<h1 align='center'><span style="color:blue">Exploratory Data Analysis of the Concrete Dataset</span></h1>
<blockquote>
<h4>Contributors</h4>
<p>Divya Chatty<br>Suvarna Satish<br>Uday Kumar KB<br>Vatti Hemanth Kumar<br>Vivek NG<br>Xitij Patel</p>
</blockquote>

***

<blockquote>
    <h2>Aim of the project</h2>
    This project aims to delve into the Concrete Dataset, identify its attributes and draw some useful insights from it. 
    The study of the Concrete dataset provides an excellent exercise to demonstrate the capabilities of the Python modules for Data Analysis - NumPy, Pandas, MatPlotLib and Seaborn. 
</blockquote>

***

<h3>Before we begin - The areas under which the EDA has been done are:</h3>
<blockquote>1. Loading the Dataset<br>
2. Dataset Exploration<br>
3. Using Pandas to draw insights<br>
4. Visualization 
</blockquote>

***

<h1 align='center'><span style="color:blue">1. Loading the Dataset</span></h1>

### Importing the necessary packages

In [None]:
import numpy as np
import pandas as pd

In [None]:
#Suppress warnings in the IPython Notebook
import warnings
warnings.simplefilter('ignore')

### Loading the dataset from file system

In [None]:
# Reading the csv file.
concrete = pd.read_csv('../input/compressive-strength-of-concrete/npvproject-concrete.csv')  # We can replace the file name with path if the file location differs.

In [None]:
# Printing DataFrame 
print(concrete)      # Print the contents of Data Frame

In [None]:
# Printing shape of the dataset
print(concrete.shape)    # Returns total number of Rows and Columns

***
<h1 align = 'center'><span style="color:blue"> 2. Dataset Exploration</span></h1>

In [None]:
# Retrieving Data in first 5 Rows 
concrete.head() # We can mention number of rows to retrieve inside braces too

##### Note: We can also print last few rows by using concrete.tail() method

In [None]:
# Retrieving columns names of DataFrame.
concrete.columns # Returns the column names in the Concrete Dataset

##### Note: We can retrieve the indices of the Dataframe by using the concrete.index property

### Details of each column from the UCI Repository    
    Column                                Type            Units                 Nature
    -------------------------------------------------------------------------------------------
    Cement (component 1)               -- quantitative -- kg in a m3 mixture -- Input Variable
    Blast Furnace Slag (component 2)   -- quantitative -- kg in a m3 mixture -- Input Variable
    Fly Ash (component 3)              -- quantitative -- kg in a m3 mixture -- Input Variable
    Water (component 4)                -- quantitative -- kg in a m3 mixture -- Input Variable
    Superplasticizer (component 5)     -- quantitative -- kg in a m3 mixture -- Input Variable
    Coarse Aggregate (component 6)     -- quantitative -- kg in a m3 mixture -- Input Variable
    Fine Aggregate (component 7)       -- quantitative -- kg in a m3 mixture -- Input Variable
    Age                                -- quantitative -- Day (1~365)        -- Input Variable
    Concrete compressive strength      -- quantitative -- MPa                -- Output Variable

### Useful Pandas methods to analyse the dataset at a glance

In [None]:
# Using info() method 
concrete.info()     # Outputs some general information about the dataframe

In [None]:
# Using describe() method 
concrete.describe()      # Basic statistical characteristics of each numerical column 

In [None]:
#Applying Functions to Cells, Columns and Rows
concrete.apply(np.max) # Returns the Maximum values of all columns

****

<h1 align='center'}><span style="color:blue"> 3. Using Pandas to draw insights</span></h1>

<h3>
        <span style = "color:green">
            The following areas are discussed here: <br>
        </span>
</h3>
<span style = "color:green">
            <blockquote>
            1. Cleaning<br>
            2. Sorting<br>
            3. Accessing the Dataset<br>
            4. Grouping and Pivoting<br>
                5. Statistical Analysis            </blockquote></span>
    


<h2><span style="color:purple">1. Cleaning</span></h2>
<h3>Does the dataset have any null values?<h3>

In [None]:
concrete.isna().sum()  #We use isna() function. It returns dataframe of Boolean values with a True value if NaNs are found.

### Does the dataset have duplicate rows?

In [None]:
duplicate_rows = concrete[concrete.duplicated()]
print("There are {} duplicate rows in the dataset".format(duplicate_rows.shape[0]))
duplicate_rows

### Dropping unwanted rows and columns

In [None]:
concrete.drop(index=[2,5,1026,1029],columns=['slag','coarseagg'],inplace=False)

<h2><span style="color:purple">2. Sorting</span></h2>
<h3> Which are the top 5 greatest Compressive strengths in concrete? </h3>

In [None]:
concrete.sort_values(by='strength',ascending=False,inplace=False).head(n=5)

### Which are the Top 3 instances in terms of MAX Water poured and LEAST Strength of Concrete?

In [None]:
concrete.sort_values(by=['water','strength'],ascending=[False,True],inplace=False).head(n=3)

<h2><span style="color:purple">3. Accessing the Dataset</span></h2>
<h3>iloc based access</h3>

In [None]:
concrete.iloc[2:10,2:9] #Fetch 2nd to 9th row and only 2nd to 8th columns

### loc based access

In [None]:
concrete.loc[4:20:2,  'water':'fineagg'] #Fetches values based on indices and columns specified

### Conditional Accessing


#### What is the average strength of cement composition equal to 540kg/m3

In [None]:
concrete[concrete['cement'] == 540]['strength'].mean() 

#### What is the minimum strength obtained if slag composition is more than 130 kg/m3 and it is aged for at least 100 days?

In [None]:
concrete[(concrete['slag']>130) & (concrete['age']>=100)]['strength'].min()

<h2><span style="color:purple">4. Grouping and Pivoting</span></h2>
<h3>Show means, max and min values of compressive strength taking each input column at a time

In [None]:
concrete.groupby(['strength'])[concrete.columns[:8]].agg([np.mean,np.min,np.max]).T

### With Cement and Slag as the indices, obtain the means of instances for which cement composition is between 140kg/m3 and 150kg/m3

In [None]:
#Creating a spreadsheet-style pivot table as a DataFrame.

#The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) 
#on the index and columns of the result DataFrame.

df2 = pd.pivot_table(concrete, index = ['cement','slag'])
df2.iloc[:8] #Fetching only first 8 rows of pivoted table

<h2><span style="color:purple">5. Statistical Functions</span></h2>
<h3>Randomly picking an instance from a dataframe</h3>

In [None]:
#The sample() method selects rows or columns randomly.
#By default, one row is returned randomly.

concrete.sample(n=10)

### What is the statistical correlation between any two attributes in the dataset?

In [None]:
#Correlation matrix is a table showing correlation coefficients between variables. 
#It is used to show the summarize data, as an input into a more advanced analysis, 
#and as a diagnostic for advanced analyses.
concrete.corr()  #Methods - Pearson(by default) or Spearman or Kendall

### Which 2 instances have the smallest values of compressive strength?

In [None]:
concrete.nsmallest(2,'strength')

<h5> Note: You can retrieve nlargest values of any attribute by using <span style="color: green">concrete.nlargest(2,'strength') </span></h5>

***

<h1 align="center"><span style="color:blue">4. Visualization</span></h1>
<blockquote>
    A visual aid is often the easiest way to glean information from datasets and communicate relationships effectively. <br><br>
    Let's delve into the Concrete Dataset using Visualization techniques and derive some useful insights!
   <br><br>
    The utilities of the modules MatPlotLib and Seaborn are leveraged for this part. 
</blockquote>

In [None]:
#Loading the necessary packages
import matplotlib.pyplot as plt
import seaborn as sns

<h2><span style="color:purple">Some Univariate Analysis on Compressive strength</span></h2>
<blockquote>
    <b>Univariate analysis</b>: <br>The simplest form of statistical analysis. <br>It can be inferential or descriptive.<br>The key fact is that only one variable is involved.
</blockquote>

In [None]:
#Let's find what's the range of values in the Compressive strength. 
print("Maximum Strength achieved: ",concrete.strength.max())
print("Minimum Strength achieved: ",concrete.strength.min())
print("Range/Spread of the Strength of Concrete: ",concrete.strength.max() - concrete.strength.min())

In [None]:
#Which instances produce the maximum values in strength?
print("Maximum strength achieved in the instance(s): ")
concrete[concrete.strength == concrete.strength.max()]

In [None]:
#Which instances produce the minimum values in strength?
print("Minimum strength achieved in the instance(s): ")
concrete[concrete.strength == concrete.strength.min()]

In [None]:
#Let's also plot a box plot to know if there are any major outliers in the strength values.
#Alongide, lets also put a histogram of strength variation to know what type of distribtion it is

#Box Plot
plt.figure(figsize=(5,5), dpi=100, facecolor='cyan', edgecolor='#000000')
plt.boxplot(concrete.strength)
plt.text(x=1.1,y=concrete.strength.min() ,s="Min")
plt.text(x=1.1,y=concrete.strength.max(),s="Max")
plt.text(x=1.1,y=concrete.strength.median() ,s="Median")
plt.text(x=1.1,y=concrete.strength.quantile(0.25),s="Q1")
plt.text(x=1.1,y=concrete.strength.quantile(0.75),s="Q3")
plt.title("Distribution of strength of concrete across its range.")
plt.ylabel('Compressive Strength')

#Histogram
plt.figure(figsize=(5,5), dpi=100, facecolor='cyan', edgecolor='#000000')
plt.hist(concrete.strength,color='orange',rwidth=0.9)
plt.ylabel('Compressive strength')
plt.title("Histogram of Compressive strength across its range.")

<center><h3><span style="color:#4B0082">Observations from the Box Plot:</span></h3></center><br>
<ul>
    <li>The boxplot shows that the <b>median strength seems to be around 35MPa</b>. 
       (This can be corroborated by the actual values above)</li><br>
                    
   <li>Since the right whisker is longer than the left whisker, there seems to be <b>slight Right skewness</b> in the distribution.</li><br>
    
   <li>There are <b>certain outliers</b> present in strength, which go to suggest that, in some cases (we don't yet know which), the strength of the concrete increases more than expected</li><br>

<center><h3><span style="color:#4B0082">Observations from the Histogram:</span></h3></center><br>
    <li>There is a <b>normal distribution</b> of compressive strength here. But it is obviously <b>skewed</b>.</li><br>
    <li>The distribution has a <b>longer right tail</b> indicating --> positive skewness</li><br>
</ul>

<h2><span style="color:purple">Bivariate Analysis - Compressive strength vs Age</span></h2>
<blockquote>
    <b>Bivariate analysis</b>:The relationship between any two variables is explored at length
</blockquote>
<h3>Question - <span style="color:red">Does the strength of concrete increase directly with time?</span></h3>

In [None]:
#Let's plot the variation of strength against time. Here, strength is in MPa and time is in days
plt.figure(figsize=(5,5), dpi=100, facecolor='cyan', edgecolor='#000000')
plt.scatter(x=concrete.age,y=concrete.strength,marker='o',color='orange',alpha=0.5)
plt.xlabel('Age')
plt.ylabel('Compressive strength')
plt.title('Variation of Compressive Strength versus Ageing Time')

<center><h3><span style="color:#4B0082">Observations from the Scatter Plot:</span></h3></center><br>
<ul>
    <li>Clearly, <b>extremely high ageing times (>350 days) produces concrete with average strength.</b> <br>
       Or, it could also be that, the strength of concrete will reduce over time. <br>
        (This of course depends on how the dataset was collected)</li><br>
       
 <li><b>Maximum strength</b> of concrete is produced when it is aged for around <b>25-100 days</b>. </li><br>
    
 <li>But, there are <b>4 instances</b> where strength>60 and 2 instances where strength>70 where the ageing time was more than 150 days.<br> 
    This may indicate that, the composition of the concrete also plays a role in the strength. <br>
    <b>Therefore, we can't use time alone as a factor.</b> </li><br>
    
<li>There seems to be a <b>lot of variation in strength in the 0-50 days group.</b><br>
    This group begs our attention because there are clearly more factors than time that seem to be deciding strength</li>
</ul>

<h3>Answer to our question - <span style="color:green">
Age influences Strength for sure. <br>But, there is clearly no direct relationship between Aging time and Compressive Strength. There are other factors at play.</span></h3> 

<h2><span style="color:purple">Relationship between all the attributes</span></h2>

In [None]:
plt.figure(figsize=(10,8))
sns.heatmap(concrete.corr(), cmap ="YlGnBu", linewidths = 0.1,annot = True) #Plot contents of correlation matrix as a heatmap
plt.title("Heatmap - Relationship between all the attributes",fontdict={'fontsize': 16})  #Title of the plot
plt.show()

<center><h3><span style="color:#4B0082">Observations:</span></h3></center>
<p>
    As compared to other input factors, only the quantity of cement seems to have a signification correlation to Compressive Strength. <br>
    After Cement, Age and Super plasticizer are the other two factors which seem to affect Compressive Strength.<br>
    Super Plasticizer seems to have a negative high correlation with Water, and positive correlations with Fly ash and Fine aggregate.
</p>

<h2><span style="color:purple">Various plots with matplotlib and seaborn</span></h2>
Flexing the capabilities of seaborn and matplotlib

<h3>Distribution Plot</h3>

In [None]:
plt.figure(figsize=(15,8))
cs_d = sns.distplot(concrete.strength, color='mediumvioletred', rug=True,kde_kws={'shade':True})
cs_d.set_title("Compressive Strength Distribution")
plt.show()

<h3>Density Plots</h3>

In [None]:
concrete.plot(kind='density', subplots=True, layout=(5,2), sharex=False, sharey=False, figsize=(20,30))
plt.show()

<h3>Pair Plots</h3>
<ul>
    <li>Plots pairs of columns separately.</li><br>
    <li>The required columns can be specified.</li><br>
    <li>Helps map the relation between values of specified columns.</li><br>
    <li>Here, each of the input parameters has been plotted against the output parameter (Compressive Strength).</li>
</ul>

In [None]:
sns.pairplot(concrete, x_vars = ['cement', 'slag','ash','water','superplastic','coarseagg','fineagg'], y_vars = 'strength',height=4, aspect=1.5)
plt.show()

### Advanced Scatter Plots - With more depth

In [None]:
fig, ax = plt.subplots(figsize=(15,8))
sns.scatterplot(y="strength", x="cement", hue="water", size="age", data=concrete, palette='inferno', ax=ax, sizes=(50, 300))
ax.set_title("Compressive Strength vs (Cement, Age, Water)")
plt.show()

<center><h3><span style="color:#4B0082">Observations:</span></h3></center>
<p>
Compressive Strength increases with amount of cement.<br>
Compressive Strength is maximum when the age is roughly between 200 and 350.<br>
The older the cement is the more water it requires for better Compressive Strength.<br>
In general, Concrete Strength increases when less water is used in preparing it.
</p>

***

<center><h1><span style="color:blue">Conclusions from the EDA</span></h1></center>

<ol>
    <li>The dataset mainly concerns with studying the factors that influence the Compressive Strength of Concrete.</li><br>
    <li>There are 1030 instances, and there are no missing/unknown/invalid values.</li><br>
    <li>There are 9 attributes --> 8 input variables and 1 output variable</li><br>
    <li>Seven input variables represent the amount of raw material (measured in kg/m³) and one represents Age (in Days). The target variable is Concrete Compressive Strength measured in (MPa — Mega Pascal). </li><br>
    <li>There are no categorical variables. There are only numeric values.</li><br>
    <li>Using features of Pandas and Numpy, we can query the dataset interactively and fetch the required results.</li><br>
    <li>Some univariate, bivariate and multivariate analyses were conducted.</li><br>
    <ul>
        <li>Compressive Strength has a normal distribution with slight positive skewness</li><br>
        <li>The time given for Concrete to age changes its strength, But, that relationship is not linear.</li><br>
        <li>Compressive Strength increases with the amount of Cement in it.</li><br>
        <li>Strength is inversely related to the amount of water in the mixture.</li><br>        
    </ul>
    <li>By doing further analyses, most important attributes can be selected.</li>
    
    
</ol><br><br>
<center>
    <h3><span style="color:blue">This is an introduction to the Exploratory Data Analysis of the Concrete Dataset.</span></h3>
The features provided by Numpy, Pandas, MatPlotlib and Seaborn are instrumental in the process of getting understanding of the dataset.<br>
<h3>More In-Depth analysis will lead to more in-depth results!</h3>
</center>

***