# <center>Concrete</center>
---

## Regression

In [1]:
# import libraries
import numpy as np
import pandas as pd
import sklearn.tree as tree
import sklearn.ensemble as ens
import sklearn.preprocessing as pre_pro
import sklearn.metrics as eval_metrics
import sklearn.feature_selection as feat_selec
from sklearn.model_selection import train_test_split

### Load Data 

In [2]:
data = pd.read_excel("data/Concrete_Data.xls")

In [3]:
data.head()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [4]:
data.shape

(1030, 9)

### [Data Description](https://archive.ics.uci.edu/ml/datasets/concrete+compressive+strength)
Given are the variable name, variable type, the measurement unit and a brief description. The concrete compressive strength is the regression problem. The order of this listing corresponds to the order of numerals along the rows of the database. 

Name -- Data Type -- Measurement -- Description 

1) Cement (component 1) -- quantitative -- kg in a m3 mixture -- Input Variable 
2) Blast Furnace Slag (component 2) -- quantitative -- kg in a m3 mixture -- Input Variable 
3) Fly Ash (component 3) -- quantitative -- kg in a m3 mixture -- Input Variable 
4) Water (component 4) -- quantitative -- kg in a m3 mixture -- Input Variable 
5) Superplasticizer (component 5) -- quantitative -- kg in a m3 mixture -- Input Variable 
6) Coarse Aggregate (component 6) -- quantitative -- kg in a m3 mixture -- Input Variable 
7) Fine Aggregate (component 7) -- quantitative -- kg in a m3 mixture -- Input Variable 
8) Age -- quantitative -- Day (1~365) -- Input Variable 
9) Concrete compressive strength -- quantitative -- MPa -- Output Variable `Target`

## Data Preparation

In [5]:
# rename columns
data.columns = ["Cement","Blast Furnace Slag","Fly Ash","Water","Superplasticizer",\
                "Coarse Aggregate","Fine Aggregate","Age (in Days)","Target: Concrete compressive strength"]

In [7]:
data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age (in Days),Target: Concrete compressive strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


### Split Data

In [8]:
X = data.iloc[:,:-1]
y = data.iloc[:,-1]

In [9]:
X.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age (in Days)
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [10]:
y.head()

0    79.986111
1    61.887366
2    40.269535
3    41.052780
4    44.296075
Name: Target: Concrete compressive strength, dtype: float64

In [11]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=6)

In [12]:
X.shape, y.shape

((1030, 8), (1030,))

In [13]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((824, 8), (206, 8), (824,), (206,))

## Modelling

In [14]:
model = ens.RandomForestRegressor(criterion="squared_error", n_estimators=100)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

## Evaluation

In [15]:
model.score(X_test,y_test)

0.9267052166994242

## Feature Selection | Filter Method | F-Score

**F-Score for Regression**

Univariate linear regression tests returning F-statistic and p-values.

Quick linear model for testing the effect of a single regressor,
sequentially for many regressors.

This is done in 2 steps:

1. The cross correlation between each regressor and the target is computed
   using :func:`r_regression`
   
2. It is converted to an F score and then to a p-value.

In [21]:
# Returns two array one: is F_Score and another: is its P-Value
f_Score_P_val = feat_selec.f_regression(X,y)

print(f"F-Score:\n {f_Score_P_val[0]}\n")
print(f"P-Value:\n {f_Score_P_val[1]}")

F-Score:
 [338.72579369  19.03257174  11.62694924  94.11879731 159.10932173
  28.74470957  29.58293761 124.67320926]

P-Value:
 [1.32345795e-65 1.41457499e-05 6.75283560e-04 2.36607270e-21
 5.07908924e-34 1.01959722e-07 6.69468148e-08 2.10314421e-27]


In [22]:
# selecting top k best features 
# Select features according to the k highest scores.
# k : represents the no. of top k features we want 
skbest = feat_selec.SelectKBest(score_func=feat_selec.f_regression, k=7)

In [24]:
# fit transform
selected_features = skbest.fit_transform(X,y)

In [25]:
# selected features
print(skbest.get_feature_names_out())

list(X.columns[skbest.get_support()])

['Cement' 'Blast Furnace Slag' 'Water' 'Superplasticizer'
 'Coarse Aggregate' 'Fine Aggregate' 'Age (in Days)']


['Cement',
 'Blast Furnace Slag',
 'Water',
 'Superplasticizer',
 'Coarse Aggregate',
 'Fine Aggregate',
 'Age (in Days)']

### Selecting Features
Selecting the top k best features

In [26]:
# Selecting the top k best features
X_train_modi = X_train[skbest.get_feature_names_out()]
X_test_modi = X_test[skbest.get_feature_names_out()]

In [27]:
X_train_modi.head()

Unnamed: 0,Cement,Blast Furnace Slag,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age (in Days)
457,251.37,0.0,192.94,5.75,1043.6,754.3,56
203,190.68,0.0,162.14,7.77,1090.0,804.01,100
515,202.0,11.0,206.0,1.72,942.0,801.0,28
340,297.16,0.0,174.8,9.52,1022.8,753.45,14
751,540.0,0.0,173.0,0.0,1125.0,613.0,7


In [28]:
X_test_modi.head()

Unnamed: 0,Cement,Blast Furnace Slag,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age (in Days)
148,531.3,0.0,141.8,28.2,852.1,893.7,56
761,350.0,0.0,203.0,0.0,974.0,775.0,90
232,213.72,98.05,181.71,6.86,1065.8,785.38,56
885,153.0,145.0,178.0,8.0,867.0,824.0,28
843,142.0,167.0,174.0,11.0,883.0,785.0,28


### Modelling

In [29]:
model = ens.RandomForestRegressor(criterion="squared_error", n_estimators=100)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

### Evaluation

In [30]:
model.score(X_test,y_test)

0.9282377737219897