# Concrete compressive strength Regression

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation
7. Experimentation / Improvements

# 1. Problem Definition

How we can use various python based Machine Learning Model and the given parameters to predict the housing price?

# 2. Data

Data From: https://www.kaggle.com/vinayakshanawad/cement-manufacturing-concrete-dataset

## Data Description

The actual concrete compressive strength (MPa) for a given mixture under a
specific age (days) was determined from laboratory. Data is in raw form (not scaled). The data has 8 quantitative input variables, and 1 quantitative output variable, and 1030 instances (observations).

## Domain

Cement manufacturing

## Context

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate.

# 3. Evaluation

As this is a Regression problem, we will use the Root mean square error for evauluting the model

# 4. Features

## Inputs / Features


    Cement : measured in kg in a m3 mixture
    Blast : measured in kg in a m3 mixture
    Fly ash : measured in kg in a m3 mixture
    Water : measured in kg in a m3 mixture
    Superplasticizer : measured in kg in a m3 mixture
    Coarse Aggregate : measured in kg in a m3 mixture
    Fine Aggregate : measured in kg in a m3 mixture
    Age : day (1~365)
    
## Output / Label
    Concrete compressive strength measured in MPa


## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Reading the dataset

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Local
# df = pd.read_csv('concrete.csv')

# Kaggle
df = pd.read_csv('/kaggle/input/cement-manufacturing-concrete-dataset/concrete.csv')
df.head()

## Data Exporation

In [None]:
df

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Histogram of strength')
sns.histplot(data=df,x='strength', kde=True);

the histogram show that the strength in the data set has a normal distabution

In [None]:
plt.figure(figsize=(20,10))
plt.title('Boxplot of Dataset')
sns.boxplot(data=df);

As we can see, there are some outlier in the dataset

In [None]:
plt.figure(figsize=(20,10))
plt.title('Boxplot of slag')
sns.boxplot(data=df, x='slag');

In [None]:
df[df['slag'] > 350]

In [None]:
df = df.drop(df[df['slag'] > 350].index)

In [None]:
plt.figure(figsize=(20,10))
plt.title('Boxplot of water')
sns.boxplot(data=df, x='water');

In [None]:
df[(df['water'] < 122) | (df['water'] > 230)] 

In [None]:
df['water'].describe()

In [None]:
df = df.drop(df[(df['water'] < 122) | (df['water'] > 230)].index)

In [None]:
plt.figure(figsize=(20,10))
plt.title('Boxplot of superplastic')
sns.boxplot(data=df, x='superplastic');

In [None]:
df['superplastic'].describe()

In [None]:
df[df['superplastic'] > 25]

In [None]:
df = df.drop(df[df['superplastic'] > 25].index)

In [None]:
plt.figure(figsize=(20,10))
plt.title('Boxplot of fineagg')
sns.boxplot(data=df, x='fineagg');

In [None]:
df[(df['fineagg'] < 600) | (df['fineagg'] > 950)]

In [None]:
df = df.drop(df[(df['fineagg'] < 600) | (df['fineagg'] > 950)].index)

In [None]:
df

In [None]:
plt.figure(figsize=(20,10))
plt.title('Boxplot of age')
sns.boxplot(data=df, x='age');

In [None]:
df[df['age'] > 150]

In [None]:
df = df.drop(df[df['age'] > 150].index)

In [None]:
plt.figure(figsize=(20,10))
plt.title('Boxplot of strength')
sns.boxplot(data=df, x='strength');

In [None]:
df['strength'].describe()

In [None]:
df[df['strength'] > 79]

In [None]:
df = df.drop(df[df['strength'] > 79].index)

In [None]:
df

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of cement vs strength')
sns.scatterplot(data=df, x='cement', y= 'strength');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of slag vs strength')
sns.scatterplot(data=df, x='slag', y= 'strength');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of ash vs strength')
sns.scatterplot(data=df, x='ash', y= 'strength');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of water vs strength')
sns.scatterplot(data=df, x='water', y= 'strength');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of superplastic vs strength')
sns.scatterplot(data=df, x='superplastic', y= 'strength');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of coarseagg vs strength')
sns.scatterplot(data=df, x='coarseagg', y= 'strength');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of fineagg vs strength')
sns.scatterplot(data=df, x='fineagg', y= 'strength');

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of fineagg vs strength')
sns.scatterplot(data=df, x='age', y= 'strength');

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(data=df.corr(), annot=True);

In [None]:
df.corr()['strength'].sort_values()[:-1]

As we can see, the strenght have a strong postive correlation to the following:
    
    superplastic
    cement
    age  
    
and a negitive correaltion to:
    
    water

# 5. Modelling

In [None]:
X = df.drop('strength', axis=1)
y = df['strength']

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Import Models

In [None]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from xgboost import XGBRegressor, XGBRFRegressor
from catboost import CatBoostRegressor

## Baseline Models and Scores

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    model_rsme = {}
    model_r2 = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)
        y_preds = model.predict(X_test)
        model_rsme[name] = np.sqrt(mean_squared_error(y_test,y_preds))
        model_r2[name] = r2_score(y_test,y_preds)

    model_scores = pd.DataFrame(model_scores, index=['Score']).transpose()
    model_scores = model_scores.sort_values('Score')
    
    model_rsme = pd.DataFrame(model_rsme, index=['RSME']).transpose()
    model_rsme = model_rsme.sort_values('RSME')
    
    model_r2 = pd.DataFrame(model_r2, index=['R2']).transpose()
    model_r2 = model_r2.sort_values('R2')
        
    return model_scores,model_rsme, model_r2

In [None]:
models = {'Ridge' : Ridge(),
         'Lasso': Lasso(),
         'ElasticNet': ElasticNet(),
         'KNeighborsRegressor': KNeighborsRegressor(),
         'SVR': SVR(),
         'DecisionTreeRegressor': DecisionTreeRegressor(),
         'RandomForestRegressor':RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor(),
         'AdaBoostRegressor': AdaBoostRegressor(),
        'XGBRegressor': XGBRegressor(objective='reg:squarederror'),
        'XGBRFRegressor': XGBRFRegressor(objective='reg:squarederror'),
          'CatBoostRegressor': CatBoostRegressor(verbose=0)
         }

In [None]:
model_scores_baseline, model_rsme_baseline, model_r2_baseline = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
model_scores_baseline

In [None]:
model_rsme_baseline.sort_values('RSME', ascending=False)

In [None]:
model_r2_baseline

In [None]:
df['strength'].mean()

## Catboost

In [None]:
model = CatBoostRegressor(iterations=10000, verbose=0)
model.fit(X_train,y_train, eval_set=[(X_test,y_test)],early_stopping_rounds=50)

# 6. Model Evalution

In [None]:
y_preds = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
mae = mean_absolute_error(y_test, y_preds)
mse = mean_squared_error(y_test,y_preds)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_preds)

In [None]:
print(f'mean absolute error: {mae}')
print(f'Mean squared error: {mse}')
print(f'Root Mean Squared Error: {rmse}')
print(f'R2 Score: {r2}')

## Feature Importance

In [None]:
feat_impt = pd.DataFrame(model.feature_importances_, index=X.columns)

In [None]:
feat_impt

In [None]:
plt.figure(figsize=(20,10))
plt.title('Feature Importance')
sns.barplot(data= feat_impt.sort_values(0).T);

## Evalution using cross_validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
def get_cv_score(model, X, y, cv=5):
    
    
    cv_r2 = cross_val_score(model,X,y,cv=cv,
                         scoring='r2')
    print(f'Cross Validaion R2 Scores: {cv_r2}')
    print(f'Cross Validation R2 Mean Score: {cv_r2.mean()}')
    
    cv_neg_mean_absolute_error = cross_val_score(model,X,y,cv=cv,
                         scoring='neg_mean_absolute_error')
    print(f'Cross Validaion Neg MAE Scores: {cv_neg_mean_absolute_error}')
    print(f'Cross Validation Neg MAE Mean Score: {cv_neg_mean_absolute_error.mean()}')
    
    cv_neg_mean_squared_error = cross_val_score(model,X,y,cv=cv,
                         scoring='neg_mean_squared_error')
    print(f'Cross Validaion Neg MSE Scores: {cv_neg_mean_squared_error}')
    print(f'Cross Validation Neg MSE Mean Score: {cv_neg_mean_squared_error.mean()}')
    
    cv_neg_root_mean_squared_error = cross_val_score(model,X,y,cv=cv,
                         scoring='neg_root_mean_squared_error')
    print(f'Cross Validaion Neg RMSE Scores: {cv_neg_root_mean_squared_error}')
    print(f'Cross Validation Neg RMSE Score: {cv_neg_root_mean_squared_error.mean()}')   
    
    cv_merics = pd.DataFrame({'R2': cv_r2.mean(),
                         'neg_mean_absolute_error': cv_neg_mean_absolute_error.mean(),
                         'neg_mean_squared_error': cv_neg_mean_squared_error.mean(),
                         'neg_root_mean_squared_error': cv_neg_root_mean_squared_error.mean()},index=[0])
    
    return cv_merics

In [None]:
cv_model = CatBoostRegressor(iterations=4524,verbose=0)

In [None]:
cv_merics = get_cv_score(cv_model, X_train, y_train, cv=5)

In [None]:
cv_merics