# Predicting Conrete Compressive Strength

## This is an starter notebook for the regression task of Concrete Compressive Strength, this notebook follows these steps :-

##### 1. EDA and Visualisation (using auto EDA library Dataprep)
##### 2. Data Prepration (Handling Skewed Data and data preprocessing)
##### 3. Model Selection (using Pycaret AutoML)
##### 4. Visualising Model and Predictions
##### 5. Building and Comparing Neural Network Model
##### 6. Finalising the best Model

## This notebook can be used as guide to any Regression task.

In [None]:
!pip install pycaret

In [None]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split, KFold, StratifiedKFold
from sklearn.metrics import r2_score
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
import seaborn as sns
from joblib import dump, load

%matplotlib inline
sns.set(color_codes=True)
pal = sns.color_palette("viridis", 10)
sns.set_palette(pal)

In [None]:
data = pd.read_csv('../input/dl-course-data/concrete.csv')

In [None]:
data.isnull().sum()

> ### No null value

## 1. EDA and detailed report with Dataprep

In [None]:
!pip install dataprep

In [None]:
from dataprep.eda import plot, create_report

In [None]:
plot(data)

In [None]:
create_report(data)

## 2. Data Prepration

## Handling Skewness

### As shown in the report variables that are skewed are
1. BlastfurnaceSlag
2. Flyash
3. Water
4. Superplasticiser
5. Age

### To handle Skewness there are many methods, we will use Log transformation

In [None]:
#log1p is log(1+x), did this to handle log(0) case

data['BlastFurnaceSlag'] = np.log1p(data['BlastFurnaceSlag'])
data['FlyAsh'] = np.log1p(data['FlyAsh'])
data['Water'] = np.log1p(data['Water'])
data['Superplasticizer'] = np.log1p(data['Superplasticizer'])
data['Age'] = np.log1p(data['Age'])

>#### Log transformation decreased the Mean Absolute error by > 0.1 and increased the r2 score from 0.92 to 0.933 on best model

## 3. Model Selection with Pycaret
### Its an auto ML library, we will take its help to find the best model to fit our data, It saves time and code !!

In [None]:
from pycaret.regression import *
reg = setup(data = data , target = 'CompressiveStrength', numeric_features= list(data.drop(['CompressiveStrength'],axis=1).columns),remove_outliers=True,   silent=True, train_size = 0.7)

In [None]:
compare_models()

> #### Catboost giving best score

In [None]:
cb = create_model('catboost')

## 4. Visualising Model and Predictions

In [None]:
plot_model(cb)

In [None]:
plot_model(cb, plot = 'feature')

In [None]:
interpret_model(cb)

In [None]:
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
X_train, X_dev, y_train, y_dev = train_test_split(X,y,random_state=13, train_size = 0.7)

In [None]:
from catboost import CatBoostRegressor, Pool
train_pool = Pool(data=X_train, label=y_train)
test_pool = Pool(data = X_dev, label = y_dev)

def r2_check(lr):
    cb_2 = CatBoostRegressor(eval_metric='R2',random_state=13, learning_rate=lr*0.001).fit(train_pool, eval_set = test_pool,  verbose=False)
    pred = cb_2.predict(X_dev)
    return r2_score(y_dev,pred)

def get_best_lr(r2):
    m=0
    best_lr=0
    for i in range(len(r2)):
        if r2[i]>m:
            m=r2[i]
            best_lr = 0.001*(i+51)
    return best_lr

In [None]:
lr = [i for i in range(51,100)]
r2 = []
for i in lr:
    r2.append(r2_check(i))
print(max(r2), get_best_lr(r2))
cb_2 = CatBoostRegressor(eval_metric='R2',random_state=13, learning_rate=get_best_lr(r2)).fit(train_pool, eval_set = test_pool,  verbose=False)
pred = cb_2.predict(X_dev)


### Checking Predictions

In [None]:
df = pd.DataFrame({'True Compressive Stength (MPA)': y_dev , 'Predicted Compressive Strength(MPA)': pred}).head(40)
df

## 5. Let's Compare Catboost with Neural Network

In [None]:
X = data.drop(['CompressiveStrength'],axis=1)
y = data['CompressiveStrength']
X_train, X_dev, y_train, y_dev = train_test_split(X,y,random_state=13, train_size = 0.75)
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_dev = ss.transform(X_dev)

In [None]:
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential([
    layers.Dense(512, activation='relu', input_shape=[8]),
    layers.Dropout(0.2),
    layers.BatchNormalization(),
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.2),
    layers.BatchNormalization(),
    layers.Dense(512, activation='relu'),
    layers.Dropout(0.2),
    layers.BatchNormalization(),
    layers.Dense(1),
])

In [None]:

model.compile(
    optimizer='adam',
    loss='mae', 
)

history = model.fit(
    X_train, y_train, 
    validation_data=(X_dev, y_dev),
    batch_size=32,
    epochs=100,
    verbose=0
)


# Show the learning curves
history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();
print(("Minimum Validation Loss: {:0.4f}").format(history_df['val_loss'].min()))

In [None]:
pred = model.predict(X_dev)
r2_score(y_dev,pred)

> ## No improvement to Catboost Score

## 6. Finalising and Saving best Model

In [None]:
dump(cb_2, 'model.joblib')

------- End of the Notebook ---------- 

## Kindly Upvote the notebook if you found it informative, Thanks!! 