## Introduction
I am not a data scientist but an engineer/researcher in the manufacturing domain. I work with the data here and reflect myself from a production engineering perspective. Any comments or advice are welcome!

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import KFold, cross_val_score, cross_validate
from xgboost import XGBRegressor, plot_importance
from statistics import mean
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
data = pd.read_csv('../input/mercedes-benz-greener-manufacturing/train.csv.zip')
test = pd.read_csv('../input/mercedes-benz-greener-manufacturing/test.csv.zip')
print(data.shape, test.shape)
data.head()

## 1. Exploring the dataset 
### 1.1 Basic checkup on & Cleaning

In [None]:
print(data.dtypes.value_counts())
data.isnull().sum().sort_values(ascending=False) # no missing value!
test.isnull().sum().sort_values(ascending=False) # no missing value for test data as well!

# cardinality=1 columns: 12 columns in data and 5 columns in test
data_one_cardinality_columns = [column for column in data.columns if data[column].nunique()==1]
test_one_cardinality_columns = [column for column in test.columns if test[column].nunique()==1]

one_cardinality_columns = data_one_cardinality_columns + test_one_cardinality_columns

I could drop these one cardinality columns but I do not know the data well yet. So I will consider it later.

In [None]:
sns.boxplot(data.y)

Takt time for car assembly lines are often around 60 sec but not sure for the premium cars like Mercedes. If this y values reflect the reality, the tak time may be longer than 60 sec and there may be two or more test stations at the end of the line. Anyhow, one data point over 250 sec seems very strange. I can remove this data point.  

In [None]:
data_o = data[(data['y'] <= 200)]
sns.boxplot(data_o.y)

### 1.2 Facets overview from Google. I like this one for inital data exploration.

In [None]:
!pip install facets-overview

In [None]:
### Create the feature stats for the datasets and stringify it.
import base64
from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator

gfsg = GenericFeatureStatisticsGenerator()
proto = gfsg.ProtoFromDataFrames([{'name': 'train', 'table': data}])
protostr = base64.b64encode(proto.SerializeToString()).decode("utf-8")

### Display the facets overview visualization for this data
from IPython.core.display import display, HTML

HTML_TEMPLATE = """
        <script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html" >
        <facets-overview id="elem"></facets-overview>
        <script>
          document.querySelector("#elem").protoInput = "{protostr}";
        </script>"""
html = HTML_TEMPLATE.format(protostr=protostr)
display(HTML(html))

### 1.3 Some reflection on the data
* I understand they are list of car features. Eight categorical and the rest is binary. 
* No particular outliears and such things due to this. Clean and simple dataset.
* Since the input data is car features, the quality of the data should be always good. Otherwise they can not build cars! 
* Many features have low variance e.g. 99.8% is 0. Not sure they will contribute to the model training.
* The description of the dataset says "dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing". This is a bit difficult for me to understand. Does the permutation matter for the test time or is it combination of features instead? The latter makes more sense to me but maybe I misunderstand something....

## 2. Data preprocessing

In [None]:
y = data_o.y
X = data_o.drop(['ID','y'], axis=1)
X_submission = test.drop('ID', axis=1)

# Concatenate X and X_submission before applying auto OneHotEncoder
X_con = pd.concat([X, X_submission], axis=0)

# Apply auto OneHotEncoder 
X_con_ohe = pd.get_dummies(X_con)

# Now deviding back to train and test data 
X_ohe = X_con_ohe[:len(X)]
X_submission_ohe = X_con_ohe[len(X):]

## 3. Run some estimaters as baselines
* Most of the manufcaturing related relational data work very well with ensemble trees. I have not seen so far other models such as liner regression or neural network have beaten them... 
* R2 is used for socoring but mae is more informative in this test bench case. So I use mae as well.

### 3.1 XGboost

In [None]:
xgb= XGBRegressor(n_estimators=200, learning_rate=0.05, random_state=42)

fit_params = {"early_stopping_rounds": 5, "eval_set": [(X_ohe, y)]}
cv = KFold(n_splits=5, shuffle=True, random_state=42)
xgb_scores = cross_validate(xgb, X_ohe, y, scoring=['neg_mean_absolute_error','r2'], cv=cv, n_jobs=-1, 
                            verbose=1, fit_params=fit_params, return_estimator=True)

print('mae:',abs(xgb_scores['test_neg_mean_absolute_error'].mean()))
print('r2:',xgb_scores['test_r2'].mean())

### 3.2 Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(max_depth=4, n_estimators=5, random_state=42)
# evaluate model
cv = KFold(n_splits=10, shuffle=True, random_state=42)
rf_scores = cross_validate(rf, X_ohe, y, scoring=['neg_mean_absolute_error','r2'], cv=cv, n_jobs=-1, return_estimator=True)

rf_base_mae = abs(rf_scores['test_neg_mean_absolute_error'].mean())
rf_base_r2 = rf_scores['test_r2'].mean()

print('mae:',rf_base_mae)
print('r2:',rf_base_r2)

### 3.3 XGboost HavlingRandamizedSeaerchCV
This is not necessary but I just wanted to experiement how this turns out.

In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
from scipy.stats import uniform, randint

xgb_model = XGBRegressor(n_estimators=200)   

params = {
    "colsample_bytree": uniform(0.7, 0.3),
    "gamma": uniform(0, 0.5),
    "learning_rate": uniform(0.03, 0.3), # default 0.1 
    "max_depth": randint(2, 6), # default 3
    "subsample": uniform(0.6, 0.4)}

fit_params = {
    "early_stopping_rounds": 5,
    "eval_set": [(X_ohe, y)]}

xgb_halvsearch = HalvingRandomSearchCV(xgb_model,                   
                param_distributions=params, resource='n_estimators', 
                max_resources=40, random_state=42, cv=5, scoring='r2',
                verbose=1, n_jobs=-1, return_train_score=True)

xgb_halvsearch.fit(X_ohe, y, **fit_params)
print('best r2 score:',xgb_halvsearch.best_score_)

### 3.4 Reflection on running some estimators
* Random Forest is quick and generates ok result. So I will use this further.
* 5 sec deviation from the target on average is probably ok, condering the current mean and variation of the target.
* A few 0.01 points increase in r2 only affects a few 0.1 seconds improvement for the accuracy (mae). This is very little considering that the average test cycles is around 100 sec, and such small time reduction is easily diminished by other factors in production.
* So it makes less sense in spending hours to improve the scores...

## 4. Explore with SHAP

In [None]:
import shap

estimater = rf_scores['estimator'][1]   # I use the one generating the score close to the average....

explainer = shap.explainers.Tree(estimater)
shap_values = explainer(X_ohe)

shap.summary_plot(shap_values, X_ohe)

So several features, especialy when they are true, contributes to the inference.

In [None]:
# Check mean shap value for each column
shap_mean = np.abs(shap_values.values).mean(axis=0)
shap_mean_columns = pd.Series(shap_mean, index=X_ohe.columns)
#shap_mean_columns.value_counts().sort_index()
shap_mean_columns.sort_values(ascending=False)

So there are quite many features having very low shap values. It may be a good idea to remove them.

## 5. Feature selection
### 5.1 Feature selection based on mean SHAP values

In [None]:
# creat column list that has mean shap is higher 0.01 =>about 31 columns
shap_incl_columns = shap_mean_columns[shap_mean_columns.values>0.01].index.to_list()
print(shap_incl_columns)
X_ohe_fs = X_ohe[shap_incl_columns]
X_ohe_fs.shape

### 5.2 Recursive Feature Elimination and Cross-Validated selection (RFECV)

In [None]:
# I borrowed the code from Dmitriy K. thanks!
from sklearn.feature_selection import RFECV

selector = RFECV(estimater, step = 1, cv=5, n_jobs=-1,verbose=1, scoring='r2')
selector.fit(X_ohe_fs, y)

print(selector.grid_scores_)

rfecv_features = [f for f, s in zip(X_ohe_fs, selector.support_) if s]
print('selected features:', rfecv_features)

X_ohe_rfecv = X_ohe[rfecv_features]

### 5.3 Compare results after the feature selection

In [None]:
rf = RandomForestRegressor(max_depth=4, n_estimators=5, random_state=42)
cv = KFold(n_splits=10, shuffle=True, random_state=42)

# Score with all features
print('mae_all_data:',rf_base_mae)
print('r2_all_data:',rf_base_r2)

# Score with features with mean shap value >0.01 (31 columns)
rf_scores1 = cross_validate(rf, X_ohe_fs, y, scoring=['neg_mean_absolute_error','r2'], cv=cv, n_jobs=-1, return_estimator=True)
print('mae_shap_features:', abs(rf_scores1['test_neg_mean_absolute_error'].mean()))
print('r2_shap_features:', rf_scores1['test_r2'].mean())

# Score with RFECV features (5 columns)
rf_scores2 = cross_validate(rf, X_ohe_rfecv, y, scoring=['neg_mean_absolute_error','r2'], cv=cv, n_jobs=-1, return_estimator=True)
print('mae_rfecv_features:', abs(rf_scores2['test_neg_mean_absolute_error'].mean()))
print('r2_rfecv_features:', rf_scores2['test_r2'].mean())

### 5.4 Reflection on feature selection
* Reduced features show a slightly better results.
* Accuracy wise, Shap_features or Rfecv_features does not make difference.
* I use shap features here. If the features are those requiring maintenace for instance sensor values, then I would use fewer features. But this case the data should be very stable...
* Inference time of the trained model is short enough, since the car features should be decided before the actual manufacturing and takt time is much longer than that time.

## 6. Use RF with shap features for the final model
### 6.1 Map the difference between the target and predicted
I learned from experience that just looking at aggregated statistical data such as means can be risky especially in manufacturing. It is good the check how the inference looks like against each target.

In [None]:
from sklearn.model_selection import train_test_split

final_model = RandomForestRegressor(max_depth=4, n_estimators=5, random_state=42)
final_model.fit(X_ohe_fs,y)

# make prediction and calcurate the difference betweeen y and prediction on each dataset
X_train, X_test, y_train, y_test = train_test_split(X_ohe_fs, y,shuffle=False, test_size=0.25)
predicted_y = pd.Series(final_model.predict(X_test))
predicted_y.index = X_test.index
dif = abs(predicted_y-y_test)

In [None]:
# DataFrame with columns y, predicted_y, difference
compare = y_test.to_frame().join(predicted_y.to_frame(name='predicted_y'))
compare = compare.join(dif.to_frame(name='abs_dif'))

# Not necessarily lineplot but I can see the difference easily.
sns.set_theme(context='notebook', style='darkgrid')
plt.figure(figsize=(24, 6.5))
sns.lineplot(data=compare.iloc[250:350,0:2])

In [None]:
# plot the difference
plt.figure(figsize=(24, 6.5))
sns.lineplot(data=compare.iloc[250:350,2]) 

In [None]:
sns.displot(data=compare, x=y_test)
sns.displot(data=compare, x=predicted_y)

#### 6.2 Reflection
 * The final model generally follows well to the target but does not predict well for the longer test cycle. But the purpose is to reduce the test cycle time. So this discrepancy is not so important either.
 * Distribution plot of the predicted values shows that the model traces the characteristic of the target distribution. 

## 7. submission

In [None]:
X_submission_ohe_fs = X_submission_ohe[shap_incl_columns]

predict = final_model.predict(X_submission_ohe_fs)
submission = pd.DataFrame({'ID': test.ID, 'y': predict})
submission.to_csv('submission.csv', index=False)