The objective of the competition is to predict the time it will take to complete the testing phase. The dataset represents various permutations of the characteristics of Mercedes-Benz vehicles. Reducing the algorithm run time can also help reduce carbon dioxide emissions without compromising Daimler's standards.

The dataset contains an anonymized set of variables (user-defined functions) in a Mercedes vehicle. For example, a variable could be 4WD, it could be an added air suspension, or a head display.

y is the variable to be predicted, this is the time (in seconds) it took for the car to be tested for each variable

Variables containing letters are categorical. Variables with 0/1 are of binary type.



In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import os
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [None]:
colors = ['#001c57','#40948f','#a6a6a6','#99d1df']
sns.palplot(sns.color_palette(colors))

In [None]:
train = pd.read_csv('/kaggle/input/mercedesbenz-greener-manufacturing/train.csv')
test = pd.read_csv('/kaggle/input/mercedesbenz-greener-manufacturing/test.csv')

In [None]:
plt.figure(figsize=(16,6))
plt.subplot(121)
sns.distplot(train.y.values, bins=50, color=colors[1])
plt.title('Target Value Distribution - y\n',fontsize=15)
plt.xlabel('Value in Seconds'); plt.ylabel('Frequecy');

plt.subplot(122)
sns.boxplot(train.y.values, color=colors[3])
plt.title('Target Value Distribution - y\n',fontsize=15)
plt.xlabel('Value in Seconds')

In [None]:
train['y'].describe()

The target variable has a standard distribution of about 72 to 140 seconds. The first and third quartiles lie in the range from about 91 to 109 seconds, the median is 100 seconds, we also note that there are outliers starting from 140 seconds that we can remove from the training sample, since these values ​​will add noise to our algorithm.


In [None]:
train.dtypes.value_counts()

In [None]:
train.dtypes[train.dtypes=='float']

In [None]:
dtype_df = train.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df.groupby("Column Type").aggregate('count').reset_index()

In [None]:
train.dtypes[train.dtypes=='object']

In [None]:
obj_dtype = train.dtypes[train.dtypes=='object'].index
for i in obj_dtype:
    print(i, train[i].unique())

In [None]:
train.isna().sum()[train.isna().sum()>0]

In [None]:
fig,ax = plt.subplots(len(obj_dtype), figsize=(18,80))

for i, col in enumerate(obj_dtype):
    sns.boxplot(x=col, y='y', data=train, ax=ax[i])

Inference from the graphs:

1) Since there is a need to reduce the testing time, the best values in the variables at which this time is minimal are az and bc (X0), y (X1), n (X2), x and h (X5) (hypothesis: on y?)

2) Variables X3, X5, X6, X8 have similar distributions of values, where there are no special differences within the feature between values in the context of means and quartiles

3) X0 and X2 have the greatest variety within variables, which can potentially indicate a greater usefulness of these features


In [None]:
num = train.dtypes[train.dtypes=='int'].index[1:]

We have a set of numeric variables, where the value is set to 1 or 0, so there is no need to carry out volumetric analysis. In this case, we should be interested in whether the value of indicators changes within the variables, for this we examine the variance of these variables, use the var () function, and select only those where the variance is zero (that is, always 0, or 1 on the entire dataset in variable cut)

In [None]:
nan_num = []
for i in num:
    if (train[i].var()==0):
        print(i, train[i].var())
        nan_num.append(i)

We received several such variables, we can remove them from the analysis, since they will not affect the target in any way, thereby increasing the performance of the algorithm.

In [None]:
train = train.drop(columns=nan_num, axis=1)

In [None]:
train.shape

In [None]:
for i in obj_dtype:
    le = LabelEncoder()
    le.fit(list(train[i].values) + list(train[i].values))
    train[i] = le.transform(list(train[i].values))

In [None]:
train[obj_dtype].head()

In [None]:
corr = train[train.columns[1:10]].corr()

fig,ax = plt.subplots(figsize=(12,10))
sns.heatmap(corr, vmax=.7, square=True,annot=True);

Among the categorical variables, we did not find a direct relationship with the target y

In [None]:
threshold = 1

corr_all = train.drop(columns=obj_dtype, axis=1).corr()
corr_all.loc[:,:] =  np.tril(corr_all, k=-1) 

In [None]:
train.shape

In [None]:
already_in = set()
result = []
for col in corr_all:
    perfect_corr = corr_all[col][corr_all[col] == threshold ].index.tolist()
    if perfect_corr and col not in already_in:
        already_in.update(set(perfect_corr))
        perfect_corr.append(col)
        result.append(perfect_corr)

In [None]:
result

When analyzing numerical variables, we found that some of them have a direct correlation with others, therefore, in order to avoid multicollinearity, we can remove the variables with correlation 1 (leave one of the group), or use regularization so that the algorithm does it in automatic mode.
How else can we remove such variables without correlation? It's simple, we delete duplicates in the column section.



In [None]:
train.T.drop_duplicates().T

In [None]:
# Let me run an ensable model Random Forest

from sklearn.model_selection import train_test_split

x = train.drop('y',axis=1)
x = train.drop('ID',axis=1)
y = train['y']
x_train,x_test, y_train, y_test = train_test_split(x, y, test_size=.2,random_state=10) 


from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=200, max_depth=200, min_samples_leaf=4, max_features=0.2, n_jobs=-1, random_state=10)
model.fit(x_train, y_train)

print("Traiing Score:- ",model.score(x_train,y_train)*100)
print("Testing Score:- ",model.score(x_test,y_test)*100)

In [None]:
# Let me run an ensable model Gradient Boosting Regressor 

from sklearn.model_selection import train_test_split

x = train.drop('y',axis=1)
x = train.drop('ID',axis=1)
y = train['y']
x_train,x_test, y_train, y_test = train_test_split(x, y, test_size=.2,random_state=10) 


from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()
#model = ensemble.RandomForestRegressor(n_estimators=100, max_depth=10, min_samples_leaf=4, max_features=0.2, n_jobs=-1, random_state=0)
model.fit(x_train, y_train)

print("Traiing Score:- ",model.score(x_train,y_train)*100)
print("Testing Score:- ",model.score(x_test,y_test)*100)

In [None]:
predicted = model.predict(x_test)

In [None]:
predicted

In [None]:
plt.figure(figsize=(15,5))
plt.subplot(121)
sns.distplot(predicted, bins=50, color=colors[1])
plt.title('Target Value Distribution - y\n',fontsize=15)
plt.xlabel('Value in Seconds'); plt.ylabel('Frequecy');

plt.subplot(122)
sns.boxplot(predicted, color=colors[3])
plt.title('Target Value Distribution - y\n',fontsize=15)
plt.xlabel('Value in Seconds');