## Competition [Mercedes-Benz Greener Manufacturing](https://www.kaggle.com/competitions/mercedes-benz-greener-manufacturing)

<img src="https://images.unsplash.com/photo-1630596369706-57eaf9ba7cae?ixlib=rb-1.2.1&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=3570&q=80:*" width="600px">

<a class="anchor" id="0.1"></a>
# Table of Contents

1. [Import libraries](#1)
1. [Download data](#2)
1. [EDA](#3)

## 1. Import libraries<a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

In [None]:
# Work with Data
import numpy as np 
import pandas as pd

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Modeling and Prediction
from sklearn.preprocessing import LabelEncoder

# Helpers
import os
import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [None]:
colors = ['#b8e994','#78e08f','#38ada9','#079992']
sns.palplot(sns.color_palette(colors))

## 2. Download data<a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [None]:
train = pd.read_csv('../input/mercedes-benz-greener-manufacturing/train.csv.zip')
test = pd.read_csv('../input/mercedes-benz-greener-manufacturing/test.csv.zip')
sub = pd.read_csv('../input/mercedes-benz-greener-manufacturing/sample_submission.csv.zip')

print("Train shape : ", train.shape)
print("Test shape : ", test.shape)

In [None]:
train.head()

## 3. EDA <a class="anchor" id="3"></a>
[Back to Table of Contents](#0.1)

### Task description

The objective of the competition is **to predict the time** required to complete the testing phase.<br>
The dataset represents various permutations of the characteristics of Mercedes-Benz vehicles.<br>
Reducing the running time of the algorithm could also help reduce carbon emissions without compromising Daimler's standards.<br>

The data set contains an anonymized set of variables (user functions) in a Mercedes vehicle.<br>
For example, a variable could be 4WD, it could be an added air suspension or a head-up display.

***y*** is the variable to be predicted, that is the time (in seconds) it took the car to pass the test for each variable

Variables containing letters - **categorical**.<br>
Variables with 0/1 - **binary**.

### 2.1 Target variable analysis

In [None]:
plt.figure(figsize=(15,5))
plt.subplot(121)
sns.distplot(train.y.values, bins=50, color=colors[1])
plt.title('Distribution of the target variable (y)\n',fontsize=15)
plt.xlabel('Seconds');
plt.ylabel('Frequency');

plt.subplot(122)
sns.boxplot(train.y.values, color=colors[3])
plt.title('Distribution of the target variable (y)\n',fontsize=15)
plt.xlabel('Seconds'); 

In [None]:
train.y.describe()

The target variable (n) has a standard distribution of approximately 72 to 140 seconds.<br>
The first and third quartiles range from approximately 91 to 109 seconds, with a median of 100s.<br>
Also note that there are outliers starting at 140s that we can remove from the training sample as these values will add noise to our algorithm.

### 2.2 Data types

In [None]:
train.dtypes.value_counts()

In [None]:
train.dtypes[train.dtypes=='float']

In [None]:
train.dtypes[train.dtypes=='object']

In [None]:
obj = train.dtypes[train.dtypes=='object'].index
for i in obj:
    print(i, train[i].unique())

### 2.3 Missing values

In [None]:
train.isna().sum()[train.isna().sum()>0]

### 2.4 Categorical variables

In [None]:
fig,ax = plt.subplots(len(obj), figsize=(20,70))

for i, col in enumerate(obj):
    sns.boxplot(x=col, y='y', data=train, ax=ax[i])

#### What we see from the charts:

1) Since there is a need to reduce the testing time, the best values in the variables for which this time is minimal are "az" and "bc" (X0), "y" (X1), "n" (X2), "x" and "h" (X5) -> **hypothesis** - can affect y?;

2) Variables X3, X5, X6, X8 have similar distributions of values, where there are no special differences within the feature between the values in the context of means and quartiles;

3) X0 and X2 have the greatest variety within the variables, which can potentially indicate the greater usefulness of these features.

### 2.5 Numeric variables

In [None]:
num = train.dtypes[train.dtypes=='int'].index[1:]

We have a set of numerical variables, where the value is set to 1 or 0, so there is no need to carry out a volume analysis.<br>
In this case, we should be interested in whether the value of the indicators inside the variables changes,<br>
for this we study the variance of these variables, using the var() function, and select only those where the variance is zero (that is, always 0, or 1 on the entire dataset in cut variable)

In [None]:
nan_num = []
for i in num:
    if (train[i].var()==0):
        print(i, train[i].var())
        nan_num.append(i)

We received several such variables, we can remove them from the analysis, since they will not affect the target in any way, thereby we increase the performance of the algorithm.

In [None]:
train = train.drop(columns=nan_num, axis=1)

### 2.6 Correlation analysis

In order for us to be able to do a correlation analysis for categorical variables, before that we need to convert these variables using LabelEncoder().<br>
When converting values to a binary form, we will not be able to track the relationship of a particular variable + we must take into account the test set, since its values will participate in finding the target.

In [None]:
for i in obj:
    le = LabelEncoder()
    le.fit(list(train[i].values) + list(train[i].values))
    train[i] = le.transform(list(train[i].values))

In [None]:
train[obj].head()

In [None]:
corr = train[train.columns[1:10]].corr()

fig,ax = plt.subplots(figsize=(7,6))
sns.heatmap(corr, vmax=.7, square=True,annot=True);

Among the categorical variables, we did not find a direct relationship with the target y.

In [None]:
threshold = 1

corr_all = train.drop(columns=obj, axis=1).corr()
corr_all.loc[:,:] =  np.tril(corr_all, k=-1) 

In [None]:
already_in = set()
result = []
for col in corr_all:
    perfect_corr = corr_all[col][corr_all[col] == threshold ].index.tolist()
    if perfect_corr and col not in already_in:
        already_in.update(set(perfect_corr))
        perfect_corr.append(col)
        result.append(perfect_corr)

In [None]:
result

When analyzing numerical variables, we found that some of them have a direct correlation with others, therefore, in order to avoid multicollinearity.<br>
So we can remove variables with a correlation of 1 (leave one of the group), or use regularization so that the algorithm does this automatically.<br>

How else can we remove such variables without correlation? It's simple, **we remove duplicates in the context of columns**.

In [None]:
train.T.drop_duplicates().T