# Mercedes-Benz Greener Manufacturing

### DESCRIPTION

#### Reduce the time a Mercedes-Benz spends on the test bench.

##### Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

Following actions should be performed:

If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
Check for null and unique values for test and train sets.
Apply label encoder.
Perform dimensionality reduction.
Predict your test_df values using XGBoost.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import warnings

In [None]:
colors = ['#001c57','#50248f','#a6a6a6','#38d1ff','#ca7beb','#c8aef8','#9154f8','#cef3f5']
sns.palplot(sns.color_palette(colors))

### Load training and testing data and look at the number of rows and colums for the same 

In [None]:
#train = pd.read_csv('../kaggle/input/mercedesbenz-greener-manufacturing/train.csv')

#test = pd.read_csv('../kaggle/input/mercedesbenz-greener-manufacturing/test.csv')

train = pd.read_csv("../input/mercedes-benz-greener-manufacturing/train.csv.zip")
test = pd.read_csv("../input/mercedes-benz-greener-manufacturing/test.csv.zip")

print('Shape of the training data: ',train.shape)
print('Shape of the testing data: ',train.shape)

In [None]:
train.head()

In [None]:
train['y'].isnull().sum()

Look at the distribution of Target Values 

In [None]:
    plt.figure(figsize=(15,5))
    plt.subplot(121)
    sns.distplot(train.y.values, bins=20, color=colors[4])
    plt.title('Target Value Distribution \n',fontsize=15)
    plt.xlabel('Target Value in Seconds'); plt.ylabel('Occurances');

    plt.subplot(122)
    sns.boxplot(train.y.values, color=colors[0])
    plt.title('Target Value Distribution \n',fontsize=15)
    plt.xlabel('Target Value in Seconds');

From the above, what we can visualize is a standard distribution, which is centred around 100. There is a single outlier at 265 seconds where every other value is below 180.

The fact that our ID is not equal to the row ID seems to suggest that the train and test sets were randomly sampled from the same dataset, which could have some special order to it, for example a time series. Let's take a look at how this target value changes over time in order to understand whether we're given time series data.

In [None]:
plt.figure(figsize=(15, 5))
plt.plot(train.y.values, color=colors[0])
plt.xlabel('Row ID')
plt.ylabel('Target value')
plt.title('Change in target value over the dataset')
plt.show()
plt.figure(figsize=(15, 5))
plt.plot(train.y.values[:100], color=colors[0])
plt.xlabel('Row ID')
plt.ylabel('Target value')
plt.title('Change in target value over the dataset (first 100 samples)')
print()

## Feature Analysis

In [None]:
dtype_df = train.dtypes.reset_index()
dtype_df.columns = ["Count", "Column Type"]
dtype_df.groupby("Column Type").aggregate('count').reset_index()

Looking at the above we have the following:- 
369 integer variables. 
8 object (likely a string) variables
1 target variable

Let us look at the cardinality of our features?

In [None]:
train.dtypes[train.dtypes=='object']

In [None]:
obj_dtype = train.dtypes[train.dtypes=='object'].index
for i in obj_dtype:
    print(i, train[i].unique())

In [None]:
fig,ax = plt.subplots(len(obj_dtype), figsize=(18,80))

for i, col in enumerate(obj_dtype):
    sns.boxplot(x=col, y='y', data=train, ax=ax[i])
    
#for c in counts[2]:
#value_counts = df_train[c].value_counts()
#fig, ax = plt.subplots(figsize=(10, 5))
#plt.title('Categorical feature {} - Cardinality {}'.format(c, len(np.unique(df_train[c])))
#)
#   plt.xlabel('Feature value')
#    plt.ylabel('Occurences')
#    plt.bar(range(len(value_counts)), value_counts.values, color=pal[1])
#    ax.set_xticks(range(len(value_counts)))
#    ax.set_xticklabels(value_counts.index, rotation='vertical')
#    plt.show()    
    
    

### Inference from the graphs:
##### 1) Since there is a need to reduce the testing time, the best values in the variables at which this time is minimal are az and bc (X0), y (X1), n (X2), x and h (X5) (hypothesis: on y?)

##### 2) Variables X3, X5, X6, X8 have similar distributions of values, where there are no special differences within the feature between values in the context of means and quartiles

##### 3) X0 and X2 have the greatest variety within variables, which can potentially indicate a greater usefulness of these features

In [None]:
num = train.dtypes[train.dtypes=='int'].index[1:]

In [None]:
nan_num = []
for i in num:
    if (train[i].var()==0):
        print(i, train[i].var())
        nan_num.append(i)

We have a set of numeric variables, where the value is set to 1 or 0, so there is no need to carry out volumetric analysis. In this case, we should be interested in whether the value of indicators changes within the variables, for this we examine the variance of these variables, use the var () function, and select only those where the variance is zero (that is, always 0, or 1 on the entire dataset in variable cut)

## XGBoost Starter

In [None]:
usable_columns = list(set(train.columns) - set(['ID', 'y']))
y_train = train['y'].values
id_test = test['ID'].values
x_train = train[usable_columns]
x_test = test[usable_columns]
for column in usable_columns:
    cardinality = len(np.unique(x_train[column]))
    if cardinality == 1:
        x_train.drop(column, axis=1) # Column with only one value is useless so we drop it
        x_test.drop(column, axis=1)
    if cardinality > 2: # Column is categorical
        mapper = lambda x: sum([ord(digit) for digit in x])
        x_train[column] = x_train[column].apply(mapper) 
        x_test[column] = x_test[column].apply(mapper)
x_train.head()

In [None]:
import xgboost as xgb
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(x_train, y_train, test_size=0.2, random_state=10)
d_train = xgb.DMatrix(x_train, label=y_train)
d_valid = xgb.DMatrix(x_valid, label=y_valid)
d_test = xgb.DMatrix(x_test)
params = {}
params['objective'] = 'reg:linear'
params['eta'] = 0.02
params['max_depth'] = 4
def xgb_r2_score(preds, dtrain):
    labels = dtrain.get_label()
    return 'r2', r2_score(labels, preds)
watchlist = [(d_train, 'train'), (d_valid, 'valid')]
clf = xgb.train(params, d_train, 1000, watchlist, early_stopping_rounds=50, feval=xgb_r2_score
, maximize=True, verbose_eval=10)

In [None]:
p_test = clf.predict(d_test)

sub = pd.DataFrame()
sub['ID'] = id_test
sub['y'] = p_test
sub.to_csv('xgb_results.csv',index=False)

In [None]:
sub.head()