# The Problem
Reduce the time a Mercedes-Benz spends on the test bench.

Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

Following actions should be performed:

1. If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
2. Check for null and unique values for test and train sets.
3. Apply label encoder.
4. Perform dimensionality reduction.
5. Predict your test_df values using XGBoost.



## A) Reading and studying the dataset
We read the train dataset and study it to decide the needed steps to carry out the three initial tasks from the list mentioned above.
1. If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
2. Check for null and unique values for test and train sets.
3. Apply label encoder.


In [4]:
# Clear all variables and import Numpy and Pandas libraries
%reset -f
import numpy as np
import pandas as pd
# Upload/read both TRAIN and TEST data with pandas
DFTRAIN = pd.read_csv('train.csv')
DFTEST= pd.read_csv('test.csv')

In [5]:
# Checking for Missing values in TEST and TRAIN dataframes (Item 2 above)
def check_missing_values(df, name):
    if df.isnull().any().any():
        print("There are missing values in the dataframe", name, "dataframe")
    else:
        print("There are no missing values in the", name, "dataframe")
check_missing_values(DFTRAIN, "TRAIN")
check_missing_values(DFTEST, "TEST")

There are no missing values in the TRAIN dataframe
There are no missing values in the TEST dataframe


In [6]:
#using "info" to learn more about the train and test data
print("TRAIN SET")
DFTRAIN.info()
print("\nTEST SET")
DFTEST.info()

TRAIN SET
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB

TEST SET
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 377 entries, ID to X385
dtypes: int64(369), object(8)
memory usage: 12.1+ MB


In [7]:
# From the result above we see that we have 4209 rows and 378 columns
# Let us see the data and headers to understand the frame
print("The shape of the frame: ", DFTRAIN.shape)
print("We have thus 4209 samples, 1 target and 376 features (ID column is not used)")
print("All features are labeled as 'X...'. The target is labeled as 'y'")
DFTRAIN.head()

The shape of the frame:  (4209, 378)
We have thus 4209 samples, 1 target and 376 features (ID column is not used)
All features are labeled as 'X...'. The target is labeled as 'y'


Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [8]:
# Let us see how many samples have exactly the same feature set. 
# We know from figure above that the columns that contain the features are 2:377
# The duplicated number of samples that have the same feature set is thus 298
b = DFTRAIN[DFTRAIN.duplicated(subset=DFTRAIN.columns[2:], keep='first')]
print("Number of items that have clones", b.shape[0])
# Now look at the kind of data in features
cols = [c for c in DFTRAIN.columns if 'X' in c]
print(f"Number of features: {len(cols)}")
print('Number of Feature types:')
print(DFTRAIN[cols].dtypes.value_counts())

Number of items that have clones 298
Number of features: 376
Number of Feature types:
int64     368
object      8
dtype: int64


## B) Transforming the dataset to suit better the model achievement
We remove now from the train dataset the features (colums) that contain a unique value. (item 1 from above)

Two identical samples (rows) may have different target measurements. We will do the following steps to remove samples with repeated input:
1. Take the mean value of the target "y" for all samples (rows) with identical input.
2. Eliminate all but one the samples (rows) having the same feature input.
3. Replace the target value "y" of remained clone with the mean value from above

In [9]:
# A) Remove colums that contain a unique value for all samples from a copy of the train set.
import time  # MUCH FASTER THAN ALTERNATIVE WAYS
X_DF = DFTRAIN.copy()
import math as math
colsRmv = [] # Used later to remove the same columns from the test dataframe
for c in cols:
    if len(np.unique(X_DF[c])) == 1: # True if all values in columns are 1
        del X_DF[c] # X_DF.drop(c, axis="columns", inplace=True) # actually edit the dataframe
        colsRmv.append(c)      
print("Number of removed features is", DFTRAIN.shape[1]-X_DF.shape[1])
# From the reduced feature set we remove samples having the identical features.
# By doing that we increase the accuracy of the predictions
b = X_DF[X_DF.duplicated(subset=X_DF.columns[2:], keep=False)].copy() # b is an auxiliary dataframe with all clones
X_DF = X_DF.drop_duplicates(subset=X_DF.columns[2:], keep='first').copy() # copy() doesn't keep index
b.reset_index(drop=True, inplace=True) # Otherwise index may be larger than b.shape[0]
b1 = [] # A list condensing all the feature values of a sample in "b" to one string object
for i in range(b.shape[0]):
    a2 = b.iloc[i, 2:].tolist() # From series to list
    a1 = a2[0] # It is a character, e.g. "k"
    for j in range(1,len(a2)):
        if j < 8:
            a1 += a2[j] # e.g. v, at, a, d, u, j, o
        else:
            a1 += chr(65 +a2[j]) # "A" if a2[j]= 0 or "B" if 1
    b1.append(a1) # b1 = [["kv ....B"], [ ...]]
# Modify Dataframe "b" to contain just the single condensed feature in order to make it easy to take y mean
b.insert(2, "NEW", b1) # Dataframe b has a new column 2 named "NEW". Old columns 2:end are right displaced.
b.drop(b.columns[3:], axis="columns", inplace=True) # Frame b is much simpler now

# ALTERNATIVA 1 - Does not change dataframe b
t = time.perf_counter()
std = []
d = []
for i in range(len(b1)):
    if  not(b1[i] == "0"):
        a1 = b1[i] #  The first element of a series of clones
        while True:
            try:
                d.append(b1.index(a1)) # find the first remaining index in b1 containing a1
                b1[d[-1]] = "0" # this (last) item in b1 will not be found again     
            except ValueError:
                break # Takes place when b1.index(a1) is exausted
        a1 = b.iloc[d, 1].values.mean() # Average values of identical items in b1)
        if not(math.isnan(a1)):
            std.append(b.iloc[d, 1].values.std()) # Append the std of the family of repeats
            X_DF.loc[X_DF["ID"] == b.iloc[d[0], 0], X_DF.columns[1]] = a1 # Put the average y to the "first"
            d = []
print(f"Mean value of std for clones: {sum(std)/len(std):.2f} in {len(std)} rows")
#print(time.perf_counter() - t, 'sec. Alt 1 \n')

d = """ # ALTERNATIVA 2 ** AT LEAST TWICE LESS EFFICIENT - It changes b
t = time.perf_counter()
#b.reset_index(drop=True, inplace=True) # Otherwise index may be larger than b.shape[0]
# Find the number of repeats, take "y" average and remove all clones but the "first"
std = []
for i in range(b.shape[0]):
    if not(b.iloc[i, 2] == "NONE"):
        mask = b["NEW"] == b.iloc[i, 2]
        d = b[mask].loc[:,"y"].values
        X_DF.loc[X_DF["ID"] == b.iloc[i, 0], X_DF.columns[1]] = d.mean() # Put the average y to the "first"
        std.append(d.std()) # Append the std of the family of repeat
        b.iloc[b[mask].index, 2] = "NONE"
print(f"Mean value of std for clones: {sum(std)/len(std):.2f} {len(std)}")
print(time.perf_counter() - t, 'sec. Alt 2')"""

Number of removed features is 12
Mean value of std for clones: 4.13 in 217 rows


In [10]:
# From dataframe to arrays
y_DF = X_DF['y'].values # From a pandas series to a numpy array
# Make a new list with the labels of all Feature columns
cols = [c for c in X_DF.columns if 'X' in c]
# Using Shape to understand the shape of the data
X_DF = X_DF[cols].values # From Frame to a numpy array
print(X_DF.shape)
X_DF

(3911, 364)


array([['k', 'v', 'at', ..., 0, 0, 0],
       ['k', 't', 'av', ..., 0, 0, 0],
       ['az', 'w', 'n', ..., 0, 0, 0],
       ...,
       ['ak', 'v', 'r', ..., 0, 0, 0],
       ['al', 'r', 'e', ..., 0, 0, 0],
       ['z', 'r', 'ae', ..., 0, 0, 0]], dtype=object)

In [11]:
# Item 3 from list above: APLYING LABEL ENCODER
# To convert categorical features to features that can be used with estimators
# we use a one-of-K, also known as one-hot or dummy encoding. This type of 
# encoding transforms each categorical feature with n_categories possible 
# values into n_categories binary features, with one of them 1, and all others 0.
from sklearn import preprocessing
enc = preprocessing.OneHotEncoder()
enc.fit(X_DF[:,0:8])
#print(enc.categories_)# The categories for each feature
#enc.transform([['k', 'v', 'at', 'a', 'd', 'u', 'j', 'o']]).toarray()
X1=enc.transform(X_DF[:,0:8]).toarray()
print(X1.shape,"without 1-hot features")
X1 = np.append(X1[:,:], X_DF[:,8:], axis=1) # Sergio
print(X1.shape,"with 1-hot features")
print(y_DF.shape, "targets")
#print(X1[5])

(3911, 195) without 1-hot features
(3911, 551) with 1-hot features
(3911,) targets


## C) Forming the model
The following steps will be taken.

1. Split the samples in a train and a validation set.
2. Reduce the dimensions with PSA
3. Fit the XGBoost model

In [12]:
# From the original Feature -> Target we create a train and a validation sets
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(
        X1, y_DF, test_size=0.15, 
        random_state=0)

In [13]:
# 4 item from list above: DIMENSDIONALITY REDUCTION WITH PCA
from sklearn.decomposition import PCA
# The number of features, x_train.shape[1], is 551. 100 features explain 90.7% of variations
# 225 features explain 99% of variations. 325 features 99.95 % of variations
n_comp = 325
# Try and error showed n_comp=100 is sound
pca = PCA(n_components=n_comp, random_state=0)
X_pca = pca.fit_transform(x_train)
X_pcaV = pca.transform(x_valid)
print(f"pca explained variance ratio: {sum(pca.explained_variance_ratio_):.4f}\
 (with {n_comp} more significant components of {x_train.shape[1]})")

pca explained variance ratio: 0.9995 (with 325 more significant components of 551)


In [14]:
# More than 280 trials will cause overfitting. 280 trials minimize the validation set rms,
# 8.01182, (just a little more than the rms for the repeated items, 5.55792, shown at In[6]).
# If we do not remove the samples with identical features from the train set the rms of the
# validation set becomes 11.14574. Below the complete statistics for both cases.
# [279]	train-rmse:0.91933	valid-rmse:8.01166	train-r2:0.99479	valid-r2:0.54013
# [279]	train-rmse:2.18185	valid-rmse:11.14574	train-r2:0.96927	valid-r2:0.35478
import xgboost as xgb
from sklearn.metrics import r2_score
dt_x = xgb.DMatrix(X_pca, label=y_train)
dt_v = xgb.DMatrix(X_pcaV, label=y_valid)
params = {'objective': 'reg:squarederror', 'max_depth': 100,  'eta': 0.02}
def R2_score(yp, db):
    y = db.get_label()
    return 'r2', r2_score(y, yp) 
watchlist = [(dt_x, 'train'), (dt_v, 'valid')]
regr = xgb.train(params, dt_x, 280, watchlist, feval=R2_score, verbose_eval = 20)

[1]	train-rmse:97.05156	train-r2:-57.04910	valid-rmse:96.01337	valid-r2:-65.04721
[2]	train-rmse:95.13505	train-r2:-54.77910	valid-rmse:94.10549	valid-r2:-62.44848
[3]	train-rmse:93.25716	train-r2:-52.59880	valid-rmse:92.23646	valid-r2:-59.95319
[4]	train-rmse:91.41776	train-r2:-50.50522	valid-rmse:90.40443	valid-r2:-57.55590
[5]	train-rmse:89.61493	train-r2:-48.49383	valid-rmse:88.60884	valid-r2:-55.25297
[6]	train-rmse:87.84840	train-r2:-46.56179	valid-rmse:86.84888	valid-r2:-53.04054
[7]	train-rmse:86.11717	train-r2:-44.70577	valid-rmse:85.12682	valid-r2:-50.91871
[8]	train-rmse:84.42181	train-r2:-42.92379	valid-rmse:83.43864	valid-r2:-48.87994
[9]	train-rmse:82.76022	train-r2:-41.21179	valid-rmse:81.78461	valid-r2:-46.92191
[10]	train-rmse:81.13265	train-r2:-39.56778	valid-rmse:80.16218	valid-r2:-45.03948
[11]	train-rmse:79.53739	train-r2:-37.98825	valid-rmse:78.57205	valid-r2:-43.23106
[12]	train-rmse:77.97478	train-r2:-36.47131	valid-rmse:77.01699	valid-r2:-41.49759
[13]	train-rm

[107]	train-rmse:13.10119	train-r2:-0.05782	valid-rmse:13.95571	valid-r2:-0.39539
[108]	train-rmse:12.87368	train-r2:-0.02140	valid-rmse:13.77474	valid-r2:-0.35944
[109]	train-rmse:12.65047	train-r2:0.01371	valid-rmse:13.60384	valid-r2:-0.32591
[110]	train-rmse:12.43156	train-r2:0.04755	valid-rmse:13.43439	valid-r2:-0.29308
[111]	train-rmse:12.21624	train-r2:0.08026	valid-rmse:13.27191	valid-r2:-0.26200
[112]	train-rmse:12.00547	train-r2:0.11172	valid-rmse:13.11691	valid-r2:-0.23269
[113]	train-rmse:11.79861	train-r2:0.14207	valid-rmse:12.96102	valid-r2:-0.20357
[114]	train-rmse:11.59553	train-r2:0.17135	valid-rmse:12.81068	valid-r2:-0.17581
[115]	train-rmse:11.39664	train-r2:0.19953	valid-rmse:12.66266	valid-r2:-0.14879
[116]	train-rmse:11.19892	train-r2:0.22707	valid-rmse:12.52088	valid-r2:-0.12321
[117]	train-rmse:11.00680	train-r2:0.25336	valid-rmse:12.37877	valid-r2:-0.09786
[118]	train-rmse:10.81656	train-r2:0.27894	valid-rmse:12.24119	valid-r2:-0.07359
[119]	train-rmse:10.63047	

[217]	train-rmse:2.13672	train-r2:0.97186	valid-rmse:8.18862	valid-r2:0.51959
[218]	train-rmse:2.10408	train-r2:0.97272	valid-rmse:8.18498	valid-r2:0.52002
[219]	train-rmse:2.07224	train-r2:0.97354	valid-rmse:8.18082	valid-r2:0.52050
[221]	train-rmse:2.01001	train-r2:0.97510	valid-rmse:8.17572	valid-r2:0.52110
[222]	train-rmse:1.97961	train-r2:0.97585	valid-rmse:8.17219	valid-r2:0.52151
[223]	train-rmse:1.95001	train-r2:0.97657	valid-rmse:8.16993	valid-r2:0.52178
[224]	train-rmse:1.92055	train-r2:0.97727	valid-rmse:8.16575	valid-r2:0.52227
[225]	train-rmse:1.89142	train-r2:0.97795	valid-rmse:8.16320	valid-r2:0.52257
[226]	train-rmse:1.86290	train-r2:0.97861	valid-rmse:8.16093	valid-r2:0.52283
[227]	train-rmse:1.83491	train-r2:0.97925	valid-rmse:8.15919	valid-r2:0.52304
[228]	train-rmse:1.80729	train-r2:0.97987	valid-rmse:8.15700	valid-r2:0.52329
[229]	train-rmse:1.78044	train-r2:0.98046	valid-rmse:8.15333	valid-r2:0.52372
[230]	train-rmse:1.75386	train-r2:0.98104	valid-rmse:8.15047	val

# D) PREDICTING NEW TEST VALUES
Since we have the model the task now is to comply with task number 5 "Predict your test_df values using XGBoost". First we read the test.csv and clean it sinse there are feature values not available in the trained set.

In [15]:
# Clean samples in test.csv having no same feature value as in train.csv
# We got the "elim" and columns[X] by try and error method
elim = ['bb', 'an', 'ag', 'av', 'ae', 'p']
DFTEST = DFTEST.loc[~DFTEST[DFTEST.columns[1]].isin(elim)] #,  inplace=True)
#print(1, DFTEST.shape)
elim = ['u', 'ax', 'ab', 'ad', 'w', 'aj']
DFTEST = DFTEST.loc[~DFTEST[DFTEST.columns[3]].isin(elim)] #,  inplace=True)
#print(3, DFTEST.shape)
elim = ['t', 'a', 'z', 'b']
DFTEST = DFTEST.loc[~DFTEST[DFTEST.columns[6]].isin(elim)] #,  inplace=True)
print(6, DFTEST.shape)
# Now we make a copy in order to remove the same columns we removed from the trained set
X_t = DFTEST.copy()
for c in colsRmv:
    X_t.drop(c, axis="columns", inplace=True)
X_t.drop("ID", axis="columns", inplace=True)
X_t = X_t.values # LIST of arrays
#print(X_t.head())

6 (4185, 377)


In [16]:
# We use the copy with less columns to prepare the data to be fed to XGBoost
X2 = [X_t[i,0:8] for i in range(X_t.shape[0])]
# Transform X2 to a one-hot array
X2=enc.transform(X2).toarray()
#print(X2.shape)
X2 = np.append(X2[:,:], X_t[:,8:], axis=1)
# Reduce the features with the same pca structure used for the trained set
X2 = pca.transform(X2)
X2 = xgb.DMatrix(X2) # XGBoost demands a dataframe as input
y = regr.predict(X2)
# The last step is to insert the modeled target values to the test dataframe
DFTEST.insert(1, "y", y)
DFTEST.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
4,5,110.078117,w,s,as,c,d,y,i,m,...,1,0,0,0,0,0,0,0,0,0
5,8,91.983849,y,aa,ai,e,d,x,g,s,...,1,0,0,0,0,0,0,0,0,0
6,10,112.357079,x,b,ae,d,d,x,d,y,...,0,0,0,0,0,1,0,0,0,0
7,11,95.787804,f,s,ae,c,d,h,d,a,...,0,0,1,0,0,0,0,0,0,0
8,12,118.142006,ap,l,s,c,d,h,j,n,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# Run this part if you want to re run the Predicting new values part
try:
    del DFTEST["y"] # or DFTEST.drop("y", inplace = True, axis=1)
except KeyError:
    pass
#DFTEST.head()
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
#                       T H E       E N D                                     #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 