# Mercedes-Benz Project
### by: Rupesh Ingle

### Project - Mercedes-Benz Greener Manufacturing

DESCRIPTION

Reduce the time a Mercedes-Benz spends on the test bench.

### Problem Statement Scenario:
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with the crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of every unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.

### Following actions should be performed:
* If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
* Check for null and unique values for test and train sets
* Apply label encoder.
* Perform dimensionality reduction.
* Predict your test_df values using xgboost

In [149]:
import pandas as pd
import numpy as np

In [150]:
train = pd.read_csv("train.csv")  # ID column is dropped in Task-4 of Dimensionality reduction
test = pd.read_csv("test.csv")   # ID column is dropped in Task-4 of Dimensionality reduction
print(f"Shape of train/X is {train.shape}")
print(f"Shape of test/y is {test.shape}")

Shape of train/X is (4209, 378)
Shape of test/y is (4209, 377)


In [151]:
train.head()

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


In [152]:
X_train = train.drop('y', axis=1)
X_train.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,k,v,at,a,d,u,j,o,0,...,0,0,1,0,0,0,0,0,0,0
1,6,k,t,av,e,d,y,l,o,0,...,1,0,0,0,0,0,0,0,0,0
2,7,az,w,n,c,d,x,j,x,0,...,0,0,0,0,0,0,1,0,0,0
3,9,az,t,n,f,d,x,l,e,0,...,0,0,0,0,0,0,0,0,0,0
4,13,az,v,n,f,d,h,d,n,0,...,0,0,0,0,0,0,0,0,0,0


In [153]:
X_train.shape

(4209, 377)

In [154]:
y_train = train['y']
y_train.shape

(4209,)

In [155]:
y_train.head()

0    130.81
1     88.53
2     76.26
3     80.62
4     78.02
Name: y, dtype: float64

In [156]:
X_test = test.copy()
X_test.shape

(4209, 377)

### --------------------------------------Task-1 starts below---------------------------------------------------

## Task-1: If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
### For this we will find whether anyone of the column have same value throughout. For that we will use df.nunique() to find no. of unique values in each column, and if we get any column/s nunique value as 1, then we will drop that column.

In [157]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)

(4209, 377)
(4209,)
(4209, 377)


In [158]:
X_train.nunique()

ID      4209
X0        47
X1        27
X2        44
X3         7
        ... 
X380       2
X382       2
X383       2
X384       2
X385       2
Length: 377, dtype: int64

In [159]:
X_train.nunique().sort_values()

X290       1
X235       1
X11        1
X297       1
X347       1
        ... 
X1        27
X5        29
X2        44
X0        47
ID      4209
Length: 377, dtype: int64

In [160]:
for c in X_train.columns:
    if len(X_train[c].unique()) == 1:
        print(c)

X11
X93
X107
X233
X235
X268
X289
X290
X293
X297
X330
X347


In [161]:
var0_col=[]
for c in X_train.columns:
    if len(X_train[c].unique()) == 1:
        var0_col.append(c)
print(var0_col)
print(len(var0_col))

['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347']
12


In [162]:
X_train.drop(var0_col, axis = 1,inplace =True)

In [163]:
X_train.shape

(4209, 365)

### Deleted 12 columns with 0 variance column from train set (X_train)

In [164]:
X_test.nunique()

ID      4209
X0        49
X1        27
X2        45
X3         7
        ... 
X380       2
X382       2
X383       2
X384       2
X385       2
Length: 377, dtype: int64

In [165]:
X_test.nunique().sort_values()

X258       1
X295       1
X296       1
X257       1
X369       1
        ... 
X1        27
X5        32
X2        45
X0        49
ID      4209
Length: 377, dtype: int64

In [166]:
var0_col1=[]
for c1 in X_test.columns:
    if len(X_test[c1].unique()) == 1:
        var0_col1.append(c1)
print(var0_col1)
print(len(var0_col1))

['X257', 'X258', 'X295', 'X296', 'X369']
5


In [167]:
X_test.drop(var0_col1, axis =1, inplace = True)
X_test.shape

(4209, 372)

### Deleted 5 columns with 0 variance column from test set (X_test)

In [168]:
print(var0_col)  # X_train columns with variance = 0
print(var0_col1)  # X_test columns with variance = 0

['X11', 'X93', 'X107', 'X233', 'X235', 'X268', 'X289', 'X290', 'X293', 'X297', 'X330', 'X347']
['X257', 'X258', 'X295', 'X296', 'X369']


### But these columns are not same, so we need to delete remaining columns with variance = 0 we got in other set

In [169]:
X_train.drop(var0_col1, axis =1, inplace = True)
X_test.drop(var0_col, axis =1, inplace = True)
print(X_train.shape)
print(X_test.shape)

(4209, 360)
(4209, 360)


### Ans-1: 12 columns in train set & 5 columns in test set(So total 12 + 5 = 17 columns were) had 1 unique value(i.e. variance was 0).   Those columns are deleted that means final columns remaining is 377 - 17 = 160 each in train & test set.

### --------------------------------------Task-1 done above---------------------------------------------------
### --------------------------------------Task-2 starts below---------------------------------------------------

## Task-2: Check for null and unique values for test and train sets.
### 2A). First we will find null values in train & test set
### 2B). Then we will find unique values in train & test set

In [170]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 360 entries, ID to X385
dtypes: int64(352), object(8)
memory usage: 11.6+ MB


In [171]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4209 entries, 0 to 4208
Columns: 360 entries, ID to X385
dtypes: int64(352), object(8)
memory usage: 11.6+ MB


In [172]:
X_train.isna().sum()

ID      0
X0      0
X1      0
X2      0
X3      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 360, dtype: int64

In [173]:
X_train.isna().sum().sum()

0

### No null values in train set i.e. X_train

In [174]:
X_test.isna().sum()

ID      0
X0      0
X1      0
X2      0
X3      0
       ..
X380    0
X382    0
X383    0
X384    0
X385    0
Length: 360, dtype: int64

In [175]:
X_train.isna().sum().sum()

0

### No null values in test set i.e. X_test
### Ans-2A: No null values are present in train & test set.

### Unique values of train & test set can be found column wise as shown below

In [177]:
# Finding names of X_train columns/features
X_train.columns

Index(['ID', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=360)

In [180]:
X_train.nunique()

ID      4209
X0        47
X1        27
X2        44
X3         7
        ... 
X380       2
X382       2
X383       2
X384       2
X385       2
Length: 360, dtype: int64

In [184]:
# Now lets find unique values of X0 & X3 column for example
print(X_train['X0'].unique())
len(X_train['X0'].unique())

['k' 'az' 't' 'al' 'o' 'w' 'j' 'h' 's' 'n' 'ay' 'f' 'x' 'y' 'aj' 'ak' 'am'
 'z' 'q' 'at' 'ap' 'v' 'af' 'a' 'e' 'ai' 'd' 'aq' 'c' 'aa' 'ba' 'as' 'i'
 'r' 'b' 'ax' 'bc' 'u' 'ad' 'au' 'm' 'l' 'aw' 'ao' 'ac' 'g' 'ab']


47

In [186]:
print(X_train['X3'].unique())
len(X_train['X3'].unique())

['a' 'e' 'c' 'f' 'd' 'b' 'g']


7

### Above found unique values of any 2 columns of train set (X_train)

In [187]:
# Finding names of X_test columns/features
X_test.columns

Index(['ID', 'X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8', 'X10',
       ...
       'X375', 'X376', 'X377', 'X378', 'X379', 'X380', 'X382', 'X383', 'X384',
       'X385'],
      dtype='object', length=360)

In [188]:
X_test.nunique()

ID      4209
X0        49
X1        27
X2        45
X3         7
        ... 
X380       2
X382       2
X383       2
X384       2
X385       2
Length: 360, dtype: int64

In [189]:
# Now lets find unique values of X1 & X2 column for example
print(X_test['X1'].unique())
len(X_test['X1'].unique())

['v' 'b' 'l' 's' 'aa' 'r' 'a' 'i' 'p' 'c' 'o' 'm' 'z' 'e' 'h' 'w' 'g' 'k'
 'y' 't' 'u' 'd' 'j' 'q' 'n' 'f' 'ab']


27

In [190]:
print(X_test['X2'].unique())
len(X_test['X2'].unique())

['n' 'ai' 'as' 'ae' 's' 'b' 'e' 'ak' 'm' 'a' 'aq' 'ag' 'r' 'k' 'aj' 'ay'
 'ao' 'an' 'ac' 'af' 'ax' 'h' 'i' 'f' 'ap' 'p' 'au' 't' 'z' 'y' 'aw' 'd'
 'at' 'g' 'am' 'j' 'x' 'ab' 'w' 'q' 'ah' 'ad' 'al' 'av' 'u']


45

### Above found unique values of any 2 columns of test set (X_test)

### If we want to see unique values of all columns, then we can apply loop as follows:

In [192]:
# To see unique values of train set (X_train):
for c in X_train.columns:
    print(f"Column {c} unique values are {X_train[c].unique()}")

Column ID unique values are [   0    6    7 ... 8412 8415 8417]
Column X0 unique values are ['k' 'az' 't' 'al' 'o' 'w' 'j' 'h' 's' 'n' 'ay' 'f' 'x' 'y' 'aj' 'ak' 'am'
 'z' 'q' 'at' 'ap' 'v' 'af' 'a' 'e' 'ai' 'd' 'aq' 'c' 'aa' 'ba' 'as' 'i'
 'r' 'b' 'ax' 'bc' 'u' 'ad' 'au' 'm' 'l' 'aw' 'ao' 'ac' 'g' 'ab']
Column X1 unique values are ['v' 't' 'w' 'b' 'r' 'l' 's' 'aa' 'c' 'a' 'e' 'h' 'z' 'j' 'o' 'u' 'p' 'n'
 'i' 'y' 'd' 'f' 'm' 'k' 'g' 'q' 'ab']
Column X2 unique values are ['at' 'av' 'n' 'e' 'as' 'aq' 'r' 'ai' 'ak' 'm' 'a' 'k' 'ae' 's' 'f' 'd'
 'ag' 'ay' 'ac' 'ap' 'g' 'i' 'aw' 'y' 'b' 'ao' 'al' 'h' 'x' 'au' 't' 'an'
 'z' 'ah' 'p' 'am' 'j' 'q' 'af' 'l' 'aa' 'c' 'o' 'ar']
Column X3 unique values are ['a' 'e' 'c' 'f' 'd' 'b' 'g']
Column X4 unique values are ['d' 'b' 'c' 'a']
Column X5 unique values are ['u' 'y' 'x' 'h' 'g' 'f' 'j' 'i' 'd' 'c' 'af' 'ag' 'ab' 'ac' 'ad' 'ae'
 'ah' 'l' 'k' 'n' 'm' 'p' 'q' 's' 'r' 'v' 'w' 'o' 'aa']
Column X6 unique values are ['j' 'l' 'd' 'h' 'i' 'a' 'g' 'c' 'k' 

In [193]:
# To see unique values of test set (X_test):
for c1 in X_test.columns:
    print(f"Column {c1} unique values are {X_test[c1].unique()}")

Column ID unique values are [   1    2    3 ... 8413 8414 8416]
Column X0 unique values are ['az' 't' 'w' 'y' 'x' 'f' 'ap' 'o' 'ay' 'al' 'h' 'z' 'aj' 'd' 'v' 'ak'
 'ba' 'n' 'j' 's' 'af' 'ax' 'at' 'aq' 'av' 'm' 'k' 'a' 'e' 'ai' 'i' 'ag'
 'b' 'am' 'aw' 'as' 'r' 'ao' 'u' 'l' 'c' 'ad' 'au' 'bc' 'g' 'an' 'ae' 'p'
 'bb']
Column X1 unique values are ['v' 'b' 'l' 's' 'aa' 'r' 'a' 'i' 'p' 'c' 'o' 'm' 'z' 'e' 'h' 'w' 'g' 'k'
 'y' 't' 'u' 'd' 'j' 'q' 'n' 'f' 'ab']
Column X2 unique values are ['n' 'ai' 'as' 'ae' 's' 'b' 'e' 'ak' 'm' 'a' 'aq' 'ag' 'r' 'k' 'aj' 'ay'
 'ao' 'an' 'ac' 'af' 'ax' 'h' 'i' 'f' 'ap' 'p' 'au' 't' 'z' 'y' 'aw' 'd'
 'at' 'g' 'am' 'j' 'x' 'ab' 'w' 'q' 'ah' 'ad' 'al' 'av' 'u']
Column X3 unique values are ['f' 'a' 'c' 'e' 'd' 'g' 'b']
Column X4 unique values are ['d' 'b' 'a' 'c']
Column X5 unique values are ['t' 'b' 'a' 'z' 'y' 'x' 'h' 'g' 'f' 'j' 'i' 'd' 'c' 'af' 'ag' 'ab' 'ac'
 'ad' 'ae' 'ah' 'l' 'k' 'n' 'm' 'p' 'q' 's' 'r' 'v' 'w' 'o' 'aa']
Column X6 unique values are ['a' 'g'

### Ans-2B: Unique values of train & test set is shown above
### --------------------------------------Task-2 done above---------------------------------------------------
### --------------------------------------Task-3 starts below---------------------------------------------------

## Task-3: Apply label encoder.

In [241]:
# Creating a deep copy first
X_train1 = X_train.copy()
X_test1 = X_test.copy()

In [242]:
from sklearn.preprocessing import LabelEncoder

In [243]:
le = LabelEncoder()

In [244]:
X_train1.dtypes == 'object'

ID      False
X0       True
X1       True
X2       True
X3       True
        ...  
X380    False
X382    False
X383    False
X384    False
X385    False
Length: 360, dtype: bool

In [245]:
X_train1['X0'].dtypes

dtype('O')

In [246]:
X_train1['X385'].dtypes

dtype('int64')

In [247]:
ob = X_train1.select_dtypes(include='object')
ob

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,k,v,at,a,d,u,j,o
1,k,t,av,e,d,y,l,o
2,az,w,n,c,d,x,j,x
3,az,t,n,f,d,x,l,e
4,az,v,n,f,d,h,d,n
...,...,...,...,...,...,...,...,...
4204,ak,s,as,c,d,aa,d,q
4205,j,o,t,d,d,aa,h,h
4206,ak,v,r,a,d,aa,g,e
4207,al,r,e,f,d,aa,l,u


In [248]:
obj_cols = ob.columns
obj_cols              # this column stores the object column name

Index(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype='object')

In [249]:
for c in X_train1.columns:
    if X_train1[c].dtypes == 'O':
        print(c)

X0
X1
X2
X3
X4
X5
X6
X8


In [250]:
for c in X_train1.columns:
    if X_train1[c].dtypes == 'O':
        X_train1[c] = le.fit_transform(X_train1[c])

In [251]:
X_train1[obj_cols]

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,32,23,17,0,3,24,9,14
1,32,21,19,4,3,28,11,14
2,20,24,34,2,3,27,9,23
3,20,21,34,5,3,27,11,4
4,20,23,34,5,3,12,3,13
...,...,...,...,...,...,...,...,...
4204,8,20,16,2,3,0,3,16
4205,31,16,40,3,3,0,7,7
4206,8,23,38,0,3,0,6,4
4207,9,19,25,5,3,0,11,20


### So the object colums of train set (X_train1) are transformed by Label encoding, Now lets apply same Label encoder on object columns of test set (X_test1)

In [252]:
X_test1.select_dtypes(include="object")

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,az,v,n,f,d,t,a,w
1,t,b,ai,a,d,b,g,y
2,az,v,as,f,d,a,j,j
3,az,l,n,f,d,z,l,n
4,w,s,as,c,d,y,i,m
...,...,...,...,...,...,...,...,...
4204,aj,h,as,f,d,aa,j,e
4205,t,aa,ai,d,d,aa,j,y
4206,y,v,as,f,d,aa,d,w
4207,ak,v,as,a,d,aa,c,q


In [253]:
for c1 in X_test1.columns:
    if X_test1[c1].dtypes == "O":
        X_test1[c1] = le.fit_transform(X_test1[c1])
        
X_test1[obj_cols]   
# We can see label encoding is applied here in o/p, compare with previous cell output

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8
0,21,23,34,5,3,26,0,22
1,42,3,8,0,3,9,6,24
2,21,23,17,5,3,0,9,9
3,21,13,34,5,3,31,11,13
4,45,20,17,2,3,30,8,12
...,...,...,...,...,...,...,...,...
4204,6,9,17,5,3,1,9,4
4205,42,1,8,3,3,1,9,24
4206,47,23,17,5,3,1,3,22
4207,7,23,17,0,3,1,2,16


In [254]:
X_train1.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,32,23,17,0,3,24,9,14,0,...,0,0,1,0,0,0,0,0,0,0
1,6,32,21,19,4,3,28,11,14,0,...,1,0,0,0,0,0,0,0,0,0
2,7,20,24,34,2,3,27,9,23,0,...,0,0,0,0,0,0,1,0,0,0
3,9,20,21,34,5,3,27,11,4,0,...,0,0,0,0,0,0,0,0,0,0
4,13,20,23,34,5,3,12,3,13,0,...,0,0,0,0,0,0,0,0,0,0


In [255]:
X_test1.head()

Unnamed: 0,ID,X0,X1,X2,X3,X4,X5,X6,X8,X10,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,1,21,23,34,5,3,26,0,22,0,...,0,0,0,1,0,0,0,0,0,0
1,2,42,3,8,0,3,9,6,24,0,...,0,0,1,0,0,0,0,0,0,0
2,3,21,23,17,5,3,0,9,9,0,...,0,0,0,1,0,0,0,0,0,0
3,4,21,13,34,5,3,31,11,13,0,...,0,0,0,1,0,0,0,0,0,0
4,5,45,20,17,2,3,30,8,12,0,...,1,0,0,0,0,0,0,0,0,0


### --------------------------------------Task-3 done above---------------------------------------------------
### --------------------------------------Task-4 starts below---------------------------------------------------

## Task-4: Perform dimensionality reduction.

In [256]:
from sklearn.decomposition import PCA
pca = PCA()

### We will now create new df by dropping the ID column from train & test data

In [269]:
X_train2 = X_train1.drop('ID',axis=1).copy()
X_test2 = X_test1.drop('ID',axis=1).copy()

In [270]:
X_train2.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,32,23,17,0,3,24,9,14,0,0,...,0,0,1,0,0,0,0,0,0,0
1,32,21,19,4,3,28,11,14,0,0,...,1,0,0,0,0,0,0,0,0,0
2,20,24,34,2,3,27,9,23,0,0,...,0,0,0,0,0,0,1,0,0,0
3,20,21,34,5,3,27,11,4,0,0,...,0,0,0,0,0,0,0,0,0,0
4,20,23,34,5,3,12,3,13,0,0,...,0,0,0,0,0,0,0,0,0,0


In [271]:
X_test2.head()

Unnamed: 0,X0,X1,X2,X3,X4,X5,X6,X8,X10,X12,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,21,23,34,5,3,26,0,22,0,0,...,0,0,0,1,0,0,0,0,0,0
1,42,3,8,0,3,9,6,24,0,0,...,0,0,1,0,0,0,0,0,0,0
2,21,23,17,5,3,0,9,9,0,0,...,0,0,0,1,0,0,0,0,0,0
3,21,13,34,5,3,31,11,13,0,0,...,0,0,0,1,0,0,0,0,0,0
4,45,20,17,2,3,30,8,12,0,0,...,1,0,0,0,0,0,0,0,0,0


In [282]:
print(X_train2.shape)
print(X_test2.shape)

(4209, 359)
(4209, 359)


In [272]:
# I want to lets say capture min 95% of EVR, then give n_components=float_value
pca = PCA(n_components=0.95)

In [273]:
pca.fit(X_train2)

PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [274]:
pca.explained_variance_ratio_

array([0.38335038, 0.21388171, 0.13261954, 0.1182672 , 0.0920607 ,
       0.01590615])

In [275]:
pca.explained_variance_ratio_.shape

(6,)

In [276]:
np.sum(pca.explained_variance_ratio_)

0.956085668525783

### It shows PCA have created 6 features which gives EVR of 95%. 

### Now once i have identified that i need 6 features, then i need to transform my data as my data has 359 features(it was 360 in X_train1 but we dropped 'ID' in X_train2

In [277]:
X_train_transformed = pca.transform(X_train2)
X_test_transformed = pca.transform(X_test2)

In [278]:
X_train_transformed

array([[  0.61476509,  -0.13300457,  15.62445722,   3.68757593,
          1.3595744 ,  -2.69141855],
       [  0.565407  ,   1.56033758,  17.90958066,  -0.09288947,
          1.5366486 ,  -4.44287772],
       [ 16.20171239,  12.29285404,  17.63353492,   0.1863145 ,
         11.85081957,  -2.15538711],
       ...,
       [ 29.00465997,  14.86091249,  -7.75334616,  11.22440177,
         -5.84698504,   0.7893111 ],
       [ 22.97242203,   1.68482895,  -9.03125688,   9.74979538,
          9.44955753,  -4.35522878],
       [-17.28304728,  -9.95197999,  -3.71937027,  18.34309844,
          8.40170738,   0.50947589]])

In [279]:
pd.DataFrame(X_train_transformed)

Unnamed: 0,0,1,2,3,4,5
0,0.614765,-0.133005,15.624457,3.687576,1.359574,-2.691419
1,0.565407,1.560338,17.909581,-0.092889,1.536649,-4.442878
2,16.201712,12.292854,17.633535,0.186314,11.850820,-2.155387
3,16.149998,13.535426,14.898693,-3.140913,-6.832194,-4.290012
4,16.459103,13.175011,4.403086,7.671148,2.139916,3.763861
...,...,...,...,...,...,...
4204,22.161404,-7.184316,-8.659411,10.774855,4.669902,3.527911
4205,6.153948,22.828152,-8.314672,10.303205,-3.089277,0.073623
4206,29.004660,14.860912,-7.753346,11.224402,-5.846985,0.789311
4207,22.972422,1.684829,-9.031257,9.749795,9.449558,-4.355229


In [280]:
pd.DataFrame(X_test_transformed)

Unnamed: 0,0,1,2,3,4,5
0,15.122596,12.426351,16.575683,0.381597,10.749272,6.778301
1,-16.418522,-6.087807,-5.818107,-0.643846,11.846740,0.972064
2,11.310890,-2.240983,-5.683221,15.249593,-2.777293,-2.557532
3,12.883277,13.284804,14.409577,-11.039923,2.535216,-3.919067
4,-12.119759,3.021291,20.739082,0.602510,-1.087116,-1.422970
...,...,...,...,...,...,...
4204,22.045697,-5.791778,-14.969442,0.445581,-6.315913,-2.175567
4205,-16.799934,-5.849007,-13.362178,2.585824,12.125719,-1.671761
4206,-13.467664,3.524159,-0.387650,20.586708,9.051955,3.609705
4207,24.086926,-6.512196,-6.389518,12.212749,4.557660,4.376525


In [281]:
print(X_train_transformed.shape)
print(X_test_transformed.shape)

(4209, 6)
(4209, 6)


### Here we can see that shape is (4209 x 6) after PCA is applied, earlier before applying PCA it was (4209 x 359) few cell back in X_train2 & X_test2.

### --------------------------------------Task-4 done above---------------------------------------------------
### --------------------------------------Task-5 starts below---------------------------------------------------

# Task-5: Predict your test_df values using XGBoost.

In [284]:
import xgboost as xgb

ModuleNotFoundError: No module named 'xgboost'

In [285]:
!pip install xgboost

Collecting xgboost
  Downloading https://files.pythonhosted.org/packages/24/14/d9ecb9fa86727f51bfb35f1c2b0428ebc6cd5ffde24c5e2dc583d3575a6f/xgboost-1.6.2-py3-none-win_amd64.whl (125.4MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.6.2


In [288]:
import xgboost as xgb

In [289]:
xgb_r = xgb.XGBRegressor()

In [290]:
xgb_r.fit(X_train_transformed, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, objective='reg:squarederror',
             predictor='auto', random_state=0, reg_alpha=0, ...)

In [291]:
y_pred = xgb_r.predict(X_test_transformed)

In [292]:
y_pred

array([ 81.353905, 101.158516,  83.286964, ..., 108.5399  , 109.41588 ,
       101.74673 ], dtype=float32)

In [303]:
pd.DataFrame(y_pred, columns = ['y'])

Unnamed: 0,y
0,81.353905
1,101.158516
2,83.286964
3,91.348869
4,104.549469
...,...
4204,114.716934
4205,101.151001
4206,108.539902
4207,109.415878


In [None]:
pd.DataFrame(y_pred, name)

In [296]:
X_test_transformed

array([[ 15.12259634,  12.4263511 ,  16.57568311,   0.38159714,
         10.74927229,   6.77830055],
       [-16.41852224,  -6.08780679,  -5.81810666,  -0.64384649,
         11.84673994,   0.972064  ],
       [ 11.31089028,  -2.24098281,  -5.68322101,  15.24959317,
         -2.77729329,  -2.55753182],
       ...,
       [-13.46766361,   3.52415935,  -0.38765044,  20.58670847,
          9.05195519,   3.60970468],
       [ 24.08692577,  -6.51219645,  -6.38951799,  12.21274873,
          4.55765983,   4.37652459],
       [-16.55794144,  -5.49565516, -13.70389166,   2.25695005,
          5.14286798,   1.17658804]])

In [297]:
pd.DataFrame(X_test_transformed)

Unnamed: 0,0,1,2,3,4,5
0,15.122596,12.426351,16.575683,0.381597,10.749272,6.778301
1,-16.418522,-6.087807,-5.818107,-0.643846,11.846740,0.972064
2,11.310890,-2.240983,-5.683221,15.249593,-2.777293,-2.557532
3,12.883277,13.284804,14.409577,-11.039923,2.535216,-3.919067
4,-12.119759,3.021291,20.739082,0.602510,-1.087116,-1.422970
...,...,...,...,...,...,...
4204,22.045697,-5.791778,-14.969442,0.445581,-6.315913,-2.175567
4205,-16.799934,-5.849007,-13.362178,2.585824,12.125719,-1.671761
4206,-13.467664,3.524159,-0.387650,20.586708,9.051955,3.609705
4207,24.086926,-6.512196,-6.389518,12.212749,4.557660,4.376525


### Now we obtained the prediction using XgBoost algorithm. We will now concatenate the y_pred with the X_test_transformed and save it to final_output.csv

In [304]:
final_output = pd.concat([pd.DataFrame(X_test_transformed), pd.DataFrame(y_pred, columns=['y'])],axis = 1)

In [305]:
final_output

Unnamed: 0,0,1,2,3,4,5,y
0,15.122596,12.426351,16.575683,0.381597,10.749272,6.778301,81.353905
1,-16.418522,-6.087807,-5.818107,-0.643846,11.846740,0.972064,101.158516
2,11.310890,-2.240983,-5.683221,15.249593,-2.777293,-2.557532,83.286964
3,12.883277,13.284804,14.409577,-11.039923,2.535216,-3.919067,91.348869
4,-12.119759,3.021291,20.739082,0.602510,-1.087116,-1.422970,104.549469
...,...,...,...,...,...,...,...
4204,22.045697,-5.791778,-14.969442,0.445581,-6.315913,-2.175567,114.716934
4205,-16.799934,-5.849007,-13.362178,2.585824,12.125719,-1.671761,101.151001
4206,-13.467664,3.524159,-0.387650,20.586708,9.051955,3.609705,108.539902
4207,24.086926,-6.512196,-6.389518,12.212749,4.557660,4.376525,109.415878


In [306]:
final_output.to_csv('final_output.csv')

### --------------------------------------Task-5 done above---------------------------------------------------