Problem Statement Scenario:

Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include the passenger safety cell with a crumple zone, the airbag, and intelligent assistance systems. Mercedes-Benz applies for nearly 2000 patents per year, making the brand the European leader among premium carmakers. Mercedes-Benz is the leader in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.
To ensure the safety and reliability of every unique car configuration before they hit the road, the company’s engineers have developed a robust testing system. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Mercedes-Benz’s production lines. However, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach.

You are required to reduce the time that cars spend on the test bench. Others will work with a dataset representing different permutations of features in a Mercedes-Benz car to predict the time it takes to pass testing. Optimal algorithms will contribute to faster testing, resulting in lower carbon dioxide emissions without reducing Mercedes-Benz’s standards.

I will Take below approach for Problem Statement

1. If for any column(s), the variance is equal to zero, then you need to remove those variable(s).
2. Check for null and unique values for test and train sets.
3. Apply label encoder for non-numerical categorical variables.
4. Use Boruta for dimensionality reduction since PCA is not suitable on binary and categorical variable.
5. Use XGBoost for final modeling.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max.columns", None)




In [None]:
##Importing Data Set
traindata = pd.read_csv("/kaggle/input/mercedes-benz-greener-manufacturing/train.csv.zip")
testdata = pd.read_csv("/kaggle/input/mercedes-benz-greener-manufacturing/test.csv.zip")

submission = pd.read_csv("/kaggle/input/mercedes-benz-greener-manufacturing/sample_submission.csv.zip")


In [None]:
#Creating Variables for Identificaton of Data.
testdata['Type'] = "Test"
traindata['Type'] = "train"

#Merging Data Set
mergeddata = (pd.concat([traindata, testdata], ignore_index= True))

# Exploratary Data Inspection

In [None]:
traindata.head()

In [None]:
traindata.describe()

In [None]:
## ID not seems relevant data, it seems a like serial number of records, so dropping it.
mergeddata.set_index('ID', inplace=True)

In [None]:
traindata.info()

In [None]:
traindata.shape

In [None]:
traindata.isnull().sum()

In [None]:
traindata.isnull().sum().sum()

In [None]:
traindata.isnull().values.any()

In [None]:
traindata.var()

In [None]:
testdata.var()

# Data PreProcessing

#Step:1 - If for any column(s), the variance is equal to zero, then you need to remove those variable(s).

In [None]:
##Creating Variance dataframe
Numeric = pd.DataFrame(traindata.var())
Numeric = Numeric.transpose()
Numeric

In [None]:
##Dropping variable from combined dataset which have 0 variance in training Set.
variables = Numeric.columns
variable = [ ]
numeric = traindata[variables]
var = numeric.var()
numeric = numeric.columns
variable = [ ]
for i in range(0,len(var)):
    if var[i] == 0:   #setting the threshold as 0
       variable.append(numeric[i])

mergeddata1 = mergeddata.drop(variable,axis=1)
mergeddata1.head()

In [None]:
##Checcking Variance in Test Data
Numeric1 = pd.DataFrame(testdata.var())
Numeric1 = Numeric1.transpose()
Numeric1

In [None]:
##Dropping variable from combined dataset which have 0 variance in Test Data Set.
variables = Numeric1.columns
variable = [ ]
numeric = testdata[variables]
var = numeric.var()
numeric = numeric.columns
variable = [ ]
for i in range(0,len(var)):
    if var[i] == 0:   #setting the threshold as 0
       variable.append(numeric[i])

mergeddata2 = mergeddata1.drop(variable,axis=1)
mergeddata2.head()

In [None]:
##Checking Categorical Data in DataFrame for train Data
traindata.describe(include=object)

##Dropping X4 Variable as it has no variance as value 'd' count is 8408 which is almost 100% of variable.

In [None]:
mergeddata3 = mergeddata2.drop('X4',axis=1)
mergeddata3.head()

In [None]:
##Checking Categorical Data in DataFrame for Test Data
testdata.describe(include=object)

#Another variables has variation except X4 and it has bee already removed during checking of Traning Data Variance for Categorical Data

In [None]:
FinalData = mergeddata3.copy()
FinalData.head()

##Step:2 - Check for null and unique values for test and train sets.

In [None]:
traindata.isna().sum()

In [None]:
traindata.isnull().sum().sum()

In [None]:
testdata.isna().sum()

In [None]:
testdata.isnull().sum().sum()

##Checking Categorical Value availablity between train and test data.

In [None]:
#Feature X0
X0Check = np.where(testdata.X0.isin(traindata.X0),'Match', traindata.X0)
X0Check = pd.Series(X0Check)
X0Check.value_counts()

In [None]:
#Feature X1
X1Check = np.where(testdata.X1.isin(traindata.X1),'Match', traindata.X1)
X1Check = pd.Series(X1Check)
X1Check.value_counts()

In [None]:
#Feature X2
X2Check = np.where(testdata.X2.isin(traindata.X2),'Match', traindata.X2)
X2Check = pd.Series(X2Check)
X2Check.value_counts()

In [None]:
#Feature X3
X3Check = np.where(testdata.X3.isin(traindata.X3),'Match', traindata.X3)
X3Check = pd.Series(X3Check)
X3Check.value_counts()

In [None]:
#Feature X5
X5Check = np.where(testdata.X5.isin(traindata.X5),'Match', traindata.X5)
X5Check = pd.Series(X5Check)
X5Check.value_counts()

In [None]:
#Feature X6
X6Check = np.where(testdata.X6.isin(traindata.X6),'Match', traindata.X6)
X6Check = pd.Series(X6Check)
X6Check.value_counts()

In [None]:
#Feature X8
X8Check = np.where(testdata.X8.isin(traindata.X8),'Match', traindata.X8)
X8Check = pd.Series(X8Check)
X8Check.value_counts()

##All Level are available in train data which is in test data except few ones which has low occurrance. I will leave it as is.

In [None]:
##Frequency observation of categorical column.

for column in traindata.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=traindata[column], columns='Observations%', normalize='columns')*100)   

#By looking the data, all input feature has sufficient variance.

In [None]:
##Frequency observation of numerical(binary) columns.

for column in traindata.select_dtypes(include=['int64']).columns:
    display(pd.crosstab(index=traindata[column], columns='Observations%', normalize='columns')*100)   

##We have already removed those columns which has 0 variance in Final Data.

# Data Seggragation from Merged Data - Train and Test Data

In [None]:
FinalData.head()

In [None]:
FinalTrainData = FinalData[FinalData['Type']=='train'].drop('Type',axis=1)
FinalTrainData.shape

In [None]:
FinalTrainData.head()

In [None]:
FinalTestData = FinalData[FinalData['Type']=='Test'].drop(['y','Type'],axis=1)
FinalTestData.shape

In [None]:
FinalTestData.head()

##Step: - 3 Label Encoder Application

In [None]:
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from boruta import BorutaPy
import xgboost as xgb
import seaborn as sns
from scipy.stats import skew
from sklearn.metrics import mean_squared_error as MSE
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import r2_score as rsq
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import statistics as sts
le = preprocessing.LabelEncoder()

In [None]:
#Apply Lables Encoder on Non-Numerical Categorical Data.
FinalTrainData['X0']=le.fit_transform(FinalTrainData['X0'])
FinalTrainData['X1']=le.fit_transform(FinalTrainData['X1'])
FinalTrainData['X2']=le.fit_transform(FinalTrainData['X2'])
FinalTrainData['X3']=le.fit_transform(FinalTrainData['X3'])
FinalTrainData['X5']=le.fit_transform(FinalTrainData['X5'])
FinalTrainData['X6']=le.fit_transform(FinalTrainData['X6'])
FinalTrainData['X8']=le.fit_transform(FinalTrainData['X8'])

In [None]:
FinalTrainData.head()

In [None]:
##Since after label encoding, categorical feature has higher value in terms of number, which might dominate the model, so Data Scaling is required on Data.
stscale=StandardScaler()
FinalTrainData['X0']=stscale.fit_transform(FinalTrainData[['X0']])
FinalTrainData['X1']=stscale.fit_transform(FinalTrainData[['X1']])
FinalTrainData['X2']=stscale.fit_transform(FinalTrainData[['X2']])
FinalTrainData['X3']=stscale.fit_transform(FinalTrainData[['X3']])
FinalTrainData['X5']=stscale.fit_transform(FinalTrainData[['X5']])
FinalTrainData['X6']=stscale.fit_transform(FinalTrainData[['X6']])
FinalTrainData['X8']=stscale.fit_transform(FinalTrainData[['X8']])

In [None]:
FinalTrainData.head()

In [None]:
FinalTrainData.shape

In [None]:
FinalTrainData.info()

##Task:4 Feature Selection/Dimensionality Reduction using Boruta Package

In [None]:
##Preparing data for Feature and Label.
X =FinalTrainData.drop('y', axis=1)
y=FinalTrainData['y']
X.shape,y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2, random_state = 21) #0.2 ==> 20% data is for testing & 
print("Shape of X_train is " , X_train.shape)
print("Shape of y_train is " , y_train.shape)
print("=====================================")
print("Shape of X_test is " , X_test.shape)
print("Shape of y_test is " , y_test.shape)

##Boruta Package Application on Training Data

In [None]:
mymodel=xgb.XGBRegressor()

In [None]:
selfeat=BorutaPy(mymodel, n_estimators='auto', verbose=2, random_state=1)

In [None]:
selfeat.fit(np.array(X),np.array(y))

In [None]:
#Boruta Result
selected_rf_features = pd.DataFrame({'Feature':list(X_train.columns),
                                       'Ranking':selfeat.ranking_,
                                    'Support':selfeat.support_})
Feature= selected_rf_features.copy()
Confimred = Feature[Feature['Ranking']==1]#Confirmed
Tentative = Feature[Feature['Ranking']==2]#Tentative
print(Confimred.head(10))
print(Tentative.head())

##8 Features has been confimred thru Boruta as Important feature, we will move with these 8 feature for modeling.

In [None]:
#Final Model with with Confirmed Feature
FinData1=FinalTrainData[['y','X189','X315','X314','X118','X261','X29','X127','X236']]

In [None]:
##Target Value Data Distribution Review
sns.kdeplot(FinData1.y)

##By Looking the graph data seems to be skewed, lets check it by stats.

In [None]:
##Stats Checking
print('The Median Value of Target Variable is:',sts.median(FinData1.y))
print('The Mean Value of Target Variable is:',round(sts.mean(FinData1.y),2))
print('The Skewness of Target Variable is:',round(skew(FinData1.y),2))


##Data seems to be normally distributed as per stats as skewness is below threashold (<3%).

In [None]:
#Creating Train Test Model for XG Booster Model.
X =FinData1.drop('y', axis=1)
y=FinData1['y']
X.shape,y.shape
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size = 0.2, random_state = 21) #0.2 ==> 20% data is for testing & 
print("Shape of X_train is " , X_train.shape)
print("Shape of y_train is " , y_train.shape)
print("=====================================")
print("Shape of X_test is " , X_test.shape)
print("Shape of y_test is " , y_test.shape)


In [None]:
##Xgb Model
import random
random.seed(111)
xgb_mymodel = xgb.XGBRegressor(max_depth=3, n_estimators=100, n_jobs=2,
                           objectvie='reg:squarederror', booster='gbtree',
                           random_state=42, learning_rate=0.05)


In [None]:
#Fitting the Data
xgb_mymodel.fit(X_train, y_train)

In [None]:
##Prediction on Test Data for accuracy
preds = xgb_mymodel.predict(X_test)

In [None]:
#Evalution of Model on Test Data
rmse = np.sqrt(MSE(y_test, preds))
mae=MAE(y_test, preds)
mse = MSE(y_test, preds)
R2SQ=rsq(y_test, preds)
adjRSq = 1 - (1-R2SQ)*(len(y_test)-1)/(len(y_test)-X.shape[1]-1)
diff = pd.DataFrame({'Actual': y_test, 'Predicted': preds,'Error':y_test -preds})
print("The Root Mean Squared Error is: ",round(rmse,2))
print("The Mean Absolute Error is: ",round(mae,2))
print("The Mean Squared is Error is: ",round(mse,2))
print("The RSquared value is:",round(R2SQ,2))
print("The Adj RSquared value is:",round(adjRSq,2))
print("The Max Error value is:",round(diff.Error.min(),2))
print("The Max Error value is:",round(diff.Error.max(),2))

In [None]:
y_test.describe()

RMSE and Adjusted RSquared seems okay in respect of test data Mean and Standard deviation.

In [None]:
#Plotting Actual V/s Predicted Value
plt.scatter(y_test, preds,color = "Green")
plt.xlabel('True Values [RH]')
plt.ylabel('Predictions [RH]')
plt.grid(False)
plt.axis('equal')
plt.axis('square')
plt.xlim(60,140)
plt.ylim(60,140)
plt.title("Actual v/s Predicted Value")
plt.show()

##Step:4 Predict your test_df values using XGBoost.

# Standardization of Test Data for actual prediction

In [None]:
ActTestData=FinalTestData[['X189','X315','X314','X118','X261','X29','X127','X236']]
ActTestData.head()

In [None]:
ActTestData.describe()

In [None]:
##Prediction on Actual Data
ActualTestPred = xgb_mymodel.predict(ActTestData)

In [None]:
ActualTestPred

In [None]:
TestID = np.array(testdata['ID'])
FinalPred = pd.DataFrame({'ID': TestID, 'y': ActualTestPred})
FinalPred.head()

In [None]:
##Save the predicted values
FinalPred.to_csv('TestDataSubmission.csv', index=False)

##End of Project: Thank You##