## Overview
Since the first automobile, the Benz Patent Motor Car in 1886, Mercedes-Benz has stood for important automotive innovations. These include, for example, the passenger safety cell with crumple zone, the airbag, and intelligent assistance systems. Daimler’s Mercedes-Benz cars are leaders in the premium car industry. With a huge selection of features and options, customers can choose the customized Mercedes-Benz of their dreams.

To ensure the safety and reliability of each unique car configuration before they hit the road, Daimler’s engineers have developed a robust testing system. But, optimizing the speed of their testing system for many possible feature combinations is complex and time-consuming without a powerful algorithmic approach. As one of the world’s biggest manufacturers of premium cars, safety and efficiency are paramount on Daimler’s production lines.

In this competition, Daimler is challenging Kagglers to tackle the curse of dimensionality and reduce the time that cars spend on the test bench. Competitors will work with a dataset representing different permutations of Mercedes-Benz car features to predict the time it takes to pass testing. Winning algorithms will contribute to speedier testing, resulting in lower carbon dioxide emissions without reducing Daimler’s standards.


## Dataset Description
The dataset consists of training and test sets containing different permutations of Mercedes-Benz car features. The goal is to predict the time it takes for each configuration to pass testing.
* `train.csv.zip`: Training set with features and target variable.
* `test.csv.zip`: Test set for predictions.

## Setup directory

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


/kaggle/input/mercedes-benz-greener-manufacturing/train.csv.zip
/kaggle/input/mercedes-benz-greener-manufacturing/sample_submission.csv.zip
/kaggle/input/mercedes-benz-greener-manufacturing/test.csv.zip


In [5]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings("ignore")

## Data Loading and Exploration
### Load and read train data

In [6]:
# Load the training data
train_data = pd.read_csv('/kaggle/input/mercedes-benz-greener-manufacturing/train.csv.zip')
test_data = pd.read_csv('/kaggle/input/mercedes-benz-greener-manufacturing/test.csv.zip')
display(train_data.head())

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0


## Preprocess data 
* Transform the catergorical data into binary value with OneHotEncoder
* Combined the data set into train and test encoded data set

In [21]:
# Combine the training and test data for consistent one-hot encoding
combined_data = pd.concat([train_data, test_data], axis=0, ignore_index=True)
display(combined_data)

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X375,X376,X377,X378,X379,X380,X382,X383,X384,X385
0,0,130.81,k,v,at,a,d,u,j,o,...,0,0,1,0,0,0,0,0,0,0
1,6,88.53,k,t,av,e,d,y,l,o,...,1,0,0,0,0,0,0,0,0,0
2,7,76.26,az,w,n,c,d,x,j,x,...,0,0,0,0,0,0,1,0,0,0
3,9,80.62,az,t,n,f,d,x,l,e,...,0,0,0,0,0,0,0,0,0,0
4,13,78.02,az,v,n,f,d,h,d,n,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8413,8410,,aj,h,as,f,d,aa,j,e,...,0,0,0,0,0,0,0,0,0,0
8414,8411,,t,aa,ai,d,d,aa,j,y,...,0,1,0,0,0,0,0,0,0,0
8415,8413,,y,v,as,f,d,aa,d,w,...,0,0,0,0,0,0,0,0,0,0
8416,8414,,ak,v,as,a,d,aa,c,q,...,0,0,1,0,0,0,0,0,0,0


In [24]:
# Preprocess the combined data using one-hot encoding for categorical variables 
categorical_columns = combined_data.select_dtypes(include=['object']).columns
display(categorical_columns)

# Encode the categorical data with OneHotEncoder:
encoder = OneHotEncoder(sparse=False, drop='first', handle_unknown='ignore')
encoded_categorical_data = encoder.fit_transform(combined_data[categorical_columns])
encoded_categorical_df = pd.DataFrame(encoded_categorical_data, columns=encoder.get_feature_names_out(categorical_columns))
display(encoded_categorical_df)

# Concatenate the original data with the encoded categorical data
combined_data_encoded = pd.concat([combined_data, encoded_categorical_df], axis=1)

# Display the resulting DataFrame
display(combined_data_encoded)

Index(['X0', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X8'], dtype='object')

Unnamed: 0,X0_aa,X0_ab,X0_ac,X0_ad,X0_ae,X0_af,X0_ag,X0_ai,X0_aj,X0_ak,...,X8_p,X8_q,X8_r,X8_s,X8_t,X8_u,X8_v,X8_w,X8_x,X8_y
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8413,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8414,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8415,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
8416,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X8_p,X8_q,X8_r,X8_s,X8_t,X8_u,X8_v,X8_w,X8_x,X8_y
0,0,130.81,k,v,at,a,d,u,j,o,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,6,88.53,k,t,av,e,d,y,l,o,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7,76.26,az,w,n,c,d,x,j,x,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,9,80.62,az,t,n,f,d,x,l,e,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,13,78.02,az,v,n,f,d,h,d,n,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8413,8410,,aj,h,as,f,d,aa,j,e,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8414,8411,,t,aa,ai,d,d,aa,j,y,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
8415,8413,,y,v,as,f,d,aa,d,w,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
8416,8414,,ak,v,as,a,d,aa,c,q,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
# Split the combined data back into training and test datasets
train_data_encoded = combined_data_encoded[:len(train_data)]
test_data_encoded = combined_data_encoded[len(train_data):]
display(train_data_encoded)

Unnamed: 0,ID,y,X0,X1,X2,X3,X4,X5,X6,X8,...,X8_p,X8_q,X8_r,X8_s,X8_t,X8_u,X8_v,X8_w,X8_x,X8_y
0,0,130.81,k,v,at,a,d,u,j,o,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,6,88.53,k,t,av,e,d,y,l,o,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7,76.26,az,w,n,c,d,x,j,x,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,9,80.62,az,t,n,f,d,x,l,e,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,13,78.02,az,v,n,f,d,h,d,n,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8405,107.39,ak,s,as,c,d,aa,d,q,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4205,8406,108.77,j,o,t,d,d,aa,h,h,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4206,8412,109.22,ak,v,r,a,d,aa,g,e,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4207,8415,87.48,al,r,e,f,d,aa,l,u,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [26]:
# Drop the original categorical columns
train_data_encoded = train_data_encoded.drop(categorical_columns, axis=1)
test_data_encoded = test_data_encoded.drop(categorical_columns, axis=1)
display(train_data_encoded)


Unnamed: 0,ID,y,X10,X11,X12,X13,X14,X15,X16,X17,...,X8_p,X8_q,X8_r,X8_s,X8_t,X8_u,X8_v,X8_w,X8_x,X8_y
0,0,130.81,0,0,0,1,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,6,88.53,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,7,76.26,0,0,0,0,0,0,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,9,80.62,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,13,78.02,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4204,8405,107.39,0,0,0,0,1,0,0,0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4205,8406,108.77,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4206,8412,109.22,0,0,1,1,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4207,8415,87.48,0,0,0,0,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


## Machine Learning model: Train a Random Forest Regressor model

In [28]:
# Split the data into features
y_train = train_data_encoded['y']
X_train = train_data_encoded.drop(columns=['ID', 'y'])

In [29]:
# Train a Random Forest Regressor model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [30]:
# Use the trained model to make predictions on the test data
test_predictions = model.predict(test_data_encoded.drop(columns=['ID', 'y']))

In [31]:
# Create a submission file
submission = pd.DataFrame({'ID': test_data_encoded['ID'], 'y': test_predictions})

# Save the submission file
submission.to_csv('submission.csv', index=False)


In [32]:
# Calculate and print the R-squared score on the validation set
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size=0.2, random_state=99)
y_val_pred = model.predict(X_val_split)
r2 = r2_score(y_val_split, y_val_pred)
print(f'R-squared Score on Validation Set: {r2}')

R-squared Score on Validation Set: 0.9195556437080905
