   # Vehicle price prediction using several regression models with step by step data cleaning, exploration, feature selection and engineering, data modelling, model evaluation and finally conclusion.

##  Get to know the dataset and the features
__Thoughts on each feature and hypotheses of their effect:__
- selling_price will be our target variable since we are predicting the price of vehicles.
- name: Name and model of vehicle, I think the brand of the car will be somewhat valuable, since most of us do care about our car's brand, don't we?
- year: The year when the vehicle is bought, I assume it will tell us more information if we convert it to another feature called 'age' by deducting the 'year' by year of now(2021). I also assume older vehicle should be cheaper.
- km_driven: This can indirectly tell us the condition of the vehicle, vehicle which travelled a longer distance tends to mean the vehicle is older and hence the selling price will be lower.
- fuel: Diesel vs petrol should make a difference since the price of the fuel and the engine type are not the same.
- seller_type: I assume selling price of vehicle for 'Individual seller' should be lower since 'Dealer' often charge commission or service fee or any form of fees.
- transmission: I assume 'Manual' car should be cheaper than 'Automatic' vehicle, as of my experience.
- owner: This specify the number of owners the vehicle had before, I assume the more owners the vehicle had before, the cheaper the vehicle will be.
- mileage: This is the fuel efficiency metric, I assume higher mileage vehicle should be higher in selling_price.
- engine: The Cubic Capacity(CC) of engine, I assume higher CC vehicle should be higher in selling_price.
- max_power: The Brake Horse Power(BHP) of the vehicle, I assume higher BHP should be higher in selling_price.
- torque: The torque of the vehicle, for modelling purpose, this does not contain much information since they are rated at different rpm, so I will drop this feature.
- seats: Seats can possibly represent the size of the vehicle, I assume the vehicle with more seats will be higher in selling price.

#### 1. Import the modules used for the analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import warnings
warnings.simplefilter(action='ignore')

#### 2. Read the data into a pandas DataFrame and look at the first five rows of data

In [None]:
df = pd.read_csv('../input/vehicle-dataset-from-cardekho/Car details v3.csv')
df.head()

#### 3. Chech the size of original dataset

In [None]:
df.shape

#### 4. Check for duplicated data

In [None]:
df.duplicated().any()

#### 5. Remove duplicated data

In [None]:
df = df.drop_duplicates()

#### 6. Check the size of the data with no duplicated records

In [None]:
df.shape

#### 7. Basic information of the columns

In [None]:
df.info()

#### 8. Dropping the 'torque' column

In [None]:
df.drop(['torque'],axis=1, inplace = True)
df.head()

#### 9. Check for missing values

In [None]:
df.isnull().any()

#### 10. Missing values in percentage of the total samples

In [None]:
df.isnull().sum() / df.shape[0] * 100

#### 11. Since we have more than 6000 samples which is decent, I decided to remove the rows with missing values since they only constitute about 3% of the data

In [None]:
df.dropna(axis=0, inplace=True)
df.isnull().any()

#### 12. Check the size of dataset after removing the missing values to make sure only small number of samples are deleted

In [None]:
df.shape

## Data Cleaning

#### 1. Adding the 'age' feature which is a better feature for modelling purpose and removing the 'year' column
#### 2. Replacing the string in 'owner' with numerical representation for better illustration

In [None]:
df['age'] = 2021 - df['year']
df.drop(['year'],axis = 1,inplace = True)
df['owner'] = df['owner'].replace({'First Owner': 1, 'Second Owner': 2, 'Third Owner': 3})
df.head()

#### 3. Cleaning the data by removing the strings on the datas

In [None]:
df['mileage'] = df['mileage'].str.strip('kmpl').str.strip('km/kg')
df['engine'] = df['engine'].str.strip('CC')
df['max_power'] = df['max_power'].str.strip('bhp').str.strip()
df.head()

#### 4. Converting the data into float format since they are numerical data(continuous data)

In [None]:
df['mileage'] = pd.to_numeric(df['mileage'])
df['engine'] = pd.to_numeric(df['engine'])
df['max_power'] = pd.to_numeric(df['max_power'])

#### 5. Converting the datatype of 'seats' to string object since it is a categorical data

In [None]:
df['seats'] = df['seats'].astype(str)

## Let's do some exploratory analysis(EDA)

### Univariate Analysis

#### 1. Plotting histogram to visualize the distribution of all the  numerical data

In [None]:
fig = make_subplots(rows=3, cols=2,subplot_titles=("Selling Price in Rupee", "Total KM Driven", "Fuel Efficiency in KM per litre",
                                                   "Engine CC", "Brake Horse Power(BHP)", "Age of Car","Number of Seats"))

fig.add_trace(
    go.Histogram(x=df['selling_price'], name="Rupee"),
    row=1, col=1
)

fig.add_trace(
   go.Histogram(x=df['km_driven'], name="KM"),
    row=1, col=2
)

fig.add_trace(
    go.Histogram(x=df['mileage'], name="KM/L"),
    row=2, col=1
)

fig.add_trace(
    go.Histogram(x=df['engine'], name="CC"),
    row=2, col=2
)

fig.add_trace(
    go.Histogram(x=df['max_power'], name="BHP"),
    row=3, col=1
)

fig.add_trace(
    go.Histogram(x=df['age'], name="Years"),
    row=3, col=2
)



fig.update_layout(height=1400, width=800, title_text="Distribution of numerical data")
fig.show()

- __From the histograms above, we can see the 'selling_price', 'km_driven', 'max_power' and 'age' plots look like a positively-skewed distribution, while 'mileage' look somewhat like a normal curve and 'engine' does not look like to follow a certain distribution.__

#### 2. Plotting boxplots to explore all the numerical data

In [None]:
fig = make_subplots(rows=3, cols=2,subplot_titles=("Selling Price in Rupee", "Total KM Driven", "Fuel Efficiency in KM per litre",
                                                   "Engine CC", "Brake Horse Power(BHP)", "Age of Car","Number of Seats"))

fig.add_trace(
    go.Box(x=df['selling_price'], name="Rupee"),
    row=1, col=1
)

fig.add_trace(
   go.Box(x=df['km_driven'], name="KM"),
    row=1, col=2
)

fig.add_trace(
    go.Box(x=df['mileage'], name="KM/L"),
    row=2, col=1
)

fig.add_trace(
    go.Box(x=df['engine'], name="CC"),
    row=2, col=2
)

fig.add_trace(
    go.Box(x=df['max_power'], name="BHP"),
    row=3, col=1
)

fig.add_trace(
    go.Box(x=df['age'], name="Years"),
    row=3, col=2
)



fig.update_layout(height=1400, width=800, title_text="Distribution of numerical data")
fig.show()


# data = [trace1]
# layout = go.Layout(go.Layout(title="Total km driven"))
# fig = go.Figure(data, layout=layout)
# fig.show()

- __From the box plots above, we can see all of them contain some sort of outliers. We will decide later during the feature engineering phase to decide which outlier to be removed.__

#### 3. Plotting bar graphs to show the distribution of all the categorical data

In [None]:
count_fuel = df['fuel'].value_counts().reset_index()
count_fuel = count_fuel.rename(columns = {'index':'fuel','fuel':'count'})

count_seller = df['seller_type'].value_counts().reset_index()
count_seller = count_seller.rename(columns = {'index':'seller_type','seller_type':'count'})

count_transmission = df['transmission'].value_counts().reset_index()
count_transmission = count_transmission.rename(columns = {'index':'transmission','transmission':'count'})

count_owner = df['owner'].value_counts().reset_index()
count_owner = count_owner.rename(columns = {'index':'owner','owner':'count'})

count_seats = df['seats'].value_counts().reset_index()
count_seats = count_seats.rename(columns = {'index':'seats','seats':'count'})

In [None]:
fig = make_subplots(rows=3, cols=2,subplot_titles=("Fuel Type", "Seller Type", "Transmission Type",
                                                   "Number of Owners", "Number of Seats"))

fig.add_trace(
    go.Bar(y=count_fuel['count'], x=count_fuel['fuel'], name="Fuel type"),
    row=1, col=1
)

fig.add_trace(
    go.Bar(y=count_seller['count'], x=count_seller['seller_type'], name="Seller type"),
    row=1, col=2
)

fig.add_trace(
    go.Bar(y=count_transmission['count'], x=count_transmission['transmission'], name="Transmission"),
    row=2, col=1
)

fig.add_trace(
    go.Bar(y=count_owner['count'], x=count_owner['owner'], name="Number of owners"),
    row=2, col=2
)

fig.add_trace(
    go.Bar(y=count_seats['count'], x=count_seats['seats'], name="Number of seats"),
    row=3, col=1
)

fig.update_layout(height=1000, width=800, title_text="Distribution of categorical data")
fig.show()


# data = [trace1]
# layout = go.Layout(title="Fuel type",xaxis_title='fuel',yaxis_title='count', height=700, legend=dict(x=0.1, y=1.1))
# fig = go.Figure(data, layout=layout)
# fig.show()

### Bivariate/Multivariate Analysis

#### 1. Plotting the correlation coefficient heatmap to visualize the relationship of each numericle variable

In [None]:
sns.heatmap(df.corr(), annot=True, cmap="RdBu")
plt.show()

- __We can look at the selling_price row to see the correlation coefficient of each numerical feature to the target variable. And from it, we can tell the relationship between the features and the target variable. Higher coefficient represents stronger relationship between the two variables(regardless of the sign).__
- __Check for multicollinearity among the feature variable, as a rule of thumb, coefficient <+-0.8 is acceptable.__

#### 2. Plotting the scatterplots for each numerical variable to visualize their relationships

In [None]:
sns.pairplot(df)

## Feature Selection, Feature Engineering and Data Preparation for Modelling

__Now, we look at each feature and decide which features we would like to include in training our model. Selecting features that represent the data well is crucial in building a model that generalize well(which is our ultimate goal). Some of the features can be engineered(Feature Engineering) so that they represent the data better and hence  a better model can be built.For example, the original 'year' column tells us which year the car is bought, if we slightly engineer the data by deducting the 'year' column from now(2021 as of writing this), we can get the age of the car which is an important factor when we are considering to buy an used car. You can think of features as the food for a growing baby(model)- the more nutritious food(better features) you feed the baby, the smarter(better prediction) the baby becomes. So, feature selection and feature engineering are as important as choosing a good model to fit your data.__
- name: We can obtain the brand of the vehicle and discard the rest of the information. I will remove this column and add a 'brand' column for the vehicle brand.
- selling_price(in Rupee): From the histogram, we can see the target variable is positively skewed, hence, I am going to use log-transformation on this data to make it behave more like a normal distribution to improve the linear models for regression. As we can see from the boxplot, this target variable is very noisy. I am going to remove all the rows for (selling price > 2.5M Rupee).
- km_driven: As we can see from the boxplot, this feature contains many outliers. So I decided to remove all the outliers(>300k km).
- fuel: I will remove the CNG and LPG fuelled car since their 'mileage' metric is measured in 'km/kg' while the petrol and diesel vechiles 'mileage' are measured in 'km/l'. Also, CNG and LPG fuelled car are of very small sample size.
- mileage: There are some outliers, I will remove vehicles whose 'mileage' is < 5 and > 35.
- max_power: There are some outliers, I will remove vehicles with BHP > 300. I will also do a log-transformation on the positively-skewed data for the better performance in linear models.
- age: As we can see from the histogram, 'age' is positively-skewed, I am going to do a log-transformation on it to make it behave more like a normal distribution for the performance of linear models.

#### 1. Feature Engineering

In [None]:
# Make a copy of the data for modelling
df_model = df.copy()

# Create the 'brand' column by splitting the 'name' column
df_model['brand'] = df_model['name'].str.split(' ').str.get(0)
df_model.drop(['name'],axis=1,inplace=True)

# Filter the outlier and log-transform the target variable('selling_price')
df_model = df_model[df_model['selling_price'] < 2500000]
df_model['selling_price'] = np.log(df_model['selling_price'])

# Filter the outlier in 'km_driven' feature
df_model = df_model[df_model['km_driven'] < 300000]

# Filter the unwanted rows in 'fuel' feature
df_model = df_model[~df_model['fuel'].isin(['CNG','LPG'])]

# Filter the outliers in 'mileage' feature
df_model = df_model[(df_model['mileage'] > 5) & (df_model['mileage'] < 35)]

# Filter the outlier in 'max_power' feature and log-transform the data.
df_model = df_model[df_model['max_power'] < 300]
df_model['max_power'] = np.log(df_model['max_power'])

# Log-transform the 'age' feature data.
df_model['age'] = np.log(df_model['age'])


# Show the first five records of the feature engineered DataFrame.
df_model.head()


#### 2. One-hot encoding to represent the categorical data for regression modelling

In [None]:
df_model = pd.get_dummies(data = df_model, drop_first=True)
df_model.head()

#### 3. Assigning the feature variables and the target variable

In [None]:
X = df_model.drop(['selling_price'],axis=1)
y = df_model['selling_price']

#### 4. Splitting the dataset into training set(for modelling) and test set(for evaluation)

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
print("x train: ",X_train.shape)
print("x test: ",X_test.shape)
print("y train: ",y_train.shape)
print("y test: ",y_test.shape)

#### 5. Scaling the numerical data.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
num_var = ['km_driven', 'mileage', 'engine', 'max_power', 'age']
X_train[num_var] = scaler.fit_transform(X_train[num_var])
X_test[num_var] = scaler.transform(X_test[num_var])

#### 6. Automatic feature selection using Recursive Feature Elimination(RFE). I have just learnt about RFE and I am implementing RFE as an extra experiment to check the effectiveness of RFE on model performance.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
select = RFE(RandomForestRegressor(n_estimators=100, random_state=42),
                 n_features_to_select=40)
select.fit(X_train, y_train)
X_train_rfe= select.transform(X_train)
X_test_rfe= select.transform(X_test)

#### 7. Main function to fit all regression model, check r2_score,check cross-validation score, plot residual plot and plot scatterplot of y_test_prediction vs y_test

In [None]:
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

r2_train_scores = []
r2_test_scores = []
cv_mean = []

def car_price_prediction_model(model):
    model.fit(X_train, y_train)
    
    #R2 score of training set
    y_train_pred = model.predict(X_train)
    r2_train = r2_score(y_train, y_train_pred)
    r2_train_scores.append(round(r2_train,2))
    
    #R2 score of test set
    y_test_pred = model.predict(X_test)
    r2_test = r2_score(y_test, y_test_pred)
    r2_test_scores.append(round(r2_test,2))
    
    # CV score of training set
    cv_training = cross_val_score(model, X_train, y_train, cv=5)
    cv_mean_training = cv_training.mean()
    cv_mean.append(round(cv_mean_training,2))
    
    
    
    # Printing each score
    print("Training set R2 scores: ",round(r2_train,2))
    print("Test set R2 scores: ",round(r2_test,2))
    print("Training cross validation score: ", cv_training)
    print("Training cross validation mean score: ",round(cv_mean_training,2))
    
    
    fig, ax = plt.subplots(1,2,figsize = (10,4))
    ax[0].set_title('Residual Plot of Train samples')
    sns.distplot((y_train-y_train_pred),hist = False,ax = ax[0])
    ax[0].set_xlabel('y_pred')
    
    # Y_test vs Y_train scatter plot
    ax[1].set_title('y_test vs y_pred_test')
    ax[1].scatter(x = y_test, y = y_test_pred)
    ax[1].set_xlabel('y_test')
    ax[1].set_ylabel('y_pred_test')
    
    plt.show()

#### 8. Main function to fit all regression model based on the RFE-dataset, check r2_score,check cross-validation score, plot residual plot and plot scatterplot of y_test_prediction_rfe vs y_test

In [None]:
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score

r2_train_scores_rfe = []
r2_test_scores_rfe = []
cv_mean_rfe = []

def car_price_prediction_model_rfe(model):
    model.fit(X_train_rfe, y_train)
    
    
    #R2 score of RFE training set
    y_train_pred_rfe = model.predict(X_train_rfe)
    r2_train_rfe = r2_score(y_train, y_train_pred_rfe)
    r2_train_scores_rfe.append(round(r2_train_rfe,2))
    
    #R2 score of RFE test set
    y_test_pred_rfe = model.predict(X_test_rfe)
    r2_test_rfe = r2_score(y_test, y_test_pred_rfe)
    r2_test_scores_rfe.append(round(r2_test_rfe,2))

    # CV score of RFE training set
    cv_training_rfe = cross_val_score(model, X_train_rfe, y_train, cv=5)
    cv_mean_training_rfe = cv_training_rfe.mean()
    cv_mean_rfe.append(round(cv_mean_training_rfe,2))
    
    # Printing each score
    print("Training set R2 scores: ",round(r2_train_rfe,2))
    print("Test set R2 scores: ",round(r2_test_rfe,2))
    print("Training cross validation score: ", cv_training_rfe)
    print("Training cross validation mean score: ",round(cv_mean_training_rfe,2))
    
    fig, ax = plt.subplots(1,2,figsize = (10,4))
    ax[0].set_title('Residual Plot of RFE-Train samples')
    sns.distplot((y_train-y_train_pred_rfe),hist = False,ax = ax[0])
    ax[0].set_xlabel('residual')
    
    # Y_test vs Y_train scatter plot
    ax[1].set_title('y_test vs y_pred_test_rfe')
    ax[1].scatter(x = y_test, y = y_test_pred_rfe)
    ax[1].set_xlabel('y_test')
    ax[1].set_ylabel('y_pred_test_rfe')
    
    plt.show()

## Regression Modelling and Evaluation

#### 1. Lineaer Regression(Ordinary Least Square)

In [None]:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
car_price_prediction_model(lm)

#### RFE Version

In [None]:
car_price_prediction_model_rfe(lm)

#### 2. Linear Regression(Ridge)

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV

rg = Ridge()
alpha = np.logspace(-3,3,num=14)
rg_rs = RandomizedSearchCV(estimator=rg, param_distributions=dict(alpha=alpha))
car_price_prediction_model(rg_rs)

#### RFE Version

In [None]:
car_price_prediction_model_rfe(rg_rs)

#### 3. Linear Regression(Lasso)

In [None]:
from sklearn.linear_model import Lasso
from sklearn.model_selection import RandomizedSearchCV

ls = Lasso()
alpha = np.logspace(-3,3,num=14)
ls_rs = RandomizedSearchCV(estimator=ls, param_distributions=dict(alpha=alpha))
car_price_prediction_model(ls_rs)

#### RFE Version

In [None]:
car_price_prediction_model_rfe(ls_rs)

#### 4. Extreme Gradient Boosting Regressor

In [None]:
from xgboost import XGBRegressor
xg = XGBRegressor(verbosity= 0)

n_estimators = [100, 500, 900, 1100, 1500]
max_depth = [2, 3, 5, 10, 15]
booster=['gbtree','gblinear']
learning_rate=[0.05,0.1,0.15,0.20]
min_child_weight=[1,2,3,4]
base_score=[0.25,0.5,0.75,1]


parameter_grid = {
    'n_estimators': n_estimators,
    'max_depth':max_depth,
    'learning_rate':learning_rate,
    'min_child_weight':min_child_weight,
    'booster':booster,
    'base_score':base_score
    }

xg_rs = RandomizedSearchCV(estimator=xg, param_distributions=parameter_grid)
            


In [None]:
car_price_prediction_model(xg_rs)

#### RFE Version

In [None]:
car_price_prediction_model_rfe(xg_rs)

#### 5. Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

rf = RandomForestRegressor()

# Number of trees in Random forest
n_estimators=list(range(500,1000,100))
# Maximum number of levels in a tree
max_depth=list(range(4,9,4))
# Minimum number of samples required to split an internal node
min_samples_split=list(range(4,9,2))
# Minimum number of samples required to be at a leaf node.
min_samples_leaf=[1,2,5,7]
# Number of fearures to be considered at each split
max_features=['auto','sqrt']

# Hyperparameters dict
param_grid = {"n_estimators":n_estimators,
              "max_depth":max_depth,
              "min_samples_split":min_samples_split,
              "min_samples_leaf":min_samples_leaf,
              "max_features":max_features}

rf_rs = RandomizedSearchCV(estimator = rf, param_distributions = param_grid)

In [None]:
car_price_prediction_model(rf_rs)

#### RFE Version

In [None]:
car_price_prediction_model_rfe(rf_rs)

#### 6. Gradient Boosting Regressor

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import RandomizedSearchCV

gb = GradientBoostingRegressor()

# Rate at which correcting is being made
learning_rate = [0.001, 0.01, 0.1, 0.2]
# Number of trees in Gradient boosting
n_estimators=list(range(500,1000,100))
# Maximum number of levels in a tree
max_depth=list(range(4,9,4))
# Minimum number of samples required to split an internal node
min_samples_split=list(range(4,9,2))
# Minimum number of samples required to be at a leaf node.
min_samples_leaf=[1,2,5,7]
# Number of fearures to be considered at each split
max_features=['auto','sqrt']

# Hyperparameters dict
param_grid = {"learning_rate":learning_rate,
              "n_estimators":n_estimators,
              "max_depth":max_depth,
              "min_samples_split":min_samples_split,
              "min_samples_leaf":min_samples_leaf,
              "max_features":max_features}

gb_rs = RandomizedSearchCV(estimator = gb, param_distributions = param_grid)

In [None]:
car_price_prediction_model(gb_rs)

#### RFE Version

In [None]:
car_price_prediction_model_rfe(gb_rs)

## Model Evaluation and Conclusion

In [None]:
algo = ["LinearRegression(OLS)","LinearRegression(Ridge)","LinearRegression(Lasso)",
        "ExtremeGradientBoostingRegressor","RandomForestRegressor","GradientBoostingRegressor"]

model_eval = pd.DataFrame({'Model': algo,'R Squared(Train)': r2_train_scores,'R Squared(Test)': r2_test_scores,
                           'CV score mean(Train)': cv_mean})
display(model_eval)

In [None]:
model_eval_RFE = pd.DataFrame({'Model': algo,'R Squared(Train)': r2_train_scores_rfe,
                                'R Squared(Test)': r2_test_scores_rfe,'CV score mean(Train)': cv_mean_rfe})
display(model_eval_RFE)

## Conclusion

- Extreme Gradient Boosting Regressor is the model I will choose since it has the highest CV score(91%) which mean it generalize better than other models.
- Linear model is also a great model choice if we have computational power constraint since the non-linear model are quite computational expensive.
- The automatic feature selection(RFE) did not make significant improvement on all of the models. Hence we do not need it unless computational time is of concern.
