This notebook was done on following this video by Krish Naik as a tutorial - https://youtu.be/p_tpQSY1aTs


In [None]:
%config Completer.use_jedi = False

In [None]:
sns.set(rc={'figure.figsize':(18,10)})
sns.set_style({'axes.facecolor':'white', 'grid.color': '.8', 'font.family':'Times New Roman'})

In [None]:
# Colors

cyan = '#00FFD1'
red = '#FF007D'
prussian = '#0075FF'
green = '#EEF622'
yellow = '#FFF338'
violet = '#9B65FF'
orange = '#FFA500'
blue = '#00EBFF'
vermillion = '#FF6900'
red2 = '#FF2626'
seagreen = '#28FFBF'
green2 = '#FAFF00'
navyblue = '#04009A'
darkgreen = '#206A5D'
lightgreen = '#CCF6C8'
pink = '#F35588'
mauve = '#BAABDA'
lightblue = '#1CC5DC'
mustard = '#FDB827'
deeppurple = '#723881'

color_list = [cyan,red,prussian,green,violet,orange,yellow,blue,vermillion,red2,seagreen,green2,navyblue,darkgreen,lightgreen,pink,mauve,lightblue,mustard,deeppurple]
plt.rcParams['axes.prop_cycle'] = plt.cycler(color=color_list)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import random

In [None]:
df = pd.read_csv('../input/vehicle-dataset-from-cardekho/car data.csv')

In [None]:
df.head()

In [None]:
df.shape

#### Categorical and Numerical

* Categorical - ['Fuel_Type', 'Seller_Type', 'Transmission', 'Owner'] 
* Numerical - ['Selling_Price', 'Present_Price', 'Kms_Driven' ]

In [None]:
print(df['Fuel_Type'].unique())
print(df['Seller_Type'].unique())
print(df['Transmission'].unique())
print(df['Owner'].unique())
print(df['Year'].unique())

#### Checking missing or null values

In [None]:
df.isnull().sum() 

In [None]:
df.describe()

In [None]:
df.columns

In [None]:
# dropping car names
final_dataset = df[['Year', 'Selling_Price', 'Present_Price', 'Kms_Driven',
       'Fuel_Type', 'Seller_Type', 'Transmission', 'Owner']] 

In [None]:
final_dataset.head()

In [None]:
final_dataset['Current_yr'] = 2020
final_dataset['No_yr'] = final_dataset['Current_yr']-final_dataset['Year']

In [None]:
final_dataset.head()

In [None]:
# dropping Year and Current year
final_dataset.drop(['Year','Current_yr'],axis=1,inplace=True)

In [None]:
final_dataset.head()

### One hot encoding

In [None]:
final_dataset = pd.get_dummies(final_dataset,drop_first=True) 

In [None]:
final_dataset.head()

Explanation:

Fuel_type: there were 3 categories ['Petrol' 'Diesel' 'CNG'] : CNG column has been dropped to prevent dummy variable trap

#### Corelation

In [None]:
sns.pairplot(final_dataset,palette=random.choice(color_list));

In [None]:
fig, ax = plt.subplots(figsize=(18,14)) 
corrmat= final_dataset.corr()
top_corr_features=corrmat.index #will take top corelation features

mask = np.triu(final_dataset[top_corr_features].corr())
sns.heatmap(final_dataset[top_corr_features].corr(),cmap='BrBG',linewidths=1.5,ax=ax,annot=True,center=0,square=True,mask=mask);

In [None]:
final_dataset.head()

Here the selling price price will be a dependant feature and the rest will be independant features

In [None]:
X = final_dataset.iloc[:,1:] # independant features 
y = final_dataset.iloc[:,0] # dependant feature

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# finding out ordering of important features
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X,y)

In [None]:
print(model.feature_importances_)

Hence the first(present price) then the fifth(fuel type diesel)
the n seventh.... have the most importnce

In [None]:
feat_importances = pd.Series(model.feature_importances_, index = X.columns)
feat_importances.nlargest(5).plot(kind = 'barh',color=color_list) # top 5
plt.title('Top 5',fontsize=30)
plt.show()

In [None]:
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X,y,test_size = 0.2)

In [None]:
print(train_X.shape)
print(test_X.shape)
print(train_y.shape)
print(test_y.shape)

# Random Forest

We can use liner regression lasso ... but here for him RF gave better results

When using random forest we dont hve to scaale the values since random forest uses decision tree and in that its usually not required

In [None]:
from sklearn.ensemble import RandomForestRegressor
rf_random = RandomForestRegressor() 

In [None]:
# Hyperparameters
import numpy as np

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]

# Number of features to consider at every step
max_features = ['auto','sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5,30,num = 6)]

# Minimum number of samples required to split a node
min_samples_split = [2,5,10,15,100]

# Miimum number of samples required at each leaf node
min_samples_leaf = [1,2,5,10]

# Crosshecking
print(n_estimators)
print(max_depth)

In [None]:
# helps us to find out the best parameters considering how many estimators should be 
# there how many max features and depth should be there
from sklearn.model_selection import RandomizedSearchCV 

These are all the different decision trees that we have selected for hyperparameters

In [None]:
#remember to take in the form of key value pairs
random_grid = {'n_estimators': n_estimators,
              'max_features': max_features,
              'max_depth': max_depth,
              'min_samples_split': min_samples_split,
              'min_samples_leaf': min_samples_leaf}
print(random_grid)

These are the various parameters that we've taken

In [None]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune

# Initializing a random forest regressor
rf = RandomForestRegressor() #this we have passed as an argument below

In [None]:
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, scoring='neg_mean_squared_error', n_iter=10, cv=5, verbose=2, random_state=42, n_jobs=1)

Verbose is for displaying everything below

In [None]:
rf_random.fit(train_X, train_y)

## Prediction

In [None]:
predictions = rf_random.predict(test_X)

In [None]:
predictions

To compare predictions we'll use : (below)

In [None]:
sns.distplot(test_y-predictions,color=random.choice(color_list));

Since the graph looks like normal distribution it shows us that the model that we created is giving us very good results

Here since we get this close gaussian distribution graph that means that the distance is very minimal

In [None]:
plt.scatter(test_y,predictions,color=random.choice(color_list));

Here alse the scatter plot shows us that the values of y test and prediction are almost in a line itself, therefore this model is good