## MAJOR PROJECT IN MACHINE LEARNING


 ### Problem Statement:Predicting the costs of used cars given the data collected from various sources and distributed across various locations in India.


## FEATURES 

* Name: The brand and model of the car. 
* Location: The location in which the car is being sold or is available for purchase. 
* Year: The year or edition of the model.
* Kilometers_Driven: The total kilometres driven in the car by the previous owner(s) in KM.
* Fuel_Type: The type of fuel used by the car.
* Transmission: The type of transmission used by the car. 
* Owner_Type: Whether the ownership is Firsthand, Second hand or other. 
* Mileage: The standard mileage offered by the car company in kmpl or km/kg
* Engine: The displacement volume of the engine in cc.
* Power: The maximum power of the engine in bhp. 
* Seats: The number of seats in the car. 
* Price: The price of the used car in INR Lakhs.

## IMPORTING THE LIBRARIES

In [1]:
from sklearn.model_selection import train_test_split  
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
import math
from sklearn.metrics import mean_squared_error

  import pandas.util.testing as tm


## LOADING THE DATASETS

In [None]:
from google.colab import files
data_to_load = files.upload()

In [None]:
from google.colab import files
data_to_load = files.upload()

In [None]:
import pandas as pd
pr = pd.read_excel('Data_Train.xlsx')
prtest=pd.read_excel('Data_Test.xlsx')

#To see how many rows and columns are there in each dataset
pr.shape, prtest.shape

## EXPLORATORY DATA ANALYSIS

In [None]:
pr.describe()

In [None]:
# To see the number of unique values in each column
pr.nunique()

In [None]:
# To see the column Data Types and non missing values
pr.info()

- To check whether by mistakenly same data has been added to the list or not!


In [None]:
temp = pd.DataFrame(pr,columns=['Name','Location','Year','Kilometers_Driven','Fuel_Type','Transmission','Owner_Type','Mileage','Engine','Power','Seats','Price'])
dup_rows=temp[temp.duplicated(['Name','Location','Year','Kilometers_Driven','Fuel_Type','Transmission','Owner_Type','Mileage','Engine','Power','Seats','Price'])]
print(dup_rows)


 - We got no such double entry of data.


## To see the unique values in each column 
 
 
 
   - First function will be used to count the unique values present. 
   - Second function will be used for retrieving those unique values.


In [None]:
# To see the unique values present in each column
pr.Name.nunique()

In [None]:
pr.Name.unique()


In [None]:
pr.Location.nunique()

In [None]:
pr.Location.unique()


In [None]:
pr.Fuel_Type.nunique()

In [None]:

pr.Fuel_Type.unique()

In [None]:
pr.Transmission.nunique()

In [None]:
pr.Transmission.unique()

In [None]:
pr.Owner_Type.nunique()

In [None]:
pr.Owner_Type.unique()

In [None]:
pr.Seats.nunique()

In [None]:
pr.Seats.unique()

In [None]:
pr.Year.nunique()


In [None]:
pr.Year.unique()


  ## Graphical Representation and relation depiction of data
 

 
 - For the following graphs as Price is our basic constraint, so we will be plotting relations with respect to Price.


In [None]:
import seaborn as sns
sns.set_style('darkgrid')
fig_dims = (13, 7)
fig, ax = plt.subplots(figsize=fig_dims)
sns.swarmplot(x='Location',y='Price',data=pr,ax=ax)

In [None]:
fig_dims = (13, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot('Year','Price',data=pr,ax=ax)

In [None]:
fig_dims = (5, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot('Fuel_Type','Price',data=pr,ax=ax)

In [None]:
fig_dims = (5, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot('Owner_Type','Price',data=pr,ax=ax)

In [None]:
fig_dims = (3, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot('Transmission','Price',data=pr,ax=ax)

In [None]:
fig_dims = (13, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.barplot('Seats','Price',data=pr,ax=ax)

In [None]:
sns.scatterplot('Kilometers_Driven','Price',data=pr)

In [None]:
fig_dims = (15, 5)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot('Mileage','Price',data=pr,ax=ax)

In [None]:
fig_dims = (13, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot('Power','Price',data=pr)

In [None]:
fig_dims = (13, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot('Engine','Price',data=pr)

## COUNTING THE VALUES THAT NEED TO BE REFERENCED AND REMOVING THE OUTLIERS


In [None]:
fig_dims = (13, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.countplot('Location',data=pr)

In [None]:
fig_dims = (13, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.countplot('Year',data=pr)

In [None]:
fig_dims = (13, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.scatterplot('Year','Price',data=pr)

In [None]:
fig_dims = (5, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.countplot('Owner_Type',data=pr)

In [None]:
fig_dims = (4, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.countplot('Transmission',data=pr)

In [None]:
fig_dims = (13, 4)
fig, ax = plt.subplots(figsize=fig_dims)
sns.boxplot('Price',data=pr,ax=ax)

## DATA PREPROCESSING

In [None]:
# To find the no of missing values in each attribute of data
pr.isnull().sum()

In [None]:
# The original shape of dataset
pr.shape

In [None]:
#  To drop the rows having missing data
pr.dropna(how='any',inplace=True)


In [None]:
pr.isnull().sum()

In [None]:
# Removing the cars which are of extremely high price
pr = pr[pr['Name'] != 'Ambassador Classic Nova Diesel']
pr = pr[pr['Name'] != 'Lamborghini Gallardo Coupe']
pr = pr[pr['Name'] != 'Force One LX 4x4']
pr = pr[pr['Name'] != 'Force One LX ABS 7 Seating']
pr = pr[pr['Name'] != 'Smart Fortwo CDI AT']
len(pr)

- Cars listed should be driven more than 999 Km and less than 700000 Km.


In [None]:
pr = pr[pr['Kilometers_Driven'] < 700000]
pr = pr[pr['Kilometers_Driven'] > 999]

- Taking into consideration only those cars which are powered without electricity, because our country does not have much electric charging stations.


In [None]:
pr = pr[pr['Fuel_Type'] != 'Electric']
len(pr)

In [None]:
# Forming a dataset containing both training and testing data for preprocessing of data 
tr=pr.append(prtest,ignore_index=True,sort=False)

In [None]:
tr.head()

In [None]:
# Creating a new variable 'Car age ' containing the age of car
tr['Car age']=2020-tr.Year

In [None]:
# Converting the string variables into numerical.
tr.Engine=tr.Engine.apply(lambda x: str(x).split(" ")[0]).astype(float)

In [None]:
tr.Mileage=tr.Mileage.apply(lambda x: str(x).split(" ")[0]).astype(float)

In [None]:
tr.Power=tr.Power.replace('null bhp','0 ')

In [None]:
tr.Power=tr.Power.apply(lambda x: str(x).split(" ")[0]).astype(float)

In [None]:
tr.head()

In [None]:
# creating a new variable which contains the brand name of car
tr['Brand'] = tr.Name.apply(lambda x: ' '.join(x.split(' ')[:2]))

In [None]:
tr.Price.sort_values(ascending=False)

In [None]:
tr.head()

In [None]:
# Shape of dataset after removing missing values
tr.shape


In [None]:
tr.isnull().sum()

In [None]:
def aggregate_functions(tr):        
    
    agg_func = {
        'Location' : ['count'],
        'Mileage' : ['mean'],
        'Power' : ['mean'],
        'Engine' : ['mean'] }
    
    agg_tr = tr.groupby(['Brand']).agg(agg_func)
    agg_tr.columns = ['_'.join(col).strip() for col in agg_tr.columns.values]
    agg_tr.reset_index(inplace=True)
    
    agg_tr = pd.merge(tr, agg_tr, on='Brand', how='left')
    
    return agg_tr
tr=aggregate_functions(tr)

In [None]:
tr.head()

In [None]:
tr.shape

In [None]:
dummy_name=pd.get_dummies(tr.Name)# creating dummy variables for the categorical variables

In [None]:
tr=pd.concat([tr,dummy_name],axis=1)

In [None]:
tr.head()

In [None]:
dummyloc=pd.get_dummies(tr.Location)

In [None]:
tr=pd.concat([tr,dummyloc],axis=1)

In [None]:
dummyfuel=pd.get_dummies(tr.Fuel_Type)

In [None]:
tr=pd.concat([tr,dummyfuel],axis=1)

In [None]:
dummytrans=pd.get_dummies(tr.Transmission)

In [None]:
tr=pd.concat([tr,dummytrans],axis=1)

In [None]:
dummyowner=pd.get_dummies(tr.Owner_Type)

In [None]:
tr=pd.concat([tr,dummyowner],axis=1)

In [None]:
tr.head()

In [None]:
tr.drop(['Name','Brand','Location','Fuel_Type','Owner_Type','Transmission'], axis=1, inplace=True)

In [None]:
tr.shape

In [None]:
tr.drop(['Year'],axis=1,inplace=True)

In [None]:
# Splitting the whole dataset into Training data and Testing data
train_tr = tr[tr['Price'].isnull()!=True]
test_tr = tr[tr['Price'].isnull()==True]
test_tr.drop('Price', axis=1, inplace=True)

In [None]:
train_tr.shape,test_tr.shape

In [None]:
y=train_tr.Price

In [None]:
train_tr.drop('Price',axis=1,inplace=True)
x=train_tr

In [None]:
x.shape,y.shape


In [None]:
x.head()

## Splitting the labeled dataset into training and testing data

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state=42)


# BUILDING & TRAINING THE MODELS

## KNN Regression:

In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
error=[]
for k in range(1,13):
    knn=KNeighborsRegressor(n_neighbors=1)
    knn.fit(x_train,y_train)
    pred=knn.predict(x_test)
    mse=metrics.mean_squared_error(y_test,pred)
    rmse=math.sqrt(mse)
    error.append(rmse)
    print('for neighbors=',k,'rmse:',rmse)

- So,the root mean squared error is not varying for the no of neighbors in KNN algorithm

In [None]:
x=range(1,13)
plt.plot(x,error)
plt.xlabel('Value of k in KNN')
plt.ylabel('RMSE')

- Graph between actual and predicted values:

In [None]:
sns.scatterplot(y_test,pred)
x=range(0,80)
y=range(0,80)
plt.plot(x,y,color='red')
plt.xlabel("Actual values of price")
plt.ylabel("Predicted values of price")

## Random Forest Regression

In [None]:
from sklearn.ensemble import RandomForestRegressor


In [None]:
rf = RandomForestRegressor(n_estimators = 1000, random_state = 42)


In [None]:
def custom_excepthook(type, value, traceback):
    if type is KeyboardInterrupt:
        return # do nothing
    else:
        sys.__excepthook__(type, value, traceback)
rf.fit(x_train, y_train)     

In [None]:
predrf=rf.predict(x_test)
reg_ols=sm.OLS(endog=y_train,exog=x_train).fit()



- Finding the root mean squared error

In [None]:
mse=metrics.mean_squared_error(y_test,predrf)
rmse=math.sqrt(mse)
print('RMSE is:',rmse)

- Graph between actual and predicted values:

In [None]:
sns.scatterplot(predrf,y_test)
x=range(0,80)
y=range(0,80)
plt.plot(x,y,color='red')
plt.xlabel("Actual values of price")
plt.ylabel("Predicted values of price")


## LightGBM Regressor

- Advantages of Light GBM

 - Faster training speed and higher efficiency: Light GBM use histogram based algorithm i.e it buckets continuous feature values into discrete bins which fasten the training procedure.

 - Lower memory usage: Replaces continuous values to discrete bins which result in lower memory usage.


- Finding the root mean squared error

In [None]:
mse=metrics.mean_squared_error(y_test,y_pred_lgbm)
rmse=math.sqrt(mse)
print('RMSE is:',rmse)

- Graph between actual and predicted values:

In [None]:
sns.scatterplot(x=y_test,y=y_pred_lgbm)
x=range(0,80)
y=range(0,80)
plt.plot(x,y,color='red')
plt.xlabel("Actual values of price")
plt.ylabel("Predicted values of price")

## Decision Tree Regressor

In [None]:
import sklearn.tree as model
dt=model.DecisionTreeRegressor(criterion='mse')
dt.fit(x_train,y_train)
preddec=dt.predict(x_test)
reg_ols=sm.OLS(endog=y_train,exog=x_train).fit()


- Finding the root mean squared error

In [None]:
mse=metrics.mean_squared_error(y_test,preddec)
rmse=math.sqrt(mse)
print('RMSE is:',rmse)

- Graph between actual and predicted values:

In [None]:
sns.scatterplot(x=y_test,y=preddec)
x=range(0,80)
y=range(0,80)
plt.plot(x,y,color='red')
plt.xlabel("Actual values of price")
plt.ylabel("Predicted values of price")

## Multiple Regression

In [None]:
from sklearn.linear_model import LinearRegression
reg= LinearRegression()
reg.fit(x_train,y_train)   
predlin=reg.predict(x_test)
reg_ols=sm.OLS(endog=y_train,exog=x_train).fit()



- Finding the root mean squared error

In [None]:
mse=metrics.mean_squared_error(y_test,predlin)
rmse=math.sqrt(mse)
print('RMSE is:',rmse)

- Graph between actual and predicted values:

In [None]:
sns.scatterplot(predlin,y_test)
x=range(0,80)
y=range(0,80)
plt.plot(x,y,color='red')
plt.xlabel("Actual values of price")
plt.ylabel("Predicted values of price")

###  SELECTING THE BEST MODEL

- From all the above models, we can see that the Root mean squared error is minimum for LightGBM Regression.
- So, we will predict on the Test Data by using LightGBM Regressor.

## MAKING PREDICTIONS ON THE UNKNOWN(TEST) DATA

In [None]:
Predicted_price=lgbm.predict(test_tr)# Using LGBM Regressor, making predictions on the Unknown(Test) data.

In [None]:
Predicted_price=pd.DataFrame(Predicted_price,columns=['Predicted Price(in Lakhs)'])# Storing the predicted values in series

### Saving the predicted price in excelsheet

In [None]:
from pandas import ExcelWriter
with ExcelWriter('Predicted Prices on Unknown(Test) data.xlsx')as writer:# The file saved has name 'Predicted Prices on Unknown(Test) Data'
    Predicted_price.to_excel(writer)# Saving file in excelsheet
writer.save()

- We took into consideration 5 models namely,
- KNN
- Random Forest 
- LightGBM
- Decision Tree
- Multiple Regression
 We inferred from the above five models that, LightGBM has given us the least RMSE value.So we used this model for the prediction of dataset.
 
 We learnt many things through the discussions that we've made in successfully completing the project.
 
 This is a joint effort by our beloved teammates.
