# Model Pipeline

## Description of the problem.

An used car listing website plans to provide a value added feature to their customers. Once the customer uploads details of the car they want to sale, the features should estimate the expected price at which the car can be sold. Currently, the customers list the car at resale price based on their own experience or judgement. But this results in either revenue loss for customers, if they list the car at lower price or no or delayed sale, if they list at very high price. This value-add feature will help customer to find customers early and right price. 

The website has collected the past car resales data and plan to leverage that to build a ML model to estimate the resale price.

## Dataset

The dataset has 12 features of each car and the price at which they were sold. These are cars that were sold in 2019.

1. Id - Car's id. This is a sequence number.
2. Name - The brand and model of the car.
3. Location - The location in which the car is being sold or is available for purchase.
4. Year - The year or edition of the model.
5. Kilometers_Driven - The total kilometers are driven in the car by the previous owner(s) in KM.
6. Fuel_Type - The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
7. Transmission - The type of transmission used by the car. (Automatic / Manual)
8. Owner_Type - First, Second, Third, or Fourth & Above
9. Mileage - The standard mileage offered by the car company in kmpl or km/kg
10. Engine - The displacement volume of the engine in CC.
11. Power - The maximum power of the engine in bhp.
12. Seats - The number of seats in the car.
13. New_Price - The price of a new car of the same model.
14. Price - The price of the car (target).

### Load Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
cars_df = pd.read_csv( "https://drive.google.com/uc?export=download&id=10-R6GyVWjt_gjWEFD86mKHDvSWD9lp1z" )

In [None]:
cars_df.sample(5)

Unnamed: 0,index,Name,Location,Year,Kilometers_Driven,Fuel_Type,Transmission,Owner_Type,Mileage,Engine,Power,Seats,New_Price,Price,age,KM_Driven,make,mileage_new,engine_new,power_new
1310,2613,Maruti Swift Dzire LDI,Pune,2016,39000,Diesel,Manual,First,26.59 kmpl,1248 CC,74 bhp,5.0,,5.8,3,39,maruti,26.59,1248.0,74.0
60,130,Maruti Ciaz VXi Plus,Kochi,2017,44285,Petrol,Manual,First,20.73 kmpl,1373 CC,91.1 bhp,5.0,,7.47,2,44,maruti,20.73,1373.0,91.1
2849,5541,Chevrolet Beat Diesel LT,Chennai,2012,81000,Diesel,Manual,First,25.44 kmpl,936 CC,56.3 bhp,5.0,,2.5,7,81,chevrolet,25.44,936.0,56.3
978,1917,Honda City 1.5 EXI,Jaipur,2005,88000,Petrol,Manual,Second,13.0 kmpl,1493 CC,100 bhp,,,1.7,14,88,honda,13.0,1493.0,100.0
656,1321,Ford Fiesta 1.4 Duratec EXI,Chennai,2007,100000,Petrol,Manual,First,16.6 kmpl,1388 CC,68 bhp,5.0,,1.6,12,100,ford,16.6,1388.0,68.0


In [None]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3092 entries, 0 to 3091
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              3092 non-null   int64  
 1   Name               3092 non-null   object 
 2   Location           3092 non-null   object 
 3   Year               3092 non-null   int64  
 4   Kilometers_Driven  3092 non-null   int64  
 5   Fuel_Type          3092 non-null   object 
 6   Transmission       3092 non-null   object 
 7   Owner_Type         3092 non-null   object 
 8   Mileage            3092 non-null   object 
 9   Engine             3092 non-null   object 
 10  Power              3092 non-null   object 
 11  Seats              3091 non-null   float64
 12  New_Price          411 non-null    object 
 13  Price              3092 non-null   float64
 14  age                3092 non-null   int64  
 15  KM_Driven          3092 non-null   int64  
 16  make               3092 

### Feature Set Selection

In [None]:
cars_df.columns

Index(['index', 'Name', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type',
       'Transmission', 'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats',
       'New_Price', 'Price', 'age', 'KM_Driven', 'make', 'mileage_new',
       'engine_new', 'power_new'],
      dtype='object')

In [None]:
x_features = ['KM_Driven', 'Fuel_Type', 'age',
              'Transmission', 'Owner_Type', 'Seats', 
              'make', 'mileage_new', 'engine_new', 
              'power_new', 'Location']

In [None]:
cat_vars = ['Fuel_Type', 
                'Transmission', 'Owner_Type',
                'make', 'Location']

In [None]:
num_vars = list(set(x_features) - set(cat_vars))

In [None]:
num_vars

['Seats', 'power_new', 'age', 'KM_Driven', 'engine_new', 'mileage_new']

In [None]:
cars_df[x_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3092 entries, 0 to 3091
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   KM_Driven     3092 non-null   int64  
 1   Fuel_Type     3092 non-null   object 
 2   age           3092 non-null   int64  
 3   Transmission  3092 non-null   object 
 4   Owner_Type    3092 non-null   object 
 5   Seats         3091 non-null   float64
 6   make          3092 non-null   object 
 7   mileage_new   3092 non-null   float64
 8   engine_new    3092 non-null   float64
 9   power_new     3092 non-null   float64
 10  Location      3092 non-null   object 
dtypes: float64(4), int64(2), object(5)
memory usage: 265.8+ KB


### Setting X and y variables

In [None]:
X = cars_df[x_features]
y = cars_df['Price']

### Data Splitting

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 80)

In [None]:
X_train.shape

(2473, 11)

In [None]:
X_test.shape

(619, 11)

## Defining Transformation

1. Data imputation for Seats Column
    - Mean imputation 
2. Categorical Encoding for categorical columns
    - OHE Encoding
3. Data scaling
    - Standard scaling

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imputed_num_vars = ['Seats']

In [None]:
imputed_num_vars

['Seats']

In [None]:
non_imputed_num_vars = list(set(num_vars) - set(imputed_num_vars))

In [None]:
non_imputed_num_vars

['power_new', 'age', 'KM_Driven', 'engine_new', 'mileage_new']

In [None]:
mean_imputer = SimpleImputer(strategy='mean')

### Encode Categorical Variables

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe_encoder = OneHotEncoder(handle_unknown='ignore')

### Scaling Numerical Vars

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

## Creating Pipelines

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
imputed_num_transformer = Pipeline( steps = [  
        ('imputation', mean_imputer),
        ('scaler', scaler)])

In [None]:
non_imputed_num_transformer = Pipeline( steps = [('scaler', scaler)])

In [None]:
cat_transformer = Pipeline( steps = [('ohencoder', ohe_encoder)])

In [None]:
preprocessor = ColumnTransformer(
    transformers=[  
        ('num_imputed', imputed_num_transformer, imputed_num_vars),
        ('num_not_imputed', non_imputed_num_transformer, non_imputed_num_vars),
        ('catvars', cat_transformer, cat_vars)])

### KNN (K-Nearest Neighbor)


In [None]:
from sklearn.neighbors import KNeighborsRegressor

In [None]:
#knn = KNeighborsRegressor(n_neighbors=20)
knn = KNeighborsRegressor(n_neighbors=20, weights='distance')

In [None]:
knn_v1 = Pipeline(steps=[('preprocessor', preprocessor),
                          ('knn', knn)])

In [None]:
knn_v1.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num_imputed',
                                                  Pipeline(steps=[('imputation',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['Seats']),
                                                 ('num_not_imputed',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['power_new', 'age',
                                                   'KM_Driven', 'engine_new',
                                                   'mileage_new']),
                                                 ('catvars

In [None]:
from sklearn import set_config
set_config(display='diagram') 

In [None]:
knn_v1

## K Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
scores = cross_val_score( knn_v1,
                          X_train,
                          y_train,
                          cv = 10,
                          scoring = 'r2')

In [None]:
scores

array([0.8099739 , 0.74165817, 0.81740538, 0.81873944, 0.77995802,
       0.81580862, 0.80065021, 0.77859505, 0.80501332, 0.81669483])

In [None]:
scores.mean()

0.7984496941261805

In [None]:
scores.std()

0.023552848559143465

## Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
knn_params = { "knn__n_neighbors": [5, 10, 15, 20, 25],
               "knn__weights": ['uniform', 'distance'],
               "knn__metric": ['minkowski', 'euclidean']}

In [None]:
knn_grid_v1 = GridSearchCV(knn_v1,
                           param_grid=knn_params,
                           cv = 10,
                           scoring = 'r2')



In [None]:
knn_grid_v1.fit(X_train, y_train)

In [None]:
knn_grid_v1.best_params_

{'knn__metric': 'minkowski',
 'knn__n_neighbors': 10,
 'knn__weights': 'distance'}

In [None]:
knn_grid_v1.best_score_

0.815226329927499

In [None]:
knn_grid_results = pd.DataFrame( knn_grid_v1.cv_results_ )
knn_grid_results[['param_knn__n_neighbors', 'param_knn__weights', 'mean_test_score', 'std_test_score']]

Unnamed: 0,param_knn__n_neighbors,param_knn__weights,mean_test_score,std_test_score
0,5,uniform,0.795765,0.029362
1,5,distance,0.808335,0.026944
2,10,uniform,0.799702,0.024248
3,10,distance,0.815226,0.024555
4,15,uniform,0.787126,0.024108
5,15,distance,0.808177,0.023972
6,20,uniform,0.773486,0.023395
7,20,distance,0.79845,0.023553
8,25,uniform,0.767428,0.023019
9,25,distance,0.794198,0.022889


## Building the final model

In [None]:
final_model = KNeighborsRegressor(n_neighbors = knn_grid_v1.best_params_['knn__n_neighbors'], 
                                  weights = knn_grid_v1.best_params_['knn__weights'], 
                                  metric = knn_grid_v1.best_params_['knn__metric'])
knn_final = Pipeline(steps=[('preprocessor', preprocessor),
                          ('knn', final_model)])

In [None]:
knn_final.fit(X_train, y_train)

In [None]:
knn_final.score(X_test, y_test)

0.8098088897417473

In [None]:
from sklearn.metrics import mean_squared_error

In [None]:
final_rmse = np.sqrt(mean_squared_error(y_test, knn_final.predict(X_test)))
final_rmse

0.9615420152525802

## Model Persistence

In [None]:
class CarPredictionModel():
    
    def __init__(self, model, features, rmse):
        self.model = model
        self.features = features
        self.rmse = rmse

In [None]:
my_model = CarPredictionModel(knn_final, list(X_train.columns), final_rmse)

In [None]:
from joblib import dump

In [None]:
dump(my_model, './cars_v1.pkl')

['./cars_v1.pkl']