# Homework 12

https://scikit-learn.org/0.15/modules/scaling_strategies.html#incremental-learning

* Implement a mini batch functionality to train a regressor.
    - (Optional) If anyone want to do this in a pipeline can do this: https://koaning.github.io/tokenwiser/api/pipeline.html

* Save model, load the model again and test it on `X_test` __Do NOT commit the pickle file__

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [2]:
def test_df():
    df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/car_prices/car_prices.csv', low_memory=False)

    df = df.sample(5000, random_state=100).reset_index(drop=True)
    
    y = df['sellingprice']
    df.drop('sellingprice', axis=1, inplace=True)
    X = df
    
    return X,y

def partial_df():
    df = pd.read_csv('https://raw.githubusercontent.com/msaricaumbc/DS_data/master/ds602/car_prices/car_prices.csv', low_memory=False)
   
    while(True):
        yield df.sample(100).reset_index(drop=True)
        
gen = partial_df()

In [3]:
X_test, y_test = test_df()

In [4]:
# each time you call this you will get a new slice of the dataframe.
next(gen)

Unnamed: 0,year,make,model,trim,body,transmission,vin,state,condition,odometer,color,interior,seller,mmr,sellingprice,saledate
0,2012,Ford,Fusion,SE,Sedan,automatic,3fahp0ha9cr247085,mo,3.7,31103.0,blue,black,"ford motor credit company,llc pd",11800,11200,Tue Feb 10 2015 02:30:00 GMT-0800 (PST)
1,2014,Subaru,Impreza WRX,Base,Hatchback,manual,jf1gr7e64eg204248,tx,4.4,5451.0,gray,black,music city autoplex llc,24500,25500,Wed Feb 11 2015 02:20:00 GMT-0800 (PST)
2,2011,Honda,Odyssey,Touring,minivan,automatic,5fnrl5h93bb037689,pa,4.3,53851.0,black,gray,adcock brothers inc,23500,23750,Fri Jun 05 2015 02:00:00 GMT-0700 (PDT)
3,2012,Chevrolet,Malibu,LT,sedan,automatic,1g1zc5e00cf206966,il,1,39571.0,black,black,ally,10800,8000,Thu Jun 18 2015 07:00:00 GMT-0700 (PDT)
4,2000,BMW,3 Series,323i,Sedan,automatic,wbaan3340ync92009,ga,3.3,93851.0,silver,black,davis automotive,2125,4300,Thu Jan 22 2015 02:00:00 GMT-0800 (PST)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2011,Chevrolet,Equinox,LT1,SUV,automatic,2cnaldec2b6200499,tx,1.9,49586.0,gold,black,wells fargo dealer services,13400,7600,Wed Mar 04 2015 02:00:00 GMT-0800 (PST)
96,2006,Chevrolet,Impala,LT,Sedan,automatic,2g1wc581569345835,wi,1.9,134040.0,brown,beige,blackhawk finance,3375,3000,Wed Feb 11 2015 02:00:00 GMT-0800 (PST)
97,2007,Ford,Mustang,Deluxe,Convertible,automatic,1zvft84n675262587,co,2.9,30481.0,silver,gray,santander consumer,8125,10500,Tue Mar 03 2015 04:00:00 GMT-0800 (PST)
98,2011,Nissan,Altima,2.5 S,Sedan,automatic,1n4al2ap4bc134794,fl,3.7,53560.0,blue,gray,gm financial,10100,9100,Thu Feb 26 2015 01:40:00 GMT-0800 (PST)


In [5]:
next(gen).dtypes

year              int64
make             object
model            object
trim             object
body             object
transmission     object
vin              object
state            object
condition        object
odometer        float64
color            object
interior         object
seller           object
mmr              object
sellingprice      int64
saledate         object
dtype: object

In [6]:
next(gen).columns

Index(['year', 'make', 'model', 'trim', 'body', 'transmission', 'vin', 'state',
       'condition', 'odometer', 'color', 'interior', 'seller', 'mmr',
       'sellingprice', 'saledate'],
      dtype='object')

In [7]:
gen

<generator object partial_df at 0x000001C3226A67B0>

## Import the necessary libraries

In [8]:
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import joblib
import warnings
warnings.filterwarnings("ignore")

**In the following code, I have implemented the following :-**

- First, initialized the SGDRegressor model.
- Divide the dataset into numerical and categorical features.
- Create a numerical transformer which imputes the missing values with mean and scale the numerical data.
- Create a categorical transformer which imputes the missing values with most_frequent value and do the one hot encoding.
- Preprocess the numerical and categorical data and transform the data using ColumnTransformer.
- Combine the preprocessor pipeline and model pipeline (sgd_regressor).

In [9]:
# Initialize the sgdregressor model
sgd_regressor = SGDRegressor()

# Divide into numerical and categorical features
numerical_features = X_test.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X_test.select_dtypes(include=['object']).columns

# Transform the numerical features by imputing with mean and scale the data using standard scaler.
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Transform the categorical features by imputing with most frequent value and use onehotencoder to convert into numerical values.
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine the numerical and categorical transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create a pipeline with preprocessor and model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', sgd_regressor)
])

### In the following cell, I have implemented mini-batch functionality to train a regressor.

In [10]:
batch = 150
num_batches = 70
for _ in range(num_batches):
    X_batch = next(gen)
    y_batch = X_batch.pop('sellingprice')
    pipeline.fit(X_batch, y_batch)

### Evaluate the model

In [11]:
y_pred = pipeline.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error of Test Set',mse)

Mean Squared Error of Test Set 51501198.521237105


### Save the model into a file

In [12]:
joblib.dump(pipeline, 'model.pkl')

['model.pkl']

### Load the model from the file

In [13]:
model_pipeline = joblib.load('model.pkl')
model_pipeline

### Test the loaded model

In [14]:
pred = model_pipeline.predict(X_test)
mse_loaded = mean_squared_error(y_test, pred)
print('Mean Squared Error of Loaded model test set',mse_loaded)

Mean Squared Error of Loaded model test set 51501198.521237105


**From the above code, mean squared error on the normal test set is 51501198.521237105 and the mean squared error on the loaded model test set is 51501198.521237105**