# Task 3 - Modeling

This notebook will get you started by helping you to load the data, but then it'll be up to you to complete the task! If you need help, refer to the `modeling_walkthrough.ipynb` notebook.


## Section 1 - Setup

In [46]:
import os
import pandas as pd
import numpy as np
import datetime as dt
import datetime_truncate as dtr

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import Normalizer

from sklearn.linear_model import SGDRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.ensemble import VotingRegressor

from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
import joblib

In [4]:
os.chdir('..')

## Section 2 - Data loading

Similar to before, let's load our data from Google Drive for the 3 datasets provided. Be sure to upload the datasets into Google Drive, so that you can access them here.

In [5]:
sales_df = pd.read_csv("Data/sales.csv")
sales_df.drop(columns=["Unnamed: 0"], inplace=True, errors='ignore')
sales_df.head()

Unnamed: 0,transaction_id,timestamp,product_id,category,customer_type,unit_price,quantity,total,payment_type
0,a1c82654-c52c-45b3-8ce8-4c2a1efe63ed,2022-03-02 09:51:38,3bc6c1ea-0198-46de-9ffd-514ae3338713,fruit,gold,3.99,2,7.98,e-wallet
1,931ad550-09e8-4da6-beaa-8c9d17be9c60,2022-03-06 10:33:59,ad81b46c-bf38-41cf-9b54-5fe7f5eba93e,fruit,standard,3.99,1,3.99,e-wallet
2,ae133534-6f61-4cd6-b6b8-d1c1d8d90aea,2022-03-04 17:20:21,7c55cbd4-f306-4c04-a030-628cbe7867c1,fruit,premium,0.19,2,0.38,e-wallet
3,157cebd9-aaf0-475d-8a11-7c8e0f5b76e4,2022-03-02 17:23:58,80da8348-1707-403f-8be7-9e6deeccc883,fruit,gold,0.19,4,0.76,e-wallet
4,a81a6cd3-5e0c-44a2-826c-aea43e46c514,2022-03-05 14:32:43,7f5e86e6-f06f-45f6-bf44-27b095c9ad1d,fruit,basic,4.49,2,8.98,debit card


In [6]:
stock_df = pd.read_csv("Data/sensor_stock_levels.csv")
stock_df.drop(columns=["Unnamed: 0"], inplace=True, errors='ignore')
stock_df.head()

Unnamed: 0,id,timestamp,product_id,estimated_stock_pct
0,4220e505-c247-478d-9831-6b9f87a4488a,2022-03-07 12:13:02,f658605e-75f3-4fed-a655-c0903f344427,0.75
1,f2612b26-fc82-49ea-8940-0751fdd4d9ef,2022-03-07 16:39:46,de06083a-f5c0-451d-b2f4-9ab88b52609d,0.48
2,989a287f-67e6-4478-aa49-c3a35dac0e2e,2022-03-01 18:17:43,ce8f3a04-d1a4-43b1-a7c2-fa1b8e7674c8,0.58
3,af8e5683-d247-46ac-9909-1a77bdebefb2,2022-03-02 14:29:09,c21e3ba9-92a3-4745-92c2-6faef73223f7,0.79
4,08a32247-3f44-4002-85fb-c198434dd4bb,2022-03-02 13:46:18,7f478817-aa5b-44e9-9059-8045228c9eb0,0.22


In [7]:
temp_df = pd.read_csv("Data/sensor_storage_temperature.csv")
temp_df.drop(columns=["Unnamed: 0"], inplace=True, errors='ignore')
temp_df.head()

Unnamed: 0,id,timestamp,temperature
0,d1ca1ef8-0eac-42fc-af80-97106efc7b13,2022-03-07 15:55:20,2.96
1,4b8a66c4-0f3a-4f16-826f-8cf9397e9d18,2022-03-01 09:18:22,1.88
2,3d47a0c7-1e72-4512-812f-b6b5d8428cf3,2022-03-04 15:12:26,1.78
3,9500357b-ce15-424a-837a-7677b386f471,2022-03-02 12:30:42,2.18
4,c4b61fec-99c2-4c6d-8e5d-4edd8c9632fa,2022-03-05 09:09:33,1.38


Now it's up to you, refer back to the steps in your strategic plan to complete this task. Good luck!

In [8]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7829 entries, 0 to 7828
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   transaction_id  7829 non-null   object 
 1   timestamp       7829 non-null   object 
 2   product_id      7829 non-null   object 
 3   category        7829 non-null   object 
 4   customer_type   7829 non-null   object 
 5   unit_price      7829 non-null   float64
 6   quantity        7829 non-null   int64  
 7   total           7829 non-null   float64
 8   payment_type    7829 non-null   object 
dtypes: float64(2), int64(1), object(6)
memory usage: 550.6+ KB


In [9]:
stock_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   15000 non-null  object 
 1   timestamp            15000 non-null  object 
 2   product_id           15000 non-null  object 
 3   estimated_stock_pct  15000 non-null  float64
dtypes: float64(1), object(3)
memory usage: 468.9+ KB


In [10]:
temp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23890 entries, 0 to 23889
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           23890 non-null  object 
 1   timestamp    23890 non-null  object 
 2   temperature  23890 non-null  float64
dtypes: float64(1), object(2)
memory usage: 560.0+ KB


##  Section 3: Preprocessing Data

In [11]:
# Change timestamp to hourly
sales_df['timestamp'] = sales_df.timestamp.apply(lambda x: dtr.truncate_hour(pd.to_datetime(x)))
stock_df['timestamp'] = stock_df.timestamp.apply(lambda x: dtr.truncate_hour(pd.to_datetime(x)))
temp_df['timestamp'] = temp_df.timestamp.apply(lambda x: dtr.truncate_hour(pd.to_datetime(x)))

In [12]:
# Group by hour and product
sales_agg = sales_df.groupby(['timestamp', 'product_id']).agg({'quantity': 'sum', 'total': 'mean'}).reset_index()
stock_agg = stock_df.groupby(['timestamp', 'product_id']).agg({'estimated_stock_pct': 'mean'}).reset_index()
temp_agg = temp_df.groupby(['timestamp']).agg({'temperature': 'mean'}).reset_index()
sales_agg.head()

Unnamed: 0,timestamp,product_id,quantity,total
0,2022-03-01 09:00:00,00e120bb-89d6-4df5-bc48-a051148e3d03,3,33.57
1,2022-03-01 09:00:00,01f3cdd9-8e9e-4dff-9b5c-69698a0388d0,3,4.47
2,2022-03-01 09:00:00,03a2557a-aa12-4add-a6d4-77dc36342067,3,17.97
3,2022-03-01 09:00:00,049b2171-0eeb-4a3e-bf98-0c290c7821da,7,8.715
4,2022-03-01 09:00:00,04da844d-8dba-4470-9119-e534d52a03a0,11,1.3475


In [13]:
stock_agg.head()

Unnamed: 0,timestamp,product_id,estimated_stock_pct
0,2022-03-01 09:00:00,00e120bb-89d6-4df5-bc48-a051148e3d03,0.89
1,2022-03-01 09:00:00,01f3cdd9-8e9e-4dff-9b5c-69698a0388d0,0.14
2,2022-03-01 09:00:00,01ff0803-ae73-4234-971d-5713c97b7f4b,0.67
3,2022-03-01 09:00:00,0363eb21-8c74-47e1-a216-c37e565e5ceb,0.82
4,2022-03-01 09:00:00,03f0b20e-3b5b-444f-bc39-cdfa2523d4bc,0.05


In [14]:
temp_agg.head()

Unnamed: 0,timestamp,temperature
0,2022-03-01 09:00:00,-0.02885
1,2022-03-01 10:00:00,1.284314
2,2022-03-01 11:00:00,-0.56
3,2022-03-01 12:00:00,-0.537721
4,2022-03-01 13:00:00,-0.188734


In [15]:
sales_agg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6217 entries, 0 to 6216
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   timestamp   6217 non-null   datetime64[ns]
 1   product_id  6217 non-null   object        
 2   quantity    6217 non-null   int64         
 3   total       6217 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 194.4+ KB


In [16]:
stock_agg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10845 entries, 0 to 10844
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   timestamp            10845 non-null  datetime64[ns]
 1   product_id           10845 non-null  object        
 2   estimated_stock_pct  10845 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 254.3+ KB


In [17]:
temp_agg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   timestamp    77 non-null     datetime64[ns]
 1   temperature  77 non-null     float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 1.3 KB


In [18]:
merged_df = stock_agg.merge(sales_agg, on=['timestamp', 'product_id'], how='left') \
       .merge(temp_agg, on='timestamp', how='left')
merged_df.head()

Unnamed: 0,timestamp,product_id,estimated_stock_pct,quantity,total,temperature
0,2022-03-01 09:00:00,00e120bb-89d6-4df5-bc48-a051148e3d03,0.89,3.0,33.57,-0.02885
1,2022-03-01 09:00:00,01f3cdd9-8e9e-4dff-9b5c-69698a0388d0,0.14,3.0,4.47,-0.02885
2,2022-03-01 09:00:00,01ff0803-ae73-4234-971d-5713c97b7f4b,0.67,,,-0.02885
3,2022-03-01 09:00:00,0363eb21-8c74-47e1-a216-c37e565e5ceb,0.82,,,-0.02885
4,2022-03-01 09:00:00,03f0b20e-3b5b-444f-bc39-cdfa2523d4bc,0.05,,,-0.02885


In [19]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10845 entries, 0 to 10844
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   timestamp            10845 non-null  datetime64[ns]
 1   product_id           10845 non-null  object        
 2   estimated_stock_pct  10845 non-null  float64       
 3   quantity             3067 non-null   float64       
 4   total                3067 non-null   float64       
 5   temperature          10845 non-null  float64       
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 593.1+ KB


In [20]:
# Add rest of product features
category_feats = sales_df[['product_id', 'category']]
category_feats.drop_duplicates(inplace=True)

product_price = sales_df[['product_id', 'unit_price']]
product_price.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  category_feats.drop_duplicates(inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  product_price.drop_duplicates(inplace=True)


In [21]:
# Merge onto main df
merged_df = merged_df.merge(category_feats, on='product_id', how='left') \
            .merge(product_price, on='product_id', how='left')
merged_df.head()

Unnamed: 0,timestamp,product_id,estimated_stock_pct,quantity,total,temperature,category,unit_price
0,2022-03-01 09:00:00,00e120bb-89d6-4df5-bc48-a051148e3d03,0.89,3.0,33.57,-0.02885,kitchen,11.19
1,2022-03-01 09:00:00,01f3cdd9-8e9e-4dff-9b5c-69698a0388d0,0.14,3.0,4.47,-0.02885,vegetables,1.49
2,2022-03-01 09:00:00,01ff0803-ae73-4234-971d-5713c97b7f4b,0.67,,,-0.02885,baby products,14.19
3,2022-03-01 09:00:00,0363eb21-8c74-47e1-a216-c37e565e5ceb,0.82,,,-0.02885,beverages,20.19
4,2022-03-01 09:00:00,03f0b20e-3b5b-444f-bc39-cdfa2523d4bc,0.05,,,-0.02885,pets,8.19


In [22]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10845 entries, 0 to 10844
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   timestamp            10845 non-null  datetime64[ns]
 1   product_id           10845 non-null  object        
 2   estimated_stock_pct  10845 non-null  float64       
 3   quantity             3067 non-null   float64       
 4   total                3067 non-null   float64       
 5   temperature          10845 non-null  float64       
 6   category             10845 non-null  object        
 7   unit_price           10845 non-null  float64       
dtypes: datetime64[ns](1), float64(5), object(2)
memory usage: 762.5+ KB


In [23]:
# Convert object types to categorical
merged_df['category'] = merged_df['category'].astype('category')
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10845 entries, 0 to 10844
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   timestamp            10845 non-null  datetime64[ns]
 1   product_id           10845 non-null  object        
 2   estimated_stock_pct  10845 non-null  float64       
 3   quantity             3067 non-null   float64       
 4   total                3067 non-null   float64       
 5   temperature          10845 non-null  float64       
 6   category             10845 non-null  category      
 7   unit_price           10845 non-null  float64       
dtypes: category(1), datetime64[ns](1), float64(5), object(1)
memory usage: 689.1+ KB


In [24]:
# Fill NaN
merged_df.quantity.fillna(0, inplace=True)
merged_df.total.fillna(0, inplace=True)

## Feature Engineering

In [25]:
merged_df.corr()

Unnamed: 0,estimated_stock_pct,quantity,total,temperature,unit_price
estimated_stock_pct,1.0,0.012929,0.0042,0.007955,-0.024479
quantity,0.012929,1.0,0.658054,-0.017771,-0.108101
total,0.0042,0.658054,1.0,-0.017438,0.245598
temperature,0.007955,-0.017771,-0.017438,1.0,0.002874
unit_price,-0.024479,-0.108101,0.245598,0.002874,1.0


In [26]:
# Extract date features
merged_df['month'] = merged_df.timestamp.dt.month
merged_df['day'] = merged_df.timestamp.dt.day
merged_df['day_of_week'] = merged_df.timestamp.dt.dayofweek
merged_df['hour'] = merged_df.timestamp.dt.hour

merged_df.drop(['timestamp', 'product_id'], axis=1, inplace=True)
merged_df

Unnamed: 0,estimated_stock_pct,quantity,total,temperature,category,unit_price,month,day,day_of_week,hour
0,0.89,3.0,33.57,-0.028850,kitchen,11.19,3,1,1,9
1,0.14,3.0,4.47,-0.028850,vegetables,1.49,3,1,1,9
2,0.67,0.0,0.00,-0.028850,baby products,14.19,3,1,1,9
3,0.82,0.0,0.00,-0.028850,beverages,20.19,3,1,1,9
4,0.05,0.0,0.00,-0.028850,pets,8.19,3,1,1,9
...,...,...,...,...,...,...,...,...,...,...
10840,0.50,4.0,19.96,-0.165077,fruit,4.99,3,7,0,19
10841,0.26,0.0,0.00,-0.165077,meat,19.99,3,7,0,19
10842,0.78,3.0,20.97,-0.165077,packaged foods,6.99,3,7,0,19
10843,0.92,3.0,44.97,-0.165077,meat,14.99,3,7,0,19


## Model Building

In [27]:
# Split data
X = merged_df.drop('estimated_stock_pct', axis=1)
y = merged_df['estimated_stock_pct']
print(X.shape)
print(y.shape)

(10845, 9)
(10845,)


In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train

Unnamed: 0,quantity,total,temperature,category,unit_price,month,day,day_of_week,hour
6641,0.0,0.0,0.160429,kitchen,19.19,3,5,5,12
10739,0.0,0.0,-0.165077,personal care,3.49,3,7,0,19
32,0.0,0.0,-0.028850,snacks,1.99,3,1,1,9
6793,0.0,0.0,0.671020,packaged foods,7.19,3,5,5,13
10482,0.0,0.0,-0.003988,spices and herbs,2.19,3,7,0,17
...,...,...,...,...,...,...,...,...,...
919,0.0,0.0,0.337059,medicine,12.99,3,1,1,15
8710,0.0,0.0,-0.827895,spices and herbs,0.19,3,6,6,15
10798,0.0,0.0,-0.165077,canned foods,7.49,3,7,0,19
3606,0.0,0.0,0.982000,baby products,15.99,3,3,3,12


In [29]:
y_train

6641     0.87
10739    0.06
32       0.22
6793     0.75
10482    0.73
         ... 
919      0.93
8710     0.91
10798    0.14
3606     0.03
6367     0.66
Name: estimated_stock_pct, Length: 8676, dtype: float64

In [30]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10845 entries, 0 to 10844
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   quantity     10845 non-null  float64 
 1   total        10845 non-null  float64 
 2   temperature  10845 non-null  float64 
 3   category     10845 non-null  category
 4   unit_price   10845 non-null  float64 
 5   month        10845 non-null  int64   
 6   day          10845 non-null  int64   
 7   day_of_week  10845 non-null  int64   
 8   hour         10845 non-null  int64   
dtypes: category(1), float64(4), int64(4)
memory usage: 773.8 KB


In [31]:
X.describe()

Unnamed: 0,quantity,total,temperature,unit_price,month,day,day_of_week,hour
count,10845.0,10845.0,10845.0,10845.0,10845.0,10845.0,10845.0,10845.0
mean,0.908529,5.8187,-0.213678,8.938575,3.0,4.010973,2.988566,13.997234
std,1.78768,13.155445,0.649671,5.390235,0.0,1.998378,1.998261,3.165366
min,0.0,0.0,-1.84727,0.19,3.0,1.0,0.0,9.0
25%,0.0,0.0,-0.657082,4.99,3.0,2.0,1.0,11.0
50%,0.0,0.0,-0.230631,8.19,3.0,4.0,3.0,14.0
75%,1.0,3.725,0.160429,12.49,3.0,6.0,5.0,17.0
max,15.0,95.96,1.435938,23.99,3.0,7.0,6.0,19.0


In [32]:
# Encode categorical features
to_normalize = ['quantity', 'total', 'unit_price', 'temperature', 'month', 'day', 'hour']
to_one_hot = ['category']

# Preprocessing
ct = make_column_transformer(
    (OneHotEncoder(), to_one_hot),
    (Normalizer(), to_normalize)
)

In [33]:
# ML pipeline function
SCORER = 'neg_mean_absolute_error'
def ml_pipeline_cv(model, ct, X_train: pd.DataFrame, y_train: pd.Series, cv: int, scorer: str=SCORER):
    """Creates a pipeline using a column transformer and model. K-fold cross validation
    on the pipeline.

    Args:
        model: Model compatible with sklearn Pipeline.
        ct: Sklearn column transformer.
        X_train (pd.DataFrame): Features dataframe.
        y_train (pd.Series): Target series.
        cv (int): K-fold cross validation.
        scorer (str, optional): Error metric. Defaults to SCORER.

    Returns:
        float: The average error selected.
    """
    pipe = Pipeline([
        ('ct', ct),
        ('model', model)
    ])
    cv = cross_validate(pipe, X_train, y_train, cv=cv, scoring=scorer, error_score='raise')
    return cv.get('test_score').mean()

In [34]:
# Linear Regression using Stochastic Gradient Descent
sgd = SGDRegressor()
mae = ml_pipeline_cv(sgd, ct, X_train, y_train, cv=4)
print(f'MAE: {mae.round(3)}')

MAE: -0.224


In [35]:
# Support Vector Machine
svr = SVR()
mae = ml_pipeline_cv(svr, ct, X_train, y_train, cv=4)
print(f'MAE: {mae.round(3)}')

MAE: -0.226


In [36]:
# Random Forest
rf = RandomForestRegressor()
mae = ml_pipeline_cv(rf, ct, X_train, y_train, cv=4)
print(f'MAE: {mae.round(3)}')

MAE: -0.237


In [37]:
# Soft Voting Regressor
estimators = [('sgd', sgd), ('svr', svr), ('rf', rf)]
vr = VotingRegressor(estimators=estimators)

mae = ml_pipeline_cv(vr, ct, X_train, y_train, cv=4)
print(f'MAE: {mae.round(3)}')

MAE: -0.226


In [38]:
# Soft Voting Regressor
estimators = [('sgd', sgd), ('svr', svr)]
vr = VotingRegressor(estimators=estimators)

mae = ml_pipeline_cv(vr, ct, X_train, y_train, cv=4)
print(f'MAE: {mae.round(3)}')

MAE: -0.224


## Model Tuning

In [39]:
# Tuning grid search function
def model_tuning_cv(model, ct, param_grid, X_train: pd.DataFrame, y_train: pd.Series):
    """Creates a pipeline using a column transformer and model. Searched through all
    specified parameters with K-fold cross validation on the pipeline.

    Args:
        model: Model compatible with sklearn Pipeline.
        ct: Sklearn column transformer.
        param_grid: Dict or list of dicts containing the parameters to search through.
        X_train (pd.DataFrame): Features dataframe.
        y_train (pd.Series): Target series.
    """
    pipe = Pipeline([
        ('ct', ct),
        ('model', model)
    ])
    gs = GridSearchCV(pipe, param_grid, scoring=SCORER, n_jobs=5)
    gs.fit(X_train, y_train)
    print(gs.best_params_)

### SGD Linear Regression

In [40]:
# Grid search
params_sgd = [
    {
        'model__alpha': [0.1, 0.01, 0.001, 0.0001],
        'model__loss': ['squared_error', 'huber'],
        'model__penalty': ['l2', 'l1'],
        'model__max_iter': [500, 1000, 1500, 2000],
        'model__learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive']
    },
    {
        'model__alpha': [0.1, 0.01, 0.001, 0.0001],
        'model__loss': ['squared_error', 'huber'],
        'model__penalty': ['elasticnet'],
        'model__l1_ratio': [0.5, 0.15, 0.25],
        'model__max_iter': [500, 1000, 1500, 2000],
        'model__learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive']
    }
]
model_tuning_cv(sgd, ct, params_sgd, X_train, y_train)

{'model__alpha': 0.1, 'model__l1_ratio': 0.5, 'model__learning_rate': 'constant', 'model__loss': 'squared_error', 'model__max_iter': 500, 'model__penalty': 'elasticnet'}


In [43]:
# Finer tune
params_sgd_finer = {
    'model__alpha': [0.1],
    'model__l1_ratio': [0.6],
    'model__learning_rate': ['constant'],
    'model__loss': ['squared_error'],
    'model__max_iter': np.arange(350, 400),
    'model__penalty': ['elasticnet']
}
model_tuning_cv(sgd, ct, params_sgd_finer, X_train, y_train)

{'model__alpha': 0.1, 'model__l1_ratio': 0.6, 'model__learning_rate': 'constant', 'model__loss': 'squared_error', 'model__max_iter': 371, 'model__penalty': 'elasticnet'}


In [44]:
sgd_best = SGDRegressor(
    alpha=0.1, learning_rate='constant', max_iter=371, penalty='elasticnet', l1_ratio=0.6)

mae = ml_pipeline_cv(sgd_best, ct, X_train, y_train, cv=5)
print(f'MAE: {mae.round(3)}')

MAE: -0.223


### SVR

In [45]:
# Parameters
params_svr = {
    'model__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'model__degree': np.arange(1, 12, 2),
    'model__gamma': ['scale', 'auto'],
    'model__coef0': [0, 0.5, 0.9],
    'model__C': [1.0, 3.0, 5.0]
}
model_tuning_cv(svr, ct, params_svr, X_train, y_train)

{'model__C': 1.0, 'model__coef0': 0, 'model__degree': 3, 'model__gamma': 'auto', 'model__kernel': 'poly'}


In [47]:
# Finer tune
params_svr_finer = {
    'model__kernel': ['poly'],
    'model__degree': np.arange(1, 6),
    'model__gamma': ['auto'],
    'model__C': [1, 2]
}
model_tuning_cv(svr, ct, params_svr_finer, X_train, y_train)

{'model__C': 1, 'model__degree': 3, 'model__gamma': 'auto', 'model__kernel': 'poly'}


In [48]:
svr_best = SVR(kernel='poly', C=1, degree=3, gamma='auto', )
mae = ml_pipeline_cv(svr_best, ct, X_train, y_train, cv=5)
print(f'MAE: {mae.round(3)}')

MAE: -0.223


### Linear SGD & SVR Voting Regressor

In [49]:
# Random tuning grid search function
def random_model_tuning_cv(model, ct, params, X_train: pd.DataFrame, y_train: pd.Series, n_iter: int=10, random_state: int=123):
    """Creates a pipeline using a column transformer and model. Searches through a
    random sample of specified parameters with K-fold cross validation on the pipeline.

    Args:
        model: Model compatible with sklearn Pipeline.
        ct: Sklearn column transformer.
        params: Dict or list of dicts containing the parameters to search through.
        X_train (pd.DataFrame): Features dataframe.
        y_train (pd.Series): Target series.
        n_iter (int, optional): Number of search iterations. Defaults to 10.
        random_state (int, optional): Random state instance. Defaults to 123.
    """
    pipe = Pipeline([
        ('ct', ct),
        ('model', model)
    ])
    random_gs = RandomizedSearchCV(
        pipe, params, n_iter=n_iter, cv=4, n_jobs=5, scoring=SCORER, random_state=random_state)
    random_gs.fit(X_train, y_train)
    print(random_gs.best_params_)

In [50]:
params_sgd_svr = [
    {
        'model__sgd__alpha': [0.1, 0.01, 0.001, 0.0001],
        'model__sgd__loss': ['squared_error', 'huber'],
        'model__sgd__penalty': ['l2', 'l1'],
        'model__sgd__learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'],
        'model__svr__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'model__svr__degree': [1, 3, 5, 7, 9, 11, 13, 15],
        'model__svr__gamma': ['scale', 'auto'],
        'model__svr__C': np.arange(1, 12, 2)
    },
    {
        'model__sgd__alpha': [ 0.01, 0.001],
        'model__sgd__loss': ['squared_error', 'huber'],
        'model__sgd__penalty': ['elasticnet'],
        'model__sgd__l1_ratio': [0.2, 0.5, 0.7, 0.9],
        'model__sgd__learning_rate': ['constant', 'optimal', 'invscaling', 'adaptive'],
        'model__svr__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'model__svr__degree': [1, 3, 5, 7, 9, 11, 13, 15],
        'model__svr__gamma': ['scale', 'auto'],
        'model__svr__C': np.arange(1, 12, 2)
    }
]
estimators = [
    ('sgd', SGDRegressor()),
    ('svr', SVR())
]
voting_reg = VotingRegressor(estimators, n_jobs=5)
random_model_tuning_cv(
    voting_reg, ct, params_sgd_svr, X_train, y_train, n_iter=500)

{'model__svr__kernel': 'poly', 'model__svr__gamma': 'auto', 'model__svr__degree': 11, 'model__svr__C': 3, 'model__sgd__penalty': 'l1', 'model__sgd__loss': 'squared_error', 'model__sgd__learning_rate': 'constant', 'model__sgd__alpha': 0.1}


In [51]:
# Finer tune
params_sgd_svr = {
    'model__sgd__alpha': [0.1, 0.3],
    'model__sgd__loss': ['squared_error'],
    'model__sgd__penalty': ['l1'],
    'model__sgd__learning_rate': ['constant'],
    'model__svr__kernel': ['poly'],
    'model__svr__degree': np.arange(10, 21),
    'model__svr__gamma': ['auto'],
    'model__svr__C': np.arange(1, 7)
}
voting_reg = VotingRegressor(estimators, n_jobs=5)
model_tuning_cv(voting_reg, ct, params_sgd_svr, X_train, y_train)

{'model__sgd__alpha': 0.3, 'model__sgd__learning_rate': 'constant', 'model__sgd__loss': 'squared_error', 'model__sgd__penalty': 'l1', 'model__svr__C': 5, 'model__svr__degree': 19, 'model__svr__gamma': 'auto', 'model__svr__kernel': 'poly'}


In [52]:
# Calculate MAE for Voting Regressor
voting_sgd_best = SGDRegressor(
    alpha=0.3, penalty='l1', learning_rate='constant')
voting_svr_best = SVR(
    kernel='poly', degree=19, gamma='auto', C=5)

estimators_best = [
    ('sgd', voting_sgd_best),
    ('svr', voting_svr_best)
]
voting_reg_best = VotingRegressor(estimators_best)

mae = ml_pipeline_cv(voting_reg_best, ct, X_train, y_train, cv=5)
print(f'MAE: {mae.round(3)}')

MAE: -0.223


### Final Model

The models that produced the same mean absolute error were:
- Linear Regression with SGD
- Support Vector Machine
- Voting Regressor with Linear Regression and SVM

However, the linear model was drastically faster at fitting. Therefore, this model will be used for production and scaling.

In [53]:
# Save model as pickle file
joblib.dump(sgd_best, 'Task_4/final_model.pkl')

['Task_4/final_model.pkl']