<a href="https://colab.research.google.com/github/ruelanthonyb/sales-predictions/blob/main/Project_1_Part_6_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 - Part 6 (Core)

You will add modeling to your sales prediction project. The goal of this is to help the retailer understand the properties of products and outlets that play crucial roles in predicting sales.



**CRISP-DM Phase 4 - Modeling**

1. Your first task is to build a linear regression model to predict sales.
* Build a linear regression model.
* Use the custom evaluation function to get the metrics for your model (on training and test data).
* Compare the training vs. test R-squared values and answer the question: to what extent is this model overfit/underfit?

2. Your second task is to build a Random Forest model to predict sales.
* Build a default Random Forest model.
* Use the custom evaluation function to get the metrics for your model (on training and test data).
* Compare the training vs. test R-squared values and answer the question: to what extent is this model overfit/underfit?
* Compare this model's performance to the linear regression model: which model has the best test scores?

3. Use GridSearchCV to tune at least two hyperparameters for a Random Forest model.
* After determining the best parameters from your GridSearch, fit and evaluate a final best model on the entire training set (no folds).
* Compare your tuned model to your default Random Forest: did the performance improve?

**CRISP-DM Phase 5 - Evaluation**

4. You now have tried several different models on your data set. You need to determine which model to implement.
* Overall, which model do you recommend?
* Justify your recommendation.
* In a Markdown cell:
  * Interpret your model's performance based on R-squared in a way that your non-technical stakeholder can understand.
  * Select another regression metric (RMSE/MAE/MSE) to express the performance of your model to your stakeholder.
  * Include why you selected this metric to explain to your stakeholder.
  * Compare the training vs. test scores and answer the question: to what extent is this model overfit/underfit?

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# import libraries
import pandas as pd

# pre-processing functions
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer

# ML models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

# metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

from sklearn import set_config
set_config(transform_output='pandas')

In [None]:
# Define custom function

def regression_metrics(y_true, y_pred, label='', verbose = True,
                       output_dict=False):
  # Get metrics
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = mean_squared_error(y_true, y_pred, squared=False)
  r_squared = r2_score(y_true, y_pred)
  if verbose == True:
    # Print Result with Label and Header
    header = "-"*60
    print(header, f"Regression Metrics: {label}", header, sep='\n')
    print(f"- MAE = {mae:,.3f}")
    print(f"- MSE = {mse:,.3f}")
    print(f"- RMSE = {rmse:,.3f}")
    print(f"- R^2 = {r_squared:,.3f}")
  if output_dict == True:
      metrics = {'Label':label, 'MAE':mae,
                 'MSE':mse, 'RMSE':rmse, 'R^2':r_squared}
      return metrics

def evaluate_regression(reg, X_train, y_train, X_test, y_test, verbose = True,
                        output_frame=False):
  # Get predictions for training data
  y_train_pred = reg.predict(X_train)

  # Call the helper function to obtain regression metrics for training data
  results_train = regression_metrics(y_train, y_train_pred, verbose = verbose,
                                     output_dict=output_frame,
                                     label='Training Data')
  print()
  # Get predictions for test data
  y_test_pred = reg.predict(X_test)
  # Call the helper function to obtain regression metrics for test data
  results_test = regression_metrics(y_test, y_test_pred, verbose = verbose,
                                  output_dict=output_frame,
                                    label='Test Data' )

  # Store results in a dataframe if ouput_frame is True
  if output_frame:
    results_df = pd.DataFrame([results_train,results_test])
    # Set the label as the index
    results_df = results_df.set_index('Label')
    # Set index.name to none to get a cleaner looking result
    results_df.index.name=None
    # Return the dataframe
    return results_df.round(3)

# Loading Data

In [None]:
# Import the data
path = '/content/drive/MyDrive/<< Coding Dojo PH >>/Data Analytics & Visualization/02 Intro to ML/Wk1/assignments/datasets/sales_predictions_2023.csv'
df = pd.read_csv(path)
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


# Explore and Clean the Data

In [None]:
# check number of rows and columns
df.shape

(8523, 12)

* There are 8523 rows and 12 columns

In [None]:
# check data types
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

In [None]:
# check for duplicates
df.duplicated().sum()

0

In [None]:
# check for missing values
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

In [None]:
# select object columns
categoricals = df.select_dtypes('object')

# Check for inconsistencies in categorical data
for col in categoricals.columns:
  print(col)
  print(categoricals[col].unique(), '\n')

Item_Identifier
['FDA15' 'DRC01' 'FDN15' ... 'NCF55' 'NCW30' 'NCW05'] 

Item_Fat_Content
['Low Fat' 'Regular' 'low fat' 'LF' 'reg'] 

Item_Type
['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood'] 

Outlet_Identifier
['OUT049' 'OUT018' 'OUT010' 'OUT013' 'OUT027' 'OUT045' 'OUT017' 'OUT046'
 'OUT035' 'OUT019'] 

Outlet_Size
['Medium' nan 'High' 'Small'] 

Outlet_Location_Type
['Tier 1' 'Tier 3' 'Tier 2'] 

Outlet_Type
['Supermarket Type1' 'Supermarket Type2' 'Grocery Store'
 'Supermarket Type3'] 



In [None]:
# check inconsistent categories
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [None]:
df['Item_Fat_Content'].value_counts().sum()

8523

In [None]:
# replacing inconsistent categories in 'Item_Fat_Content' column
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'})

# check value counts to confirm changes
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

# Identify the Features (X) and Target (y)

In [None]:
# define features X and target y
X = df.drop(columns = 'Item_Outlet_Sales')
y = df['Item_Outlet_Sales']

X.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1


# Dropping Unwanted Column

In [None]:
# drop the 'Item_Identifier' column
X = X.drop(columns = 'Item_Identifier')
X.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2
2,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1
3,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store
4,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1


# Train/Test Split the Data

In [None]:
# perform a train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6392 entries, 4776 to 7270
Data columns (total 10 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Weight                5285 non-null   float64
 1   Item_Fat_Content           6392 non-null   object 
 2   Item_Visibility            6392 non-null   float64
 3   Item_Type                  6392 non-null   object 
 4   Item_MRP                   6392 non-null   float64
 5   Outlet_Identifier          6392 non-null   object 
 6   Outlet_Establishment_Year  6392 non-null   int64  
 7   Outlet_Size                4580 non-null   object 
 8   Outlet_Location_Type       6392 non-null   object 
 9   Outlet_Type                6392 non-null   object 
dtypes: float64(3), int64(1), object(6)
memory usage: 549.3+ KB


In [None]:
# Display the number of null values in X_train
X_train.isna().sum()

Item_Weight                  1107
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  1812
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64

# Create a numeric features pipeline

In [None]:
# save a list of numeric columns
num_cols = ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']

# instantiate the imputer for numeric columns
num_mean_imputer = SimpleImputer(strategy = 'mean')

# instantiate scaler
num_scaler = StandardScaler()

# make an numeric pipeline
num_pipe = make_pipeline(num_mean_imputer, num_scaler)
num_pipe

# Create ordinal pipeline

In [None]:
# save a list of ordinal columns
ord_cols = ['Outlet_Size']

# instantiate the imputer for ordinal data
ord_freq_imputer = SimpleImputer(strategy = 'most_frequent')

# check unique values in 'shelf'
X_train['Outlet_Size'].unique()

array(['Medium', 'Small', nan, 'High'], dtype=object)

In [None]:
# create an order categories for shelf
Outlet_Size_order = ['Small', 'Medium', 'High']

# make an order list for OrdinalEncoder
ord_cat_order = [Outlet_Size_order]

# instantiate the OrdinalEncoder
ord_encoder = OrdinalEncoder(categories = ord_cat_order)

In [None]:
# instantiate scaler
ord_scaler = StandardScaler()

# make an ordinal pipeline
ord_pipe = make_pipeline(ord_freq_imputer, ord_encoder, ord_scaler)
ord_pipe

In [None]:
# save a list of ordinal columns
ord_cols = ['Outlet_Size']

# instantiate the imputer for ordinal data
ord_freq_imputer = SimpleImputer(strategy = 'most_frequent')

# check unique values in 'shelf'
X_train['Outlet_Size'].unique()

# create an order categories for shelf
Outlet_Size_order = ['Small', 'Medium', 'High']

# make an order list for OrdinalEncoder
ord_cat_order = [Outlet_Size_order]

# instantiate the OrdinalEncoder
ord_encoder = OrdinalEncoder(categories = ord_cat_order)

# instantiate scaler
ord_scaler = StandardScaler()

# make an ordinal pipeline
ord_pipe = make_pipeline(ord_freq_imputer, ord_encoder, ord_scaler)
ord_pipe

# Encode the Categorical (nominal) Features

In [None]:
# save a list of categorical columns
cat_cols = X_train.select_dtypes('object').drop(columns=ord_cols).columns
cat_cols

Index(['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
       'Outlet_Location_Type', 'Outlet_Type'],
      dtype='object')

In [None]:
# instantiate the OneHotEncoder
one_incoder = OneHotEncoder(handle_unknown = 'ignore', sparse_output = False)
one_incoder

# Create a Column Transformer

In [None]:
# define 3 tuples for column transformer
num_tuple = ('numeric', num_pipe, num_cols)
ord_tuple = ('ordinal', ord_pipe, ord_cols)
cat_tuple = ('categorical', one_incoder, cat_cols)

# create one column transformer
preprocessor = ColumnTransformer([num_tuple, ord_tuple, cat_tuple], verbose_feature_names_out=False)
preprocessor

# 1. Build a Linear Regression Model.

In [None]:
# instantiate linear regression model
lr = LinearRegression()

# combine the processor with the linear regression model in a model pipeline
lr_pipe = make_pipeline(preprocessor, lr)

# fit the model on the training data
lr_pipe.fit(X_train, y_train)

# Evaluate with custom function
evaluate_regression(lr_pipe, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 847.126
- MSE = 1,297,559.357
- RMSE = 1,139.105
- R^2 = 0.562

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 804.105
- MSE = 1,194,333.006
- RMSE = 1,092.855
- R^2 = 0.567


Compare the training vs test R-squared values and answer the question: to what extent is this model overfit/underfit?

  * The R^2 value on the test set is slightly higher than that on the training set.
  
  * Based on the R^2 values, the model does not exhibit signs of overfitting or underfitting to a significant extent. It appears to have generalized well to the unseen data.


# 2. Build a Random Forest Model.

In [None]:
# Instantiate default random forest model
rf = RandomForestRegressor(random_state=42)

# combine the processor with the random forest model in a model pipeline
rf_pipe = make_pipeline(preprocessor, rf)

# fit the model on the training data
rf_pipe.fit(X_train, y_train)

# Evaluate with custom function
evaluate_regression(rf_pipe, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 296.372
- MSE = 182,847.683
- RMSE = 427.607
- R^2 = 0.938

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 766.319
- MSE = 1,214,631.657
- RMSE = 1,102.103
- R^2 = 0.560


Compare the training vs. test R-squared values and answer the question: to what extent is this model overfit/underfit?

  * The model performs exceptionally well on the training data but significantly worse on the test data.
  * The significant discrepancy between training and test R^2 values indicates **overfitting**.
  * The extent of overfitting can be considered significant due to the large gap between the training and test performance.

Compare this model's performance to the linear regression model: which model has the best test scores?

  * The Linear Regression model has a marginally better R^2 value, while the Random Forest model achieves a lower MAE, indicating its predictions are closer to the actual values on average.
  * Considering MAE as a practical measure of average error, the **Random Forest** model performs slightly better.

# 3. Use GridSearchCV to tune at least two hyperparameters for a Random Forest model.

In [None]:
# See parameters for tuning
rf_pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('numeric',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer()),
                                                    ('standardscaler',
                                                     StandardScaler())]),
                                    ['Item_Weight', 'Item_Visibility', 'Item_MRP',
                                     'Outlet_Establishment_Year']),
                                   ('ordinal',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer(strategy='most_frequent')),
                                                    ('ordinalencoder',
                                                     OrdinalEncoder(categories=[['Small',
                                                                                 'Medium',
  

In [None]:
# Define parameters to tune
params = {'randomforestregressor__max_depth': [4, 6, 8],
          'randomforestregressor__max_features': ['auto', 'sqrt'],
          'randomforestregressor__min_samples_leaf': [1, 2, 4],
          'randomforestregressor__min_samples_split': [4, 6, 8, 10],
          'randomforestregressor__n_estimators': [150, 200, 250]}

# Instantiate GridSearchCV
gridsearch = GridSearchCV(rf_pipe, params, n_jobs= -1, verbose=1)

# Fit the GridSearchCV on the training data
gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 216 candidates, totalling 1080 fits


  warn(


In [None]:
# Obtain the best parameters from the gridsearch
gridsearch.best_params_

{'randomforestregressor__max_depth': 6,
 'randomforestregressor__max_features': 'auto',
 'randomforestregressor__min_samples_leaf': 2,
 'randomforestregressor__min_samples_split': 8,
 'randomforestregressor__n_estimators': 250}

In [None]:
# Define the best version of the model
best_rf = gridsearch.best_estimator_

# Evaluate with custom function
evaluate_regression(best_rf, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 742.442
- MSE = 1,113,816.743
- RMSE = 1,055.375
- R^2 = 0.624

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 727.540
- MSE = 1,099,550.663
- RMSE = 1,048.595
- R^2 = 0.601


Compare your tuned model to your default Random Forest: did the performance improve?

  * Yes, the performance improved in terms of generalization ability. The tuned model offers a better balance between bias and variance, leading to improved performance on unseen data.

# 4. You now have tried several different models on your data set.

You need to determine which model to implement.

Overall, which model do you recommend?

  * I recommend to implement the tuned Random Forest model.

Justify your recommendation.

  * The tuned Random Forest model demonstrates improved performance on unseen data (test set) compared to its default configuration.
  * It offers a better balance between bias and variance, leading to improved performance on unseen data. The tuning process has effectively reduced the model's variance without introducing excessive bias.
  * It also offers better predictive accuracy on the test set, as indicated by the R^2 and other error metrics like RMSE or MAE, making it a more reliable choice for making predictions.

Interpret your model's performance based on R-squared in a way that your non-technical stakeholder can understand.

  * The R^2 value of around 0.60 on the test set means that 60% of the variation in our target is predictable from our data.

Select another regression metric (RMSE/MAE/MSE) to express the performance of your model to your stakeholder. Include why you selected this metric to explain to your stakeholder.

  * The RMSE express the model's performance because it provides a clear measure of the model's prediction error in the same units as the target variable. On average, it tells us how much our model's predictions deviate from the actual values. A lower RMSE value means better accuracy.

Compare the training vs. test scores and answer the question: to what extent is this model overfit/underfit?

  * The close R^2 values between the training (0.624) and test (0.601) datasets, along with similar patterns in RMSE or MAE, suggest the model is well-balanced. It neither overfits nor underfits. This balance indicates that the model is likely to perform consistently on new data, making it a reliable choice for deployment.