<a href="https://colab.research.google.com/github/whitefreeze/Prediction-of-Product-Sales/blob/main/Prediction_of_Product_Sales_Part6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prediction of Product Sales
Part 5

## Import Necessary Libraries

In [None]:
# Libraries
import pandas as pd
import numpy as np

# Preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Models
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

## Regression Metrics
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

from sklearn import set_config
set_config(display='diagram')

## Functions

In [None]:
# Function from Regression Metrics Solutions code
# Create a function to take the true and predicted values
# and print MAE, MSE, RMSE, and R2 metrics for a model
def eval_regression(y_true, y_pred, name='mode'):
    """Takes true targets and predictions from a regression model and prints
    MAE, MSE, RMSE, AND R2 scores
    Set 'name' to name of model and 'train' or 'test' as appropriate"""
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)

    print(f'{name} Scores')
    print(f'MAE: {mae:,.4f} \nMSE: {mse:,.4f} \nRMSE: {rmse:,.4f} \nR2: {r2:.4f}\n')

## Load the Data

In [None]:
# Connect Google Drive to import data
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Import the data
path = '/content/drive/MyDrive/Data Science/Coding Dojo/Course 2: ML/05 Week 5: ML Intro/sales_predictions_2023.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [None]:
# Explore the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


* Before splitting your data, you can drop duplicates.

In [None]:
# Count number of duplicate rows in dataset
print(f'There are {df.duplicated().sum()} duplicate rows.')

There are 0 duplicate rows.


-No duplicates found. None removed.

* Before splitting your data, you can 
fix inconsistencies in categorical data.

In [None]:
# Check Item_Fat_Content for inconsistent categorical data.
df.Item_Fat_Content.value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [None]:
# Inconsistent observation naming found in feature. Combining observations as appropriate.
# replace 'LF' with 'Low Fat'
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('LF', 'Low Fat')
# replace 'low fat' with 'Low Fat'
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('low fat', 'Low Fat')
# replace 'reg' with 'Regular'
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace('reg', 'Regular')

In [None]:
df.Item_Fat_Content.value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

-Item_Fat_Content obervations fixed.

-No other inconsistencies in categorical data found.

* Missing Values

In [None]:
# Display total number of missing values
print(f'There are {df.isna().sum().sum()} missing values.')

There are 3873 missing values.


In [None]:
# Display count of missing values by column
print(df.isna().sum())

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64


In [None]:
# Display percentage of missing values by column
print(df.isna().sum()/len(df)*100)

Item_Identifier               0.000000
Item_Weight                  17.165317
Item_Fat_Content              0.000000
Item_Visibility               0.000000
Item_Type                     0.000000
Item_MRP                      0.000000
Outlet_Identifier             0.000000
Outlet_Establishment_Year     0.000000
Outlet_Size                  28.276428
Outlet_Location_Type          0.000000
Outlet_Type                   0.000000
Item_Outlet_Sales             0.000000
dtype: float64


-Two features have missing values: 
* Item_Weight
* Outlet_Size

We will use SimpleImputer in our preprocessing steps after performing our Train_Test_Split.

* Determine which features will be relevant to include in our features matrix.

In [None]:
# Display feature info.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [None]:
# Count total number of observations.
display(df.shape)

# Only count unique values for categorical features: determine if relevant for feature matrix.
display(df['Item_Identifier'].nunique())
display(df['Item_Fat_Content'].nunique())
display(df['Item_Type'].nunique())
display(df['Outlet_Size'].nunique())
display(df['Outlet_Location_Type'].nunique())
display(df['Outlet_Type'].nunique())

(8523, 12)

1559

2

16

3

3

4

-While there are many unique values in Item_Identifier, this is reasonable as the same products are sold at different stores in our dataset and the column is not only made up of completely unique values; this means we should include the feature in our calculations.

## Ordinal Encoding

The ordinal data can be encoded without too much risk of data leakage, as there are only a few variables and are likely to be in both training and testing data. 

* Ordinal Encoding 

Outlet_Size is the only ordinal feature that we know the order of. Other features are ambiguious and will be treated as Nominal Categorical.

In [None]:
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [None]:
# Ordinal Encolding via .replace() as not able to get OrdinalEncoder working
replacement_dictionary = {'Small':0, 'Medium':1, 'High':2}
df['Outlet_Size'].replace(replacement_dictionary, inplace=True)
df['Outlet_Size']

0       1.0
1       1.0
2       1.0
3       NaN
4       2.0
       ... 
8518    2.0
8519    NaN
8520    0.0
8521    1.0
8522    0.0
Name: Outlet_Size, Length: 8523, dtype: float64

## Split the Data

* Identify the features (X) and target (y): Assign the "Item_Outlet_Sales" column as your target and the rest of the relevant variables as your features matrix.

In [None]:
# Define features (X) and target (y)
target = 'Item_Outlet_Sales'
X = df.drop(columns = [target]).copy()
y = df[target].copy()

* Perform a train test split

In [None]:
# Split training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## Prepare the Data

* Identify the datatypes for each feature

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   float64
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(5), int64(1), object(6)
memory usage: 799.2+ KB


**Ordinal:** 'Outlet_Size'  (Outlet_Size has already been ordinal encoded.)

**Numeric:** 'Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment' 

**Nominal:** 'Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'

## Column Selectors, Tranformers & Pipelines

* Make sure your imputation of missing values occurs after the train test split using SimpleImputer. 

* ColumnSelector

As there is one ordinal categorical, must specify ordinal and nominal cateforical features manually.

In [None]:
# Instantiate column selectors
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')

* Imputers

In [None]:
# Display total number of missing values
print(f'There are {df.isna().sum().sum()} missing values.')

There are 3873 missing values.


-We have many missing values and will require values to be imputed.

* Transformers

In [None]:
# Instantiate transformers

# Imputers
freq_imputer = SimpleImputer(strategy='most_frequent')
mean_imputer = SimpleImputer(strategy='mean')

# Scaler
scaler = StandardScaler()

# One-hot encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

* Instantiate Pipelines

In [None]:
# Numeric Pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

In [None]:
# Categorical Pipeline
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

* Instantiate ColumnTransformer

Create a preprocessing object to prepare the dataset for Machine Learning

In [None]:
# Tuple for ColumnTransformer
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)

# ColumnTransformer
preprocessor = make_column_transformer(number_tuple, category_tuple, remainder='passthrough')
preprocessor

* Transformer Data

In [None]:
# Fit training data on ColumnTransformer/preprocessor
preprocessor.fit(X_train)

In [None]:
# Use fitted ColumnTranformer to transform both training and testing datasets
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

## Inspect Result

In [None]:
# Check for missing values & that data has been scaled and one-hot encoded.
print(np.isnan(X_train_processed).sum().sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum().sum(), 'missing values in testing data')
print('\n')
print('All data in X-train_processed is ', X_train_processed.dtype)
print('All data in X_test_processed is ', X_test_processed.dtype)
print('\n')
print('The shape of the data is ', X_train_processed.shape)
print('\n')
X_train_processed

0 missing values in training data
0 missing values in testing data


All data in X-train_processed is  float64
All data in X_test_processed is  float64


The shape of the data is  (6392, 1590)




array([[ 0.81724868, -0.71277507,  1.82810922, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.5563395 , -1.29105225,  0.60336888, ...,  0.        ,
         1.        ,  0.        ],
       [-0.13151196,  1.81331864,  0.24454056, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 1.11373638, -0.92052713,  1.52302674, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.76600931, -0.2277552 , -0.38377708, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.81724868, -0.95867683, -0.73836105, ...,  1.        ,
         0.        ,  0.        ]])

* Data has been preprocessed and it ready to be modeled. 

## Linear Regression Model

Your first task is to build a linear regression model to predict sales.

* Drop the 'Item_Identifier' column due to high cardinality

In [None]:
# Drop the 'Item_Identifier' column due to high cardinality
df.drop(columns=['Item_Identifier'], inplace = True)
df.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,1.0,Tier 1,Supermarket Type1,3735.138
1,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,1.0,Tier 3,Supermarket Type2,443.4228
2,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,1.0,Tier 1,Supermarket Type1,2097.27
3,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,2.0,Tier 3,Supermarket Type1,994.7052


* Build a linear regression model.

In [None]:
# Instantiate the transformers
# scaler = StandardScaler()
# mean_imputer = SimpleImputer(strategy='mean')
# freq_imputer = SimpleImputer(strategy='most_frequent')
# ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

In [None]:
# Prepare separate processing pipelines for numeric and categorical data
# num_pipe = make_pipeline(mean_imputer, scaler)
# cat_pipe = make_pipeline(freq_imputer, ohe)

In [None]:
# Create ColumnSelectors for the the numeric and categorical data
# cat_selector = make_column_selector(dtype_include='object')
# num_selector = make_column_selector(dtype_include='number')

In [None]:
# Combine the Pipelines and ColumnSelectors into tuples for the ColumnTransformer
# cat_tuple = (cat_pipe, cat_selector)
# num_tuple = (num_pipe, num_selector)

In [None]:
# Create the preprocessing ColumnTransformer
# preprocessor = make_column_transformer(cat_tuple, num_tuple, remainder='drop')
# preprocessor

In [None]:
# Instantiate the model
linreg = LinearRegression()

In [None]:
# Fit the model pipeline on the training data
# This is the step where the model "learns" about the relationship between the features and target. 
# Model is learning the relationship between X and y
linreg.fit(X_train_processed, y_train)

In [None]:
# Make predictions using the testing and training data
train_preds_linreg = linreg.predict(X_train_processed)
test_preds_linreg = linreg.predict(X_test_processed)
train_preds_linreg[:10]

array([3018.875 , 3785.375 , 2220.9375, 1242.1875, 2216.75  , -105.6875,
       1618.0625, 4391.8125, 3699.0625, 1616.75  ])

In [None]:
# evaluate the model
train_score_linreg = linreg.score(X_train_processed, y_train)
test_score_linreg = linreg.score(X_test_processed, y_test)
print(train_score_linreg)
print(test_score_linreg)

0.6716866170021532
-2.2000993154978634e+18


In [None]:
# Display model performance metrics 
eval_regression(y_train, train_preds_linreg, name='Linear Regression Train')
eval_regression(y_test, test_preds_linreg, name='Linear Regression Test')

Linear Regression Train Scores
MAE: 735.7972 
MSE: 971,628.9142 
RMSE: 985.7124 
R2: 0.6717

Linear Regression Test Scores
MAE: 178,892,562,469.1567 
MSE: 6,070,024,042,377,572,190,781,440.0000 
RMSE: 2,463,741,878,196.1660 
R2: -2200099315497863424.0000



Evaluate the performance of your model based on r^2.

* The r^2 value for the test set returned is a very negative number: 

> -2,200,099,315,497,863,424.0000


Evaluate the performance of your model based on rmse.

* The RMSE value for the test set also returned an extreme number:

> 2,463,741,878,196.1660

## Regression Tree Model

Your second task is to build a regression tree model to predict sales.

* Build a simple regression tree model.

In [None]:
# Instantiate the DecisionTreeRegressor
dec_tree = DecisionTreeRegressor(random_state = 42)

In [None]:
# Fit the data
dec_tree.fit(X_train_processed, y_train)

In [None]:
# Predict target values
train_preds = dec_tree.predict(X_train_processed)
test_preds = dec_tree.predict(X_test_processed)
train_preds[:10]

array([ 515.3292, 3056.022 , 1577.946 , 1331.6   , 1687.1372,  111.8544,
       1151.1682, 3401.5722, 3570.0196, 1523.3504])

In [None]:
# evaluate the model
train_score = dec_tree.score(X_train_processed, y_train)
test_score = dec_tree.score(X_test_processed, y_test)
print(train_score)
print(test_score)

1.0
0.2083233124406807


In [None]:
# Display model performance metrics 
eval_regression(y_train, train_preds, name='DecisionTreeRegressor Train')
eval_regression(y_test, test_preds, name='DecisionTreeRegressor Test')

DecisionTreeRegressor Train Scores
MAE: 0.0000 
MSE: 0.0000 
RMSE: 0.0000 
R2: 1.0000

DecisionTreeRegressor Test Scores
MAE: 1,004.1214 
MSE: 2,184,218.0003 
RMSE: 1,477.9100 
R2: 0.2083



Evaluate the performance of your model based on r^2.

* The r^2 value for the test set returned a low, but reasonable number:
> 0.2083

Evaluate the performance of your model based on RMSE.

* The RMSE value for the test set is:
> 1,477.91

## Model Recommendation

You now have tried two different models on your data set. You need to determine which model to implement.

* Overall, which model do you recommend?

I would strongly recommend the Regression Tree Model.

* Justify your recommendation.

1. For the Linear Regression, the metrics (for both the R2 score and RMSE) are off the charts (not in a good way). The model performed extremely poorly, or there is an error in the way the model was implemented. 

2. The Regression Tree model did not perform well, but had reasonable results that could be useful. 

## GitHub README

To finalize this project, complete a README in your GitHub repository including:

* An overview of the project

* Two (2) relevant insights from the data (supported with reporting quality visualizations)

* Summary of the model and its evaluation metrics

* Final recommendations

## Project Labeling

Remove all references to "Project 1" in your filenames, repository name, final readme, and/or notebook. You want this to be read as a professional presentation, not a school project. If you need, create a clean, new repository that only contains your final notebook, README (project summary/explanation), and the images/visualizations you're using.  Ask yourself, what would this look like if this were a project you completed for a real-life stakeholder?

Please note:

* Do not include detailed technical processes or code snippets in your README. If readers want to know more technical details they should be able to easily find your notebook to learn more.

* Make sure your GitHub repository is organized and professional. Remember, this should be used to showcase your data science skills and abilities.

* Commit all of your work to GitHub and turn in a link to your GitHub repo with your final project.