<a href="https://colab.research.google.com/github/tschelli/food_sales_predictions/blob/main/Sales_Predictions_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Food Sales Predictions Project - Preprocessing

Coding Dojo Data Science Project 1
- Name: Tyler Schelling
- Start Date: 10/12/2022



---



**Data Dictionary Reference:**

Variable Name	   |  Description
-------------------|------------------
Item_Identifier	   |  Unique product ID
Item_Weight	       |  Weight of product
Item_Fat_Content	| Whether the product is low fat or regular
Item_Visibility	|The percentage of total display area of all products in a store allocated to the particular product
Item_Type	|The category to which the product belongs
Item_MRP	|Maximum Retail Price (list price) of the product
Outlet_Identifier	|Unique store ID
Outlet_Establishment_Year	|The year in which store was established
Outlet_Size|	The size of the store in terms of ground area covered
Outlet_Location_Type	|The type of area in which the store is located
Outlet_Type	|Whether the outlet is a grocery store or some sort of supermarket
Item_Outlet_Sales	|Sales of the product in the particular store. This is the target variable to be predicted.

## Mount Drive | Import Libraries | Load Data
- Section last updated: 10/12/2022

### Mounting Google Drive

In [1]:
#Dataset is stored via Google drive. Mount the drive.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Importing Libraries

In [2]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn import set_config
import math
set_config(display='diagram')

### Load the Data
 *Note: [Original Data Source](https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/)*

In [3]:
#Load the data and examine the dataset we will be using
filename = '/content/drive/MyDrive/02. Life/Coding Dojo/00 Datasets/sales_predictions.csv'
df = pd.read_csv(filename)

## Preprocessing for Machine Learning
- Section last updated: 10/14/2022

In [4]:
df_ml = df.copy()
df_ml.head(3)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27


### Drop Duplicates and Fix Inconsistencies

In [5]:
# Check to see if there are any duplicates
df_ml.duplicated().sum()

0

- No duplicates present in the dataset

In [6]:
df_ml.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [7]:
# Outlet_Establishment_Year will be treated as a category
df_ml['Outlet_Establishment_Year'] = df_ml['Outlet_Establishment_Year'].astype("object")

In [8]:
for col in df_ml.select_dtypes(include="object").columns:
  print(col)
  print(df_ml[col].value_counts(), '\n')

Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64 

Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64 

Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64 

Outlet_Identifier
OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018    928
OUT017    926
OUT010    55

- Inconsistencies in Item_Fat_Content

In [9]:
# Use df['Item_Fat_Content'].replace() to replace the inconsistent data
df_ml['Item_Fat_Content'] = df_ml['Item_Fat_Content'].replace(['low fat', 'LF'], 'Low Fat').replace('reg','Regular')
print(df_ml['Item_Fat_Content'].unique())

['Low Fat' 'Regular']


### Identify the features (X) and target (y) and split

In [10]:
# Assign the feature columns as X
X = df_ml.drop(columns = ['Item_Outlet_Sales'])
# Assign the target column as y
y = df_ml['Item_Outlet_Sales']

### Perform the Train Test Split

In [11]:
# Split training and test
# Set random_state to 42 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [12]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6392 entries, 4776 to 7270
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            6392 non-null   object 
 1   Item_Weight                5285 non-null   float64
 2   Item_Fat_Content           6392 non-null   object 
 3   Item_Visibility            6392 non-null   float64
 4   Item_Type                  6392 non-null   object 
 5   Item_MRP                   6392 non-null   float64
 6   Outlet_Identifier          6392 non-null   object 
 7   Outlet_Establishment_Year  6392 non-null   object 
 8   Outlet_Size                4580 non-null   object 
 9   Outlet_Location_Type       6392 non-null   object 
 10  Outlet_Type                6392 non-null   object 
dtypes: float64(3), object(8)
memory usage: 599.2+ KB


###Impute Missing Item_Weights 

In [13]:
#Loop through the index of the Dataframe
for ind in X_train.index:
    #Create a variable to hold the Item_Identifier of the current index
    item_id = X_train.loc[ind,'Item_Identifier']

    #Check if the location of the Item_Weight at the index is null
    #Create a filter to filter out only 'Item_Identifier' values equivalent to the item_id
    #Replace the null value with the .mean() of the 'Item_Weight' for the current 'Item_Identifier'. Values across 'Item_Identifier's are consistent.
    if X_train.isnull().loc[ind, 'Item_Weight']:
        item_filter = X_train['Item_Identifier'] == item_id
        X_train.loc[ind, 'Item_Weight'] = X_train[item_filter]['Item_Weight'].mean()

        #Additional check for the case of the null 'Item_Weight' being the only value for the 'Item_Identifier'
        #Create a new filter to filter out only the 'Item_Type' values of the current index.
        #Replace the null values with the .mean() of the 'Item_Weight' for the item's 'Item_Type'
        if math.isnan(X_train[item_filter]['Item_Weight'].mean()):
            type_filter = X_train['Item_Type'] == X_train.loc[ind, 'Item_Type']
            X_train.loc[ind, 'Item_Weight'] = X_train[type_filter]['Item_Weight'].median()

In [14]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6392 entries, 4776 to 7270
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            6392 non-null   object 
 1   Item_Weight                6392 non-null   float64
 2   Item_Fat_Content           6392 non-null   object 
 3   Item_Visibility            6392 non-null   float64
 4   Item_Type                  6392 non-null   object 
 5   Item_MRP                   6392 non-null   float64
 6   Outlet_Identifier          6392 non-null   object 
 7   Outlet_Establishment_Year  6392 non-null   object 
 8   Outlet_Size                4580 non-null   object 
 9   Outlet_Location_Type       6392 non-null   object 
 10  Outlet_Type                6392 non-null   object 
dtypes: float64(3), object(8)
memory usage: 857.3+ KB


#### X_test Imputation Based on X_train values

Same loop as above, but only looking in X_train to obtain missing values in order to update X_test.

In [15]:
#Loop through the index of the Dataframe
for ind in X_test.index:
    #Create a variable to hold the Item_Identifier of the current index
    item_id = X_test.loc[ind,'Item_Identifier']

    #Check if the location of the Item_Weight at the index is null
    #Create a filter to filter out only 'Item_Identifier' values equivalent to the item_id
    #Replace the null value with the .mean() of the 'Item_Weight' for the current 'Item_Identifier'. Values across 'Item_Identifier's are consistent.
    if X_test.isnull().loc[ind, 'Item_Weight']:
        item_filter = X_train['Item_Identifier'] == item_id
        X_test.loc[ind, 'Item_Weight'] = X_train[item_filter]['Item_Weight'].mean()

        #Additional check for the case of the null 'Item_Weight' being the only value for the 'Item_Identifier'
        #Create a new filter to filter out only the 'Item_Type' values of the current index.
        #Replace the null values with the .mean() of the 'Item_Weight' for the item's 'Item_Type'
        if math.isnan(X_train[item_filter]['Item_Weight'].mean()):
            type_filter = X_train['Item_Type'] == X_test.loc[ind, 'Item_Type']
            X_test.loc[ind, 'Item_Weight'] = X_train[type_filter]['Item_Weight'].median()

### Drop Unnecessary Columns

In [16]:
X_train = X_train.drop(columns = 'Item_Identifier')

In [17]:
X_test = X_test.drop(columns = 'Item_Identifier')

### Instantiate Column Selectors

In [18]:
# Column selectors
num_selector = make_column_selector(dtype_include='number')

# Split ordinal columns out
ord_cols = ['Outlet_Size', 'Outlet_Location_Type']
cat_cols = df_ml.select_dtypes('object').columns.difference(['Outlet_Size', 'Outlet_Location_Type', 'Item_Identifier']).to_list()
cat_cols

['Item_Fat_Content',
 'Item_Type',
 'Outlet_Establishment_Year',
 'Outlet_Identifier',
 'Outlet_Type']

### Instantiate Transformers

In [19]:
# Simple Imputer
median_imputer = SimpleImputer(strategy='median')
freq_imputer = SimpleImputer(strategy = 'most_frequent')

In [20]:
# Scaler
scaler = StandardScaler()

In [21]:
# One Hot Encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

### Ordinal Encoder

In [22]:
# Create a list of ordinal labels
out_size_labels = ['Small', 'Medium', 'High']
out_loc_type = ['Tier 1', 'Tier 2', 'Tier 3']

# Combine the ordered list in the order that the columns appear
ordered_labels = [out_size_labels, out_loc_type]

# Instantiate OrdinalEncoder
ordinal = OrdinalEncoder(categories = ordered_labels)

### Instantiate Pipelines

In [23]:
# Pipeline for numerals and categories
num_pipe = make_pipeline(median_imputer, scaler)
ord_pipe = make_pipeline(freq_imputer, ordinal)
cat_pipe = make_pipeline(freq_imputer, ohe)

### Instantiate Column Transformer

In [24]:
# Setup the tuples to pair the processors with the make column selectors
num_tuple = (num_pipe, num_selector)
ord_tuple = (ord_pipe, ord_cols)
cat_tuple = (cat_pipe, cat_cols)

In [25]:
# Make column transformer
preprocessor = make_column_transformer(num_tuple, cat_tuple, ord_tuple, remainder = 'passthrough')

### Fit Data

In [26]:
#Fit training data to preprocessor
preprocessor.fit(X_train)

### Transform Data

In [27]:
# Transform X train and test
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

### Explore the Result

In [29]:
# Check for missing values and that data is scaled and one-hot encoded
print(np.isnan(X_train_processed).sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum(), 'missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('The shape of the training data is', X_train_processed.shape)
print('The shape of the testing data is', X_test_processed.shape)
print('\n')
pd.DataFrame(X_train_processed)

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


The shape of the training data is (6392, 46)
The shape of the testing data is (2131, 46)




Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,36,37,38,39,40,41,42,43,44,45
0,0.736477,-0.712775,1.828109,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0
1,0.498624,-1.291052,0.603369,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0
2,-0.128441,1.813319,0.244541,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
3,-1.074445,-1.004931,-0.952591,1.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,1.385165,-0.965484,-0.336460,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6387,-0.767399,4.309657,-0.044657,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0
6388,0.574304,1.008625,-1.058907,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
6389,1.006763,-0.920527,1.523027,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
6390,1.601394,-0.227755,-0.383777,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
