<a href="https://colab.research.google.com/github/tschelli/food_sales_predictions/blob/main/Sales_Predictions_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Food Sales Predictions Project - Preprocessing

Coding Dojo Data Science Project 1
- Name: Tyler Schelling
- Start Date: 10/12/2022



---



**Data Dictionary Reference:**

Variable Name	   |  Description
-------------------|------------------
Item_Identifier	   |  Unique product ID
Item_Weight	       |  Weight of product
Item_Fat_Content	| Whether the product is low fat or regular
Item_Visibility	|The percentage of total display area of all products in a store allocated to the particular product
Item_Type	|The category to which the product belongs
Item_MRP	|Maximum Retail Price (list price) of the product
Outlet_Identifier	|Unique store ID
Outlet_Establishment_Year	|The year in which store was established
Outlet_Size|	The size of the store in terms of ground area covered
Outlet_Location_Type	|The type of area in which the store is located
Outlet_Type	|Whether the outlet is a grocery store or some sort of supermarket
Item_Outlet_Sales	|Sales of the product in the particular store. This is the target variable to be predicted.

## Mount Drive | Import Libraries | Load Data
- Section last updated: 10/12/2022

### Mounting Google Drive

In [1]:
#Dataset is stored via Google drive. Mount the drive.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Importing Libraries

In [2]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn import set_config
set_config(display='diagram')

### Load the Data
 *Note: [Original Data Source](https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/)*

In [3]:
#Load the data and examine the dataset we will be using
filename = '/content/drive/MyDrive/02. Life/Coding Dojo/00 Datasets/sales_predictions.csv'
df = pd.read_csv(filename)

## Preprocessing for Machine Learning
- Section last updated: 10/13/2022

In [4]:
df_ml = df.copy()
df_ml.head(3)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27


### Drop Duplicates and Fix Inconsistencies

In [5]:
# Check to see if there are any duplicates
df_ml.duplicated().sum()

0

- No duplicates present in the dataset

In [6]:
for col in df_ml.columns:
  print(col)
  print(df_ml[col].value_counts(), '\n')

Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64 

Item_Weight
12.150    86
17.600    82
13.650    77
11.800    76
15.100    68
          ..
7.275      2
7.685      1
9.420      1
6.520      1
5.400      1
Name: Item_Weight, Length: 415, dtype: int64 

Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64 

Item_Visibility
0.000000    526
0.076975      3
0.162462      2
0.076841      2
0.073562      2
           ... 
0.013957      1
0.110460      1
0.124646      1
0.054142      1
0.044878      1
Name: Item_Visibility, Length: 7880, dtype: int64 

Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              64

- Inconsistencies in Item_Fat_Content

In [7]:
# Use df['Item_Fat_Content'].replace() to replace the inconsistent data
df_ml['Item_Fat_Content'] = df_ml['Item_Fat_Content'].replace(['low fat', 'LF'], 'Low Fat').replace('reg','Regular')
print(df_ml['Item_Fat_Content'].unique())

['Low Fat' 'Regular']


In [8]:
df_ml.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [9]:
# Outlet_Establishment_Year will be treated as a category
df['Outlet_Establishment_Year'] = df['Outlet_Establishment_Year'].astype("object")

### Identify the features (X) and target (y) and split

In [10]:
# Assign the feature columns as X
X = df_ml.drop(columns = ['Item_Identifier', 'Item_Outlet_Sales'])
# Assign the target column as y
y = df_ml['Item_Outlet_Sales']

### Perform the Train Test Split

In [11]:
# Split training and test
# Set random_state to 42 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Instantiate Column Selectors

In [12]:
# Column selectors
num_selector = make_column_selector(dtype_include='number')

# Split ordinal columns out
ord_cols = ['Outlet_Size', 'Outlet_Location_Type']
cat_cols = df_ml.select_dtypes('object').columns.difference(['Outlet_Size', 'Outlet_Location_Type', 'Item_Identifier']).to_list()
cat_cols

['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Type']

### Instantiate Transformers

In [13]:
# Simple Imputer
median_imputer = SimpleImputer(strategy='median')
freq_imputer = SimpleImputer(strategy = 'most_frequent')

In [14]:
# Scaler
scaler = StandardScaler()

In [15]:
# One Hot Encoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

### Ordinal Encoder

In [16]:
# Create a list of ordinal labels
out_size_labels = ['Small', 'Medium', 'High']
out_loc_type = ['Tier 1', 'Tier 2', 'Tier 3']

# Combine the ordered list in the order that the columns appear
ordered_labels = [out_size_labels, out_loc_type]

# Instantiate OrdinalEncoder
ordinal = OrdinalEncoder(categories = ordered_labels)

### Instantiate Pipelines

In [17]:
# Pipeline for numerals and categories
num_pipe = make_pipeline(median_imputer, scaler)
ord_pipe = make_pipeline(freq_imputer, ordinal)
cat_pipe = make_pipeline(freq_imputer, ohe)

### Instantiate Column Transformer

In [18]:
# Setup the tuples to pair the processors with the make column selectors
num_tuple = (num_pipe, num_selector)
ord_tuple = (ord_pipe, ord_cols)
cat_tuple = (cat_pipe, cat_cols)

In [19]:
# Make column transformer
preprocessor = make_column_transformer(num_tuple, cat_tuple, ord_tuple, remainder = 'passthrough')

### Fit Data

In [20]:
#Fit training data to preprocessor
preprocessor.fit(X_train)

### Transform Data

In [21]:
# Transform X train and test
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

### Explore the Result

In [22]:
# Check for missing values and that data is scaled and one-hot encoded
print(np.isnan(X_train_processed).sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum(), 'missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('The shape of the training data is', X_train_processed.shape)
print('\n')
pd.DataFrame(X_train_processed)

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


The shape of the training data is (6392, 38)




Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,28,29,30,31,32,33,34,35,36,37
0,0.827485,-0.712775,1.828109,1.327849,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0
1,0.566644,-1.291052,0.603369,1.327849,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,2.0
2,-0.121028,1.813319,0.244541,0.136187,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
3,-1.158464,-1.004931,-0.952591,0.732018,1.0,0.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,1.538870,-0.965484,-0.336460,0.493686,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6387,-0.821742,4.309657,-0.044657,0.017021,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0
6388,0.649639,1.008625,-1.058907,1.089517,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
6389,1.123896,-0.920527,1.523027,0.493686,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
6390,1.775999,-0.227755,-0.383777,1.089517,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
