<a href="https://colab.research.google.com/github/ralphpatrick/sales_predictions/blob/main/Project%201%20-%20Part%205%20(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 - Part 5 (Core)

We will continue to work on your sales prediction project. The goal of this step is to help the retailer by using machine learning to make predictions about future sales based on the data provided.

For Part 5, you will go back to your original dataset with the goal of preventing data leakage.  

Please note: If you imputed missing values based on a calculation on the entire dataset (such as mean), you should now perform that step after the train test split using SimpleImputer.  

* Identify the target (X) and features (y): Assign the "Item_Outlet_Sales" column as your target and the rest of the relevant variables as your features matrix.  
* Perform a train test split 
* Create a pre processing pipeline to prepare the dataset for Machine Learning
Commit your work to GitHub. 

Turn in a link to your GitHub repo! We will finalize the project next week.

## 1. Load Data and Do Data Cleaning

In [25]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

In [2]:
#importing pandas
import pandas as pd
import numpy as np
filename = '/content/sales_predictions_2023 (1).csv'

df = pd.read_csv(filename)
df.head(3)


Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27


In [3]:
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

In [5]:
#  Find and fix any inconsistent categories of data (example: fix cat, Cat, and cats so that they are consistent)
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [6]:
#change Item_Fat_Content to consistent data 1

df.loc[df['Item_Fat_Content'] == 'LF', 'Item_Fat_Content'] = 'Low Fat'

In [7]:
df['Item_Fat_Content'].value_counts()

Low Fat    5405
Regular    2889
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [8]:
#change Item_Fat_Content to consistent data 2

df.loc[df['Item_Fat_Content'] == 'low fat', 'Item_Fat_Content'] = 'Low Fat'

In [9]:
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    2889
reg         117
Name: Item_Fat_Content, dtype: int64

In [10]:
#change Item_Fat_Content to consistent data 3

df.loc[df['Item_Fat_Content'] == 'reg', 'Item_Fat_Content'] = 'Regular'

In [11]:
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

In [12]:
df.sample(5)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
6746,FDP37,15.6,Low Fat,0.143915,Breakfast,127.5994,OUT017,2007,,Tier 2,Supermarket Type1,2441.4886
1487,FDT21,7.42,Low Fat,0.020392,Snack Foods,248.9092,OUT046,1997,Small,Tier 1,Supermarket Type1,2241.0828
5451,FDD45,8.615,Low Fat,0.116485,Fruits and Vegetables,94.1436,OUT045,2002,,Tier 2,Supermarket Type1,1134.5232
3967,FDV59,13.35,Low Fat,0.048018,Breads,219.6166,OUT035,2004,Small,Tier 2,Supermarket Type1,3483.4656
6253,FDD09,13.5,Low Fat,0.02154,Fruits and Vegetables,182.4976,OUT045,2002,,Tier 2,Supermarket Type1,1629.8784


## 2. Explore the Data

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [14]:
df.columns

Index(['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier',
       'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type', 'Item_Outlet_Sales'],
      dtype='object')

* 'Item_Identifier' = Nominal
* 'Item_Weight' = Numeric
* 'Item_Fat_Content' = Nominal
* 'Item_Visibility' = Numeric
* 'Item_Type' = Nominal
* 'Item_MRP' = Numeric
* 'Outlet_Identifier' = Nominal
* 'Outlet_Establishment_Year' = Numeric
* 'Outlet_Size' = **Ordinal**
* 'Outlet_Location_Type' = **Ordinal**
* 'Outlet_Type' = Nominal
* 'Item_Outlet_Sales' = Numeric

## 3 b. Ordinal Encoding

We can ordinal encode data without too much risk of data leakage. There are generally a small number of ordinal variables and they are likely to be in both training and testing data. If that is not the case, the sklearn transformer called OrdinalEncoder can be added to a preprocessing pipeline.

In [15]:
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [16]:
df['Outlet_Location_Type'].value_counts()

Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

In [21]:
# Ordinal Encoding 'Outlet_Size'
replacement_dictionary = {'High':2, 'Medium':1, 'Small':0}
df['Outlet_Size'].replace(replacement_dictionary, inplace=True)
df['Outlet_Size'].value_counts()

1.0    2793
0.0    2388
2.0     932
Name: Outlet_Size, dtype: int64

In [22]:
# Ordinal Encoding 'Outlet_Location_Type'
replacement_dictionary = {'Tier 3':2, 'Tier 2':1, 'Tier 1':0}
df['Outlet_Location_Type'].replace(replacement_dictionary, inplace=True)
df['Outlet_Location_Type'].value_counts()

2    3350
1    2785
0    2388
Name: Outlet_Location_Type, dtype: int64

## 4. Validation Split

In [26]:
# target y is 'Item_Outlet_Sales'
X = df.drop(columns='Item_Outlet_Sales')
y = df['Item_Outlet_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

## 5. Instantiate Column Selectors

In [28]:
# make selectors
cat_selector = make_column_selector(dtype_include='object')
num_selector = make_column_selector(dtype_include='number')

## 6. Instantiate Transformers.

In [29]:
# Imputer
freq_imputer = SimpleImputer(strategy='most_frequent')
mean_imputer = SimpleImputer(strategy='mean')

# Scaler
scaler = StandardScaler()

# One Hot Encoder
ohe = OneHotEncoder(handle_unknown = 'ignore', sparse=False)

## 7. Instantiate Pipelines

In [30]:
# Numerical pipeline
numeric_pipe = make_pipeline(mean_imputer, scaler)
numeric_pipe

In [31]:
# Categorical Pipeline
categorical_pipe = make_pipeline(freq_imputer, ohe)
categorical_pipe

## 8. Instantiate ColumnTransformer

In [32]:
# Tuples for Column transformer
number_tuple = (numeric_pipe, num_selector)
category_tuple = (categorical_pipe, cat_selector)

# Column transformer
preprocessor = make_column_transformer(number_tuple, category_tuple)
preprocessor

## 9. Transformer Data

In [33]:
# fit on a train
preprocessor.fit(X_train)



In [34]:
# transform train and test models
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

## 10. Inspect the Result

In [35]:
# Check for missing values and that data is scaled and one hot encoded
print(np.isnan(X_train_processed).sum().sum(), 'missing values in training data')
print(np.isnan(X_test_processed).sum().sum(), 'missing values in testing data')
print('\n')
print('All data in X_train_processed are', X_train_processed.dtype)
print('All data in X_test_processed are', X_test_processed.dtype)
print('\n')
print('shape of data is', X_train_processed.shape)
print('\n')
X_train_processed

0 missing values in training data
0 missing values in testing data


All data in X_train_processed are float64
All data in X_test_processed are float64


shape of data is (6392, 1588)




array([[ 0.81724868, -0.71277507,  1.82810922, ...,  0.        ,
         1.        ,  0.        ],
       [ 0.5563395 , -1.29105225,  0.60336888, ...,  0.        ,
         1.        ,  0.        ],
       [-0.13151196,  1.81331864,  0.24454056, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 1.11373638, -0.92052713,  1.52302674, ...,  1.        ,
         0.        ,  0.        ],
       [ 1.76600931, -0.2277552 , -0.38377708, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.81724868, -0.95867683, -0.73836105, ...,  1.        ,
         0.        ,  0.        ]])