<a href="https://colab.research.google.com/github/zevy613/supervised-machine-learning/blob/main/Project1_part5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [13]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

filename = "/content/drive/MyDrive/Colab Notebooks/CodingDojo/05IntroML/sales_predictions.csv"
df = pd.read_csv(filename)
df.head()
# lets make a copy of our data set so we don't lose the original data.
df_ml = df.copy()

First lets check for duplicates

In [14]:
df.duplicated().sum()

0

Next, we'll check for inconsistencies in our data.

In [15]:
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

From here we clearly see mistakes in the spelling of low fat and regular fat. Let's fix these.

In [16]:
df['Item_Fat_Content'].replace('LF','Low Fat', inplace=True)
df['Item_Fat_Content'].replace('low fat','Low Fat', inplace=True)
df['Item_Fat_Content'].replace('reg','Regular', inplace=True)
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

Train test split.

In [17]:
X = df.drop(columns = ['Item_Outlet_Sales'])
y = df['Item_Outlet_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Lets check for missing values and the type of the columns. 

In [18]:
display(X_train.isna().sum())
print()
display("The type of item weight is ", df['Item_Weight'].dtype)
print()
display("The type of Outlet size is ", df['Outlet_Size'].dtype)

Item_Identifier                 0
Item_Weight                  1107
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  1812
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64




'The type of item weight is '

dtype('float64')




'The type of Outlet size is '

dtype('O')

We'll need to impute the values for these columns.

We also need to scale our data and One Hot Encode all of the categorical columns.

We begin by instantiating the selectors we need. Because one column is numeric and one is categoric/object, we'll need two column selectors.

In [19]:
#instantiate our solumn selectors
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')

Next we'll create all of the transformers we need.
We will use the mean strategy because we're not concerned with outliers.

In [20]:
mean_imputer = SimpleImputer(strategy='mean')
freq_imputer = SimpleImputer(strategy='most_frequent')

scaler = StandardScaler()

ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

Next, because we're applying many transformations on the same column type, we will use piplines.

In [21]:
num_pipe = make_pipeline(mean_imputer, scaler)
cat_pipe = make_pipeline(freq_imputer, ohe)

Because we are operating on two different column types we'll need a transformer as well.

In [22]:
# group with tuples
num_tuple = (num_pipe, num_selector)
cat_tuple = (cat_pipe, cat_selector)

preprocessor = make_column_transformer(num_tuple, cat_tuple)

preprocessor.fit(X_train)

Now we can transform our data all at once.

In [23]:
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

Check to see if there are any null values.


In [24]:
np.isnan(X_train_processed).sum().sum()

0

Perfect, we have no missing data! We are ready for modeling.