# Summer Camp Lab 2 - Intermediate Machine Learning

In this lab we will practice:

Recap from last lab:
* Loading Data
* Data Exploration
* Selecting the Prediction Target
* Choosing Features
* Splitting Data into Training and Test Sets

New this lab:
* Gradient Boosted Trees (XGBoost)
* Encoding of Categorical Variables
* Scaling of Numerical Variables
* Imputing Missing Values of Categorical and Numerical Variables
* Building the Pipeline
* Cross Validation
* Predictions with Cross Validated Estimates

**To work in the notebook, first copy the notebook to your own drive. File > "Save a copy in Drive"**

# Setting Up the Workspace

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

#Metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_squared_error

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict

# Case Study

BigMart's team of data scientists has gathered sales data for the year 2013, encompassing 1559 products distributed across 10 stores situated in various cities. The dataset includes specific attributes for each product and store.

The primary objective is to construct a predictive model capable of forecasting the sales of individual products within specific outlets.

This predictive model aims to unveil the influential factors that contribute to increased sales, enabling BigMart to gain insights into product and outlet characteristics crucial for sales growth.

The data has the following features that could be useful in your model:

* Item_Identifier: A unique identifier for each product.

* Item_Weight: The weight of the product.

* Item_Fat_Content: Indicates the level of fat content in the product, often categorized as 'Low Fat,' 'Regular,' etc.

* Item_Visibility: The percentage of total display area of all products in a store allocated to a particular product.

* Item_Type: The category or type of the product (e.g., dairy, meat, fruits, etc.).

* Item_MRP (Maximum Retail Price): The maximum price at which the product can be sold.

* Outlet_Identifier: A unique identifier for each store/outlet.

* Outlet_Establishment_Year: The year in which the store was established.

* Outlet_Size: The size of the store, often categorized as 'Small,' 'Medium,' or 'Large.'

* Outlet_Location_Type: The type of location where the store is situated, such as 'Urban,' 'Suburban,' or 'Rural.'

* Outlet_Type: The type of outlet, such as 'Supermarket Type1,' 'Supermarket Type2,' 'Grocery Store,' etc.

* Item_Outlet_Sales: The target variable, representing the sales of the product in a particular store.


# Loading the Data

In [None]:
#import pandas as pd
df_sales = pd.read_csv('https://www.dropbox.com/s/yqaymhdf7bvvair/bigmart_sales_predictions.csv?dl=1')

#Fix different spelling variants of Fat Content
df_sales['Item_Fat_Content'] = df_sales['Item_Fat_Content'].replace({'LF' : 'Low Fat', 'low fat' : 'Low Fat', 'reg' : 'Regular'})


df_sales.head()

# Data Exploration

In [None]:
df_sales.tail()

In [None]:
df_sales['Item_Fat_Content'].value_counts()

In [None]:
df_sales.info()

In [None]:
df_sales.describe().T

In [None]:
df_sales.value_counts('Outlet_Location_Type')

# Selecting the Prediction Target

In [None]:
# Our target variable is the sales of an item at an outlet.
y = df_sales['Item_Outlet_Sales']
y

# Choosing Features

In [None]:
# We include a few features that we think could be useful as features in our model.
X = df_sales[['Item_Visibility','Item_MRP','Item_Weight', 'Item_Fat_Content', 'Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']]
X

In [None]:
X.describe()

In [None]:
X.dtypes

# Split Data into Training and Test Sets

Use random state if you want to generate the same split for each run of your code.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# XGboost

In [None]:
  from xgboost import XGBRegressor

# Create a decision tree regressor
model = XGBRegressor()

# Train the model
model.fit(X_train[['Item_Visibility','Item_Weight','Item_MRP']], y_train)

# Make predictions on the test set
predictions = model.predict(X_test[['Item_Visibility','Item_Weight','Item_MRP']])

# Evaluate the model
print('MSE', mean_squared_error(y_test, predictions))

Setting the number of estimators to 10 instead of the default 100, reduces overfitting and increases the test accuracy on this dataset.

In [None]:
from xgboost import XGBRegressor

# Create a decision tree regressor
model = XGBRegressor(n_estimators=10)

# Train the model
model.fit(X_train[['Item_Visibility','Item_Weight','Item_MRP']], y_train)

# Make predictions on the test set
predictions = model.predict(X_test[['Item_Visibility','Item_Weight','Item_MRP']])

# Evaluate the model
print('MSE', mean_squared_error(y_test, predictions))

In [None]:
from xgboost import XGBRegressor

# Create a decision tree regressor
model = XGBRegressor(n_estimators=100)

# Train the model
model.fit(X_train[['Item_Visibility','Item_Weight','Item_MRP']], y_train)

# Make predictions on the test set
predictions = model.predict(X_test[['Item_Visibility','Item_Weight','Item_MRP']])

# Evaluate the model
print('MSE', mean_squared_error(y_test, predictions))

In [None]:
data = X_test[['Item_Visibility','Item_Weight','Item_MRP']].copy()
data['actual_sales'] = y_test
data['predicted_sales'] = predictions
data.to_csv('predictions.csv')
data

In [None]:
# @title actual_sales vs predicted_sales

from matplotlib import pyplot as plt
data.plot(kind='scatter', x='actual_sales', y='predicted_sales', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

# Encoders and Scalers

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

## OrdinalEncoder

In [None]:
df_sales[['Outlet_Size']].value_counts()

In [None]:
X_test[['Outlet_Size']]

In [None]:
from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder(handle_unknown='error')
ord_enc.fit_transform(X_train[['Outlet_Size']])
ord_enc.transform(X_test[['Outlet_Size']])

Ordinal Encoder automatically defines an order based on the natural ordering of the categories, so for Tier 1 to 3, it encodes this as 0-2.
However you can also specify the categories manually.

Note that you define the categories as a list for each column. So `[column1, column2, ...., column3]`

In [None]:
ord_enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1, categories=[ ['Small', 'Medium', 'High'] ])
ord_enc.fit_transform(X_train[['Outlet_Size']])
ord_enc.transform(X_test[['Outlet_Size']])

If you training data has a category that is not seen during training (`fit_transform`) then `OrdinalEncoder` will give an error with the default `handle_unknown='error'`

In [None]:
ord_enc = OrdinalEncoder(handle_unknown='error')
ord_enc.fit_transform(X_train[['Outlet_Size']])
ord_enc.transform(X_test[['Outlet_Size']])

With `handle_unkown='use_encoded_value'` the encoder will set the number or `np.nan` you set in `unknown_value`

In [None]:
ord_enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
ord_enc.fit_transform(X_train[['Outlet_Size']])
ord_enc.transform([['Unknown category'], ['Unknown']])

The other two parameters `min_frequency` and `max_categories` can be used to purge less frequently occuring categories.

## OneHotEncoder

You use `fit_transform` in the first pass of the data, the encoder then learns the categories on the training data.

In subsequent uses you can use `transform` to encode each category exactly like the data you trained on.

In [None]:
from sklearn.preprocessing import OneHotEncoder

oh_enc = OneHotEncoder(handle_unknown='error', sparse_output=False)
oh_enc.fit_transform(X_train[['Outlet_Size']])
oh_enc.transform(X_test[['Outlet_Size']])

You can again manually specify the order with the categories column.

In [None]:
oh_enc = OneHotEncoder(handle_unknown='error', sparse_output=False, categories=[[np.nan, 'Small', 'Medium', 'High']])
oh_enc.fit_transform(X_train[['Outlet_Size']])
oh_enc.transform(X_test[['Outlet_Size']])

As one-hot encoding really lost it's effectivenes if you have a high number of columns you can use `max_categories` and `min_frequency` to specify the minimum frequency

In [None]:
from sklearn.preprocessing import OneHotEncoder

oh_enc = OneHotEncoder(sparse_output=False, max_categories=2)
oh_enc.fit_transform(X_train[['Outlet_Size']])
oh_enc.transform(X_test[['Outlet_Size']])

In [None]:
from sklearn.preprocessing import OneHotEncoder

oh_enc = OneHotEncoder(sparse_output=False, min_frequency=0.8)
oh_enc.fit_transform(X_train[['Outlet_Size']])
oh_enc.transform(X_test[['Outlet_Size']])

## StandardScaler

Standardize features by removing the mean and scaling to unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit_transform(X_train[['Item_MRP']])
scaler.transform(X_test[['Item_MRP']])

## MinMaxScaler

Transform features by scaling each feature to a given range.


In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit_transform(X_train[['Item_MRP']])
scaler.transform(X_test[['Item_MRP']])

# Imputers

You can use the imputer in multiple ways, for example you can set a constant value, by using the constant strategy. `fill_value` can be a string, any number or None/np.nan.

In [None]:
X_test[['Item_Weight']]

In [None]:
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(strategy='constant', fill_value=np.nan)
imputer.fit_transform(X_train[['Item_Weight']])
imputer.transform(X_test[['Item_Weight']])

For numerical values you can also set the missing values to the median or mean of the data given in `fit_transform`

In [None]:
imputer = SimpleImputer(strategy='median')
imputer.fit_transform(X_train[['Item_MRP']])
imputer.transform(X_test[['Item_MRP']])

In [None]:
imputer = SimpleImputer(strategy='mean')
imputer.fit_transform(X_train[['Item_MRP']])
imputer.transform(X_test[['Item_MRP']])

You can also set the missing values to the most frequent category seen in the data during `fit_transform`.



In [None]:
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit_transform(X_train[['Outlet_Size']])
imputer.transform(X_test[['Outlet_Size']])

In [None]:
X_test[['Outlet_Size']]

# Building the Pipeline

Define a pipeline that transforms numerical data of each column, imputes missing value with the median of the training data (`fit_transform`) and scales the data using `StandardScaler`

In [None]:
from sklearn.pipeline import Pipeline

# Preprocessing for numerical data
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

Define a pipeline that transforms categorical data of each column into a series of OneHot encoded columns.

In [None]:
# Preprocessing for categorical data
# Raise an error if validation data contains classes that aren't represented in the training data
categorical_transformer = Pipeline(steps=[
   ('imputer', SimpleImputer(strategy='most_frequent')),
   ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# The Pipeline() function is like a railway track with a list of different stations (steps)
# Each step is a tuple declaring the name of the step and then the function to apply

categorical_transformer

Set up the column transformer so categorical columns that we specify in the list `categorical_cols` are being send to the `categorial_transformer` pipeline and `numerical_cols` are being send to the `numerical_cols`.

In [None]:
from sklearn.compose import ColumnTransformer


numerical_cols   = ['Item_Visibility','Item_MRP', 'Item_Weight']
categorical_cols = ['Outlet_Size','Outlet_Location_Type', 'Outlet_Type', 'Item_Fat_Content', 'Item_Type']

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer,   numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ], remainder='passthrough')
    # The ColumnTransformer() function is like a railway switch: it tells what to do with the specified trainwagons (data columns).
    # The transformers list gives the different branches where columns  can go.
    # Each transformer is a tuple declaring the name of the transformer, the transformer to apply (eg. Pipeline defined above), and which columns need to be transformed
    # By default the ColumnTransformer() drops every column which is not explicitly specified in the list of transformers.
    # With the parameter remainder='passthrough', the columns that you do not mention will not be dropped (and also will not transformed).

preprocessor

Define a Random Forest model to use in our experiment.

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=0)

Create a pipeline that uses our preprocessing pipeline and feeds the output into our model

In [None]:
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])
                            # Here the Pipeline() function is again like a railway track, with a higer level list of different stations (steps)
                            # Each step is a tuple declaring the name of the step and then the function to apply
my_pipeline

In [None]:
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_test)

# Evaluate the model
score = mean_absolute_error(y_test, preds)
print('MAE:', score)
print('MAPE:', mean_absolute_percentage_error(y_test, preds))
print('RMSE:', mean_squared_error(y_test, preds,squared=False))

# Cross validation

Doing cross validation on our pipeline and data.

In [None]:
# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')
print("MAE scores:\n", scores)

## Predictions after cross validation

You can either use `cross_val_predict`

In [None]:
cross_val_predict(my_pipeline, X, y, cv=5)

Or retrain your pipeline with the training data.

In [None]:
# Train the model
my_pipeline.fit(X_train, y_train)

# Make predictions on the test set
predictions = my_pipeline.predict(X_test)

# Evaluate the model
print('MSE', mean_squared_error(y_test, predictions))

In [None]:
data = X_test.copy()
data['actual_sales'] = y_test
data['predicted_sales'] = predictions
data.to_csv('predictions.csv')
data

In [None]:
# @title actual_sales vs predicted_sales

from matplotlib import pyplot as plt
data.plot(kind='scatter', x='actual_sales', y='predicted_sales', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

# Assignment

Write a pipeline that.

- Ordinal encodes the columns `Outlet_Size`, `Outlet_Type`,	`Outlet_Location_Type`, in the correct order of the values. (eg. Small 0 and High is 2)
- One-Hot encodes the columns `Item_Type`. Only allow the 25 categories with the most values.

- Imputes the missing values in column `Item_Weight` with the mean value, and scales them with MinMaxScaler.

- Imputes the missing values in column `Item_Visibility`	and `Item_MRP` with with the median value and scales them with StandardScaler

- Use this preprocessing pipeline with a `RandomForestRegressor` mnodel and with a `XGBRegressor` model

- Cross validate both model pipelines.
- Generate cross-validated estimates with both model pipelines.
