<a href="https://colab.research.google.com/github/zevy613/supervised-machine-learning/blob/main/Project1_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [25]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn import set_config
set_config(display='diagram')

import matplotlib.pyplot as plt
#Import Linear Regression model
from sklearn.linear_model import LinearRegression
# Import random forest Regressor
from sklearn.ensemble import RandomForestRegressor
# Import the bagging regressor
from sklearn.ensemble import BaggingRegressor
#Import regression tree
from sklearn.tree import DecisionTreeRegressor
#Import Metrics for testing our models
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

filename = "/content/drive/MyDrive/Colab Notebooks/CodingDojo/06Regression/sales_predictions.csv"
df = pd.read_csv(filename)
df.head()
# lets make a copy of our data set so we don't lose the original data.
df_ml = df.copy()

First lets check for duplicates and missing values.

In [26]:
display(df_ml.duplicated().sum())
display(df_ml.isna().sum())

0

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

We will deal with our missing values shortly (using simple imputer).
For now we'll check for inconsistencies in our data.

In [27]:
df_ml['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

From here we clearly see mistakes in the spelling of low fat and regular fat. Let's fix these.

In [28]:
df_ml['Item_Fat_Content'].replace('LF','Low Fat', inplace=True)
df_ml['Item_Fat_Content'].replace('low fat','Low Fat', inplace=True)
df_ml['Item_Fat_Content'].replace('reg','Regular', inplace=True)
df_ml['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

Train test split.

In [29]:
X = df_ml.drop(columns = ['Item_Outlet_Sales'])
y = df_ml['Item_Outlet_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Lets re-examine our missing values and the type of the columns. 

In [30]:
display(X_train.isna().sum())
print()
display("The type of item weight is ", df_ml['Item_Weight'].dtype)
print()
display("The type of Outlet size is ", df_ml['Outlet_Size'].dtype)

Item_Identifier                 0
Item_Weight                  1107
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  1812
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64




'The type of item weight is '

dtype('float64')




'The type of Outlet size is '

dtype('O')

We'll need to impute the values for these columns.

We also need to scale our data and One Hot Encode all of the categorical columns.

We begin by instantiating the selectors we need. Because one column is numeric and one is categoric/object, we'll need two column selectors.

In [31]:
#instantiate our solumn selectors
num_selector = make_column_selector(dtype_include='number')
cat_selector = make_column_selector(dtype_include='object')

Next we'll create all of the transformers we need.
We will use the mean strategy because we're not concerned with outliers.

In [32]:
mean_imputer = SimpleImputer(strategy='mean')
freq_imputer = SimpleImputer(strategy='most_frequent')

scaler = StandardScaler()

ohe = OneHotEncoder(handle_unknown='ignore', sparse=False)

Next, because we're applying many transformations on the same column type, we will use piplines.

In [33]:
num_pipe = make_pipeline(mean_imputer, scaler)
cat_pipe = make_pipeline(freq_imputer, ohe)

Because we are operating on two different column types we'll need a transformer as well.

In [34]:
# group with tuples
num_tuple = (num_pipe, num_selector)
cat_tuple = (cat_pipe, cat_selector)

preprocessor = make_column_transformer(num_tuple, cat_tuple)

preprocessor.fit(X_train)

Now we can transform our data all at once.

In [35]:
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

Check to see if there are any null values.


In [36]:
np.isnan(X_train_processed).sum().sum()

0

Perfect, we have no missing data! We are ready for modeling.

Before we begin modeling, we will define a method that will help us when we need to perform evaluations on our models. We wil be using R-Squared scoring and RMSE.

In [37]:
def metrics(y, pred):
  print(" R^2 : ", round(r2_score(y, pred),3))
  print("RMSE : ", round(np.sqrt(mean_squared_error(y,pred)),3))

Lets begin by build a linear regression model to predict sales.

We will;

  1) Build the model.
  2) Evaluate the performance of our model using R-squared scoring and RMSE.

In [38]:
#instantiate model
linear_reg = LinearRegression()

#fit the data
linear_reg.fit(X_train_processed, y_train)

#make our predictions
train_pred = linear_reg.predict(X_train_processed)
test_pred = linear_reg.predict(X_test_processed)

Now lets evaluate our model to see how well it performed.

In [39]:
print("Metrics for training data")
metrics(y_train, train_pred)
print("\nMetrics for testing data")
metrics(y_test, test_pred)

Metrics for training data
 R^2 :  0.671
RMSE :  986.086

Metrics for testing data
 R^2 :  -1.6575880497968222e+19
RMSE :  6762579228318.999


Next lets build a regression TREE model to predict sales.

We will begin by

  1) Building a regression tree model.
  2) Evaluate and compare the performance of our model using r^2 and rmse. 

In [40]:
#instantiate decision tree
dec_tree = DecisionTreeRegressor(max_depth=5, random_state=42)

#fit the data
dec_tree.fit(X_train_processed, y_train)

#make our predictions
train_pred = dec_tree.predict(X_train_processed)
test_pred = dec_tree.predict(X_test_processed)

Now lets evaluate and compare our results.

In [41]:
print("Metrics for training data")
metrics(y_train, train_pred)
print("\nMetrics for testing data")
metrics(y_test, test_pred) 

Metrics for training data
 R^2 :  0.604
RMSE :  1082.281

Metrics for testing data
 R^2 :  0.596
RMSE :  1055.685


#Tuning Basic Tree Regression

In [42]:
#Tuning basic regression tree
print("The default depth is : ", dec_tree.get_depth())

# List of values to try for max_depth:
depths = list(range(2, dec_tree.get_depth()+1)) 

# Lets make a DataFrame to store the score for each value of max_depth:

scores = pd.DataFrame(index=depths, columns=['Test Score','Train Score'])
for depth in depths:
    dec_tree = DecisionTreeRegressor(max_depth=depth, random_state=42)
    
    dec_tree.fit(X_train_processed, y_train)
    
    train_score = dec_tree.score(X_train_processed, y_train)
    test_score = dec_tree.score(X_test_processed, y_test)
    
    scores.loc[depth, 'Train Score'] = train_score
    scores.loc[depth, 'Test Score'] = test_score


sorted_scores = scores.sort_values(by='Test Score', ascending=False)
sorted_scores.head()

The default depth is :  5


Unnamed: 0,Test Score,Train Score
5,0.596056,0.604207
4,0.583937,0.582705
3,0.524222,0.524218
2,0.433778,0.431641


#Bagged trees

In [43]:
bagreg = BaggingRegressor(n_estimators=100, random_state=42)

bagreg.fit(X_train_processed, y_train)

train_pred = bagreg.predict(X_train_processed)
test_pred = bagreg.predict(X_test_processed)

print("Metrics for training data")
metrics(y_train, train_pred)
print("\nMetrics for testing data")
metrics(y_test, test_pred) 

Metrics for training data
 R^2 :  0.938
RMSE :  428.911

Metrics for testing data
 R^2 :  0.55
RMSE :  1113.893


#Tuning for Bagged Trees

In [None]:
#Tuning for baged trees

# List of estimator values
estimators = [10, 20, 30, 40, 50, 100]

# Data frame to store the scores
scores = pd.DataFrame(index=estimators, columns=['Test Score', 'Train Score'])

# Iterate through the values to find the best number of estimators
for num_estimators in estimators:
   bagreg = BaggingRegressor(n_estimators=num_estimators, random_state=42)
   
   bagreg.fit(X_train_processed, y_train)
   
   train_score = bagreg.score(X_train_processed, y_train)
   test_score = bagreg.score(X_test_processed, y_test)
   
   scores.loc[num_estimators, 'Train Score'] = train_score
   scores.loc[num_estimators, 'Test Score'] = test_score

scores = scores.sort_values(by='Test Score', ascending=False)
scores.head()

#Random forest

In [45]:
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train_processed, y_train)

train_pred = rf.predict(X_train_processed)
test_pred = rf.predict(X_test_processed)

print("Metrics for training data")
metrics(y_train, train_pred)
print("\nMetrics for testing data")
metrics(y_test, test_pred) 

Metrics for training data
 R^2 :  0.938
RMSE :  428.748

Metrics for testing data
 R^2 :  0.55
RMSE :  1114.012


#Tuning Random Forest model

In [None]:
#tuning the depth of our trees for a forest model
est_depths = [estimator.get_depth() for estimator in rf.estimators_]
d = max(est_depths)
depths = range(1, d)
scores = pd.DataFrame(index=depths, columns=['Test Score', 'Train Score'])
for depth in depths:    
   
   rf = RandomForestRegressor(max_depth=depth)
   
   rf.fit(X_train_processed, y_train)
   
   scores.loc[depth, 'Train Score'] = rf.score(X_train_processed, y_train)
   scores.loc[depth, 'Test Score'] = rf.score(X_test_processed, y_test)
   
sorted_scores = scores.sort_values(by='Test Score', ascending=False)
display(sorted_scores.head())

#tuning the number of estimators for our forest model
n_ests = [50, 100, 150, 200, 250]
scores2 = pd.DataFrame(index=n_ests, columns=['Test Score', 'Train Score'])
for n in n_ests:
   
   rf = RandomForestRegressor(max_depth=sorted_scores.index[0], n_estimators=n)
   
   rf.fit(X_train_processed, y_train)
   
   scores2.loc[n, 'Train Score'] = rf.score(X_train_processed, y_train)
   scores2.loc[n, 'Test Score'] = rf.score(X_test_processed, y_test)

sorted_scores2 = scores2.sort_values(by='Test Score', ascending=False)
display(sorted_scores2.head())

#3) After trying different models on our data set lets determine which model we would choose to implement.

I would reccomend using a basic Regression Tree for the data set.
Out of all the models that are used above, the baic Regression Tree has the highest R^2 score. As per r^2 avaluation metric we can see the model was exaplined 59% correcltly by the dependent variable. This means the depedent variable was able to exaplined by the variance of the data with 59% based on given independant features. Using RMSE, we can predict the value of our dependent variable up to 1,055 dollars of the target value.
