# Summer Camp Lab 1 - Intro to Machine Learning

In this lab we will practice:

* Loading Data
* Data Exploration
* Selecting the Prediction Target
* Choosing Features
* Splitting Data into Training and Test Sets
* Building a Decision Tree Model
* Model Validation
* Hyperparameter Tuning
* Building a Random Forest Model


# Setting Up the Workspace

In [None]:
!pip install pandas==2.0.3 scikit-learn=1.2.2 matplotlib==3.7.1

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

#Metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_squared_error

# Case Study

BigMart's team of data scientists has gathered sales data for the year 2013, encompassing 1559 products distributed across 10 stores situated in various cities. The dataset includes specific attributes for each product and store.

The primary objective is to construct a predictive model capable of forecasting the sales of individual products within specific outlets.

This predictive model aims to unveil the influential factors that contribute to increased sales, enabling BigMart to gain insights into product and outlet characteristics crucial for sales growth.

The data has the following features that could be useful in your model:

* Item_Identifier: A unique identifier for each product.

* Item_Weight: The weight of the product.

* Item_Fat_Content: Indicates the level of fat content in the product, often categorized as 'Low Fat,' 'Regular,' etc.

* Item_Visibility: The percentage of total display area of all products in a store allocated to a particular product.

* Item_Type: The category or type of the product (e.g., dairy, meat, fruits, etc.).

* Item_MRP (Maximum Retail Price): The maximum price at which the product can be sold.

* Outlet_Identifier: A unique identifier for each store/outlet.

* Outlet_Establishment_Year: The year in which the store was established.

* Outlet_Size: The size of the store, often categorized as 'Small,' 'Medium,' or 'Large.'

* Outlet_Location_Type: The type of location where the store is situated, such as 'Urban,' 'Suburban,' or 'Rural.'

* Outlet_Type: The type of outlet, such as 'Supermarket Type1,' 'Supermarket Type2,' 'Grocery Store,' etc.

* Item_Outlet_Sales: The target variable, representing the sales of the product in a particular store.


# Loading the Data

In [None]:
#import pandas as pd
df_sales = pd.read_csv('https://www.dropbox.com/s/yqaymhdf7bvvair/bigmart_sales_predictions.csv?dl=1')


# Data Exploration

In [None]:
# display first 5 rows of data
df_sales.___

In [None]:
# explore last 5 rows of data
df_sales.___

In [None]:
# display information about columns (non-null count and dtype)
df_sales.___

In [None]:
# explore descriptive statistics of the numerical data
df_sales.___

In [None]:
# explore values of object data, such as Outlet_Location_Type
df_sales.___

# Selecting the Prediction Target

In [None]:
# Our target variable is the sales of an item at an outlet.
y = ___
y

# Choosing Features

In [None]:
# We include a few features that we think could be useful as features in our model.
#Include Item_Visibility. Item_MRP, Item_Weight
X = df_sales[].fillna(0)
X

In [None]:
# Check descriptive statistics of the features
X.___

In [None]:
# Check dtypes of the features
X.___

# Split Data into Training and Test Sets

Use random state if you want to generate the same split for each run of your code.

In [None]:
# Split the features and the target into training and test sets
X_train, X_test, y_train, y_test = ___

# Building a Decision Tree Model

In [None]:
# Create a decision tree regressor
model = ____()

# Train the model
model.fit(___)

# Make predictions on the test set
predictions = model.predict(____)

# Evaluate the model with mean_squared_error
print(____)


# Model Validation

In [None]:
#Mean Absoulute Error (MAE)
print(___(y_test, predictions))

In [None]:
#Mean Absoulute Percentage Error (MAPE)

print(___(y_test, predictions))

In [None]:
#Mean Squared Error (MSE)
print(____(y_test, predictions))

In [None]:
# Root Mean Squared Error (RMSE)
print(___(y_test, predictions, ___))

# Hyperparameter Tuning

In [None]:
def decisiontree_depth_effect(train_X, train_y, test_X, test_y, max_depth_range):

    train_errors = []
    test_errors = []

    for depth in max_depth_range:
        # Create a decision tree regressor with the specified maximum depth
        model = DecisionTreeRegressor(max_depth=depth)

        # Train the model
        model.fit(train_X, train_y)

        # Make predictions on training and test data
        train_preds = model.predict(train_X)
        test_preds = model.predict(test_X)

        # Calculate mean squared error for training and test data
        train_error = mean_squared_error(train_y, train_preds)
        test_error = mean_squared_error(test_y, test_preds)

        # Append errors to the lists
        train_errors.append(train_error)
        test_errors.append(test_error)

    # Plotting the results
    plt.figure(figsize=(10, 6))
    plt.plot(max_depth_range, train_errors, label='Training Error', marker='o')
    plt.plot(max_depth_range, test_errors, label='Test Error', marker='o')
    plt.xlabel('Tree Depth')
    plt.ylabel('Mean Squared Error')
    plt.title('Effect of Tree Depth of DecisionTree on Error Metric')
    plt.legend()
    plt.show()


In [None]:
# Display loss curves for a range of the hyperparameter
decisiontree_depth_effect(X_train, y_train, X_test, y_test, range(___, ___))

# Building a Random Forest Model

In [None]:
# Create a decision tree regressor
model = ___()

# Train the model
model.___(___, ___)

# Make predictions on the test set
predictions = model.___(__)

# Evaluate the model using MSE
print(___(___, ___))

In [None]:
def randomforest_depth_effect(train_X, train_y, test_X, test_y, max_depth_range):

    train_errors = []
    test_errors = []

    for depth in max_depth_range:
        # Create a decision tree regressor with the specified maximum depth
        model = RandomForestRegressor(max_depth=depth)

        # Train the model
        model.fit(train_X, train_y)

        # Make predictions on training and test data
        train_preds = model.predict(train_X)
        test_preds = model.predict(test_X)

        # Calculate mean squared error for training and test data
        train_error = mean_squared_error(train_y, train_preds)
        test_error = mean_squared_error(test_y, test_preds)

        # Append errors to the lists
        train_errors.append(train_error)
        test_errors.append(test_error)

    # Plotting the results
    plt.figure(figsize=(10, 6))
    plt.plot(max_depth_range, train_errors, label='Training Error', marker='o')
    plt.plot(max_depth_range, test_errors, label='Test Error', marker='o')
    plt.xlabel('Tree Depth')
    plt.ylabel('Mean Squared Error')
    plt.title('Effect of Tree Depth of RandomForest on Error Metric')
    plt.legend()
    plt.show()


In [None]:
# Display loss curves for a range of the hyperparameter
randomforest_depth_effect(X_train, y_train, X_test, y_test, range(___, ___))


# Assignment

In class you learned about the hyperparameter 'max_leaf_nodes'. In lab you learned about the hyperparameter 'max_depth'. Please repeat the lab notebook two (2) times experimenting with tuning the following hyperparamaters:

* 'min_samples_split' (int, default=2): The minimum number of samples required to split an internal node.

* 'min_samples_leaf' (int, default=1): The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

Does your tuning of the new hyperparametes yield "better" results than in lab? Explain your intuition why or why not.

Please submit your two (2) notebooks together with a PDF file with brief interpretation of your findings.

*The goal of the exercise is not to develop an optimal model, but rather to practice your skills and your reasoning.*


BONUS: Experiment with adding a **categorical variable** as a series of binary dummy variables using techniques that you learned in Logistic Regression!