## SuperKart Project
***Marks: 60***

## Problem Statement

### Context:

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. It is extremely important for a company to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of actions. Forecasting helps an organization to plan its sales operations by regions and provide valuable insights to the supply chain team regarding the procurement of goods and materials. 
An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

### Objective:

SuperKart is an organization which owns a chain of supermarkets and food marts providing a wide range of products. They want to predict the future sales revenue of its different outlets so that they can strategize their sales operation across different tier cities and plan their inventory accordingly. To achieve this purpose, SuperKart has hired a data science firm, shared the sales records of its various outlets for the previous quarter and asked the firm to come up with a suitable model to predict the total sales of the stores for the upcoming quarter.


### Data Description:

The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.

* Product_Id - unique identifier of each product, each identifier having two letters at the beginning followed by a number.
* Product_Weight - weight of each product
* Product_Sugar_Content - sugar content of each product like low sugar, regular and no sugar
* Product_Allocated_Area - ratio of the allocated display area of each product to the total display area of all the products in a store
* Product_Type - broad category for each product like meat, snack foods, hard drinks, dairy, canned, soft drinks, health and hygiene, baking goods, bread, breakfast, frozen foods, fruits and vegetables, household, seafood, starchy foods, others
* Product_MRP - maximum retail price of each product
* Store_Id - unique identifier of each store
* Store_Establishment_Year - year in which the store was established
* Store_Size - size of the store depending on sq. feet like high, medium and low
* Store_Location_City_Type - type of city in which the store is located like Tier 1, Tier 2 and Tier 3. Tier 1 consists of cities where the standard of living is comparatively higher than its Tier 2 and Tier 3 counterparts.
* Store_Type - type of store depending on the products that are being sold there like Departmental Store, Supermarket Type 1, Supermarket Type 2 and Food Mart
* Product_Store_Sales_Total - total revenue generated by the sale of that particular product in that particular store


### **Please read the instructions carefully before starting the project.** 
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned. 
* Blanks '_______' are provided in the notebook that 
needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space. 
* Identify the task to be performed correctly, and only then proceed to write the required code.
* Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
* Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
* Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.

## Importing necessary libraries

In [None]:
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Library to split data
from sklearn.model_selection import train_test_split

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)


# Libraries different ensemble classifiers
from sklearn.ensemble import (
    BaggingRegressor,
    RandomForestRegressor,
    AdaBoostRegressor,
    GradientBoostingRegressor,
    StackingRegressor,
)

from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor

# Libraries to get different metric scores
from sklearn import metrics
from sklearn.metrics import (
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
)

# To tune different models
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

## Import Dataset

In [None]:
kart = pd."_______" # Fill the blanks to read data 

In [None]:
# copying data to another variable to avoid any changes to original data
data = kart.copy()

## Overview of the Dataset

### View the first and last 5 rows of the dataset

In [None]:
data.('_______') ##  Complete the code to view top 5 rows of the data

In [None]:
data.('_______') ##  Complete the code to view last 5 rows of the data  

### Understand the shape of the dataset

In [None]:
data.('_______') ##  Complete the code to view dimensions of the data

### Check the data types of the columns for the dataset

In [None]:
data."_______" # Fill the blank to display data informaton

In [None]:
# checking for missing values in the data
data."_______".sum()  # Fill the blank to get number of missing values in data

In [None]:
# checking for duplicate values
data."_______".sum() # Fill the blank to check duplicates 

## Exploratory Data Analysis

#### Let's check the statistical summary of the data

In [None]:
data.('_______') ##  Complete the code to print the statistical summary of the data

#### Let's check the count of each unique category in each of the categorical variables

In [None]:
# Making a list of all catrgorical variables
cat_col = list(data.select_dtypes("object").columns)

# Printing number of count of each unique value in each column
for column in cat_col:
    print(data[column].value_counts())
    print("-" * 50)

In [None]:
# Replacing reg with Regular
data.Product_Sugar_Content.replace(to_replace=["reg"], value=["Regular"], inplace=True)

In [None]:
data.Product_Sugar_Content."_________" # Fill the blank to count values for each class in Product_Sugar_Content

In [None]:
## extracting the first two characters from the Product_Id column and storing it in another column
data["Product_Id_char"] = data["Product_Id"].str[:2]
data.head()

In [None]:
data["Product_Id_char"]."________" # Fill the blank to get all unique elements in Product_Id_char

In [None]:
data.loc[data.Product_Id_char == "FD", "Product_Type"]."__________"# Fill the blank to get all unique elements

In [None]:
data.loc[data.Product_Id_char == "DR", "Product_Type"]."__________" # Fill the blank to get all unique elements

In [None]:
data.loc[data.Product_Id_char == "NC", "Product_Type"]."__________"# Fill the blank to get all unique elements

#### The Product_Id column will not add any value to our analysis so let's drop it before we move forward

In [None]:
## dropping the column
data = data.drop("________") # Fill the blank to drop the Product_Id column

In [None]:
data.head()

### Univariate Analysis

In [None]:
def histogram_boxplot(data, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

#### Product_Weight

In [None]:
histogram_boxplot(data, "Product_Weight")

#### Product_Allocated_Area

In [None]:
histogram_boxplot("_______") # Fill the blank to display plots for Product_Allocated_Area

#### Product_MRP

In [None]:
histogram_boxplot("_______") # Fill the blank to display plots for Product_MRP

#### Product_Store_Sales_Total

In [None]:
histogram_boxplot("_______") # Fill the blank to display plots for Product_Store_Sales_Total

In [None]:
# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

#### Product_Sugar_Content

In [None]:
labeled_barplot(data, "Product_Sugar_Content", perc=True)

#### Product_Type

In [None]:
labeled_barplot("_________") # Fill the blank to plot barplot for Product_Type

#### Store_Id

In [None]:
labeled_barplot("_________") # Fill the blank to plot barplot for Store_Id

#### Store_Size

In [None]:
labeled_barplot("_________") # Fill the blank to plot barplot for Store_Size

#### Store_Location_City_Type

In [None]:
labeled_barplot("_________") # Fill the blank to plot barplot for Store_Location_City_Type

#### Store_Type

In [None]:
labeled_barplot("_________") # Fill the blank to plot barplot for Store_Type

### Bivariate Analysis

In [None]:
cols_list = data.select_dtypes(include=np.number).columns.tolist()

plt.figure(figsize=(10, 5))
sns.heatmap(
    data[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()

#### Let's check the distribution of our target variable i.e Product_Store_Sales_Total with the numeric columns

In [None]:
plt.figure(figsize=[8, 6])
sns.scatterplot(x=data.Product_Weight, y=data.Product_Store_Sales_Total)
plt.show()

In [None]:
plt.figure(figsize=[8, 6])
sns.scatterplot("________") # Fill the blank to plot scatter plot of Product_Allocated_Area against Product_Store_Sales_Total 
plt.show()

In [None]:
plt.figure(figsize=[8, 6])
sns.scatterplot("_______") # Fill the blank to plot scatter plot of Product_MRP against Product_Store_Sales_Total 
plt.show()

#### Let us see from which product type the company is generating most of the revenue

In [None]:
df_revenue1 = data.groupby(["Product_Type"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[14, 8])
plt.xticks(rotation=90)
a = sns.barplot(x=df_revenue1.Product_Type, y=df_revenue1.Product_Store_Sales_Total)
a.set_xlabel("Product Types")
a.set_ylabel("Revenue")
plt.show()

In [None]:
df_revenue2 = data.groupby(["Product_Sugar_Content"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
b = sns.barplot("___________") # Fill the blank to plot bar plot with Product_Sugar_content as x and Product_Store_Sales_Total as y  
b.set_xlabel("Product_Sugar_content")
b.set_ylabel("Revenue")
plt.show()

#### Let us see from which type of stores and locations the revenue generation is more

In [None]:
df_store_revenue = data.groupby(["Store_Id"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
r = sns.barplot("________") # Fill the blank to plot bar plot with Store_Id as x and Product_Store_Sales_Total as y  
r.set_xlabel("Stores")
r.set_ylabel("Revenue")
plt.show()

In [None]:
df_revenue3 = data.groupby(["Store_Size"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
c = sns.barplot("_______") # Fill the blank to plot bar plot with Store_Size as x and Product_Store_Sales_Total as y
c.set_xlabel("Store_Size")
c.set_ylabel("Revenue")
plt.show()

In [None]:
df_revenue4 = data.groupby(["Store_Location_City_Type"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
d = sns.barplot("______") # Fill the blank to plot bar plot with Store_Location_City_Type as x and Product_Store_Sales_Total as y
d.set_xlabel("Store_Location_City_Type")
d.set_ylabel("Revenue")
plt.show()

In [None]:
df_revenue5 = data.groupby(["Store_Type"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
plt.figure(figsize=[8, 6])
plt.xticks(rotation=90)
e = sns.barplot("_________") # Fill the blank to plot bar plot with Store_Type as x and Product_Store_Sales_Total as y
e.set_xlabel("Store_Type")
e.set_ylabel("Revenue")
plt.show()

#### Let's check the distribution of our target variable i.e Product_Store_Sales_Total with the other categorical columns

In [None]:
plt.figure(figsize=[14, 8])
sns.boxplot(data.Store_Id, data.Product_Store_Sales_Total)
plt.xticks(rotation=90)
plt.title("Boxplot - Store_Id Vs Product_Store_Sales_Total")
plt.xlabel("Stores")
plt.ylabel("Product_Store_Sales_Total (of each product)")
plt.show()

In [None]:
plt.figure(figsize=[14, 8])
sns.boxplot("________") # Fill the blank to plot boxplot of Store size against Product_Store_Sales_Total 
plt.xticks(rotation=90)
plt.title("Boxplot - Store_Size Vs Product_Store_Sales_Total")
plt.xlabel("Store_Size")
plt.ylabel("Product_Store_Sales_Total (of each product)")
plt.show()

#### Let's now try to find out some relationship between the other columns

#### Generally certain product types will have higher product weight than others. Let's have a look

In [None]:
plt.figure(figsize=[14, 8])
sns.boxplot("_________") # Fill the blank to plot boxplot of Product_Type against Product_Weight 
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Type Vs Product_Weight")
plt.xlabel("Types of Products")
plt.ylabel("Product_Weight")
plt.show()

#### Let's find out whether there is some relationship between the weight of the product and its sugar content

In [None]:
plt.figure(figsize=[14, 8])
sns.boxplot("________") # Fill the blank to plot a box plot of Product_Sugar_Content against Product_Weight 
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Sugar_Content Vs Product_Weight")
plt.xlabel("Product_Sugar_Content")
plt.ylabel("Product_Weight")
plt.show()

#### Let's analyze the sugar content of different product types

In [None]:
plt.figure(figsize=(14, 8))
sns.heatmap(
    pd.crosstab(data["Product_Sugar_Content"], data["Product_Type"]),
    annot=True,
    fmt="g",
    cmap="viridis",
)
plt.ylabel("Product_Sugar_Content")
plt.xlabel("Product_Type")
plt.show()

#### Let's find out how many items of each product type has been sold in each of the stores

In [None]:
plt.figure(figsize=(14, 8))
sns.heatmap("________")  # Fill the blank to plot a heatmap with and Product_Type as x and Store_Ids as y
plt.ylabel("Stores")
plt.xlabel("Product_Type")
plt.show()

#### Different product types have different prices. Let's analyze the trend

In [None]:
plt.figure(figsize=[14, 8])
sns.boxplot("__________") # # Fill the blank to plot a box plot of Product_Type against Product_MRP
plt.xticks(rotation=90)
plt.title("Boxplot - Product_Type Vs Product_MRP")
plt.xlabel("Product_Type")
plt.ylabel("Product_MRP (of each product)")
plt.show()

#### Let's find out how the Product_MRP varies with the different stores

In [None]:
plt.figure(figsize=[14, 8])
sns.boxplot("_________") # Fill the blank to plot a box plot of Store_Id against Product_MRP
plt.xticks(rotation=90)
plt.title("Boxplot - Store_Id Vs Product_MRP")
plt.xlabel("Stores")
plt.ylabel("Product_MRP (of each product)")
plt.show()

#### Let's delve deeper and do a detailed analysis of each of the stores

#### OUT001

In [None]:
data.loc[data["Store_Id"] == "OUT001"].describe(include="all").T

In [None]:
data.loc[data["Store_Id"] == "OUT001", "Product_Store_Sales_Total"].sum()

In [None]:
df_OUT001 = (
    data.loc[data["Store_Id"] == "OUT001"]
    .groupby(["Product_Type"], as_index=False)["Product_Store_Sales_Total"]
    .sum()
)
plt.figure(figsize=[14, 8])
plt.xticks(rotation=90)
plt.xlabel("Product_Type")
plt.ylabel("Product_Store_Sales_Total")
plt.title("OUT001")
sns.barplot(x=df_OUT001.Product_Type, y=df_OUT001.Product_Store_Sales_Total)
plt.show()

#### OUT002

In [None]:
data.loc[data["Store_Id"] == "OUT002"]."__________" # Fill the blank to describe

In [None]:
data.loc[data["Store_Id"] == "OUT002", "Product_Store_Sales_Total"]."________" # Fill the blank to perform summation

In [None]:
df_OUT002 = ("_________") # Fill the blank to form the required dataframe just like we have done in OUT001
plt.figure(figsize=[14, 8])
plt.xticks(rotation=90)
plt.xlabel("Product_Type")
plt.ylabel("Product_Store_Sales_Total")
plt.title("OUT002")
sns.barplot("________") # Fill the blank to plot barplot for Product_Type against Product_Store_Sales_Total
plt.show()

#### OUT003

In [None]:
data.loc[data["Store_Id"] == "OUT003"]."__________" # Fill the blank to describe

In [None]:
data.loc[data["Store_Id"] == "OUT003", "Product_Store_Sales_Total"]."________" # Fill the blank to perform summation

In [None]:
df_OUT003 = ("_________") # Fill the blank to form the required dataframe
plt.figure(figsize=[14, 8])
plt.xticks(rotation=90)
plt.xlabel("Product_Type")
plt.ylabel("Product_Store_Sales_Total")
plt.title("OUT003")
sns.barplot("________") # Fill the blank to plot barplot for Product_Type against Product_Store_Sales_Total
plt.show()

#### OUT004

In [None]:
data.loc[data["Store_Id"] == "OUT004"]."__________" # Fill the blank to describe

In [None]:
data.loc[data["Store_Id"] == "OUT004", "Product_Store_Sales_Total"]."________" # Fill the blank to perform summation

In [None]:
df_OUT004 = ("_________") # Fill the blank to form the required dataframe
plt.figure(figsize=[14, 8])
plt.xticks(rotation=90)
plt.xlabel("Product_Type")
plt.ylabel("Product_Store_Sales_Total")
plt.title("OUT004")
sns.barplot("________") # Fill the blank to plot barplot for Product_Type against Product_Store_Sales_Total
plt.show()

#### Let's find out the revenue generated by the stores from each of the product types

In [None]:
df1 = data.groupby(["Product_Type", "Store_Id"], as_index=False)[
    "Product_Store_Sales_Total"
].sum()
df1

#### Let's find out the revenue generated by the stores from products having different levels of sugar content

In [None]:
df2 = data.groupby("_________")["________"].sum() # Fill in the blanks to find the revenue generated by each store for the different sugar content level items 
df2

## Data Preprocessing

### Feature Engineering

#### A store which has been in the business for a long duration is more trustworthy than the newly established ones. On the other hand, older stores may sometimes lack infrastructure if proper attention is not given. So let us calculate the current age of the store and incorporate that in our model.(The data of the sales records was collected in 2021, so we will use 2021 as the base year to calculate the store age) 

In [None]:
# Outlet Age
data["Store_Age_Years"] = 2021 - data."___________" ## Fill in the blank and use Store_Establishment_Year to extract the present store age

#### We have 16 different product types in our dataset. So let us make two broad categories, perishables and non perishables, in order to reduce the number of product types

In [None]:
perishables = [
    "Dairy",
    "Meat",
    "Fruits and Vegetables",
    "Breakfast",
    "Breads",
    "Seafood",
]

In [None]:
def change(x):
    if x in perishables:
        return "Perishables"
    else:
        return "Non Perishables"


data.Product_Type.apply(change)

In [None]:
change1 = []
for i in range(0, len(data)):
    if data.Product_Type[i] in perishables:
        change1.append("Perishables")
    else:
        change1.append("Non Perishables")

In [None]:
data["Product_Type_Category"] = pd.Series("_______") ## Fill in the blank and use change1 to create a new column

In [None]:
data.head()

### Outlier Check

- Let's check for outliers in the data.

In [None]:
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
numeric_columns.remove("Store_Establishment_Year")
numeric_columns.remove("Store_Age_Years")


plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

### Data Preparation for modeling

- We want to forecast the Product_Store_Sales_Total. 
- Before we proceed to build a model, we'll have to encode categorical features and drop the unnecessary columns
- We'll split the data into train and test to be able to evaluate the model that we build on the train data.

In [None]:
data.head()

In [None]:
data = data."_______"(["Product_Type", "Store_Id", "Store_Establishment_Year"], axis=1) # Fill in the blank to drop the listed columns

In [None]:
data.shape

In [None]:
data = pd.get_dummies(
    data,
    columns=data.select_dtypes(include=["object", "category"]).columns.tolist(),
    drop_first=True,
)

In [None]:
data.head()

In [None]:
# Separating features and the target column
X = data."________"("Product_Store_Sales_Total", axis=1) # Fill in the blank to drop the specified column
y = data["Product_Store_Sales_Total"]

In [None]:
# Splitting the data into train and test sets in 70:30 ratio
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=1, shuffle=True
)

In [None]:
X_train.shape, X_test.shape

## Model Building - Strategy and Evaluation

- We'll fit different models on the train data and observe their performance. 
- We'll try to improve that performance by tuning some hyperparameters available for that algorithm.
- We'll use GridSearchCv for hyperparameter tuning and `r_2 score` to optimize the model.
- R-square - `Coefficient of determination` is used to evaluate the performance of a regression model. It is the amount of the variation in the output dependent attribute which is predictable from the input independent variables.
- Let's start by creating a function to get model scores, so that we don't have to use the same codes repeatedly.

In [None]:
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mape_score(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

In [None]:
##  Function to calculate r2_score and RMSE on train and test data
def get_model_score(model, flag=True):
    """
    model : classifier to predict values of X

    """
    # defining an empty list to store train and test results
    score_list = []

    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)

    train_r2 = metrics.r2_score(y_train, pred_train)
    test_r2 = metrics.r2_score(y_test, pred_test)
    train_rmse = np.sqrt(metrics.mean_squared_error(y_train, pred_train))
    test_rmse = np.sqrt(metrics.mean_squared_error(y_test, pred_test))

    # Adding all scores in the list
    score_list.extend((train_r2, test_r2, train_rmse, test_rmse))

    # If the flag is set to True then only the following print statements will be dispayed, the default value is True
    if flag == True:
        print("R-sqaure on training set : ", metrics.r2_score(y_train, pred_train))
        print("R-square on test set : ", metrics.r2_score(y_test, pred_test))
        print(
            "RMSE on training set : ",
            np.sqrt(metrics.mean_squared_error(y_train, pred_train)),
        )
        print(
            "RMSE on test set : ",
            np.sqrt(metrics.mean_squared_error(y_test, pred_test)),
        )

    # returning the list with train and test scores
    return score_list

## Decision Tree - Model Building and Hyperparameter Tuning

### Decision Tree Model

In [None]:
dtree = DecisionTreeRegressor(random_state=1)
dtree.fit(X_train, y_train)

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

#### Checking model performance on training set

In [None]:
dtree_model_train_perf = model_performance_regression(dtree, X_train, y_train)
dtree_model_train_perf

#### Checking model performance on test set

In [None]:
dtree_model_test_perf = model_performance_regression("___________________") # Fill in the blank to get model performance on test set
dtree_model_test_perf

### Hyperparameter Tuning - Decision Tree Model

In [None]:
# Choose the type of classifier.
dtree_tuned = DecisionTreeRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": list(np.arange(1, 9)) + [None],
    "min_samples_leaf": [1, 3, 5, 7, 10],
    "max_leaf_nodes": [2, 3, 5, 10, 15] + [None],
    "min_impurity_decrease": [0.001, 0.01, 0.1, 0.0],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV(dtree_tuned, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
dtree_tuned.fit(X_train, y_train)

#### Checking model performance on training set

In [None]:
dtree_tuned_model_train_perf = model_performance_regression(
    dtree_tuned, X_train, y_train
)
dtree_tuned_model_train_perf

#### Checking model performance on test set

In [None]:
dtree_tuned_model_test_perf = model_performance_regression("________________") # Fill in the blank to check the tuned model performance on test data
dtree_tuned_model_test_perf

## Bagging - Model Building and Hyperparameter Tuning

### Bagging Regressor

In [None]:
bagging_regressor = "__________"(random_state=1) # Fill in the blank to define bagging regressor
bagging_regressor.fit("___________") # Fill in the blank to fit the model on the train data

#### Checking model performance on training set

In [None]:
bagging_regressor_model_train_perf = model_performance_regression("____________") # Fill in the blank to check model performance on train data
bagging_regressor_model_train_perf

#### Checking model performance on test set

In [None]:
bagging_regressor_model_test_perf = model_performance_regression("___________") # Fill in the blank to check the model performance on test data
bagging_regressor_model_test_perf

### Hyperparameter Tuning - Bagging Regressor

In [None]:
# Choose the type of regressor.
bagging_estimator_tuned = BaggingRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_samples": [0.7, 0.8, 0.9],
    "max_features": [0.7, 0.8, 0.9],
    "n_estimators": np.arange(90, 120, 10),
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV("_____________") # complete the code to run grid search with cv = 5
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
bagging_estimator_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
bagging_estimator_tuned.fit(X_train, y_train)

#### Checking model performance on training set

In [None]:
bagging_estimator_tuned_model_train_perf = model_performance_regression("___________") # Fill in the blank to check tuned model performance on train data
bagging_estimator_tuned_model_train_perf

#### Checking model performance on test set

In [None]:
bagging_estimator_tuned_model_test_perf = model_performance_regression("___________") # Fill in the blank to check the tuned model performance on test data
bagging_estimator_tuned_model_test_perf

### Random Forest Model

In [None]:
rf_estimator = "__________"(random_state=1) # Fill in the blank to define random forest regressor
rf_estimator.fit("___________") # Fill in the blank to fit the regressor to training data

#### Checking model performance on training set

In [None]:
rf_estimator_model_train_perf = model_performance_regression("____________") # Fill in the blank to check model performance on train data
rf_estimator_model_train_perf

#### Checking model performance on test set

In [None]:
rf_estimator_model_test_perf = model_performance_regression("____________") # Fill in the blank to check model performance on test data
rf_estimator_model_test_perf

### Hyperparameter Tuning - Random Forest

In [None]:
# Choose the type of classifier.
rf_tuned = RandomForestRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {
    "max_depth": [4, 6, 8, 10, None],
    "max_features": ["sqrt", "log2", None],
    "n_estimators": [80, 90, 100, 110, 120],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV("_____________") # complete the code to run grid search with cv = 5
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
rf_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
rf_tuned.fit(X_train, y_train)

#### Checking model performance on training set

In [None]:
rf_tuned_model_train_perf = model_performance_regression("___________") # Fill in the blank to check tuned model performance on train data
rf_tuned_model_train_perf

#### Checking model performance on test set

In [None]:
rf_tuned_model_test_perf = model_performance_regression("___________") # Fill in the blank to check tuned model performance on test data
rf_tuned_model_test_perf

## Boosting - Model Building and Hyperparameter Tuning

### AdaBoost Regressor

In [None]:
ab_regressor = "__________"(random_state=1) # Fill in the blank to define the adaboost regressor
ab_regressor.fit("___________") # Fill in the blank to fit the regressor to training data

#### Checking model performance on training set

In [None]:
ab_regressor_model_train_perf = model_performance_regression("____________") # Fill in the blank to check model performance on train data
ab_regressor_model_train_perf

#### Checking model performance on test set

In [None]:
ab_regressor_model_test_perf = model_performance_regression("____________") # Fill in the blank to check model performance on test data
ab_regressor_model_test_perf

### Hyperparameter Tuning - AdaBoost Regressor

In [None]:
# Choose the type of classifier.
ab_tuned = AdaBoostRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {
    "n_estimators": np.arange(10, 100, 10),
    "learning_rate": [1, 0.1, 0.5, 0.01],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV("_____________") # complete the code to run grid search with cv = 5
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
ab_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
ab_tuned.fit(X_train, y_train)

#### Checking model performance on training set

In [None]:
ab_tuned_model_train_perf = model_performance_regression("___________") # Fill in the blank to check tuned model performance on train data
ab_tuned_model_train_perf

#### Checking model performance on test set

In [None]:
ab_tuned_model_test_perf = model_performance_regression("___________") # Fill in the blank to check tuned model performance on test data
ab_tuned_model_train_perf

### Gradient Boosting Regressor

In [None]:
gb_estimator = "___________"(random_state=1) # Fill in the blank to define gradient boosting regressor 
gb_estimator.fit("___________") # Fill in the blank to fit the regressor to training data

#### Checking model performance on training set

In [None]:
gb_estimator_model_train_perf = model_performance_regression("____________") # Fill in the blank to check model performance on train data
gb_estimator_model_train_perf

#### Checking model performance on test set

In [None]:
gb_estimator_model_test_perf = model_performance_regression("____________") # Fill in the blank to check model performance on test data
gb_estimator_model_test_perf

### Hyperparameter Tuning - Gradient Boosting Regressor

In [None]:
# Choose the type of classifier.
gb_tuned = GradientBoostingRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {
    "n_estimators": np.arange(50, 200, 25),
    "subsample": [0.7, 0.8, 0.9, 1],
    "max_features": [0.7, 0.8, 0.9, 1],
    "max_depth": [3, 5, 7, 10],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV("_____________") # complete the code to run grid search with cv = 5
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
gb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
gb_tuned.fit(X_train, y_train)

#### Checking model performance on training set

In [None]:
gb_tuned_model_train_perf = model_performance_regression("___________") # Fill in the blank to check tuned model performance on train data
gb_tuned_model_train_perf

#### Checking model performance on test set

In [None]:
gb_tuned_model_test_perf = model_performance_regression("___________") # Fill in the blank to check tuned model performance on test data
gb_tuned_model_test_perf

### XGBoost Regressor

In [None]:
xgb_estimator = "_________"(random_state=1) # Fill in the blank to define the XGBoost regressor
xgb_estimator.fit("___________") # Fill in the blank to fit the regressor to training data

#### Checking model performance on training set

In [None]:
xgb_estimator_model_train_perf = model_performance_regression("____________") # Fill in the blank to check model performance on train data
xgb_estimator_model_train_perf

#### Checking model performance on test set

In [None]:
xgb_estimator_model_test_perf = model_performance_regression("____________") # Fill in the blank to check model performance on test data
xgb_estimator_model_test_perf

### Hyperparameter Tuning - XGBoost Regressor 

In [None]:
# Choose the type of classifier.
xgb_tuned = XGBRegressor(random_state=1)

# Grid of parameters to choose from
parameters = {
    "n_estimators": [75, 100, 125, 150],
    "subsample": [0.7, 0.8, 0.9, 1],
    "gamma": [0, 1, 3, 5],
    "colsample_bytree": [0.7, 0.8, 0.9, 1],
    "colsample_bylevel": [0.7, 0.8, 0.9, 1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.r2_score)

# Run the grid search
grid_obj = GridSearchCV("_____________") # complete the code to run grid search with cv = 5
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
xgb_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
xgb_tuned.fit(X_train, y_train)
xgb_tuned.fit(X_train, y_train)

#### Checking model performance on training set

In [None]:
xgb_tuned_model_train_perf = model_performance_regression("___________") # Fill in the blank to check tuned model performance on train data
xgb_tuned_model_train_perf

#### Checking model performance on test set

In [None]:
xgb_tuned_model_test_perf = model_performance_regression("___________") # Fill in the blank to check tuned model performance on train data
xgb_tuned_model_test_perf

## Stacking Model

#### Now, let's build a stacking model with the tuned models - decision tree, random forest, and gradient boosting, then use XGBoost to get the final prediction

In [None]:
estimators = [
    ("Decision Tree", dtree_tuned),
    ("Random Forest", rf_tuned),
    ("Gradient Boosting", gb_tuned),
]
final_estimator = XGBRegressor(random_state=1)

In [None]:
stacking_estimator = StackingRegressor(
    estimators=estimators, final_estimator=final_estimator, cv=5
)
stacking_estimator.fit("__________") # Fill in the blank to fit the stacking estimator on the train data

#### Checking model performance on training set

In [None]:
stacking_estimator_model_train_perf = model_performance_regression("____________") # Fill in the blank to check model performance on train data
stacking_estimator_model_train_perf

#### Checking model performance on test set

In [None]:
stacking_estimator_model_test_perf = model_performance_regression("____________") # Fill in the blank to check model performance on test data
stacking_estimator_model_test_perf

## Model Performance Comparison and Final Model Selection

In [None]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        dtree_model_train_perf.T,
        dtree_tuned_model_train_perf.T,
        bagging_regressor_model_train_perf.T,
        bagging_estimator_tuned_model_train_perf.T,
        rf_estimator_model_train_perf.T,
        rf_tuned_model_train_perf.T,
        ab_regressor_model_train_perf.T,
        ab_tuned_model_train_perf.T,
        gb_estimator_model_train_perf.T,
        gb_tuned_model_train_perf.T,
        xgb_estimator_model_train_perf.T,
        xgb_tuned_model_train_perf.T,
        stacking_estimator_model_train_perf.T,
    ],
    axis=1,
)

models_train_comp_df.columns = [
    "Decision Tree",
    "Decision Tree Tuned",
    "Bagging Regressor",
    "Bagging Regressor Tuned",
    "Random Forest Estimator",
    "Random Forest Tuned",
    "Adaboost Regressor",
    "Adaboost Tuned",
    "Gradient Boost Estimator",
    "Gradient Boost Tuned",
    "XGB",
    "XGB Tuned",
    "Stacking Regressor",
]

print("Training performance comparison:")
models_train_comp_df

In [None]:
# testing performance comparison

'_______' ## Complete the code to check performance for test data

### Important features of the best model

In [None]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        '____________'.feature_importances_, columns=["Imp"], index=X_train.columns ## Complete the code to fill in best model object 
    ).sort_values(by="Imp", ascending=False)
)

In [None]:
feature_names = X_train.columns
importances = '____________'.feature_importances_ ## Complete the code to fill in best model object
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

## Actionable Insights and Business Recommendations

- 


___________________________________________________________________________