# **Problem Statement**

## **Business Context**

A sales forecast is a prediction of future sales revenue based on historical data, industry trends, and the status of the current sales pipeline. Businesses use the sales forecast to estimate weekly, monthly, quarterly, and annual sales totals. A company needs to make an accurate sales forecast as it adds value across an organization and helps the different verticals to chalk out their future course of action.

Forecasting helps an organization plan its sales operations by region and provides valuable insights to the supply chain team regarding the procurement of goods and materials. An accurate sales forecast process has many benefits which include improved decision-making about the future and reduction of sales pipeline and forecast risks. Moreover, it helps to reduce the time spent in planning territory coverage and establish benchmarks that can be used to assess trends in the future.

## **Objective**

SuperKart is a retail chain operating supermarkets and food marts across various tier cities, offering a wide range of products. To optimize its inventory management and make informed decisions around regional sales strategies, SuperKart wants to accurately forecast the sales revenue of its outlets for the upcoming quarter.

To operationalize these insights at scale, the company has partnered with a data science firm—not just to build a predictive model based on historical sales data, but to develop and deploy a robust forecasting solution that can be integrated into SuperKart’s decision-making systems and used across its network of stores.

## **Data Description**

The data contains the different attributes of the various products and stores.The detailed data dictionary is given below.

- **Product_Id** - unique identifier of each product, each identifier having two letters at the beginning followed by a number.
- **Product_Weight** - weight of each product
- **Product_Sugar_Content** - sugar content of each product like low sugar, regular and no sugar
- **Product_Allocated_Area** - ratio of the allocated display area of each product to the total display area of all the products in a store
- **Product_Type** - broad category for each product like meat, snack foods, hard drinks, dairy, canned, soft drinks, health and hygiene, baking goods, bread, breakfast, frozen foods, fruits and vegetables, household, seafood, starchy foods, others
- **Product_MRP** - maximum retail price of each product
- **Store_Id** - unique identifier of each store
- **Store_Establishment_Year** - year in which the store was established
- **Store_Size** - size of the store depending on sq. feet like high, medium and low
- **Store_Location_City_Type** - type of city in which the store is located like Tier 1, Tier 2 and Tier 3. Tier 1 consists of cities where the standard of living is comparatively higher than its Tier 2 and Tier 3 counterparts.
- **Store_Type** - type of store depending on the products that are being sold there like Departmental Store, Supermarket Type 1, Supermarket Type 2 and Food Mart
- **Product_Store_Sales_Total** - total revenue generated by the sale of that particular product in that particular store


In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

# **Installing and Importing the necessary libraries**

## **Install required packages - One time activity**
- Commented this line as some pacakges are not compatible with the current environment. However joblib and hugging face are installed so the project is working.
- Using VS Code with Jupyter extensions for running this.

In [None]:
#Installing the libraries with the specified versions
# !pip install numpy==2.0.2 pandas==2.2.2 scikit-learn==1.6.1 matplotlib==3.10.0 seaborn==0.13.2 joblib==1.4.2 xgboost==2.1.4 requests==2.32.3 huggingface_hub==0.30.1 -q

**Note:**

- After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab) and run all cells sequentially from the next cell.

- On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

## **Import the packages**

In [None]:
import warnings
warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# For splitting the dataset
from sklearn.model_selection import train_test_split

#Imports RobustScaler for data preprocessing
from sklearn.preprocessing import RobustScaler, FunctionTransformer, OrdinalEncoder, MinMaxScaler

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 100)


#Import ensemble classifiers to be used for modelling.
from sklearn.ensemble import (
    BaggingRegressor,
    RandomForestRegressor,
    AdaBoostRegressor,
    GradientBoostingRegressor,
)
from xgboost import XGBRegressor

# Libraries to get different metric scores for regression
from sklearn.metrics import (
    f1_score,
    mean_squared_error,
    mean_absolute_error,
    r2_score,
    mean_absolute_percentage_error
)

# To create the pipelines for data transformation of Test, Train and Validation datasets
from sklearn.compose import make_column_transformer, ColumnTransformer
from sklearn.pipeline import make_pipeline,Pipeline

# To tune different models and standardize
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# To serialize the model
import joblib

# os related functionalities
import os

# API request
import requests
import sklearn


# for hugging face space authentication to upload files
from huggingface_hub import login, HfApi

# Set scikit-learn's display mode to 'diagram' for better visualization of pipelines and estimators
sklearn.set_config(display='diagram')

# **Loading the dataset**

In [None]:
# Load the dataset of SuperKart and make a copy for EDA Analysis followed by Model building.
skart_ori = pd.read_csv("SuperKart.csv")

# skart_ori = pd.read_csv("/content/SuperKart.csv")
skart = skart_ori.copy()

- Load dataset and make a copy
- Use the copy for EDA analysis and Model building.
- EDA Analysis will add more columns where necessary (Feature Engineering)
- Model building will perform scaling using RobustScaler on this dataset.
- As it safe not to modify the orginal copy, a copy is taken.

# **Data Overview**

## **Display the first 5 rows**

In [None]:
skart.head(5)

##### **OBSERVATIONS**
- Dataset is loaded properly
- First 5 rows are displayed correctly as verfied in CSV file manually.

## **Display the dimension of the dataset**

In [None]:
print ( skart.shape )

##### **OBSERVATIONS**
- There are 8763 rows in the dataset
- There are 12 columns in the dataset

## **Display the column names and data types**

### **Dataframe info**

In [None]:
skart.info()

##### **OBSERVATIONS**
- Dataset is no null values.
- It has 5 Numeric fields 7 object fields. These fields are categorical in nature.They have to be converted.
- Product ID and Store ID are the main idenitifers that uniquely idenitfy the products and stores respectively.
- Products have numeric attributes like Weight, Allocated Area and MRP. Also, products addtionally have two Categorical attributes, sugar content and type of the product.
- Stores have only Categorical attributes like Type, Size, location and year of establishment.
- Product_Store_Sales_Total is the target variable. It is having sales recorded for each product and Store combinations.

### **Create aliases and classify numeric and categorical variables**

In [None]:
# Define the total sales
target = 'Product_Store_Sales_Total'

# Use short form aliases to avoid typos
pid = 'Product_Id'
sid = 'Store_Id'
pwt = 'Product_Weight'
paa = 'Product_Allocated_Area'
pmrp = 'Product_MRP'
seyr = 'Store_Establishment_Year'
psc = 'Product_Sugar_Content'
ptyp = 'Product_Type'
ssize = 'Store_Size'
sloctype = 'Store_Location_City_Type'
styp = 'Store_Type'

# Deine numeric variables - Include Target in Numeric variables as it continuous variable and not classification.
numeric_vars = [pwt,paa,pmrp,target]

# Even though 'Store_Establishment_Year' is Numeric lets consider it as Categorical.
# Define categorical variables
categories = [psc,ptyp,ssize,sloctype,styp,seyr]

## **Display the statistical summary**

In [None]:
skart.describe().T

##### **OBSERVATIONS**

There are 5 Numeric variables:

- **Product_Weight:** the units are not mentioned. The assumption is that it is in oz. The weight ranges from 4 to 22.

- **Product_Allocated_Area:** As per definition this ratio is specific to the stores and it is the ratio of display area of product to total display area. The min is 0.004 and max is almost 0.298. There are no product Names. Only Product Type and categories are available. The mean is 0.056 suggesting that it is right skewed data. Most products have a shorter display area even though there sales might be higher. It could be items are smaller in size and may require smaller area to fit most of the items.


- **Product_MRP:** The currency is not Specified. It is assumed to be US dollars. The Minimum price is 31 and maximum 266.

- **Establishment year:** The first store was established in 1987 and most latest one is in 2009.  

- **Product_Store_Sales_Total:** This is sales for the product and Store combination. It is the Target variable to be predicted. Min sales per product per stores is 33 and max is 8000.  

- There are no NULL values  

# **Exploratory Data Analysis (EDA)**

## **Helper functions**

### **Visualizing Univariate data**

In [None]:
# Helper function for checking if there are any Null or missing values in the data.
def col_vals(data,c):
    print(c)
    # print(data[c].unique())
    # print(data[c].value_counts())
    print("Check Missing values for ", c ," = ",
          "No missing values " if ( ( data[c].value_counts().sum() - data.shape[0] == 0) and (data[c].isnull().sum()/len(data)*100 == 0) ) else " invalid entires may exist")
    print('---------------------------------')

# Helper Function for visualizing a feature and the occurance. It gives density and frequency
def countplot_feature(data, feature, figsize=(12, 7), color='skyblue', rotation=0, nrows=2):
    """
    Count plot for categorical features

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    color: color of bars (default blue)
    rotation: rotation of x-axis labels (default 0)
    """
    plt.figure(figsize=figsize)
    for i, ft in enumerate(feature):
        plt.subplot(nrows, 2, i+1)
        plt.title(f'counts of {ft} categories')
        sns.countplot(x=data[ft], color=color)
        plt.xticks(rotation=rotation)
    plt.tight_layout()
    plt.show()

# This function is used only if a feature is existing in two data sets and it needs to be compared.
# For ex: The difference of distribution of a feature n Test and Training datasets.
def histplot_compare(data1, data2, feature):
    # cols = cols /2
    total = len(feature)
    rows = int( total)
    bins1 = int( data1.shape[0]/100 )
    bins2 = int( data2.shape[0]/25 )
    fig, ax_hist = plt.subplots(nrows=rows, ncols=2, figsize=(8,75))
    # print (ax_hist)
    for i, ft in enumerate(feature):
        ax_hist1 = ax_hist[i, 0]
        ax_hist2 = ax_hist[i, 1]
        ax_hist1.sharex(ax_hist2)
        # ax_hist1.sharey(ax_hist2)
        # plt.title(f'{ft} histogram')
        pcolor=sns.color_palette('Set1_r', as_cmap = True)
        # Palette = ["#090364", "#091e75"] #define your preference
        # sns.set_style("whitegrid")
        # sns.set_palette(Palette) #use the list defined in the function
        ax1 = sns.histplot(
            data=data1, x=ft, kde=True, ax=ax_hist1, bins=bins1, palette=pcolor, edgecolor=None
        )
        ax2 = sns.histplot(
            data=data2, x=ft, kde=True, ax=ax_hist2, bins=bins2, palette=pcolor, edgecolor=None
        )
        ax_hist2.axvline(
            data2[ft].mean(), color="crimson", linestyle="solid"
        )  # Add mean to the histogram
        ax_hist2.axvline(
            data2[ft].median(), color="darkgreen", linestyle="solid"
        )  # Add median to the histogram
        ax_hist1.axvline(
            data1[ft].mean(), color="crimson", linestyle="solid"
        )  # Add mean to the histogram
        ax_hist1.axvline(
            data1[ft].median(), color="darkgreen", linestyle="solid"
        )  # Add median to the histogram
        ax1.lines[0].set_color('black'); ax2.lines[0].set_color('black')
        ax1.set_xlabel(ft  + "(Training)"); ax1.set_ylabel('Frequency')
        ax2.set_xlabel(ft + "(Testing)"); ax2.set_ylabel('Frequency')
        plt.xticks(rotation=0)

    plt.tight_layout()
    plt.show()

# This function is used for plotting a histogram with Mean and Median.
def histplot_feature(data, feature, cols=4, figsize=(12,7), fth=None):
    total = len(feature)
    rows = int( total / cols)
    bins = 20
    fig, ax_hist = plt.subplots(nrows=rows, ncols=cols, figsize=figsize)
    # print (ax_hist)
    for i, ft in enumerate(feature):
        ax_hist2 = ax_hist[int(i/cols), i%cols]
        # plt.title(f'{ft} histogram')
        ax = sns.histplot(
            data=data, x=ft, kde=True, ax=ax_hist2, palette='Set1', bins=bins, edgecolor=None, hue=fth
        )
        ax_hist2.axvline(
            data[ft].mean(), color="crimson", linestyle="solid"
        )  # Add mean to the histogram
        ax_hist2.axvline(
            data[ft].median(), color="darkgreen", linestyle="solid"
        )  # Add median to the histogram
        ax.lines[0].set_color('black')
        plt.xticks(rotation=0)

    plt.tight_layout()
    plt.show()

# This function is used for plotting a Box Plot on a features showing quartiles and Outiliers
def boxplot_feature(data, feature, cols=4, figsize=(12,7), fth=None):

    total = len(feature)
    rows = int( total / cols)
    fig, ax_box = plt.subplots(nrows=rows, ncols=cols, figsize=figsize)
    for i, ft in enumerate(feature):
        ax_box2 = ax_box[int(i/cols), i%cols]
        # plt.title(f'{ft} boxplot')
        ax = sns.boxplot(
            data=data, x=ft, showmeans=True, ax=ax_box2, palette='Set2', hue=fth
        )
        ax.legend().remove()
        plt.xticks(rotation=0)
        plt.legend(loc='upper left', bbox_to_anchor=(1,1) )

    plt.tight_layout()
    plt.show()

# Similar to Histplot but show the KDE line.
def kdeplot_feature(data, feature, figsize=(12, 7), color='skyblue', rotation=0):
    """
    kdeplot for categorical features

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    color: color of bars (default blue)
    rotation: rotation of x-axis labels (default 0)
    """
    plt.figure(figsize=figsize)
    for i, ft in enumerate(feature):
        plt.subplot(2, 2, i+1)
        plt.title(f'kde {ft} distribution')
        sns.kdeplot(x=data[ft], color=color)
        plt.xticks(rotation=rotation)

    plt.tight_layout()
    plt.show()

# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

    # function to create labeled barplots

# This function is wrpper function for a barplot.
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

### **Categorical data - Label/Ordinal Encoding**

In [None]:
def ordinal_enc( data, col, cat, drop=False):
    """
    Function to perform ordinal encoding on the specified columns

    data: dataframe
    col: column to be encoded
    drop: whether to drop the column after encoding (default is False)
    """
    from sklearn.preprocessing import OrdinalEncoder
    enc = OrdinalEncoder(categories=cat)
    data[col+'_enc'] = enc.fit_transform(data[[col]])
    if drop:
        data.drop(columns=[col], inplace=True)

def label_enc( data, col, drop=False):
    """
    Function to perform ordinal encoding on the specified columns

    data: dataframe
    col: column to be encoded
    drop: whether to drop the column after encoding (default is False)
    """
    from sklearn.preprocessing import LabelEncoder
    enc = LabelEncoder()
    data[col+'_enc'] = enc.fit_transform(data[[col]])
    if drop:
        data.drop(columns=[col], inplace=True)

### **Visualizing Bi-Variate data**

In [None]:
# Plot two Categorical variable on X and Y axis.
def cross_tab(data, feature1, feature2, nrows=1, ncols=2, figsize=(15,5), stacked=True, margins=True):
    """
    Cross tabulation of two features

    df: dataframe
    feature1: first feature
    feature2: second feature
    """
    fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
    tab1 = pd.crosstab(data[feature1], data[feature2], dropna=False)
    if ( margins ):
        tab2 = pd.crosstab(data[feature1], data[feature2], normalize='index', margins=True, margins_name="Total", dropna=True)
    else:
        tab2 = pd.crosstab(data[feature1], data[feature2], normalize='index', margins=False, dropna=True)
    # print(tab1)
    # print(tab2)
    # print(type(fig) )
    fig.set_size_inches(figsize)
    xl = feature1
    yl = "Proportion of " + feature2
    tit = " Actual counts"
    tab1.plot(kind='bar', ax=axes[0], stacked=stacked, title=tit, xlabel=xl, ylabel=yl)
    plt.legend(loc='upper left', bbox_to_anchor=(0,0), title=feature2)

    tit = " Normalized (individual) counts"
    tab2.plot(kind='bar', ax=axes[1], stacked=stacked, title=tit, xlabel=xl, ylabel=yl)
    plt.legend(loc='upper left', bbox_to_anchor=(1,1), title=feature2)

    plt.tight_layout()
    plt.show()

# seperate the data based on unit_of_wage (espically hour) because the values of hours is low compared to other (monthly, yearly and week).
# Putting everything to together will make the hour plot invisible. As we do not know if hours is 8 or 12 hours etc.
# It is better to seperate categories and look at prevailing_wage trend in the category.
#  Also as identified in the corr matirx the other influencers of `previaling_wage` are `full_time_position`, `continent` and `region_of_employment.
# While analyzing the variable of `prevailing_wage` and its influencers we keep an eye of the overall impact on `case_status`
visa_palette = {0: "#fc8d62", 1: "#66c2a5", "test": "#66c2a5", "train": "#fc8d62", "validation": "#ffd92f","testing": "#66c2a5", "training": "#fc8d62",
                'full': '#ffd92f', 'mean': '#a6d854', 'sum':'#66c2a5', 'std':'#66c2a5', 'min': '#8da0cb',
                "max": "#fc8d62", "median": "#fc8d62", "count": "#ffd92f",
                paa: "#66c2a5", pwt: "#fc8d62", pmrp: "#ffd92f",target:'#8da0cb',pid:'#a6d854',
                "25%": "#66c2a5", "50%": "#fc8d62", "75%": "#ffd92f",
                'F1': '#fc8d62', 'Accuracy': '#a6d854', 'Recall':'#66c2a5', 'Precision': '#8da0cb',
                'Regular': '#fc8d62', 'No Sugar': '#a6d854', 'Low Sugar': '#66c2a5', 'reg': '#8da0cb',
                'NC':"#66c2a5", 'FD':"#fc8d62", 'DR':"#ffd92f"
                }
# Format axis ticks
def custom_fmt(x):
    return f'{x/1000:.1f}K' if x >= 10000 else f'{x:.2f}'

# Formatting helper function used by all Fecet Grid plot for handling common display issues.
def fmt_plot(g, nrows, ncols, row_d, col_d, hue_d, x_d, y_d, fnt, tit, kind='None'):
    print(f"5-dimension analysis: {row_d} X {col_d} = {nrows} X {ncols} with hue = {hue_d} on variables {x_d} over {y_d}")
    length = len(g.axes_dict)
    print ( length, g.axes_dict.keys())
    print ( length, g.axes_dict.items())
    if (tit):
        #  g.set_titles(col_template="{col_var}={col_name}", row_template="{row_var}={row_name}", size=fnt)
        g.set_titles(col_template="{col_name}", row_template="{row_name}", size=fnt)
    for i,m in enumerate(g.axes_dict):
        ax = g.axes_dict[m]
        # if(not tit):
        ax.tick_params(direction='out', axis='x', labelsize=fnt, colors='b', grid_color='r', labelrotation=90)
        ax.set_xlabel(str(x_d) + "(" + row_d + "="+ str(m[0]) + ")", fontsize=fnt)
        ax.tick_params(direction='out', axis='y', labelsize=fnt, colors='b', grid_color='r')
        ax.set_ylabel(str(y_d) + "(" + col_d + "="+str(m[1]) + ")", fontsize=fnt)
        # else:
        #     ax.tick_params(direction='out', axis='x', labelsize=fnt, colors='b', grid_color='r', labelrotation=90)
        #     ax.set_xlabel(str(x_d), fontsize=fnt)
        #     ax.tick_params(direction='out', axis='y', labelsize=fnt, colors='b', grid_color='r')
        #     ax.set_ylabel(str(y_d), fontsize=fnt)
        if (kind =='bar'):
            for bars in ax.containers:
                ax.bar_label(bars, labels=[custom_fmt(v) for v in bars.datavalues], color='red',
                              padding=3, fontsize=fnt)
    plt.xticks(rotation=90)
    plt.tight_layout()
    plt.show()

# Function for controlling the axis ranges on the map.
# In FacetGrid it helos to have same Y-axis ticks for better comparision.
def unify_axes(data, g, row_d, col_d, hue_d, x_d, y_d, kind=None):
    nrows =data[row_d].nunique()
    ncols = data[col_d].nunique()
    # if( hue_d):
    g.add_legend(title=hue_d, loc='center', bbox_to_anchor=(0,0))
    # Find global y-limits across all axes
    ymin, ymax = data[y_d].min(), data[y_d].max()*1.2
    # Apply same y-limits to all subplots
    for ax in g.axes.flat:
        ax.set_ylim(ymin, ymax)
        if ( ymax > 1000):
            ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'{x/1000:.1f}K'))
        else:
            ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, _: f'{x:.1f}'))
        if(kind =='bar'):
            for bar in ax.patches:
                bar.set_x(bar.get_x() + 0.2)  # Shift right by 0.1 units
    return nrows, ncols

# This function renders a Heat Map inside a Facet Grid.
# As we cannot use map dataframe for heat map,
# this function uses the Facet ax for controlling the rendering of each Heatmap of the Facet.
def FHMap (data, row_d, col_d, vars, fnt=8, ht=None, asp=None, tit=False, annot_fnt=8,debug=False):
    nrows = data[row_d].nunique()
    ncols = data[col_d].nunique()

    if( (ht is None) | (asp is None) ):
        # Decide the default Facet cell size.
        hgt_1f = 2.75;
        asp = 1.25
        # Compute other sizes.
        wdt_1f = round(hgt_1f*asp,2); fig_h = round(nrows*hgt_1f,2); fig_w = round(ncols * wdt_1f, 2);
        ht = hgt_1f;

    # Plot two heatmaps - first one will be all variables together with annotation and FacetGrid.
    # first heatmap is a single one. Calculate figure size using
    mapsize = len(vars)
    map_width = int(round(wdt_1f * mapsize,0))
    map_height = int(round(hgt_1f * mapsize))

    if(debug):
        print(f"Total height={fig_h}, Total width={fig_w}, FG-cell height={ht}, FG-Cell width={round(asp*ht,2)}, \
              FG-cell aspect={asp}, ref. heatmap width{map_width}:height{map_height}")

    plt.figure(figsize=(map_width,map_height))
    heat_data = data.loc[:, vars].corr()
    # print ( len(heat_data), type(heat_data))
    ax = sns.heatmap(heat_data, annot=True, cbar=True,
                vmin=-1,vmax=1,fmt='.1f', cmap="Spectral", annot_kws={"size": 15},
                xticklabels=True, yticklabels=True)
    ax.set_title(f"Heatmaps using: {' | '.join(vars)} in the dataset.\n\n\
    (1) The first Heatmap does not have slicing Facets of {row_d} and {col_d}\n\
    (2) Compare this with the Grid of Heatmap Facets below which uses Facets of {row_d} and {col_d}\n\
    (3) All heatmaps use the same set of features in the same order\n\
    (4) The first heatmap acts as reference heatmap to show all the features names involved\n\
    (5) The columns are abbreviated to reduce space occupied on the graph.\n",
    loc='left', color='red')

    ax.tick_params(axis='x', rotation = 90)
    ax.tick_params(axis='y', rotation =0)
    plt.tight_layout()
    plt.show()

    # Second heatmap will be inside FacetGrid every Facet split by nrows and ncols will have a heatmap.
    # This way one can compare if the data seperation is influencing the correlation.
    g = sns.FacetGrid(data, row=row_d, col=col_d, height=ht, aspect=asp, sharex=False, sharey=False)
    for (r_val, c_val), ax in g.axes_dict.items():
        data_sub = data[(data[row_d] == r_val) & (data[col_d] == c_val)]
        heat_data = data_sub.loc[:, vars]
        # print(f"Handing Facet {r_val} and {c_val} using {row_d} x {col_d}, Heatmap features = {vars}. Length of data for correlation {len(heat_data)}")
        sns.heatmap(heat_data.corr(), ax=ax, cbar=False, annot=True, annot_kws={"size": annot_fnt},
                    xticklabels=True, yticklabels=True,  vmin=-1,vmax=1,fmt='.1f', cmap="Spectral")
    fmt_plot(g, nrows, ncols, row_d, col_d, None, "All Numeric Vars", vars[-1], fnt, tit)

# Uses Facet Grid for rendering Bar plot for each Facet in the Grid.
def FBar ( data, row_d, col_d, hue_d, x_d, y_d, fnt=8, ht=2.5, asp=1.5, tit=False, bars=True, cold=None, hue_order=None):
    if (hue_order is None):
        hue_order = sorted(data[hue_d].dropna().unique())
    if (cold is None):
        cold = sorted(data[col_d].dropna().unique())
    g = sns.FacetGrid(data, row=row_d, col=col_d, height=ht, aspect=asp, sharex=False, sharey=False, col_order=cold)
    g.map_dataframe(sns.barplot, x=x_d, y=y_d, hue=hue_d, dodge=True, palette=visa_palette, hue_order=hue_order, ci=None)
    nrows, ncols = unify_axes(data, g, row_d, col_d, hue_d, x_d, y_d, kind='bar')
    fmt_plot(g, nrows, ncols, row_d, col_d, hue_d, x_d, y_d, fnt,tit, kind=('bar' if (bars) else 'None') )

# Uses Facet Grid for rendering Box plot for each Facet in the Grid.
def FBox ( data, row_d, col_d, hue_d, x_d, y_d, fnt=8, ht=2.5, asp=1.5, tit=False, cold=None):
    hue_order = sorted(data[hue_d].dropna().unique())
    if (cold is None):
        cold = sorted(data[col_d].dropna().unique())
    g = sns.FacetGrid(data, row=row_d, col=col_d, height=ht, aspect=asp, sharex=False, sharey=False, col_order=cold)
    g.map_dataframe(sns.boxplot, x=x_d, y=y_d, hue=hue_d, palette=visa_palette, hue_order=hue_order) #.add_legend()
    nrows, ncols = unify_axes(data, g, row_d, col_d, hue_d, x_d, y_d)
    fmt_plot(g, nrows, ncols, row_d, col_d, hue_d, x_d, y_d, fnt,tit)

# Uses Facet Grid for renderinganchor Violin plot for each Facet in the Grid.
def FViolin ( data, row_d, col_d, hue_d, x_d, y_d, fnt=8, ht=2.5, asp=1.5, tit=False, cold=None):
    hue_order = sorted(data[hue_d].dropna().unique())
    if(cold is None):
        cold = sorted(data[col_d].dropna().unique())
    g = sns.FacetGrid(data, row=row_d, col=col_d, height=ht, aspect=asp, sharex=False, sharey=False, col_order=cold)
    g.map_dataframe(sns.violinplot, x=x_d, y=y_d, hue=hue_d, palette=visa_palette, hue_order=hue_order)
    nrows, ncols = unify_axes(data, g, row_d, col_d, hue_d, x_d, y_d)
    fmt_plot(g, nrows, ncols, row_d, col_d, hue_d, x_d, y_d, fnt,tit)

# Uses Facet Grid for rendering Scatter plot for each Facet in the Grid.
def FScatter( data, row_d, col_d, hue_d, x_d, y_d, fnt=8, ht=2, asp=3, tit=False, cold=None):
    hue_order = sorted(data[hue_d].dropna().unique())
    if(cold is None):
        cold = sorted(data[col_d].dropna().unique())
    g = sns.FacetGrid(data, row=row_d, col=col_d, height=ht, aspect=asp, sharex=False, sharey=False, col_order=cold)
    g.map_dataframe(sns.scatterplot, x=x_d, y=y_d, hue=hue_d, palette=visa_palette, hue_order=hue_order)
    nrows, ncols = unify_axes(data, g, row_d, col_d, hue_d, x_d, y_d)
    fmt_plot(g, nrows, ncols, row_d, col_d, hue_d, x_d, y_d, fnt,tit)

## **Correct inconsistent and invalid data in the dataset**

### **Check for missing or null values. Check for Duplicates**

In [None]:
#Check for null values.
for c in skart.columns:
    col_vals(skart, c)
#check for Duplicates
print ( "Duplicate rows = ", "No duplicates" if (skart.duplicated().value_counts()[0] == skart.shape[0]) else " Duplicate rows exist")

##### **OBSERVATIONS**
- None of columns are having NULL values or missing values.
- None of values are negative
- No duplicates in the Dataset

### **Check Categorical variables for invalid entries**

In [None]:
vcdf = pd.DataFrame ( [{ 'Var': x, 'Total': skart[x].nunique(), 'Category': yind, 'Count': yval } for x in categories for yind, yval in skart[x].value_counts().items() ] )
vcdf

##### **OBSERVATIONS**

There are 6 categorical variables (The 6th one "Store_Establishment_Year" was a numeric variable explained above. It is considered as categorical for better analysis).

- **Product Type:** Has 16 types of Products lead by "Frutis and Vegetables" and with "Seafood" at the bottom. There is a Household category of items, apart form this all other items are perishable food items.   

- **Store Type:** There are 4 Stores with "SuperMarket Type2" being the store with highest number of products sold.  

- **Store_Loaction_City_Type:** It has 3 classifications "Tier 1", "Tier 2" and "Tier 3". "Tier 1" has a high cost of living. However, the sales are higher in "Tier 2" compared the others.  

- **Store_Establishment_Year:** It has the Year as a Numeric. The lowest is 1987 and highest 2009. It shows a newer location (probably with better ambience and more like Mall) is attracting more sales than the older once (probably with a more classic look).  

- **Store Size:** The Sizes are classified as "Small", "Medium" and "High". The Medium is having maximum sales compared to High and Small.  

- **Product_Sugar_Content:** It has 3 classifications as per the documentation. "Low Sugar", "Regular" and "No Sugar". There is an extra category "reg" in the data. The number of this are very low. It must be corrected and treated as "Regular". "Low Sugar" products are recording highest sales.

 - There are two Ids Product_Id and Store_Id. They uniquely identify products and stores respectively.

### **Correct the invalid data present in the Product_Sugar_Content column**

In [None]:
print ("Before replacement",skart[psc].value_counts() )
skart[psc] = np.where(skart[psc]=="reg", "Regular", skart[psc])
print ("\n\nAfter replacement",skart[psc].value_counts() )

##### **OBSERVATIONS**
- Now the values of Product_Sugar_Content are converted from 4 categories to 3 categories which matches the specifications

## **Basic Analysis of the dataset**
- Dataset is complicated and many relationship posibilties of analysis
- Therefore get a good understanding of the basic relationships first
- After the basic relationships are understood we will do deeper Univariate, Bivariate and Multivariate Analysis

### **Understanding Stores data in the dataset**
- Understand the vairables Store_Id, Store_Size, Store_Location_City_Type, Store_Establishment Year and Store_Type
- Keep Product_Store_Sales_total (Target Variable) also in the analysis to get basic relationship with Target.

#### **How many stores are there ?**

In [None]:
## Get all the stores
print( "Number of Stores = ", skart[sid].nunique() )
stores = skart.loc[:, [sid, ssize, sloctype, seyr, styp, target] ]
stores = stores.groupby([sid]).agg( {
    target:'sum',
    ssize:'first',
    sloctype:'first',
    seyr: 'first',
    styp: 'first'
    }).reset_index()
stores

#### **Store with highest sales**

In [None]:
smax = stores[target].idxmax()
print ( stores.loc[smax,:] )


##### **OBSERVATIONS**
- There are 4 Stores each of these Stores is uniquely identified by Store_Type. Store_id is therefore redundant. The Store with id "OUT004" is having highest sales.

- The "Tier 2" "Medium" size Stores is doing well. It recorded the maximum revenue. About 250% of the "High" size store. It was established in 2009 as apposed to other stores which atleast a decade earlier than this.
   - THIS SHOWS THE LATEST ESTABLISHED STORES IS PROBABLY HAVING BETTER FACILITIES AND/OR TECH/AMBIENCE WHICH IS ATTRACTING THE CUSTOMERS MORE.
   - THE BUSINESS MUST TAKE NOTE OF THIS AND CONSIDER RENOVATION.
- Departmental Store and Supermarket Type1 have almost equals revenues. Even though the Size of the SuperMarket Type1 Store is the oldest store (1987) and is size category "High", it has lesser revenue compared to Departmental Store of "Medium" size. However, the Departmental Store is in Tier and SuperMarket Type1 is in Tier 2 which might be reason for lower sales.

### **Understanding Products and Products Vs Stores relationships in the dataset**
- Understand Product Type variable. Its relationship with Stores
- Understand Product_Allocation_Area, Product_Weight and Product_MRP.
- Understand Product Vs Stores relationships.

#### **Number of Products**

In [None]:
print( "Number of Products = ", skart[pid].nunique())
products = skart.loc[:, [pid, ptyp, paa, pmrp, pwt, psc, target] ]
print( len(products), products.duplicated().value_counts() )

##### **OBSERVATIONS**
- There are 8763 products in total with unique Product ID.
- In the dataset no Product ID is repeated twice. It means dataset has already aggregated the data w.r.t to Product IDs.

#### **Understanding Product Weight**
- For any analysis going forward the data will be sliced into Product Types and Store Types
- There are distinct patterns that will noted for all the variables with combination or Ptoduct Type and Store Types
- There are 16 Product types and 4 Store types. A total of 64 combinations.
- The pie chart below aggregates a mean average for the product Weights across all product and Store combinations.
- the pie chart has 16 combinations of product and each column is 4 piechart one for each Store.
- The first column has Mean Weight and second column has the sum of Area.
- **The pie chart will tell - in the overall sales (calssified per product and store type combination):**
    - How much was average Weight of the products that was sold in that product type per Store ?
    - how much was the total Area that was occupied by the sold items ?

**NOTE: The Total Area is a ratio and not a true number. So it will reflect relative sales between product items but not give absolute values**

In [None]:
pskw = skart.groupby([styp,ptyp]).agg({pwt:'mean', paa:'sum', pid:'count'}).unstack(level=0)
pskw.columns = [y+(" Weight" if x == pwt else " Pid" if x == pid else " Area") for x,y in pskw.columns]

ctyp = "Dummy"
def autopct_format_factory(df_vals):
	def autopct_format(pct, all_vals=df_vals):
				autopct_format.counter += 1
				idx = autopct_format.counter - 1
				# lis = list(skart[styp].unique())
				colnm = ctyp+" Pid"
				abs_val = all_vals.iloc[idx][colnm]  # pick the column you want
				return f"{pct:.1f}%\n({abs_val})"
	autopct_format.counter = 0
	return autopct_format


fig, ax = plt.subplots(4,2,figsize=(30,50) )
ax = ax.flatten()

for n,i in enumerate(skart[styp].unique()):
			ctyp = i
			ax1 = ax[2*n]
			ax2 = ax[2*n+1]
			y1c = i+" Weight"
			y2c = i+" Area"
			y3c = i+" Pid"
			# print (i, y1c,y2c)
			# ax1.pie(pskw[y1c], labels=pskw.index, autopct='%1.1f%%')
			ax1.pie(pskw[y1c], labels=pskw.index, autopct=autopct_format_factory(pskw), textprops={'fontsize': 14}, radius=1.1)
			ax1.set_title(y1c+" AVG SUM = "+str( round(pskw[y1c].sum(),3))+" (Orders= "+str(pskw[y3c].sum()) +")", fontsize=20, fontweight='bold')
			ax2.pie(pskw[y2c], labels=pskw.index, autopct='%1.1f%%',   textprops={'fontsize': 14}, radius=1.1)
			ax2.set_title(y2c+" SUM = "+str(round(pskw[y2c].sum(),3)), fontsize=20, fontweight='bold')



##### **OBSERVATIONS**
- The pie chart on left shows the SUM of AVG means of Product Weight across unique combinations of Product Type and Store Type.

- The pie chart on the left clearly shows that sum of Means is higher for Depratmental Store. This means Depratmental stores is selling product that are having higher Weights. It could be products with higher MRP and/or Premium products. Could be that Multiple numbers( 5 pack or 10 pack) products are bundles together and sold. This is the reason for higher weights and therefore higher sum of means.

- The SuperMarket Type2 we already saw has highest sales. Here we see that it has a lower sum of means. However it is still higher that the Food Market. So the food Market is selling products that more cheaper probably lower prices. Not the Premium range of products. SuperMarket Type2 is selling smaller to Average sizes. This could also be a potential reason on why sales are high because custoemrs may not prefer large/Premium sizes always, and having options between low to average sizes might be preferred on a regular prices.

- SuperMarket Type1 is unique here. As we saw already it is vintage store from 1987. What we see here is that its weights are closer to Departmental store and significantly higher than its Type2 counterpart. This means it is having a good range of combinations from both Type2 and Departmental. It must be offering mostly premium products but at the same time having some most poupular lower of Mid range products. Type1 is size "High", so it defeintely can be renovated to fit more products by renovations and give more options to the customers.

- Food Market has lowest mean. It is more convienence store for quick and small purchases.

- The important thing is that the distribution of the Product Weights across Product Type is same for all the 4 Stores. It shows that even though the stores are selling different sizes of products, yet, the % distrbution of products types across the stores is NOT changing significantly.

- Other interesting observations are:
    - On right side pie chart. Even though it is SUM of all Area in the dataset, it is indirectly recflective of overall sales per store (Because it is just a sum and more orders means the value goes higher). It is showing that Area is not really correlated with Weight. Because if it was correlated then Type1 has lower mean than Departemental, but Area is having value in Type2 and lower in DEpartmental. It is actually directly proportional only to the orders and not really the weight of the product.
    - The orders are lower for Departmental stores (which had marginally higher sales than Type1 SuperMarket) compared to Type 1 SuperMarket and Type 2 SuperMarket. But it exceeded sales for Type 1 and it has higher % revenue per order when compared to Type 2 (Even though sales is lower). Again it shows the Departmental is focusing on higher Premium products probably with higher MRPs.
    - The %s on the left pie and right pie are also not corrrelated. (For ex: Seafood, breakfast and Breads) Weight %s are all > 5% but the corresponding area sum are very less. Which means these products are given lesser display area (Allocation) and therefore the values are low. Also for ex: Frozen Foods, Meat, Household are some instances where the display areas are more than the Weights of the products. This also proves there is no direct correlation between the two.
    - The left side pie show no. of order in brackets (), we can see the right side Area is directly proportional to this number of oeders as it is sum of areas of the items sold. But it is no way related to the Weights.

#### **Understanding Product MRP**
- We will do exactly similar analysis as we did about for Product Weight.
- The intention will be too see if the hypothesis we made above that, Departmental Stores are seeling premium products with higher MRP is True or Not.

In [None]:
psk = skart.groupby([styp,ptyp]).agg({pwt:'mean', pmrp:'mean', pid:'count'}).unstack(level=0)
psk.columns = [y+(" Weight" if x == pwt else " Pid" if x == pid else " MRP") for x,y in psk.columns]


ctyp = "Dummy"
def autopct_format_factory(df_vals):
	def autopct_format(pct, all_vals=df_vals):
				autopct_format.counter += 1
				idx = autopct_format.counter - 1
				# lis = list(skart[styp].unique())
				colnm = ctyp+" Pid"
				abs_val = all_vals.iloc[idx][colnm]  # pick the column you want
				return f"{pct:.1f}%\n({abs_val})"
	autopct_format.counter = 0
	return autopct_format


fig, ax = plt.subplots(4,2,figsize=(30,50) )
ax = ax.flatten()

for n,i in enumerate(skart[styp].unique()):
	ctyp = i
	ax1 = ax[2*n]
	ax2 = ax[2*n+1]
	y1c = i+" MRP"
	y2c = i+" Weight"
	y4c = i+" Pid"
	ax1.pie(psk[y1c], labels=psk.index, autopct=autopct_format_factory(psk), textprops={'fontsize': 18}, radius=1.1)
	ax1.set_title(y1c+" AVG SUM = "+str( round(psk[y1c].sum(),3))+" (Orders= "+str(psk[y4c].sum()) +")", fontsize=20, fontweight='bold')
	ax2.pie(psk[y2c], labels=psk.index, autopct='%1.1f%%',   textprops={'fontsize': 18}, radius=1.1)
	ax2.set_title(y2c+" AVG SUM = "+str(round( psk[y2c].sum(), 1) )+":\n MRP Price per 1 unit of Weight ="+str(round ( psk[y1c].sum() / psk[y2c].sum(), 1) ), fontsize=20, fontweight='bold')


##### **OBSERVATIONS**
1. The left hand side pie is having AVG MRP and right hand side is having AVG weight. The left side figure for AVG MRP is expected. AVG MRP is not directly influenced by number or orders. It is influenced by the value of each order.
     - Even though highest sales are in Type2 SuperMarket its AVG MRP is less at 2277 as opposed to Deprtmental value at 2919.
     - It shows lower value products are being sold by Type2 Store when compared to Departmental. But, no. of orders are high for the same (which is cause of higher sales).
2. The same is not true (infact the oppsite) for Departmental Stores. It has lower number of orders and higher value for each order. It makes sense from previous analysis as Departmental stores are selling premium products of higher weight and hence higher order value.
But, what is interesting is the right hand side pie chart.
    - AVG Weight and Ratio of MRP to 1 unit of Weight is Shown in the second line of the title of the pie.
    - It shows lowest value of 10.7 for Food Market. Departmental is at 12 units which is the highest and Type2 SM having highest sales is in between at 11.5. The Type1 SM is closer to Deprtmental store.
    - This is because of the Location of stores. Departmental store is located in "Tier 1". High cost of living. Inspite of the bulk purchase discounts the effective price per unit of weight is comming to 12.
    - On the other hand Food Market is "Tier 3". Low cost of living. Inspite of selling smaller units which usually must cost higher is lower than "Tier 1" cost.
    - SuperMarket are in "Tier 2" cities they are therefore in the middle and "Type1" SM is higher than Type1 as it selles higher Weights (closer to Departmental)

#### **Understanding Product Allocation Area**

##### **Product Allocation area in Stores**

In [None]:
print ( "paa area across stores ", skart.groupby(sid)[paa].sum(), "\n")

##### **OBSERVATIONS**
The "OUT004" Store with highest sales and sum of Production allocation areas is highest. As the dataset given is unique on product Ids one would expect this Area to be close to 1 as it is ratio by defintion.
This needs more analysis. Let us computes the Areas per store and Product types. Check the mean values against sales values across all store.

##### **Understanding Product Allocation Area across Stores and product types**

In [None]:
pskw.columns

In [None]:
pskw = skart.groupby([styp,ptyp]).agg({target:'sum', paa:['mean','sum'], pid:'count'}).unstack(level=0)
pskw.columns = [z+ y +(" Sales" if x == target else " Pid" if x == pid else " Area") for x,y,z in pskw.columns]
# print( pskw.columns)
ctyp = "Dummy"
def autopct_format_factory(df_vals):
	def autopct_format(pct, all_vals=df_vals):
				autopct_format.counter += 1
				idx = autopct_format.counter - 1
				colnm = ctyp+" Pid"
				abs_val = all_vals.iloc[idx][colnm]  # pick the column you want
				return f"{pct:.2f}%\n({abs_val})"
	autopct_format.counter = 0
	return autopct_format


fig, ax = plt.subplots(4,2,figsize=(30,50) )
ax = ax.flatten()

for n,i in enumerate(skart[styp].unique()):
			ctyp = i+"count"
			ax1 = ax[2*n]
			ax2 = ax[2*n+1]
			y1c = i+"mean"+" Area"
			y11c = i+"sum"+" Area"
			y2c = i+"sum"+" Sales"
			y3c = i+"count"+" Pid"
			# print( y1c,y11c,y2c,y3c)
			ax1.pie(pskw[y1c], labels=pskw.index, autopct=autopct_format_factory(pskw), textprops={'fontsize': 14}, radius=1.1)
			ax1.set_title(y1c+" SUM = "+str( round(pskw[y1c].sum(),3))+" (Orders= "+str(pskw[y3c].sum()) +")", fontsize=20, fontweight='bold')
			labels = [f"{idx}\n({round(val, 2)}K)" for idx, val in zip(pskw.index, pskw[y2c]/(1000*pskw[y11c]))]
			ax2.pie(pskw[y2c], labels=labels, autopct='%1.2f%%',   textprops={'fontsize': 14}, radius=1.1)
			ax2.set_title(y2c+" SUM = "+str(round(pskw[y2c].sum(),2))+"\n Average sales per Area = "+str(round(pskw[y2c].sum()/(1000*pskw[y11c].sum()),2))+"K", fontsize=20, fontweight='bold')



##### **OBSERVATIONS**
General understanding of the pie charts
    - The product Allocation Area are ratios of the product area to its other products within the same Product Type per store. It is not the ratio of the total Product area across all the products in all the product types. This is confirmed mean value which is almost 1 for all the stores in the left hand side of the pie charts.
    - The pie charts in left side are created one per Store Type and each store type has 16 product types are shown in the pie chart. The orders are shown inside the pie chart with () and overall orders are shown in the title for the store.
    - The %s inside the pie chart are the Area occupied %s for the product types averaged per product type.
    - the right side pie chart are overall sales per store divided into product types in the pie chart %s.
    - the labels of the pie chart have anotehr %s in (), these are showing %s of sales / Area %s (which are in left pie chart)

Observations:
- The Product Allocation Areas are very similar in all the 4 stores. This is confirmed by the pie chart Left hand side columns for Area. Each Store in % has almost same allocation when we take the Mean Value. The Min and Max are varying within a product type group. But overall %s of display area allocations are very similar with minor variations done by Store business owners.
- The Sales is not corresponding to the Area. Espically with some products like 'Starchy Foods', 'Others', 'Breads, 'Breakast', 'seafood' are the  items in all stores which take more allocation area but are generating less revenue.
- This could also mean that products in these take lesser area or just having lesser sales. This is also input to business to check if the allocation areas need optimization or an innnovative management non-display storage and display area storage. It could also mean the Store can reduce allocation Area and this might influence overall sales.

-Ket take away. Area has an indirect impact on Sales. No direct correlation. By reducing the allocation areas for lesser sold products the display areas can be maximized for other products which sell better. This way Area influnces Sales but not directly.
- Moreover, we see that Area can be +vely and -vely correlated with sales. "Baking Goods" for example has less Area and higher Sales. But Breads it is the opposite. This is indicative of the fact that this is not a direct correlation.

- Product area allocation is determined and controlled by the store administration and the produt weights / produtMRP is agreed mutually between store administration and product venders. Therefore, one can have different weights and MRPs having same allocation area. Also, the same weights for lighter products can have lower or higher MRPs based on the produt type. Simmilarly smaller products can fit easily into smaller areas ans vice versa. These possibilities make the relationship between MRP,Area,Weight very complex. Hence it is not visually determine pattrens of correlations. The pattern are complex and deeply embedded into the product types and store types.

**IMPORTANT**
- **Based on the analysis MRP, AREA and Allocation, one can conclude that none of this have a well linear correlation with Sales. It is a complex relationship influenced by many factors and better handeld by the ensemble models. It is easy for visualization.**

#### **Number of products sold in each store**

In [None]:
print ( "No of Products sold in each store ", skart.groupby(sid)[pid].nunique(), "\n")

##### **OBSERVATIONS**
- From this data it is pretty clear as to why the "OUT004" is having high sales. It simply is giving a very product range.
- There could be many factors contributing this but the stores businesses must look at how to maximize there allocation areas and increase there products range.

#### **Are all the products are sold atleast by one or more stores ?**

In [None]:
print ( "Sum of all the products sold across stores ", skart.shape[0], sum( skart.groupby(sid)[pid].nunique().values))
print ( "Products sold in multiple stores. ", sum( skart.groupby(pid)[sid].count() > 1) )
print ( "Number of product types sold by each store ", skart.groupby(sid)[ptyp].nunique())

##### **OBSERVATIONS**
- All Stores are selling all the 16 product types.
- The product IDs are unique and they do not overlap across the stores. It suggests that stores may be geographically in different areas where the vendors supplying the products are different and hence different product IDs. Standardized venodrs across stores may not be present or may not given unique IDs across stores. This is also something the business can look to standardize so that the products doing well in some cities can be offered to other cities.
The first two chars of Product ID might be useful is studying any pattern of sales across these as an addtional categorical variable.


#### **Top five products types across all stores**

In [None]:
print ( "Top five selling products type \n", skart.groupby(ptyp)[target].sum().sort_values(ascending=False).head(5))
print ( "Top five selling products type per store \n", (skart.groupby([ptyp, styp], as_index=False)[target].sum()).sort_values([styp, target, ptyp], ascending=[True, False, True]).groupby([styp]).head(5) )

##### **OBSERVATIONS**
- THE TOP PRODUCT TYPES ACROSS ALL STORES ARE SAME. THE VOLUMES OF SALES VARY.
- This suggests that people are comming to the stores for top 5 categories and they will experiement if new products are offered in these categories. The other stores must explore the products from "OUT004" and optimize there allocation area to also shelf them to increase the sales.

#### **Stores Top 10 Allocation Area per product type**
- Check if the stores are allocating area correctly as per the sales.

In [None]:
store_paa = skart.groupby([styp, ptyp], as_index=False).agg ({paa:'sum'}).sort_values(by=[styp, paa], ascending=[False,False]).groupby([styp]).head(10).reset_index().loc[:,[styp,ptyp,paa]].pivot_table(index=styp, columns=[ptyp], values=paa, aggfunc='sum').sort_values(by='Supermarket Type2', axis=1, ascending=False)
store_paa['type']='paa'

store_sales = skart.groupby([styp, ptyp], as_index=False).agg ({target:'sum'}).sort_values(by=[styp, target], ascending=[False,False]).groupby([styp]).head(10).reset_index().loc[:,[styp,ptyp,target]].pivot_table(index=styp, columns=[ptyp], values=target, aggfunc='sum').sort_values(by='Supermarket Type2', axis=1, ascending=False)
store_sales['type']='sales'

store_paa_vs_sales = pd.concat( [store_sales.reset_index(), store_paa.reset_index()], axis=0, ignore_index=True )
store_paa_vs_sales[styp] = ( (store_paa_vs_sales[styp]).astype(str) + "__" + store_paa_vs_sales['type'] ).astype('category')
store_paa_vs_sales.reset_index(drop=True).sort_values(by=[styp, ], ascending=[False]).iloc[:,0:11]

# store_paa_vs_sales.info()

##### **OBSERVATIONS**
- The Top 10 allocation Area Product Types are the same for all the stores. We computed only top 10 entries and Non Zero values in these entries indicate that the categories are same, only the sales % different.
- The allocation area sizes are correctly adjusted as per sales for all the stores in most cases. There are a few analomies as we saw earlier for items like Seafood in some Stores which need inspection.
- For ex: check "Household" and "Frozen Foods" items in "Supermarket Type2" and "Food Mart". In the SuperMarket sales for "Frozen Foods" is higher. In the "Food Market" sales for "Household" is higher. But Allocation area is adjusted correctly in both according to the sales.
- It shows that Stores have adjusted there respective areas as per there sales. However considering the success of the "OUT004" there is need for oother stores to explore the other product types and better layout for maxmizing allocation area.

## **Feature Engineering - Part 1**

#### **Add Product ID first two hars as a seperate column**
- In Model Building we will remove Product ID
- But the first two chars of product are having a specific meaning as classify the producs differently
- An anlysis of this also might be useful in identifying any patterns on sales.

In [None]:
pidc2 = "pid_c2"
skart[pidc2] = skart[pid].apply(lambda x: x[0:2])
categories = categories + [pidc2]
products[pidc2] = products[pid].apply(lambda x: x[0:2])
labeled_barplot(skart, pidc2, perc=True)

##### **OBSERVATIONS**
- As expected the Food items are having highest order share compared to other two categories.
- More analysis is needed if the %s of fooditems ar similar across stores or diferent. This will be done in Bi-variate sections. If analysis proves that they are similar then we drop Product type and use the PIDC2 in model building. Otherwise we drop PID_C2 and keep the product type.

#### **Convert all Object datatypes to Categorical types for Univariate analysis**

##### **Prepare the ordinal order for each categorical variable**

In [None]:
vcdf = pd.DataFrame ( [{ 'Var': x, 'Total': skart[x].nunique(), 'Category': yind, 'Count': yval } for x in categories for yind, yval in skart[x].value_counts().items() ] )
psc_order = ["No Sugar", "Low Sugar", "Regular"]
ptyp_order = vcdf.loc[(vcdf['Var']==ptyp), ['Category', 'Count']].sort_values(by='Count', ascending=True)['Category'].values.tolist()
ssize_order = ["Small", "Medium", "High"]
sloctype_order = ["Tier 1", "Tier 2", "Tier 3"]
styp_order = vcdf.loc[(vcdf['Var']==styp), ['Category', 'Count']].sort_values(by='Count', ascending=True)['Category'].values.tolist()
seyr_order = vcdf.loc[(vcdf['Var']==seyr), ['Category', 'Count']].sort_values(by='Count', ascending=True)['Category'].values.tolist()
pidc2_order = vcdf.loc[(vcdf['Var']==pidc2), ['Category', 'Count']].sort_values(by='Count', ascending=True)['Category'].values.tolist()
cat_order_map = {psc:psc_order, pidc2:pidc2_order, ptyp:ptyp_order, ssize:ssize_order, sloctype:sloctype_order, styp:styp_order, seyr:seyr_order}

##### **Convert to Categorical and use the ordinal order to set into cat.codes directly**
- This way no special ordinal encoding is required.
- All categorical has a internal ordinal order except the product type.
- There are 16 product types and all stores are selling all product types.
- As we saw earlier the top 10 categories of the product types are same and only the sales are differing.
- Therefore even for the product type we use Label encoding by taking overall sales as the ordinal order.
- This will avoid multi dimensional one hot encoding complexity and curse of dimensionality which is not really necessary in this case.

In [None]:
# Even though Established year is Numeric it is better handled as Categorical. So changing it.
print (categories)
for ft in categories:
    # Here pass
    skart[ft] = pd.Categorical(skart[ft], categories=cat_order_map.get(ft), ordered=True)

# [print ( dict(zip( skart[x].cat.categories, range(len(skart[x].cat.categories)))), skart[x].cat.codes.tolist()) for x in categories]

## **Univariate Analysis**

### **Plot (count plot) all categorical variables with sub categories**

In [None]:
# Use labelling and plot a countplot which looks like a barplot with avalue on the top.
cats = categories+[target]
skart_cat = skart.loc[:, cats]

# Do it for all the categorical variables.
for ft in categories:
    labeled_barplot(skart_cat, ft, True)

# Also count stores and number of products in each store.
labeled_barplot(skart, sid, True)

##### **OBSERVATIONS**

**Product_Sugar_Content**
- "Low Sugar" products categories are selling the most compared to "Regular" and "No Sugar".

**Product_Type**
- "Fruits and Vegetables" are the highest sellers followed by "Snacks" and then by "Frozen Foods" and "Dairy". "Seafood" and "Starchy Foods" are at the bottom of the sales.

**Store_Size**
- "Medium" Size Store are recording highest sales compared to "Small" and "High".

**Store_Location_City_Type**
- "Tier 2" location stores are recording the highest sales compared to "Tier 1" and "Tier 3"

**Store_Type**
- Out of the Four stores "OUT004" store which is "Tier 2" and "Medium" size has recorded the highest sales. This is also the primary influencer for the high sales in Tier 2" and "Medium" size classes.

**Store_Establishment_Year**
- The latest establish Store in 2009 has recorded high sales giving an impression that customers are preferring the modern ambience and outward look of the stores.

**Product_ID (first two characters)**
- This variable is derived from the Product ID. It has three values FD (Food items), DR (Drinks) and NC (Non Consumables). This addtional variable was created for analyzing any patterns of deviation across the different stores.
- We can see in the above Count plot that FD items are highest selling, followed by Non Consumables and finally drinks.


### **Plot (KDE/Histogram and Box) for numeric vairables**
- For numeric data fields create an addtional (equivalent) categorical field by binning into a max 6 bins as per quantiles and outilers
- Binning will allow additional insights on the numeric data as it can be cross tabbed with categorical data.

In [None]:
nums = numeric_vars
print(nums)

# Data is pd.Series. It will compute Quantiles, IQR and generate bins according to quantiles including outliers
def gen_bins (data):

    # Calculate quartiles and IQR
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    # Outlier thresholds
    lower_outlier = Q1 - 1.5 * IQR
    upper_outlier = Q3 + 1.5 * IQR

    # Define bin edges
    bin_edges = [
        -np.inf,                 # extreme lower outlier start
        lower_outlier,           # lower outlier end
        Q1,                      # first quartile
        data.median(),           # median
        Q3,                      # third quartile
        upper_outlier,           # upper outlier start
        np.inf                   # extreme upper outlier end
    ]

    # Define bin labels
    bin_labels = [
        "Lower Outlier",         # < lower_outlier
        "Q1 Range",              # between lower_outlier and Q1
        "Q2 Range",              # between Q1 and median
        "Q3 Range",              # between median and Q3
        "Q4 Range",              # between Q3 and upper_outlier
        "Upper Outlier"          # > upper_outlier
    ]
    return bin_edges, bin_labels

target_bin = target+"_bin"
paa_bin = paa+"_bin"
pmrp_bin = pmrp+"_bin"
pwt_bin = pwt+"_bin"


In [None]:
skart_num = skart.loc[:, nums]
for ft in nums:
    bin_edges, bin_labels = gen_bins(skart_num[ft])
    skart_num[ft+"_bin"] = pd.cut(skart_num[ft], bins=bin_edges, labels=bin_labels)
    histogram_boxplot(skart, ft, kde=True, bins=20)
# Binned numeric data is also useful for categorical analysis. Therefore duplicate into Start_cat
length = len(nums)
columns = skart_cat.columns.append(skart_num.columns[length:length*2])
skart_cat = pd.concat( [skart_cat, skart_num.iloc[:,length:2*length]], axis=1, ignore_index=True)
skart_cat.columns = columns

##### **OBSERVATIONS**

**Product_Weight**
- Nice Symetric data forming a bell curve.
- There are Outliers (shown in the box plot) on either side of mean.
- this is expected because there a lot of ranges of product Weights. they are all well controlled into standards size of oz and lbs.
- So, in the supermarket product inventory this is expected to have this kind of symetric graph for food products and also outliers here is not a problem.

**Product_Allocated_Area**
- Right Skewed data. Mean is being pulled to right due to Outliers on the right hand side.
- This shows that there are some products which are taking higher allocation areas and a lot of smaller products taking lower allocation Areas.
- this Right skewness is also expected because there will be many products in smaller sizes which will not demand a lot of allocation area.
- On the other hands, there will be some products which are bigger packages in Frozen foods etc. of higher quantities which take larger areas and are fewer in numbers.
- The right side outliers here also can be safely ignored. the key aspect to check here is " Are the sales proportional to display allocation areas that are allocated ?"

**Product_MRP**
- Again like the Product_Weight this data is very Symetric and has outliers on either side of the mean.
- This is also expected as the MRP will have a lot of ranges for all the 8763 product ids in the dataset.
- It will defintely go beyond the permissiable lower and upper bounds calculate using the IQR.
- The important things to check here is category ranges or MRPs which are most seeling giving an indication of how much users are usually spending in each of the product categories.

**Product_Store_Sales_Total**
- This is also a nice symeetirc grpah. It shows that sales are symetric all combination of products are being purchased.

## **Bivariate/Multivariate Analysis with target variable**
- Use skart_num dataframe for Numeric data fields. It has bins created for each numeric field. With bins it is possible the Numeric field as Category vairables also.  
- Use skart_cat dataframe for studying Categorical vairables influence target variable.

### **Need for creating a transformed dataset for Analysis**

##### **OBSERVATIONS**
- As we saw in the Basic analysis the relationships of variables like Product_Weight, Product_Allocation_Area and Product_MRP are very dependent on the Product_Type
- The Store size, Location, Even establishment year are influencing the overall products prices, weights and closely related to Store Type.
- A general analysis of features will loose many patterns. The features have to organized as product and stores ids and there means, sums, max and min must be taken for a better analysis.
- Therefore lets create a transformed dataset that represents the relationships of features more closely.
- Also, As the numeric variables have different scales, like Product Area is 0 to 1, Target variable is very high numeric values, MRP and Weight have intermmediate values it is diffiuclt to analyze the correlations without Scaling.
- This transformed dataset must scale the numeric variables. The main pupose of this dataset will be idenity patterns and not actually to find actual values or metrics.

In [None]:
skart_AW = skart.groupby([styp, ptyp]).agg( {paa:['max', 'min', 'mean'], target:['sum','count','mean'], pmrp:['mean','max','min'], pwt:['mean','max','min'] })
skart_AW = skart_AW.stack(level=[0,1]).reset_index()
skart_AW.columns = [styp, ptyp, 'feature', 'aggfunc', 'value']

# As the Sales is very high number and Product Weights and product Allocation Area are small numbers
# Visualization is difficuly. Hence lets scale the values in 0 to 1 range.
# the Plot will not show actual numbers but we are interested in relative increase
# So we ignore actual numbers and check relative increase.

# Initialize MinMaxScaler
scaler = MinMaxScaler()
# Scale the DataFrame
mask_target = ( (skart_AW['value'] >=1 ) & (skart_AW['feature'] == target) )
skart_AW.loc[mask_target, 'value'] = scaler.fit_transform(skart_AW.loc[mask_target, ['value']])

scaler = MinMaxScaler()
mask_mrp = ( (skart_AW['value'] > 1) & (skart_AW['feature'] == pmrp) )
skart_AW.loc[mask_mrp, 'value'] = scaler.fit_transform(skart_AW.loc[mask_mrp, ['value']])

scaler = MinMaxScaler()
mask_wt = ( (skart_AW['value'] > 1) & (skart_AW['feature'] == pwt) )
skart_AW.loc[mask_wt, 'value'] = scaler.fit_transform(skart_AW.loc[mask_wt, ['value']])


# For use which require the long format one can use the below dataset.
skart_long = skart_AW.pivot(index=[ptyp,styp], columns=['feature', 'aggfunc'], values='value')
skart_long.columns = [x+"_"+y for x,y in skart_long.columns]
skart_long.reset_index(inplace=True)

# Add the key metrics which are used in this dataset.
skart_long['mrp_2_weight'] = skart_long[pmrp+"_mean"]/skart_long[pwt+"_mean"]
skart_long['area_2_sales'] = skart_long[target+"_mean"]/skart_long[paa+"_mean"]

#Add product types to skart_long
skart_long = skart_long.merge(products.loc[:, [ptyp, pidc2]].drop_duplicates(), on=ptyp, how='left')
#Add Store attributes to skart_long
skart_long = skart_long.merge(stores[[styp, ssize, sloctype, seyr]], how='left', on=styp)
skart_long[pidc2] = pd.Categorical(skart_long[pidc2])
skart_long[sloctype]= pd.Categorical(skart_long[sloctype])
skart_long[seyr] = pd.Categorical(skart_long[seyr])
skart_long.info()

### **Heat Map Analysis for Numeric variables**

In [None]:
# As this is packed visual, to save space we use smaller featue names
vars = ['P_Wgt', 'P_Area', 'P_MRP', 'Sales']
print( "Using 'P_Wgt' for 'Product_Weight', \n'P_Area' for 'Product_Allocated_Area', 'P_MRP' for 'Product_MRP', 'P_Type' for 'Product_Type','S_Type' for 'Store_Type' and 'Sales' for the target")
skart_hmaps = skart.loc[:, [ptyp,styp,pwt,paa,pmrp,target]]
skart_hmaps.columns = ['P_Type', 'S_Type', 'P_Wgt', 'P_Area', 'P_MRP', 'Sales']

In [None]:
FHMap(skart_hmaps, 'P_Type', 'S_Type', vars, ht=None, asp=None, tit=True, fnt=8, annot_fnt=10, debug=True)

##### **OBSERVATIONS**

- The results of correlations in the overall heatmap (First heatmap above outside grid) show only the following correlations

    - Correlation shown is between Product_weight and Product_Store_Sales_Total(target variable) as 0.7. This is not correct reflection of the reality. It is an average giving a wrong impression of the data. Therefore for better analysis the individual heatmaps classified as under unique combinations of Product_Type and Store_Type must be analyzed. This is shown above in 64 (16 product types X 4 Store Types) smaller heatmaps.  

	- There is also a strong correlation shown between RMP and Sales of 0.8. this is also not a correct reflection of the reality. The reality is sales attributed to SuperMarkets are correlated properly with sales. But Food MArket and Departmental weakly correlated with MRP. But as the orders number is dominated by the supermarkets this correlation showing up.  


- Lets do some deeper analysis on this smaller heatmap grid computed one per product and Store combinations:

**Product Area**
- Check cell (1,2) = Seafood in Departmental Store in the Grid. It is the first row second cell in 16X4 heatmaps

It shows that Area is strongly -vely correlated with sales. It means  
- (1) "Seafood" is taking more space but effective revenue generated (sales) is low compared to other food items which occupy similar area.
- (2) "Seafood" is 0 or +vely correlated in "Food Market" and "SuperMaket Type1". This is because the Allocation_Area (See the section "Understanding Product Allocation Area" above which shows the Pie chart) % is reduced by these stores. This is reducing correlation to the sales.
- (3) For Type2 Supermaket the correlation is -ve but small. This is because the Area of products and thereby the Weight is smaller. Type2 store is not selling Premium Products. it is selling smaller sizes and therefore smaller and Area of allocation must have been allocated by stores.  

**In Summary**
- "Seafood" is occupying more Area in stores and not giving enough corresponding sales. This is action for corresponding Store business to check all -ve correlations with sales in this Heatmap
- Check cells in first Column = food Mart for Starchy Foods, Breads, Cannned, Frozen Food. They are all falling under the same categories where Areas and MRPs have -ve correlations with sales. The Business of Food Mart must see if they can reduced the allocated Areas and workout even more smaller size packages for these items.
- Check cells for "Others" row no 4 in the above diagram. It has positive correlation only in Departmental Store. All other stores have -ve correlation and 0 for type1 SM. This means "Others" are sold as premium products and Departmental store has decided to move away from general allocation of area guidelines to custom guidelines for this. We can confirm this observation by looking at pie chart in “Understanding Product Allocation Area” section above. The % is more than 7% in Departmental store compared to other stores which less or equal to 6%. Type1 SM as observed earlier is selling a mix or premium products and lower MRPs. It is little closer to Departmental and hence it has 0. But Type2 SM is loosing on Area with this. Food Market also is losing but less than Type2 SM.

**Product_Weight**
- It is well managed in Type2 SM and Type1 SM. The Weights for all products are +vely correlated with the sales. It shows that the products sizes (like 0.5 lb or 10oz ) can be available in many sizes but the right sizes and quantity (like 5 pack or 10 pack) etc. must be according to what the customers typically need. These stores (SuperMarkets) are managing it well. While Type2 SuperMaket has long ranges of size mid and lower category. Type1 SM has higher range of sizes and very few selected lower sizes(Refer to the Scatter plotting "Influence of Product_weight" section below)

**Product MRP**
- It shows strong correlations with supermarkets Type1 and Type1. Low correlations with Food Market and Departmental. This is because it depends heavily on following factors
  - Number of orders NOT the sales values. The Supermarkets have higher number of orders. Departmental has higher sales that Type1 Super Market but less orders. Food Market also has less number of orders
  - Another important factor is the range of MRPs. Refer to the graph below "Transformed dataset analysis". It shows that MRP range of the Supermarkets is almost same (Scaled value range 0.35 to 0.70). The range for Departmental is much higher and Food Market is much lower. In these MRP ranges of Food Market and/or Departmental (Premium products) it is more likely for higher MRP values having lower sales and vice versa. But Super Markets being daily consumption items it is more likely to balanced MRP to price correlations
  - However, this reqires proper selection and Type1 Supermarket is doing a very good job at it. Type2 supermarket must learn to do a better job by adopting the MRP and products patters of type1.

### **Pair plot for Numeric data**
- A traditional pair plot will be useful when plotted using hue of Product_Store_Sales_Total_bin.
- We have defined bins according to Outliers, Quartiles and IQRs.
- Additonally Lets plot a custom relationship of the variables relationship with Target variable.

#### **Original dataset**

In [None]:
sns.pairplot(data=skart, hue=styp, diag_kind='kde', markers=["o", "s", "D"])
# Show the plot
plt.show()

##### **OBSERVATIONS - Original dataset**

**Product_Weight**
- Departmental Stores and Type1 SuperMarkets are selling higher product MRPs ( Closer to Whole sale ). Departemental stores is soley focused on Higher Weights and MRPs (Premium products most likely with premium pacaking adding to weights).
- The Type1 is also focused on Premium products. But it has equal focused on selectd products in the lower MRP range. The Green dots have a distinct pattern. They are higher range, a gap exists and then lower range. It seems to be a very well planned strategy.
- The type1 store is established is 1987 and one will expect that it has matured in its strategy or which MRPs are best seeling and optimized the product ranges to maximize sales.
- Type2 Supermarket is selling lower to Medium ranges of Product pacakges (Hence Average Weight)
- Food Market is popular for low range of Weights. It sells "Loose" products low price and typically small ot 1 piece quantity.
- Product Weight is showing a liner correlation with Target a variable. The tilt (angle)
- the KDE graph show high for Type 2 refelcting high sales. For Type Green color KDE it shows two distinct clusters for two clusters of MRPs that it is targetting. The Food MArket and Depratemental KDE almost show mutualally exlcusive markets targetted by the two Stores.
- overall it has +ve correaltion on the sales. But has many anamolies that are complex to visualize.

**Product_Allocated_Area**
- Patterns is similar for the 4 Stores with Product Weights. There is no real liner correlation. It is solely the discretion or Stores and standard guideline for all stores. There is no relation like higher  are for higher MRPs or higher weights.
- The KDE shows right skewed. It means there are choosen set (typically highest sales) of products by the store which are alwasy in the higher display area and a long tail of other products with lower sales with a lesser distributed display area.
- It is not showing any steep tilt. It is having a mixture of flat and +ve correlations.
- the interesting aspect for business are some strong -ve correlations for selected product and store combinations like Departmental / Store influence the sales -vely. The business can look at these and correct the display areas allocation for these items.   
- Overall it is difficult to see any direct correlation target. Wherever we see some correlation also it is very weak and or a -ve correlation.

**Product_MRP**
- It is roughly similar to the Weights bu not fully.  This is because of variation of prices due to cost of living. The angles are not so steep. Typically as Type1 and Departmental Stores are selling in bulking we should see that effective MRPs come down but that did not happen as the Departmental store is in "Tier 1" and "Type 1" is in "Tier 2". However considering that "Tier 1 " city it is good tradeoff to buy in whole sale and then MRP diference to "Tier 3" is very comparable.
- Overall it is +vely correlated. But it has a few analmolies also like Areas. Therefore not so linearly correalted as the weights.  

#### **Use Transformed dataset**

In [None]:
sns.pairplot(data=skart_long.loc[:, [ptyp, styp, paa+"_mean", pmrp+"_mean", pwt+"_mean", target+"_sum"]], hue=styp, diag_kind='kde', markers=["o", "s", "D"])
# Show the plot
plt.show()

##### **OBSERVATIONS**
- This transformed dataset uses the mean values instead of actuals. The means are aggegated on unique Product type and Store Type combinations.
- This will show overall aggregated behaviour for all product and Stores. Individual behaviours are different as we saw in the heatmap.

- KDE graphs
  - MRP and Weights KDE Graphs shows distinctly that the products shelved by these 4 types of stores are mutually exclusive. It is also confirmed by the fact that in the dataset the PId are not overlapping with Sids.
  - The Area KDE simplyb refelct the order numbers. Higher the order bigger the bell curve. No correlation with sales.

Correlations of Numeric variables. This graph more accurately depicts the overall relationships.
   - MRP and Area are NOT correlated. The colors are mixed in all ranges.  
   - Correlation with Target variables
      - MRP we see that the stores are focusing on different MRPs. Food Market is focused on products with lower MRP/Lower weights (more like Single piece). Departmental is focused on higer MRPs and Higher weights (whole sale, like 50 pack etc). Type1 SMarket is closer to Departmental. Type2 is is focused on Modile and Upper middle range MRPs and hence similar weights.
      - So MRP and Weights are positively correlated with Sales. But, there are anamolies, we see in some scatter points where MRPs are higher and sales are lower, vice versa and same for Weights. This is simply giving an insight on the nature of the products and not any actionable recomendation for the businesses.
      - Area is different because Store business get to choose display area allocation. For MRP and Weights they have choosen focused limits and therefore they are fixed per store. But Areas are choosen by Stores and hence it is mixed bag.

### **Transformed dataset analysis of numeric variables**
- (Product_Weight, Product_MRP and Product_Allocation_Area)

In [None]:
FBar(skart_AW.sort_values(by='feature'), ptyp, styp,'aggfunc','feature', 'value', ht=5, asp=0.8, fnt=12, tit=True)

##### **OBSERVATIONS**

**IMPORTANT While using this Transformed dataset the values are not important. The values are scaled. The important thing to take note of is relative correlation patterns**

**Product_Weight**

- It is same pattern we can observe thruout the graph above. This Transformed dataset is showing the pattern very distinctly. Check the mean/Max values. The Mean value is smallest for Food Market (Selling single items, convienence Store), Highest for Departmental (Whole Sale) followed by Type1 (Vintage 1987 Store which is seeling close to Whole sale) and then followed by Type2 which selling lower to Average sizes for products.
- This pattern is very consistent acros all the product types.

**Product_Allocated_Area**

It is showing very good data for this field which will help the business to analyze there sales Vs allocated Area. Some examples:
- Seafood: Check the Product_Store_sales_Total and Product_allocated_Area fields. Departmental Store and Type2 SuperMarket we saw earlier had -ve correlation for sales and Seafood Area in the heatmap. Type1 had +ve correlation. Food MArket had a 0 correlation.
- This graph shows clearly why this is happening. Departmental Store has allocated 0.20 (higher) Area and Sales is 0.02 whereas Type1 super market allocation 0.12 Area and Sales is same. Hence there is a difference in the correlation. Type2 on the other hand as sales of 0.05 but has a higher area 0.23 and therefore still it has 0.1 negative correlations.
- This data is showing this patterns for many other fields for business to take a look and adjust there Allocated_Area in proportion to the Sales. food Mart for Starchy Foods, Breads, Cannned, Frozen Food are other examples. The Business can look at the heatmap -ve correlations and come to this chart to look at the misalligned Area Allocations

**Product_MRP**
- This pattern is also beautifully depicted by the Transformed dataset. Check the MAX value in a the cells of the Grid. The Departmental store is selling Premium and couple by the fact that it is Tier 1 city. This is influencing the MRP to the highest. This is followed by Type1 SMarket which is in Tier 2 but selling mix of premiun and lesser MRPs (high Weights). Then come Type1 which is selling Low and Average sizes and is Tier 2 and finally Tier 3 and small sizes in Food MArket.
- this pattern is clearly visiible in all cells of the Graphs confirming the assumptions made during Heatmap and Pariplot analysis.

### **Influence of **Product_Id(first two chars)** on **Product_Store_Sales_Total****
- FD stands for Food items
- DR stands for Drink items
- NC stands for Non Consumables

#### **Percentage variation across Stores for these items**

In [None]:
skart_pidc2 = skart.groupby([styp, pidc2]).agg({target:'sum'}).unstack(level=0)
fig, ax = plt.subplots(2,2,figsize=(12,10))
# ax.reshape(-1,1)
print(styp_order)
skart_pidc2.columns = [ x for y,x in skart_pidc2.columns]
# skart_pidc2.columns
for i,st in enumerate(styp_order):
    axi = ax[i//2, i%2]
    axi.set_title(st)
    # df = skart_pidc2.loc[skart_pidc2[styp]==st]
    axi.pie(skart_pidc2[st], labels=skart_pidc2.index, autopct='%1.1f%%')

##### **OBSERVATIONS**
- Very similar % sales w.r.t to to these three product Types. The variation is less than 3% to 4%. This means that all the stores pretty much same products types are getting purchased and only the volume of sales is different.
- This also suggests our earlier theory that number of products if increased it may boost sales.
- FD = Food items. There are the majority sales
- DR = Drinks. there are the least in all the stores.
- NC = Non Consumables. Sales is greater than drinks and by far less than Food items. Even for this the variation in % sales is less than 3%.

- **Key take away is that % sales in all stores is almost similar across product types. The deviation is only about 3% to 4%.**

#### **Check actual percentage variations for food items**
- If the variations of Product Types are not high then in Model building we drop Product Type and replace with pid_c2
- If there are vairations which are getting lost by "Averaging" out at the FD category of pid_c2 then we can drop pid_c2 as redundant in Model building.

In [None]:
cross_tab(skart_cat.loc[skart_cat[pidc2]=='FD'], ptyp, styp, nrows=2, ncols=1, figsize=(15,10), stacked=True)

##### **OBSERVATIONS**
- The first crosstab is expected as we know that Type2 has higher sales and followed by departmental and Type1 stores. The least being Food Market. Nothing new here.
- The seocnd crosstab shows the normalized values. It is showing that there are % variations of the product types in each store even though they are not of large scale. For example
    - Breakfast is low in Type1. Meat and Starchy Food is higher Type1.
    - Dairy and Starchy Foods are higher in Departmental Stores.
    - Breakfast is high in Food Mart and Type2.
- If we drop the Prduct Type and use pid_c2 then we loss these variations per store. So it is better to keep them and the ensemble/Tree models do a better job of anlayzing these values and finetuning the model.

### **Influence of Product_MRP on Product_Store_Sales_Total**

#### **Overall influence**

In [None]:
sns.scatterplot(data=skart_long, x=pmrp+"_mean", y=target+"_sum", hue=styp)

##### **OBSERVATIONS**
- Same as what we saw in the pairplot. +ve correlatiion between sales (target) and MRP exists but is weak one.  
- Lower MRPs dominated by Food Market, Higher by Departmental and followed by Type1 and then Type2 in middle.
- Type1 has intelligent mix of a few middle MRPs and mostly focused on higher MRPs.

#### **Influence on Stores**

In [None]:
cross_tab(skart_cat, pmrp_bin, styp, nrows=2, ncols=1, figsize=(15,10), stacked=True, margins=False)

##### **OBSERVATIONS**
This plot shows clearly MRPs and Stores relationships.
- Food Mart is focusing on the lower MRPs.
- Depratmental is focusing on the highest MRPs.
- The Type1 is doing a smart trade-off. While focusing on higher MRPs it has selected low MRPs which it is selling.
- Type2 is very active lower, Medium and upper Medium range. It is not focusing on higher MRP ranges (not into Whole sale).

#### **Product and Store wise influence**

In [None]:
FScatter(skart, ptyp, styp, psc, pmrp, target,ht=3, asp=1, fnt=7, tit=True)

##### **OBSERVATIONS**
This detailed Facet-Scatter plot on Product Type and Store Type with Product_Sugar_level hue is also giving some good insights.
- First, it confirms the MRPs being higher on Departmental and Type1 in comparision to Type2 and least being Food Market. We can see the points are on the higher side of the graphs for Departmental and Type1 and lowest side for Food Market.
- The interesting thing about this graph is shows for Type1 super market how it is not totally isolating itself from the lower MRP segement. There is distinct pattern in all product types where Type1 is actually selecting a few MRPs from the lower segement. Even though a large part of its presence is on higher MRPs and selected lower MRPs are targetted.
- The different color for "Health and Hygene, Others and Household shows they are non Consumables. This also eliminates the need of pid_c2 in the final Model building. It will be redundant.
- The graph clearly shows that "Low Sugar" products are in demand. Majority of sales has happened in this category. Regular is also there but very less in relative percentages.
- It also shows Type2 is focusing on a very narrow range of MRPs which will correspond closely to Weights of the pproducts and totally does not want to enter the lowest and highest segements of MRP spaces.

### **Influence of Product_Type on Product_Store_Sales_Total**

#### **Overall influence - Total sales**

In [None]:
plt.figure(figsize=(12,15))
df = skart_AW.loc[((skart_AW['feature']==target) & (skart_AW['aggfunc']=='sum')) , [ptyp, styp, 'value']]
sns.barplot(data=df, x='value', y=ptyp, hue=styp, orient='h', order=df.sort_values(by='value', ascending=False)[ptyp])
plt.tight_layout()
plt.show()

##### **OBSERVATIONS**
- This is an expected graph. It uses overall sales and distributes it over product types and then sub sets with Store types.
- As we know the highest sales is for "Fruits and Vegetables' and Store Type2 Supermarket.
- The Departmental is next closely followed by Type1. The graphs show there are a few product types ( Baking Goods, Hard Drinks, Soft Drinks and Household) where the sales of the Type1 exceeds Departemental. Even though in overall sales Departmental exceeds Type1

#### **Overall influence - Total orders**

In [None]:
plt.figure(figsize=(12,15))
df = skart_AW.loc[((skart_AW['feature']==target) & (skart_AW['aggfunc']=='count')) , [ptyp, styp, 'value']]
sns.barplot(data=df, x='value', y=ptyp, hue=styp, orient='h', order=df.sort_values(by='value', ascending=False)[ptyp])
plt.tight_layout()
plt.show()

##### **OBSERVATIONS**
- This graph shows an interesting aspect. The number of orders are high in Type1 and less in Departmental stores.
- Even though the number of orders are less in Departmental because of high MRP the overall sales is higher than Type1.
- It also shows the Type2 orders are very high. The average MRPs of the orders is very low compared to Departmental because it is not focusing on high MRPs.

### **Influence of Product_Allocated_Area on Product_Store_Sales_Total**
- Product_Allocated_Area is giving good insights on possible improvments for the business.
- The display area influences the sales. It should be usually directly proportional to the sales. If it is not observed to be proportional then it is indicative of some improvements for the business.
- A detailed analysis is already in the section "Understand Product Allocated Area and Heatmap section. We only do an overall influence analysis

#### **Overall influence**

In [None]:
sns.scatterplot(data=skart_long, x=paa+"_mean", y=target+"_sum", hue=styp)

##### **OBSERVATIONS**
- This is looking different plot from MRP and Weight.
- It shows as if it is a flat strcuture with NOTHING correlated.
- **This is because of distribution and usage of Area allocation by the stores. W.r.t MRP and Weights, the stores do not much of leverage to play around with. They take products that match range of MRPs in the  market segement which they are focused. But w.r.t Area allocation they are having the choice for controlling the display area in store for the products. Therefore we see a lot of overlap. Higher Weights and MRPs are using lower display areas. And vice versa**
- **Also this is Area where improvements have been identified. Store Business have the flexibility to change this and must use the analysis to allocate areas as per the sales. Primarily due to mized up area allocations across MRPs the correlations to sales shows as linear in the graph. but we take overall aggreagete across all products and stores is weakly correlated with sales just like MRPs.**

#### **Product and Store wise influence**

In [None]:
FScatter(skart, ptyp, styp, psc, paa, target,ht=3, asp=1, fnt=7, tit=True)

##### **OBSERVATIONS**
- The higher points are on the Deparmental stores and least is Food Market.
- All points are pretty flat. No Correaltions with Target variables.
- But w.r.t target we notice a lot of anamolies which are explained in detail "Transform dataset Analysis" and heatmap sections.
- **As we see in the heatmap there are -ve correlations or no correlations for some product and store types for the Area. There are no +ve correlations. Efen the -ve correlations are mainly attributed to lesser orders and not directly realted to Area**

### **Influence of **Product_Sugar_Content** on **Product_Store_Sales_Total****

#### **Overall influence**

In [None]:
cross_tab(skart_cat, target_bin, psc, nrows=2, ncols=1, figsize=(15,10), stacked=True)

##### **OBSERVATIONS**
- In general, irrespective of Product type or Store type "Low Sugar" is selling well.
- No Sugar and Regular are by far less in percentage sales. Regular is higher than No Sugar.
- This is true for all MRPs and ranges of sales as confirmed by the graph above.

#### **Store wise influence**

In [None]:
cross_tab(skart_cat, psc, styp, nrows=2, ncols=1, figsize=(15,10), stacked=True)

##### **OBSERVATIONS**
- Nothing new here. It only shows the stores and higher the sales of store higher the sales of "Low Sugar" products.

#### **Product and Store wise influence**

In [None]:
g = sns.catplot(data=skart_cat, kind='count', y=ptyp, hue=styp, col=psc, orient='h', height=9, aspect=0.5)
g.set(xlim=(0, 350 * 1.5))

##### **OBSERVATIONS**
- The graph is illustrating the Product_Sugar_Content across products and store types
- It shows that (left graph) "No Sugar" category is mutually exclusive applied to non consumable products and not for food items.
- Only Household, Health and Hygiene and Other are "No Sugar".
- As we already know "Low Sugar" has higher demand which is shown in the 2nd and 3rd graphs.
- There is nothing specific about Stores w.r.t Sugar content. It is only the products. All stores are having similar behaviour.

### **Influence of Product_Weight on Product_Store_Sales_Total**

#### **Overall influence**

In [None]:
g = sns.scatterplot(data=skart, x=pwt, y=target, hue=styp)
plt.legend(loc='upper left', bbox_to_anchor=(0,0), title=styp)


##### **OBSERVATIONS**
- Shows a nice linear corrleation between sales and Weight
- It is a +ve correlation shown by gradual increase of the shape. It is very distinct for the Type1 and Type2 supermarkets.
- For Food Market and Departmental the scatter point right and left respectively acts as outliers and reduce the linear correlation and hence the heatmap is showing lesser number of correlation. But this is not in anyway an actionable. It is only an insight that the Food Market and Departmental Stores are seeling product that have lower Weights with higher sales and/or vice versa.
- Departmental stores is on the top, Food market at bottom. Type1 and Type2 in the middle like the MRP graph.
- refer to MRP and heatmap, Transformed dataset sections. A very detailed analysis is presented there on this.   

#### **Store wise influence**

In [None]:
cross_tab(skart_cat, pwt_bin, styp, nrows=2, ncols=1, figsize=(15,10), stacked=True, margins=False)

##### **OBSERVATIONS**
- This is a similar beahviour as we see for MRP.
- In lower Quartiles we see is dominated by Food MArket Stores.
- In Middle and Upper middle are dominated by Type2 SuperMarket
- In the higher quartiles we more of Type1 and Departmental stores.

#### **Product and Store wise influence**

In [None]:
FScatter(skart, ptyp, styp, psc, pwt, target,ht=3, asp=1, fnt=7, tit=True)

##### **OBSERVATIONS**
- Pretty much similar patterns that we have already seen in Product MRP.
- The higher dots correspond to Departmental stores and lower dots to Food MArket.
- Type 1 and Type 2 follow below the Departmental.
- Overall Strong +Ve correlation with Sales.

### **Influence of **Store_Establishment_Year** on **Product_Store_Sales_Total****

In [None]:
cross_tab(skart_cat, target_bin,seyr, nrows=2, ncols=1, figsize=(15,10), stacked=True, margins=True)

##### **OBSERVATIONS**
- Simply shows higher sales for Store established in 2009. Which is the Type2 Super Market.
- One can interpret this as new Ambience and Mall like setup will influence the customers more tocome to store. Therefore the stores built later may have such an ambience.
- IF it is true the business can evaluate the renovation option for incerasing the sales.
- Other than this aspect there is nothing new in the analysis for this fearture.
- It also shows 1987 the first store was established.

### **Influence of **Store_Location_City_Type** on **Product_Store_Sales_Total****

#### **Influence on MRP**

In [None]:
skart_long.info()

In [None]:
sns.catplot(data=skart_long, kind='bar', y=ptyp, hue=styp, x=pmrp+"_mean", col=sloctype, orient='h', height=9, aspect=0.5)

In [None]:
sns.catplot(data=skart_long, kind='bar', y=ptyp, hue=styp, x=pwt+"_mean", col=sloctype, orient='h', height=9, aspect=0.5)

##### **OBSERVATIONS**
- The above 2 graphs shows Weights/MRPs. They are smaller for Tier 3 and highest for Tier 1 and in the middle for Tier 2.
- This increase of MRP in Tiers is not fully because of Tiers. As we saw the increase in MRP also is affected by the Weights.
- Tier 3's are selling smaller weights, In Tier 2 the Type 1 SuperMarket is selling higher weight and Type 2 is selling lower weights.
- Tier 1 is only selling high weights.
- Therefore there the  increase in MRP, Tier 1 > Tier 2 > Tier 3 is a mix of Tiers (cost of living) and Weights of the products.

### **Influence of **Store_Size** on **Product_Store_Sales_Total****

In [None]:
cross_tab(skart_cat, target_bin, ssize, nrows=1, ncols=2, figsize=(15,5), stacked=True, margins=False)

#### **OBSERVATIONS**
- The graph shows expected output. In the dataset we have Type1 super market as "High". It is selling higher MRP products and has sales only in the higher quartile.
- The Type2 which is having higher sales is selling middle and upper middle auartiles.
- the Food Market is seeling lower quartiles.
- The departmental store is selling very high quartiles than the Type1 store is also size "Medium" in size.

In [None]:
sns.catplot(data=skart_long, kind='bar', y=ptyp, hue=styp, x=paa+"_mean", col=ssize, orient='h', height=9, aspect=0.5)

#### **OBSERVATIONS**
- The graph shows output of how the stores exercising their freedom in choosing the product allocation areas based on the stores sizes.
- One would assume that the stores have analyzed the sales and came up with this allocations. However certain anamolies were observed wherein store are allocating more area but sales are not in proportion. This was detailed unders the section "Understanding Product Allocation Area"
- Display area is primiarly influencing factor of the sales and size of store dictates how much area is avilable. Therefore it influences the over all sales.
- In the pie charts detailed in "Understanding Product Allocation Area" it shows that a majority of product types are allocated correctly area and are resulting +ve sales to Area Allocation Ratio. Only a handfull of product types are -vely correlated which the business can analyze. As it is was explained earleir this graph here is only for completness. Please refere the sections mentioned.

### Influence of **Store_Type** on **Product_Store_Sales_Total**

In [None]:
sns.catplot(data=skart_long, kind='bar', y=ptyp, hue=styp, x=target+"_sum", col=styp, orient='h', height=9, aspect=0.5)

##### **OBSERVATIONS**
- The sales across stores is known that type2 is highest and Food Market is lowest.
- Fruits and vegetables are the highest selling products.
- The interesting thing to note here is that, apart from minor varitions in products like Dairy and Meat, (in which Depratmental and Type1 relative sales percentages are higher than their counter parts in Food MArket and Type2), overall the product sales pattern is same across all the stores. Relative percentage of sales is almost same across all the stores.

# **insights based on EDA - Summary**

- The 4 stores in the dataset are having mutually exclusive products. They are meant for different market segementations. By design the store types are designed for addressing differnt needs of the customers.
- The supermarkets Type2 addresses daily consumption products which are most commonly purchased by every household on a regualr basis. It targets lower quartile to Middle and Upper middle qaurtiles of the MRPS segements for the products.
- Type1 supermarket products are an intelligent mix of Premium products and a Selected set upper Middle quartile of frequently used products. They have an MRP range above the Super Markets but lesser than Depratmental stores.
- The Departmental Stores is more Premium Product Segement. It targets higher MRPs. Even if there are fewer number of orders the overall sales will be much higher than the supermarkets like Type1. They address Tier 1 cities.
- The food Market is like convienence store for "Small" stores addressing very low MRPs for products.
- The products are classified into 16 different types. Broadly there are Food items, Drinks and Non Consumables. Out of these items the Food items comprise a large portion of sales, the highest being Fruits and Vegetables for all the stores.
- the key attributes for products are
   - Weight: Product Weight
        - It is observed that lower Middle, Middle and Upper Middle Quartiles the Weight of the product is highly correlated with the sales positiely. For lower Qaurtiles and Upper / Outlier quarties Weight is not a well correlated, meaning lower weight can have higher MRP and vice versa.
        - In general Super market are focused in this MRP segements hence the correlation of Weights and sales is very high for Super Markets.
   - MRP: the seeling price of the product
        - By design it is qualifying the Store Type. The 4 Stores types are addressing differnt MRP segements.
   - Area: display Area allocated by the product.
        - totally controlled by Store adminstration to maximize sales. It succeded in more than 90% of the cases but there a few cases where Areas aalocated are -vely proportion to sales. The business have to take these on case by cases and adjust the allocation areas.
        - The heatmap section of this report details the products and store combination where -ve correlations are observed.
- The stores by design are addressing different MRPs. They monitor sales and take there own decisions on how to adjust the Display allocation Area for the products so that sales is maximized. The product Weights influence the sales as the Area and MRPs are correlated to the Weights.
- The correlation of Product attributes with Sales is a very complex relationship it is difficult to visualize. The Heatmap and Pairplot provide ver weak insights. This is a perfect ase for an ensemble model to find out deeper realtionships and convert this into ML model.
- Another Attribute of the product is Product Sugar content level. It is fairly straght forward showig high sales for all products which are "Low Sugar". The sales of "Regular" sugar content is relatively low compared to "Low Sugar". this attribute also classifies "Non Consumable" products as "No Sugar"
- The store attributes are more straight forward
   - Store Location tpye: It is indicative for which Tier the Store is going to Target. Tier 1 for high cost of living areas and Tier 3 for low cost of living. tier 2 is in between. In the dataset provided  Super Market type store are in "Tier 2", Departmental Stores is in "Tier 1" and Food MArket is in "Tier 3"
   - Store Size classifies stores as "High", "Medium" and "Small". Thisis simply reflective of the area of store it is not anyway  reflective of Allocation Areas and/or MRPs. IN the Dataset provided "High" is a "Tier 2" Super Market Type 1. Medium is Departmental and Super Market Type2. Tier 3 Food Market is "Small"
   - Store Establishment Year simply gives in which year the store was established. It is observed that the store established in 1987 had very good management of MRPs Vs Weight Vs Area and correlations of this Type 1 Vintage 1987 store was the highest with Sales. Other stores must adopt the best practices of allocation Areas and Choosing MRPs / Weight from this Store.
   -overall one can say the sales prediction is complex combinations of product and Store attributes. The model has to consider all  all Product and Stores attributes. There are many hidden relationships between Store and Product attributes which are best enalyzed by ensemble models.

# **Data Preprocessing**

## **Feature Engineering - Part2**

No Addtional feature Engineering is required here. We have already completed the following in during the EDA
- Added a new Column derived from the Product_ID first two characters. This helps is classifying the items as Food, Drinks and Non Consumables which gives gives insights into Sales per these categories. However, after analysis it is not adding any value. As we saw in the previous section the relative product percentage sales are almost similar across all the categories so putting them in an in a different product type category is not necessary.
- Morevoer the minor deviation in some pattern like Dairy, Meat across stores can be stected by ensemble models and more accruate predictions are possible. Therefore we drop the new column added.
- Added Numerical bins based on Quartiles, Outliers and IQRs. This is helping EDA analysis but in Model building it is decided to drop them as they do not add any values
- In addtion we drop PID and SID. Explained below  


### **Drop PID and SID**

In [None]:
X = skart.drop(columns=[target, pid, sid, pidc2])
y = skart[target]

##### **OBSERVATIONS**
- Product ID is unique ID for all the rows. Therefore this can now dropped as IDs which are unique for the dataset do not help in ML.
- Store ID is useful as it explains relationship between Products and Stores and helps in Aggregations and identifying patterns. However, Store_Type in the dataset uniquely idenitifies stores. Therefore Store ID is redundant. Therefore even Store ID is dropped.

### **Outlier Treatments - Options and Decision**

##### **OBSERVATIONS** - OPTIONS
The applicable options available for outlier treatment here are:
- Cap the Outliers at the upper and lower bounds as calculated by the Tukey method.
- Leave them as is as they are meaningfull and lead to some useful patterns.


##### **OBSERVATIONS** - DECISIONS
- IN THIS CASE IT IS WISE TO LEAVE OUTLIERS AS IS AND ANALYZE THEM SPECIFICALLY IF THERE ANY PATTERN SPECIFIC TO OUTLIERS.
IN CASE OF SUPERMARKETS THERE COULD MANY EVENTS, CELEBRITY VISITS, SALE CAMPAIGNS WHICH WILL SPIKE SALES ON A FEW DAYS.
DURING SUCH SPIKES IT IS MORE IMPORTANT TO ANALYZE THE PATTERN OF SALES ON WHAT PEOPLE TEND BUY RATHER THAN IGNORE THEM.
CAPPING OUTLIERS WILL LOOSE DATA THAT IS IMPORTANT.
- TO ANALYZE OUTLIERS WE WILL MAKE BINS (USING pd.cut) WITH BIN EDGES MATCHING THE IQR, QUANTILE AND OUTLIER RANGES.
WE THEN VISUALIZE FOR ANY SPECIFIC PATTERNS WITH OUTLIERS IN THE BI-VARIATE ANALYSIS.

## **Split the dataset into Testing, Validation and Training**

In [None]:
# -----------------------------
# Train / Val / Test split
# -----------------------------
# 60% train, 20% val, 20% test
X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, test_size=0.25, random_state=42
)  # 0.25 of 0.8 -> 0.20

##### **OBSERVATIONS**
- We take a 3 way split into Train, Validation and Testing.
- This is required because we intend to Hyper Tune the parameters of the Models.
- We will fit the Models first with Train and Then Hyper Tune with Val.
- finally we verify with Test. This way we prevent data leaks by keeping Val and Test Sperate.

## **Prepare a single pipeline for Categorical and Numeric variables**

In [None]:
# Identify numeric and categorical columns
numeric_features = X_train.select_dtypes(include=["int64", "float64"]).columns
categorical_features = X_train.select_dtypes(exclude=["int64", "float64"]).columns  # Already ordinal encoded
print ( numeric_features)
print( categorical_features)

preprocessor = ColumnTransformer(
    transformers=[
        ('num', RobustScaler(), numeric_features),
        ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_features)
    ],
    remainder='drop'
)

##### **OBSERVATIONS**
- Scale the numeric values using Robust Scalar
- Robust Scalar is used as all the numeric values have outliers.
- This pipeline will be auto applied on Testing and Validation datas

# **Model Building**

## **Metrics Used - Regression**
- Relavent metrics used are RMSE, R2 and Adj R2
- As this is Regression problem we minimize RMSE value and fine tune Hyper parameters to get the best value or R2 close to 1
- Also make sure we have good value of Adj R2 such that we dont have variables in the dataset not contributing to error reductions.

## **Define functions for Model Evaluation**

In [None]:
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mean_absolute_percentage_error(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

The ML models to be built can be any two out of the following:
1. Decision Tree
2. Bagging
3. Random Forest
4. AdaBoost
5. Gradient Boosting
6. XGBoost

# **Model Performance Improvement - Hyperparameter Tuning**

## **Run default models first and select the best two models**

In [None]:
# List of models to test
models = {
    "AdaBoost": AdaBoostRegressor(random_state=42),
    "GradientBoost": GradientBoostingRegressor(random_state=42),
    "XGBoost": XGBRegressor(random_state=42, verbosity=0),
    "Bagging": BaggingRegressor(random_state=42),
    "RandomForest": RandomForestRegressor(random_state=42)
}
import time
results = []
for name, model in models.items():
    print( " Processing Model = ", name)
    pipeline = Pipeline(steps=[
        ("preprocessor", preprocessor),
        ("model", model)
    ])
    start_fit_predict = time.perf_counter()
    pipeline.fit(X_train, y_train)
    res_train = model_performance_regression(pipeline, X_train, y_train)
    time_taken = round( (time.perf_counter() - start_fit_predict), 2)
    res_train['Type'] = 'Training'
    res_train['Time'] = time_taken

    res_val = model_performance_regression(pipeline, X_val, y_val)
    res_val['Type'] = 'Validation'

    res_test = model_performance_regression(pipeline, X_test, y_test)
    res_test['Type'] = 'Testing'
    total_time_taken = round ( (time.perf_counter() - start_fit_predict), 2)
    res_val['Time'] = total_time_taken
    res_test['Time'] = total_time_taken

    one_model = pd.concat([res_train, res_val, res_test], ignore_index=True)
    one_model['Name'] = name
    results.append(one_model)
    print (" Finished model = ", name, " time taken Train:Total", time_taken, total_time_taken)

print( "Completed all Models = ", len(results))




In [None]:
all_results = pd.concat(results, ignore_index=True)
all_results.sort_values(by=['Time', 'Adj. R-squared'], ascending=[True, False])

### **OBSERVATIONS**
1. Performance of Random Forest is very bad inspite of having better R2 and Adj R2 it is not preferred. Performance is an important aspect and model must be able train faster. The time taken by the model is shown in the "Time" column.
2. Gradient Boost scores are very close but again performance is very bad. Therefore this is also not shortlisted.
3. AdaBoost has good performance results but scores are bad. Therefore this is also rejected.
4. The best models in terms of performance and Metrics are XGBoost and Bagging.
5. However, in both Bagging and XGBoost the Test and Train R2 values have 7% difference ( 91 Vs 98). This is case for Hyper Tuning. So lets Hyper tune.

**Lets Hyper parameter tune the Models XGBoost and Bagging**

## **Hyper parameter Tuning for the top 2 models**

In [None]:
# Define models and their parameter grids
model_params = {
    "Bagging": {
        "model": BaggingRegressor(random_state=42),
        "params": {
            "model__n_estimators": [50, 100, 200],
            "model__max_samples": [0.5, 0.8, 1.0],
            "model__max_features": [0.5, 0.8, 1.0]
        }
    },
    "XGBoost": {
        "model": XGBRegressor(random_state=42, eval_metric="rmse"),
        "params": {
            "model__n_estimators": [100, 200],
            "model__max_depth": [3, 5, 7],
            "model__learning_rate": [0.01, 0.1, 0.2],
            "model__subsample": [0.8, 1.0]
        }
    }
}

results = []
models_best = []

for name, mp in model_params.items():
    print(f"\n🔹 Tuning {name}...")

    pipeline = Pipeline(steps=[
        ("preprocessor", preprocessor),
        ("model", mp["model"])
    ])

    grid = GridSearchCV(
        estimator=pipeline,
        param_grid=mp["params"],
        scoring="r2",
        cv=5,
        n_jobs=-1,
        verbose=1
    )

    # grid = RandomizedSearchCV(
    #     estimator=pipeline,
    #     param_distributions=mp["params"],
    #     scoring="r2",
    #     cv=5,
    #     n_jobs=-1,
    #     verbose=1
    # )

    start_time = time.time()
    grid.fit(X_train, y_train)
    elapsed = round( (time.time() - start_time),2)
    best_score = round(grid.best_score_, 4)

    print(f"Time taken: {elapsed:.2f} seconds")
    print(f"Best Params: {grid.best_params_}")
    print(f"Best CV Score: {best_score:.4f}")

    # Evaluate best model
    best_model = grid.best_estimator_
    models_best.append(best_model)

    for X_set, y_set, set_name in [(X_train, y_train, "Training"),
                                   (X_val, y_val, "Validation"),
                                   (X_test, y_test, "Testing")]:
        res = model_performance_regression(best_model, X_set, y_set)
        res["Type"] = set_name
        res["Name"] = name
        res["Time"] = elapsed
        results.append(res)
    print ( "Completed Model name:score:Time = ", name, best_score, elapsed)
print("Complete Hyper Tuning for all models = ", len(results))


# **Model Performance Comparison, Final Model Selection, and Serialization**

## **Performance Comparison**

In [None]:
# Combine results
final_results = pd.concat(results, ignore_index=True)
final_results

## **Final Model Selection**

### **OBSERVATIONS**
1. XGBoost is clear winner in terms of performance.
2. Bagging test scores exceed XGBoost by 0.3%. But this difference is very minimal.
3. The difference between testing and Training score is higher for Bagging. 93% Vs 97% = 4%. For XGBoost it is lower 93% Vs 95% = 2%. This shows that XGBoost has generalized better than Bagging.

****Final Model Selected = XGBoost****

# **Feature importance - XGBoost**

In [None]:
from xgboost import plot_importance
import matplotlib.pyplot as plt

best_model =models_best[1]

feature_importances = pd.Series(
    best_model.named_steps['model'].feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)

print(feature_importances)

plot_importance(best_model.named_steps['model'])
plt.show()

## **Serialization**

In [None]:
import os
# Create a folder to upload your trained serialized model into it
os.makedirs("backend_files", exist_ok=True)

In [None]:
import joblib
# Save the model
joblib.dump(models_best[1], "backend_files/final_xgb_pipeline.joblib")
print("Model saved as final_xgb_pipeline.joblib")

### **Verification**

In [None]:
# load it
loaded_model = joblib.load("backend_files/final_xgb_pipeline.joblib")

# Test loaded model
y_pred = loaded_model.predict(X_test)
print("Test R² after reload:", r2_score(y_test, y_pred))

# Now get the preprocessor step
preprocessor = loaded_model.named_steps["preprocessor"]

# Extract numeric and categorical column names
num_features = preprocessor.transformers_[0][2]
cat_features = preprocessor.transformers_[1][2]

print("Numeric Features:", num_features)
print("Categorical Features:", cat_features)

# **Deployment - Backend**

## Flask Web Framework


In [None]:
%%writefile backend_files/app.py
# Import necessary libraries
import numpy as np
import joblib  # For loading the serialized model
import pandas as pd  # For data manipulation
from flask import Flask, request, jsonify  # For creating the Flask API

import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# print( " Trying to load XGBoost model using joblib")
model = joblib.load("final_xgb_pipeline.joblib")

# print("Model loaded successfully!")

# Initialize the Flask application
sales_predictor_api = Flask("SuperKart Sales Prediction")

# Define a route for the home page (GET request)
@sales_predictor_api.get('/')
def home():
    """
    This function handles GET requests to the root URL ('/') of the API.
    It returns a simple welcome message.
    """
    return "Welcome to the SuperKart Sales Prediction API!"

# Define an endpoint for single property prediction (POST request)
@sales_predictor_api.post('/v1/sales')
def predict_sales():
    """
    This function handles POST requests to the '/v1/sales' endpoint.
    It expects a JSON payload containing property details and returns
    the predicted rental price as a JSON response.
    """
    # Get the JSON data from the request body
    property_data = request.get_json()

    # Extract relevant features from the JSON data
    sample = {

        'Product_Weight': property_data['Product_Weight'],
        'Product_Allocated_Area': property_data['Product_Allocated_Area'],
        'Product_MRP': property_data['Product_MRP'],
        'Product_Sugar_Content': property_data['Product_Sugar_Content'],
        'Product_Type': property_data['Product_Type'],
        'Store_Establishment_Year': property_data['Store_Establishment_Year'],
        'Store_Size': property_data['Store_Size'],
        'Store_Location_City_Type': property_data['Store_Location_City_Type'],
        # 'pid_c2': property_data['pid_c2'],
        'Store_Type': property_data['Store_Type']
    }
    # print( ' recevied request from client ')
    # Convert the extracted data into a Pandas DataFrame
    input_data = pd.DataFrame([sample])
    # print("data recevied = ", input_data)
    # Make prediction (get log_price)
    predicted_sales = model.predict(input_data)[0]
    predicted_sales = float(predicted_sales)  # convert to native float
    # print ("Sales predicted = ", predicted_sales)
    return jsonify({'Predicted Sales': predicted_sales})

# Run the Flask application in debug mode if this script is executed directly
if __name__ == '__main__':
    sales_predictor_api.run(debug=True)

## **Dependencies File**
- Some more dependencies are there required for REST based model
- Having these here will alow the backend to be used as UI + Backend without REST. So keeping them.

In [None]:
%%writefile backend_files/requirements.txt
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.28.1
uvicorn[standard]

## Dockerfile

In [None]:
%%writefile backend_files/Dockerfile
FROM python:3.9-slim

# Set the working directory inside the container
WORKDIR /app

# Copy all files from the current directory to the container's working directory
COPY . .

# Install dependencies from the requirements file without using cache to reduce image size
RUN pip install --no-cache-dir --upgrade -r requirements.txt

# Define the command to start the application using Gunicorn with 4 worker processes
# - `-w 4`: Uses 4 worker processes for handling requests
# - `-b 0.0.0.0:7860`: Binds the server to port 7860 on all network interfaces
# - `app:app`: Runs the Flask app (assuming `app.py` contains the Flask instance named `app`)
CMD ["gunicorn", "-w", "4", "-b", "0.0.0.0:7860", "app:sales_predictor_api"]

## Setting up a Hugging Face Docker Space for the Backend

In [None]:
# Import the login function from the huggingface_hub library
from huggingface_hub import login

# Login to your Hugging Face account using your access token
# Replace "YOUR_HUGGINGFACE_TOKEN" with your actual token
import os

from dotenv import load_dotenv
import os

load_dotenv()  # loads from .env

token = os.getenv("HF_TOKEN")

if token is None:
    raise ValueError("HUGGINGFACE_TOKEN not set in environment")
# print(token)

login(token=token)

# Import the create_repo function from the huggingface_hub library
from huggingface_hub import create_repo

In [None]:
# Try to create the repository for the Hugging Face Space
repoid = 'surnellas/SuperKart_Backend'
try:
    create_repo("SuperKart_Backend",  # One can replace "Backend_Docker_space" with the desired space name
        repo_type="space",  # Specify the repository type as "space"
        space_sdk="docker",  # Specify the space SDK as "docker" to create a Docker space
        private=False  # Set to TrueCall.  if you want the space to be private
    )
except Exception as e:
    # Handle potential errors during repository creation
    if "RepositoryAlreadyExistsError" in str(e):
        print("Repository already exists. Skipping creation.")
    else:
        print(f"Error creating repository: {e}")

## Uploading Files to Hugging Face Space (Docker Space)

In [None]:
# for hugging face space authentication to upload files
from huggingface_hub import HfApi

repo_id = "surnellas/SuperKart_Backend"  # Your Hugging Face space id

# Initialize the API
api = HfApi()

# Upload Streamlit app files stored in the folder called deployment_files
api.upload_folder(
    folder_path="backend_files",  # Local folder path
    repo_id=repo_id,  # Hugging face space id
    repo_type="space",  # Hugging face repo type "space"
)

# **Deployment - Frontend**

## Points to note before executing the below cells
- Create a Streamlit space on Hugging Face by following the instructions provided on the content page titled **`Creating Spaces and Adding Secrets in Hugging Face`** from Week 1

## Streamlit for Interactive UI

In [None]:
# Create a folder for storing the files needed for frontend UI deployment
os.makedirs("frontend_files", exist_ok=True)

In [None]:
%%writefile frontend_files/app.py
import streamlit as st
import requests
import datetime
import joblib  # For loading the serialized model
import pandas as pd  # For data manipulation
from flask import Flask, request, jsonify  # For creating the Flask API
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer


# Set the title of the Streamlit app
st.title("SuperKart sales Prediction")

# Section for online prediction
st.subheader("Online Prediction")

# cyear = datetime.now().year
cyear=2025

#--------------------------------------------TESTING (Model without REST)-----------------------------------------------

# print( " Trying to load XGBoost model using joblib")
# model = joblib.load("XGBoost_best_model.joblib")

# print("Model loaded successfully!")

#--------------------------------------------TESTING (Model without REST)-----------------------------------------------------

# Collect user input for property features
pwt = st.number_input("Product Weight", min_value=0.1, value=100.0)
paa = st.number_input("Product_Allocated_Area (Ratio of product area to Total Area)", min_value=0.001, max_value=0.999, value=0.2)
psc = st.selectbox("Product Sugar Content", ['Low Sugar', 'Regular', 'No Sugar'])
ptyp = st.selectbox("Product Type", ['Fruits and Vegetables', 'Snack Foods', 'Frozen Foods', 'Dairy',
       'Household', 'Baking Goods', 'Canned', 'Health and Hygiene',
       'Meat', 'Soft Drinks', 'Breads', 'Hard Drinks', 'Others',
       'Starchy Foods', 'Breakfast', 'Seafood'])
ssize = st.selectbox("Store Size", ['Small', 'Medium', 'High'])
sloctype = st.selectbox("Store Location City Type", ['Tier 1', 'Tier 2', 'Tier 3'])
styp = st.selectbox("Store Type", ['Supermarket Type2', 'Supermarket Type1', 'Departmental Store', 'Food Mart'])
# pid_c2 = st.selectbox("pid_c2", ['FD', 'DR', 'NC'])
pmrp = st.number_input("Product MRP", min_value=0.1,  value=100.0)
seyr_i = st.number_input("Store Establishment Year", min_value=1987, step=1, max_value=cyear, value=2025)
seyr = str(seyr_i)

# Convert user input into a DataFrame
input_data = pd.DataFrame([{
    'Product_Weight': pwt,
    'Product_Allocated_Area': paa,
    'Product_MRP': pmrp,
    'Product_Sugar_Content': psc,
    'Product_Type': ptyp,
    'Store_Establishment_Year': seyr,
    'Store_Size': ssize,
    # 'pid_c2':pid_c2,
    'Store_Location_City_Type': sloctype,
    'Store_Type': styp
}])

# Make prediction when the "Predict" button is clicked
if st.button("Predict"):
    print("payload = ", input_data.to_dict(orient='records')[0])
    response = requests.post("https://surnellas-SuperKart-Backend.hf.space/v1/sales", json=input_data.to_dict(orient='records')[0])
    if response.status_code == 200:
        prediction = response.json()['Predicted Sales']
        st.success(f"Predicted Sales: {prediction}")
    else:
        st.error("Error making prediction.")

## Dependencies File

In [None]:
%%writefile frontend_files/requirements.txt
pandas==2.2.2
numpy==2.0.2
scikit-learn==1.6.1
xgboost==2.1.4
joblib==1.4.2
Werkzeug==2.2.2
flask==2.2.2
gunicorn==20.1.0
requests==2.28.1
uvicorn[standard]
streamlit==1.43.2

## DockerFile

In [None]:
%%writefile frontend_files/Dockerfile
# Use a minimal base image with Python 3.9 installed
FROM python:3.9-slim

# Set the working directory inside the container to /app
WORKDIR /app

# Copy all files from the current directory on the host to the container's /app directory
COPY . .

# Install Python dependencies listed in requirements.txt
RUN pip3 install -r requirements.txt

# Define the command to run the Streamlit app on port 8501 and make it accessible externally
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.enableXsrfProtection=false"]

# NOTE: Disable XSRF protection for easier external access in order to make batch predictions

## Uploading Files to Hugging Face Space (Streamlit Space)

In [None]:
# for hugging face space authentication to upload files
from huggingface_hub import HfApi

repo_id = "surnellas/SuperKart_Frontend"  # Your Hugging Face space id

# Initialize the API
api = HfApi()

# Upload Streamlit app files stored in the folder called deployment_files
api.upload_folder(
    folder_path="frontend_files",  # Local folder path
    repo_id=repo_id,  # Hugging face space id
    repo_type="space",  # Hugging face repo type "space"
)

### **Hugging face links**

- Frontend:  https://huggingface.co/spaces/surnellas/SuperKart_Frontend
- Backend:   https://huggingface.co/spaces/surnellas/SuperKart_Backend

## **Screenshot of the Backend**

- Colab is not embedding screenshots properly in Markdown cell. SO I loaded this screenshots using OpenCV to embed the image in the Colab.
- I captured this screenshot when Backend is running.

In [None]:
import cv2
import matplotlib.pyplot as plt

# Define the path to the image file
image_path = '/content/Backend.png'

# Load the image using OpenCV
# cv2.imread reads an image in BGR format by default
image = cv2.imread(image_path)

# Check if the image was loaded successfully
if image is not None:
    # Convert the image from BGR to RGB for displaying with Matplotlib
    image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Display the image using Matplotlib
    plt.imshow(image_rgb)
    plt.title('Backend Hugging Face Screenshot')
    plt.axis('off') # Hide axes
    plt.show()
else:
    print(f"Error: Could not load image from {image_path}")

<<<<<<< LOCAL CELL DELETED >>>>>>>
![Screenshot 2025-08-13 204222-HF-Backend.png](<attachment:/content/Screenshot 2025-08-13 204222-HF-Backend.png>)

## **Screenshot of the frontend**

1.  This is the frontend screenshot it shows the input entered and the output text for the prediction.
2. The REST JSON aPI is executed. The code is above with links.



In [None]:
import cv2
import matplotlib.pyplot as plt

# Define the path to the image file
image_path = '/content/Frontend.png'

# Load the image using OpenCV
# cv2.imread reads an image in BGR format by default
image = cv2.imread(image_path)

# Check if the image was loaded successfully
if image is not None:
    # Convert the image from BGR to RGB for displaying with Matplotlib
    image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    # Display the image using Matplotlib
    plt.imshow(image_rgb)
    plt.title('Frontend Hugging Face screenshot')
    plt.axis('off') # Hide axes
    plt.show()
else:
    print(f"Error: Could not load image from {image_path}")

<<<<<<< LOCAL CELL DELETED >>>>>>>
![Screenshot 2025-08-13 204412_Frontend.png](<attachment:/content/Screenshot 2025-08-13 204412_Frontend.png>)

# **Actionable Insights and Business Recommendations**

## **Insights**

- The dataset is well formed with no invalidate data. Minor correction in PSC (has some "reg" for "Regular") was easily corrected. No elaborate treatment was required for missing values or null values.
- There are 4 stores in dataset - Two are Super Markets, one is Deparmental Store and Other is a Food Market.
- The dataset contains Products and Stores attributes. The Products are 8763 in number and they do ot overlap across Stores.
- The product and Stores have many attributes and thre correlation has been explored in a detailed manner to find correlations and make an accurate prediction with more 91% accuracy as validated by the Cross Validations scores.

The insights are as follows:
<<<<<<< local

- The 4 stores in the dataset are having mutually exclusive products. They are meant for different market segementations. By design the store types are designed for addressing differnt needs of the customers.
- The supermarkets Type2 addresses daily consumption products which are most commonly purchased by every household on a regualr basis. It targets lower quartile to Middle and Upper middle qaurtiles of the MRPS segements for the products.
- Type1 supermarket products are an intelligent mix of Premium products and a Selected set upper Middle quartile of frequently used products. They have an MRP range above the Super Markets but lesser than Depratmental stores.
=======
- The 4 stores in the dataset are having mutually exclusive products. They are meant for different market segementations. By design the store types are designed for addressing differnt needs of the customers.
- The supermarkets Type2 addresses daily used products that are most commonly purchased by every household on a regualr basis. It targets lower quartile to Middle and Upper middle qaurtiles of the MRPS segements for the products.
- Type1 supermarkets are an intelligent mix of Near Premium products and Selected upper Middle quartile of frequently used products. They have an MRP range above the Super Markets but lesser than Depratmental stores. They address "high" are stores in the Tier 2 spaces of the cities.
>>>>>>> remote
- The Departmental Stores is more Premium Product Segement. It targets higher MRPs. Even if there are fewer number of orders the overall sales will be much higher than the supermarkets like Type1. They address Tier 1 cities.
- The food Market is like convienence store for "Small" stores addressing very low MRPs for products.
- The products are classified into 16 different types. Broadly there are Food items, Drinks and Non Consumables. Out of these items the Food items comprise a large portion of sales, the highest being Fruits and Vegetables for all the stores.
- the key attributes for products are
   - Weight: Product Weight
        - It is observed that lower Middle, Middle and Upper Middle Quartiles the Weight of the product is highly correlated with the sales positiely. For lower Qaurtiles and Upper / Outlier quarties Weight is not a well correlated, meaning lower weight can have higher MRP and vice versa.
        - In general Super market are focused in this MRP segements hence the correlation of Weights and sales is very high for Super Markets.
   - MRP: the seeling price of the product
        - By design it is qualifying the Store Type. The 4 Stores types are addressing differnt MRP segements.
   - Area: display Area allocated by the product.
        - totally controlled by Store adminstration to maximize sales. It succeded in more than 90% of the cases but there a few cases where Areas aalocated are -vely proportion to sales. The business have to take these on case by cases and adjust the allocation areas.
        - The heatmap section of this report details the products and store combination where -ve correlations are observed.
- The stores by design are addressing different MRPs. They monitor sales and take there own decisions on how to adjust the Display allocation Area for the products so that sales is maximized. The product Weights influence the sales as the Area and MRPs are correlated to the Weights.
- The correlation of Product attributes with Sales is a very complex relationship it is difficult to visualize. The Heatmap and Pairplot provide ver weak insights. This is a perfect ase for an ensemble model to find out deeper realtionships and convert this into ML model.
<<<<<<< local
- Another Attribute of the product is Product Sugar content level. It is fairly straght forward showig high sales for all products which are "Low Sugar". The sales of "Regular" sugar content is relatively low compared to "Low Sugar". this attribute also classifies "Non Consumable" products as "No Sugar"
- The store attributes are more straight forward
   - Store Location tpye: It is indicative for which Tier the Store is going to Target. Tier 1 for high cost of living areas and Tier 3 for low cost of living. tier 2 is in between. In the dataset provided  Super Market type store are in "Tier 2", Departmental Stores is in "Tier 1" and Food MArket is in "Tier 3"
   - Store Size classifies stores as "High", "Medium" and "Small". Thisis simply reflective of the area of store it is not anyway  reflective of Allocation Areas and/or MRPs. IN the Dataset provided "High" is a "Tier 2" Super Market Type 1. Medium is Departmental and Super Market Type2. Tier 3 Food Market is "Small"
   - Store Establishment Year simply gives in which year the store was established. It is observed that the store established in 1987 had very good management of MRPs Vs Weight Vs Area and correlations of this Type 1 Vintage 1987 store was the highest with Sales. Other stores must adopt the best practices of allocation Areas and Choosing MRPs / Weight from this Store.
   -overall one can say the sales prediction is complex combinations of product and Store attributes. The model has to consider all  all Product and Stores attributes. There are many hidden relationships between Store and Product attributes which are best enalyzed by ensemble models.


=======
- One Attribute of the product is Product Sugar content level. It is fairly straght forward showig high sales for all products which are "Low Sugar". The sales of "Regular" sugar content is relatively low compared to "Low Sugar". this attribute also classifies "Non Consumable" products as "No Sugar"
- The store attributes are more straight forward
   - Store Location tpye: It is indicative for which Tier the Store is going to Target. Tier 1 for high cost of living areas and Tier 3 for low cost of living. tier 2 is in between. In the dataset provided  Super Market type store are in "Tier 2", Departmental Stores is in "Tier 1" and Food MArket is in "Tier 3"
   - Store Size classifies stores as "High", "Medium" and "Small". Thisis simply reflective of the area of store it is not anyway  reflective of Allocation Areas and/or MRPs. IN the Dataset provided "High" is a "Tier 2" Super Market Type 1. Medium is Departmental and Super Market Type2. Tier 3 Food Market is "Small"
   - Store Establishment Year simply gives in which year the store was established. Itis observed that the store established in 1987 had very good management of MRPs Vs Weight Vs Area and correlations of this Type 1 Vintage 1987 store was the highest with Sales. Other stores must adopt the best practices of allocation Areas and Choosing MRPs / Weight from this Store.
   -overall one an say the sales prediction is deep model of product and Store attributes. There cannot be a generic model for all Product ans Stores. There are many hidden relationships between Store and Product attributes which are best enalyzed by ensemble models.
>>>>>>> remote
   - the model built is doing a good job on this. It has 92% Accuracy and ready for production deployment.
   - It has been deployed on Hugging Face with Steamlit for consumption of any demos.
   - The links for Frontend and ABckend are listed in section prior to this section.

## **Recomendations**

<<<<<<< local
Display Areas must pripritze products with higher sales as opposed to products with low sales. 
-  Check analmolies listed in heatmap for -ve correlations for Product Area Allocations. 
- Use the data in heatmap to correct allocations such that Areas correspond to the sales correctly. 

Double down on best selling items:
- Prioritize Fruits & Vegetables, Snack Foods, Dairy, Frozen for end-caps, promos, and in-stock rigor.

=======
Double down on best selling items:
- Prioritize Fruits & Vegetables, Snack Foods, Dairy, Frozen for end-caps, promos, and in-stock rigor.

>>>>>>> remote
Store-type strategy
- Supermarket Type2: Treat as growth engine—run cross-category bundles and high-velocity replenishment.
- Departmental/Type1: Use targeted promos to lift basket size; mirror bestsellers from Type2.
- Food Mart: Curate a tight, high-turn core; avoid long-tail inventory.

Size & layout
- High/Medium stores: Keep space-rich for top categories; protect space against fragmentation.
- Small stores: Focus on premium, faster-moving items (higher MRP bins) to lift average ticket.

Pricing & portfolio:
- Create good-better-best ladders in categories where MRP bins show step-ups in average sales.
- Use price anchoring: display premium SKUs to increase perceived value and trade-up.

Assortment by city tier & sugar
- In health-conscious catchments (if you can map city tiers to affluence), increase low/regular sugar variants in Dairy/Snacks and message “better-for-you” options.

Promo & ops
- Run basket-building offers that pair produce with dairy/snacks (complements with highest sales).
- Track out-of-stock on top 10 products and enforce service-level SLAs.