---
<center><h1>Big Mart Sales Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

The objective of this project is to utilize machine learning techniques to create an accurate predictive model for **forecasting sales of products in the Big Mart retail chain**. This problem is a **Regression Machine Learning task**, where the goal is to predict continuous numerical values. Specifically, we aim to predict the sales of various products across different stores based on historical sales data and other relevant features. The developed model will help Big Mart optimize inventory management, stock replenishment strategies and store performance by providing reliable sales forecasts.

## 2) Understanding Data
---

The project uses **Big Mart Sales Data** which contains several variables (independent variables) and one outcome variable (dependent variable).

## 3) Getting System Ready
---
Importing required libraries


In [None]:
import pandas as pd
import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from six.moves import urllib

warnings.filterwarnings("ignore")
%matplotlib inline

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
big_mart_sales_df = pd.read_csv('Datasets/Day12_Big_Mart_Sales_Data.csv') 

In [None]:
big_mart_sales_df

In [None]:
print('The size of Dataframe is: ', big_mart_sales_df.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
big_mart_sales_df.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in big_mart_sales_df.columns if big_mart_sales_df[feature].dtype != 'O']
categorical_features = [feature for feature in big_mart_sales_df.columns if big_mart_sales_df[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=big_mart_sales_df.isnull().sum().sort_values(ascending=False)
percent=(big_mart_sales_df.isnull().sum()/big_mart_sales_df.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
print('Summary Statistics of numerical features for DataFrame are as follows:')
print('-'*100)
big_mart_sales_df.describe()

In [None]:
print('Summary Statistics of categorical features for DataFrame are as follows:')
print('-'*100)
big_mart_sales_df.describe(include='object').T

## 5) Data Cleaning & Preprocessing
---

### Handling Missing Values

#### Filling the missing values in "Item_weight column" with "Mean" value

In [None]:
big_mart_sales_df['Item_Weight'].fillna(big_mart_sales_df['Item_Weight'].mean(), inplace=True)

#### Filling the missing values in "Outlet_Size" column with Mode

In [None]:
mode_of_Outlet_size = big_mart_sales_df['Outlet_Size'].mode()[0]
mode_of_Outlet_size

In [None]:
big_mart_sales_df['Outlet_Size'].fillna(mode_of_Outlet_size, inplace=True)

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=big_mart_sales_df.isnull().sum().sort_values(ascending=False)
percent=(big_mart_sales_df.isnull().sum()/big_mart_sales_df.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

### Item Weight Distribution

In [None]:
plt.figure(figsize=(6,6))
sns.distplot(big_mart_sales_df['Item_Weight'])
plt.show()

### Item Visibility Distribution

In [None]:
plt.figure(figsize=(6,6))
sns.distplot(big_mart_sales_df['Item_Visibility'])
plt.show()

### Item MRP Distribution

In [None]:
plt.figure(figsize=(6,6))
sns.distplot(big_mart_sales_df['Item_MRP'])
plt.show()

### Item Outlet Sales Distribution

In [None]:
plt.figure(figsize=(6,6))
sns.distplot(big_mart_sales_df['Item_Outlet_Sales'])
plt.show()

### Outlet Establishment Year

In [None]:
plt.figure(figsize=(6,6))
sns.countplot(x='Outlet_Establishment_Year', data=big_mart_sales_df)
plt.show()

### Item Fat Content Distribution

In [None]:
plt.figure(figsize=(6,6))
sns.countplot(x='Item_Fat_Content', data=big_mart_sales_df)
plt.show()

### Item Type Distribution

In [None]:
plt.figure(figsize=(15,15))
sns.countplot(x='Item_Type', data=big_mart_sales_df)
plt.xticks(rotation=90)
plt.show()

### Outlet Size Distribution

In [None]:
plt.figure(figsize=(6,6))
sns.countplot(x='Outlet_Size', data=big_mart_sales_df)
plt.show()

In [None]:
big_mart_sales_df

In [None]:
big_mart_sales_df['Item_Fat_Content'].value_counts()

In [None]:
big_mart_sales_df.replace({'Item_Fat_Content': {'low fat':'Low Fat','LF':'Low Fat', 'reg':'Regular'}}, inplace=True)

In [None]:
big_mart_sales_df['Item_Fat_Content'].value_counts()

### Encoding the Categorical Features

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

In [None]:
big_mart_sales_df['Item_Identifier'] = encoder.fit_transform(big_mart_sales_df['Item_Identifier'])

big_mart_sales_df['Item_Fat_Content'] = encoder.fit_transform(big_mart_sales_df['Item_Fat_Content'])

big_mart_sales_df['Item_Type'] = encoder.fit_transform(big_mart_sales_df['Item_Type'])

big_mart_sales_df['Outlet_Identifier'] = encoder.fit_transform(big_mart_sales_df['Outlet_Identifier'])

big_mart_sales_df['Outlet_Size'] = encoder.fit_transform(big_mart_sales_df['Outlet_Size'])

big_mart_sales_df['Outlet_Location_Type'] = encoder.fit_transform(big_mart_sales_df['Outlet_Location_Type'])

big_mart_sales_df['Outlet_Type'] = encoder.fit_transform(big_mart_sales_df['Outlet_Type'])

In [None]:
big_mart_sales_df

## 6) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [None]:
# separating the data and labels
X = big_mart_sales_df.drop(columns = ['Item_Outlet_Sales'], axis=1) # Feature matrix
y = big_mart_sales_df['Item_Outlet_Sales'] # Target variable

In [None]:
X

In [None]:
y

### Data Standardization

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
standardized_data = scaler.transform(X)

In [None]:
standardized_data

In [None]:
X = standardized_data

In [None]:
X

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
print(y.shape, y_train.shape, y_test.shape)

### Model Comparison : Training & Evaluation

In [None]:
# For Model Building
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
models = [LinearRegression, Lasso, Ridge, SVR, DecisionTreeRegressor, RandomForestRegressor]
mae_scores = []
mse_scores = []
rmse_scores = []
r2_scores = []

for model in models:
    regressor = model().fit(X_train, y_train)
    y_pred = regressor.predict(X_test)
    
    mae_scores.append(mean_absolute_error(y_test, y_pred))
    mse_scores.append(mean_squared_error(y_test, y_pred))
    rmse_scores.append(mean_squared_error(y_test, y_pred, squared=False))
    r2_scores.append(r2_score(y_test, y_pred))

In [None]:
regression_metrics_df = pd.DataFrame({
    "Model": ["Linear Regression", "Lasso", "Ridge", "SVR", "Decision Tree Regressor", "Random Forest Regressor"],
    "Mean Absolute Error": mae_scores,
    "Mean Squared Error": mse_scores,
    "Root Mean Squared Error": rmse_scores,
    "R-squared (R2)": r2_scores
})

regression_metrics_df.set_index('Model', inplace=True)
regression_metrics_df

### Inference

In the context of predicting Big Mart sales,
- The results reveal that the Random Forest Regressor outperforms its counterparts, demonstrating the lowest Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). Its substantial R-squared (R2) value of 0.557 indicates that this model successfully captures a significant portion of the variance in sales data, making it the top choice for sales prediction. 
- Conversely, linear regression-based models (Linear Regression, Lasso, Ridge) show reasonably good but slightly inferior performance with R2 values around 0.514. 
- However, the Support Vector Regressor (SVR) and Decision Tree Regressor lag behind, struggling to provide accurate sales forecasts. The **Random Forest Regressor*** stands out as the **most promising** candidate for optimizing inventory management and sales forecasting in Big Mart stores.