**UNDERSTANDING OF THE PROBLEM STATEMENT:**
    
*   According to the quote, "Success in sales is the sum of small efforts, repeated day in & day out"
  
*   Let us consider a supermarket has several outlets or several stores around the world & they want us to predict the sales which they can expect.

**APPLICATION OF PREDICTING THE SALES:**
   
*    We can tell the company what are all the challenges they may face
   
*    What are the brands or products which is sold the most & other such kind of things
   
*    This helps sales team to understand which product to sell & which product to promote & other such kind of things
   
*    They can also make several marketing plans(let's say that a particular product in a particular store is getting sold the most & we may find some insights from it - as of why this product is getting sold the most & this helps the company to make better marketing decisions)
    


**IMPORTING LIBRARIES**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn import metrics

**LOADING THE DATA**

In [None]:
# loading the data
sales_data = pd.read_csv('/kaggle/input/bigmart-sales-data/Train.csv')
#checking the first 5 rows of the dataframe
sales_data.head()

**It is important to note that Item_Outlet_Sales is the target variable which we are going to predict & the remaining are the feature variables**

In [None]:
# checking the number of data points(different products present in the dataset) & number of features
sales_data.shape

**Hence, we are having 8523 different products with 12 features**

In [None]:
# getting some information about the dataset
sales_data.info()

**Categorical Features:**

* Item_Identifier : categories of different products

* Item_Fat_Content : It tells us whether it has high fat content or low fat content or
  regular fat content

* Item_Type : It tells us whether it has meat or soft drink & such kind of things

* Outlet_Identifier : It tells us the unique ID of the outlet

* Outlet_Size : it tells us whether it is medium,high or small in size

* Outlet_Location_Type : It tells us whether it is tier 1 or tier 2 & such kind of things

* Outlet_Type : It tells us whether it is supermarket or grocerry store

In [None]:
# checking for missing values
sales_data.isnull().sum()

We can observe that we are having 1463 missing values in the Item_Weight column & we are having about 2410 missing values in the Outlet_Size column


**IN ORDER TO DEAL WITH THE MISSING VALUES**

**Mean --> average**

* The Mean value of a dataset is the average value i.e. a number around which a whole data is spread out. All values used in calculating the average are weighted equally when defining the Mean

* In this case, in order to convert the missing values in the numerical column, we use mean of that particular column

**Mode --> most repeated value**

* The mode is the value that appears most frequently in a data set. A set of data may have one mode, more than one mode, or no mode at all.The mode can be the same value as the mean and/or median, but this is usually not the case.

* In this case, in order to convert the missing values in the categorical feature, we use the mode of that particular column

**Replacing the missing values in the "Item_Weight"column**

In [None]:
# mean value of "Item_Weight" column
sales_data['Item_Weight'].mean()

In [None]:
# filling the missing values in "Item_weight column" with "Mean" value
sales_data['Item_Weight'].fillna(sales_data['Item_Weight'].mean(), inplace=True)

**Replacing the missing values in the "Outlet_Size"column**

In [None]:
# mode of "Outlet_Size" column
sales_data['Outlet_Size'].mode()

In [None]:
# filling the missing values in "Outlet_Size" column with Mode
#Here we take Outlet_Size column & Outlet_Type column since they are correlated
mode_of_Outlet_size = sales_data.pivot_table(values='Outlet_Size', columns='Outlet_Type', aggfunc=(lambda x: x.mode()[0]))

In [None]:
print(mode_of_Outlet_size)

From the above pivot table, we can observe that

* If the outlet type is Grocery Store in most of the cases the outlet size(mode) is Small
* If the outlet type is Supermarket Type1 in most of the cases the outlet size(mode) is Small
* If the outlet type is Supermarket Type2 in most of the cases the outlet size(mode) is Medium
* If the outlet type is Supermarket Type3 in most of the cases the outlet size(mode) is Medium

In [None]:
miss_values = sales_data['Outlet_Size'].isnull()

In [None]:
print(miss_values)

**False** represents it is not null that means the **value is present**

**True** represents a particular **value is missing**

In [None]:
sales_data.loc[miss_values, 'Outlet_Size'] = sales_data.loc[miss_values,'Outlet_Type'].apply(lambda x: mode_of_Outlet_size[x])

In [None]:
# checking for missing values
sales_data.isnull().sum()

**Thus we don't have any missing values in a numerical column & a categorical column**

**ANALYSING THE DATA**

In [None]:
#stastical measures about the data
sales_data.describe()

**DATA VISUALIZATION**

* Data visualization is the graphical representation of information and data.
* It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns

**VISUALIZATION OF NUMERICAL FEATURES**

In [None]:
sns.set()

In [None]:
# Item_Weight distribution
#plt.figure(figsize=(5,5))
sns.distplot(sales_data['Item_Weight'], color='purple')
plt.show()

* Hence from the above graph we can observe that we have the item weight from 5 Kg to 20 Kg & we have maximum values around 12 Kg where the mean is 12.85 Kg

* Therefore in this 8523 products the average weight is about 12.8 Kg

In [None]:
# Item Visibility distribution
#plt.figure(figsize=(5,5))
sns.distplot(sales_data['Item_Visibility'], color='purple')
plt.show()

* Hence from the above graph we can observe that Item_Visibility feature is positively skewed

In [None]:
# Item MRP distribution
#plt.figure(figsize=(5,5))
sns.distplot(sales_data['Item_MRP'], color='purple')
plt.show()

* From the above graph, we can observe that we have good amount of products for 50 MRP,  100 MRP ,200 MRP & then we have less products

* Hence we have more products in the range of 100 MRP - 180 MRP

In [None]:
# Item_Outlet_Sales distribution
#plt.figure(figsize=(5,5))
sns.distplot(sales_data['Item_Outlet_Sales'], color='purple')
plt.show()

* Hence from the above graph we can observe that Item_Outlet_Sales feature is positively skewed

In [None]:
# Outlet_Establishment_Year column
#plt.figure(figsize=(5,5))
sns.countplot(x='Outlet_Establishment_Year', data=sales_data)
plt.show()

* Hence from the above graph we can observe that we have the outlet establishment from the year 1985, 1987 and all the way to 2009

* Therefore these are the years on which different outlets or different stores have been established

* We can also observe that a lots of stores are established in the year 1985 & less in the year 1998 & all the others years are almost same

**VISUALIZATION OF CATEGORICAL FEATURES**

In [None]:
# Item_Fat_Content column
#plt.figure(figsize=(5,5))
sns.countplot(x='Item_Fat_Content', data=sales_data)
plt.show()

* From the above graph we can observe that the data in the Item_Fat_Content column has to be cleaned since we have columns such as Low fat,low fat & Lf which is same & must be put into a single particular label.Similarly we have Regular & reg where we need to put this into a single entity.

* Hence, we need to pre process this data so we will be dealing with this in a later point of time after the visualization of the data

In [None]:
# Item_Type column
plt.figure(figsize=(25,7))
sns.countplot(x='Item_Type', data=sales_data)
plt.show()

* From the above graph we can observe the different items or food types we have such as dairy, soft drinks, meat, fruits & vegetables, household etc

* Hence totally we have about 16 Item_Type values in this case where we have more values in the fruits & vegetables column and snack foods column

In [None]:
# Outlet_Size column
#plt.figure(figsize=(5,5))
sns.countplot(x='Outlet_Size', data=sales_data)
plt.show()

* From the above graph, we can observe that we have three outlet_Size in this case which is medium, small & high

**PREPROCESSING OF DATA**

In [None]:
sales_data.head()

In [None]:
sales_data['Item_Fat_Content'].value_counts()

In [None]:
sales_data.replace({'Item_Fat_Content': {'low fat':'Low Fat','LF':'Low Fat', 'reg':'Regular'}}, inplace=True)

In [None]:
sales_data['Item_Fat_Content'].value_counts()

Hence, we have successfully cleaned the data in Item_Fat_Content column

**LABEL ENCODING:**
*     Label Encoding refers to the convertion of the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

*     In simple terms, taking all the categorical values & transforming them into some numerical values

In [None]:
encoder = LabelEncoder()

In [None]:
sales_data['Item_Identifier'] = encoder.fit_transform(sales_data['Item_Identifier'])

sales_data['Item_Fat_Content'] = encoder.fit_transform(sales_data['Item_Fat_Content'])

sales_data['Item_Type'] = encoder.fit_transform(sales_data['Item_Type'])

sales_data['Outlet_Identifier'] = encoder.fit_transform(sales_data['Outlet_Identifier'])

sales_data['Outlet_Size'] = encoder.fit_transform(sales_data['Outlet_Size'])

sales_data['Outlet_Location_Type'] = encoder.fit_transform(sales_data['Outlet_Location_Type'])

sales_data['Outlet_Type'] = encoder.fit_transform(sales_data['Outlet_Type'])

In [None]:
sales_data.head()

* Hence, we have only numerical values in our data where these categories are given some specific numerical values if it is unique

* Therefore we have successfully encoded categorical columns into numerical values which is an important data preprocessing step.

**SPLITTING FEATURES AND TARGET INTO X & Y RESPECTIVELY**

We know that the data in the "Item_Outlet_Sales" column is the target & remaining are the features

In [None]:
#Let's have all the features in X & target in Y
X = sales_data.drop(columns='Item_Outlet_Sales', axis=1)
Y = sales_data['Item_Outlet_Sales']

In [None]:
# X contains features
print(X)

In [None]:
# Y contains target
print(Y)

**SPLITTING THE DATA INTO TRAINING DATA & TESTING DATA**

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

We can observe that

* X contains the original data which is 8523

* X_trains contains 80% of the data which is 6818

* X_test contains 20% of the data which is 1705

**MACHINE LEARNING MODEL**

**SUPERVISED LEARNING:**

*  It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately.

* Basically supervised learning is when we teach or train the machine using data that is well labeled.

* In this particular project, the labels are the target which is more precise.

* In this case the targets are sales amount

**REGRESSION:**

* Regression means predicting a particular value especially continuous value (i.e.sales)

**MACHINE LEARNING MODEL TRAINING - XGBoost Regressor**

Extreme Gradient Boosting (XGBoost) is an open-source library that provides an efficient and effective implementation of the gradient boosting algorithm.XGBoost is an efficient implementation of gradient boosting that can be used for regression predictive modeling.

In [None]:
regressor = XGBRegressor()

In [None]:
#fit the model
#Training data is in X_train and the corresponding price value is in the Y_train
regressor.fit(X_train, Y_train)

**EVALUATION**

The R2 score is a very important metric that is used to evaluate the performance of a regression-based machine learning model. It is pronounced as R squared and is also known as the coefficient of determination. It works by measuring the amount of variance in the predictions explained by the dataset.

**PREDICTION OF THE DATA**

In [None]:
sales_data_prediction = regressor.predict(X_train)

In [None]:
# In order to check the performance of the model we find the R squared Value
r2_sales = metrics.r2_score(Y_train, sales_data_prediction)
print('R Squared value = ', r2_sales)

In [None]:
# prediction on test data
data_prediction = regressor.predict(X_test)

In [None]:
# R squared Value
r2_data = metrics.r2_score(Y_test, data_prediction)

In [None]:
print('R Squared value = ', r2_data)

**BUILDING A PREDICTIVE SYSTEM**

* Building a predictive system inorder to find the sales for the first product from the dataset

In [None]:
input_data = (156, 9.300, 0, 0.016047, 4, 249.8092, 9, 1999,1, 0, 1)
#input_data_as_numpy_array = np.asarray(input_data)
#input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)
#prediction = regressor.predict(input_data_reshaped)
#print(prediction)
#print("The initial value is ",prediction[0])
print("The sales for the first product in the dataset is predicted as ", sales_data_prediction[0])

In [None]:
print("Thus we have built the model to predict the sales & have performed the evaluation successfully")

**STAY SAFE**🏡 & **STAY HEALTHY**👩

**HAPPY LEARNING** ✍ & **KEEP KAGGLING** 👩🏻

