### Problem Statement

Sales Prediction for Big Mart Outlets The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.

Using this model, BigMart will try to understand the properties of products and outlets which play a key role in increasing sales.

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

Data Dictionary We have train (8523) and test (5681) data set, train data set has both input and output variable(s). You need to predict the sales for test data set.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
train = pd.read_csv("../input/bigmart-sales-data/Train.csv")
test = pd.read_csv("../input/bigmart-sales-data/Test.csv")

In [None]:
# Preview First 05 Rows of the Data
train.head()

In [None]:
train.info()

### Steps to Modelling

1. Problem Statement

2. Hypothesis Generation

3. Exploratory Data Analysis

3.1. Univariate Analysis

3.2. Bivariate or Multivariate Analysis

3.3. Missing Values Treatment

3.4. Outlier Identification

3.5. Feature Engineering

3.6. Standardization - This is the last Step of EDA popularly known as Data Pre-Processing Step.

4. Applying Machine Learning Models

### Exploratory Data Analysis

1. Univariate Analysis

The columns in the dataset are either Numerical or categorical.

For Numerical Columns - Create a Histogram | Distplot

Histogram is a Statistical Plot that tells me how is the Data Being Distrib uted. If it is not normal (Bell Shaped), then it would be skewed.

For Categorical Column, we create a BARPLOT/Frequency/Count Plot


In [None]:
# Target Variable
train.columns

In [None]:
sns.distplot(train.Item_Outlet_Sales, color = "m")
plt.show()

Item_Outlet_Sales is Positively Skewed

In [None]:
train.Item_Outlet_Sales.describe()

In [None]:
sns.distplot(train.Item_Visibility, color = "red");

Visibility is Higher for lot of Items.

In [None]:
sns.distplot(train.Item_Weight.dropna(), color = "g");

In [None]:
sns.distplot(train.Item_MRP, color = "r");

MRP appears to have 04 diff dist/values. Needs exploring

In [None]:
train.head()

In [None]:
test.Item_Fat_Content.value_counts()

In [None]:
test.Item_Fat_Content.replace(to_replace = ["LF", "low fat"], 
                              value = ["Low Fat", "Low Fat"], inplace=True)
test.Item_Fat_Content.replace(to_replace = ["reg"], value = ["Regular"], 
                              inplace = True)

In [None]:
# Replacement of LF and low fat
train.Item_Fat_Content.replace(to_replace = ["LF", "low fat"], 
                              value = ["Low Fat", "Low Fat"], inplace=True)


In [None]:
# Replacing reg into Regular
train.Item_Fat_Content.replace(to_replace = ["reg"], value = ["Regular"], 
                              inplace = True)

In [None]:
# Item Fat Content
train.Item_Fat_Content.value_counts().plot(kind = "bar")

In [None]:
# Item Fat Content
train.Item_Type.value_counts().plot(kind = "bar")

# By Sns
sns.countplot(x = "Item_Type", data = train)
plt.xticks(rotation = 90)
plt.show()

In [None]:
sns.countplot(x = "Item_Type", data = train)
plt.xticks(rotation = 90)
plt.show()

In [None]:
# Outlet _Identifier
train.Outlet_Identifier.value_counts().plot(kind = "bar")

Less Frequency Count is of OUT10 and OUT19

In [None]:
# Outlet _Size
train.Outlet_Size.value_counts().plot(kind = "bar")

Medium Outlets are the most visible outlets

In [None]:
# Outlet_Type
train.Outlet_Type.value_counts().plot(kind = "bar");

Most of the Outlet Types are S

### Bivariate Analysis

1. Num vs Num - Scatterplot

2. Cat Vs Num - Boxplot (Statistical Plot) | Violin Plot

3. Cat Vs Cat - pd.crosstab | Table - Frequency

In [None]:
# Num vs Num
train.head()

In [None]:
plt.scatter(train.Item_Weight, train.Item_Outlet_Sales, color = "magenta");

No pattern as such between the duo

In [None]:
plt.figure(figsize = [10, 8])
plt.scatter(train.Item_Visibility, train.Item_Outlet_Sales, color = "red");

Lots of 0s in Visibility for which the sales exist.

In [None]:
plt.scatter(train.Item_MRP, train.Item_Outlet_Sales, color = "hotpink")
# Price Per Unit

In [None]:
# Cat Vs Numerical
sns.boxplot(train.Item_Fat_Content, train.Item_Outlet_Sales)

In [None]:
train.groupby("Item_Fat_Content")["Item_Outlet_Sales"].describe().T
# Hint: Refer Empirical Rule and Contradictory Rule - Chebyshev Inequality

In [None]:
# Cat Vs Numerical
plt.figure(figsize = [13,6])
sns.boxplot(train.Item_Type, train.Item_Outlet_Sales)
plt.xticks(rotation = 90)
plt.title("Boxplot - Item Type Vs Sales")
plt.xlabel("Item Type")
plt.ylabel("Sales")
plt.show()

In [None]:
# Cat Vs Numerical
plt.figure(figsize = [13,6])
sns.boxplot(train.Outlet_Identifier, train.Item_Outlet_Sales)
plt.xticks(rotation = 90)
plt.title("Boxplot - Oultet ID Vs Sales")
plt.xlabel("Outlets")
plt.ylabel("Sales")
plt.show()

In [None]:
# Outlet Size
# Cat Vs Numerical
plt.figure(figsize = [13,6])
sns.boxplot(train.Outlet_Size, train.Item_Outlet_Sales)
plt.xticks(rotation = 90)
plt.title("Boxplot - Oultet Size Vs Sales")
plt.xlabel("Outlets")
plt.ylabel("Sales")
plt.show()

In [None]:
pd.DataFrame(train.groupby("Outlet_Size")["Outlet_Identifier"].value_counts()).T

In [None]:
# Missing Value
train.isnull().sum()[train.isnull().sum()!=0]

In [None]:
weightna = train[train.Item_Weight.isnull()]

In [None]:
weightna.head()

In [None]:
# Combining the Dataset
combined = pd.concat([train,test], ignore_index=True, sort = False)

In [None]:
combined.isnull().sum()[combined.isnull().sum()!=0]

In [None]:
combined.Item_Fat_Content.value_counts()

In [None]:
# Pattern
train[train.Item_Identifier=="FDX07"]["Item_Visibility"].median()

# Missing value Imputation
train.loc[29, "Item_Weight"]= train[train.Item_Identifier=="FDC14"]["Item_Weight"].median()

# Finding ID | np.where(train.Item_Weight.isna())
ids = train[pd.isnull(train.Item_Weight)]["Item_Identifier"]
locs = ids.index # Finding Index of the Item Weight Missing Values

# Missing Value Final Code
for i in range(0, len(ids)):
    train.loc[locs[i],"Item_Weight"]=train[train.Item_Identifier==ids.values[i]]["Item_Weight"].median()

In [None]:
# Missing Value Imputation - Item Weight | Lambda
combined["Item_Weight"]=combined.groupby("Item_Identifier")["Item_Weight"].transform(lambda x:x.fillna(x.median()))

In [None]:
# Missing Values - Item Visibility
combined["Item_Visibility"] = combined.groupby("Item_Identifier")["Item_Visibility"].transform(lambda x:x.replace(to_replace = 0,value = x.median()))

In [None]:
plt.figure(figsize = [10,7])
plt.scatter(combined["Item_Visibility"], combined["Item_Outlet_Sales"], color = "red")

In [None]:
combined[combined["Item_Identifier"]=="FDY07"]

In [None]:
train[train.Item_Identifier=="FDY07"]["Item_Visibility"]

In [None]:
# Imputation of FDY 07
combined.loc[(combined.Item_Identifier=="FDY07") & (combined["Item_Visibility"]!=0), 
        "Item_Visibility"]=0.121848

In [None]:
combined.head()

In [None]:
# Lets Deal with Tier 2
train.loc[train["Outlet_Location_Type"]=='Tier 2',"Outlet_Size"]="Small"

In [None]:
#train.loc[train["Outlet_Location_Type"]=='Tier 1',"Outlet_Size"]

In [None]:
# Feature Engineering
train.head()

In [None]:
# Size
pd.DataFrame(combined.groupby(["Outlet_Type", "Outlet_Location_Type"])
             ["Outlet_Size"].value_counts())

Rule

1. Tier 3 and Grocery Store - Medium
2. Tier 2 and S1 - Small

When Outlet Size is NA, then the Locations are Tier 2 and Tier 3 and 'Grocery Store', 'Supermarket Type1'

In [None]:
# Imputting Rule 2 Tier 2 and S1 - Small
combined.loc[[(combined["Outlet_Location_Type"]=="Tier 2") & 
             (combined["Outlet_Type"]=="Supermarket Type1"),
            "Outlet_Size"]]=["Small"]

In [None]:
# Imputting Rule 1 Tier 3 and Grocery Store - Medium
combined.loc[[(combined["Outlet_Location_Type"]=="Tier 3") & 
             (combined["Outlet_Type"]=="Grocery Store"),
            "Outlet_Size"]]=["Medium"]

In [None]:
combined.isnull().sum()

In [None]:
combined.head()

### Feature Engg

1. Price Per Unit - MRP/Weight
2. Item Type Category - Convert Item Type into Two CATs - Perishables and Non Perishables

3. Outlet Age - 2013 - Est Year
4. Extract Two Codes from ID

In [None]:
# Price Per Unit
combined["Price_Per_Unit"] = combined["Item_MRP"]/combined["Item_Weight"]

In [None]:
# Outlet Age
combined["Outlet_Age"] = 2013 - combined.Outlet_Establishment_Year

In [None]:
combined.Item_Type.unique()

In [None]:
perishables = ['Dairy', 'Meat', 'Fruits and Vegetables','Breakfast',
              'Breads','Seafood']

In [None]:
# Function
def badalde(x):
    if(x in perishables):
        return("Perishables")
    else:
        return("Non Perishables")
    
combined.Item_Type.apply(badalde)

In [None]:
# np.where
np.isin(combined.Item_Type, perishables)

In [None]:
np.where(combined.Item_Type.isin(perishables), "Perishables", 
         "Non Perishables")

In [None]:
# Loop
badlale = []
for i in range(0, len(combined)):
    if(combined.Item_Type[i] in perishables):
        badlale.append("Perishables")
    else:
        badlale.append("Non Perishables")

In [None]:
combined["ItemType_Cat"]=pd.Series(badlale)

In [None]:
combined.head()

In [None]:
str(combined.Item_Identifier[0])[:2]

In [None]:
item_id =[]
for i in combined.Item_Identifier:
    item_id.append(str(i)[:2])

In [None]:
combined["Item_IDS"]=pd.Series(item_id)

In [None]:
combined.head()

In [None]:
plt.figure(figsize=[10,7])
plt.scatter(combined["Price_Per_Unit"], combined["Item_MRP"], color = "red")

In [None]:
# Dropping the Columns
combined.columns

In [None]:
newdata = combined.drop(['Item_Identifier','Item_MRP','Item_Type','Outlet_Identifier',
       'Outlet_Establishment_Year',], axis = 1)

In [None]:
print(newdata.shape)

In [None]:
# Applying OHE
dummydata = pd.get_dummies(newdata)

In [None]:
dummydata.head()

In [None]:
# Split the Data in Train and Test
newtrain = dummydata[0:train.shape[0]]

In [None]:
# Test
newtest = dummydata[8523:dummydata.shape[0]]

In [None]:
newtest.drop("Item_Outlet_Sales",axis = 1, inplace = True)

In [None]:
print(newtrain.shape)
print(newtest.shape)

## Data Pre-Processing Stage

In [None]:
newtrain.columns

In [None]:
newtest.columns

In [None]:
# Scaling the Dataset
from sklearn.preprocessing import StandardScaler
nayasc = StandardScaler()

In [None]:
newtrain.drop("Item_Outlet_Sales", axis = 1).shape

In [None]:
newtrain.columns[newtrain.columns!="Item_Outlet_Sales"]

In [None]:
# Standardized Train Set
scaledtrain = pd.DataFrame(nayasc.fit_transform(newtrain.drop("Item_Outlet_Sales", axis = 1)), 
             columns = newtrain.columns[newtrain.columns!="Item_Outlet_Sales"])

In [None]:
# Standardized Test Set
scaledtest = pd.DataFrame(nayasc.transform(newtest), columns=newtest.columns)

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
rf = RandomForestRegressor()
gbm = GradientBoostingRegressor()

In [None]:
gbm.fit(scaledtrain, newtrain.Item_Outlet_Sales)

In [None]:
gbm_pred = gbm.predict(scaledtest)

In [None]:
# Submit on AV
solution = pd.DataFrame({"Item_Identifier":test["Item_Identifier"],
                        "Outlet_Identifier":test["Outlet_Identifier"],
                        "Item_Outlet_Sales":gbm_pred})

In [None]:
solution.to_csv("GBM Model.csv", index = False) # 1164.224735564618

In [None]:
rf.fit(scaledtrain, newtrain.Item_Outlet_Sales)

In [None]:
pred = rf.predict(scaledtest)

In [None]:
# Submit on AV
solution = pd.DataFrame({"Item_Identifier":test["Item_Identifier"],
                        "Outlet_Identifier":test["Outlet_Identifier"],
                        "Item_Outlet_Sales":pred})

In [None]:
solution.head()

In [None]:
solution.to_csv("RandomF Model.csv", index = False) # 1224.9984365775733.

Give upvote if you find it useful...

and check my github: https://github.com/swapnilbhange