***The Hypotheses***
> I came up with the following hypothesis while thinking about the problem. These are just my thoughts and you can come-up with many more of these. Since we’re talking about stores and products, lets make different sets for each.

***Store Level Hypotheses:***

1. City type: Stores located in urban or Tier 1 cities should have higher sales because of the higher income levels of people there.
2. Population Density: Stores located in densely populated areas should have higher sales because of more demand.
3. Store Capacity: Stores which are very big in size should have higher sales as they act like one-stop-shops and people would prefer getting everything from one place
4. Competitors: Stores having similar establishments nearby should have less sales because of more competition.
5. Marketing: Stores which have a good marketing division should have higher sales as it will be able to attract customers through the right offers and advertising.
6. Location: Stores located within popular marketplaces should have higher sales because of better access to customers.
7. Customer Behavior: Stores keeping the right set of products to meet the local needs of customers will have higher sales.
8. Ambiance: Stores which are well-maintained and managed by polite and humble people are expected to have higher footfall and thus higher sales.

***Product Level Hypotheses:***

1. Brand: Branded products should have higher sales because of higher trust in the customer.
2. Packaging: Products with good packaging can attract customers and sell more.
3. Utility: Daily use products should have a higher tendency to sell as compared to the specific use products.
4. Display Area: Products which are given bigger shelves in the store are likely to catch attention first and sell more.
5. Visibility in Store: The location of product in a store will impact sales. Ones which are right at entrance will catch the eye of customer first rather than the ones in back.
6. Advertising: Better advertising of products in the store will should higher sales in most cases.
7. Promotional Offers: Products accompanied with attractive offers and discounts will sell more.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
from matplotlib import cbook, rc_params_from_file, rcParamsDefault
import plotly.express as px
from sklearn.metrics import roc_auc_score
from sklearn import metrics
import squarify

#classifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,r2_score, mean_absolute_error

from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import *

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_csv(r"/kaggle/input/big-mart-sales-prediction/Train.csv")
test = pd.read_csv(r"/kaggle/input/big-mart-sales-prediction/Test.csv")

In [None]:
print(train.shape)
train.head()

In [None]:
train.info()

In [None]:
print(test.shape)
test.head()

In [None]:
test.info()

In [None]:
df = pd.concat([train, test],ignore_index=True)
print(train.shape, test.shape, df.shape)

In [None]:
print(df.shape)
df.head()

In [None]:
df.notnull().tail()

In [None]:
df.dropna(how = 'any').shape

In [None]:
df.duplicated().sum()

In [None]:
df.loc[df.duplicated(keep = 'last'), :]

In [None]:
df.loc[df.duplicated(keep = False), :]

In [None]:
df.drop_duplicates(keep = 'first').shape

In [None]:
df.drop_duplicates(keep = 'last').shape

In [None]:
df.drop_duplicates(keep = False).shape

In [None]:
df.drop_duplicates(subset = ['Item_Identifier', 'Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type', 'Item_MRP', 
                             'Outlet_Identifier', 'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type','Outlet_Type', 
                             'Item_Outlet_Sales']).shape

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.isnull().sum()

***Some observations:****

1. 'Item_Visibility' has a min value of zero. This makes no practical sense because when a product is being sold in a store, the visibility cannot be 0.
2. 'Outlet_Establishment_Years' vary from 1985 to 2009. The values might not be apt in this form. Rather, if we can convert them to how old the particular store is, it should have a better impact on sales.
3. The lower ‘count’ of "Item_Weight" and "Item_Outlet_Sales" confirms the findings from the missing value check.

In [None]:
df['Item_Weight'] = df['Item_Weight'].fillna(df['Item_Weight'].mode()[0])
df['Outlet_Size'] = df['Outlet_Size'].fillna(df['Outlet_Size'].mode()[0])

In [None]:
df.isnull().sum()

In [None]:
df.nunique()

> This tells us that there are 1549 products and 10 outlets/stores (which was also mentioned in problem statement). Another thing that should catch attention is that Item_Type has 16 unique values. 

In [None]:
#Filter categorical variables
categorical_columns = [x for x in df.dtypes.index if df.dtypes[x]=='object']
#Exclude ID cols and source:
categorical_columns = [x for x in categorical_columns if x not in ['Item_Identifier','Outlet_Identifier','source']]
#Print frequency of categories
for col in categorical_columns:
    print('\nFrequency of Categories for varible %s'%col)
    print(df[col].value_counts())

The output gives us following observations:

1. Item_Fat_Content: Some of ‘Low Fat’ values mis-coded as ‘low fat’ and ‘LF’. Also, some of ‘Regular’ are mentioned as ‘regular’.
2. Item_Type: Not all categories have substantial numbers. It looks like combining them can give better results.
3. Outlet_Type: Supermarket Type2 and Type3 can be combined. But we should check if that’s a good idea before doing it.

In [None]:
df.apply(lambda x: x.dtype)

In [None]:
df.columns.to_series().groupby(df.dtypes).groups

In [None]:
round((df.apply(lambda x:x.isnull().sum())/len(df))*100,2)

In [None]:
total_miss = df.isnull().sum()
perc_miss = total_miss/df.isnull().count()*100

missing_data = pd.DataFrame({'Total missing':total_miss,'% missing':perc_miss})

missing_data.sort_values(by='Total missing',ascending=False).head(3)

In [None]:
(df.isnull().sum()/len(df))*100

In [None]:
# find the unique values from categorical features
for col in df.select_dtypes(include='object').columns:
    print(col)
    print(df[col].unique())

In [None]:
for column in df.columns:
    print(column,df[column].nunique())

In [None]:
categorical_features = [feature for feature in df.columns if ((df[feature].dtypes=='O') & (feature not in ['deposit']))]
categorical_features

In [None]:
for feature in categorical_features:
    print('The feature is {} and number of categories are {}'.format(feature,len(df[feature].unique())))

In [None]:
numerical_data = df.select_dtypes(include=np.number) # select_dtypes selects data with numeric features
numerical_col = numerical_data.columns 

print("Numeric Features:")
print(numerical_data.head())
print("===="*20)

In [None]:
categorical_data = df.select_dtypes(exclude=np.number) # we will exclude data with numeric features
categorical_col = categorical_data.columns                          # we will store the categorical features in a variable

print("Categorical Features:")
print(categorical_data.head())
print("===="*20)

In [None]:
### numerical 
numerical_cols = list(df.select_dtypes(exclude=['object']))
numerical_cols

In [None]:
### categorical
categorical_cols = list(df.select_dtypes(include=['object']))
categorical_cols

In [None]:
#Check target label split over categorical features and find the count
for categorical_feature in categorical_features:
    print(df.groupby(['Item_Outlet_Sales',categorical_feature]).size())

In [None]:
# list of numerical variables
numerical_features = [feature for feature in df.columns if ((df[feature].dtypes != 'O') & (feature not in ['y']))]
print('Number of numerical variables: ', len(numerical_features))

# visualise the numerical variables
df[numerical_features].head()

In [None]:
#Discrete Numerical Features
discrete_feature=[feature for feature in numerical_features if len(df[feature].unique())<25]
print("Discrete Variables Count: {}".format(len(discrete_feature)))

In [None]:
#Continuous Numerical Features
continuous_features=[feature for feature in numerical_features if feature not in discrete_feature+['deposit']]
print("Continuous feature Count: {}".format(len(continuous_features)))

In [None]:
cols_with_missing = [col for col in df.columns 
                                 if df[col].isnull().any()]
cols_with_missing

In [None]:
df.describe()

In [None]:
df.corr()

In [None]:
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(),annot=True)

In [None]:
df.hist(figsize=(15,15))
plt.show()

In [None]:
sns.pairplot(df)

In [None]:
def bar_plot(variable):
    var = df[variable]
    varValue = var.value_counts()
    plt.figure(figsize=(15,3))
    plt.bar(varValue.index, varValue,color=['#00008b','#00e5ee','#cd1076', '#008080','#cd5555','red','blue',])
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    
    plt.show()
    print("{}: \n {}".format(variable,varValue))

In [None]:
categorical_cols = ['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']
for c in categorical_cols:
    bar_plot(c)

In [None]:
categorcial_variables = ['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']
for col in categorcial_variables:
    plt.figure(figsize=(10,4))
    sns.barplot(df[col].value_counts().values, df[col].value_counts().index)
    plt.title(col)
    plt.tight_layout()

In [None]:
for i in df.describe().columns:
    sns.boxplot(df[i].dropna())
    plt.show()

In [None]:
def Count_categorcial_variables(df):
    categorcial_variables = df.select_dtypes(include=['object']).columns.tolist()
    #fig = plt.figure(figsize=(14, 18))

    for index, col in enumerate(categorcial_variables):
        print("------------",col," value counts---------------------")
        print(df[col].value_counts())
        
        #fig.add_subplot(3, 2, index+1)
        #dataframe[col].value_counts()[:20].plot(kind='bar', title=col, color = "royalblue")
        #plt.tight_layout()
        
    print("\n\n------------Number of categories in each columns---------------------")
    for i in categorcial_variables:
        a = df[i].unique()
        print("There are {} categories in {}".format(len(a),i))
Count_categorcial_variables(df)

# Data Cleaning

This step typically involves imputing missing values and treating outliers. Though outlier removal is very important in regression techniques, advanced tree based algorithms are impervious to outliers. So I’ll leave it to you to try it out. We’ll focus on the imputation step here, which is a very important step.

# Imputing Missing Values

We found two variables with missing values – Item_Weight and Outlet_Size. Lets impute the former by the average weight of the particular item. This can be done as:

In [None]:
#Determine the average weight per item:
item_avg_weight = df.pivot_table(values='Item_Weight', index='Item_Identifier')

#Get a boolean variable specifying missing Item_Weight values
miss_bool = df['Item_Weight'].isnull() 

#Impute data and check #missing values before and after imputation to confirm
print('Orignal #missing: %d'% sum(miss_bool))
df.loc[miss_bool,'Item_Weight'] = df.loc[miss_bool,'Item_Identifier'].apply(lambda x: item_avg_weight[x])
print('Final #missing: %d'% sum(df['Item_Weight'].isnull()))

# Univariate Analysis

In [None]:
df['Item_Fat_Content'].value_counts()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(y = df['Item_Fat_Content'])

In [None]:
plt.figure(figsize=(10,10))
df['Item_Fat_Content'].value_counts().plot.pie(autopct="%0.2f%%")

In [None]:
df['Item_Type'].value_counts()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(y = df['Item_Type'])

In [None]:
plt.figure(figsize=(10,10))
df['Item_Type'].value_counts().plot.pie(autopct="%0.2f%%")

In [None]:
df['Item_MRP'].value_counts()

In [None]:
df['Outlet_Identifier'].value_counts()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(y = df['Outlet_Identifier'])

In [None]:
plt.figure(figsize=(10,10))
df['Outlet_Identifier'].value_counts().plot.pie(autopct="%0.2f%%")

In [None]:
df['Outlet_Establishment_Year'].value_counts()

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(y = df['Outlet_Establishment_Year'])

In [None]:
plt.figure(figsize=(10,10))
df['Outlet_Establishment_Year'].value_counts().plot.pie(autopct="%0.2f%%")

In [None]:
df['Outlet_Size'].value_counts()

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(y = df['Outlet_Size'])

In [None]:
plt.figure(figsize=(15,10))
df['Outlet_Size'].value_counts().plot.pie(autopct="%0.2f%%")

In [None]:
df['Outlet_Location_Type'].value_counts()

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(y = df['Outlet_Location_Type'])

In [None]:
plt.figure(figsize=(15,10))
df['Outlet_Location_Type'].value_counts().plot.pie(autopct="%0.2f%%")

In [None]:
df['Outlet_Type'].value_counts()

In [None]:
plt.figure(figsize=(10,7))
sns.countplot(y = df['Outlet_Type'])

In [None]:
plt.figure(figsize=(15,10))
df['Outlet_Type'].value_counts().plot.pie(autopct="%0.2f%%")

In [None]:
df['Item_Outlet_Sales'].value_counts()

In [None]:
df['Item_Fat_Content'] = df['Item_Fat_Content'].map({'Low Fat': 0, 'Regular': 1, 'LF': 2, 'reg': 3, 'low fat': 4})
df.head()

# Dist Plot

In [None]:
plt.figure(figsize = (10, 10))
sns.distplot(x = df['Item_Visibility'])

In [None]:
plt.figure(figsize = (10, 10))
sns.distplot(x = df['Item_MRP'])

In [None]:
plt.figure(figsize = (10, 10))
sns.distplot(x = df['Outlet_Establishment_Year'])

# Joint Plot

In [None]:
sns.jointplot(x = "Item_Fat_Content", y = "Item_Outlet_Sales", data = df)

In [None]:
sns.jointplot(x = "Item_Visibility", y = "Item_Outlet_Sales", data = df)

In [None]:
sns.jointplot(x = "Item_MRP", y = "Item_Outlet_Sales", data = df)

In [None]:
sns.jointplot(x = "Outlet_Identifier", y = "Item_Outlet_Sales", data = df)
plt.xticks(rotation = 90)

In [None]:
sns.jointplot(x = "Outlet_Establishment_Year", y = "Item_Outlet_Sales", data = df)

In [None]:
sns.jointplot(x = "Outlet_Size", y = "Item_Outlet_Sales", data = df)

In [None]:
sns.jointplot(x = "Outlet_Location_Type", y = "Item_Outlet_Sales", data = df)

In [None]:
sns.jointplot(x = "Outlet_Type", y = "Item_Outlet_Sales", data = df)
plt.xticks(rotation = 90)

# Bar Plot

In [None]:
plt.figure(figsize = (15,7))
sns.barplot(data = df, y = 'Item_Outlet_Sales', x = 'Item_Fat_Content')

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(data = df, y = 'Item_Outlet_Sales', x = 'Item_Type')
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(data = df, y = 'Item_Outlet_Sales', x = 'Outlet_Location_Type')

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(data = df, y = 'Item_Outlet_Sales', x = 'Outlet_Size')

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(data = df, y = 'Item_Outlet_Sales', x = 'Outlet_Establishment_Year')

In [None]:
plt.subplots(figsize = (15,4))
sns.barplot(y = df['Outlet_Identifier'], x= df['Item_Outlet_Sales'])

In [None]:
plt.figure(figsize=(15,7))
sns.barplot(x = 'Item_Type', y = 'Item_Outlet_Sales',hue = 'Outlet_Type', data = df)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize=(20,6))
sns.barplot(x = 'Outlet_Type',y = 'Item_Outlet_Sales',hue = 'Outlet_Location_Type',data = df)

In [None]:
plt.figure(figsize=(20,6))
sns.barplot(x = 'Item_Type',y ='Item_Weight',data = df)

In [None]:
plt.figure(figsize=(20,6))
sns.barplot(y='Item_Outlet_Sales',hue='Outlet_Type',x='Outlet_Location_Type',data=df)

In [None]:
plt.figure(figsize=(20,7))
sns.boxplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales',hue = 'Outlet_Location_Type', data = df) 

# Box Plot

In [None]:
plt.figure(figsize = [12,9])
sns.boxplot(x = 'Item_Type', y = 'Item_Outlet_Sales', data = df)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize = [20,7])
sns.boxplot(x = 'Outlet_Identifier', y = 'Item_Outlet_Sales', data = df)

In [None]:
plt.figure(figsize = [20,7])
sns.boxplot(x = 'Outlet_Size', y = 'Item_Outlet_Sales', data = df)

In [None]:
plt.figure(figsize = [20,7])
sns.boxplot(x = 'Outlet_Location_Type', y = 'Item_Outlet_Sales', data = df)

In [None]:
plt.figure(figsize = [20,7])
sns.boxplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales', data = df)

In [None]:
plt.figure(figsize=(16,10))
temp_col = ['Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']
for i, col in enumerate(temp_col):
    plt.subplot(2,2,i+1)
    sns.boxplot(data=df, y='Item_Outlet_Sales', x=col)
    plt.xlabel(col, fontsize=14)

In [None]:
plt.subplots(figsize = (25,4))
sns.boxplot(x = df['Item_Type'], y= df['Item_Outlet_Sales'])

In [None]:
plt.subplots(figsize = (15,4))
sns.boxplot(y = df['Outlet_Identifier'], x= df['Item_Outlet_Sales'])

In [None]:
plt.subplots(figsize = (15,4))
sns.boxplot(y = df['Outlet_Size'], x= df['Item_Outlet_Sales'])

In [None]:
plt.figure(figsize=(20,6))
sns.boxplot(y='Item_Outlet_Sales',hue='Outlet_Type',x='Outlet_Location_Type',data=df)

In [None]:
plt.figure(figsize=(20,7))
sns.boxplot(x = 'Item_Type', y = 'Item_Outlet_Sales',hue = 'Outlet_Type', data = df)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize=(20,7))
sns.boxplot(x = 'Item_Type', y = 'Item_Outlet_Sales',hue = 'Outlet_Size', data = df)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize=(20,7))
sns.boxplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales',hue = 'Outlet_Size', data = df)

In [None]:
plt.figure(figsize=(20,7))
sns.boxplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales',hue = 'Outlet_Type', data = df)

In [None]:
plt.figure(figsize=(20,7))
sns.boxplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales',hue = 'Outlet_Location_Type', data = df)

In [None]:
plt.figure(figsize=(20,7))
sns.boxplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales',hue = 'Outlet_Establishment_Year', data = df)

In [None]:
fig,axes=plt.subplots(2,2,figsize=(15,12))
sns.boxplot(x='Outlet_Establishment_Year',y='Item_Outlet_Sales',hue='Outlet_Size',ax=axes[0,0],data=df)
sns.boxplot(x='Outlet_Size',y='Item_Outlet_Sales',hue='Outlet_Size',ax=axes[0,1],data=df)
sns.boxplot(x='Outlet_Location_Type',y='Item_Outlet_Sales',hue='Outlet_Size',ax=axes[1,0],data=df)
sns.boxplot(x='Outlet_Type',y='Item_Outlet_Sales',hue='Outlet_Size',ax=axes[1,1],data=df)

In [None]:
plt.figure(figsize = (15,10))

plt.subplot(211)
sns.boxplot(x='Outlet_Identifier', y='Item_Outlet_Sales', data=df, palette="Set1")

plt.subplot(212)
sns.boxplot(x='Item_Type', y='Item_Outlet_Sales', data=df, palette="Set1")

plt.show()

# Count Plot

In [None]:
plt.figure(figsize=(20,7))
sns.countplot(x = 'Item_Fat_Content', hue = 'Outlet_Type', data = df)

In [None]:
plt.figure(figsize=(20,7))
sns.countplot(x = 'Item_Fat_Content', hue = 'Outlet_Size', data = df)

In [None]:
plt.figure(figsize=(20,7))
sns.countplot(x = 'Item_Fat_Content',hue = 'Outlet_Location_Type', data = df)

In [None]:
plt.figure(figsize=(20,7))
sns.countplot(x = 'Item_Fat_Content',hue = 'Outlet_Size', data = df)

In [None]:
plt.figure(figsize=(20,7))
sns.countplot(x = 'Item_Type',hue = 'Outlet_Size', data = df)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize=(20,7))
sns.countplot(x = 'Item_Type',hue = 'Outlet_Type', data = df)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize=(20,6))
sns.countplot(hue='Outlet_Type',x='Outlet_Location_Type',data=df)

# Violin Plot

In [None]:
plt.figure(figsize=(20,7))
sns.violinplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales',hue = 'Outlet_Location_Type', data = df)

In [None]:
plt.figure(figsize=(20,7))
sns.violinplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales',hue = 'Outlet_Size', data = df)

In [None]:
plt.figure(figsize=(20,6))
sns.violinplot(y = 'Item_Outlet_Sales', x = 'Outlet_Location_Type', data = df)

In [None]:
plt.figure(figsize=(20,6))
sns.violinplot(y = 'Item_Outlet_Sales', x = 'Item_Fat_Content', data = df)

In [None]:
plt.figure(figsize=(20,6))
sns.violinplot(y = 'Item_Outlet_Sales', x = 'Item_Type', data = df)

In [None]:
plt.figure(figsize=(20,6))
sns.violinplot(y = 'Item_Outlet_Sales', x = 'Outlet_Identifier', data = df)

In [None]:
plt.figure(figsize=(20,6))
sns.violinplot(y = 'Item_Outlet_Sales', x = 'Outlet_Establishment_Year', data = df)

In [None]:
plt.figure(figsize = (15,10))

plt.subplot(211)
sns.violinplot(x='Outlet_Identifier', y='Item_Outlet_Sales', data=df, palette="Set1")

plt.subplot(212)
sns.violinplot(x='Item_Type', y='Item_Outlet_Sales', data=df, palette="Set1")
plt.xticks(rotation = 90)

plt.show()

# Scatter Plot

In [None]:
plt.figure(figsize=(20,7))
sns.scatterplot(y = 'Item_Outlet_Sales',x = 'Item_Type',hue = 'Outlet_Location_Type', data = df)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize = (20,7))
sns.scatterplot(x = 'Item_Weight', y = 'Item_Outlet_Sales', hue = 'Item_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.scatterplot(x= 'Item_Visibility', y = 'Item_Outlet_Sales', hue = 'Item_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.scatterplot(x= 'Item_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Type', data = df)
plt.xticks(rotation = 90)

In [None]:
plt.figure(figsize = (20, 7))
sns.scatterplot(x= 'Item_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Size', data = df)
plt.xticks(rotation = 90)

In [None]:
sns.scatterplot(x = 'Outlet_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Location_Type', data = df)

In [None]:
sns.scatterplot(x = 'Outlet_Location_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Type', data = df)

In [None]:
sns.scatterplot(y = 'Item_Fat_Content', x = 'Item_Outlet_Sales', hue = 'Outlet_Type', data = df)

In [None]:
sns.scatterplot(y = 'Item_Fat_Content', x = 'Item_Outlet_Sales', hue = 'Outlet_Size', data = df)

In [None]:
fig,axes=plt.subplots(1,1,figsize=(12,8))
sns.scatterplot(x='Item_MRP',y='Item_Outlet_Sales',hue='Item_Fat_Content',size='Item_Weight',data=df)

In [None]:
fig,axes=plt.subplots(1,1,figsize=(10,8))
sns.scatterplot(x='Item_MRP',y='Item_Outlet_Sales',hue='Item_Fat_Content',size='Item_Weight',data=df)
plt.plot([69,69],[0,5000])
plt.plot([137,137],[0,5000])
plt.plot([203,203],[0,9000])

# Pie Chart

In [None]:
sums = df['Item_Outlet_Sales'].groupby(df['Outlet_Identifier']).sum().reset_index(name='Sales')
# sums
plt.figure(figsize=(10,10))
plt.pie(sums['Sales'], labels=sums['Outlet_Identifier'], rotatelabels=True,autopct='%1.1f%%');
plt.show()

In [None]:
sums = df['Item_Outlet_Sales'].groupby(df['Item_Type']).sum().reset_index(name='Sales')
# sums
plt.figure(figsize=(10,10))
plt.pie(sums['Sales'], labels=sums['Item_Type'], rotatelabels=True,autopct='%1.1f%%');

In [None]:
for i in df['Outlet_Identifier'].unique():
  # print(i)
    sums = df['Item_Outlet_Sales'].groupby(df['Item_Type']).count().reset_index(name='Sales')
    plt.figure(figsize=(10,10))
    plt.title('Outlet_Identifier is'+ i)
    plt.pie(sums['Sales'], labels=sums['Item_Type'], rotatelabels=False,autopct='%1.1f%%');

In [None]:
sums = df['Item_Type'].groupby(df['Item_Type']).count().reset_index(name='counts')
# sums
plt.figure(figsize=(10,10))
plt.pie(sums['counts'], labels=sums['Item_Type'], rotatelabels=True,autopct='%1.1f%%');

In [None]:
# Check what type of items in each and every OUtlet
for i in df['Outlet_Identifier'].unique():
    sums = df['Item_Type'].groupby(df['Item_Type']).count().reset_index(name='counts')
    plt.figure(figsize=(10,10))
    plt.title('Outlet_Identifier is'+ i)
    plt.pie(sums['counts'], labels=sums['Item_Type'], rotatelabels=False,autopct='%1.1f%%');

In [None]:
sums = df['Item_Outlet_Sales'].groupby(df['Outlet_Type']).sum().reset_index(name='Sales')
# sums
plt.figure(figsize=(10,10))
plt.pie(sums['Sales'], labels=sums['Outlet_Type'], rotatelabels=True,autopct='%1.1f%%');

In [None]:
sums = df['Item_Outlet_Sales'].groupby(df['Outlet_Size']).sum().reset_index(name='Sales')
# sums
plt.figure(figsize=(10,10))
plt.pie(sums['Sales'], labels=sums['Outlet_Size'], rotatelabels=True,autopct='%1.1f%%');

# Point Plot

In [None]:
figures, axes = plt.subplots(3, 2, figsize=(25, 15))

sns.pointplot(y = 'Item_Outlet_Sales', x = 'Item_Type', data = df, hue = 'Outlet_Type', ax = axes[0,0])
sns.pointplot(y = 'Item_Outlet_Sales', x = 'Item_Type', data = df, hue = 'Outlet_Establishment_Year', ax = axes[0,1])
sns.pointplot(y = 'Item_Outlet_Sales', x = 'Item_Type', data = df, hue = 'Outlet_Location_Type', ax = axes[1,0])
sns.pointplot(y = 'Item_Outlet_Sales', x = 'Item_Type', data = df, hue = 'Outlet_Size', ax = axes[1,1])
sns.pointplot(y = 'Item_Outlet_Sales', x = 'Item_Fat_Content', data = df, hue = 'Outlet_Type', ax = axes[2,0])
sns.pointplot(y = 'Item_Outlet_Sales', x = 'Item_Fat_Content', data = df, hue = 'Outlet_Location_Type', ax = axes[2,1])

plt.show()

# Pivot Table

In [None]:
table = pd.pivot_table(df,index=['Item_Fat_Content','Outlet_Establishment_Year'])
table

In [None]:
table = pd.pivot_table(df, values = 'Item_Outlet_Sales',index = 'Outlet_Type')
table

In [None]:
table = pd.pivot_table(df, values = 'Item_Visibility', index = 'Item_Identifier')
table

In [None]:
table = pd.pivot_table(df, values = 'Outlet_Type', columns = 'Outlet_Identifier',aggfunc = lambda x:x.mode())
table

In [None]:
df.pivot_table(values='Outlet_Type', columns='Outlet_Size',aggfunc=lambda x:x.mode())

In [None]:
df.pivot_table(values='Outlet_Location_Type', columns='Outlet_Type',aggfunc=lambda x:x.mode())

In [None]:
table = pd.pivot_table(df,index=['Item_Fat_Content','Outlet_Size'])
table

In [None]:
table = pd.pivot_table(df,index=['Item_Fat_Content','Outlet_Type'])
table

In [None]:
table = pd.pivot_table(df,index=['Item_Fat_Content','Outlet_Location_Type'])
table

In [None]:
table = pd.pivot_table(df,index=['Outlet_Type','Item_Type'])
table

In [None]:
table = pd.pivot_table(df,index=['Outlet_Size','Item_Type'])
table

In [None]:
table = pd.pivot_table(df,index=['Outlet_Location_Type','Item_Type'])
table

In [None]:
table = pd.pivot_table(df,index=['Outlet_Size','Item_Fat_Content'])
table

In [None]:
table = pd.pivot_table(df,index=['Outlet_Type','Item_Fat_Content'])
table

In [None]:
table = pd.pivot_table(df,index=['Outlet_Location_Type','Item_Fat_Content'])
table

In [None]:
table = pd.pivot_table(df,index=['Outlet_Establishment_Year','Item_Fat_Content'])
table

# Cross Tab

In [None]:
pd.crosstab(df['Item_Fat_Content'],df['Outlet_Size']).style.background_gradient(cmap = 'winter')

In [None]:
pd.crosstab(df['Item_Fat_Content'],df['Outlet_Type']).style.background_gradient(cmap = 'cool')

In [None]:
pd.crosstab(df['Item_Fat_Content'],df['Item_Type']).style.background_gradient(cmap = 'ocean')

In [None]:
pd.crosstab(df['Item_Fat_Content'],df['Outlet_Establishment_Year']).style.background_gradient(cmap = 'spring')

In [None]:
pd.crosstab(df['Item_Fat_Content'],df['Outlet_Location_Type']).style.background_gradient(cmap = 'summer')

In [None]:
pd.crosstab(df['Item_Type'],df['Outlet_Type']).style.background_gradient(cmap = 'autumn')

In [None]:
pd.crosstab(df['Item_Type'],df['Outlet_Size']).style.background_gradient(cmap = 'twilight')

In [None]:
pd.crosstab(df['Item_Type'],df['Outlet_Establishment_Year']).style.background_gradient(cmap = 'flag')

In [None]:
pd.crosstab(df['Item_Type'],df['Outlet_Size']).style.background_gradient(cmap = 'Wistia')

In [None]:
pd.crosstab(df['Item_Type'],df['Outlet_Location_Type']).style.background_gradient(cmap = 'bwr')

In [None]:
pd.crosstab(df['Outlet_Size'],df['Outlet_Type']).style.background_gradient(cmap = 'seismic')

In [None]:
pd.crosstab(df['Outlet_Establishment_Year'],df['Outlet_Type']).style.background_gradient(cmap = 'PRGn')

In [None]:
pd.crosstab(df['Outlet_Location_Type'],df['Outlet_Type']).style.background_gradient(cmap = 'PuOr')

In [None]:
fat = df['Item_Fat_Content'].value_counts()

plt.style.use('default')
plt.figure(figsize = (7, 5))
squarify.plot(sizes = fat.values, label = fat.index, value = fat.values)
plt.title('Item Fat Content Distribution', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20, 'fontweight' : 'bold'})
plt.show()

In [None]:
type1 = df['Item_Type'].value_counts()

plt.style.use('default')
plt.figure(figsize = (15, 10))
squarify.plot(sizes = type1.values, label = type1.index, value = type1.values)
plt.title('Item Type Distribution', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20, 'fontweight' : 'bold'})
plt.show()

In [None]:
year = df['Outlet_Establishment_Year'].value_counts()

plt.style.use('default')
plt.figure(figsize = (7, 5))
squarify.plot(sizes = year.values, label = year.index, value = year.values)
plt.title('Outlet Establishment Year Distribution', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20, 'fontweight' : 'bold'})
plt.show()

In [None]:
size = df['Outlet_Size'].value_counts()

plt.style.use('default')
plt.figure(figsize = (7, 5))
squarify.plot(sizes = size.values, label = size.index, value = size.values)
plt.title('Outlet Size Distribution', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20, 'fontweight' : 'bold'})
plt.show()

In [None]:
location = df['Outlet_Location_Type'].value_counts()

plt.style.use('default')
plt.figure(figsize = (7, 5))
squarify.plot(sizes = location.values, label = location.index, value = location.values)
plt.title('Outlet Location Type Distribution', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20, 'fontweight' : 'bold'})
plt.show()

In [None]:
type2 = df['Outlet_Type'].value_counts()

plt.style.use('default')
plt.figure(figsize = (7, 5))
squarify.plot(sizes = type2.values, label = type2.index, value = type2.values)
plt.title('Outlet Type Distribution', fontdict = {'fontname' : 'Monospace', 'fontsize' : 20, 'fontweight' : 'bold'})
plt.show()

# Line Plot

In [None]:
plt.figure(figsize=(15,10))
sns.lineplot(df['Item_Type'],df['Item_Outlet_Sales'])

In [None]:
plt.figure(figsize=(15,10))
sns.lineplot(df['Outlet_Identifier'],df['Item_Outlet_Sales'])

In [None]:
plt.figure(figsize=(15,10))
sns.lineplot(df['Outlet_Establishment_Year'],df['Item_Outlet_Sales'])

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Item_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Item_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Size', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Item_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Location_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Item_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Establishment_Year', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Item_Type', y = 'Item_Outlet_Sales', hue = 'Item_Fat_Content', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Outlet_Type', y = 'Item_Outlet_Sales', hue = 'Item_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Outlet_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Location_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Outlet_Location_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Size', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales', hue = 'Outlet_Location_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales', hue = 'Outlet_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales', hue = 'Outlet_Size', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.lineplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales', hue = 'Outlet_Establishment_Year', data = df)

# Boxen Plot

In [None]:
plt.figure(figsize = (20,7))
sns.boxenplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales', hue = 'Outlet_Establishment_Year', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.boxenplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales', hue = 'Outlet_Size', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.boxenplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales', hue = 'Outlet_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.boxenplot(x = 'Item_Fat_Content', y = 'Item_Outlet_Sales', hue = 'Outlet_Location_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.boxenplot(x = 'Outlet_Location_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Size', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.boxenplot(x = 'Outlet_Type', y = 'Item_Outlet_Sales', hue = 'Outlet_Location_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.boxenplot(x = 'Outlet_Type', y = 'Item_Outlet_Sales', hue = 'Item_Type', data = df)

In [None]:
plt.figure(figsize = (20,7))
sns.boxenplot(x = 'Item_Type', y = 'Item_Outlet_Sales', hue = 'Item_Fat_Content', data = df)

In [None]:
average_sales    = df.groupby('Item_Type')["Item_Outlet_Sales"].mean()
pct_change_sales = df.groupby('Item_Type')["Item_Outlet_Sales"].sum().pct_change()
pct_change_sales

In [None]:
average_sales

In [None]:
# plot average sales per Item_Type
fig, (axis1,axis2) = plt.subplots(2,1,sharex=True,figsize=(15,8))

fig1 = average_sales.plot(legend=True,ax=axis1,marker='o',colormap="flag", title="Average Sales Per Item_Type")
fig1.set_xticks(range(len(average_sales)))
fig1.set_xticklabels(average_sales.index.tolist(), rotation=90)

# plot precent change for sales per Item_Type
fig2 = pct_change_sales.plot(legend=True,ax=axis2,marker='o',rot=90,colormap="summer",title="Sales Percent Change Per Item_Type")

# Groupby Plot

In [None]:
df.groupby(['Outlet_Location_Type','Outlet_Size'])['Outlet_Size'].count()

In [None]:
df.groupby(['Outlet_Identifier','Outlet_Size'])['Outlet_Size'].count()

In [None]:
df.groupby(['Outlet_Type','Outlet_Size'])['Outlet_Size'].count()

In [None]:
df.groupby(['Outlet_Type'])['Item_Visibility'].mean()

In [None]:
df['Item_Type'] = df['Item_Type'].map({'Snack Foods': 0, 'Fruits and Vegetables': 1, 'Household': 2, 'Frozen Foods': 3, 
                                       'Dairy': 4, 'Baking Goods': 5, 'Canned': 6, 'Health and Hygiene': 7, 'Meat': 8, 
                                       'Soft Drinks': 9, 'Breads': 10, 'Hard Drinks': 11, 'Starchy Foods': 12, 'Others': 13, 
                                      'Breakfast': 14, 'Seafood': 15})
df.head()

In [None]:
df['Outlet_Identifier'] = df['Outlet_Identifier'].map({'OUT027': 0, 'OUT045': 1, 'OUT046': 2, 'OUT013': 3, 'OUT035': 4,
                                                       'OUT049': 5, 'OUT017': 6, 'OUT018': 7, 'OUT010': 8, 'OUT019': 9})
df.head()

In [None]:
df['Outlet_Size'] = df['Outlet_Size'].map({'High': 0, 'Medium': 1, 'Small': 2})
df.head()

In [None]:
df['Outlet_Location_Type'] = df['Outlet_Location_Type'].map({'Tier 3': 0, 'Tier 2': 1, 'Tier 1': 2})
df.head()

In [None]:
df['Outlet_Type'] = df['Outlet_Type'].map({'Supermarket Type1': 0, 'Grocery Store': 1, 'Supermarket Type3': 2, 
                                           'Supermarket Type2': 3})
df.head()

In [None]:
df = df.dropna()
#print(df)
df.head()

In [None]:
df.dropna(inplace = True)
df.isnull().sum()

# Training And Testing Data

In [None]:
#Classifiers
X = df.loc[:, ['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_Type', 'Item_MRP', 'Outlet_Identifier', 
               'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']]
X.head()

In [None]:
Y = df.loc[:, ['Item_Outlet_Sales']]
Y.head()

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 42, shuffle = True)

# Linear Regression

In [None]:
regressor = LinearRegression()  
regressor.fit(X_train, Y_train) #training the algorithm
#To retrieve the intercept:
print(regressor.intercept_)

#For retrieving the slope:
print(regressor.coef_)

In [None]:
Y_pred = regressor.predict(X_test)
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

In [None]:
# Model initialization
regression_model = LinearRegression()
# Fit the data(train the model)
regression_model.fit(X, Y)
# Predict
Y_pred = regression_model.predict(X)

# model evaluation
rmse = mean_squared_error(Y, Y_pred)
r2 = r2_score(Y, Y_pred)

# printing values
print('Slope:' ,regression_model.coef_)
print('Intercept:', regression_model.intercept_)
print('Root mean squared error: ', rmse)
print('R2 score: ', r2)

In [None]:
import statsmodels.api as sm

X = np.random.rand(100)
Y = X + np.random.rand(100)*0.1

results = sm.OLS(Y,sm.add_constant(X)).fit()

print(results.summary())

plt.scatter(X,Y)

X_plot = np.linspace(0,1,100)
plt.plot(X_plot, X_plot*results.params[0] + results.params[1])

plt.show()

# Decision Tree 

In [None]:
regressor = DecisionTreeRegressor(max_depth=15,min_samples_leaf=300)
regressor.fit(X_train, Y_train)

In [None]:
Y_pred = regressor.predict(X_test)
Y_pred

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

# Random Forest Model

In [None]:
regressor = RandomForestRegressor(n_estimators=100,max_depth=6, min_samples_leaf=50,n_jobs=4)
regressor.fit(X_train, Y_train)

In [None]:
Y_pred = regressor.predict(X_test)
Y_pred

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_test, Y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y_test, Y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))