# **Problem Description**
*The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined.*

**The aim is to build a predictive model and find out the sales of each product at a particular store.
Using this model, BigMart will try to understand the properties of products and stores which play a key role in increasing sales.**


**In this analysis, I have focused more on the visualizations part and its interpretation**

# **Importing the libraries and the datasets**



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
from scipy import stats
from scipy.stats import norm
warnings.filterwarnings('ignore')

In [None]:
train = pd.read_csv('../input/big-mart-sales-prediction/Train.csv')
train.head()

In [None]:
test = pd.read_csv('../input/big-mart-sales-prediction/Test.csv')

In [None]:
print("Train : ", train.shape)
print("Test : ", test.shape)

In [None]:
train.info()

In [None]:
# To look at the unique observations from each of the features
train.nunique()

# **Data Visualization and Cleaning**

In [None]:
# To check if there are any null values
train.isnull().sum()

In [None]:
# Total percentage of the missing values
missing_data = train.isnull().sum()
total_percentage = (missing_data.sum()/train.shape[0]) * 100
print(f'The total percentage of missing data is {round(total_percentage,2)}%')

In [None]:
total = train.isnull().sum().sort_values(ascending=False)
percent_total = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)*100
missing = pd.concat([total, percent_total], axis=1, keys=["Total", "Percentage"])
missing_data = missing[missing['Total']>0]
missing_data

In [None]:
# Plotting the percentage of missing values
plt.figure(figsize=(5,5))
sns.set(style="whitegrid")
sns.barplot(x=missing_data.index, y=missing_data['Percentage'], data = missing_data)
plt.title('Percentage of missing data by feature')
plt.xlabel('Features', fontsize=14)
plt.ylabel('Percentage', fontsize=14)
plt.show()

In [None]:
# Filling the null values with the mean value
train['Item_Weight'].fillna(train['Item_Weight'].mean(),inplace=True)
train.isnull().sum()

In [None]:
# Mapping the categorical values and then replacing it with the median value
train['Outlet_Size'] = train['Outlet_Size'].map({'Small':1, 'Medium':2, 'High':3})

print("The median value : ", train['Outlet_Size'].median())
train['Outlet_Size'] = train['Outlet_Size'].fillna(train['Outlet_Size'].median())
train.isnull().sum()

In [None]:
# Replacing it back into categorical values
train['Outlet_Size'] = train['Outlet_Size'].replace(1.000000,'Small')
train['Outlet_Size'] = train['Outlet_Size'].replace(2.000000,'Medium')
train['Outlet_Size'] = train['Outlet_Size'].replace(3.000000,'High')

In [None]:
# Changing the data type of establishment year into a object, as the years are not representing any numerical values but categorical
train['Outlet_Establishment_Year']  = train['Outlet_Establishment_Year'].astype('object')

train.info()

In [None]:
# Statistical description of the data
train.describe()

In [None]:
# The minimum value of the item visibility feature is zero(0)
# Replacing the minimum value with the 2nd minimum value of the feature, as item visibility cannot be zero
train['Item_Visibility'] = train['Item_Visibility'].replace(0.000000,0.003574698)

In [None]:
# Detecting the outliers and then removing it 
plt.title('Box-plot of Item outlet sales')
sns.boxplot('Item_Outlet_Sales',data=train)

**The outliers in the Item outlet sales features are from 6250. Hence including only those observations which has values < 6250**

In [None]:
train=train[train['Item_Outlet_Sales']<6250]

In [None]:
# Plotting the distribution of the feature
plt.figure(figsize=(10,8))
sns.distplot(train['Item_Outlet_Sales'], fit = norm, color='red')
plt.title('Item Outlet Sales Distribution')
plt.axvline(train['Item_Outlet_Sales'].median(),color='yellow',label='Median')
plt.axvline(train['Item_Outlet_Sales'].mean(),color='blue',label='Mean')
plt.legend()

In [None]:
print ("Skewness :", train['Item_Outlet_Sales'].skew())
print("Kurtosis : ", train['Item_Outlet_Sales'].kurt())

1. **The distribution is positively skewed which says some of the items have sales less then the mean value.**
2. **The distribution curve is platykurtic which tells that it is less prone to the outliers.**

In [None]:
plt.title('Box-plot of Item outlet sales after removing the outliers')
sns.boxplot('Item_Outlet_Sales',data=train)

In [None]:
plt.title('Box-plot of Item visibilty')
sns.boxplot('Item_Visibility',data=train)

**The outliers in the Item outlet sales features are from 0.195** 

**Hence including only those observations which has values < 0.195**

In [None]:
train=train[train['Item_Visibility']<0.195]

In [None]:
plt.figure(figsize=(10,8))
sns.distplot(train['Item_Visibility'], fit = norm, color='red')
plt.title('Item_Visibility after removing the outliers')
plt.axvline(train['Item_Visibility'].median(),color='yellow',label='Median')
plt.axvline(train['Item_Visibility'].mean(),color='blue',label='Mean')
plt.legend()

In [None]:
print ("Skewness :", train['Item_Visibility'].skew())
print("Kurtosis : ", train['Item_Visibility'].kurt())

1. **The distribution is positively skewed which says some of the items are less visible then the mean value.**
2. **The distribution curve is platykurtic which tells that it is less prone to the outliers.**

In [None]:
plt.title('Box-plot of Item visibilty after removing the outliers')
sns.boxplot('Item_Visibility',data=train)

In [None]:
plt.title('Box-plot of Item MRP')
sns.boxplot('Item_MRP',data=train)

In [None]:
plt.figure(figsize=(10,8))
sns.distplot(train['Item_MRP'], fit = norm, color = 'red')
plt.title('Item_MRP')
plt.axvline(train['Item_MRP'].median(),color='yellow',label='Median')
plt.axvline(train['Item_MRP'].mean(),color='blue',label='Mean')
plt.legend()

**The distribution looks symmetric as the mean and the median value are almost same**

In [None]:
print ("Skewness :", train['Item_MRP'].skew())
print("Kurtosis : ", train['Item_MRP'].kurt())

In [None]:
plt.title('Box-plot of Item weight')
sns.boxplot('Item_Weight',data=train)

In [None]:
plt.figure(figsize=(10,8))
sns.distplot(train['Item_Weight'], fit = norm, color = 'red')
plt.title('Item_Weight')
plt.axvline(train['Item_Weight'].median(),color='yellow',label='Median')
plt.axvline(train['Item_Weight'].mean(),color='blue',label='Mean')
plt.legend()

**The distribution looks symmetric as the mean and the median value are same**

In [None]:
print ("Skewness :", train['Item_Weight'].skew())
print("Kurtosis : ", train['Item_Weight'].kurt())

In [None]:
train['Item_Fat_Content'].value_counts()

In [None]:
# The values LF, low fat and Low Fat are same, similarly reg and Regular. Hence replacing it to avoid confusion
train['Item_Fat_Content'] = train['Item_Fat_Content'].replace('low fat', 'Low Fat')
train['Item_Fat_Content'] = train['Item_Fat_Content'].replace('reg', 'Regular')
train['Item_Fat_Content'] = train['Item_Fat_Content'].replace('LF', 'Low Fat')

In [None]:
# Low Fat product has great count which that those product has high preferences
sns.countplot(x=train['Item_Fat_Content'])

In [None]:
# Fruits and vegetables, Snacks have a great count which says that there is a good demand of these products
sns.countplot(train['Item_Type'])
plt.xticks(rotation=90)

In [None]:
sns.countplot(train['Outlet_Establishment_Year'])
plt.xticks(rotation=90)

In [None]:
sns.countplot(train['Outlet_Identifier'])
plt.xticks(rotation=90)

In [None]:
sns.countplot(train['Outlet_Size'])
plt.xticks(rotation=90)

In [None]:
sns.countplot(train['Outlet_Location_Type'])
plt.xticks(rotation=90)

In [None]:
sns.countplot(train['Outlet_Type'])
plt.xticks(rotation=90)

In [None]:
plt.figure(figsize=(8,5))
sns.heatmap(train.corr(), annot=True)

**Item MRP and Item outlet sales shows high positive correlation which tells that as the MRP of the item increases the sales of that item also increases.**
**Similarly we can also notice that the item visibility and item outlet sales are negatively correlated which means that if the item is less visible then the sale of the item is more or if the item is more visible then the sale of the item is less..**

In [None]:
plt.figure(figsize=(8,5))
sns.scatterplot(x='Item_MRP',y='Item_Outlet_Sales',hue='Item_MRP',data=train)
plt.axvline(train['Item_MRP'].mean(),color='red',label='Mean')

**By looking at the plot we can say that if the MRP is high, then the sales are high.**
**If an item has MRP more then the mean value then the its sales are higher.**

In [None]:
plt.figure(figsize=(8,5))
sns.scatterplot(x='Item_Visibility',y='Item_Outlet_Sales',hue='Item_Visibility', data=train,)

**If the item visibility is less then 0.100, the sales are higher.**

In [None]:
sns.barplot(train['Outlet_Identifier'],train['Item_Outlet_Sales'])
plt.xticks(rotation=90)

**Outlet OUT1027 have higher sales.**

In [None]:
# The breakfast item have good visibilty
sns.barplot(train['Item_Type'],train['Item_Visibility'])
plt.xticks(rotation=90)

In [None]:
# Most of the items have good amount of sales
sns.barplot(train['Item_Type'],train['Item_Outlet_Sales'])
plt.xticks(rotation=90)

In [None]:
# Outlet Established in 1998 did not have a great sales
sns.barplot(train['Outlet_Establishment_Year'],train['Item_Outlet_Sales'])
plt.xticks(rotation=90)

In [None]:
# After the removal of outliers
train.shape

In [None]:
fig,axes=plt.subplots(2,2,figsize=(15,12))
sns.boxplot(x='Outlet_Establishment_Year',y='Item_Outlet_Sales',ax=axes[0,0],data=train)
sns.boxplot(x='Outlet_Size',y='Item_Outlet_Sales',ax=axes[0,1],data=train)
sns.boxplot(x='Outlet_Location_Type',y='Item_Outlet_Sales',ax=axes[1,0],data=train)
sns.boxplot(x='Outlet_Type',y='Item_Outlet_Sales',ax=axes[1,1],data=train)

In [None]:
fig,axes=plt.subplots(3,1,figsize=(15,12))
sns.boxplot(x='Item_Type',y='Item_Outlet_Sales',ax=axes[0],data=train)
sns.boxplot(x='Outlet_Identifier',y='Item_Outlet_Sales',ax=axes[1],data=train)
sns.boxplot(x='Item_Fat_Content',y='Item_Outlet_Sales',ax=axes[2],data=train)

In [None]:
train.info()

In [None]:
# Mapping the binary features
train['Item_Fat_Content'] = train['Item_Fat_Content'].map({'Low Fat': 1, 'Regular': 0})

In [None]:
# Creating dummy variables of all the other categorical features
Itemtype = pd.get_dummies(train['Item_Type'],prefix='ItemType',drop_first=True)
train = pd.concat([train,Itemtype],axis=1)

OutID = pd.get_dummies(train['Outlet_Identifier'],prefix='OutIden',drop_first=True)
train = pd.concat([train,OutID],axis=1)

OutLoctype = pd.get_dummies(train['Outlet_Location_Type'],prefix='OutLocTy',drop_first=True)
train = pd.concat([train,OutLoctype],axis=1)

Outtype = pd.get_dummies(train['Outlet_Type'],prefix='OutTy',drop_first=True)
train = pd.concat([train,Outtype],axis=1)

OutSz = pd.get_dummies(train['Outlet_Size'],prefix='OutSz',drop_first=True)
train = pd.concat([train,OutSz],axis=1)

OutEYr = pd.get_dummies(train['Outlet_Establishment_Year'],prefix='OutEstYear',drop_first=True)
train = pd.concat([train,OutEYr],axis=1)

In [None]:
train.drop(['Item_Type','Outlet_Identifier','Outlet_Location_Type','Outlet_Type','Outlet_Size','Outlet_Establishment_Year'],axis=1,inplace=True)

In [None]:
X = train.drop(['Item_Identifier','Item_Outlet_Sales'],axis=1)

y = train['Item_Outlet_Sales']

In [None]:
X.shape

In [None]:
from sklearn.ensemble import ExtraTreesRegressor
model = ExtraTreesRegressor()
model.fit(X,y)

# To look at the best features
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest().plot(kind='barh')
plt.show()

**The Item MRP, Item Visibility and Item weight are the best features including Outlet OUT019 and Outlet Established in 1998.**

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)

y_linreg = lin_reg.predict(x_test)

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

MSE=mean_squared_error(y_test,y_linreg)
MAE=mean_absolute_error(y_test,y_linreg)
r2=r2_score(y_test,y_linreg)
RMSE = np.sqrt(MSE)
print("R squared value: ", r2)
print("Root Mean Squared Error : ", RMSE)
print("Mean Absolute Error : ", MAE)

In [None]:
from sklearn.ensemble import RandomForestRegressor
reg=RandomForestRegressor(n_estimators=100)
reg.fit(x_train,y_train)

feature_imp = pd.Series(reg.feature_importances_,index=x_train.columns).sort_values(ascending=False)

plt.figure(figsize=(10,15))
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.legend()
plt.show()

**Both of the regression models showed the same features as the best which are Item MRP, Item Visibility, Item weight, Outlet OUT019 and Outlet Establishment Year1998.**

In [None]:
rfr_reg=RandomForestRegressor(n_estimators=200)
rfr_reg.fit(x_train,y_train)

y_rfreg = rfr_reg.predict(x_test)

In [None]:
MSE=mean_squared_error(y_test,y_rfreg)
MAE=mean_absolute_error(y_test,y_rfreg)
r2=r2_score(y_test,y_rfreg)
RMSE = np.sqrt(MSE)
print("R squared value: ", r2)
print("Root Mean Squared Error : ", RMSE)
print("Mean Absolute Error : ", MAE)

**Comparing the 2 models we can say that the Random forest model is the best as it has good metrics value.**

**For predicting the test data, we will use both the models.**

# **Testing Phase**
**The same data cleaning steps will were taken for the test data and then the regression models will be used to predict the Item outlet sale for the unseen data.**

In [None]:
test.head()

In [None]:
test.nunique()

In [None]:
test['Outlet_Establishment_Year']  = test['Outlet_Establishment_Year'].astype('object')

test.info()

In [None]:
test.isnull().sum()

In [None]:
test['Item_Weight'].fillna(test['Item_Weight'].mean(),inplace=True)
test.isnull().sum()

In [None]:
test['Outlet_Size'] = test['Outlet_Size'].map({'Small':1, 'Medium':2, 'High':3})

test['Outlet_Size'] = test['Outlet_Size'].fillna(test['Outlet_Size'].median())
test.isnull().sum()

In [None]:
test['Outlet_Size'] = test['Outlet_Size'].replace(1.000000,'Small')
test['Outlet_Size'] = test['Outlet_Size'].replace(2.000000,'Medium')
test['Outlet_Size'] = test['Outlet_Size'].replace(3.000000,'High')

In [None]:
test['Item_Visibility'] = test['Item_Visibility'].replace(0.000000,0.003591414)

In [None]:
test.describe()

In [None]:
plt.title('Box-plot of Item visibilty')
sns.boxplot('Item_Visibility',data=test)

In [None]:
test=test[test['Item_Visibility']<0.19]

In [None]:
plt.title('Box-plot of Item visibilty after removing the outliers')
sns.boxplot('Item_Visibility',data=test)

In [None]:
plt.title('Box-plot of Item Weight')
sns.boxplot('Item_Weight',data=test)

In [None]:
plt.title('Box-plot of Item MRP')
sns.boxplot('Item_MRP',data=test)

In [None]:
test.shape

In [None]:
test.info()

In [None]:
Prediction_LR = pd.DataFrame(test['Item_Identifier'])
Prediction_LR['Outlet_Identifier'] = test['Outlet_Identifier']

In [None]:
Prediction_RFR = pd.DataFrame(test['Item_Identifier'])
Prediction_RFR['Outlet_Identifier'] = test['Outlet_Identifier']

In [None]:
test['Item_Fat_Content'] = test['Item_Fat_Content'].replace('low fat', 'Low Fat')
test['Item_Fat_Content'] = test['Item_Fat_Content'].replace('reg', 'Regular')
test['Item_Fat_Content'] = test['Item_Fat_Content'].replace('LF', 'Low Fat')

In [None]:
test['Item_Fat_Content'] = test['Item_Fat_Content'].map({'Low Fat': 1, 'Regular': 0})

In [None]:
Itemtype = pd.get_dummies(test['Item_Type'],prefix='ItemType',drop_first=True)
test = pd.concat([test,Itemtype],axis=1)

OutID = pd.get_dummies(test['Outlet_Identifier'],prefix='OutIden',drop_first=True)
test = pd.concat([test,OutID],axis=1)

OutLoctype = pd.get_dummies(test['Outlet_Location_Type'],prefix='OutLocTy',drop_first=True)
test = pd.concat([test,OutLoctype],axis=1)

Outtype = pd.get_dummies(test['Outlet_Type'],prefix='OutTy',drop_first=True)
test = pd.concat([test,Outtype],axis=1)

OutSz = pd.get_dummies(test['Outlet_Size'],prefix='OutSz',drop_first=True)
test = pd.concat([test,OutSz],axis=1)

OutEYr = pd.get_dummies(test['Outlet_Establishment_Year'],prefix='OutEstYear',drop_first=True)
test = pd.concat([test,OutEYr],axis=1)

In [None]:
test.drop(['Item_Type','Outlet_Identifier','Outlet_Location_Type','Outlet_Type','Outlet_Size','Outlet_Establishment_Year'],axis=1,inplace=True)

In [None]:
test.drop('Item_Identifier',axis=1,inplace=True)

In [None]:
test.head()

In [None]:
test.shape

In [None]:
item_sales_linreg = lin_reg.predict(test)

Prediction_LR['Item_Outlet_Sales'] = item_sales_linreg
Prediction_LR.head()

In [None]:
item_sales_rfr = rfr_reg.predict(test)

Prediction_RFR['Item_Outlet_Sales'] = item_sales_rfr
Prediction_RFR.head()