# EDA

Exploratory data analysis is a crucial step of Data Analysis that helps in understanding the data. 
EDA gives insight and knowledge to the data which later helps us to build a suitable model. 

For this notebook, I chose BigMart Sales data and the task is to build a regression model to prdict the sales of the items.

The data has both numerical and categorical features with missing values.
Looks like I can apply all basic EDA techniques here!

In [None]:
# BigMart Sales Prediction
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from scipy.stats import norm
from scipy import stats

# Load the data 
a = pd.read_csv(r'../input/big-mart-sales-prediction/Train.csv')
b = pd.read_csv(r'../input/big-mart-sales-prediction/Test.csv')
# I store the ID for later use and delete it from the data.
c = b.iloc[:, 0]
d = b.iloc[:, 6]

a.drop(['Item_Identifier', 'Outlet_Identifier'], axis = 1, inplace = True)
b.drop(['Item_Identifier', 'Outlet_Identifier'], axis = 1, inplace = True)

I stored the numerical and categorical feature names in a variable for future use.

In [None]:
categorical = ['Item_Fat_Content', 'Outlet_Size','Outlet_Location_Type','Outlet_Type', 'Item_Type']
continuous = ['Item_Weight','Item_Visibility', 'Item_MRP','Item_Outlet_Sales']

Let's have a look at the structure of our training data...

In [None]:
a.info()


... and the testing data.

In [None]:
b.info()

* There are 9 features among which 5 are categorical and 4 are continuous.
* Target variable is Item_Outlet_Sales and it is a continuous variable.
* Both training and testing data have the same continuous and categorical features with the exception of target variable.
* Eventhogh number of features are same, the testing data has less number of observations. Having too many observations will affect the working of the model since the model tends to fit nicely with the increase of the observations. Hence this difference in the number might help us anyway.

By looking at the features we can guess existence of relationship between MRP, FatContent, Visibility and the target variable but we won't arrive at a conclusion without looking at the correlations.

Let's have a closer look at our VIP, the target variable.

In [None]:
sns.distplot(a['Item_Outlet_Sales'])
plt.show()

Turns out that the VIP is skewed positively.


Let's have a peek into the data.

In [None]:
a.head()

* Has nan values, 

Checking for missing data. 


Sometimes missing data can be written as nan or just replaced as 0. 
In some features 0 does not make sense, so that time it is safe to assume that this might be a missing data.

In [None]:
a.isnull().sum()

In [None]:
b.isnull().sum()

* The training data has missing values in Item_Weight and Outlet_Size which is continuous and categorical feature respectively. Hence, the treatment will be different.
* The testing data also have null values in the same features.

Checking for 0 valus.

In [None]:
a.eq(0).sum()

In [None]:
b.eq(0).sum()

There is 0 value only in Item_Visibilty feature. We can approch this in 2 ways:
* From the data description, we know that Item_Visibility is calculated in %. So, 0% can mean that the item was not on display. We can then,treat it as a normal data and move ahead with the analysis.
* Or we can see it as missing value. Then, it belongs to the continuous feature and can be treated in the same way as Item Weight.

Here, I see it as a missing data.

Missing data Treatment

There are 3 types of missing data:
* Missing completely at random - The data is missing be error and does not depend on any other feature or itself.
* Missing at Random - Here, the data is missing because of other feature and not itself. For eg., women not disclosing age. Women here is another feature.
* Not missing at random - The data is missing because of it's nature. For eg., salary, sex etc. 

Treatment for missing values varies on the category.

In our data, values of Item Visibility, Outlet size and Item weight. 
We can safely say that these are missing completely at random.

I will impute them.

In [None]:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values  = np.nan, strategy = 'mean')
a.iloc[:, [0]] = imp.fit_transform(a.iloc[:,[0]])
b.iloc[:, [0]] = imp.transform(b.iloc[:,[0]])
imp1 = SimpleImputer(missing_values  = 0, strategy = 'mean')
a.iloc[:, [2]] = imp1.fit_transform(a.iloc[:,[2]])
b.iloc[:, [2]] = imp1.transform(b.iloc[:,[2]])

For categorical missing values, I will replace them with mode.

In [None]:
a['Outlet_Size'].fillna(a['Outlet_Size'].mode()[0], inplace = True)
b['Outlet_Size'].fillna(b['Outlet_Size'].mode()[0], inplace = True)

Correlation

Now to check the relationship between our VIP and it's followers, I will use heatmap.

In [None]:
sns.heatmap(a.corr(), annot = True)
plt.show()

Looks like our VIP is not liked that much. Sad!
Only Item_MRP and the target variable has correltion coefficient above 0.5.
But Multicollinearity won't be a problem. ;)


In [None]:
sns.scatterplot(x = a['Item_MRP'], y = a['Item_Outlet_Sales'], data = a)
plt.show()

Eventhough, it looks like Sales increase with MRP, there are less number of sales for high MRP value. This trend seems natural, with less price, items tends to be sold out quickly.

Let's peek into the big picture.

Relationship with Continuous Features.

In [None]:
sns.pairplot(a)
plt.show()

Wow, not much of a linear data. 


Outlet_Establishment year graph looks odd, lets look at the values. 

In [None]:
a['Outlet_Establishment_Year'].unique()

Since, there are only 9 values here, we can treat this feature as categorical data and encode or bin it.

Relationship of  target variable with Categorical Features.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=3, figsize = (15,8))
fig.subplots_adjust(right=1)
fig.suptitle('Relationshp with Categorical Features')
for ax, feature in zip(axes.flatten(),  categorical[0:3]):
    sns.stripplot(x = feature,  y = 'Item_Outlet_Sales', data = a, ax = ax)
plt.show()

Oh oh, look at the first graph. Item Fat Content has error. Low Fat, LF and low fat all belongs to one category. We will need to correct this error.

In [None]:
a.Item_Fat_Content = a.Item_Fat_Content.replace({'low fat' : 'Low Fat', 'LF' : 'Low Fat', 'reg' : 'Regular'})
b.Item_Fat_Content = b.Item_Fat_Content.replace({'low fat' : 'Low Fat', 'LF' : 'Low Fat', 'reg' : 'Regular'})
fig, axes = plt.subplots(nrows=1, ncols=3, figsize = (15,8))
fig.subplots_adjust(right=1)
fig.suptitle('Relationshp with Categorical Data')
for ax, feature in zip(axes.flatten(),  categorical[0:3]):
    sns.stripplot(x = feature,  y = 'Item_Outlet_Sales', data = a, ax = ax)
plt.show()

Perfect. 
* Low fat Items were sold little more than regular. Who can stay away from the good stuff right?
* Small outlet size means congesting, maybe that's why sales went down. 


In [None]:
fig, axes = plt.subplots(nrows = 2, ncols=1, figsize = (20,25))
fig.subplots_adjust(hspace=0.5)
fig.suptitle('Relationshp with Categorical Data')
for ax, feature in zip(axes.flatten(),  categorical[3:]):
    sns.stripplot(x = 'Item_Outlet_Sales', y = feature, data = a, ax = ax)
plt.show()

Naturally, supermarkets are large establishments with increased number of choice, quality and quantity which increases salse. Afterall, they are SUPERmarkets ;)

Some more insights

In [None]:
fig,axes = plt.subplots(figsize = (10,10))
sns.boxplot(x = a['Outlet_Establishment_Year'], y = a['Item_Outlet_Sales'], hue = a['Outlet_Type'], ax = axes )
plt.plot

* We can see that sales in grocery stores are rare even in previous years. 
* Many outliers are present. These are establishments having very huge sales for that year.

Treating our Categorical friends.

In [None]:
# Encoding Categorical
a = pd.get_dummies(a, drop_first = True)
b = pd.get_dummies(b, drop_first = True)

In [None]:
a.info()

* Dropfirst is used to avoid dummy variale trap.
* Since Item_Type has many categories, we can use hashing technique on it.


Checking for assumptions:


Normality of Errors

For checking normality of errors, we will need a model and fit the data to it.

In [None]:
# Splitting
Y = a['Item_Outlet_Sales']
X = a.drop('Item_Outlet_Sales', axis = 1)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 0, test_size = 0.25)
# Model
# Linear Regression
from sklearn.linear_model import LinearRegression
lg = LinearRegression()
lg.fit(X_train, Y_train)
Y_pred = lg.predict(X_test)
residue = Y_test - Y_pred
sns.regplot(residue, Y_pred, lowess = True, line_kws={'color': 'red'})
plt.show()


The graph is funnel shaped. Hence, it is heteroscedasticity.

To correct this, I will log transform Y variable and try.

In [None]:
Y_train = np.log(Y_train)
Y_test = np.log(Y_test)

In [None]:
from sklearn.linear_model import LinearRegression
lg = LinearRegression()
lg.fit(X_train, Y_train)
Y_pred = lg.predict(X_test)
residue = Y_test - Y_pred
sns.regplot(residue, Y_pred, lowess = True, line_kws={'color': 'red'})
plt.show()


# Regression

I will use SVM

In [None]:
# SVM
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X_train, Y_train)
Y_pred2 = regressor.predict(X_test)


Calculating RMSE

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(Y_test, Y_pred2))
from sklearn.metrics import r2_score
r2 = r2_score(Y_test, Y_pred2)
print('RMSE = ',rms, ' R2 score = ',r2)

Feature Selection Using Lasso

In [None]:
from sklearn.linear_model import LassoCV
model_lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005])
model_lasso.fit(X_train, Y_train)
coef = pd.Series(model_lasso.coef_, index = X_train.columns)
imp_features = coef.index[coef!=0].tolist()

imp_features


In [None]:
X_train = X_train[imp_features]
X_test = X_test[imp_features]



Feature Engineering

I will bin the Outlet_Establishment_Year feature. 

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
disc = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
X_train['Outlet_Establishment_Year'] = disc.fit_transform(X_train[['Outlet_Establishment_Year']])
X_test['Outlet_Establishment_Year'] = disc.fit_transform(X_test[['Outlet_Establishment_Year']])


In [None]:
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X_train, Y_train)
Y_pred3 = regressor.predict(X_test)
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(Y_test, Y_pred3))
from sklearn.metrics import r2_score
r2 = r2_score(Y_test, Y_pred3)
print('RMSE = ',rms, ' R2 score = ',r2)

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train.iloc[:,0:2] = sc_X.fit_transform(X_train.iloc[:,0:2])
X_test.iloc[:,0:2] = sc_X.transform(X_test.iloc[:,0:2])


Model after feature scaling

In [None]:
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X_train, Y_train)
Y_pred3 = regressor.predict(X_test)
from sklearn.metrics import mean_squared_error
from math import sqrt
rms = sqrt(mean_squared_error(Y_test, Y_pred3))
from sklearn.metrics import r2_score
r2 = r2_score(Y_test, Y_pred3)
print('RMSE = ',rms, ' R2 score = ',r2)

After feature scaling, RMSE decreased but R square increased.