# Part-1

Objectives:
1. To explore the dataset

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle//input/bigmart-sales-data'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
test_data = pd.read_csv('/kaggle//input/bigmart-sales-data/Test.csv')
train_data = pd.read_csv('/kaggle//input/bigmart-sales-data/Train.csv')

We can see that for every unique item their is unique id 

In [None]:
print(train_data.shape)
print(test_data.shape)



In [None]:
train_data.head()

**BigMart Sales Prediction practice problem**

We have train (8523) and test (5681) data set, train data set has both input and output variable(s). We need to predict the sales for test data set.

* Item_Identifier: Unique product ID

* Item_Weight: Weight of product

* Item_Fat_Content: Whether the product is low fat or not

* Item_Visibility: The % of total display area of all products in a store allocated to the particular product

* Item_Type: The category to which the product belongs

* Item_MRP: Maximum Retail Price (list price) of the product

* Outlet_Identifier: Unique store ID

* Outlet_Establishment_Year: The year in which store was established

* Outlet_Size: The size of the store in terms of ground area covered

* Outlet_Location_Type: The type of city in which the store is located

* Outlet_Type: Whether the outlet is just a grocery store or some sort of supermarket

* Item_Outlet_Sales: Sales of the product in the particulat store. This is the outcome variable to be predicted.

In [None]:
train_data.info()

In [None]:
train_data.isnull().sum()

In [None]:
test_data.columns

In [None]:
test_data.isnull().sum()

In [None]:
train_data.select_dtypes(include='object').nunique()

# Upto here what we can conclude that :
1. **Our data consist of two sections one is for item and one is outlet where each contains two type of data ,categorical
     and numerical.**
2. **There are missing values in item weight which is float, and outlet size which is categorical. Similar in the test data.**     
3. **Also we could see that obejct data has different categorical values which needed to be encoded.**     

# Part-2

**Objectives**
1. Visualize each columns and check the correlation between each of them.
2. Fill the null or missing values.
3. One hot encode every categorical data.


In [None]:
train_data.head()

Lets visualize some data and make conclusions.

In [None]:
import seaborn as sns
sns.barplot(x='Item_Fat_Content',y='Item_Outlet_Sales',data=train_data)

In [None]:
sns.set(rc={'figure.figsize':(20,8)})
chart = sns.barplot(x='Item_Type',y='Item_Weight',data=train_data)
chart.set_xticklabels(chart.get_xticklabels(), rotation=45, horizontalalignment='right')

This is important as what we can infer from this : 
On an average every item type vs weight didnt vary , like for eg. Canned type or Dairy both are near to 12 etc. Neither one has different max value. 
Hence for the missing values we would fill them with mean value.
But think of situation where max value of Meat is upto 6 , Canned is to 3, Seafood to 12. Think of how do we fill the null value than?
For that case we have to fill the null value by item_type. Each mean value correspond to each item_type.




In [None]:
sns.set(rc={'figure.figsize':(10,8)})
sns.barplot(x='Outlet_Type',y='Item_Outlet_Sales',data=train_data)

We can infer from these that Grocery stores in the outlet type has poor Item outlet sales

In [None]:
train_data['Item_Weight'].describe()

Now we will merge data of train and test to data engineering

In [None]:
train_data['source'] = 'train'
test_data['source'] = 'test'
test_data['Item_Outlet_Sales'] = 0
data = pd.concat([train_data, test_data], sort = False)

Handling Na values

In [None]:
#Item weight to be filled with mean value
mean = data['Item_Weight'].mean()
data['Item_Weight'] = data['Item_Weight'].fillna(value=mean)

In [None]:
data.isnull().sum()

In [None]:
sns.distplot(data['Item_Outlet_Sales'])

This Item Outlet Sales seems to be preety rightly skewed to apply it to the model we might need to perform some standard scaler operation.

In [None]:
sns.set(rc={'figure.figsize':(10,8)})
sns.barplot(x='Outlet_Size',y='Item_Outlet_Sales',data=train_data)

In [None]:
sns.countplot('Outlet_Size',data=data)

For Outlet size Na value we are going to fill with mode

In [None]:
from scipy.stats import mode

#Determing the mode for each
outlet_size_mode = data.pivot_table(values='Outlet_Size', columns='Outlet_Type',aggfunc=(lambda x:mode(x.astype('str')).mode[0]))
print ('Mode for each Outlet_Type:')
print (outlet_size_mode)

#Get a boolean variable specifying missing Item_Weight values
missing_values = data['Outlet_Size'].isnull() 

#Impute data and check #missing values before and after imputation to confirm
print ('\nOrignal #missing: %d'% sum(missing_values))
data.loc[missing_values,'Outlet_Size'] = data.loc[missing_values,'Outlet_Type'].apply(lambda x: outlet_size_mode[x])
print (sum(data['Outlet_Size'].isnull()))

In [None]:
data.head()

In [None]:
data['Item_Fat_Content'].unique()

Here we have same category but with different names. Hence we merge them to same.

In [None]:
#Lets merge contents to Low fat and Regular
data['Item_Fat_Content'] = data['Item_Fat_Content'].replace({'LF':'Low Fat','low fat':'Low Fat','reg':'Regular'})


In [None]:
#Item visibilty must be some value as 0 visibilty didnt make any sense , as with 0 visibilty the product  outlet sale should be zero, but is isn't.
#Hence let change the zero value.
sns.scatterplot('Item_Visibility','Item_Outlet_Sales',data=data)


In [None]:
#We are gonna fill 0 values with mean.
mean = data['Item_Visibility'].mean()
data=data.replace({'Item_Visibility': {0.0: mean}})

In [None]:
data['Item_Type'].unique()

In [None]:
data.head()

Finally we one hot code for the categorical values.

In [None]:
dummies = pd.get_dummies(data[['Item_Fat_Content','Item_Type','Outlet_Size','Outlet_Location_Type','Outlet_Type']])
dummies

In [None]:
data_processed = pd.concat([data,dummies],axis=1)

In [None]:
data_processed.head()

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
#New variable for outlet
data_processed['Outlet'] = le.fit_transform(data['Outlet_Identifier'])

In [None]:
data_processed.drop(['Item_Fat_Content','Item_Type','Outlet_Identifier','Outlet_Size','Outlet_Location_Type','Outlet_Type'],inplace=True,axis=1)

In [None]:
data_processed.head()

In [None]:
data_processed['Outlet_Year'] = 2009 - data_processed['Outlet_Establishment_Year']

In [None]:
data_processed.drop(['Outlet_Establishment_Year'],axis=1,inplace=True)

In [None]:
train = data_processed.loc[data['source']=="train"]
test = data_processed.loc[data['source']=="test"]

#Drop unnecessary columns:
test.drop(['Item_Outlet_Sales','source'],axis=1,inplace=True)
train.drop(['source'],axis=1,inplace=True)


In [None]:
train.head()

In [None]:
print(train.shape)
print(test.shape)

Here we complete our Part 2.
**What we conclude from this**
* Our data is cleaned and all the categorical data is converted to encode understandable by the model.
* We still have to work on perform standard scaler operation, but we should be aware that Standard scaler isn't always give good feedback on dummies, for why? Go through the following link.

[https://www.quora.com/How-bad-is-it-to-standardize-dummy-variables]
