## ****About the Notebook****
### This is a very basic tutorial to the Machine Learning world using the Big Mart Sales Dataset.

## **1-1 Problem Feature**
### The data provided is the BigMart sales data of 10 outlets with 8523 records in the training dataset and 5681 records in the test dataset.
### The dataset contains the product as well as the outlet details for each product.

## **1-2 Variables in the Dataset**
### The dataset contains 12 columns:
### * Item_Identifier - unique product id
### * Item_Weight - The weight of the product 
### * Item_Fat_Content - The fat content in the product. A Categorical variable showing whether its low fat or regular
### * Item_Visibiliy - The percentage of total display area allocated to a product in an outlet
### * Item_Type - A categorical variable providing information on the type of the product
### * Item_MRP - The price of the product. A continuous Numerical data type.
### * Outlet_Identifier - Unique outlet id
### * Outlet_Establishment_Year - The establishment year of the outlet.
### * Outlet_Size - The size of the outlet with categorical data showing whether its High, Medium or small outlet.
### * Outlet_Location_Type -  Tier wise classification of city where the outlet is located.
### * Outlet_Type - Type of outlet whether its a supermarket or grocery store
### * Item_Outlet_Sales - The dependent / target variable showing the sales of products in an outlet.

## **1-3 Aim**
### The aim of the dataset is to train the model on the training dataset and predict the Outlet sales for the outlets in the test dataset. 
#Using this model, BigMart will try to understand the properties of products and outlets which play a key role in increasing sales.

## Loading the required packages.

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns



## Importing the BigMart sales dataset

In [None]:
import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
mart_train = pd.read_csv('../input/Train.csv')
mart_test  = pd.read_csv('../input/Test.csv')

In [None]:
mart_train.head()

In [None]:
print(mart_train.shape, mart_test.shape)

In [None]:
# The shape of the training and test dataset shows that there are 8523 records in training and 5681 records in test dataset.

In [None]:
mart_train.info()

In [None]:
mart_test.info()

# Observe that there are few null values in both training and test dataset. So it will be good if we merge both the datasets and perform feature engineering. After doing the EDA we can split it again into train and test dataset.

In [None]:
mart_train['Source']='Train'   # Creating a new column in train dataset and assigning value 'Train' in-order to classify the train dataset records after merging.
mart_test['Source'] = 'Test'   # Creating a new column in test dataset and assigning value 'Test' in-order to classify the test dataset records after merging.

full_data = pd.concat([mart_train, mart_test], ignore_index=True)

print(mart_train.shape, mart_test.shape, full_data.shape)

In [None]:
print(full_data.isna().sum())    # Getting the count of missing values in the full_data.

# Observe that the count '5681' in Item_Outlet_Sales is of the target variable which is missing from the test dataset. We have to focus on the other missing values i.e. Item_Weight and Outlet_Size

In [None]:
full_data.describe()

## Few observations from the above results - 
## 1. The minimum value of Item_Visibility is 0 which is not possible, since if the product is being sold in the store it should have some visibility
## 2. The establishment year of the outlet varies from 1985 to 2009. 


In [None]:
full_data.apply(lambda x : len(x.unique()))

In [None]:
# From the above result it's evident that the there are 1559 products in total and 16 distinct types of items. 
# Also, there are 10 outlets.


In [None]:
full_data.Item_Fat_Content.value_counts()

In [None]:
#Observe that "Low Fat," "LF", "low fat" are termed differently. Similarly, "Regular" and "Reg" as well. "LF" and "Low Fat" needs to be updated to "Low Fat"

In [None]:
full_data.Outlet_Location_Type.value_counts()

In [None]:
# Updating the null values of Item_Weight field with the mean
full_data['Item_Weight'].fillna(full_data['Item_Weight'].mean(), inplace= True)

In [None]:
# Check null values in Item_Weight
full_data['Item_Weight'].isna().sum()

In [None]:
# Updating the null values of Outlet_size with the mode
full_data['Outlet_Size'].fillna(full_data['Outlet_Size'].mode()[0], inplace=True)

In [None]:
# Check the null values in Outlet_Size
full_data['Outlet_Size'].isna().sum()

In [None]:
full_data.pivot_table(values='Item_Outlet_Sales', index = 'Outlet_Type')

In [None]:
full_data.loc[full_data['Item_Visibility']==0, 'Item_Visibility'] = full_data['Item_Visibility'].mean()

In [None]:
print((full_data['Item_Visibility']==0).sum())