# Libraries import

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 

# Data import

In [33]:
test = pd.read_csv(r"C:\Users\rafal\Documents\python\Big-Mart-Sales\Test-Set.csv")
train = pd.read_csv(r"C:\Users\rafal\Documents\python\Big-Mart-Sales\Train-Set.csv")

## Exploratory Analysis

### Context

The data scientists at Big Mart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.

Using this model, Big Mart will try to understand the properties of products and outlets which play a key role in increasing sales.

Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly.

### Variable description

As we can see in the dataframes below, there are 11 columns in each subset plus 1 column with outlet sales in the train set. 

- **ProductID** : unique product ID
- **Weight** : weight of products
- **FatContent** : specifies whether the product is low on fat or not
- **Visibility** : percentage of total display area of all products in a store allocated to the particular product
- **ProductType** : the category to which the product belongs
- **MRP** : Maximum Retail Price (listed price) of the products
- **OutletID** : unique store ID
- **EstablishmentYear** : year of establishment of the outlets
- **OutletSize** : the size of the store in terms of ground area covered
- **LocationType** : the type of city in which the store is located
- **OutletType** : specifies whether the outlet is just a grocery store or some sort of supermarket
- **OutletSales** : (target variable) sales of the product in the particular store

In [34]:
test.head()

Unnamed: 0,ProductID,Weight,FatContent,ProductVisibility,ProductType,MRP,OutletID,EstablishmentYear,OutletSize,LocationType,OutletType
0,FDW58,20.75,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
2,NCN55,14.6,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3


In [35]:
test.shape

(5681, 11)

In [36]:
train.head()

Unnamed: 0,ProductID,Weight,FatContent,ProductVisibility,ProductType,MRP,OutletID,EstablishmentYear,OutletSize,LocationType,OutletType,OutletSales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [37]:
train.shape

(8523, 12)

In [38]:
# train data persentage
train.shape[0] / (test.shape[0] + train.shape[0])

0.6000422416220783

The dataset has 60% of training data and 40% of test data.

In [39]:
train.dtypes

ProductID             object
Weight               float64
FatContent            object
ProductVisibility    float64
ProductType           object
MRP                  float64
OutletID              object
EstablishmentYear      int64
OutletSize            object
LocationType          object
OutletType            object
OutletSales          float64
dtype: object

Data types are mostly of type "object" while all of them are in fact of type "category" (ProductID, FatContent, ProductType, OutletID, OutletSize, LocationType and OutletType). Therefore, in the next step the data type will be changed.

In [40]:
train[["ProductID", "FatContent", "ProductType", "OutletID", "OutletSize", "LocationType", "OutletType"]] = train[["ProductID", "FatContent", "ProductType", "OutletID", "OutletSize", "LocationType", "OutletType"]].apply(lambda x: x.astype('category'))
test[["ProductID", "FatContent", "ProductType", "OutletID", "OutletSize", "LocationType", "OutletType"]] = test[["ProductID", "FatContent", "ProductType", "OutletID", "OutletSize", "LocationType", "OutletType"]].apply(lambda x: x.astype('category'))

In [41]:
train.dtypes

ProductID            category
Weight                float64
FatContent           category
ProductVisibility     float64
ProductType          category
MRP                   float64
OutletID             category
EstablishmentYear       int64
OutletSize           category
LocationType         category
OutletType           category
OutletSales           float64
dtype: object