## Sales Prediction for Big Mart Outlets
- The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and predict the sales of each product at a particular outlet.


- Using this model, BigMart will try to understand the properties of products and outlets which play a key role in increasing sales.


- Please note that the data may have missing values as some stores might not report all the data due to technical glitches. Hence, it will be required to treat them accordingly. 

### Data Dictionary
- We have train (8523) and test (5681) data set, train data set has both input and output variable(s). You need to predict the sales for test data set.



- **Train file:** CSV containing the item outlet information with sales value

|Variable	                     |Description                                                                                 |
|---------------------           |------------------------------------------------------------------------------------------- | 
|Item_Identifier	             |Unique product ID                                                                           |
|Item_Weight	                 |Weight of product                                                                           |
|Item_Fat_Content	             |Whether the product is low fat or not                                                       |
|Item_Visibility	             |The % of total display area of all products in a store allocated to the particular product  |
|Item_Type	                     |The category to which the product belongs                                                   |
|Item_MRP	                     |Maximum Retail Price (list price) of the product                                            |
|Outlet_Identifier	             |Unique store ID                                                                             |
|Outlet_Establishment_Year	     |The year in which store was established                                                     |
|Outlet_Size	                 |The size of the store in terms of ground area covered                                       |
|Outlet_Location_Type	         |The type of city in which the store is located                                              |
|Outlet_Type	                 |Whether the outlet is just a grocery store or some sort of supermarket                      |
|Item_Outlet_Sales	             |Sales of the product in the particular store. This is the outcome variable to be predicted  |

- **Test file:** CSV containing item outlet combinations for which sales need to be forecasted

|Variable	                     |Description                                                                                 |
|------------------------------  |------------------------------------------------------------------------------------------  |
|Item_Identifier	             |Unique product ID                                                                           |
|Item_Weight	                 |Weight of product                                                                           |
|Item_Fat_Content	             |Whether the product is low fat or not                                                       |
|Item_Visibility	             |The % of total display area of all products in a store allocated to the particular product  |
|Item_Type	                     |The category to which the product belongs                                                   |
|Item_MRP	                     |Maximum Retail Price (list price) of the product                                            |
|Outlet_Identifier	             |Unique store ID                                                                             |
|Outlet_Establishment_Year	     |The year in which store was established                                                     |
|Outlet_Size	                 |The size of the store in terms of ground area covered                                       |
|Outlet_Location_Type	         |The type of city in which the store is located                                              |
|Outlet_Type	                 |Whether the outlet is just a grocery store or some sort of supermarket                      |


- **Submission file format**

|Variable	                     |Description
|----------------------------    |------------------------------------------------------------------------------------------- |
|Item_Identifier	             |Unique product ID                                                                           |
|Outlet_Identifier	             |Unique store ID                                                                             |
|Item_Outlet_Sales	             |Sales of the product in the particular store. This is the outcome variable to be predicted  |


### Evaluation Metric
- Your model performance will be evaluated on the basis of your prediction of the sales for the test data (test.csv), which contains similar data-points as train except for the sales to be predicted. Your submission needs to be in the format as shown in sample submission.


- We at our end, have the actual sales for the test dataset, against which your predictions will be evaluated. We will use the Root Mean Square Error value to judge your response.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

sns.set_style("darkgrid")

In [2]:
train = pd.read_csv("train_v9rqX0R.csv")
test = pd.read_csv("test_AbJTz2l.csv")
submission = pd.read_csv("sample_submission_8RXa3c6.csv")

In [3]:
display(train.head())
display(test.head())
display(submission.head())

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDW58,20.75,Low Fat,0.007565,Snack Foods,107.8622,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,FDW14,8.3,reg,0.038428,Dairy,87.3198,OUT017,2007,,Tier 2,Supermarket Type1
2,NCN55,14.6,Low Fat,0.099575,Others,241.7538,OUT010,1998,,Tier 3,Grocery Store
3,FDQ58,7.315,Low Fat,0.015388,Snack Foods,155.034,OUT017,2007,,Tier 2,Supermarket Type1
4,FDY38,,Regular,0.118599,Dairy,234.23,OUT027,1985,Medium,Tier 3,Supermarket Type3


Unnamed: 0,Item_Identifier,Outlet_Identifier,Item_Outlet_Sales
0,FDW58,OUT049,1000
1,FDW14,OUT017,1000
2,NCN55,OUT010,1000
3,FDQ58,OUT017,1000
4,FDY38,OUT027,1000


In [4]:
display(train.shape)
display(test.shape)

(8523, 12)

(5681, 11)

In [5]:
train_original = train.copy()
test_original = test.copy()

### Missing Values

In [6]:
display(train.isnull().sum())
display(test.isnull().sum())

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

Item_Identifier                 0
Item_Weight                   976
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  1606
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64

In [7]:
train['Item_Weight'].median()

12.6

In [8]:
test['Item_Weight'].median()

12.5

In [9]:
train['Item_Weight'] = train['Item_Weight'].fillna(12.6)
test['Item_Weight'] = test['Item_Weight'].fillna(12.5)

In [10]:
train['Outlet_Size'] = train['Outlet_Size'].fillna(train['Outlet_Size'].mode()[0])
test['Outlet_Size'] = test['Outlet_Size'].fillna(test['Outlet_Size'].mode()[0])

In [11]:
display(train.isnull().sum())
display(test.isnull().sum())

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
dtype: int64

In [12]:
train.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1, inplace=True)
test.drop(['Item_Identifier', 'Outlet_Identifier'], axis=1, inplace=True)

## EDA (Exploratory Data Analysis)

### 1. Item_Fat_Content

In [13]:
display(train['Item_Fat_Content'].value_counts())
display(test['Item_Fat_Content'].value_counts())

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

Low Fat    3396
Regular    1935
LF          206
reg          78
low fat      66
Name: Item_Fat_Content, dtype: int64

In [14]:
train['Item_Fat_Content'] = train['Item_Fat_Content'].str.replace('LF', 'Low Fat')
test['Item_Fat_Content'] = test['Item_Fat_Content'].str.replace('LF', 'Low Fat')

In [15]:
train['Item_Fat_Content'] = train['Item_Fat_Content'].str.replace('low fat', 'Low Fat')
test['Item_Fat_Content'] = test['Item_Fat_Content'].str.replace('low fat', 'Low Fat')

In [16]:
train['Item_Fat_Content'] = train['Item_Fat_Content'].str.replace('reg', 'Regular')
test['Item_Fat_Content'] = test['Item_Fat_Content'].str.replace('reg', 'Regular')

In [17]:
display(train['Item_Fat_Content'].value_counts())
display(test['Item_Fat_Content'].value_counts())

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

Low Fat    3668
Regular    2013
Name: Item_Fat_Content, dtype: int64

In [18]:
train['Item_Fat_Content'] = train['Item_Fat_Content'].map({'Low Fat':0, "Regular":1})
test['Item_Fat_Content'] = test['Item_Fat_Content'].map({'Low Fat':0, "Regular":1})

In [19]:
display(train['Item_Fat_Content'].value_counts())
display(test['Item_Fat_Content'].value_counts())

0    5517
1    3006
Name: Item_Fat_Content, dtype: int64

0    3668
1    2013
Name: Item_Fat_Content, dtype: int64

### 1. Train Data

### Numerical Columns

In [20]:
numeric_cols_train = train.select_dtypes(include=[np.number])
display(numeric_cols_train.head())
print('\n')
numeric_cols_train.columns

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
0,9.3,0,0.016047,249.8092,1999,3735.138
1,5.92,1,0.019278,48.2692,2009,443.4228
2,17.5,0,0.01676,141.618,1999,2097.27
3,19.2,1,0.0,182.095,1998,732.38
4,8.93,0,0.0,53.8614,1987,994.7052






Index(['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year', 'Item_Outlet_Sales'],
      dtype='object')

In [21]:
numeric_cols_train.shape

(8523, 6)

### Categorical Columns

In [22]:
categorical_cols_train = train.select_dtypes(include=[np.object])
display(categorical_cols_train.head())
print('\n')
categorical_cols_train.columns

Unnamed: 0,Item_Type,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,Dairy,Medium,Tier 1,Supermarket Type1
1,Soft Drinks,Medium,Tier 3,Supermarket Type2
2,Meat,Medium,Tier 1,Supermarket Type1
3,Fruits and Vegetables,Medium,Tier 3,Grocery Store
4,Household,High,Tier 3,Supermarket Type1






Index(['Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'], dtype='object')

In [23]:
categorical_cols_train.shape

(8523, 4)

### 2. Test Data

### Numerical

In [24]:
numeric_cols_test = test.select_dtypes(include=[np.number])
display(numeric_cols_test.head())
print('\n')
numeric_cols_test.columns

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year
0,20.75,0,0.007565,107.8622,1999
1,8.3,1,0.038428,87.3198,2007
2,14.6,0,0.099575,241.7538,1998
3,7.315,0,0.015388,155.034,2007
4,12.5,1,0.118599,234.23,1985






Index(['Item_Weight', 'Item_Fat_Content', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year'],
      dtype='object')

In [25]:
numeric_cols_test.shape

(5681, 5)

### Categorical Columns

In [26]:
categorical_cols_test = test.select_dtypes(include=[np.object])
display(categorical_cols_test.head())
print('\n')
categorical_cols_test.columns

Unnamed: 0,Item_Type,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,Snack Foods,Medium,Tier 1,Supermarket Type1
1,Dairy,Medium,Tier 2,Supermarket Type1
2,Others,Medium,Tier 3,Grocery Store
3,Snack Foods,Medium,Tier 2,Supermarket Type1
4,Dairy,Medium,Tier 3,Supermarket Type3






Index(['Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'], dtype='object')

In [27]:
categorical_cols_test.shape

(5681, 4)

------------

<h1 style="color:blue" align="left"> 1. Get Dummies All Features </h1>

In [29]:
categorical_cols_train_label = pd.get_dummies(categorical_cols_train, drop_first=True)
categorical_cols_test_label = pd.get_dummies(categorical_cols_test, drop_first=True)

In [30]:
display(categorical_cols_train_label.head())
display(categorical_cols_test_label.head())

Unnamed: 0,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,Item_Type_Meat,...,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,1,0,1,0
2,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,1,1,0,0


Unnamed: 0,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,Item_Type_Meat,...,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,1,0,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,1,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,1,0,1,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,1


In [31]:
train.drop(['Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'], axis=1, inplace=True)
test.drop(['Item_Type', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'], axis=1, inplace=True)

In [32]:
display(train.head())
display(test.head())

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
0,9.3,0,0.016047,249.8092,1999,3735.138
1,5.92,1,0.019278,48.2692,2009,443.4228
2,17.5,0,0.01676,141.618,1999,2097.27
3,19.2,1,0.0,182.095,1998,732.38
4,8.93,0,0.0,53.8614,1987,994.7052


Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year
0,20.75,0,0.007565,107.8622,1999
1,8.3,1,0.038428,87.3198,2007
2,14.6,0,0.099575,241.7538,1998
3,7.315,0,0.015388,155.034,2007
4,12.5,1,0.118599,234.23,1985


In [33]:
train_final = pd.concat([train, categorical_cols_train_label], axis=1)
test_final = pd.concat([test, categorical_cols_test_label], axis=1)

In [34]:
display(train_final.head())
display(test_final.head())

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,...,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,9.3,0,0.016047,249.8092,1999,3735.138,0,0,0,1,...,0,0,0,1,0,0,0,1,0,0
1,5.92,1,0.019278,48.2692,2009,443.4228,0,0,0,0,...,0,1,0,1,0,0,1,0,1,0
2,17.5,0,0.01676,141.618,1999,2097.27,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
3,19.2,1,0.0,182.095,1998,732.38,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
4,8.93,0,0.0,53.8614,1987,994.7052,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0


Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,...,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,20.75,0,0.007565,107.8622,1999,0,0,0,0,0,...,1,0,0,1,0,0,0,1,0,0
1,8.3,1,0.038428,87.3198,2007,0,0,0,1,0,...,0,0,0,1,0,1,0,1,0,0
2,14.6,0,0.099575,241.7538,1998,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
3,7.315,0,0.015388,155.034,2007,0,0,0,0,0,...,1,0,0,1,0,1,0,1,0,0
4,12.5,1,0.118599,234.23,1985,0,0,0,1,0,...,0,0,0,1,0,0,1,0,0,1


-----------------

<h1 style="color:blue" align="left"> 2. Get Dummies_Individual </h1>

### 2. Item_Type

In [28]:
display(train['Item_Type'].value_counts())
display(test['Item_Type'].value_counts())

Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64

Snack Foods              789
Fruits and Vegetables    781
Household                638
Frozen Foods             570
Dairy                    454
Baking Goods             438
Canned                   435
Health and Hygiene       338
Meat                     311
Soft Drinks              281
Breads                   165
Hard Drinks              148
Starchy Foods            121
Others                   111
Breakfast                 76
Seafood                   25
Name: Item_Type, dtype: int64

In [29]:
display(train['Item_Type'].unique())
display(test['Item_Type'].unique())

array(['Dairy', 'Soft Drinks', 'Meat', 'Fruits and Vegetables',
       'Household', 'Baking Goods', 'Snack Foods', 'Frozen Foods',
       'Breakfast', 'Health and Hygiene', 'Hard Drinks', 'Canned',
       'Breads', 'Starchy Foods', 'Others', 'Seafood'], dtype=object)

array(['Snack Foods', 'Dairy', 'Others', 'Fruits and Vegetables',
       'Baking Goods', 'Health and Hygiene', 'Breads', 'Hard Drinks',
       'Seafood', 'Soft Drinks', 'Household', 'Frozen Foods', 'Meat',
       'Canned', 'Starchy Foods', 'Breakfast'], dtype=object)

In [30]:
display(train['Item_Type'].nunique())
display(test['Item_Type'].nunique())

16

16

In [31]:
train['Item_Type'] = train['Item_Type'].map({'Fruits and Vegetables':0, 'Household':1, 'Breakfast':2,'Soft Drinks':3,
                                             'Hard Drinks':4, 'Frozen Foods':5, 'Meat':6, 'Breads':7, 'Seafood':8,
                                             'Dairy':9, 'Starchy Foods':10, 'Baking Goods':11, 'Canned':12, 'Snack Foods':13,
                                             'Health and Hygiene':14, 'Starchy Foods':15, 'Others':16})

test['Item_Type'] = test['Item_Type'].map({'Fruits and Vegetables':0, 'Household':1, 'Breakfast':2,'Soft Drinks':3,
                                             'Hard Drinks':4, 'Frozen Foods':5, 'Meat':6, 'Breads':7, 'Seafood':8,
                                             'Dairy':9, 'Starchy Foods':10, 'Baking Goods':11, 'Canned':12, 'Snack Foods':13,
                                             'Health and Hygiene':14, 'Starchy Foods':15, 'Others':16})

### Outlet_Size

In [32]:
display(train['Outlet_Size'].value_counts())
display(test['Outlet_Size'].value_counts())

Medium    5203
Small     2388
High       932
Name: Outlet_Size, dtype: int64

Medium    3468
Small     1592
High       621
Name: Outlet_Size, dtype: int64

In [33]:
train['Outlet_Size'] = train['Outlet_Size'].map({'Small':0, 'Medium':1, "High":2})
test['Outlet_Size'] = test['Outlet_Size'].map({'Small':0, 'Medium':1, "High":2})

In [34]:
display(train['Outlet_Size'].value_counts())
display(test['Outlet_Size'].value_counts())

1    5203
0    2388
2     932
Name: Outlet_Size, dtype: int64

1    3468
0    1592
2     621
Name: Outlet_Size, dtype: int64

### Outlet_Location_Type

In [35]:
display(train['Outlet_Location_Type'].value_counts())
display(test['Outlet_Location_Type'].value_counts())

Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

Tier 3    2233
Tier 2    1856
Tier 1    1592
Name: Outlet_Location_Type, dtype: int64

In [36]:
train['Outlet_Location_Type'] = train['Outlet_Location_Type'].map({'Tier 1':0, 'Tier 2':1, 'Tier 3':2})
test['Outlet_Location_Type'] = test['Outlet_Location_Type'].map({'Tier 1':0, 'Tier 2':1, 'Tier 3':2})

In [37]:
display(train['Outlet_Location_Type'].value_counts())
display(test['Outlet_Location_Type'].value_counts())

2    3350
1    2785
0    2388
Name: Outlet_Location_Type, dtype: int64

2    2233
1    1856
0    1592
Name: Outlet_Location_Type, dtype: int64

### Outlet_Type

In [38]:
display(train['Outlet_Type'].value_counts())
display(test['Outlet_Type'].value_counts())

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

Supermarket Type1    3717
Grocery Store         722
Supermarket Type3     624
Supermarket Type2     618
Name: Outlet_Type, dtype: int64

In [39]:
train['Outlet_Type'] = train['Outlet_Type'].map({'Grocery Store':0, 'Supermarket Type1':1, 'Supermarket Type2':2, 'Supermarket Type3':3})
test['Outlet_Type'] = test['Outlet_Type'].map({'Grocery Store':0, 'Supermarket Type1':1, 'Supermarket Type2':2, 'Supermarket Type3':3})

In [40]:
display(train['Outlet_Type'].value_counts())
display(test['Outlet_Type'].value_counts())

1    5577
0    1083
3     935
2     928
Name: Outlet_Type, dtype: int64

1    3717
0     722
3     624
2     618
Name: Outlet_Type, dtype: int64

In [41]:
display(train.head())
display(test.head())

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,0,0.016047,9,249.8092,1999,1,0,1,3735.138
1,5.92,1,0.019278,3,48.2692,2009,1,2,2,443.4228
2,17.5,0,0.01676,6,141.618,1999,1,0,1,2097.27
3,19.2,1,0.0,0,182.095,1998,1,2,0,732.38
4,8.93,0,0.0,1,53.8614,1987,2,2,1,994.7052


Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,20.75,0,0.007565,13,107.8622,1999,1,0,1
1,8.3,1,0.038428,9,87.3198,2007,1,1,1
2,14.6,0,0.099575,16,241.7538,1998,1,2,0
3,7.315,0,0.015388,13,155.034,2007,1,1,1
4,12.5,1,0.118599,9,234.23,1985,1,2,3


In [42]:
display(train.shape)
display(test.shape)

(8523, 10)

(5681, 9)

In [43]:
display(train.isnull().sum())
display(test.isnull().sum())

Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
dtype: int64

-------------------

In [45]:
X = train.iloc[:,:-1]
y = train.iloc[:,-1]

In [46]:
from sklearn.model_selection import train_test_split
import xgboost
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import accuracy_score

  data = yaml.load(f.read()) or {}
  import pandas.util.testing as tm
  defaults = yaml.load(f)


In [47]:
# split  data into training and testing sets of 80:20 ratio
# 20% of test size selected
# random_state is random seed
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=4)

In [48]:
# shape of X & Y test / train
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(6818, 9) (1705, 9) (6818,) (1705,)


## 1. Random Forest Regression

In [49]:
from sklearn.ensemble import RandomForestRegressor
rfc = RandomForestRegressor(n_estimators=100)
rfc.fit(X_train, y_train)

RandomForestRegressor()

In [50]:
y_pred_rfc = rfc.predict(X_test)

In [51]:
print("Train Score {:.2f} & Test Score {:.2f}".format(rfc.score(X_train,y_train), rfc.score(X_test,y_test)))

Train Score 0.94 & Test Score 0.55


In [53]:
from sklearn import metrics

In [54]:
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(y_train, rfc.predict(X_train))))

RMSE : 428.3


### Random

In [104]:
rfc1 = RandomForestRegressor(n_estimators=2500, min_samples_split=50,
                            min_samples_leaf=1, max_features='auto', max_depth=60)
rfc1.fit(X_train, y_train)

RandomForestRegressor(max_depth=60, min_samples_split=50, n_estimators=2500)

In [105]:
y_pred_rfc1 = rfc1.predict(X_test)

In [106]:
print("Train Score {:.2f} & Test Score {:.2f}".format(rfc1.score(X_train,y_train), rfc1.score(X_test,y_test)))

Train Score 0.69 & Test Score 0.58


In [107]:
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(y_train, rfc1.predict(X_train))))

RMSE : 953


### RSME : 1161.7

## 2. XGBOOST

In [64]:
reg_xgb = xgboost.XGBRegressor()
reg_xgb.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [65]:
y_pred_xgb = reg_xgb.predict(X_test)

In [66]:
print("Train Score {:.2f} & Test Score {:.2f}".format(reg_xgb.score(X_train,y_train),reg_xgb.score(X_test,y_test)))

Train Score 0.86 & Test Score 0.53


In [67]:
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(y_train, reg_xgb.predict(X_train))))

RMSE : 651.3


-------------

In [68]:
reg_xgb1 = xgboost.XGBRegressor(base_score=0.25, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=900,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
reg_xgb1.fit(X_train, y_train)

Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




XGBRegressor(base_score=0.25, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.1, max_delta_step=0, max_depth=2,
             min_child_weight=1, missing=None, monotone_constraints='()',
             n_estimators=900, n_jobs=1, nthread=1, num_parallel_tree=1,
             objective='reg:linear', random_state=0, reg_alpha=0, reg_lambda=1,
             scale_pos_weight=1, seed=0, silent=True, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [69]:
y_pred_xgb1 = reg_xgb1.predict(X_test)

In [70]:
print("Train Score {:.2f} & Test Score {:.2f}".format(reg_xgb1.score(X_train,y_train), reg_xgb1.score(X_test,y_test)))

Train Score 0.68 & Test Score 0.57


In [71]:
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(y_train, reg_xgb1.predict(X_train))))

RMSE : 975


### RSME : 1171.6

## 3. LGBM

In [74]:
lgbm_model = LGBMRegressor()
lgbm_model.fit(X_train, y_train)

LGBMRegressor()

In [75]:
y_pred_lgbm = lgbm_model.predict(X_test)

In [76]:
print("Train Score {:.2f} & Test Score {:.2f}".format(lgbm_model.score(X_train,y_train),lgbm_model.score(X_test,y_test)))

Train Score 0.72 & Test Score 0.57


In [77]:
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(y_train, lgbm_model.predict(X_train))))

RMSE : 906.8


### RSME : 1160

## 4. CatBoost Regressor

In [80]:
from catboost import CatBoostRegressor
CAT = CatBoostRegressor()
CAT.fit(X_train, y_train)

Learning rate set to 0.054928
0:	learn: 1667.6583845	total: 890ms	remaining: 14m 49s
1:	learn: 1618.6869120	total: 902ms	remaining: 7m 29s
2:	learn: 1573.4355230	total: 1s	remaining: 5m 33s
3:	learn: 1533.4991538	total: 1.1s	remaining: 4m 33s
4:	learn: 1494.5781181	total: 1.1s	remaining: 3m 39s
5:	learn: 1458.7517277	total: 1.15s	remaining: 3m 11s
6:	learn: 1425.0438464	total: 1.16s	remaining: 2m 44s
7:	learn: 1394.1701032	total: 1.19s	remaining: 2m 27s
8:	learn: 1366.5425310	total: 1.25s	remaining: 2m 17s
9:	learn: 1340.4955670	total: 1.31s	remaining: 2m 9s
10:	learn: 1316.9957251	total: 1.36s	remaining: 2m 2s
11:	learn: 1296.4712956	total: 1.36s	remaining: 1m 52s
12:	learn: 1277.7382309	total: 1.41s	remaining: 1m 47s
13:	learn: 1259.3946487	total: 1.42s	remaining: 1m 39s
14:	learn: 1242.7866425	total: 1.44s	remaining: 1m 34s
15:	learn: 1227.7145356	total: 1.49s	remaining: 1m 31s
16:	learn: 1214.3384701	total: 1.5s	remaining: 1m 26s
17:	learn: 1202.4795275	total: 1.54s	remaining: 1m 2

178:	learn: 1027.7373244	total: 2.95s	remaining: 13.6s
179:	learn: 1027.3691176	total: 2.96s	remaining: 13.5s
180:	learn: 1027.1410083	total: 2.96s	remaining: 13.4s
181:	learn: 1026.8565006	total: 2.97s	remaining: 13.3s
182:	learn: 1026.3507158	total: 2.97s	remaining: 13.3s
183:	learn: 1026.1120380	total: 2.98s	remaining: 13.2s
184:	learn: 1025.8629600	total: 2.99s	remaining: 13.2s
185:	learn: 1025.6232827	total: 2.99s	remaining: 13.1s
186:	learn: 1024.9800224	total: 3s	remaining: 13s
187:	learn: 1024.6867961	total: 3s	remaining: 13s
188:	learn: 1024.4508274	total: 3.01s	remaining: 12.9s
189:	learn: 1024.2235332	total: 3.01s	remaining: 12.8s
190:	learn: 1023.7283104	total: 3.02s	remaining: 12.8s
191:	learn: 1023.4822797	total: 3.02s	remaining: 12.7s
192:	learn: 1022.9986546	total: 3.02s	remaining: 12.6s
193:	learn: 1022.3692303	total: 3.03s	remaining: 12.6s
194:	learn: 1021.9714560	total: 3.03s	remaining: 12.5s
195:	learn: 1021.6443701	total: 3.04s	remaining: 12.5s
196:	learn: 1021.392

334:	learn: 976.1449971	total: 3.8s	remaining: 7.55s
335:	learn: 975.6633920	total: 3.81s	remaining: 7.53s
336:	learn: 975.5221748	total: 3.81s	remaining: 7.5s
337:	learn: 975.1857560	total: 3.82s	remaining: 7.48s
338:	learn: 974.9097019	total: 3.82s	remaining: 7.46s
339:	learn: 974.6996635	total: 3.83s	remaining: 7.43s
340:	learn: 974.4834781	total: 3.83s	remaining: 7.41s
341:	learn: 974.2127315	total: 3.84s	remaining: 7.38s
342:	learn: 974.0250193	total: 3.84s	remaining: 7.36s
343:	learn: 973.7658902	total: 3.85s	remaining: 7.34s
344:	learn: 973.4933575	total: 3.85s	remaining: 7.32s
345:	learn: 973.1684959	total: 3.86s	remaining: 7.29s
346:	learn: 972.8934963	total: 3.86s	remaining: 7.27s
347:	learn: 972.6030666	total: 3.87s	remaining: 7.24s
348:	learn: 972.3739585	total: 3.87s	remaining: 7.22s
349:	learn: 972.1291657	total: 3.87s	remaining: 7.2s
350:	learn: 971.9349140	total: 3.88s	remaining: 7.17s
351:	learn: 971.7119197	total: 3.88s	remaining: 7.15s
352:	learn: 971.3944052	total: 

501:	learn: 934.7172951	total: 4.61s	remaining: 4.57s
502:	learn: 934.4504723	total: 4.62s	remaining: 4.57s
503:	learn: 934.1852353	total: 4.63s	remaining: 4.55s
504:	learn: 933.9863779	total: 4.63s	remaining: 4.54s
505:	learn: 933.7287277	total: 4.64s	remaining: 4.53s
506:	learn: 933.5359104	total: 4.64s	remaining: 4.51s
507:	learn: 933.1945412	total: 4.65s	remaining: 4.5s
508:	learn: 932.8628903	total: 4.65s	remaining: 4.49s
509:	learn: 932.7346560	total: 4.66s	remaining: 4.47s
510:	learn: 932.3747780	total: 4.66s	remaining: 4.46s
511:	learn: 932.1440064	total: 4.67s	remaining: 4.45s
512:	learn: 931.8691705	total: 4.67s	remaining: 4.43s
513:	learn: 931.5698989	total: 4.67s	remaining: 4.42s
514:	learn: 931.2509053	total: 4.68s	remaining: 4.41s
515:	learn: 931.0751605	total: 4.68s	remaining: 4.39s
516:	learn: 930.8673071	total: 4.69s	remaining: 4.38s
517:	learn: 930.6788033	total: 4.69s	remaining: 4.37s
518:	learn: 930.4464881	total: 4.7s	remaining: 4.35s
519:	learn: 930.2845380	total:

671:	learn: 898.6087222	total: 5.42s	remaining: 2.65s
672:	learn: 898.3770502	total: 5.43s	remaining: 2.64s
673:	learn: 898.1141256	total: 5.43s	remaining: 2.63s
674:	learn: 897.9040816	total: 5.44s	remaining: 2.62s
675:	learn: 897.6833957	total: 5.44s	remaining: 2.61s
676:	learn: 897.5894507	total: 5.45s	remaining: 2.6s
677:	learn: 897.4529542	total: 5.45s	remaining: 2.59s
678:	learn: 897.2564492	total: 5.46s	remaining: 2.58s
679:	learn: 897.0310524	total: 5.46s	remaining: 2.57s
680:	learn: 896.8806527	total: 5.47s	remaining: 2.56s
681:	learn: 896.6686498	total: 5.47s	remaining: 2.55s
682:	learn: 896.4569562	total: 5.48s	remaining: 2.54s
683:	learn: 896.3057302	total: 5.48s	remaining: 2.53s
684:	learn: 896.1223503	total: 5.49s	remaining: 2.52s
685:	learn: 895.9530837	total: 5.49s	remaining: 2.51s
686:	learn: 895.7865083	total: 5.5s	remaining: 2.5s
687:	learn: 895.5376414	total: 5.5s	remaining: 2.5s
688:	learn: 895.3775512	total: 5.51s	remaining: 2.49s
689:	learn: 895.1410333	total: 5.

849:	learn: 866.4095760	total: 6.26s	remaining: 1.1s
850:	learn: 866.2268088	total: 6.26s	remaining: 1.1s
851:	learn: 866.0133796	total: 6.27s	remaining: 1.09s
852:	learn: 865.8799442	total: 6.27s	remaining: 1.08s
853:	learn: 865.6400727	total: 6.28s	remaining: 1.07s
854:	learn: 865.4596401	total: 6.28s	remaining: 1.06s
855:	learn: 865.2825780	total: 6.29s	remaining: 1.06s
856:	learn: 865.1467583	total: 6.29s	remaining: 1.05s
857:	learn: 864.8778873	total: 6.29s	remaining: 1.04s
858:	learn: 864.7734009	total: 6.3s	remaining: 1.03s
859:	learn: 864.7105049	total: 6.31s	remaining: 1.03s
860:	learn: 864.5165438	total: 6.31s	remaining: 1.02s
861:	learn: 864.3535818	total: 6.32s	remaining: 1.01s
862:	learn: 864.2249944	total: 6.32s	remaining: 1s
863:	learn: 864.0491004	total: 6.33s	remaining: 996ms
864:	learn: 863.8549496	total: 6.33s	remaining: 988ms
865:	learn: 863.7277146	total: 6.34s	remaining: 981ms
866:	learn: 863.5288509	total: 6.34s	remaining: 973ms
867:	learn: 863.4087895	total: 6.3

<catboost.core.CatBoostRegressor at 0x2231c95e2e8>

In [81]:
y_pred_CAT = CAT.predict(X_test)

In [82]:
print("Train Score {:.2f} & Test Score {:.2f}".format(CAT.score(X_train,y_train), CAT.score(X_test,y_test)))

Train Score 0.76 & Test Score 0.57


In [83]:
print("RMSE : %.4g" % np.sqrt(metrics.mean_squared_error(y_train, CAT.predict(X_train))))

RMSE : 841.7


### RSME : 1178.4

### Submission

In [108]:
y_pred_test = rfc1.predict(test)

In [109]:
submission = pd.DataFrame({'Item_Identifier': test_original['Item_Identifier'],
                           'Outlet_Identifier':test_original['Outlet_Identifier'],
                           'Item_Outlet_Sales': y_pred_test})
submission.to_csv('Bigmart3.csv', index=False)