## Load Bigmart Sales data

In [1]:
import pandas as pd

In [2]:
bigmart = pd.read_csv('train_bm.csv')

In [3]:
bigmart.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


### Check datatypes of variables

In [4]:
bigmart.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

### Encoding a single variable-  Outlet Type

In [5]:
bigmart['Outlet_Type'].value_counts()

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

In [6]:
pd.get_dummies(bigmart['Outlet_Type']).head()

Unnamed: 0,Grocery Store,Supermarket Type1,Supermarket Type2,Supermarket Type3
0,0,1,0,0
1,0,0,1,0
2,0,1,0,0
3,1,0,0,0
4,0,1,0,0


### Doing one hot encoding for all the variables 

In [7]:
bigmart_encoded = pd.get_dummies(bigmart)
bigmart_encoded.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales,Item_Identifier_DRA12,Item_Identifier_DRA24,Item_Identifier_DRA59,Item_Identifier_DRB01,Item_Identifier_DRB13,...,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
0,9.3,0.016047,249.8092,1999,3735.138,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
1,5.92,0.019278,48.2692,2009,443.4228,0,0,0,0,0,...,0,1,0,0,0,1,0,0,1,0
2,17.5,0.01676,141.618,1999,2097.27,0,0,0,0,0,...,0,1,0,1,0,0,0,1,0,0
3,19.2,0.0,182.095,1998,732.38,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
4,8.93,0.0,53.8614,1987,994.7052,0,0,0,0,0,...,1,0,0,0,0,1,0,1,0,0


We have two problems here: 

#### Problem 1:
Look at the newly created variables *'Outlet_Size_High 	Outlet_Size_Medium 	Outlet_Size_Small'*, the order between these variables is destroyed. As a result we are missing out on some important information.  

In [8]:
bigmart_encoded[['Outlet_Size_High', 'Outlet_Size_Medium', 'Outlet_Size_Small']].head()

Unnamed: 0,Outlet_Size_High,Outlet_Size_Medium,Outlet_Size_Small
0,0,1,0
1,0,1,0
2,0,1,0
3,0,0,0
4,1,0,0


#### Problem 2
The number of features has increased from 12 to 1600, where maximum values are 0.

In [9]:
bigmart.shape, bigmart_encoded.shape

((8523, 12), (8523, 1605))

In [10]:
# bigmart[['Item_Identifier_DRA12', 'Item_Identifier_DRA24',
#        'Item_Identifier_DRA59', 'Item_Identifier_DRB01',
#        'Item_Identifier_DRB13', 'Item_Identifier_DRB24',
#        'Item_Identifier_DRB25', 'Item_Identifier_DRB48',
#        'Item_Identifier_DRC01', 'Item_Identifier_DRC12',
#        'Item_Identifier_DRC13', 'Item_Identifier_DRC24',
#        'Item_Identifier_DRC25', 'Item_Identifier_DRC27',
#        'Item_Identifier_DRC36', 'Item_Identifier_DRC49',
#        'Item_Identifier_DRD01', 'Item_Identifier_DRD12',
#        'Item_Identifier_DRD13', 'Item_Identifier_DRD15',
#        'Item_Identifier_DRD24', 'Item_Identifier_DRD25',
#        'Item_Identifier_DRD27', 'Item_Identifier_DRD37',
#        'Item_Identifier_DRD49']].head()

## Problem 1 solution

In [11]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [12]:
bigmart['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

In [13]:
le.fit_transform(['Small', 'Medium', 'High'])

array([2, 1, 0], dtype=int64)

Label encoder uses the alphabetical order. again we miss the orders in ordinal vars

In [14]:
bigmart['Outlet_Size'] = bigmart['Outlet_Size'].map({'Small': 0,
                                                     'Medium': 1,
                                                     'High': 2})

In [15]:
bigmart['Outlet_Size'].head()

0    1.0
1    1.0
2    1.0
3    NaN
4    2.0
Name: Outlet_Size, dtype: float64

So that solves the first challenge we encountered. Now we'll see how to deal with high cardinality.