In [1]:
import pandas as pd
import numpy as np
from scipy import sparse
from sklearn import preprocessing  
import matplotlib
%matplotlib inline  

# Load Dataset

In [3]:
df = pd.read_csv("../../data/train.csv", delimiter=",")
df.head()

# this dataset has different types of categorical variables
# * Five binary, Ten nominal, Six ordinal, two cyclic and a target variables

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target
0,0,0.0,0.0,0.0,F,N,Red,Trapezoid,Hamster,Russia,...,02e7c8990,3.0,Contributor,Hot,c,U,Pw,6.0,3.0,0
1,1,1.0,1.0,0.0,F,Y,Red,Star,Axolotl,,...,f37df64af,3.0,Grandmaster,Warm,e,X,pE,7.0,7.0,0
2,2,0.0,1.0,0.0,F,N,Red,,Hamster,Canada,...,,3.0,,Freezing,n,P,eN,5.0,9.0,0
3,3,,0.0,0.0,F,N,Red,Circle,Hamster,Finland,...,f9d456e57,1.0,Novice,Lava Hot,a,C,,3.0,3.0,0
4,4,0.0,,0.0,T,N,Red,Triangle,Hamster,Costa Rica,...,c5361037c,3.0,Grandmaster,Cold,h,C,OZ,5.0,12.0,0


In [5]:
df.loc[[0,2,3],"nom_0"]

0    Red
2    Red
3    Red
Name: nom_0, dtype: object

# Categorical Variables

## Nominal Variables

Variables that have two or more categories which do not have any kind of order associated with them
  E.g If gender is classified into two groups, i.e. male and female, it can be considered as a nominal variable
  
## Ordinal Variables

These variables have "level" or categories with a particular order associated with them. E.g An ordinal categorical variable can be a feature with three different levels: low, medium and high. Order is important



## Cyclic Variables

These variables have "cycles" e.g days in a week, months in a year, time(or hours may be)

## Binary Variables

In form of `0` and `1`



In [3]:
# Lets look at `ord_2` feature in the dataset
# not that there are "nan" values too
df["ord_2"].unique()

array(['Hot', 'Warm', 'Freezing', 'Lava Hot', 'Cold', 'Boiling Hot', nan],
      dtype=object)

## Mapping ordinal value (text to numbers)

This type of encoding of categorical varaibles is known as **Label Encoding**, as we are encoding every category as a numerical label using mapping dictionary

In [4]:
# List mapping values here
mapping = {
    "Freezing" : 0,
    "Warm" : 1,
    "Cold" : 2, 
    "Boiling Hot" : 3,
    "Hot" : 4,
    "Lava Hot" : 5
}

##### Now convert categorical values present in `ord_2` feature to numbers

In [5]:
df.loc[:, "ord_2"] = df.ord_2.map(mapping)
# count values after mapping
# note that value_counts is used for counting on series(i.e single column)
# by default value counts dropna values (you can avoid that by passing flag as false)
df.ord_2.value_counts()

0.0    142726
1.0    124239
2.0     97822
3.0     84790
4.0     67508
5.0     64840
Name: ord_2, dtype: int64

# Alternative (Label Encoding) using Scikit-learn
This can be done via scikit-learn 

In [6]:
# read data again (as previous df is overwritten)

train_df = pd.read_csv("../data/train.csv")

print("ord 2 before label encoder")
print(train_df.loc[:, "ord_2"])
# fill NaN values (with something else as sklearn dont allow nan values, but this column has nan values)
train_df.loc[:, "ord_2"] = train_df.ord_2.fillna("NONE")
# initialize LabelEncoder
label_enc = preprocessing.LabelEncoder()

train_df.loc[:, "ord_2"] = label_enc.fit_transform(train_df.ord_2.values)


ord 2 before label encoder
0                 Hot
1                Warm
2            Freezing
3            Lava Hot
4                Cold
             ...     
599995       Freezing
599996    Boiling Hot
599997       Freezing
599998           Warm
599999    Boiling Hot
Name: ord_2, Length: 600000, dtype: object


In [7]:
print(" ============== ")
print(" ord 2 values after sklearn transform ")
print(" ============== ")

train_df.loc[:, "ord_2"]

 ord 2 values after sklearn transform 


0         3
1         6
2         2
3         4
4         1
         ..
599995    2
599996    0
599997    2
599998    6
599999    0
Name: ord_2, Length: 600000, dtype: int64

Furthermore, we can directly use these features (i.e. Label Encoded) in tree-based methods such as Decision trees, Randomforest, Xgboost, GBM, etc as they do not assign weights to each feature unlike other algos. such as Linear Regression(y = w*x + b), SVMs, or NN

**However for these kind of models, feature normalization or standarization is needed . In other words binarizing them would be efficient as it is only about 1s and 0s**

E.g

 freezing ---> 0 ---> 0 0 0
 
 Warm ---> 1 ---> 0 0 1
 
 Cold ---> 2 ---> 0 1 0
 
 Boiling Hot ---> 2 ---> 0 1 0
 
 Hot ---> 2 ---> 0 1 0
 
 Lava Hot ---> 2 ---> 0 1 0

However, note that we are also increasing overall no. of cols (or features) by splitting each feature into 
its categorical values.

#### Drawback of binarizing them
**At one point, storing all them would be very expensive considering bigger datasets**

### Solution

To solve this, we can use **sparse format (a.k.a compressed sparse format)**. It is nothing but a representation or way of storing data in memory in an efficient way. So, we do not store all the values but only that are important i.e. in this case, 1s rather than 0s. 

In this way, we reduce the memory consumption substainlly.

Note: there are various to represent array in sparse format

## Example of Sparse Format

In [8]:

# consider following array contains one-hot encoded array 
# we can easily convert that into sparse format
example = np.array([
    [0, 0, 1],
    [1, 0, 0],
    [1, 0, 1]
])
sparse_example = sparse.csr_matrix(example)
print(f'spparse example {sparse_example}')
print(f'example byts without usin sparse format {example.data.nbytes}')
print('example byts with sparse format ', sparse_example.data.nbytes)

# Image if we have huge number of rows majority filled with 0s then sparse format would be significantly 

spparse example   (0, 2)	1
  (1, 0)	1
  (2, 0)	1
  (2, 2)	1
example byts without usin sparse format 72
example byts with sparse format  32


**Learning Lesson : Thats why we prefer sparse format over dense format(filled with 0s and 1s)**

### Hang On, Might be there is another better way to represent this info (Memory Efficient ;) )

Thats called one-hot encoding ;) 


In [9]:
# consider following array contains one-hot encoded array 
# we can easily convert that into sparse format
example = np.array([
    [0, 0, 0, 0, 1, 0],
    [0, 1, 0, 0, 0, 0],
    [1, 0, 0, 0, 0, 0]
])
sparse_example = sparse.csr_matrix(example)
print(f'spparse example {sparse_example}')
print(f'example byts without usin sparse format {example.data.nbytes}')
print('example byts with sparse format ', sparse_example.data.nbytes)
print("#"*10)

spparse example   (0, 4)	1
  (1, 1)	1
  (2, 0)	1
example byts without usin sparse format 144
example byts with sparse format  24
##########


**Note that, If we use one-hot encoding then total no. of bytes are lesser as compare to the LabelEncoding, and LabelEncoding with sparse**

### Lets encode features using one_hot_encode function provided by sklearn

In [10]:
# read data again (as previous df is overwritten)

train_df_new = pd.read_csv("../data/train.csv")

print("ord 2 before label encoder")
print(train_df_new.ord_2.values)
print("*"*10)
# fill NaN values (with something else as sklearn dont allow nan values, but this column has nan values)
train_df_new.loc[:, "ord_2"] = train_df_new.ord_2.fillna("NONE")
# initialize LabelEncoder
# keep sparse = False to get dense array
one_hot_enc = preprocessing.OneHotEncoder(sparse=False)

# reshape ord2 column to 2-d array as fit_transform func. xpects that
reshaped_example = train_df_new.ord_2.values.reshape(-1, 1)
ohe_example = one_hot_enc.fit_transform(reshaped_example)
print(f"ohe_example: \n {ohe_example}" )
# NONE is also one category
print(f"ohe_example shape : \n {ohe_example.shape}")

# Now check size of array 
print(f"Size of sparse array: {ohe_example.data.nbytes}")

print("*"*10)

# Now check the size of sparse array i.e. keep sparse=True
one_hot_enc_sparse = preprocessing.OneHotEncoder(sparse=True)

# reshape ord2 column to 2-d array as fit_transform func. xpects that
reshaped_example = train_df_new.ord_2.values.reshape(-1, 1)
ohe_example_sparse = one_hot_enc_sparse.fit_transform(reshaped_example)
# NONE is also one category (hense there are 7 categories in total)
print(f"ohe_example shape : \n {ohe_example_sparse.shape}")

# Now check size of array 
print(f"Size of sparse array: {ohe_example_sparse.data.nbytes}")


ord 2 before label encoder
['Hot' 'Warm' 'Freezing' ... 'Freezing' 'Warm' 'Boiling Hot']
**********
ohe_example: 
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 1.]
 [1. 0. 0. ... 0. 0. 0.]]
ohe_example shape : 
 (600000, 7)
Size of sparse array: 33600000
**********
ohe_example shape : 
 (600000, 7)
Size of sparse array: 4800000


## Take Away: 
Size of **sparse one-hot encoder** is way lesser as compared to the **size of densed one-hot-encoder**

### Summary

We learned the three different ways to handle categorical values:
    
    * **Dictionary mapping** (maps different categorical values to numbers starting from 0 to N - 1, where N is the total no. of categories in a given feature)   Note: This is not useful to linear models as they expect to be normalized(a.k.a standarized)
    * Binarized Variables (you can also use sparse format to reduced the array size)
    * One-hot-encode (this can also be stored in a sparse format, if memory size is really problem)
    
    
Although there are many different methods to handle categorical variables. **One such method is about converting categorical variables to numerical variables** (discussed below)

In [11]:
# Lets go back again to the dataframe of ord_2 feature and check how many times category "Boiling Hot "
# occurs in it

df = pd.read_csv("../data/train.csv", delimiter=",")
print("total no. of times boiling hot occurs ", df[df.ord_2 == "Boiling Hot"].shape)

# to get count of all categories

df.groupby(["ord_2"]).count()
# however you can counting varies for each column this is because it ignores nan values
# since here id col. cannot/or  be/is not nan so we can use that to get proper count 

total no. of times boiling hot occurs  (84790, 25)


Unnamed: 0_level_0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,nom_8,nom_9,ord_0,ord_1,ord_3,ord_4,ord_5,day,month,target
ord_2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Boiling Hot,84790,82324,82219,82209,82266,82285,82265,82241,82244,82229,...,82242,82214,82273,82252,82241,82130,82251,82194,82276,84790
Cold,97822,94946,94789,94932,94944,94904,94828,94781,94860,94874,...,94881,94853,94854,94789,94859,94965,94850,94856,94850,97822
Freezing,142726,138468,138496,138541,138370,138462,138370,138311,138353,138394,...,138525,138475,138411,138417,138475,138565,138579,138547,138447,142726
Hot,67508,65487,65544,65501,65362,65495,65500,65518,65465,65415,...,65526,65513,65380,65469,65472,65537,65558,65415,65479,67508
Lava Hot,64840,62820,62888,62894,62952,62892,62838,62864,62895,62919,...,62975,62875,62833,62942,62938,62893,62922,62942,63034,64840
Warm,124239,120487,120507,120487,120533,120432,120413,120613,120573,120522,...,120577,120458,120453,120570,120553,120456,120596,120606,120431,124239


In [12]:
df.groupby(["ord_2"])["id"].count()

ord_2
Boiling Hot     84790
Cold            97822
Freezing       142726
Hot             67508
Lava Hot        64840
Warm           124239
Name: id, dtype: int64

##### Now if we just replace `ord_2` column with its count values, we have converted it to a feature  which is kind of numerical now


In [13]:
# To do this, we can create a new column or replace this column by using the transform function by 
# using the transform function of pandas along with groupby

df["count_ord2"] = df.groupby(["ord_2"])["id"].transform("count")
# you can replace this count of ord_2 with all the categories features

In [14]:
df

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target,count_ord2
0,0,0.0,0.0,0.0,F,N,Red,Trapezoid,Hamster,Russia,...,3.0,Contributor,Hot,c,U,Pw,6.0,3.0,0,67508.0
1,1,1.0,1.0,0.0,F,Y,Red,Star,Axolotl,,...,3.0,Grandmaster,Warm,e,X,pE,7.0,7.0,0,124239.0
2,2,0.0,1.0,0.0,F,N,Red,,Hamster,Canada,...,3.0,,Freezing,n,P,eN,5.0,9.0,0,142726.0
3,3,,0.0,0.0,F,N,Red,Circle,Hamster,Finland,...,1.0,Novice,Lava Hot,a,C,,3.0,3.0,0,64840.0
4,4,0.0,,0.0,T,N,Red,Triangle,Hamster,Costa Rica,...,3.0,Grandmaster,Cold,h,C,OZ,5.0,12.0,0,97822.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
599995,599995,0.0,1.0,0.0,T,N,Red,Polygon,Axolotl,India,...,3.0,Novice,Freezing,a,R,GZ,5.0,,0,142726.0
599996,599996,1.0,0.0,0.0,T,Y,Blue,Polygon,Dog,Costa Rica,...,2.0,Novice,Boiling Hot,n,N,sf,,3.0,0,84790.0
599997,599997,0.0,0.0,0.0,F,Y,Red,Circle,Axolotl,Russia,...,2.0,Contributor,Freezing,n,H,MV,7.0,5.0,0,142726.0
599998,599998,1.0,1.0,0.0,F,Y,,Polygon,Axolotl,,...,1.0,Master,Warm,m,X,Ey,1.0,5.0,0,124239.0


In [15]:
# you can add counts of all the features or can also replace them or maybe group by multiple columns 
# and their counts.

# E.g counting by grouping on ord_1 and ord_2 columns
df.groupby(["ord_2", "ord_1"])["id"].count().reset_index(name="count")
# in this way you are adding new feature

Unnamed: 0,ord_2,ord_1,count
0,Boiling Hot,Contributor,15634
1,Boiling Hot,Expert,19477
2,Boiling Hot,Grandmaster,13623
3,Boiling Hot,Master,10800
4,Boiling Hot,Novice,22718
5,Cold,Contributor,17734
6,Cold,Expert,22956
7,Cold,Grandmaster,15464
8,Cold,Master,12364
9,Cold,Novice,26271


In [16]:
### You can also create feature from these categorical variables easily
### Just by concatenating by _
df["new_feature"] = df.ord_1.astype(str) + "_" + df.ord_2.astype(str)

# similarity you can combine differt features easily to create new features
# Note that: there will be many Nan values in both the cols that are combined too 

In [17]:
df

Unnamed: 0,id,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,...,ord_1,ord_2,ord_3,ord_4,ord_5,day,month,target,count_ord2,new_feature
0,0,0.0,0.0,0.0,F,N,Red,Trapezoid,Hamster,Russia,...,Contributor,Hot,c,U,Pw,6.0,3.0,0,67508.0,Contributor_Hot
1,1,1.0,1.0,0.0,F,Y,Red,Star,Axolotl,,...,Grandmaster,Warm,e,X,pE,7.0,7.0,0,124239.0,Grandmaster_Warm
2,2,0.0,1.0,0.0,F,N,Red,,Hamster,Canada,...,,Freezing,n,P,eN,5.0,9.0,0,142726.0,nan_Freezing
3,3,,0.0,0.0,F,N,Red,Circle,Hamster,Finland,...,Novice,Lava Hot,a,C,,3.0,3.0,0,64840.0,Novice_Lava Hot
4,4,0.0,,0.0,T,N,Red,Triangle,Hamster,Costa Rica,...,Grandmaster,Cold,h,C,OZ,5.0,12.0,0,97822.0,Grandmaster_Cold
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
599995,599995,0.0,1.0,0.0,T,N,Red,Polygon,Axolotl,India,...,Novice,Freezing,a,R,GZ,5.0,,0,142726.0,Novice_Freezing
599996,599996,1.0,0.0,0.0,T,Y,Blue,Polygon,Dog,Costa Rica,...,Novice,Boiling Hot,n,N,sf,,3.0,0,84790.0,Novice_Boiling Hot
599997,599997,0.0,0.0,0.0,F,Y,Red,Circle,Axolotl,Russia,...,Contributor,Freezing,n,H,MV,7.0,5.0,0,142726.0,Contributor_Freezing
599998,599998,1.0,1.0,0.0,F,Y,,Polygon,Axolotl,,...,Master,Warm,m,X,Ey,1.0,5.0,0,124239.0,Master_Warm


As we saw, we can create new features easily, but which categories to combine is still different question to answer although we can try different combinations and see which performs better for now. (Anyways we will see feature engineering later ;) 


### So What do when you get categorical variables.. Do the following ;) 

* fill the  NaN values (Very important)
* Convert them to integers by applying label encoding using LabelEncoder(by sklearn) or by using maping dictionary.
* Create one-hot encoding, (skip binarization as sparse one-hot-encoding consumes less memory)
* Go to Modelling ;) Boom :p

### Now Lets see how to handle Nan data in categorical values

* Now if you do not handle `Nan` values in the training data then if you see `Nan` values in test/real data then you will face errors (as you would be use same pre-processing for training and test/real data.

* One Simple way to **handle Nan values** would be to drop them, this can work also, but what if you have lot of **Nan values** in your train set? Then you must end with less examples in train set as you have already drop **Nan values** . 

* Another way to **handle Nan values** is to treat them as a copmletely new category e.g. called as `"Rare"`. In this way, you are using all of the data as sometimes **Nan values** might contain useful info. 

Lets explore same categorical column and check whether it has any **any Nan values**

In [18]:
# this does not list Nan values by defauly
print(df.ord_2.value_counts())
print()
print("***** add flag dropna = False ** \n")
print(df.ord_2.value_counts(dropna=False))

Freezing       142726
Warm           124239
Cold            97822
Boiling Hot     84790
Hot             67508
Lava Hot        64840
Name: ord_2, dtype: int64

***** add flag dropna = False ** 

Freezing       142726
Warm           124239
Cold            97822
Boiling Hot     84790
Hot             67508
Lava Hot        64840
NaN             18075
Name: ord_2, dtype: int64


In [20]:
## fill the NaN values
df.ord_2.fillna("NONE").value_counts()
# you can see we have lot of NaN values

Freezing       142726
Warm           124239
Cold            97822
Boiling Hot     84790
Hot             67508
Lava Hot        64840
NONE            18075
Name: ord_2, dtype: int64

Now you see, we have assigned new category to the `NaN` values in this way we are using all the information of that particular feature. This may also increase the performance

## How to Handle Unknown Category of one Feature?

Lets say, you have new category in one of features on live data .  Consider example of `ord_2` feature, what if in your real time data, you get new category e.g. `warmer`, how you are going to tackle this? I mean this category was not present in your train dataset. 

One possible solution would be include new category in your train data i.e. `unknown/rare category`. Lets say we renamed `NONE` as `unknown category` , then whenever any new category comes during our live data, we can directly consider that category as `unknown category`. 


This analogy is very similar to NLP problem, where we always build a model based on a fixed vocabulary(or vocab. present in our training data). Increasing the size of the vocab. increases size of the model. E.g Transformer models like BERT are trained on `~30000 words` (for English). So, when we have a new word coming in, we mark it as **UNK** (unknown)


So, you can either assume that your test data will have the same categories as training or you can introduce a rare or unknown category to training to take care of the new categories in test data

### Lets Explore another feature to another unknown category

In [21]:
# check no. of unique categories
df.ord_4.fillna("NONE").unique()

array(['U', 'X', 'P', 'C', 'Q', 'R', 'Y', 'N', 'I', 'O', 'M', 'E', 'V',
       'K', 'G', 'B', 'H', 'NONE', 'T', 'W', 'A', 'F', 'D', 'S', 'J', 'L',
       'Z'], dtype=object)

In [22]:
# check count of each category
df.ord_4.fillna("NONE").value_counts()

N       39978
P       37890
Y       36657
A       36633
R       33045
U       32897
M       32504
X       32347
C       32112
H       31189
Q       30145
T       29723
O       25610
B       25212
E       21871
K       21676
I       19805
NONE    17930
D       17284
F       16721
W        8268
Z        5790
S        4595
G        3404
V        3107
J        1950
L        1657
Name: ord_4, dtype: int64

Now here we have few categories that occurs less than 4000 times and couple of categories has frequency of 2000 times. Now based on threshold we can include those categories into "rare" categories. 

To do this, its just one liner pandas function ;) 

In [25]:

df.ord_4 = df.ord_4.fillna("NONE")


In [26]:
df.ord_4.value_counts()[df["ord_4"]].values < 2000
# df.loc[df.ord_4.value_counts()[df["ord_4"].values() < 2000], "ord_4"] = "RARE"
# df.ord_4

array([False, False, False, ..., False, False, False])

In [29]:

df.loc[df["ord_4"].value_counts()[df["ord_4"]].values < 2000, "ord_4"] = "RARE"


In [30]:
df.ord_4.value_counts()

N       39978
P       37890
Y       36657
A       36633
R       33045
U       32897
M       32504
X       32347
C       32112
H       31189
Q       30145
T       29723
O       25610
B       25212
E       21871
K       21676
I       19805
NONE    17930
D       17284
F       16721
W        8268
Z        5790
S        4595
RARE     3607
G        3404
V        3107
Name: ord_4, dtype: int64

**We have assigned new category i.e. "RARE" to categories whose frequency is less than some threshold i.e. 2000 (here) . Now when in our real data, if we observe any type of category occurring under this feature, we would be handle that easily based on its count.**



## Its time to put things together and start model training ;) 


NotE: all training code is written in python scripts ;) 

## General pandas comments

### value_counts vs count

count is applied on dataframe and it does not consider nan values. mostly used on counting for groups

value_counts is applied on series df (e,g single column) (does not consider nan values by default.  however there is flag, that can be passed)

### Groupby

Any groupby operation involves one of the following operations on the original object. They are −

* Splitting the Object

* Applying a function

* Combining the results


In many situations, we split the data into sets and we apply some functionality on each subset. In the apply functionality, we can perform the following operations −

* Aggregation − computing a summary statistic

* Transformation − perform some group-specific operation

* Filtration − discarding the data with some condition

In [None]:
# df.groupby(["target"])["id"].count().plot.bar()
# df.target.value_counts().plot.bar()