## Encoding the categorical features

#### Encoding the categorical features to use categorical data in predictive analytics

#### Tags:
    Data: labeled data, Kaggle competition
    Technologies: python, pandas
    Techniques: data import
    
#### Resources:
[Kaggle competition data](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews)

In [1]:
import pandas as pd


In [6]:
# import the relevant dataset
df = pd.read_csv('../data/Womens Clothing E-Commerce Reviews.csv')
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
Unnamed: 0                 23486 non-null int64
Clothing ID                23486 non-null int64
Age                        23486 non-null int64
Title                      19676 non-null object
Review Text                22641 non-null object
Rating                     23486 non-null int64
Recommended IND            23486 non-null int64
Positive Feedback Count    23486 non-null int64
Division Name              23472 non-null object
Department Name            23472 non-null object
Class Name                 23472 non-null object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


In [13]:
# There are in total 5 columns that hold textual data, 
# out of which 2 are categorical data, columns 'Division Name'
# 'Class Name'

# Lets take a look at all the values of the 'Class Name' column
class_cat = df['Class Name'].unique()
class_cat

array(['Intimates', 'Dresses', 'Pants', 'Blouses', 'Knits', 'Outerwear',
       'Lounge', 'Sweaters', 'Skirts', 'Fine gauge', 'Sleep', 'Jackets',
       'Swim', 'Trend', 'Jeans', 'Legwear', 'Shorts', 'Layering',
       'Casual bottoms', nan, 'Chemises'], dtype=object)

In [14]:
len(df['Class Name'].unique())

21

##### Consider 3 approaches:
    * Label Encoding - is straight forward, each of the categories are encoded
    by a separate integer number, usuall starting with 0. Advantage is that the envoding is simple
    to understand.
    * One-hot encoding - each category is encoded as a binary variable in a separate column. Has 
    advantage over label encoding in that it does not create artificial meaning (like mean of 
    the label encoded variable), so is the approach for machine learning algorithms. Disadvantage is 
    that it increases the dimensionality of the dataset.
    * Special - there might be some ordinal variables with ranges involved, then other approaches are better
    , like using mean or median for the range. Also, if dimensionality is a problem then farther aggregation
    might make sense. Anyway, it depends from case to case.

In [18]:
# One-hot encoding the variable, and creating 21 new columns

df_one_hot = pd.get_dummies(df,columns=['Class Name'])

df_one_hot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 30 columns):
Unnamed: 0                   23486 non-null int64
Clothing ID                  23486 non-null int64
Age                          23486 non-null int64
Title                        19676 non-null object
Review Text                  22641 non-null object
Rating                       23486 non-null int64
Recommended IND              23486 non-null int64
Positive Feedback Count      23486 non-null int64
Division Name                23472 non-null object
Department Name              23472 non-null object
Class Name_Blouses           23486 non-null uint8
Class Name_Casual bottoms    23486 non-null uint8
Class Name_Chemises          23486 non-null uint8
Class Name_Dresses           23486 non-null uint8
Class Name_Fine gauge        23486 non-null uint8
Class Name_Intimates         23486 non-null uint8
Class Name_Jackets           23486 non-null uint8
Class Name_Jeans             23486 no

In [24]:
# Label encoding the variables

# Create a category data type and use it to encode the var

df['Class Name Category'] = df['Class Name'].astype('category')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 12 columns):
Unnamed: 0                 23486 non-null int64
Clothing ID                23486 non-null int64
Age                        23486 non-null int64
Title                      19676 non-null object
Review Text                22641 non-null object
Rating                     23486 non-null int64
Recommended IND            23486 non-null int64
Positive Feedback Count    23486 non-null int64
Division Name              23472 non-null object
Department Name            23472 non-null object
Class Name                 23472 non-null object
Class Name Category        23472 non-null category
dtypes: category(1), int64(6), object(5)
memory usage: 2.0+ MB


In [25]:
df['Class Name Category'] = df['Class Name Category'].cat.codes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 12 columns):
Unnamed: 0                 23486 non-null int64
Clothing ID                23486 non-null int64
Age                        23486 non-null int64
Title                      19676 non-null object
Review Text                22641 non-null object
Rating                     23486 non-null int64
Recommended IND            23486 non-null int64
Positive Feedback Count    23486 non-null int64
Division Name              23472 non-null object
Department Name            23472 non-null object
Class Name                 23472 non-null object
Class Name Category        23486 non-null int8
dtypes: int64(6), int8(1), object(5)
memory usage: 2.0+ MB


In [29]:
df[['Class Name','Class Name Category']].head()

Unnamed: 0,Class Name,Class Name Category
0,Intimates,5
1,Dresses,3
2,Dresses,3
3,Pants,13
4,Blouses,0
