# Categorical Data
Categorical Data can be classified into categories, which are finite in number
Two types of categorical data:
- Ordinal Data: Categories have an inherent order, hence should be retained while preparing data
- Nominal Data: Categories don't have any order, only need to consider the presence or absence of a feature

Encoding categorical data is an important part of the Feature Engineering

In [1]:
import category_encoders as ce
import pandas as pd

In [5]:
train_df = pd.DataFrame(
    {
        'Degree': ['High school', 'Masters', 'Diploma', 'Bachelors', 'Bachelors', 'Masters', 'Phd', 'High school', 'High school']
    }
)
train_df

Unnamed: 0,Degree
0,High school
1,Masters
2,Diploma
3,Bachelors
4,Bachelors
5,Masters
6,Phd
7,High school
8,High school


## Ordinal Encoding
- Used when categories have inherent(or natural) order among them
- Categories are give a weight according to the order, by mapping each category to a weight(or number) value

In [7]:
# create object of Ordinal Encoding
encoder = ce.OrdinalEncoder(
    cols      = ['Degree'],
    return_df = True,
    mapping   = [{
        'col': 'Degree',
        'mapping': {
            'None'          : 0,
            'High school'   : 1,
            'Diploma'       : 2,
            'Bachelors'     : 3,
            'Masters'       : 4,
            'Phd'           : 5
        }
    }]
)

# Original Data
train_df

Unnamed: 0,Degree
0,High school
1,Masters
2,Diploma
3,Bachelors
4,Bachelors
5,Masters
6,Phd
7,High school
8,High school


In [8]:
# fit and transform train data
df_train_transformed = encoder.fit_transform(train_df)
df_train_transformed

Unnamed: 0,Degree
0,1
1,4
2,2
3,3
4,3
5,4
6,5
7,1
8,1


## One Hot Encoding

- Used when features are nominal(i.e., categories don't have any order)
- For each category, create a new variable(Column), called the **Dummy Variable** indicating the presence(1) or absence(0) of the feature
- Number of dummy variables depends on the levels present in the categorical variables(i.e, the number of categories)
- Uses N variables for N-categories

In [9]:
data = pd.DataFrame({
    'City': ['Delhi', 'Mumbai', 'Hyderabad', 'Chennai', 'Bangalore', 'Delhi', 'Hyderabad', 'Bengalore', 'Delhi']
})
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad
7,Bengalore
8,Delhi


In [10]:
# create object for one-hot encoding
encoder = ce.OneHotEncoder(cols='City', handle_unknown='return_nan', return_df=True, use_cat_names=True)

# fit and transform data
data_encoded = encoder.fit_transform(data)
data_encoded

Unnamed: 0,City_Delhi,City_Mumbai,City_Hyderabad,City_Chennai,City_Bangalore,City_Bengalore
0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0
5,1.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,1.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,1.0
8,1.0,0.0,0.0,0.0,0.0,0.0


## Dummy Encoding
- Improvement over One-Hot Encoding
- Used for nominal data
- Transforms categorical variables into binary variables
- Uses N-1 features to represent N lables/categories

*Note:* 
- Both Hot Encoding and Dummy Encoding become inefficient when a large number of levels/categories are present in the dataset, since they make the dataset even larger and put pressure on the learning algorithm
- Might lead to Dummy variable trap, phenomenon where features are highly correlated. Which means the target-value can be predicted using other non-category variables


In [12]:
data = pd.DataFrame({
    'City': ['Delhi', 'Mumbai', 'Hyderabad', 'Chennai', 'Bangalore', 'Delhi', 'Hyderabad']
})
data

Unnamed: 0,City
0,Delhi
1,Mumbai
2,Hyderabad
3,Chennai
4,Bangalore
5,Delhi
6,Hyderabad


In [14]:
# encode the data, first label 'Bangalore' is represnted using 0
data_encoded = pd.get_dummies(data=data, drop_first=True)
data_encoded

Unnamed: 0,City_Chennai,City_Delhi,City_Hyderabad,City_Mumbai
0,0,1,0,0
1,0,0,0,1
2,0,0,1,0
3,1,0,0,0
4,0,0,0,0
5,0,1,0,0
6,0,0,1,0


## Effect Encoding
- Improvement over Dummy Encoding
- Also know as **Deviation Encoding** or **Sum Encoding**
- Same as Dummy Encoding, except the row with all **0**s, representing reference group, is shown as all **-1**s
- Hence, in Effect Encoding we use three values for encoding, namely, 1,0-1

In [16]:
data = pd.DataFrame({
    'City': ['Delhi', 'Mumbai', 'Hyderabad', 'Chennai', 'Bangalore', 'Delhi', 'Hyderabad']
})

encoder = ce.sum_coding.SumEncoder(cols='City', verbose=False,)
print(data)

encoder.fit_transform(data)

        City
0      Delhi
1     Mumbai
2  Hyderabad
3    Chennai
4  Bangalore
5      Delhi
6  Hyderabad


Unnamed: 0,intercept,City_0,City_1,City_2,City_3
0,1,1.0,0.0,0.0,0.0
1,1,0.0,1.0,0.0,0.0
2,1,0.0,0.0,1.0,0.0
3,1,0.0,0.0,0.0,1.0
4,1,-1.0,-1.0,-1.0,-1.0
5,1,1.0,0.0,0.0,0.0
6,1,0.0,0.0,1.0,0.0


## Hash Encoding
### Hashing
- Transformation of an arbitrary size input in the form of a fixed-size value
- We use hashing algorithms to generate hash value of an input
- Its a one-way process, i.e., one cannot generate original input from the hash representation
- Some hashing algorithms: MD, MD2, MD5, SHA0, SHA1, SHA2, etc...

- Like One-Hot encoder, Hash Encoding represents categorical features using new dimensions
- The number of dimensions can be fixed using **n_component** argument
- By Default, uses md5

- Since hashing transforms data in lesser dimensions, it may lead to loss of information.
- May cause collisionas multiple values may be represented by the same hash

In [17]:
data = pd.DataFrame({
    'Month': [
        'January', 'April', 'March', 'April', 'February', 'June', 'July', 'June', 'September'
    ]
})

print(data)

# create object for hash encoder
encoder = ce.HashingEncoder(cols='Month', n_components=6)

# fit and transform the data
encoder.fit_transform(data)

       Month
0    January
1      April
2      March
3      April
4   February
5       June
6       July
7       June
8  September


Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5
0,0,0,0,0,1,0
1,0,0,0,1,0,0
2,0,0,0,0,1,0
3,0,0,0,1,0,0
4,0,0,0,0,1,0
5,0,1,0,0,0,0
6,1,0,0,0,0,0
7,0,1,0,0,0,0
8,0,0,0,0,1,0


## Binary Encoding

- Its a combination of Hash Encoding and One-Hot Encoding
- Categorical feature is first converted into numerical using an ordinal encoder
- Then the numbers are transformed in the binary number or converted to their binary form.
- After that, binary value is split into different columns

- Works well when their are a high number of categories

In [19]:
data = pd.DataFrame({
    'City': [
        'Delhi', 'Mumbai', 'Hyderabad', 'Chennai', 'Bangalore', 'Delhi', 'Hyderabad', 'Mumbai', 'Agra'
    ]
})
print(data)

# create object for binary encoding
encoder = ce.BinaryEncoder(cols=['City'], return_df = True)

# fit and transform data
data_encoded = encoder.fit_transform(data)
data_encoded

        City
0      Delhi
1     Mumbai
2  Hyderabad
3    Chennai
4  Bangalore
5      Delhi
6  Hyderabad
7     Mumbai
8       Agra


Unnamed: 0,City_0,City_1,City_2
0,0,0,1
1,0,1,0
2,0,1,1
3,1,0,0
4,1,0,1
5,0,0,1
6,0,1,1
7,0,1,0
8,1,1,0


## Base N Encoding

- Extended form of Binary Encoding
- For Binary Encoding, base is 2, and it converts values into its respective Binary form
- When there are more categories, and binary encoding can't handle the dimensionality, then larger bases can be used such as 4 or 8, etc...

In [22]:
data = pd.DataFrame({
    'City': ['Delhi', 'Mumbai','Hyderabad', 'Chennai', 'Bangalore', 'Delhi', 'Hyderabad', 'Mumbai', 'Agra']
})
print(data)

# create an object for Base N Encoder
encoder = ce.BaseNEncoder(cols=['City'], return_df=True, base=5) # base-5 also know as quinary system

# fit and transform data
data_encoded = encoder.fit_transform(data)
data_encoded

        City
0      Delhi
1     Mumbai
2  Hyderabad
3    Chennai
4  Bangalore
5      Delhi
6  Hyderabad
7     Mumbai
8       Agra


Unnamed: 0,City_0,City_1
0,0,1
1,0,2
2,0,3
3,0,4
4,1,0
5,0,1
6,0,3
7,0,2
8,1,1


## Target Encoding

- Its a Baysian encoding technique
- Use information from dependent/target variables to encode the categoriacal data
- Calculate the mean of the target target variable for each category and replace the category variable with the mean value
- In case of Categorical target variables, the posterior probability of the target replaces each category

In [23]:
data = pd.DataFrame({
    'class': ['A', 'B', 'C', 'B', 'C', 'A', 'A', 'A'],
    'Marks': [50, 30, 70, 80, 45, 97, 80, 68]
})
print(data)

# create target encoding object
encoder = ce.TargetEncoder(cols='class')

# fit and Transform Train Data
data_encoded = encoder.fit_transform(data['class'], data['Marks'])
data_encoded

  class  Marks
0     A     50
1     B     30
2     C     70
3     B     80
4     C     45
5     A     97
6     A     80
7     A     68


Unnamed: 0,class
0,73.335024
1,57.689414
2,59.517061
3,57.689414
4,59.517061
5,73.335024
6,73.335024
7,73.335024


### Issues

It can lead to overfitting. Methods to deal with overfitting:
- Leave One Out Encoding: the current target value is reduced from the overall mean of the target to avoid leakage
- We may introduce some Gaussian noise in the target statistics, value of noise is the hyper-parameter for the model

Improper Distribution of categories in the training and test data, so that categories may assume extream values
- target means for the category are mixed with the marginal mean of the target

## END