In [None]:
!pip install category-encoders

# Categorical data in Machine learning
Categorical data is a sort of data that contains categories or labels and is widely seen in a variety of domains, including machine learning. Categorical data can be separated into two categories.
1. Ordinal
2. Nominal



# Nominal Data

Nominal data is made up of categories or labels that have no inherent order or ranking.  

*   Colours, animal species, and countries are examples of nominal variables.
*   There is no such thing as one category being "greater" or "smaller" than another, they are simply various categories.
*   Another example, the location of a person's residence. They should be given similar weightage whether they stay in Delhi or Kolkata.


# Ordinal Data
Ordinal data represents categories in a logical order or ranking, although the intervals between them are not always equal.

For example
* Education levels, for example, "High School," "Bachelor's Degree," and "Master's Degree" are ordinal variables.
* You can arrange them in descending order of education, but you can't argue the difference between "High School" and "Bachelor's Degree" is the same as the difference between "Bachelor's Degree" and "Master's Degree."
* When encoding ordinal categorical data, the order in which the categories are presented must be considered.





What is Encoding?

Encoding is converting a particular form of data from one form to another to extract valuable information from the data and make the data suitable for our machine learning model.

Handling categorical data is critical in machine learning since most machine learning methods require numerical input. Here are some common categorical data encoding strategies:


1. Label Encoding
2. One-Hot Encoding
3. Binary Encoding
4. Frequency Encoding
5. Embedding Layers (for Deep Learning)
6. Target Encoding
7. Hash Encoding
8. Base-N encoding
9. Effect Encoding

## Label Encoding
* The variables in the data are ordinal in this sort of encoding, and ordinal encoding converts each label into integer values, and the encoded data represents the sequence of labels.
* You can allocate integers to categories in ordinal data based on their order. For example, assign 1 to "Low," 2 to "Medium," and 3 to "High."
* This method maintains the order but requires that the intervals be equal, which is not always the case.

In [None]:
import category_encoders as ce
import pandas as pd

In [None]:
dataset=pd.DataFrame({'height':['medium','short','tall','medium','short','tall','medium','short','tall']})

In [None]:
dataset

Unnamed: 0,height
0,medium
1,short
2,tall
3,medium
4,short
5,tall
6,medium
7,short
8,tall


In [None]:
# create an object of ordinal encoding
encoder= ce.OrdinalEncoder(cols=['height'],return_df=True,mapping=[{'col':'height','mapping':{'None':0,'tall':1,'medium':2,'short':3}}])
dataset['encoded'] = encoder.fit_transform(dataset)
dataset

Unnamed: 0,height,encoded
0,medium,2
1,short,3
2,tall,1
3,medium,2
4,short,3
5,tall,1
6,medium,2
7,short,3
8,tall,1


# One Hot Encoding

For nominal data, one-hot encoding is a popular technique. It creates binary columns (0 or 1) for each category. Each category gets its own column, and only one of these columns is "hot" (set to 1) for each observation, indicating the category. This method ensures that there is no inherent order in the data.

In this encoding, we map each category to a vector that contains 1 and 0, denoting the presence of the feature or not. The number of vectors is equal to the number of categories present.

In [None]:
import category_encoders as ce
import pandas as pd

In [None]:
dataset=pd.DataFrame({'City': ['Bangalore','Delhi','Hydrabad','Bangalore','Delhi','Delhi','Mumbai','Hydrabad','Chennai']})

In [None]:
dataset

Unnamed: 0,City
0,Bangalore
1,Delhi
2,Hydrabad
3,Bangalore
4,Delhi
5,Delhi
6,Mumbai
7,Hydrabad
8,Chennai


In [None]:
# Create an object for one hot encoding
encoder = ce.OneHotEncoder(cols='City', handle_unknown='nan', return_df=True, use_cat_names=True)

In [None]:
data_encoded = encoder.fit_transform(dataset)
data_encoded

Unnamed: 0,City_Bangalore,City_Delhi,City_Hydrabad,City_Mumbai,City_Chennai
0,1,0,0,0,0
1,0,1,0,0,0
2,0,0,1,0,0
3,1,0,0,0,0
4,0,1,0,0,0
5,0,1,0,0,0
6,0,0,0,1,0
7,0,0,1,0,0
8,0,0,0,0,1


# Binary Encoding

Binary encoding combines the advantages of label encoding and one-hot encoding. It first assigns a unique integer to each category and then converts that integer into binary code. This reduces the dimensionality compared to one-hot encoding but retains some ordinal information.

This technique involves encoding the categories as ordinal, then those integers are converted into binary code, and then the digits from that binary string are split into separate columns.

This technique is proper when there are many categories, and you don't want to increase the number of dimensions as you would do one hot encoding.

In [None]:
# data
dataset = pd.DataFrame({
    'category': ['a', 'b', 'a',  'h', 'i', 's', 'p', 'b', 'd', 'e', 'd', 'f', 'g', 'h', 'h', 'k','z']
})

In [None]:
dataset

Unnamed: 0,category
0,a
1,b
2,a
3,h
4,i
5,s
6,p
7,b
8,d
9,e


In [None]:
# create an object of BinaryEncoder
ce_binary = ce.BinaryEncoder(cols = ['category'])

In [None]:
# fit and transform and you will get the encoded data
print(ce_binary.fit_transform(dataset))

    category_0  category_1  category_2  category_3
0            0           0           0           1
1            0           0           1           0
2            0           0           0           1
3            0           0           1           1
4            0           1           0           0
5            0           1           0           1
6            0           1           1           0
7            0           0           1           0
8            0           1           1           1
9            1           0           0           0
10           0           1           1           1
11           1           0           0           1
12           1           0           1           0
13           0           0           1           1
14           0           0           1           1
15           1           0           1           1
16           1           1           0           0


# Target Encoding

Target encoding is the method of converting a categorical value into the mean of the target variable.

This type of encoding is a type of bayesian encoding method where bayesian encoders use target variables to encode the categorical value.

Target encoding, also known as mean encoding or likelihood encoding, is a technique used in machine learning to encode categorical variables based on the mean of the target variable for each category. It is particularly useful for classification problems where the target variable is categorical. The main idea behind target encoding is to convert categorical values into meaningful numerical representations that capture the relationship between the categorical feature and the target variable.

In [None]:
df=pd.DataFrame({'name':[
'alex','john','mary','alex','john','mary'
],'marks' : [102,240,307,650,170,480]})

In [None]:
df

Unnamed: 0,name,marks
0,alex,102
1,john,240
2,mary,307
3,alex,650
4,john,170
5,mary,480


In [None]:
encoder=ce.TargetEncoder(cols='name')

In [None]:
encoder.fit_transform(df['name'],df['marks'])

Unnamed: 0,name
0,332.091379
1,307.834847
2,334.573773
3,332.091379
4,307.834847
5,334.573773


# Hash Encoding
Hashing involves transforming a string of characters into a usually shorter fixed-length value using an algorithm that represents the original string.

if you set the hyperparameter of the encoder to 5, then irrespective of the length of the string, the encoder will reduce it to a size of 5, which will finally give us 5 different columns representing our categorical value.

Hash encoding, also known as hashing trick or feature hashing, is a technique used in machine learning for encoding categorical variables into numerical values. It is particularly useful when dealing with high-cardinality categorical features, where the number of unique categories is large. Hash encoding reduces the dimensionality of the categorical variable while providing a numerical representation for machine learning models.

In [None]:
data = pd.DataFrame({
    'color' : [ 'Blue', 'Blue', 'Green', 'Black', 'Blue','Yellow', 'Black', 'Green']
})

In [None]:
# create an object of the HashingEncoder
hashed = ce.HashingEncoder(cols=['color'],n_components=5)
# fit and transform and you will get the encoded data
hashed.fit_transform(data)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4
0,0,0,0,1,0
1,0,0,0,1,0
2,0,0,0,0,1
3,0,0,0,1,0
4,0,0,0,1,0
5,0,1,0,0,0
6,0,0,0,1,0
7,0,0,0,0,1


# Base-N encoding

In binary encoding, we convert the integers into binary or base 2. BaseN allows us to convert the integers with any value of the base. If you have many categories, you might want to use BaseN.

In [None]:
# make some data
data = pd.DataFrame({
 'class' : ['h', 'h', 'k', 'h', 'i', 's', 'p', 'z','a', 'b', 'a', 'b', 'd', 'e', 'd', 'f', 'g']})
# create an object of the BaseNEncoder
ce_baseN4 = ce.BaseNEncoder(cols=['class'],base=4)
# fit and transform and you will get the encoded data
ce_baseN4.fit_transform(data)

Unnamed: 0,class_0,class_1
0,0,1
1,0,1
2,0,2
3,0,1
4,0,3
5,1,0
6,1,1
7,1,2
8,1,3
9,2,0


# Effect Encoding

In this encoding type, encoders provide values to the categories in -1,0,1 format. In fact, -1 formation is the only difference between One-Hot encoding and effect encoding.

In [None]:
dataset=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']})
encoder=ce.sum_coding.SumEncoder(cols='City',verbose=True)
df=encoder.fit_transform(dataset)
print(df)

   intercept  City_0  City_1  City_2  City_3
0          1     1.0     0.0     0.0     0.0
1          1     0.0     1.0     0.0     0.0
2          1     0.0     0.0     1.0     0.0
3          1     0.0     0.0     0.0     1.0
4          1    -1.0    -1.0    -1.0    -1.0
5          1     1.0     0.0     0.0     0.0
6          1     0.0     0.0     1.0     0.0




#Frequency Encoding

In this encoding method, we utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat to the target variable, it helps the model understand and assign the weight in direct and inverse proportion, depending on the nature of the data.

In this method, you encode categorical values based on their frequency in the dataset. More frequent categories are assigned higher values. This can be useful when the frequency of a category is relevant to the prediction task.



In [None]:
df=pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad']})
fe = df.groupby('City').size()/len(df)
df.loc[:,'encoded']= df['City'].map(fe)
df

Unnamed: 0,City,encoded
0,Delhi,0.285714
1,Mumbai,0.142857
2,Hyderabad,0.285714
3,Chennai,0.142857
4,Bangalore,0.142857
5,Delhi,0.285714
6,Hyderabad,0.285714


# A/B testing
A statistical way of comparing two (or more) techniques—the A and the B. Typically, the A is an existing technique, and the B is a new technique. A/B testing not only determines which technique performs better but also whether the difference is statistically significant.

A/B testing usually compares a single metric on two techniques; for example, how does model accuracy compare for two techniques? However, A/B testing can also compare any finite number of metrics.