# Encoding

![Encoding.png](attachment:Encoding.png)

# Categorical Variables
- Categorical variables are an essential part of any dataset. These variables present group information of observations containing similar information.



## Why encoding is important

- In real life, categorical or qualitative data may come as a string. We have to convert these categories into numbers, so machine learning algorithms can work.
- However, we must have to be extremely careful with nominal-level data as these categories are unordered. If we convert them into numbers, such as 0, 1, or 2, it will greatly penalize the effectiveness of our model.
- Values can be encoded by creating additional binary features corresponding to whether each value is picked or not.
- Likewise while coding ordinal level data, we have to be sure that 0 should represent lowest order and 1 should be for more than 0 only. 


In [13]:
!pip install category_encoders



In [51]:
# Importing libraries
import numpy as np
import pandas as pd
import sklearn
import category_encoders as ce

In [52]:
penguin_raw = pd.read_excel("ML101 Dataset_2 penguin_manipulated_data_set.xlsx") # reading the data set

In [53]:
penguin = penguin_raw.copy() # creating a copy to do modification

In [54]:
penguin.head()

Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,2007-11-11,,,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,PAL0708,2.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,2007-11-11,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,PAL0708,3.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,2007-11-16,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,2007-11-16,,,,.,,,,Adult not sampled.
4,PAL0708,5.0,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,2007-11-16,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,


In [55]:
# Dropping certain columns as they are not important
penguin = penguin.drop(['studyName', 'Sample Number', 'Stage', 'Region', 'Date Egg', 'Individual ID', 'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments'], axis = 1)

In [56]:
penguin.head()

Unnamed: 0,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,181.0,3750.0,MALE
1,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,,.,
4,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,36.7,19.3,193.0,3450.0,FEMALE


In [57]:
penguin.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Species              344 non-null    object 
 1   Island               344 non-null    object 
 2   Clutch Completion    344 non-null    object 
 3   Culmen Length (mm)   288 non-null    float64
 4   Culmen Depth (mm)    235 non-null    float64
 5   Flipper Length (mm)  337 non-null    float64
 6   Body Mass (g)        344 non-null    object 
 7   Sex                  334 non-null    object 
dtypes: float64(3), object(5)
memory usage: 21.6+ KB


In [58]:
penguin.nunique() # nunique function to count unique values in each column

Species                  3
Island                   3
Clutch Completion        2
Culmen Length (mm)     144
Culmen Depth (mm)       74
Flipper Length (mm)     55
Body Mass (g)           96
Sex                      3
dtype: int64

In [59]:
penguin["Species"].unique()  # Nominal feature 

array(['Adelie Penguin (Pygoscelis adeliae)',
       'Chinstrap penguin (Pygoscelis antarctica)',
       'Gentoo penguin (Pygoscelis papua)'], dtype=object)

In [60]:
penguin["Island"].unique()  #Nominal Feature

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

In [61]:
penguin["Sex"].unique()  # Nominal Feture 

array(['MALE', 'FEMALE', nan, '.'], dtype=object)

In [72]:
penguin.loc[(penguin["Sex"]=="FEMALE") | (penguin["Sex"]=="MALE")].head() # mistake of & and | is expected (and/or)

Unnamed: 0,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,181.0,3750.0,MALE
1,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,190.0,3650.0,MALE


In [73]:
penguin = penguin.loc[(penguin["Sex"]=='MALE') | (penguin["Sex"]=='FEMALE')] # dropped values where Sex is other than Male and Female 

In [74]:
penguin

Unnamed: 0,Species,Island,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex
0,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,181.0,3750.0,MALE
1,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie Penguin (Pygoscelis adeliae),Torgersen,Yes,,,190.0,3650.0,MALE
...,...,...,...,...,...,...,...,...
338,Gentoo penguin (Pygoscelis papua),Biscoe,No,47.2,13.7,214.0,4925.0,FEMALE
340,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,46.8,14.3,215.0,4850.0,FEMALE
341,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,50.4,15.7,222.0,5750.0,MALE
342,Gentoo penguin (Pygoscelis papua),Biscoe,Yes,45.2,14.8,212.0,.,FEMALE


In [76]:
penguin.nunique()

Species                  3
Island                   3
Clutch Completion        2
Culmen Length (mm)     143
Culmen Depth (mm)       73
Flipper Length (mm)     54
Body Mass (g)           95
Sex                      2
dtype: int64

- There are many approaches, here we will discuss few commonly used approaches:
    - **One-hot encoding**
    - **Dummy encoding**
    - **Oridnal encoding**
    - **Label encoding**
    - **Binary encoding**

## Pandas get dummies


#### Nominal Feture encoding

In [77]:
pd.get_dummies(penguin,columns=["Species","Island","Sex"])

Unnamed: 0,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Species_Adelie Penguin (Pygoscelis adeliae),Species_Chinstrap penguin (Pygoscelis antarctica),Species_Gentoo penguin (Pygoscelis papua),Island_Biscoe,Island_Dream,Island_Torgersen,Sex_FEMALE,Sex_MALE
0,Yes,,,181.0,3750.0,1,0,0,0,0,1,0,1
1,Yes,39.5,17.4,186.0,3800.0,1,0,0,0,0,1,1,0
2,Yes,40.3,18.0,195.0,3250.0,1,0,0,0,0,1,1,0
4,Yes,36.7,19.3,193.0,3450.0,1,0,0,0,0,1,1,0
5,Yes,,,190.0,3650.0,1,0,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
338,No,47.2,13.7,214.0,4925.0,0,0,1,1,0,0,1,0
340,Yes,46.8,14.3,215.0,4850.0,0,0,1,1,0,0,1,0
341,Yes,50.4,15.7,222.0,5750.0,0,0,1,1,0,0,0,1
342,Yes,45.2,14.8,212.0,.,0,0,1,1,0,0,1,0


In [78]:
pd.get_dummies(penguin,columns=["Species","Island","Sex"], drop_first=True) 
# dropping  one column to save from dummy variable trap

# dropping other column other tha first - df.drop([columname],axis=1)

Unnamed: 0,Clutch Completion,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Species_Chinstrap penguin (Pygoscelis antarctica),Species_Gentoo penguin (Pygoscelis papua),Island_Dream,Island_Torgersen,Sex_MALE
0,Yes,,,181.0,3750.0,0,0,0,1,1
1,Yes,39.5,17.4,186.0,3800.0,0,0,0,1,0
2,Yes,40.3,18.0,195.0,3250.0,0,0,0,1,0
4,Yes,36.7,19.3,193.0,3450.0,0,0,0,1,0
5,Yes,,,190.0,3650.0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...
338,No,47.2,13.7,214.0,4925.0,0,1,0,0,0
340,Yes,46.8,14.3,215.0,4850.0,0,1,0,0,0
341,Yes,50.4,15.7,222.0,5750.0,0,1,0,0,1
342,Yes,45.2,14.8,212.0,.,0,1,0,0,0


In [79]:
pd.get_dummies(penguin,columns=['Species', 'Clutch Completion', 'Sex'])

Unnamed: 0,Island,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Species_Adelie Penguin (Pygoscelis adeliae),Species_Chinstrap penguin (Pygoscelis antarctica),Species_Gentoo penguin (Pygoscelis papua),Clutch Completion_No,Clutch Completion_Yes,Sex_FEMALE,Sex_MALE
0,Torgersen,,,181.0,3750.0,1,0,0,0,1,0,1
1,Torgersen,39.5,17.4,186.0,3800.0,1,0,0,0,1,1,0
2,Torgersen,40.3,18.0,195.0,3250.0,1,0,0,0,1,1,0
4,Torgersen,36.7,19.3,193.0,3450.0,1,0,0,0,1,1,0
5,Torgersen,,,190.0,3650.0,1,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
338,Biscoe,47.2,13.7,214.0,4925.0,0,0,1,1,0,1,0
340,Biscoe,46.8,14.3,215.0,4850.0,0,0,1,0,1,1,0
341,Biscoe,50.4,15.7,222.0,5750.0,0,0,1,0,1,0,1
342,Biscoe,45.2,14.8,212.0,.,0,0,1,0,1,1,0


- Get dummies give all columns hence one column can be removed before ML model

### Ordinal encoding
- Ordinal encoding is a technique used to convert categorical variables into numerical values, where each category is assigned a unique integer value. It is similar to one-hot encoding, but it does not create a binary matrix, and it preserves the ordinal relationship between the categories.


In [83]:
speed = {'Car_speed' :['very high', 'high', 'medium', 'low', 'very low']}
df=pd.DataFrame(speed,columns=["Car_speed"])
temp_dict = {'very high': 4,'high': 3,'medium': 2,'low': 1,"very low":0}
df

Unnamed: 0,Car_speed
0,very high
1,high
2,medium
3,low
4,very low


In [84]:
df["speed_ordinal"] = df.Car_speed.map(temp_dict)
df

Unnamed: 0,Car_speed,speed_ordinal
0,very high,4
1,high,3
2,medium,2
3,low,1
4,very low,0


## Label encoding
### When the order is not important ( encoding in alphabetical order)
If you have a categorical variable that represents a categorical variable with no ordinal relationship between the categories, label encoding is a suitable encoding method.

 


In [87]:
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Create a sample dataset
X = np.array(["red", "green", "blue", "green", "red"])
print(X)


['red' 'green' 'blue' 'green' 'red']


In [88]:
# Create an instance of the LabelEncoder class
encoder = LabelEncoder()

# Fit the encoder to the dataset
encoder.fit(X) #fit = calculate

# Transform the dataset, creating a numerical matrix
X_encoded = encoder.transform(X) # convert

print(X_encoded)

[2 1 0 1 2]


## Onehot encoding (using sklearn) == Dummy Encoding (using pandas)
- One-hot encoding is a technique used to convert categorical variables into a binary matrix, also known as one-hot encoding. It is commonly used in machine learning to convert categorical data into a numerical format that can be input into a model.
- It's worth noting that one-hot encoding can increase the dimensionality of the data, which can lead to the curse of dimensionality. It's also important to keep in mind that one-hot encoding can lead to sparse data and some machine learning models may not perform well with sparse data.


In [None]:
s = (penguin.dtypes == 'object')
cols = list(s[s].index)
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore',sparse=False)

In [None]:
penguin_Species = pd.DataFrame(ohe.fit_transform(penguin[["Species"]]))
penguin_Species

Unnamed: 0,0,1,2
0,1.0,0.0,0.0
1,1.0,0.0,0.0
2,1.0,0.0,0.0
3,1.0,0.0,0.0
4,1.0,0.0,0.0
...,...,...,...
339,0.0,0.0,1.0
340,0.0,0.0,1.0
341,0.0,0.0,1.0
342,0.0,0.0,1.0


## Binary encoding
- Binary encoding uses the binary digit, or bit, as the fundamental unit of information, and a bit may only be a ‘0’ or a ‘1’ (only two possibilities since it is a binary-encoded system). By combining bits, numbers larger than 0 or 1 may be represented, and these bit collections are called words.
- A binary number is defined as a number that is expressed in the binary system or base 2 numeral system.
- First they encode categories as integers using oridnal encoding and then converting them into binary code.


#### Binary encoding of number 1


|Dividened|Remainder|
|---|---|
|1/2 = 0|1|

#### Binary encoding of number 2

|Dividened|Remainder|
|---|---|
|2/2 = 1|0|
|1/2 = 0|1|

Hence 2 is **10**

#### Binary encoding of number 3

|Dividened|Remainder|
|---|---|
|3/2 = 1|1|
|1/2 = 0|1|

Hence, 3 is **11**

#### Binary encoding of number 4

|Dividened|Remainder|
|---|---|
|4/2 = 2|0|
|2/2 = 1|0|
|1/2 = 0|1|

Hence, 4 is **100**

#### Binary encoding of number 4

|Dividened|Remainder|
|---|---|
|5/2 = 2|1|
|2/2 = 1|0|
|1/2 = 0|1|

Hence, 5 is 101

|Denary number|Binarynumber|
|---|---|
|1|1|
|2|10|
|3|11|
|4|100|
|5|101|
|6|110|
|7|111|
|8|1000|
|9|1001|
|10|1010|

![binary_encoding.png](attachment:binary_encoding.png)
Source: https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02

In [None]:
ce_be = ce.BinaryEncoder(cols=['Species']); # here we are just using species columns

# transform the data
data_binary = ce_be.fit_transform(penguin["Species"]);
data_binary

Unnamed: 0,Species_0,Species_1
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1
...,...,...
339,1,1
340,1,1
341,1,1
342,1,1


- Here we had three species. However, encoding resulted in only 2 columns.
- Binary encoding uses fewer features than one-hot encoding. Further, It reduces the curse of dimensionality for data with high cardinality.

#### One hot encoding and dummy encoding
Two robust and efficient encoding strategies are one hot encoder and a dummy encoder. They are also very well-liked by data scientists, but they might not be as helpful when-

- We require a similar amount of dummy variables to encode the data if a feature variable contains many categories. For instance, the code will need 25 additional variables for a column with 25 different values.
- If the dataset has many categorical features, a similar situation will arise. We will again end up with several binary features representing the multiple categorical features and their various categories, for example, a dataset with eight or more categorical columns.
<br><br>
In both situations above, these two encoding techniques inject sparsity into the dataset by having some columns contain 0s, and others have 1. In other words, it adds a lot of dummy characteristics to the dataset without providing anything new.
<br><br>
- They might also lead to a trap using a Dummy variable. Features are tightly linked to this phenomenon. That implies we can easily anticipate a variable's value using the other variables.
- An increase in the dataset will increase the computational challenge and affect overall performance. Encoding is not an optimum choice for tree based model.

   