# Encoding techniques 
- Data Encoding is an important pre-processing step in Machine Learning. It refers to the process of converting categorical or textual data into numerical format, so that it can be used as input for algorithms to process.

##### Types of Encoding techniques 
- One hot encoding / nominal encoding  
- Odinal encoding 
- Target guided ordinal encoding 

#### One Hot Encoding / Nominal encoding 
- One-hot encoding is a technique that converts categorical data into numerical data for use in machine learning.
- Nominal encoding is used for nominal categorical data, i.e data where categories dont have any specific ranking

- Advantages 
1. It allows the use of categorical variables in models that require numerical input.
2. It can improve model performance by providing more information to the model about the categorical variable.
3. It can help to avoid the problem of ordinality, which can occur when a categorical variable has a natural ordering (e.g. “small”, “medium”, “large”).

- Disadvantages 

1. It can lead to increased dimensionality, as a separate column is created for each category in the variable. This can make the model more complex and slow to train.
 
2. It can lead to sparse data, as most observations will have a value of 0 in most of the one-hot encoded columns.
 
3. It can lead to overfitting, especially if there are many categories in the variable and the sample size is relatively small.
 
4.  One-hot-encoding is a powerful technique to treat categorical data, but it can lead to increased dimensionality, sparsity, and overfitting. It is important to use it cautiously and consider other methods such as ordinal encoding or binary encoding.

In [8]:
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

In [6]:
df = sns.load_dataset('penguins')

In [7]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [13]:
# Lets see an example of from penguin data set 
df['species'].value_counts()

Adelie       152
Gentoo       124
Chinstrap     68
Name: species, dtype: int64

In [28]:
# create an instance of an Onehotencodes 

encoder = OneHotEncoder()

encoded =encoder.fit_transform(df[['species', 'island']])

In [30]:
encoded_df=pd.DataFrame(encoded.toarray(), columns= encoder.get_feature_names_out())

In [32]:
# To concat with actual data set we will do concatination 

pd.concat([df, encoded_df], axis = 1)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,species_Adelie,species_Chinstrap,species_Gentoo,island_Biscoe,island_Dream,island_Torgersen
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,1.0,0.0,0.0,0.0,0.0,1.0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,1.0,0.0,0.0,0.0,0.0,1.0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,1.0,0.0,0.0,0.0,0.0,1.0
3,Adelie,Torgersen,,,,,,1.0,0.0,0.0,0.0,0.0,1.0
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,1.0,0.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,,0.0,0.0,1.0,1.0,0.0,0.0
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,0.0,0.0,1.0,1.0,0.0,0.0
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,0.0,0.0,1.0,1.0,0.0,0.0
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,0.0,0.0,1.0,1.0,0.0,0.0


### Lable and Ordinal Encoding 
- Ordinal encoding used for a data where ranking matters, like categories like game level : D, C, B, A, A+, S, etc

- Lable encoder does not give any ranking it just used for labling 

In [33]:
## Lable encodes 
from sklearn.preprocessing import LabelEncoder

In [42]:
lable_encoder = LabelEncoder()
encoded_species = pd.DataFrame(lable_encoder.fit_transform(df['species']), columns=['species_encoded'])
encoded_islands = pd.DataFrame(lable_encoder.fit_transform(df['island']), columns=['island_encoded'])

In [44]:
lable_ecoded_df = pd.concat([df, encoded_species,encoded_islands], axis=1 )

In [45]:
lable_ecoded_df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,species_encoded,island_encoded
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,0,2
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,0,2
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,0,2
3,Adelie,Torgersen,,,,,,0,2
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,0,2
...,...,...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,,2,0
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,Female,2,0
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,Male,2,0
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,Female,2,0


- But the problem with the lable encoding is that, we give the numbers like 1, 2, 3, 4, so model thinks that one category is greather than other one, and that makes proble 
- Whenever we need such behaviour like, categories have ranking then we'll use ordinal encoding 

In [46]:
## Ordinal encoding 

from sklearn.preprocessing import OrdinalEncoder

In [47]:
diamonds = sns.load_dataset('diamonds')

In [48]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


In [50]:
# Here in diamonds dataset we have so many variables those have ranking associated with them 
# in ordinal encoder we have to give a list according the the ranking and importance increases 
# as we go ahead in the list

diamonds['cut'].value_counts()  

Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
Name: cut, dtype: int64

In [56]:
encoder = OrdinalEncoder(categories=[['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']])

In [58]:
cut = encoder.fit_transform(diamonds[['cut']])

In [63]:
df_cut = pd.DataFrame(cut, columns=['cut_encoded'])

In [64]:
pd.concat([diamonds, df_cut], axis = 1)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z,cut_encoded
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43,4.0
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31,3.0
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31,1.0
3,0.29,Premium,I,VS2,62.4,58.0,334,4.20,4.23,2.63,3.0
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75,1.0
...,...,...,...,...,...,...,...,...,...,...,...
53935,0.72,Ideal,D,SI1,60.8,57.0,2757,5.75,5.76,3.50,4.0
53936,0.72,Good,D,SI1,63.1,55.0,2757,5.69,5.75,3.61,1.0
53937,0.70,Very Good,D,SI1,62.8,60.0,2757,5.66,5.68,3.56,2.0
53938,0.86,Premium,H,SI2,61.0,58.0,2757,6.15,6.12,3.74,3.0
