# ENCODING

***In ML, Encoding is a technique of "Converting Catagorical Features into Numerical Features", so that it could be easily fitted into machine learning model.***

##### Types of Encoding.

    1. Nominal Encoding
            --> Finite set of discrete values with no relationship between them.
            --> If Values in Feature have no aggrement between them.
            
            e.g. 1. Names of state MH,KL,MP,UP..
                 2. Gender Male, Female
                 
    2. Ordinal Encoding
            --> Finite set of discrete values with relationship (order) between them.
            --> If Values in Feature have aggrement between them like if they are ranked.
            
            e.g. 1. High, Middle, Low
                 2. PHD, Masters, Bachelor
                 

![Encoding.png](attachment:Encoding.png)

## Ordinal Encoding

***Each unique value is assigned an integer value***

For example, “red” is 1, “green” is 2, and “blue” is 3.

    The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship.
    
    For categorical variables, it imposes an ordinal relationship where no such relationship may exist. This can cause problems and a one-hot encoding may be used instead.
    
    
![ORDINAL%20ENCODING.jpg](attachment:ORDINAL%20ENCODING.jpg)
    
***from sklearn.preprocessing import OrdinalEncoder***

In [1]:
# example of a ordinal encoding
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

# define data
d = {'color':['red','green','blue']}
data = pd.DataFrame(data=d)
print(data)

# define ordinal encoding
encoder = OrdinalEncoder()

# transform data
result = encoder.fit_transform(data)
print(result)

   color
0    red
1  green
2   blue
[[2.]
 [1.]
 [0.]]


## One Hot Encoding

***Applied to Nominal categorical Variables***


    This is where the integer encoded variable is removed and one new binary variable is added for each unique integer value in the variable.

    Each bit represents a possible category.
    If the variable cannot belong to multiple categories at once, then only one bit in the group can be “on.” This is called one-hot encoding …


![one-hot-encoding.jpg](attachment:one-hot-encoding.jpg)

In [2]:
# example of a one hot encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# define data
d = {'color':['red','green','blue']}
data = pd.DataFrame(data=d)
print(data)

# define one hot encoding
encoder = OneHotEncoder(sparse=False)

# transform data
onehot = encoder.fit_transform(data)
print('\nonehot\n',onehot)

   color
0    red
1  green
2   blue

onehot
 [[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


# Dummy Variable Encoding

    The one-hot encoding creates one binary variable for each category.
    The problem is that this representation includes redundancy.
    
    When there are C possible values of the predictor and only C – 1 dummy variables are used, the matrix inverse can be computed and the contrast method is said to be a full rank parameterization
    
![dummy%20data%20encoding.jpg](attachment:dummy%20data%20encoding.jpg)

In [5]:
# example of a one hot encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# define data
d = {'color':['red','green','blue']}
data = pd.DataFrame(data=d)
print(data)

# define one hot encoding
encoder = OneHotEncoder(drop='first',sparse=False)

# transform data
dummy = encoder.fit_transform(data)
print('\ndummy\n',dummy)

   color
0    red
1  green
2   blue

dummy
 [[0. 1.]
 [1. 0.]
 [0. 0.]]
