# **Data Encoding**

***Data encoding*** is the process of converting data from one format or representation to another, often used to facilitate data storage, transmission, or processing, such as converting text to binary for storage in a computer or encoding categorical data as numerical values for machine learning algorithms.

1. [Nominal/ One Hot Encoding](#id1)
2. [Label And oOrdinal Encoding](#id2)
3. [Target Guided Ordinal Encoding](#id3)

<a id="id1"></a>
## Nominal/ One Hot Encoding

***Nominal/One-Hot Encoding*** is a technique used to convert categorical variables into binary vectors, where each category is represented as a binary feature, making them suitable for machine learning models.

| ID | Red | Blue | Green |
|----|----------|-----------|------------|
| 1  |      1       |      0        |    0         |
| 2  |      0       |      1        |    0         |
| 3  |      0       |      0        |    1         |

### **Disadvantages**

___Increased Dimensionality:___ It can lead to a significant increase in the number of features, which may result in larger and more complex datasets.

___Sparse Data:___ One-hot encoding creates sparse matrices with many zeros, making the data inefficient to store and process.

___Curse of Dimensionality:___ High dimensionality can lead to increased computational requirements and potential overfitting in machine learning models.

___Loss of Information:___ It does not capture any relationships or similarities between categories in the original variable. Each category is treated independently.

___Unsuitable for High Cardinality:___ One-hot encoding is not practical for variables with high cardinality (many unique categories) as it can lead to an impractical number of features.


In [1]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Sample data
data = {'ID': [1, 2, 3, 4, 5, 6],
        'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Green']}

# Create a DataFrame
df = pd.DataFrame(data)
df.head()

Unnamed: 0,ID,Color
0,1,Red
1,2,Blue
2,3,Green
3,4,Red
4,5,Blue


In [2]:
df_encoded = pd.get_dummies(df, columns=['Color'])
df_encoded # this is one of the method

Unnamed: 0,ID,Color_Blue,Color_Green,Color_Red
0,1,0,0,1
1,2,1,0,0
2,3,0,1,0
3,4,0,0,1
4,5,1,0,0
5,6,0,1,0


In [28]:
# 2nd method using sklearn.preprocessing.OneHotEncoder

encoder = OneHotEncoder()
OneHotEncoder(handle_unknown='error')  # THIS WILL HELPFUL TO AVOOID WARNING WHILE HANDLING NEW DATA
encoded = encoder.fit_transform(df[['Color']]).toarray()

In [29]:
pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0
5,0.0,1.0,0.0


In [27]:
# handling new data whih in coming to the model

encoder.transform([['Blue']]).toarray() 

array([[1., 0., 0.]])

In [40]:
# ANOTHER EXAMPLE


data = pd.read_csv("D:\AI\DATASETS\\tips.csv")
display(data.tail())
encoder1 = OneHotEncoder()
encoded_1 = encoder1.fit_transform(data[["day"]]).toarray()
encoded_1


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.0,Female,Yes,Sat,Dinner,2
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2


array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],


In [53]:
OneHotEncoder(handle_unknown='error') # some time this method failing to compress wanings to terminal
import warnings
warnings.filterwarnings(action="ignore") # ignoring all warnings
encoder1.transform([["Sat"], ["Thur"], ["Sat"]]).toarray()

array([[0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.]])

In [54]:

encoder1.transform([["Thur"], ["Sat"]]).toarray()

array([[0., 0., 0., 1.],
       [0., 1., 0., 0.]])