# **Data Encoding**

***Data encoding*** is the process of converting data from one format or representation to another, often used to facilitate data storage, transmission, or processing, such as converting text to binary for storage in a computer or encoding categorical data as numerical values for machine learning algorithms.

1. [Nominal/ One Hot Encoding](#id1)
2. [Label Encoding](#id2) And [Ordinal Encoding](#id2-1)
3. [Target Guided Ordinal Encoding](#id3)

<a id="id1"></a>
## Nominal/ One Hot Encoding

***Nominal/One-Hot Encoding*** is a technique used to convert categorical variables into binary vectors, where each category is represented as a binary feature, making them suitable for machine learning models.

| ID | Red | Blue | Green |
|----|----------|-----------|------------|
| 1  |      1       |      0        |    0         |
| 2  |      0       |      1        |    0         |
| 3  |      0       |      0        |    1         |

### **Disadvantages**

___Increased Dimensionality:___ It can lead to a significant increase in the number of features, which may result in larger and more complex datasets.

___Sparse Data:___ One-hot encoding creates sparse matrices with many zeros, making the data inefficient to store and process.

___Curse of Dimensionality:___ High dimensionality can lead to increased computational requirements and potential overfitting in machine learning models.

___Loss of Information:___ It does not capture any relationships or similarities between categories in the original variable. Each category is treated independently.

___Unsuitable for High Cardinality:___ One-hot encoding is not practical for variables with high cardinality (many unique categories) as it can lead to an impractical number of features.


In [85]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Sample data
data = {'ID': [1, 2, 3, 4, 5, 6],
        'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Green']}

# Create a DataFrame
df = pd.DataFrame(data)
df.head()

Unnamed: 0,ID,Color
0,1,Red
1,2,Blue
2,3,Green
3,4,Red
4,5,Blue


In [86]:
df_encoded = pd.get_dummies(df, columns=['Color'])
df_encoded # this is one of the method

Unnamed: 0,ID,Color_Blue,Color_Green,Color_Red
0,1,0,0,1
1,2,1,0,0
2,3,0,1,0
3,4,0,0,1
4,5,1,0,0
5,6,0,1,0


In [87]:
# 2nd method using sklearn.preprocessing.OneHotEncoder

encoder = OneHotEncoder()
OneHotEncoder(handle_unknown='error')  # THIS WILL HELPFUL TO AVOOID WARNING WHILE HANDLING NEW DATA
encoded = encoder.fit_transform(df[['Color']]).toarray()

In [88]:
pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,1.0,0.0,0.0
5,0.0,1.0,0.0


In [89]:
# handling new data whih in coming to the model

encoder.transform([['Blue']]).toarray() 

array([[1., 0., 0.]])

In [90]:
# ANOTHER EXAMPLE


data = pd.read_csv("D:\AI\DATASETS\\tips.csv")
display(data.tail())
encoder1 = OneHotEncoder()
encoded_1 = encoder1.fit_transform(data[["day"]]).toarray()
encoded_1


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.0,Female,Yes,Sat,Dinner,2
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2


array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],


In [91]:
OneHotEncoder(handle_unknown='error') # some time this method failing to compress wanings to terminal
import warnings
warnings.filterwarnings(action="ignore") # ignoring all warnings
encoder1.transform([["Sat"], ["Thur"], ["Sat"]]).toarray()

array([[0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.]])

In [92]:

encoder1.transform([["Thur"], ["Sat"]]).toarray()

array([[0., 0., 0., 1.],
       [0., 1., 0., 0.]])


## Label Encoding



Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.(This technique is used for nominal data. And this technique is not effective)

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

    1. Red: 1
    2. Green: 2
    3. Blue: 3

    
<a id="id2"></a>

In [93]:
data.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [94]:
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})
lbl_encoder = LabelEncoder()
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [95]:
lbl_encoder.fit_transform(df[["color"]])

array([2, 0, 1, 1, 2, 0])

In [96]:
lbl_encoder.transform([['red']])

array([2])

In [97]:
display(lbl_encoder.transform([['green']]), lbl_encoder.transform([['blue']]))

array([1])

array([0])

## Ordinal Encoding

This technique is used for ordinal data.  
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

<a id="id2-1"></a>

In [98]:
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})
display(df.head())
## we need to give our categories in order small to large in order to this method
encoder=OrdinalEncoder(categories=[['small','medium','large']])
# next we need to pass the column to the method
encoder.fit_transform(df[['size']])
encoder.transform([['small']]), encoder.transform([['large']]), encoder.transform([['medium']])

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small


(array([[0.]]), array([[2.]]), array([[1.]]))

## Target Guided Ordinal Encoding 
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

<a id="id3"></a>

In [99]:
import pandas as pd

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})
display(df.head())
mean_price=df.groupby('city')['price'].mean().to_dict()
display(mean_price)

# mapping the above dict to the price is Target Guided Ordinal Encoding

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180


{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [100]:
df["TGOE"] = df["city"].map(mean_price)
df

Unnamed: 0,city,price,TGOE
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [101]:
# Lets try another example

data.head() # data already loaded

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [114]:
# bins = range(0, int(data['total_bill'].max()) + 5, 5)

# # Add a new column 'total_bill_interval' to the DataFrame
# data['total_bill_interval'] = pd.cut(data['total_bill'], bins)

# # Find the mean of 'total_bill' for each interval
# mean_tip_by_interval = data.groupby('total_bill_interval')['total_bill'].mean()
# mean_data = data.groupby("total_bill_interval")["total_bill"].mean().to_dict()

# data["mean_interval_bill"] = data["total_bill_interval"].map(mean_data)
# data


# making the above code dynamic
# Internal assingment Concerting a numerical column into bins with interval and finding the mean of those intervals and mapping those means with the input column
def tgoe_on_numerical_var(binsize:int, data:pd.Series, df:pd.DataFrame, interval_col_name:str, output_col:str, num_col:str):

    bins = range(0, int(max(data)) + 5, binsize)

    # Add a new column `interval_col_name` to the DataFrame
    df[interval_col_name] = pd.cut(df[num_col], bins)

    # Find the mean of `num_col`` for each interval

    mean_data = df.groupby(interval_col_name)[num_col].mean().to_dict()

    df[output_col] = df[interval_col_name].map(mean_data)
    return df

tgoe_on_numerical_var(5,data["total_bill"], data, "intervals", "output_col", "total_bill")

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,intervals,output_col
0,16.99,1.01,Female,No,Sun,Dinner,2,"(15, 20]",17.195373
1,10.34,1.66,Male,No,Sun,Dinner,3,"(10, 15]",12.397143
2,21.01,3.50,Male,No,Sun,Dinner,3,"(20, 25]",22.201905
3,23.68,3.31,Male,No,Sun,Dinner,2,"(20, 25]",22.201905
4,24.59,3.61,Female,No,Sun,Dinner,4,"(20, 25]",22.201905
...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,"(25, 30]",27.360000
240,27.18,2.00,Female,Yes,Sat,Dinner,2,"(25, 30]",27.360000
241,22.67,2.00,Male,Yes,Sat,Dinner,2,"(20, 25]",22.201905
242,17.82,1.75,Male,No,Sat,Dinner,2,"(15, 20]",17.195373


In [103]:
# data = [10, 20, 25, 30, 40, 50, 60]
# bins = range(0, max(data)+20, 20)
# labels = ['Low', 'Medium', 'High']

# # Create bins and assign labels
# categories = pd.cut(data, bins=bins, labels=labels)

# # The 'categories' variable will contain the category labels for each data point.

# categories

In [109]:
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


Written With Love By,

[Shaik Maaheed][]

[Shaik Maaheed]: https://www.linkedin.com//in//shaikmaaheed// "Follow me on LinkedIn"

