In [None]:
---
title: "One hot encoding"
format: 
  html:
    code-fold: true
jupyter: python3
---

# One hot encoding

In [15]:
%%html
<style type='text/css'>
.CodeMirror {
    font-size: 14px; 
    font-family: 'Jetbrains Mono';
}
</style>


In [16]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:85% !important; }</style>"))

In [17]:
import pandas as pd

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)
pd.set_option("display.max_rows", None, "display.max_columns", None)

****

Consider the sentence "The cat sat on the mat". The vocabulary (or unique words) in this sentence is (cat, mat, on, sat, the). To represent each word, you will create a zero vector with length equal to the vocabulary, then place a one in the index that corresponds to the word. This approach is shown in the following diagram.

![Diagram of one-hot encodings](https://github.com/tensorflow/text/blob/master/docs/guide/images/one-hot.png?raw=1)

Image source: https://github.com/tensorflow/text/blob/master/docs/guide

****

## Sample data

I tried to see how to get one hot encoding for the following sample data. The technique is taken from [https://github.com/DavidMertz/PyTorch-webinar/blob/master/NetworkExamples_0.ipynb](https://github.com/DavidMertz/PyTorch-webinar/blob/master/NetworkExamples_0.ipynb).

In [18]:
import pandas as pd

data = {
    'Product': ['iPhone', 'MacBook', 'iPad', 'AirPods', 'Apple Watch', 'iMac'],
    'Price': [999, 1299, 499, 199, 399, 1499],
    'Release_Year': [2023, 2022, 2023, 2021, 2022, 2022],
    'Category': ['Mobile', 'Laptop', 'Tablet', 'Accessories', 'Accessories', 'Desktop']
}

df = pd.DataFrame(data)

df

Unnamed: 0,Product,Price,Release_Year,Category
0,iPhone,999,2023,Mobile
1,MacBook,1299,2022,Laptop
2,iPad,499,2023,Tablet
3,AirPods,199,2021,Accessories
4,Apple Watch,399,2022,Accessories
5,iMac,1499,2022,Desktop


Here, I am one-hot encoding the `Category` attribute.

In [19]:
X = df[['Product', 'Price', 'Price']]

# One-hot encoding
df_one_hot = pd.get_dummies(df, columns=['Category'])
df_one_hot

Unnamed: 0,Product,Price,Release_Year,Category_Accessories,Category_Desktop,Category_Laptop,Category_Mobile,Category_Tablet
0,iPhone,999,2023,0,0,0,1,0
1,MacBook,1299,2022,0,0,1,0,0
2,iPad,499,2023,0,0,0,0,1
3,AirPods,199,2021,1,0,0,0,0
4,Apple Watch,399,2022,1,0,0,0,0
5,iMac,1499,2022,0,1,0,0,0


### Extracting the Categories as a different dataframe

In [20]:
Y = df_one_hot[[col for col in df_one_hot.columns if col.startswith('Category')]]
Y.columns = [col.replace('Category_', '') for col in Y.columns]

Y

Unnamed: 0,Accessories,Desktop,Laptop,Mobile,Tablet
0,0,0,0,1,0
1,0,0,1,0,0
2,0,0,0,0,1
3,1,0,0,0,0
4,1,0,0,0,0
5,0,1,0,0,0


****