# How to encode categorical feature

# Abstraction


Encoding is a way we transform data from a representation to another representation. In Machine Learning, we usually use this concept
when we want to transform non-numeric data to numeric data. such as, we transform label "cat", "dog" to 1, 0 respectively. 

There are some popular ways to encode categorical feature: Using dictionary to convert data to index, using one-hot vector approach, or we just train a neural network and encode data as word2vec. 

# Problem of some popular encoding technique

1) one-hot: memory

2) Using dictionary:big dictionary, unknown item

3) word2vec: unknown item

4) neural network: need to represent input in some way first, then we need to be back to 1) 2)3). Sometimes, we will get too complex output. eg: encode by GPT.

Addtionally, there are some cases we use sparse matrix to encode....

In [2]:
# one-hot encoding
import pandas as pd

# Sample dataset with a categorical column
data = {'Color': ['Red', 'Blue', 'Green', 'Red', 'Green']}
df = pd.DataFrame(data)

# Perform one-hot encoding using Pandas
one_hot_encoded = pd.get_dummies(df, columns=['Color']).astype(int)
one_hot_encoded

Unnamed: 0,Color_Blue,Color_Green,Color_Red
0,0,0,1
1,1,0,0
2,0,1,0
3,0,0,1
4,0,1,0


In [4]:
# dictionary (label encoder)
from sklearn.preprocessing import LabelEncoder

data = {'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']}
df = pd.DataFrame(data)

label_encoder = LabelEncoder()

df['Size_encoded'] = label_encoder.fit_transform(df['Size'])
df

Unnamed: 0,Size,Size_encoded
0,Small,2
1,Medium,1
2,Large,0
3,Medium,1
4,Small,2


In [10]:
# sparse matrix
from numpy import array
from scipy.sparse import csr_matrix
A = array([[1, 0, 0, 1, 0, 0], [0, 0, 2, 0, 0, 1], [0, 0, 0, 2, 0, 0]])
print(A)
print("---------------------")
S = csr_matrix(A)
print(S)



[[1 0 0 1 0 0]
 [0 0 2 0 0 1]
 [0 0 0 2 0 0]]
---------------------
  (0, 0)	1
  (0, 3)	1
  (1, 2)	2
  (1, 5)	1
  (2, 3)	2


# Solution: Byte Pair Encoding

## Byte Pair Encoding

Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization. The BPE algorithm is "byte-level" because it runs on UTF-8 encoded strings.

The tricky thing to note is that minbpe always allocates the 256 individual bytes as tokens, and then merges bytes as needed from there. So for us a=97, b=98, c=99, d=100 (their ASCII values).

**References:**<br>
https://en.wikipedia.org/wiki/Byte_pair_encoding

https://github.com/karpathy/minbpe


In [2]:
import sys
sys.path.append("../code")
from minbpe.minbpe import BasicTokenizer

tokenizer = BasicTokenizer()
text = "aaabdaaabac"
tokenizer.train(text, 256 + 3) # 256 are the byte tokens, then do 3 merges
print(tokenizer.encode(text))
# [258, 100, 258, 97, 99]
print(tokenizer.decode([258, 100, 258, 97, 99]))

[258, 100, 258, 97, 99]
aaabdaaabac


In [3]:
# try with "unknow" word
unknown_word = "hehe"
tokenizer.encode(unknown_word)

[104, 101, 104, 101]