# Feature/Data Engineering: encoding

In our last session, we expanded on the curriculum by looking deeper on how to apply unsupervised learning techniques such as DBScan Clustering and PCA not only for EDA and dimensionality reduction but also on how we can create new features aka feature engineering. If you're interested in reviewing feature engineering for numerical data (with examples), check out:
- notebooks/Mod4_Data_Analytics.ipynb
- notebooks/Mod6_PCA_Clustering.ipynb

In this session, we're going to explore Feature Engineering for categorical features and text-based data: preparing you not only in traditional ML but also for Generative AI such as LLMs.

Before we dive into encoding (tokenization) for LLMs, let’s revisit how we handle feature encoding for categorical data — a crucial step that parallels what LLMs later do implicitly with text.

In [1]:
import pathlib

import pandas as pd
import spacy

import requests
from bs4 import BeautifulSoup

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.trainers import BpeTrainer

from IPython.display import Image, HTML, display

In [2]:
import sys
parent=pathlib.Path.cwd().parent
sys.path.insert(0,str(parent))

from helpers.visuals import show_mermaid_diagram

Let's create a simple sentence, split it into words, and capture the unique words in a dataframe

In [41]:
sentence = "I love everything involving AI and machine learning and love the University of California, Berkeley!"

words = sentence.split()

# dict.fromkeys() --> keeps order and the first time each word appears
unique_words = list(dict.fromkeys(words))

df = pd.DataFrame(words, columns=['word'])

## Ordinal Encoding
IF you have categorical data WITH inherent order (e.g., small/medium/large, education levels) → Use ordinal
- Applied to categorical features (words, categories)
- Output is a single integer
- Used for features with inherent order

In [42]:
ord_encoder = OrdinalEncoder(categories=[unique_words],dtype=int)

encoded_ord = ord_encoder.fit_transform(df[['word']].astype(str))

df_ordinal = df.copy()
df_ordinal['ordinal_id'] = encoded_ord.astype(int)

df_ordinal

Unnamed: 0,word,ordinal_id
0,I,0
1,love,1
2,everything,2
3,involving,3
4,AI,4
5,and,5
6,machine,6
7,learning,7
8,and,5
9,love,1


## One-hot Encoding

Using one-hot encoding and a dataframe, we can create and visualise the occurrence of each unique word (categorical data), now as a feature column. 
Tip: IF you have categorical data with NO inherent order (e.g., colors, countries) → Use one-hot

In [43]:
oh_encoder = OneHotEncoder(categories=[unique_words], sparse_output=False, dtype=int, handle_unknown="ignore")

encoded_oh = oh_encoder.fit_transform(df[['word']])

encoded_df = pd.DataFrame(encoded_oh, columns=[f'_{word.lower()}' for word in unique_words], index=df.index)

df_one_hot = pd.concat([df.reset_index(drop=True), encoded_df], axis=1)
df_one_hot


Unnamed: 0,word,_i,_love,_everything,_involving,_ai,_and,_machine,_learning,_the,_university,_of,"_california,",_berkeley!
0,I,1,0,0,0,0,0,0,0,0,0,0,0,0
1,love,0,1,0,0,0,0,0,0,0,0,0,0,0
2,everything,0,0,1,0,0,0,0,0,0,0,0,0,0
3,involving,0,0,0,1,0,0,0,0,0,0,0,0,0
4,AI,0,0,0,0,1,0,0,0,0,0,0,0,0
5,and,0,0,0,0,0,1,0,0,0,0,0,0,0
6,machine,0,0,0,0,0,0,1,0,0,0,0,0,0
7,learning,0,0,0,0,0,0,0,1,0,0,0,0,0
8,and,0,0,0,0,0,1,0,0,0,0,0,0,0
9,love,0,1,0,0,0,0,0,0,0,0,0,0,0


## Advanced: From Text --> Categorical Data --> One-Hot Encoding

So far, we’ve treated words themselves as categories and encoded them directly - but in the real-world we often want to transform raw text into higher-level categorical features before encoding. 

One way to create categorical features from raw text (and then encode) is to categorise words into nouns, verbs, adjectives etc. aka part of speech (POS)

In Natural Language Processing (NLP), which we will cover later, you can do this using a library like spacy. 

*Note: when updating for all installs (pip install -r requirements.txt) to install spacy you will still need to run the command: python -m spacy download en_core_web_sm*

In [45]:
categorise_text = spacy.load("en_core_web_sm")

text = categorise_text(sentence)

words = []
pos_categ = []

for token in text:
    if token.is_space or token.is_punct:
        continue
    words.append(token.text)
    pos_categ.append(token.pos_)

df_pos = pd.DataFrame({'word': words,'pos': pos_categ})

unique_pos = list(dict.fromkeys(pos_categ))

oh_encoder = OneHotEncoder(categories=[unique_pos], sparse_output=False, dtype=int, handle_unknown="ignore")

encoded_oh = oh_encoder.fit_transform(df_pos[['pos']])

encoded_df = pd.DataFrame(encoded_oh, columns=[f'_{pos.lower()}' for pos in unique_pos], index=df_pos.index)

df_pos_ohe = pd.concat([df_pos.reset_index(drop=True), encoded_df], axis=1)
df_pos_ohe


Unnamed: 0,word,pos,_pron,_verb,_propn,_cconj,_noun,_det,_adp
0,I,PRON,1,0,0,0,0,0,0
1,love,VERB,0,1,0,0,0,0,0
2,everything,PRON,1,0,0,0,0,0,0
3,involving,VERB,0,1,0,0,0,0,0
4,AI,PROPN,0,0,1,0,0,0,0
5,and,CCONJ,0,0,0,1,0,0,0
6,machine,NOUN,0,0,0,0,1,0,0
7,learning,NOUN,0,0,0,0,1,0,0
8,and,CCONJ,0,0,0,1,0,0,0
9,love,VERB,0,1,0,0,0,0,0


These encoders give numerical structure to non-numeric data. However, they rely entirely on human-defined categories. In contrast, LLMs will learn their own encoding rules.

# LLM Encoding & Features: ML similarities & differences

## PCA v LLM Latent Features
When you apply scaling, PCA, or clustering on numerical data, you're effectively trying to uncover hidden features that aren't visible in your existing features/dataset e.g. PCA explicitly creates **latent features** (principal components) through linear combinations of our existing numerical features. 

In LLMs, latent features are not computed algorithmically or are assumed to be linear - instead they emerge from the raw data as part of training. 

In other words, 
- Traditional ML: **you** *engineer* latent features explicitly --> using tools like PCA or clustering 
- LLMs: the **model** *discovers* latent features implicitly --> through backwardation
  

## Encoding text v features

In traditional ML, we encode text features using techniques such as ordinal encoding or one-hot encoding to transform the features into a numerical format the model can ingest.

Similarly, encoding is used in LLMs to prepare text data prior to model training. The difference is that LLMs begin encoding text data before any features are identified. 
- *LLMs begin by converting raw text into token IDs (numerical form) before the model learns any features or relationships between them.*

That is to say, LLMs identify features implicitly and as part of its training process, not explicitly and as part of the pre-processing process (as we have seen thus far in typical regression and classifier type models)
- *LLMs identify features implicitly within their neural layers during training, rather than explicitly during data preprocessing.*

Summary:
- ML Features: explicit, identified by the Data Scientist before training
- LLM Features: implicit, discovered by the Model during training

As Data Scientists, we still need to prepare the data for encoding, it's just that we're not encoding features.

Let’s visualise how these two approaches — explicit feature engineering and implicit feature discovery — differ at the pipeline level:

In [3]:
show_mermaid_diagram("ml_llm_data_prep_pipeline")

## LLM Data Preparation: Chunking

To help an LLM discover these text features, the first step before training any LLM is to break down their enormous training dataset (corpus) into manageable chunks for encoding and then training.  

Chunks are typically semantically meaningful bodies of text - which can range between an entire document or sentence depending on the training objective. The limitation of the chunk size is called the LLMs context window

After breaking down the corpus into chunks, we then need to break down the chunks into small, unique strings we can then encode.

## Encoding small strings

These small, unique strings are frequently appearing character combinations in languages - often representing words or sub-words. Once identified, these can be mapped to a unique ID and subsequently encoded in the text by replacing the string with the ID. 

In practise, this encoding process is so time-consuming and complicated (how can you guarantee the sub-string and ID matches don't change when you update your training corpus?!?!) that it requires a model of its own. 

As a result, most LLMS use pre-trained encoders created from other models.

These models are referred to as 'Tokenisers'.


## Tokenisers: text encoding models for LLMs

A "Tokenizer" is simply a model that CONSISTENTLY encodes your text data into a numerical format for LLM ingestion. This process is called tokenisation. Under the hood, the most popular algo for tokenisation is BPE.

**Byte Pair Encoding (BPE)** is a popular encoding algo used in tokenisers that seeks to extract and engineer features. 

BPE seeks to:
- identify the most frequent combinations of characters in your text dataset as features → **feature extraction**
- map each feature to a unique integer (token id) suitable for model training → **feature engineering**

Want to visualise text tokenisation in OpenAI models? [Tiktokenizer ](https://tiktokenizer.vercel.app/)

In [4]:
show_mermaid_diagram("tokenizer_process", center=True)

#### 1. Data Collection & Cleaning

Let's create our corpus by scraping text from this courses description at UC Berkeley's Executive Education Program: https://em-executive.berkeley.edu/professional-certificate-machine-learning-artificial-intelligence


In [66]:
dataset_dir = pathlib.Path("../datasets")


url = "https://em-executive.berkeley.edu/professional-certificate-machine-learning-artificial-intelligence"

corpus_source = requests.get(url)
corpus_source.raise_for_status()
soup = BeautifulSoup(corpus_source.text, "html.parser")

Let's do some basic cleaning and joining of our corpus data
Note: data cleaning isn't as important here as it is during normal model data preparation. This is because we're less interested in the semantic meaning and data quality as we are the combinations of characters in our text to create tokens

In [67]:
paragraphs = [p.get_text().strip() for p in soup.find_all("p") if p.get_text().strip()]
full_text = "\n".join(paragraphs)

raw_file = dataset_dir / "corpus_full.txt"
with raw_file.open("w", encoding="utf-8") as f:
    f.write(full_text)

### 2: Tokenizer Algorithm
Select the tokenizer algorightm and any pre-processing steps

In [68]:
tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()

### 3-8: Tokenizer Training

Now that our text corpus is saved, let’s train our own BPE-based tokenizer. This is the same type of process used by models like GPT, except on a massive scale.

In [69]:
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],vocab_size=2000)

tokenizer.train(files=[str(raw_file)], trainer=trainer)






### 9. Final Vocabulary
That's it. You have now created a "vocabulary" containing a token (string): token id for every frequent combination of characters it could find!

Note: as languages increase with words, use cases get more defined (for LLMs), and models become more powerful, I expect vocab sizes to increase to incorporate more whole words e.g. LinkedIn, Professor etc.

Below is a sample of the tokens our tokenizer learned — these represent frequent subword units discovered from the training corpus.

In [70]:
print("Vocab size:", tokenizer.get_vocab_size())

vocab = tokenizer.get_vocab()

for token, idx in list(vocab.items())[:10]:
    print(f"{idx}: {token}")

Vocab size: 1661
1080: oking
1452: email
252: gh
512: sci
898: ),
337: job
1096: sch
1500: 9495
1566: librar
1133: ze


# Closing Remarks

## Text Encoding comparison: ML v LLM

Let’s now merge our outputs — one-hot, ordinal, and BPE — into a single table to visually compare how traditional and LLM encodings differ.

In [65]:
token_ids = []
for word in words:
    encoded = tokenizer.encode(word)
    token_ids.append(encoded.ids)

df_bpe = pd.DataFrame({"word": words,"token_id": token_ids})

df_ordinal["idx"] = df_ordinal.index
df_pos_ohe["idx"] = df_pos_ohe.index
df_bpe["idx"]      = df_bpe.index


df_encode = (df_bpe.merge(df_pos_ohe, on=["idx", "word"]).drop(columns="pos"))


df_encode = (df_encode.merge(df_ordinal, on=["idx", "word"]).drop(columns="idx"))

df_encode


Unnamed: 0,word,token_id,_pron,_verb,_propn,_cconj,_noun,_det,_adp,ordinal_id
0,I,[39],1,0,0,0,0,0,0,0
1,love,"[137, 114]",0,1,0,0,0,0,0,1
2,everything,"[1024, 79, 86, 98]",1,0,0,0,0,0,0,2
3,involving,"[83, 76, 722]",0,1,0,0,0,0,0,3
4,AI,[134],0,0,1,0,0,0,0,4
5,and,[96],0,0,0,1,0,0,0,5
6,machine,[750],0,0,0,0,1,0,0,6
7,learning,[448],0,0,0,0,1,0,0,7
8,and,[96],0,0,0,1,0,0,0,5
9,love,"[137, 114]",0,1,0,0,0,0,0,1


### Lightbulb moment
What is cool about this? In our original sentence, the word "involving" (idx: 3) is tokenised BUT doesn't exist in the corpus!

Instead, the string:
- "inv" appeared 8 times in the corpus from the word "invite" --> encoded with token id: 83
- "olv" appeared 6 times in the corpus, from words such as "solve" and "evolve" --> encoded with token id: 76
- "ing" appeared 152 times in the corpus, from lots of words! --> encoded with token id 722

# (Out of Scope) LLM Encoding during inference

We discussed a lot about feature engineering and encoding today - but this is essentially what happens when every time you run a prompt/query whenever you use an LLM such as ChatGPT!

User Input: "I love everything involving machine learning and the University of California, Berkeley!"

LLM's tokenizer: breaks this text into subword tokens using greedy longest-match-first matching against its  vocabulary, matches it against a token, finds its token ID...(see diagram below for a fuller picture)


**The takeaway:** whether you’re engineering one-hot vectors or training massive LLMs, both rely on feature encoding — the key difference lies in who engineers the features: you, or the model.


Have a great day!

In [6]:
show_mermaid_diagram("llm_encoding", center=True)