# 📖 Part 2: Tokenization & Embeddings

In this section, we will explore two fundamental steps in NLP:

1. **Tokenization**: Splitting text into smaller units.
2. **Embeddings**: Representing text numerically for machine learning models.
___

## ✂️ Tokenization
Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, subwords, or characters.

### Types of Tokenization:
- Word Tokenization
- Subword Tokenization (e.g., Byte Pair Encoding)
- Character Tokenization

### 💻 Example: Word Tokenization with NLTK

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "Tokenization is a crucial step in NLP!"
tokens = word_tokenize(text)
print("Tokens:", tokens)

___
## 📐 Embeddings
Embeddings are numerical representations of words or texts in a continuous vector space. They help capture semantic relationships between words.

### Popular Embedding Methods:
- One-Hot Encoding
- Word2Vec
- GloVe
- FastText
- Transformer-based Embeddings (e.g., BERT)

### 💻 Example: Using Pre-trained Word Embeddings with spaCy

In [None]:
import spacy

# Load spaCy's small English model
nlp = spacy.load("en_core_web_sm")
doc = nlp("Natural Language Processing")

for token in doc:
    print(f"Token: {token.text}, Vector (first 5 dims): {token.vector[:5]}")

___
## ✅ Next Steps
Proceed to Part 3: Fine-Tuning LLMs to learn how to adapt large language models for specific NLP tasks.
___