✅ Phase 0: NLP Foundations – "Speaking the Language of AI"
This phase is designed to build your intuition, technical skill, and vocabulary around Natural Language Processing (NLP), the core layer of Retrieval-Augmented Generation (RAG). Here’s what we’ll cover:

🔹 What is NLP?
Natural Language Processing (NLP) is the intersection of linguistics and machine learning that enables machines to understand, interpret, and generate human language.

🔹 Why is NLP the Foundation of RAG?
RAG uses language models for generation (G in RAG).

It uses retrieval mechanisms that depend on understanding semantics and meaning (NLP!).

Concepts like tokenization, embeddings, context windows, etc., come directly from NLP.

🔹 What You’ll Learn in This Phase:
| Concept                            | Why It Matters for RAG                                | Activities                                             |
| ---------------------------------- | ----------------------------------------------------- | ------------------------------------------------------ |
| **Tokenization**                   | Breaks text into pieces that the model can process    | Try tokenizing using `nltk` and `transformers`         |
| **Embeddings**                     | Convert text into numerical vectors for comparison    | Experiment with OpenAI, Hugging Face, and local models |
| **Context Windows**                | Determines how much text a model can "see" at once    | Visualize limits of different LLMs                     |
| **Text Cleaning**                  | Prepares data for indexing and retrieval              | Build a simple pipeline using `re` and `nltk`          |
| **Similarity Search**              | Core to retrieval in RAG                              | Use cosine similarity between embeddings               |
| **Named Entity Recognition (NER)** | Helps extract meaningful chunks from documents        | Use `spaCy` or `transformers` for practical use        |
| **Vectorization Techniques**       | From TF-IDF → Word2Vec → Sentence Transformers        | Compare their outputs and quality                      |
| **Prompt Engineering Basics**      | Formulating questions that produce accurate responses | Design and refine prompts with LLMs                    |


🔸 Suggested Tools for Hands-on Practice:
🐍 Python Libraries:

nltk, spaCy, transformers, sentence-transformers, sklearn

🧪 Try This Notebook:

notebooks/RAG/nlp_fundamentals.ipynb → where you’ll practice hands-on

In [4]:
pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


✅ 1. Tokenization :

Breaks text into pieces that the model can process



In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "Hello, this is a testing for RAG learning with a sample sentence"

tokens = tokenizer.tokenize(text)
tokenids = tokenizer.convert_tokens_to_ids(tokens)

print("Tokens:",tokens)
print("TokenIds:",tokenids)


Tokens: ['hello', ',', 'this', 'is', 'a', 'testing', 'for', 'rag', 'learning', 'with', 'a', 'sample', 'sentence']
TokenIds: [7592, 1010, 2023, 2003, 1037, 5604, 2005, 17768, 4083, 2007, 1037, 7099, 6251]


✅ 2. Normalization

Text normalization is the process of preparing text for analysis by converting it into a standard format.

🔹 Why do we normalize?

To remove inconsistencies (e.g., "Hello", "hello!", "HELLO" → all mean the same)

Reduces noise before downstream NLP tasks (e.g., vectorization)

🔹 What’s involved?

Lowercasing

Removing punctuation

Stemming

Lemmatization



In [4]:
import re
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')

text = "Running, jumped and runs are forms of run. Apples are tasty!"

# Lowercase
text = text.lower()

# Remove punctuation
text = re.sub(r'[^\w\s]', '', text)

# Tokenize
tokens = nltk.word_tokenize(text)

# Stemming
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokens]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokens]

print("Stemmed:", stemmed)
print("Lemmatized:", lemmatized)


[nltk_data] Downloading package punkt to /home/koyas/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /home/koyas/nltk_data...


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/home/koyas/nltk_data'
    - '/home/koyas/genai-lab-sandbox/venv/nltk_data'
    - '/home/koyas/genai-lab-sandbox/venv/share/nltk_data'
    - '/home/koyas/genai-lab-sandbox/venv/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************
