## Natural Language Processing

### NLP Interview Questions (AI/ML Engineer) ‚Äî Easy ‚Üí Hard

**5) Hard ‚Äî Production NLP Failure Modes**
An NLP model performs well in validation but fails in production.

* List common NLP-specific failure modes.
* How would you detect data drift in text?
* How would you debug hallucinations or semantic errors in an LLM-based system?


**1) Easy ‚Äî Text Cleaning & Normalization**
You‚Äôre given a column `review_text`.

* What basic preprocessing steps would you apply before modeling?
* Which steps depend on the downstream model (TF-IDF vs transformer)?

**Answer:**
1. Text Cleaning, that involves removing html tags, urls, PII, special characters, extra whitespaces, handling emojis and so on.
2. Lower Casing; Converting all texts into lowercase to reduce vocab size.
3. Tokenization - Splitting texts into small chunks called tokens.
4. Stop words removal like is, the, in, and so on...
5. Stemming - Removing suffixes from words - running, runs -> run. - reduces vocabulary size.
6. Lemmatization - Reduces words to their base word from using linguistic rules.{better -> good}
7. Named entity recognition - identifying real world entities
8. spelling correction
9. Feature extraction/Vectorization - Converting text into numerical form
    - Bag of words
    - TF-IDF
    - Word Embeddings(Word2Vec, GloVe)
    - Contextual Embeddings(BERT)

II. TF-IDF - Needs high preprocessing steps, where as transformer needs lesser preprocessing.
- Removal of stopwords - In TF-IDF: Stop words dominate frequency but add little value - So remove them, But in transformers we need to capture the meaning and attention patterns so include stopwords.
- Stemming or Lemmatization: performing them in TF-IDF reduces sparcity {combine multiple similar words in single word}. Where as in Transformers model rely on grammar and tense - do not use them here.
- Lower casing - TF-IDF: Always, Transformers: preserve case.
- Tokenization: literal words are converted into token in TF-IDF, Transformer used advance methods to tokenize.

**2) Easy‚ÄìMedium ‚Äî Tokenization Strategies**
Explain the difference between:

* word-level tokenization
* subword tokenization (BPE / WordPiece)
* character-level tokenization

When would each fail or succeed in real NLP systems?

**Answer:**
1. Word-Level tokenization
- Split text into tokens based on words, usually spaces and punctuation.
- Works well with NLP models, languages with clear word boundaries.
- They are often small, clean vocabularies
- `Fails` when misspellings, typos,on large vocab size.
2. Subword Tokenization (BPE/Workpiece)
- Breaks words into frequently occuring subword units learned from data.
- "unhappiness" -> ['un', 'happi', 'ness'], "ChatGPTlike" -> ['Chat', 'GPT', 'like']
- Works well with Modern Neural Networks, Handles unknown words, Controls vocab size.
- Preserves meaning better than characters.
- `Fails` on tokens that not allign linguistic meaning, Tokenization can be unintuitve.
3. Characters-Level Tokenization
- Treats every character as a token.
- "cat" -> ['c', 'a', 't']
- Extreme robustness to typos, noise.
- Used in Special domains -> ['DNA', 'code', 'OCR text']
- `Fails` on very long sequences, loses semantic structure, computationally expensive, harder to learn meaning.

**3) Medium ‚Äî Text Representation Choices**
You need to build a classifier with **limited labeled data**.

* Compare BoW, TF-IDF, word embeddings, and transformer embeddings.
* Which would you choose and why?

**Answer:**
1. Bag of Words(BoW):
- Represents text as word occurrnec counts
- Very simple and fast | Works well with linear models | low risk of over fitting
- Cons: No word importance, no semantics or context, very sparse high dimensional. Poor generalization with limited data.
- Use when extremely small data, with baseline models.

2. TF-IDF
- Weighted version of BoW that downweights commons words and upweights informative ones.
- Pros: Strong baseline for small datasets | Better signal than BoW | Works well with linear classifiers | Interpreable and stable.
- Cons: Sparse, no semantics or word order, Cannot capture synonymy or context.
- Use when limited label data

3. Word Embeddings (Word2Vec/ GloVe/ FastText):
- Dense vector representations learned from large corpora.
- Pros: Capture semantic similarity, Dense lower dimensional, Can use pretrained embeddings
- Cons: Context independent, Need pooling, Often underperform TF-IDF on small labelled datasets.
- Use when moderal data size, semantic matter.

4. Transformers Embeddings (BERT, RoBERTa)
- Contextual embeddings pretrained on massive corpora.
- Pros: Strong semantic understanding, handles context, Excellent at transfer learning
- Cons: Computationally expensive, risk of overfitting with very small labeled data, more complex to train and tune
- use when complex language understanding required.

**4) Medium‚ÄìHard ‚Äî Handling Long Documents**
Your documents exceed the 512-token limit of BERT-like models.

* What strategies would you use to handle this?
* How do chunking and hierarchical models differ?
* What are the trade-offs?

Answer:
1. Strategies to handle long documents
- Split the documents in smaller chunks, process each chunk independently with BERT then aggregate.
- Aggregation methods - mean, max pooling over chunk embeddings, attention based pooling, majortiy vote for classification.
- Pros: easy to execute, words with standard BERT
- Cons: Loses global document context, cross chunk dependencies are ignored, aggregation may dilute important signals.
2. hierachial models - Model text at multiple levels (tokens -> sentences -> Docs)
- Example: Sentence Encoder (BERT) -> Sentence embeddings
- pros: preserves structure and long range dependencies | Better doc level undrestanding | Expensive.

In [4]:
import pandas as pd

df = pd.read_csv("../data/NLP_covid/Corona_NLP_train.csv", encoding = "latin-1")
df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


## Goal: 2 hour NLP Revision - 11:37

After 2 hours, you should be able to:
* Explain NLP systems as pipelines
* Answer ‚Äúhow would you build‚Ä¶‚Äù questions
* Discuss failure modes and tradeoffs
* Avoid sounding academic or junior

> NLP is not about model but converting text messy text into reliable decisions.
### NLP Pipeline
Text ingestion -> Cleaning & Normalization -> Chunking/segmentation -> Task-Specific Model -> Post processing -> evaluation & monitoring.

### 5 NLP problems
> use case -> Model Choice -> Failure -> Fix
1. Text Classification
    - Use: Intent, routing, spam-non spam
    - Failure: Class imbalance, ambiguity
    - Fix: Thresholds, abstrain class
        - In production not predicting is better than predicting wrong.

2. Information Retrieval - Search engines, product search in e-comm, Legal document search.
3. Text Generation - Generate human like text {examples: Text completion, chatbots, creative writing}
4. Machine Translation - From one language to other
5. Text Summarization - News summaries, Meeting notes generation, legal document summarie

## NLP Question set - NLTK & SpaCy - Day 4

#### üü¢ Question 1 ‚Äî Easy (HR + Screening)

**‚ÄúCan you explain the difference between NLTK and spaCy, and when you would use each?‚Äù**

NLTK and SpaCY are Natural language processing libraries that allows us to work with Natural Language. Even though they both allows to do basic preprocessing like lemmatization, tokenization and so
- NLTK is more of experiement centric, Used for classical ml pipelines such as POS, NER etc
- SpaCy is advanced version developed for complex NLP problems. It provide high speed, scalable, and reliable application and sent for production.
> "NLTK is primarily used for linguistic exploration and classical NLP experimentation, while spaCy is designed for building fast, scalable, production-grade NLP pipelines with pretrained and custom models.

**What I‚Äôm testing**

* Conceptual clarity
* Ability to communicate simply
* Correct tool selection

**Strong signal**

* Mentions *NLTK for exploration* and *spaCy for production*
* Avoids buzzwords

**Red flag**

* ‚ÄúThey are both NLP libraries; I use whichever is faster‚Äù

---

#### üü° Question 2 ‚Äî Easy ‚Üí Medium (Applied NLP)

**‚ÄúSuppose you are building a sentiment analysis system for customer reviews. Walk me through your NLP pipeline.‚Äù**

You start with dataset curation/engineering -> data preprocessing which involves lemmatization/stemming, removal of stop words -> tokenization -> model training -> evaluation.

> I‚Äôd start with data inspection to understand noise, imbalance, and language style, apply selective normalization, tokenize appropriately based on model choice, train a baseline before moving to more complex models, and evaluate using metrics like F1 while performing error analysis on misclassified reviews.

**What I‚Äôm testing**

* End-to-end thinking
* Understanding of preprocessing
* Awareness of real data issues

**Strong signal**

* Mentions normalization, tokenization, model choice, evaluation
* Acknowledges noise, sarcasm, imbalance

**Red flag**

* Jumps straight to ‚Äúuse BERT‚Äù without pipeline context

---

#### üü† Question 3 ‚Äî Medium (System Design)

**‚ÄúWe need a Named Entity Recognition system for legal documents. spaCy‚Äôs pretrained model performs poorly. What steps would you take?‚Äù**

**What I‚Äôm testing**

* Domain adaptation
* Custom model training
* Error analysis mindset

**Strong signal**

* Talks about annotation strategy
* Mentions rule-based + ML hybrid
* Mentions evaluation & iteration

**Red flag**

* ‚ÄúI‚Äôll try a bigger model‚Äù

---

#### üîµ Question 4 ‚Äî Hard (Scaling & Collaboration)

**‚ÄúWe suddenly need to process 50 million documents per month. What changes in your NLP system design?‚Äù**

**What I‚Äôm testing**

* Scalability thinking
* Engineering maturity
* Cross-functional awareness

**Strong signal**

* Mentions batching, streaming, memory, parallelism
* Talks about infra, monitoring, tradeoffs
* Knows spaCy strengths here

**Red flag**

* ‚ÄúJust add more servers‚Äù

---

#### üî¥ Question 5 ‚Äî Very Hard (Senior / Founder-Level)

**‚ÄúAn entity extraction model performs well in offline evaluation but fails badly in production. How do you debug this?‚Äù**

**What I‚Äôm testing**

* Real-world experience
* Failure analysis
* Ownership mindset

**Strong signal**

* Talks about data drift
* Mentions logging, input distribution changes
* Discusses annotation mismatch
* Proposes monitoring & retraining loop

**Red flag**

* Blames the model without investigation

---

## üéØ What These Questions Reveal

| Level | What I learn                               |
| ----- | ------------------------------------------ |
| Q1‚ÄìQ2 | Can you work in NLP at all?                |
| Q3    | Are you an applied NLP engineer?           |
| Q4    | Can you ship and scale?                    |
| Q5    | Can I trust you with my startup‚Äôs core AI? |

---

## Coach‚Äôs advice (important)

If you can **confidently answer Q3‚ÄìQ5**, you are:

* Senior NLP Engineer
* Strong startup hire
* Someone I‚Äôd give ownership to

If you want, next I can:

* Provide **model answers**
* Turn these into a **mock interview**
* Help you prepare **whiteboard-style explanations**

Just say the word.


# NLP - Text Classification project

## üìå Problem Statement: News Topic Classification Using NLP

In today‚Äôs digital world, news platforms publish thousands of articles every day across multiple domains such as **business, sports, technology, and politics**. Manually categorizing these articles is time-consuming, error-prone, and not scalable.

The objective of this project is to **automatically classify news articles into predefined topics** using Natural Language Processing (NLP) techniques.

Specifically, the project aims to:

1. Build a **baseline text classification model** using traditional NLP techniques such as **TF-IDF with Logistic Regression**.
2. Build a **deep learning‚Äìbased classifier** using a **transformer model (DistilBERT)**.
3. Compare the two approaches in terms of:

   * accuracy
   * precision, recall, and F1-score
   * ability to capture semantic meaning
4. Analyze model performance through **error analysis and confusion matrices**.
5. Study the impact of **class imbalance** and understand **precision vs recall trade-offs**.

The final outcome is to understand **when classical NLP methods are sufficient** and **when transformer-based models provide clear advantages**, considering both performance and computational cost.

---

### üéØ Inputs and Outputs

**Input:**
Raw news article text.

**Output:**
One of the predefined categories:

* Business
* Sports
* Technology
* Politics

---

### üìä Dataset

The project uses the **AG News dataset**, a standard benchmark for text classification, containing labeled news articles across four categories.

---

### üß† Why this problem matters (interview angle)

This problem represents a **real-world NLP use case** commonly found in:

* news aggregation platforms
* content recommendation systems
* search engines
* enterprise document classification

It demonstrates core NLP skills including **text preprocessing, feature extraction, model selection, evaluation, and error analysis**.

---

### One-line version (if interviewer asks quickly)

> ‚ÄúThe goal is to automatically classify news articles into categories using NLP, comparing traditional TF-IDF-based models with transformer-based models to understand performance and trade-offs.‚Äù


### üß≠ Roadmap: News Topic Classification (NLP Revision Project)

#### üéØ Goal

Classify news articles into:

* business
* sports
* tech
* politics

Compare **traditional NLP** vs **transformer-based NLP**.

---

### üóìÔ∏è Phase 1 ‚Äî Data & Problem Setup (Day 1)

##### 1Ô∏è‚É£ Dataset selection

Use one of these:

* **AG News**
* **BBC News**
* **20 Newsgroups (filtered)**

```python
from datasets import load_dataset

dataset = load_dataset("ag_news")

print(dataset)

dataset["train"].to_csv("../data/NLP_TextClassification/ag_news_train.csv")
dataset["test"].to_csv("../data/NLP_TextClassification/ag_news_test.csv")
```

Split:

* 80% train
* 10% validation
* 10% test

##### 2Ô∏è‚É£ Exploratory Data Analysis (EDA)

Do this **before modeling**.

‚úî Class distribution
‚úî Average text length
‚úî Vocabulary size

> Interview point:
> ‚ÄúI checked class imbalance and text length to choose model and max sequence length.‚Äù

---

### üóìÔ∏è Phase 2 ‚Äî Baseline: TF-IDF + Logistic Regression (Day 1)

##### 3Ô∏è‚É£ Text preprocessing

Minimal, intentional preprocessing:

* lowercase
* remove punctuation
* stopwords (optional)
* no lemmatization (explain why)

> Interview answer:
> ‚ÄúHeavy preprocessing can remove useful signal for downstream models.‚Äù

---

##### 4Ô∏è‚É£ Feature extraction (TF-IDF)

```python
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    ngram_range=(1,2),
    max_features=50_000,
    min_df=5
)
X_train = vectorizer.fit_transform(train_texts)
X_test = vectorizer.transform(test_texts)
```

Explain:

* Why ngrams?
* Why max_features?
* Why min_df?

---

##### 5Ô∏è‚É£ Train classifier

```python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
```

Evaluate:

* accuracy
* precision
* recall
* F1 (macro)

> Interview point:
> ‚ÄúThis gives a fast, interpretable baseline.‚Äù

---

### üóìÔ∏è Phase 3 ‚Äî Transformer Model (Day 2)

##### 6Ô∏è‚É£ Tokenization

Use **DistilBERT** first (lighter, faster).

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
```

Explain:

* Subword tokenization
* Why max length (128 / 256)

---

##### 7Ô∏è‚É£ Fine-tuning model

```python
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=4
)
```

Train with:

* small learning rate (2e-5)
* 2‚Äì3 epochs
* batch size 16

> Interview point:
> ‚ÄúTransformers capture contextual meaning, not just word frequency.‚Äù

---

##### 8Ô∏è‚É£ Evaluation & comparison

Compare:

| Model       | Accuracy | Macro F1 |
| ----------- | -------- | -------- |
| TF-IDF + LR | ~80%     | ~0.78    |
| DistilBERT  | ~88‚Äì92%  | ~0.90    |

---

### üóìÔ∏è Phase 4 ‚Äî Error Analysis (Day 3)

##### 9Ô∏è‚É£ Confusion matrix

Identify:

* sports vs business confusion
* tech vs politics overlap

> Interview gold:
> ‚ÄúMost errors happen where semantic overlap exists.‚Äù

---

##### üîü Analyze misclassifications

Check:

* short texts
* ambiguous headlines
* domain-specific words

---

### üóìÔ∏è Phase 5 ‚Äî Class Imbalance & Metrics (Day 3)

##### 1Ô∏è‚É£1Ô∏è‚É£ Handle class imbalance

Options:

* class weights
* oversampling
* focal loss (mention, don‚Äôt overdo)

```python
LogisticRegression(class_weight="balanced")
```

---

##### 1Ô∏è‚É£2Ô∏è‚É£ Precision vs Recall discussion

Explain with example:

* **High precision** ‚Üí fewer false positives
* **High recall** ‚Üí fewer false negatives

> Interview answer:
> ‚ÄúFor news categorization, balanced F1 matters more than raw accuracy.‚Äù

---

### üß† Phase 6 ‚Äî Interview Explanation Cheat Sheet

##### Why TF-IDF fails on semantic similarity?

‚úî Word-based
‚úî No context
‚úî Synonyms treated differently

Example:

> ‚Äústock market‚Äù vs ‚Äúequity trading‚Äù

---

##### Why transformers generalize better?

‚úî Contextual embeddings
‚úî Attention mechanism
‚úî Pretrained on large corpora

---

##### How do you choose metrics?

* Accuracy ‚Üí balanced data
* Macro F1 ‚Üí class imbalance
* Precision/Recall ‚Üí business needs

---

##### When would you NOT use transformers?

‚úî Low latency systems
‚úî Very small datasets
‚úî Edge devices

---

### üìÅ Final Deliverables (VERY IMPORTANT)

Your GitHub should include:

* clean README
* architecture diagram
* comparison table
* error analysis section
* clear conclusions

---

### ‚è±Ô∏è Time estimate

* **Fast learner**: 2 days
* **Safe prep**: 3‚Äì4 days

---

If you want next, I can:

* give **exact folder structure**
* provide **starter code**
* give **mock interview Q&A**
* help you write a **perfect README**

Just say what you want next.


README.md: 0.00B [00:00, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 120000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7600
    })
})


In [10]:
dataset["train"].to_csv("../data/NLP_TextClassification/ag_news_train.csv")
dataset["test"].to_csv("../data/NLP_TextClassification/ag_news_test.csv")

Creating CSV from Arrow format:   0%|          | 0/120 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

1830309

In [11]:
import pandas as pd
df = pd.read_csv('../data/NLP_TextClassification/ag_news_train.csv')
df.head()

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2


In [12]:
df['label'].value_counts()

label
2    30000
3    30000
1    30000
0    30000
Name: count, dtype: int64

In [13]:
dataset['train'][0]

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
 'label': 2}