## Natural Language Processing

### NLP Interview Questions (AI/ML Engineer) — Easy → Hard



---


---



---



---

**5) Hard — Production NLP Failure Modes**
An NLP model performs well in validation but fails in production.

* List common NLP-specific failure modes.
* How would you detect data drift in text?
* How would you debug hallucinations or semantic errors in an LLM-based system?

---

If you want, the **next logical step** is:

* NLP **coding questions**, or
* LLM / **RAG-focused interview questions**, or
* A **mock NLP interview** where I evaluate your answers and give blunt feedback.


**1) Easy — Text Cleaning & Normalization**
You’re given a column `review_text`.

* What basic preprocessing steps would you apply before modeling?
* Which steps depend on the downstream model (TF-IDF vs transformer)?

**Answer:**
1. Text Cleaning, that involves removing html tags, urls, PII, special characters, extra whitespaces, handling emojis and so on.
2. Lower Casing; Converting all texts into lowercase to reduce vocab size.
3. Tokenization - Splitting texts into small chunks called tokens.
4. Stop words removal like is, the, in, and so on...
5. Stemming - Removing suffixes from words - running, runs -> run. - reduces vocabulary size.
6. Lemmatization - Reduces words to their base word from using linguistic rules.{better -> good}
7. Named entity recognition - identifying real world entities
8. spelling correction
9. Feature extraction/Vectorization - Converting text into numerical form
    - Bag of words
    - TF-IDF
    - Word Embeddings(Word2Vec, GloVe)
    - Contextual Embeddings(BERT)

II. TF-IDF - Needs high preprocessing steps, where as transformer needs lesser preprocessing.
- Removal of stopwords - In TF-IDF: Stop words dominate frequency but add little value - So remove them, But in transformers we need to capture the meaning and attention patterns so include stopwords.
- Stemming or Lemmatization: performing them in TF-IDF reduces sparcity {combine multiple similar words in single word}. Where as in Transformers model rely on grammar and tense - do not use them here.
- Lower casing - TF-IDF: Always, Transformers: preserve case.
- Tokenization: literal words are converted into token in TF-IDF, Transformer used advance methods to tokenize.

**2) Easy–Medium — Tokenization Strategies**
Explain the difference between:

* word-level tokenization
* subword tokenization (BPE / WordPiece)
* character-level tokenization

When would each fail or succeed in real NLP systems?

**Answer:**
1. Word-Level tokenization
- Split text into tokens based on words, usually spaces and punctuation.
- Works well with NLP models, languages with clear word boundaries.
- They are often small, clean vocabularies
- `Fails` when misspellings, typos,on large vocab size.
2. Subword Tokenization (BPE/Workpiece)
- Breaks words into frequently occuring subword units learned from data.
- "unhappiness" -> ['un', 'happi', 'ness'], "ChatGPTlike" -> ['Chat', 'GPT', 'like']
- Works well with Modern Neural Networks, Handles unknown words, Controls vocab size.
- Preserves meaning better than characters.
- `Fails` on tokens that not allign linguistic meaning, Tokenization can be unintuitve.
3. Characters-Level Tokenization
- Treats every character as a token.
- "cat" -> ['c', 'a', 't']
- Extreme robustness to typos, noise.
- Used in Special domains -> ['DNA', 'code', 'OCR text']
- `Fails` on very long sequences, loses semantic structure, computationally expensive, harder to learn meaning.

**3) Medium — Text Representation Choices**
You need to build a classifier with **limited labeled data**.

* Compare BoW, TF-IDF, word embeddings, and transformer embeddings.
* Which would you choose and why?

**Answer:**
1. Bag of Words(BoW):
- Represents text as word occurrnec counts
- Very simple and fast | Works well with linear models | low risk of over fitting
- Cons: No word importance, no semantics or context, very sparse high dimensional. Poor generalization with limited data.
- Use when extremely small data, with baseline models.

2. TF-IDF
- Weighted version of BoW that downweights commons words and upweights informative ones.
- Pros: Strong baseline for small datasets | Better signal than BoW | Works well with linear classifiers | Interpreable and stable.
- Cons: Sparse, no semantics or word order, Cannot capture synonymy or context.
- Use when limited label data

3. Word Embeddings (Word2Vec/ GloVe/ FastText):
- Dense vector representations learned from large corpora.
- Pros: Capture semantic similarity, Dense lower dimensional, Can use pretrained embeddings
- Cons: Context independent, Need pooling, Often underperform TF-IDF on small labelled datasets.
- Use when moderal data size, semantic matter.

4. Transformers Embeddings (BERT, RoBERTa)
- Contextual embeddings pretrained on massive corpora.
- Pros: Strong semantic understanding, handles context, Excellent at transfer learning
- Cons: Computationally expensive, risk of overfitting with very small labeled data, more complex to train and tune
- use when complex language understanding required.

**4) Medium–Hard — Handling Long Documents**
Your documents exceed the 512-token limit of BERT-like models.

* What strategies would you use to handle this?
* How do chunking and hierarchical models differ?
* What are the trade-offs?

Answer:
1. Strategies to handle long documents
- Split the documents in smaller chunks, process each chunk independently with BERT then aggregate.
- Aggregation methods - mean, max pooling over chunk embeddings, attention based pooling, majortiy vote for classification.
- Pros: easy to execute, words with standard BERT
- Cons: Loses global document context, cross chunk dependencies are ignored, aggregation may dilute important signals.
2. hierachial models - Model text at multiple levels (tokens -> sentences -> Docs)
- Example: Sentence Encoder (BERT) -> Sentence embeddings
- pros: preserves structure and long range dependencies | Better doc level undrestanding | Expensive.

In [4]:
import pandas as pd

df = pd.read_csv("../data/NLP_covid/Corona_NLP_train.csv", encoding = "latin-1")
df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


## Goal: 2 hour NLP Revision - 11:37

After 2 hours, you should be able to:
* Explain NLP systems as pipelines
* Answer “how would you build…” questions
* Discuss failure modes and tradeoffs
* Avoid sounding academic or junior

> NLP is not about model but converting text messy text into reliable decisions.
### NLP Pipeline
Text ingestion -> Cleaning & Normalization -> Chunking/segmentation -> Task-Specific Model -> Post processing -> evaluation & monitoring.

### 5 NLP problems
> use case -> Model Choice -> Failure -> Fix
1. Text Classification
    - Use: Intent, routing, spam-non spam
    - Failure: Class imbalance, ambiguity
    - Fix: Thresholds, abstrain class
        - In production not predicting is better than predicting wrong.

2. Information Retrieval - Search engines, product search in e-comm, Legal document search.
3. Text Generation - Generate human like text {examples: Text completion, chatbots, creative writing}
4. Machine Translation - From one language to other
5. Text Summarization - News summaries, Meeting notes generation, legal document summarie