# Practical NLP Session – From Theory to Code

This notebook walks through a complete basic NLP pipeline with **very detailed comments**, so you can use it as both **notes + code**.

We will cover:

1. Loading & preparing text data  
2. Cleaning & preprocessing (lowercasing, removing noise)  
3. Tokenization  
4. Stopword removal  
5. Stemming & Lemmatization  
6. N-grams  
7. CountVectorizer & TF-IDF (text -> numbers)  
8. Building a simple sentiment classifier (Logistic Regression)  
9. Trying the model on your own sentences  
10. Project idea based on this notebook

In [3]:
# ================================
# 1. Imports (Libraries we need)
# ================================

import re
import string

import numpy as np
import pandas as pd

# NLTK: Natural Language Toolkit
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Scikit-learn: for vectorization + ML models
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

print("Imports done.")

Imports done.


In [10]:
# ==========================================
# 2. Download extra NLTK data (required once)
# ==========================================

# NLTK uses several language resources for processing text.
#
# What each one does:
# - 'punkt'       : used by word_tokenize() to split text into words/sentences
# - 'punkt_tab'   : NEW dependency introduced in recent NLTK updates,
#                   adds tokenizer configuration — required to avoid errors
# - 'stopwords'   : list of common English words like "the", "is", "at", etc.
# - 'wordnet'     : dictionary used for lemmatization (finding base words)
# - 'omw-1.4'     : multilingual WordNet data, improves lemmatizer results

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")
nltk.download('punkt_tab')

print("NLTK resources downloaded.")

NLTK resources downloaded.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


## 3. Creating a Tiny Text Dataset

To understand the pipeline, we start with a **small toy dataset** of sentences.

Each sentence has a label:

- `1` -> positive sentiment  
- `0` -> negative sentiment

In real life, you would load data from:
- Kaggle
- CSV files
- Databases
- APIs, etc.

In [7]:
# =====================================
# 3. Tiny toy dataset (text + labels)
# =====================================

sentences = [
    "I love learning NLP, it is so interesting!",
    "This movie was terrible and boring.",
    "The food was amazing and the service was great.",
    "I hate waiting in long lines.",
    "Natural language processing makes computers smarter.",
    "The product quality is bad and I am disappointed.",
    "What a fantastic experience, I will come again!",
    "This is the worst thing I have ever bought.",
    "The lecture was very helpful and easy to understand.",
    "I am not happy with this phone, battery life is poor."
]

# Labels: 1 = positive, 0 = negative
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

data = pd.DataFrame({
    "text": sentences,
    "label": labels
})

data

Unnamed: 0,text,label
0,"I love learning NLP, it is so interesting!",1
1,This movie was terrible and boring.,0
2,The food was amazing and the service was great.,1
3,I hate waiting in long lines.,0
4,Natural language processing makes computers sm...,1
5,The product quality is bad and I am disappointed.,0
6,"What a fantastic experience, I will come again!",1
7,This is the worst thing I have ever bought.,0
8,The lecture was very helpful and easy to under...,1
9,"I am not happy with this phone, battery life i...",0


## 4. Basic Text Preprocessing

Raw text is **messy**. People use:

- Different cases: `Love`, `love`, `LOVE`
- Links
- HTML tags
- Numbers
- Extra spaces

We clean the text so the model focuses on **meaning**, not noise.

Typical steps:

1. Lowercase everything  
2. Remove URLs  
3. Remove HTML tags  
4. Remove digits  
5. Remove extra spaces

In [8]:
# ==================================
# 4. Text cleaning helper function
# ==================================

def clean_text(text: str) -> str:
    """
    Simple, readable cleaning function.
    Steps:
    1. Lowercase
    2. Remove URLs
    3. Remove HTML tags
    4. Remove digits
    5. Remove extra spaces
    """
    # 1. Lowercase
    text = text.lower()

    # 2. Remove URLs (http://, https://, www.)
    text = re.sub(r"https?://\S+|www\.\S+", "", text)

    # 3. Remove HTML tags: <tag>...</tag>
    text = re.sub(r"<.*?>", "", text)

    # 4. Remove digits
    text = re.sub(r"\d+", "", text)

    # 5. Remove extra spaces
    text = re.sub(r"\s+", " ", text).strip()

    return text


# Apply cleaning to our dataset
data["clean_text"] = data["text"].apply(clean_text)
data[["text", "clean_text"]]

Unnamed: 0,text,clean_text
0,"I love learning NLP, it is so interesting!","i love learning nlp, it is so interesting!"
1,This movie was terrible and boring.,this movie was terrible and boring.
2,The food was amazing and the service was great.,the food was amazing and the service was great.
3,I hate waiting in long lines.,i hate waiting in long lines.
4,Natural language processing makes computers sm...,natural language processing makes computers sm...
5,The product quality is bad and I am disappointed.,the product quality is bad and i am disappointed.
6,"What a fantastic experience, I will come again!","what a fantastic experience, i will come again!"
7,This is the worst thing I have ever bought.,this is the worst thing i have ever bought.
8,The lecture was very helpful and easy to under...,the lecture was very helpful and easy to under...
9,"I am not happy with this phone, battery life i...","i am not happy with this phone, battery life i..."


## 5. Tokenization – Breaking Text into Pieces

Computers cannot understand a full paragraph directly.  
We break text into smaller pieces called **tokens**.

- **Word tokenization** → split into words  
- **Sentence tokenization** → split into sentences

We will use `nltk.word_tokenize` and `nltk.sent_tokenize`.

In [11]:
# ==========================
# 5. Tokenization example
# ==========================

example_text = data["clean_text"][0]
print("Original sentence:", example_text)

# Word tokens
word_tokens = word_tokenize(example_text)
print("\nWord tokens:")
print(word_tokens)

# Sentence tokens (here we just have 1 sentence, but still for demo)
sent_tokens = sent_tokenize(example_text)
print("\nSentence tokens:")
print(sent_tokens)

Original sentence: i love learning nlp, it is so interesting!

Word tokens:
['i', 'love', 'learning', 'nlp', ',', 'it', 'is', 'so', 'interesting', '!']

Sentence tokens:
['i love learning nlp, it is so interesting!']


## 6. Stopword Removal

Stopwords = very common words like **the, is, am, are, in, at**.

They appear everywhere but usually don’t add much meaning for tasks like
sentiment analysis, topic classification, etc.

We will:

1. Get the English stopword list from NLTK  
2. Remove them from our token list

In [12]:
# ===================================
# 6. Stopwords and removal function
# ===================================

# Get English stopwords from NLTK
stop_words = set(stopwords.words("english"))
print("Number of stopwords:", len(stop_words))
print("Few examples:", list(sorted(stop_words))[:20])

def remove_stopwords(tokens):
    """Remove tokens that are in the stopword list."""
    return [w for w in tokens if w not in stop_words]

tokens_no_stop = remove_stopwords(word_tokens)
print("\nOriginal tokens:", word_tokens)
print("\nAfter stopword removal:", tokens_no_stop)

Number of stopwords: 198
Few examples: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']

Original tokens: ['i', 'love', 'learning', 'nlp', ',', 'it', 'is', 'so', 'interesting', '!']

After stopword removal: ['love', 'learning', 'nlp', ',', 'interesting', '!']


## 7. Stemming & Lemmatization

Both try to reduce words to a **base form**:

- **Stemming**: crude chopping of word endings  
  - `playing` → `play`  
  - `studies` → `studi` (sometimes ugly)

- **Lemmatization**: smarter, uses dictionary and grammar  
  - `better` → `good`  
  - `studies` → `study`

Goal: treat different forms of the same word as similar.

In [13]:
# =========================================
# 7. Compare stemming vs lemmatization
# =========================================

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["playing", "played", "plays", "better", "studies"]

print("Word   | Stem   | Lemma")
print("----------------------------")
for w in words:
    stem = stemmer.stem(w)
    lemma = lemmatizer.lemmatize(w)
    print(f"{w:<8} {stem:<8} {lemma:<8}")

Word   | Stem   | Lemma
----------------------------
playing  play     playing 
played   play     played  
plays    play     play    
better   better   better  
studies  studi    study   


## 8. N-grams – Looking at Neighbour Words

Single words sometimes aren’t enough:

- “New” + “York” separately ≠ “New York” (the city)

**N-gram** = sequence of N words:

- 1-gram (unigram): `["i", "love", "nlp"]`  
- 2-gram (bigram): `["i love", "love nlp"]`  
- 3-gram (trigram): `["i love nlp"]`

We’ll create bigrams manually and using `CountVectorizer`.

In [14]:
# ==========================
# 8. N-grams using NLTK
# ==========================

from nltk.util import ngrams

print("Tokens without stopwords:", tokens_no_stop)

bigrams = list(ngrams(tokens_no_stop, 2))
print("\nBigrams (pairs of words):")
print(bigrams)

Tokens without stopwords: ['love', 'learning', 'nlp', ',', 'interesting', '!']

Bigrams (pairs of words):
[('love', 'learning'), ('learning', 'nlp'), ('nlp', ','), (',', 'interesting'), ('interesting', '!')]


In [15]:
# =======================================
# 8b. N-grams using CountVectorizer
# =======================================

ngram_vectorizer = CountVectorizer(ngram_range=(1, 2))  # unigrams + bigrams
X_ngrams = ngram_vectorizer.fit_transform(data["clean_text"])

print("Shape (documents x features):", X_ngrams.shape)
print("\nFirst 30 feature names (unigrams + bigrams):")
print(ngram_vectorizer.get_feature_names_out()[:30])

Shape (documents x features): (10, 122)

First 30 feature names (unigrams + bigrams):
['again' 'am' 'am disappointed' 'am not' 'amazing' 'amazing and' 'and'
 'and am' 'and boring' 'and easy' 'and the' 'bad' 'bad and' 'battery'
 'battery life' 'boring' 'bought' 'come' 'come again' 'computers'
 'computers smarter' 'disappointed' 'easy' 'easy to' 'ever' 'ever bought'
 'experience' 'experience will' 'fantastic' 'fantastic experience']


## 9. Text → Numbers: CountVectorizer & TF-IDF

Machine learning models need **numbers**, not words.

### CountVectorizer
- Builds a vocabulary of all words
- Each document → vector of word **counts**

### TF-IDF (Term Frequency – Inverse Document Frequency)
- Similar to CountVectorizer
- Increases weight for **important but rare** words
- Decreases weight for very common words

In [16]:
# ==========================
# 9a. CountVectorizer
# ==========================

count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(data["clean_text"])

print("Shape (documents x features):", X_counts.shape)
print("\nFirst 20 feature names:")
print(count_vect.get_feature_names_out()[:20])

Shape (documents x features): (10, 58)

First 20 feature names:
['again' 'am' 'amazing' 'and' 'bad' 'battery' 'boring' 'bought' 'come'
 'computers' 'disappointed' 'easy' 'ever' 'experience' 'fantastic' 'food'
 'great' 'happy' 'hate' 'have']


In [17]:
# =====================
# 9b. TF-IDF Vectorizer
# =====================

tfidf_vect = TfidfVectorizer()
X_tfidf = tfidf_vect.fit_transform(data["clean_text"])

print("Shape (documents x features):", X_tfidf.shape)
print("\nFirst 20 TF-IDF feature names:")
print(tfidf_vect.get_feature_names_out()[:20])

Shape (documents x features): (10, 58)

First 20 TF-IDF feature names:
['again' 'am' 'amazing' 'and' 'bad' 'battery' 'boring' 'bought' 'come'
 'computers' 'disappointed' 'easy' 'ever' 'experience' 'fantastic' 'food'
 'great' 'happy' 'hate' 'have']


## 10. Simple Sentiment Classifier

Now we build a small **sentiment analysis model**.

- Input: `clean_text`  
- Vectorizer: `TfidfVectorizer`  
- Model: `LogisticRegression`  

We’ll use a `Pipeline` so that TF-IDF + model act as one unit.

In [18]:
# ==========================================
# 10. Train a Logistic Regression classifier
# ==========================================

X = data["clean_text"]
y = data["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

model = Pipeline([
    ("tfidf", TfidfVectorizer()),
    ("log_reg", LogisticRegression())
])

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Classification report:\n")
print(classification_report(y_test, y_pred))

Classification report:

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.33      1.00      0.50         1

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### 11. Try the Model on Your Own Sentences

Type any sentence and the model will predict:

- **1 → Positive**
- **0 → Negative**

In [20]:
# ==========================================
# 11. Test the model with your own input
# ==========================================

while True:
    text_input = input("Enter a sentence (or 'quit' to stop): ")
    if text_input.lower().strip() == "quit":
        break

    cleaned = clean_text(text_input)
    pred = model.predict([cleaned])[0]
    label = "Positive (1)" if pred == 1 else "Negative (0)"
    print("Prediction:", label)
    print("-" * 40)

Enter a sentence (or 'quit' to stop): This is the worst experience ever.
Prediction: Positive (1)
----------------------------------------
Enter a sentence (or 'quit' to stop): He is good boy.
Prediction: Positive (1)
----------------------------------------
Enter a sentence (or 'quit' to stop): quit


## 12. Word Sense Disambiguation (WSD) – Short Theory Note

Some words have multiple meanings:

- *bank* = money place  
- *bank* = side of a river  

**Word Sense Disambiguation** tries to pick the **correct meaning** based on context.

Modern deep models (like BERT) give different vector representations to the same
word in different sentences, which helps with WSD. Implementing full WSD usually
needs bigger models and datasets than this toy notebook.