# **NLP**

## What is NLP?

**Natural Language Processing (NLP)** is a field of Artificial Intelligence that enables computers to understand, interpret, and generate human language.  
It bridges the gap between human communication and computer understanding.

NLP combines **linguistics**, **computer science**, and **machine learning** to analyze text and speech data.  
Its applications are everywhere — from search engines and chatbots to sentiment analysis and language translation.

## Real-World Applications of NLP

- **Machine Translation** — Google Translate, DeepL, etc.  
- **Sentiment Analysis** — understanding emotions in reviews or tweets.  
- **Chatbots & Virtual Assistants** — Siri, Alexa, ChatGPT, etc.  
- **Information Retrieval** — search engines and document summarization.  
- **Text Classification** — spam detection, topic categorization, etc.  
- **Speech Recognition** — converting voice to text.

## Core Tasks in NLP

1. **Text Preprocessing**
   - Tokenization  
   - Stopword Removal  
   - Stemming & Lemmatization  
   - Part-of-Speech (POS) Tagging  

2. **Text Respresentation**
   - Bag of Words (BoW)  
   - TF-IDF (Term Frequency–Inverse Document Frequency)  
   - Word Embeddings (Word2Vec, GloVe, FastText)

3. **Modeling**
   - Classical ML Models (Naive Bayes, SVM, etc.)  
   - Deep Learning Models (RNN, LSTM, Transformers, BERT)

4. **Evaluation**
   - Accuracy, Precision, Recall, F1-score  
   - BLEU Score, ROUGE Score (for generation tasks)

## NLP Workflow

1. **Data Collection** → Gather text data from various sources.  
2. **Text Preprocessing** → Clean and structure the data.  
3. **Feature Extraction** → Convert text into numerical features.  
4. **Model Training** → Train a model to learn patterns from the text.  
5. **Evaluation & Deployment** → Test and deploy the model for real use.


## Popular NLP Libraries

| Library | Description |
|----------|--------------|
| **NLTK** | Great for beginners; includes basic NLP functions. |
| **spaCy** | Industrial-strength NLP with fast tokenization and POS tagging. |
| **scikit-learn** | Machine learning algorithms and text vectorization tools. |
| **Transformers (Hugging Face)** | State-of-the-art pre-trained models like BERT, GPT, T5. |
| **gensim** | Word embeddings and topic modeling. |

## Summary

Natural Language Processing allows machines to:
- Understand the meaning of human language.  
- Process and analyze large amounts of text efficiently.  
- Power intelligent systems that communicate naturally with users.

> NLP is at the heart of modern AI — enabling language understanding, translation, summarization, and conversation.

Installing Neccessay libraries: 
* Using pip:
    - `pip install nltk spacy gensim scikit-learn transformers`
* Using uv:
     - `uv add nltk spacy gensim scikit-learn transformers`

## **Key Terms**

| Term | Meaning |
|------|----------|
| **Corpus** | A large, structured collection of texts or speech data used to train and test NLP models. |
| **Document** | A single, self-contained unit of text within a corpus, such as a sentence, paragraph, email, or book.|
| **Vocabulary** | The set of all unique words that are present across the entire corpus, typically without duplicates.|
| **Word** | An individual token (sequence of characters) that serves as the basic element of meaning in the text.|
| **Tokenization** | Breaking text into smaller units like words or sentences. |
| **Stopwords** | Common words (like “the”, “is”, “and”) often removed during preprocessing. |
| **Stemming** | Reducing a word to its base form (e.g., *running* → *run*). |
| **Lemmatization** | Converting a word to its meaningful root using vocabulary and morphology (e.g., *better* → *good*). |
| **POS Tagging** | Identifying parts of speech (noun, verb, adjective, etc.). |
| **Named Entity Recognition (NER)** | Detecting entities like names, dates, or organizations. |
| **Vectorization** | Converting text into numerical form for machine learning models. |

In [None]:
import pandas as pd
import numpy as np

# Text Preprocessing tools
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist

# Spacy for Adv NLP tasks
import spacy

In [None]:
# downloading necessay data

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

## **Corpus, Document, vocabulory, words**

In [None]:
corpus = [
    "Natural Language Processing is an exciting field of AI.",
    "It allows machines to understand human language.",
    "Tokenization, stemming, and lemmatization are common preprocessing steps of NLP."
]

In [None]:
# corpus

corpus_text = ""
for text in corpus:
    corpus_text += text

print(f"Curpus: \n{corpus_text}")

Curpus: 
Natural Language Processing is an exciting field of AI.It allows machines to understand human language.Tokenization, stemming, and lemmatization are common preprocessing steps of NLP.


In [None]:
# Docuements

for i, doc in enumerate(corpus):
  print(f"Document {i+1}: {doc}")

Document 1: Natural Language Processing is an exciting field of AI.
Document 2: It allows machines to understand human language.
Document 3: Tokenization, stemming, and lemmatization are common preprocessing steps of NLP.


In [None]:
# Vocabulory

vocab = []
for text in corpus:
  vocab.extend(text.split(' '))
print(f"Vocabulary: {set(vocab)}")
print(f"Vocabulary size: {len(set(vocab))}")

Vocabulary: {'is', 'exciting', 'AI.', 'machines', 'common', 'lemmatization', 'preprocessing', 'language.', 'to', 'allows', 'Natural', 'are', 'Processing', 'NLP.', 'of', 'Language', 'Tokenization,', 'steps', 'field', 'stemming,', 'an', 'and', 'human', 'It', 'understand'}
Vocabulary size: 25


In [None]:
# words

print(f"Words: {vocab}")
print(f"Words size: {len(vocab)}")

Words: ['Natural', 'Language', 'Processing', 'is', 'an', 'exciting', 'field', 'of', 'AI.', 'It', 'allows', 'machines', 'to', 'understand', 'human', 'language.', 'Tokenization,', 'stemming,', 'and', 'lemmatization', 'are', 'common', 'preprocessing', 'steps', 'of', 'NLP.']
Words size: 26


## **Tokenization**

**Tokenization** is the process of splitting text into smaller units called **tokens**.  
These tokens can be **words**, **sentences**, or **subwords**, depending on the application.

- **Word Tokenization:** Splitting text into words  
- **Sentence Tokenization:** Splitting text into sentences  

Tokenization is usually the **first step in text preprocessing** because most NLP tasks operate on tokens rather than raw text.

Why Tokenization is Important

1. Simplifies text processing  
2. Helps in **counting words** and building **vocabulary**  
3. Essential for **feature extraction** like Bag of Words or TF-IDF  
4. Enables **stopword removal**, **stemming**, and **lemmatization**

In [None]:
text = "Natural Language Processing enables machines to understand human language. It is fascinating!"

In [None]:
# sentence tokenization

sentences = sent_tokenize(text)
print("Sentence Tokenization:")
print(sentences)
print("\nNumber of sentences in text:", len(sentences))

Sentence Tokenization:
['Natural Language Processing enables machines to understand human language.', 'It is fascinating!']

Number of sentences in text: 2


In [None]:
# word tokenization

words = word_tokenize(text)
print("Word Tokenization:")
print(words)
print("Number of words in the corpus:", len(words))

Word Tokenization:
['Natural', 'Language', 'Processing', 'enables', 'machines', 'to', 'understand', 'human', 'language', '.', 'It', 'is', 'fascinating', '!']
Number of words in the corpus: 14


In [None]:
# filtering the tokens

words_filtered = [word.lower() for word in words if word.isalpha()]
print("Filtered Tokens (lowercase, no punctuation):")
print(words_filtered)
print("\nNumber of filtered words:", len(words_filtered))

Filtered Tokens (lowercase, no punctuation):
['natural', 'language', 'processing', 'enables', 'machines', 'to', 'understand', 'human', 'language', 'it', 'is', 'fascinating']

Number of filtered words: 12


In [None]:
# vocabulory from the filtered words

print(f"Vocabulary: {set(words_filtered)}")
print(f"Vocabulary size: {len(set(words_filtered))}")

Vocabulary: {'machines', 'is', 'natural', 'human', 'enables', 'fascinating', 'language', 'to', 'understand', 'it', 'processing'}
Vocabulary size: 11


## **Stemming**

**Stemming** is the process of reducing words to their **root form** or **stem**.  
It usually involves removing suffixes like `-ing`, `-ed`, `-s`.  

- Example:  
  - "running" → "run"  
  - "played" → "play"  
  - "cats" → "cat"

Stemming is **rule-based** and may produce stems that are not real words.  
It is often used in **search engines** or **text classification** where exact grammar is less important.

**Stemming Methods in NLP**

**1. Porter Stemmer**  
The **Porter Stemmer** is a rule-based stemming algorithm that removes common suffixes from words to reduce them to their stem form.  
- Example: "running" → "run", "played" → "play"  
- Pros: Simple and fast  
- Cons: May produce non-dictionary stems

**2. Snowball Stemmer**  
The **Snowball Stemmer** (also called Porter2) is an improved version of the Porter Stemmer.  
- Supports multiple languages  
- More aggressive and consistent stemming than Porter  
- Example: "running" → "run", "easily" → "easili"


In [None]:
# imports for stemming

from nltk.stem import PorterStemmer, SnowballStemmer

In [None]:
# applying the stemming with porterstemmer

words = ["running", "played", "plays", "easily", "fairness"]

porter = PorterStemmer()
stemmed_porter = [porter.stem(word) for word in words]

print("Initial words:", words)
print("Porter Stemmer:", stemmed_porter)

Initial words: ['running', 'played', 'plays', 'easily', 'fairness']
Porter Stemmer: ['run', 'play', 'play', 'easili', 'fair']


In [None]:
# applying the stemming with snowballstemmer

words = ["running", "played", "plays", "easily", "fairness"]

snowball = SnowballStemmer("english")
stemmed_snowball = [snowball.stem(word) for word in words]

print("Initial words:", words)
print("Snowball Stemmer:", stemmed_snowball)

Initial words: ['running', 'played', 'plays', 'easily', 'fairness']
Snowball Stemmer: ['run', 'play', 'play', 'easili', 'fair']


## **Lemmatization**

**Lemmatization** reduces words to their **base or dictionary form** called a **lemma**.  
Unlike stemming, it always returns a **real word**.  

- Example:  
  - "running" → "run"  
  - "better" → "good"  
  - "cats" → "cat"

Lemmatization usually considers **part-of-speech (POS)** for more accurate results.  
It is widely used in **NLP pipelines** where language understanding matters.

In [None]:
# import for lemmatization

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [None]:
words = ["running", "played", "plays", "easily", "better", "cats"]

# using wordnetLemmatizer (default POS='n')
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(f"{'Words':>10} {'Lemmatized':>15}")
print("-"*40)
for word, lemma in zip(words, lemmatized_words):
    print(f"{word:>10}  {lemma:>10}")


     Words      Lemmatized
----------------------------------------
   running     running
    played      played
     plays        play
    easily      easily
    better      better
      cats         cat


In [56]:
# using wordnetLemmatizer for pos='v'

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]

print(f"{'Words':>10} {'Lemmatized':>15}")
print("-"*40)
for word, lemma in zip(words, lemmatized_words):
    print(f"{word:>10}  {lemma:>10}")

     Words      Lemmatized
----------------------------------------
   running         run
    played        play
     plays        play
    easily      easily
    better      better
      cats         cat


**Comparing the stemming and lemmatization**

In [58]:
words = ["running", "played", "cats", "better", 'plays', 'eaten', 'ate']

porter = PorterStemmer()
snowball = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

print(f"{'Word':<10} {'Porter':<10} {'Snowball':<10} {'Lemma (noun)':<15} {'Lemma (verb)':<15}")
for word in words:
    print(f"{word:<10} {porter.stem(word):<10} {snowball.stem(word):<10} "
          f"{lemmatizer.lemmatize(word):<15} {lemmatizer.lemmatize(word, pos='v'):<15}")

Word       Porter     Snowball   Lemma (noun)    Lemma (verb)   
running    run        run        running         run            
played     play       play       played          play           
cats       cat        cat        cat             cat            
better     better     better     better          better         
plays      play       play       play            play           
eaten      eaten      eaten      eaten           eat            
ate        ate        ate        ate             eat            


## **StopWords**

**Stopwords** are common words in a language that usually carry little meaningful information, such as:  
`a, an, the, is, in, on, and, but`

### **Why Remove Stopwords?**
- Reduce noise in text data  
- Improve efficiency of NLP models  
- Focus on important keywords in tasks like classification or clustering

### **Handling Stopwords in Python**
- The **NLTK** library provides a ready-made list of stopwords  
- Stopwords can be customized for specific tasks  
- Example: Removing stopwords from a tokenized sentence

In [59]:
# importing library and downloading stopwords

from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [64]:
sentence = "Natural Language Processing allows machines to understand human language easily. It is a toolkit for handling the text"

# Tokenize words
words = word_tokenize(sentence.lower())
print("Original Tokens:")
print(words)

Original Tokens:
['natural', 'language', 'processing', 'allows', 'machines', 'to', 'understand', 'human', 'language', 'easily', '.', 'it', 'is', 'a', 'toolkit', 'for', 'handling', 'the', 'text']


In [65]:
# Get English stopwords
stop_words = set(stopwords.words('english'))
print("English Stopwords are:")
print(stop_words)
print(f"Number of stopwords: {len(stop_words)}")

English Stopwords are:
{'is', "he's", 'what', 'him', 'in', 'couldn', 'do', 'o', 'only', 'some', 'm', 'mightn', "you'll", 'he', 's', 'yourselves', 'won', 'from', "you're", 'them', 'were', 'shouldn', 'before', 'this', "weren't", 've', 'hers', "wasn't", 'yours', 'both', 'me', 'wouldn', "we're", 'hasn', "they'll", "didn't", "i'd", 'just', 'has', 'be', 'down', 'isn', 'at', 'but', 'they', 'those', 'if', 'then', 'itself', "shan't", "it'll", "don't", "hadn't", 'himself', "that'll", 'such', 'most', 'nor', 'under', 'out', 'whom', 'her', "doesn't", 'myself', 'as', 'than', 'we', "haven't", 'have', 'did', 'against', 'further', "i've", "couldn't", 'to', 'now', 'd', 'between', 'his', 'other', 'didn', "mightn't", 'themselves', 'shan', "they're", "hasn't", 'here', 'an', 'can', "he'd", 'theirs', 'who', 'll', 'which', 'being', 'their', 'she', 'aren', 'too', 'the', "you've", "you'd", 'how', 'few', 'while', 'above', 'about', 'or', 'that', 't', "isn't", 'again', 'very', "i'll", 'ain', 'not', 'same', "he'll"

In [66]:
# Remove stopwords
filtered_words = [word for word in words if word.isalpha() and word not in stop_words]

print("Original Tokens:")
print(words)
print("\nTokens after Stopwords Removal:")
print(filtered_words)

Original Tokens:
['natural', 'language', 'processing', 'allows', 'machines', 'to', 'understand', 'human', 'language', 'easily', '.', 'it', 'is', 'a', 'toolkit', 'for', 'handling', 'the', 'text']

Tokens after Stopwords Removal:
['natural', 'language', 'processing', 'allows', 'machines', 'understand', 'human', 'language', 'easily', 'toolkit', 'handling', 'text']
