# NLP Basics: Text Preprocessing, Vectorization, and Text Similarity

**Objective:**  
This notebook introduces fundamental Natural Language Processing (NLP) concepts, including text preprocessing, Bag of Words (BoW), TF-IDF vectorization, and text similarity measures.  
By the end of this notebook, you will be able to:
- Preprocess raw text for NLP tasks
- Convert text into numerical representations
- Compute similarity between text documents

## What is NLP?

Natural Language Processing (NLP) is a subfield of AI and linguistics that focuses on enabling computers to understand, interpret, and generate human language.

**Key Tasks in NLP:**
- Text Preprocessing
- Text Representation (BoW, TF-IDF, Word Embeddings)
- Text Classification
- Named Entity Recognition (NER)
- Sentiment Analysis
- Machine Translation
- Question Answering

## Required Python Libraries

We will use the following Python libraries:

- `nltk` for tokenization, stopwords, stemming, and lemmatization
- `scikit-learn` for vectorization (BoW, TF-IDF) and similarity computation
- `numpy` for numerical operations


In [6]:
# Install libraries if not already installed
#!pip install nltk scikit-learn

# Import libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ksiri\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ksiri\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ksiri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Text Preprocessing

Text preprocessing is the first step in NLP. It involves cleaning and normalizing text to make it suitable for analysis or model input.

**Common Preprocessing Steps:**
1. **Lowercasing** – Convert all text to lowercase to ensure uniformity.
2. **Tokenization** – Split text into words or sentences.
3. **Stopwords Removal** – Remove common words (like "the", "is") that do not carry much meaning.
4. **Stemming** – Reduce words to their root form (e.g., "running" → "run").
5. **Lemmatization** – Reduce words to their dictionary form (e.g., "better" → "good").


In [7]:
# Sample text
text = "Natural Language Processing is amazing! NLP helps machines understand text."

# Lowercase & Tokenize
tokens = word_tokenize(text.lower())
print("Tokens:", tokens)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [w for w in tokens if w.isalpha() and w not in stop_words]
print("Filtered Tokens:", filtered_tokens)

# Stemming
ps = PorterStemmer()
stemmed_tokens = [ps.stem(w) for w in filtered_tokens]
print("Stemmed Tokens:", stemmed_tokens)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(w) for w in filtered_tokens]
print("Lemmatized Tokens:", lemmatized_tokens)


Tokens: ['natural', 'language', 'processing', 'is', 'amazing', '!', 'nlp', 'helps', 'machines', 'understand', 'text', '.']
Filtered Tokens: ['natural', 'language', 'processing', 'amazing', 'nlp', 'helps', 'machines', 'understand', 'text']
Stemmed Tokens: ['natur', 'languag', 'process', 'amaz', 'nlp', 'help', 'machin', 'understand', 'text']
Lemmatized Tokens: ['natural', 'language', 'processing', 'amazing', 'nlp', 'help', 'machine', 'understand', 'text']


## Bag of Words (BoW)

**Definition:**  
BoW is a method to represent text as a fixed-length vector of word counts. It ignores grammar and word order but captures frequency information.

**Steps:**
1. Create a vocabulary of all unique words.
2. Count occurrences of each word in the document.
3. Represent each document as a vector of word counts.

**Example:**  
Text: ["I love NLP", "NLP is fun"] → Vocabulary: ["I", "love", "NLP", "is", "fun"]


In [8]:
corpus = [
    "I love machine learning.",
    "Natural Language Processing is fascinating.",
    "Machine learning helps understand data."
]

# Create BoW representation
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)

print("BoW Feature Names:", vectorizer.get_feature_names_out())
print("BoW Matrix:\n", X_bow.toarray())


BoW Feature Names: ['data' 'fascinating' 'helps' 'is' 'language' 'learning' 'love' 'machine'
 'natural' 'processing' 'understand']
BoW Matrix:
 [[0 0 0 0 0 1 1 1 0 0 0]
 [0 1 0 1 1 0 0 0 1 1 0]
 [1 0 1 0 0 1 0 1 0 0 1]]


## TF-IDF (Term Frequency – Inverse Document Frequency)

**Definition:**  
TF-IDF is a method to represent text while reducing the weight of common words and increasing the weight of rare but important words.

**Formulas:**
- **TF (Term Frequency):** Frequency of a word in a document.
- **IDF (Inverse Document Frequency):** log(N / df), where N = total documents, df = number of docs containing the word.

**Benefits:**  
- Reduces impact of frequent words (like "the")
- Highlights important terms for the document


In [9]:
# Create TF-IDF representation
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

print("TF-IDF Feature Names:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix:\n", X_tfidf.toarray())


TF-IDF Feature Names: ['data' 'fascinating' 'helps' 'is' 'language' 'learning' 'love' 'machine'
 'natural' 'processing' 'understand']
TF-IDF Matrix:
 [[0.         0.         0.         0.         0.         0.51785612
  0.68091856 0.51785612 0.         0.         0.        ]
 [0.         0.4472136  0.         0.4472136  0.4472136  0.
  0.         0.         0.4472136  0.4472136  0.        ]
 [0.49047908 0.         0.49047908 0.         0.         0.37302199
  0.         0.37302199 0.         0.         0.49047908]]


## Text Similarity

After converting text into vectors (BoW or TF-IDF), we can compute similarity between documents.

**Common Similarity Measures:**
- Cosine Similarity
- Euclidean Distance

**Use Case:** Finding similar documents, clustering, information retrieval.


In [10]:
# Compute cosine similarity between documents
similarity_matrix = cosine_similarity(X_tfidf)
print("Cosine Similarity Matrix:\n", similarity_matrix)


Cosine Similarity Matrix:
 [[1.         0.         0.38634343]
 [0.         1.         0.        ]
 [0.38634343 0.         1.        ]]


# Summary

In this notebook, you learned:
- Basic text preprocessing: tokenization, stopwords removal, stemming, lemmatization.
- Representing text numerically using Bag of Words and TF-IDF.
- Measuring text similarity using cosine similarity.

These are foundational NLP skills required before moving to **word embeddings, sequence models, and transformers**.
