Objectives:   
	•	Introduce text vectorization methods.   
	•	Explore preprocessing steps (stemming, lemmatization, etc.) and their impact on vectorization.    
	•	Compare methods on the Reuters dataset.

In [1]:
# Load and explore the Reuters dataset using NLTK.

import nltk
from nltk.corpus import reuters
import pandas as pd

# Download Reuters dataset
nltk.download('reuters')

# Load dataset into a DataFrame
categories = reuters.categories()
files = reuters.fileids()
texts = [reuters.raw(fileid) for fileid in files[:500]]  # Limit to 500 documents for efficiency
df = pd.DataFrame({'text': texts, 'category': [reuters.categories(fileid)[0] for fileid in files[:500]]})

# Display sample
print("Sample Dataset:")
print(df.head())

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\water\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!


Sample Dataset:
                                                text  category
0  ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RI...     trade
1  CHINA DAILY SAYS VERMIN EAT 7-12 PCT GRAIN STO...     grain
2  JAPAN TO REVISE LONG-TERM ENERGY DEMAND DOWNWA...     crude
3  THAI TRADE DEFICIT WIDENS IN FIRST QUARTER\n  ...      corn
4  INDONESIA SEES CPO PRICE RISING SHARPLY\n  Ind...  palm-oil


### Stemming & Lemmatization

In [2]:
# Compare Porter, Snowball, and WordNetLemmatizer.

from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('punkt_tab')

# Sample text
sample_text = df['text'][0]

# Tokenization
tokens = word_tokenize(sample_text)

# Stemming
porter = PorterStemmer()
snowball = SnowballStemmer('english')
porter_stemmed = [porter.stem(token) for token in tokens]
snowball_stemmed = [snowball.stem(token) for token in tokens]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(token) for token in tokens]

print("\nOriginal Tokens:", tokens[:20])
print("\nPorter Stemmed:", porter_stemmed[:20])
print("\nSnowball Stemmed:", snowball_stemmed[:20])
print("\nLemmatized:", lemmatized[:20])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\water\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\water\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!



Original Tokens: ['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U.S.-JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U.S.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many']

Porter Stemmed: ['asian', 'export', 'fear', 'damag', 'from', 'u.s.-japan', 'rift', 'mount', 'trade', 'friction', 'between', 'the', 'u.s.', 'and', 'japan', 'ha', 'rais', 'fear', 'among', 'mani']

Snowball Stemmed: ['asian', 'export', 'fear', 'damag', 'from', 'u.s.-japan', 'rift', 'mount', 'trade', 'friction', 'between', 'the', 'u.s.', 'and', 'japan', 'has', 'rais', 'fear', 'among', 'mani']

Lemmatized: ['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U.S.-JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U.S.', 'And', 'Japan', 'ha', 'raised', 'fear', 'among', 'many']


#### Handling Special Characters, Numbers and URLs

In [3]:
import re

def preprocess_text(text):
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    return text

cleaned_text = preprocess_text(sample_text)
print("\nCleaned Text:", cleaned_text[:200])


Cleaned Text: ASIAN EXPORTERS FEAR DAMAGE FROM USJAPAN RIFT
  Mounting trade friction between the
  US And Japan has raised fears among many of Asias exporting
  nations that the row could inflict farreaching econo


### Vectorization

#### One Hot Encoding (OHE)

In [4]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer

# One-Hot Encoding Example
corpus = df['text'][:10]

# Use CountVectorizer to create a vocabulary and transform the text into a document-term matrix
vectorizer = CountVectorizer()
document_term_matrix = vectorizer.fit_transform(corpus)

# Use OneHotEncoder to encode the document-term matrix
encoder = OneHotEncoder(sparse_output=False) # sparse_output=False for a dense array
one_hot_encoded = encoder.fit_transform(document_term_matrix.toarray()) # Convert to dense array

print("\nOne-Hot Encoded Shape:", one_hot_encoded.shape)


One-Hot Encoded Shape: (10, 2251)


#### Bag of Words

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# Bag of Words
bow_vectorizer = CountVectorizer(max_features=100)
bow_matrix = bow_vectorizer.fit_transform(df['text'][:10])

print("\nBag of Words Shape:", bow_matrix.shape)


Bag of Words Shape: (10, 100)


#### TF-IDF

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=100)
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'][:10])

print("\nTF-IDF Shape:", tfidf_matrix.shape)


TF-IDF Shape: (10, 100)


### Benchmarking and Analysis

In [7]:
import numpy as np

def calculate_sparsity(matrix):
    """
    Calculates sparsity percentage of a given matrix.
    :param matrix: Sparse matrix or array
    :return: Sparsity as a percentage
    """
    # Convert the matrix to a dense array if it's sparse
    if hasattr(matrix, 'toarray'):
        matrix = matrix.toarray()
    total_elements = np.prod(matrix.shape)
    zero_elements = np.count_nonzero(matrix == 0)
    sparsity = (zero_elements / total_elements) * 100
    return sparsity

# Calculate sparsity for Bag of Words and TF-IDF matrices
print("\nBag of Words Sparsity:", calculate_sparsity(bow_matrix), "%")
print("TF-IDF Sparsity:", calculate_sparsity(tfidf_matrix), "%")


Bag of Words Sparsity: 61.6 %
TF-IDF Sparsity: 61.6 %


Reflection Questions    
	1. How does stemming or lemmatization affect the vocabulary size for vectorization?   
	2. What trade-offs exist between Bag of Words and TF-IDF in terms of dimensionality and interpretability?   
	3. Why might preprocessing steps (e.g., removing special characters or numbers) influence model performance?