## Assignment 8 - Deadline: Dec 29, 2024, Sun 11pm

#### DSAI 510 Fall 2024

Complete the assignment below and upload <span style="color:red">both the .ipynb file and its pdf</span> to https://moodle.boun.edu.tr by the deadline given above. The submission page on Moodle will close automatically after this date and time.


To make a pdf, this may work: Hit CMD+P or CTRL+P, and save it as PDF. You may also use other options from the File menu.

In [2]:
# Run this cell first

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm

# Set the display option to show all rows scrolling with a slider
pd.set_option('display.max_rows', None)
# To disable this, run the line below:
# pd.reset_option('display.max_rows')

## Note: 
In the problems below, if they ask "show the number of records that are nonzero", 
the answer is a number; so you don't need to show the records themselves.
But if it asks, "show the records with NaN", it wants you to print those records (rows)
containing NAN and other entries, not asking how many such records there are. So be careful about what you're asked.

# Problem 1: Newsgroup discussions classification (30 pts)

The `fetch_20newsgroups` dataset is a collection of approximately 20,000 newsgroup documents, distributed across 20 different newsgroups. Each newsgroup represents a distinct category, ranging from technology and politics to religion and sports. This dataset is widely used in natural language processing and machine learning for tasks like text classification and clustering, due to its diverse range of topics and real-world discussion content.

Your goal is to build a classifier that would take a document from this dataset and output the category of that document. Categories: 'sci.med', 'comp.graphics', 'sci.electronics' and 'talk.politics.misc'.

The data is already loaded below and first five documents and their class labels from the dataset are printed.

1) Use TF-IDF to vectorize the train and test dataset. 

(Important to prevent data leakage:
Remember from the the ipynb notebook we used in the lecture; you use `fit_transform()` on the train dataset to get the vocabulary (list of words), and for the test dataset, you only use `transform()` to learn the vocabulary from the test dataset to represent the test documents in terms of vocabulary learned from the train data. It is likely that there will be words that only appear in the test dataset, but does not occur in the train dataset. Those words will simply be ignored when the model is used on the test dataset to extract the frequency table, because vocabulary is set by the train dataset when we use `fit_transform()` on the train dataset.)

2) Use a classifier of your choice to train your model and calculate the accuracy of your model by using the test data.

(Note that we have four classes here, so your classifier will not be a binary classifier but a multi-class classifier. This means your model should be capable of distinguishing and predicting among four different categories.)


3) Find one document from the test data for each categories: 'sci.med', 'comp.graphics', 'sci.electronics' and 'talk.politics.misc'. Print those four documents and print their category as a string (not as a number, but as 'sci.med' etc.). 

In [1]:
from sklearn.datasets import fetch_20newsgroups
# Selected categories 
categories = ['sci.med', 'comp.graphics', 'sci.electronics', 'talk.politics.misc']

# Loading training dataset
df_train = fetch_20newsgroups(subset='train',categories=categories, shuffle=True, random_state=3)

# Loading testing dataset
df_test = fetch_20newsgroups(subset='test',categories=categories)

In [2]:
type(df_train)

sklearn.utils._bunch.Bunch

In [3]:
df_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

### First four documents as examples

In [4]:
for i in range(5):
    print(f"\033[1m\033[91mDocument {i}:\033[0m ")  # Red and bold text
    print(df_train.data[i])
    print("\n------------------\n")  # Separator for readability

[1m[91mDocument 0:[0m 
From: peter.m@insane.apana.org.au (Peter Tryndoch)
Subject: Dmm Advice Needed
Lines: 28

AllMartin EmdeDMM Advice Needed

ME>From: mce5921@bcstec.ca.boeing.com (Martin Emde)
ME>Organization: Boeing
ME> 
ME>I an currely in the market for a DMM and recently saw an add
ME>for a Kelvin 94 ($199).  Does anyone own one of these or some
ME>other brand that they are extremely happy with.  How do the 
ME>small name brands compare with the Fluke and Beckman brands?
ME>I am willing to spend ~$200 for one.
ME> 
ME>Any help is greatly appreciated. (please email)
ME> 
ME>-Martin

If you are going to use one where it counts (eg:aviation, space scuttle, 
etc) then I suggest you go and buy a Fluke (never seen a Beckman), however 
for every other use you can buy a cheapie. I have a metex which is some 
made up name, as I have seen the same DMM with other brand names on it, I 
bought it about 4 yrs ago for Aus$125.00 (convert that to US and you see 
that it's definetly a cheapie

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.multiclass import OneVsRestClassifier
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import re

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')


# Create lemmatizer
lemmatizer = WordNetLemmatizer()
    
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Lemmatize and filter short tokens
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens if len(token) > 2]
    
    # Join tokens back into text
    return ' '.join(lemmatized_tokens)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\seval\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\seval\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\seval\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [19]:
# Part 1
# TF-IDF Vectorization standard one
print("Step 1: TF-IDF Vectorization")
tfidf_vectorizer = TfidfVectorizer(
    stop_words='english',  # Use built-in English stop words
    token_pattern=r'\b[a-zA-Z]{2,}\b',  # Only take words with 2 or more characters
    lowercase=True  # Convert all text to lowercase
)
# Fit and transform training data
X_train_tfidf = tfidf_vectorizer.fit_transform(df_train.data)

# Only transform test data using vocabulary from training data
X_test_tfidf = tfidf_vectorizer.transform(df_test.data)

# Verify this is multiclass
print(f"Number of classes: {len(categories)}")
print(f"Unique classes in training data: {len(set(df_train.target))}")

# Part 2
# Train classifier using OneVsRestClassifier explicitly for multiclass
print("\nStep 2: Training Classifier")
clf = OneVsRestClassifier(MultinomialNB())
clf.fit(X_train_tfidf, df_train.target)

# Get probability estimates for each class
probabilities = clf.predict_proba(X_test_tfidf)
print(f"Shape of probability predictions: {probabilities.shape}")

# Make predictions
y_pred = clf.predict(X_test_tfidf)

# Calculate and print accuracy
accuracy = accuracy_score(df_test.target, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")

# Print detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(df_test.target, y_pred, target_names=categories))

# Part 3
# Find and print one document from each category
print("\nStep 3: Example Documents from Test Set:")
print("-" * 50)

for category_idx, category_name in enumerate(categories):
    # Find first document in test set for this category
    doc_idx = next(i for i, label in enumerate(df_test.target) if label == category_idx)
    
    # Get the predicted category for this document
    pred_category = categories[y_pred[doc_idx]]
    
    print(f"\nCategory: {category_name}")
    print(f"Predicted Category: {pred_category}")
    print("Document:")
    # Print first 300 characters for readability
    print(df_test.data[doc_idx][:300] + "...")
    print("-" * 50)

Step 1: TF-IDF Vectorization
Number of classes: 4
Unique classes in training data: 4

Step 2: Training Classifier
Shape of probability predictions: (1488, 4)

Model Accuracy: 0.9147

Detailed Classification Report:
                    precision    recall  f1-score   support

           sci.med       0.89      0.93      0.91       389
     comp.graphics       0.89      0.89      0.89       393
   sci.electronics       0.94      0.91      0.93       396
talk.politics.misc       0.94      0.93      0.94       310

          accuracy                           0.91      1488
         macro avg       0.92      0.92      0.92      1488
      weighted avg       0.92      0.91      0.91      1488


Step 3: Example Documents from Test Set:
--------------------------------------------------

Category: sci.med
Predicted Category: sci.med
Document:
From: lee@hobbes.cs.umass.edu (Peter Lee)
Subject: Re: QuickTime performance (was Re: Rumours about 3DO ???)
	<1993Apr16.212441.34125@rchland.ibm.com>
	

In [20]:
# Trial with Lemmatization to increase accuracy
# Part 1
# TF-IDF Vectorization with custom preprocessor function including Lemmatization
print("Step 1: TF-IDF Vectorization")
tfidf_vectorizer = TfidfVectorizer(
    preprocessor=preprocess_text,
    stop_words=None,  # We'll handle stop words in preprocessing
    token_pattern=r'\b[a-zA-Z]{2,}\b'  # Only take words with 2 or more characters
)
# Fit and transform training data
X_train_tfidf = tfidf_vectorizer.fit_transform(df_train.data)

# Only transform test data using vocabulary from training data
X_test_tfidf = tfidf_vectorizer.transform(df_test.data)

# Verify this is multiclass
print(f"Number of classes: {len(categories)}")
print(f"Unique classes in training data: {len(set(df_train.target))}")

# Part 2
# Train classifier using OneVsRestClassifier explicitly for multiclass
print("\nStep 2: Training Classifier")
clf = OneVsRestClassifier(MultinomialNB())
clf.fit(X_train_tfidf, df_train.target)

# Get probability estimates for each class
probabilities = clf.predict_proba(X_test_tfidf)
print(f"Shape of probability predictions: {probabilities.shape}")

# Make predictions
y_pred = clf.predict(X_test_tfidf)

# Calculate and print accuracy
accuracy = accuracy_score(df_test.target, y_pred)
print(f"\nModel Accuracy: {accuracy:.4f}")

# Print detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(df_test.target, y_pred, target_names=categories))

# Part 3
# Find and print one document from each category
print("\nStep 3: Example Documents from Test Set:")
print("-" * 50)

for category_idx, category_name in enumerate(categories):
    # Find first document in test set for this category
    doc_idx = next(i for i, label in enumerate(df_test.target) if label == category_idx)
    
    # Get the predicted category for this document
    pred_category = categories[y_pred[doc_idx]]
    
    print(f"\nCategory: {category_name}")
    print(f"Predicted Category: {pred_category}")
    print("Document:")
    # Print first 300 characters for readability
    print(df_test.data[doc_idx][:300] + "...")
    print("-" * 50)


Step 1: TF-IDF Vectorization
Number of classes: 4
Unique classes in training data: 4

Step 2: Training Classifier
Shape of probability predictions: (1488, 4)

Model Accuracy: 0.9254

Detailed Classification Report:
                    precision    recall  f1-score   support

           sci.med       0.94      0.89      0.92       389
     comp.graphics       0.91      0.93      0.92       393
   sci.electronics       0.91      0.95      0.93       396
talk.politics.misc       0.95      0.93      0.94       310

          accuracy                           0.93      1488
         macro avg       0.93      0.93      0.93      1488
      weighted avg       0.93      0.93      0.93      1488


Step 3: Example Documents from Test Set:
--------------------------------------------------

Category: sci.med
Predicted Category: sci.med
Document:
From: lee@hobbes.cs.umass.edu (Peter Lee)
Subject: Re: QuickTime performance (was Re: Rumours about 3DO ???)
	<1993Apr16.212441.34125@rchland.ibm.com>
	

In [21]:
# So with lemmatization accuracy increased from 91% to 93%, which is ok
"""
The performance metrics for each class are quite balanced:
sci.med: precision=0.94, recall=0.89
comp.graphics: precision=0.91, recall=0.93
sci.electronics: precision=0.91, recall=0.95
talk.politics.misc: precision=0.95, recall=0.93

The last document (talk.politics.misc) was misclassified as comp.graphics
This is actually an interesting case to examine since it's one of the 7% of cases where our model made a mistake
"""
# Let's see the actual probabilities for this misclassified document
# Find the index of our misclassified document
misclassified_idx = next(i for i, (true, pred) in enumerate(zip(df_test.target, y_pred)) 
                        if true == 3 and pred != 3)  # 3 is the index for talk.politics.misc

# Get probabilities for this document
doc_probs = probabilities[misclassified_idx]
print("\nProbabilities for misclassified document:")
for category, prob in zip(categories, doc_probs):
    print(f"{category}: {prob:.4f}")

# 2. Let's look at the most important features (words) for this document
# Get the feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# Get the TF-IDF scores for this document
doc_vector = X_test_tfidf[misclassified_idx]

# Create a list of (word, tfidf_score) pairs
word_scores = [(feature_names[i], doc_vector[0, i]) 
               for i in doc_vector.nonzero()[1]]

# Sort by TF-IDF score
word_scores.sort(key=lambda x: x[1], reverse=True)

print("\nTop 10 most important words in this document:")
for word, score in word_scores[:10]:
    print(f"{word}: {score:.4f}")

# 3. Print the full content of this document
print("\nFull document content:")
print(df_test.data[misclassified_idx])


Probabilities for misclassified document:
sci.med: 0.0674
comp.graphics: 0.3968
sci.electronics: 0.3505
talk.politics.misc: 0.1853

Top 10 most important words in this document:
lazlo: 0.4226
idea: 0.2585
sense: 0.2232
russottoengumdedu: 0.2113
russotto: 0.2113
spill: 0.2006
ignite: 0.1930
carinaunmedu: 0.1872
tank: 0.1824
hay: 0.1783

Full document content:
From: lazlo@carina.unm.edu (Lazlo Nibble)
Subject: Re: WACO burning
Organization: Vroom Socko International Fear Club
Lines: 11
Distribution: world
NNTP-Posting-Host: carina.unm.edu

russotto@eng.umd.edu (Matthew T. Russotto) writes:

> The idea that kerosene lamps would be all over the place (with
> electricity cut off) makes sense.  The idea that ramming tanks into the
> building would spill them and cause a fire makes sense.

As does the idea that a CS gas canister can get hot enough to ignite dry
baled hay.

--
Lazlo (lazlo@triton.unm.edu)



In [None]:
""" Model got confused due to Names, emails
Here we can Create separate features for domains and organizations
or Use a more sophisticated model that can weight these features appropriately
or Add domain-based feature engineering
...but this is not required for this assignment, so we keep 93% accuracy as it:)
"""