# Week 2 Assignment: Text Vectorization

Overview:

This assignment involves applying preprocessing techniques and classical vectorization methods to the SMS Spam Dataset. Students will build a preprocessing pipeline, vectorize the text using various methods, and evaluate the impact on a classification task (spam vs. ham).
In addition to vectorization, students must analyze the sparsity and feature space dimensions of their models to evaluate computational efficiency

Submission Requirements   
	•	Code: Submit a single Jupyter Notebook containing all the Python code.    
	•	Analysis: Include written responses to all analytical prompts in markdown cells within the notebook.    
	•	Visualization: Include all required plots and charts in the notebook.   
	•	Filename: Name the file as
			Week2_Textvectorization_<YourName>.pdf



In [None]:
import pandas as pd

# URL for the SMS Spam Dataset
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/sms.tsv"

# Load the dataset
sms_data = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Display the first few rows
print("First 5 rows of the SMS Spam Dataset:")
print(sms_data.head())

# Check dataset size
print("\nDataset Size:", sms_data.shape)

# Split dataset into spam and ham categories for exploration
print("\nLabel Distribution:")
print(sms_data['label'].value_counts())

## Part 1: preprocessing (20 points)


	1.	Task:   
	•	Preprocess the SMS Spam dataset:
			•	Tokenize the text.
			•	Apply stemming (Porter or Snowball) and lemmatization.
			•	Remove special characters, numbers, and URLs.
	•	Create a reusable preprocessing pipeline.
	•	Justify whether stemming or lemmatization is more appropriate for maintaining the semantic integrity of SMS-style language (e.g., 'u', 'r', 'gr8').

	2.	Deliverable (15 points):
	•	Python code implementing the preprocessing pipeline
	•	Example output showing raw and preprocessed text.

	3.	Question (5 points)
	•	Beyond vocabulary size, how does the choice between stemming and lemmatization affect the sparsity of the resulting vectors in Part 2?

## Part 2: Vectorization (40 points)


	1.	Task:
	•	Vectorize the preprocessed text using:
			•	Bag of Words (BoW).
			•	TF-IDF.
			•	N-Grams (unigrams, bigrams, and trigrams).
	•	Compare vector sizes and sparsity across methods
	•	Calculate the total number of unique features generated for unigrams, bigrams, and trigrams separately.

	2.	Deliverable (30 points):
	•	Python code for vectorization using each method.
	•	Table summarizing vector dimensions and sparsity.

	3.	Question (10 points):
	•	What are the trade-offs between Bag of Words and TF-IDF in terms of dimensionality and interpretability?
	•	Based on the dimensionality of your N-gram vectors, discuss the 'curse of dimensionality. At what point does the benefit of context-awareness (e.g., trigrams) get outweighed by the computational cost?

## Part 3: ML Classification Model (40 points)


	1.	Task:
	•	Build a Support two Vector Machine (SVM) classifiers: one using **BoW vectors** and one using T**F-IDF vectors**
	•	Train the model to classify messages as spam or ham.
	•	Evaluate the model using accuracy, precision, recall, and F1-score.

	2.	Deliverable (30 points):
	•	Python code for training and evaluating the model
	•	Classification report summarizing model performance
	•	Provide a side-by-side comparison table of Accuracy, Precision, Recall, and F1-score for both models

	3.	Question (10 points):
	•	TF-IDF is intended to prioritize rare but informative terms. Did the TF-IDF model outperform BoW for the 'Spam' class specifically? Why or why not?