Problem 1: Use the popular SMS Spam Collection dataset (available on Kaggle), which contains labeled messages as either ”spam” or ”ham” (not spam), stored in a Pandas DataFrame with columns Label (spam/ham) and Message (text).

In [30]:
import pandas as pd
import numpy as np
import nltk
import re
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
from gensim.models import KeyedVectors

In [31]:
# Download stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package punkt to C:\Users\Soham
[nltk_data]     Murudkar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Soham
[nltk_data]     Murudkar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to C:\Users\Soham
[nltk_data]     Murudkar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [32]:
df = pd.read_csv("spam.csv", encoding='latin-1')[['v1', 'v2']]
df.columns = ['Label', 'Message']
df['Label'] = df['Label'].map({'ham': 0, 'spam': 1})

(a) Preprocess each message by tokenizing, removing stop words, and lowercasing the text:

In [33]:
def preprocess(text):
    text = text.lower()
    text = re.sub(r"http\S+|@\S+|[^a-zA-Z]", " ", text)
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

df['Tokens'] = df['Message'].apply(preprocess)

In [34]:
def preprocess_text(text):
    if pd.isna(text):
        return ''
    text = text.lower()
    tokens = word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in tokens if word.isalpha() and word not in stop_words]
    return ' '.join(words)


In [35]:
df['Tokens'] = df['Message'].apply(preprocess)

(b) Load the pre-trained Google News Word2Vec model using gensim:

In [36]:
w2v_model = KeyedVectors.load_word2vec_format("../GoogleNews-vectors-negative300.bin.gz", binary=True)

(c) Convert each message into a fixed-length vector by averaging the Word2Vec vectors of all the words in the message (ignore words not found in the model vocabulary):

In [37]:
def vectorize(tokens, model, size=300):
    vecs = [model[word] for word in tokens if word in model]
    if len(vecs) == 0:
        return np.zeros(size)
    return np.mean(vecs, axis=0)

df['Vector'] = df['Tokens'].apply(lambda tokens: vectorize(tokens, w2v_model))

(d) Split the dataset into training (80%) and testing (20%) sets using train test split:

In [38]:
X = np.vstack(df['Vector'].values)
y = df['Label'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

(e) Train a Logistic Regression classifier on the vectorized training data and print the accuracy on the test set:

In [39]:
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

In [40]:
y_pred = clf.predict(X_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))

Test Accuracy: 0.9443946188340807


(f) Write a Pythonfunction predict message class(model, w2v model, message) that takes a trained classifier, the Word2Vec model, and a single message (string), and returns the predicted class (spam or ham)

In [41]:
def predict_message_class(model, w2v_model, message):
    tokens = preprocess(message)
    vec = vectorize(tokens, w2v_model).reshape(1, -1)
    pred = model.predict(vec)[0]
    return 'spam' if pred == 1 else 'ham'

In [42]:
# Example usage
example = "Congratulations! You've won a $1000 Walmart gift card. Call now!"
print("Prediction:", predict_message_class(clf, w2v_model, example))

Prediction: spam
