# Task
Build an AI model to classify SMS messages as spam or legitimate using the dataset at "/content/spam.csv". Use techniques like TF-IDF or word embeddings with classifiers like Naive Bayes, Logistic Regression, or Support Vector Machines.

## Load the dataset

### Subtask:
Load the dataset from "/content/spam.csv" into a pandas DataFrame.


**Reasoning**:
Import pandas and load the data into a DataFrame and display the head.



In [1]:
import pandas as pd

df = pd.read_csv("/content/spam.csv", encoding='latin-1')
display(df.head())

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


## Preprocess the data

### Subtask:
Clean the text data by removing punctuation, converting text to lowercase, and removing stop words.


**Reasoning**:
I need to perform the initial data cleaning steps as outlined in the instructions: dropping unnecessary columns, renaming relevant columns, and converting the target variable to a numerical format.



In [2]:
df = df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1)
df = df.rename(columns={'v1': 'label', 'v2': 'message'})
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
display(df.head())

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


**Reasoning**:
Now that the initial column cleaning is done, I need to define and apply a function to clean the text data in the 'message' column by removing punctuation, converting to lowercase, and removing stop words.



In [3]:
import string
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
nltk.download('punkt')

def clean_text(text):
    text = text.lower()
    text = ''.join([char for char in text if char not in string.punctuation])
    words = text.split()
    words = [word for word in words if word not in stopwords.words('english')]
    return ' '.join(words)

df['cleaned_message'] = df['message'].apply(clean_text)
display(df[['message', 'cleaned_message']].head())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Unnamed: 0,message,cleaned_message
0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,Free entry in 2 a wkly comp to win FA Cup fina...,free entry 2 wkly comp win fa cup final tkts 2...
3,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,"Nah I don't think he goes to usf, he lives aro...",nah dont think goes usf lives around though


## Feature engineering

### Subtask:
Transform the text data into numerical features using techniques like TF-IDF or word embeddings.


**Reasoning**:
Transform the cleaned text data into numerical features using TF-IDF.



In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['cleaned_message'])
display(X.shape)

(5572, 9376)

## Split the data

### Subtask:
Split the data into training and testing sets.


**Reasoning**:
Split the feature matrix X and the target variable df['label'] into training and testing sets.



In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, df['label'], test_size=0.2, random_state=42)
display(X_train.shape)
display(X_test.shape)
display(y_train.shape)
display(y_test.shape)

(4457, 9376)

(1115, 9376)

(4457,)

(1115,)

## Train the model

### Subtask:
Train different classification models like Naive Bayes, Logistic Regression, or Support Vector Machines on the training data.


**Reasoning**:
Import the necessary classification models and train them on the training data.



In [6]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Instantiate models
nb_model = MultinomialNB()
lr_model = LogisticRegression()
svm_model = SVC()

# Train models
nb_model.fit(X_train, y_train)
lr_model.fit(X_train, y_train)
svm_model.fit(X_train, y_train)

print("Models trained successfully.")

Models trained successfully.


## Evaluate the model

### Subtask:
Evaluate the performance of the trained models on the testing data using appropriate metrics like accuracy, precision, recall, and F1-score.


**Reasoning**:
Evaluate the performance of each trained model on the testing data using accuracy, precision, recall, and F1-score.



In [7]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Evaluate Naive Bayes model
nb_pred = nb_model.predict(X_test)
nb_accuracy = accuracy_score(y_test, nb_pred)
nb_precision = precision_score(y_test, nb_pred)
nb_recall = recall_score(y_test, nb_pred)
nb_f1 = f1_score(y_test, nb_pred)

print("Naive Bayes Model Performance:")
print(f"  Accuracy: {nb_accuracy:.4f}")
print(f"  Precision: {nb_precision:.4f}")
print(f"  Recall: {nb_recall:.4f}")
print(f"  F1-score: {nb_f1:.4f}")
print("-" * 30)

# Evaluate Logistic Regression model
lr_pred = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_pred)
lr_precision = precision_score(y_test, lr_pred)
lr_recall = recall_score(y_test, lr_pred)
lr_f1 = f1_score(y_test, lr_pred)

print("Logistic Regression Model Performance:")
print(f"  Accuracy: {lr_accuracy:.4f}")
print(f"  Precision: {lr_precision:.4f}")
print(f"  Recall: {lr_recall:.4f}")
print(f"  F1-score: {lr_f1:.4f}")
print("-" * 30)

# Evaluate Support Vector Machine model
svm_pred = svm_model.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_pred)
svm_precision = precision_score(y_test, svm_pred)
svm_recall = recall_score(y_test, svm_pred)
svm_f1 = f1_score(y_test, svm_pred)

print("Support Vector Machine Model Performance:")
print(f"  Accuracy: {svm_accuracy:.4f}")
print(f"  Precision: {svm_precision:.4f}")
print(f"  Recall: {svm_recall:.4f}")
print(f"  F1-score: {svm_f1:.4f}")

Naive Bayes Model Performance:
  Accuracy: 0.9659
  Precision: 1.0000
  Recall: 0.7467
  F1-score: 0.8550
------------------------------
Logistic Regression Model Performance:
  Accuracy: 0.9426
  Precision: 0.9574
  Recall: 0.6000
  F1-score: 0.7377
------------------------------
Support Vector Machine Model Performance:
  Accuracy: 0.9677
  Precision: 0.9831
  Recall: 0.7733
  F1-score: 0.8657


## Select the best model

### Subtask:
Choose the best performing model based on the evaluation metrics.


**Reasoning**:
Compare the evaluation metrics and determine the best performing model based on the priorities of spam classification.



In [8]:
print("--- Model Comparison ---")
print(f"Naive Bayes: Accuracy={nb_accuracy:.4f}, Precision={nb_precision:.4f}, Recall={nb_recall:.4f}, F1-score={nb_f1:.4f}")
print(f"Logistic Regression: Accuracy={lr_accuracy:.4f}, Precision={lr_precision:.4f}, Recall={lr_recall:.4f}, F1-score={lr_f1:.4f}")
print(f"Support Vector Machine: Accuracy={svm_accuracy:.4f}, Precision={svm_precision:.4f}, Recall={svm_recall:.4f}, F1-score={svm_f1:.4f}")

print("\n--- Best Model Selection ---")
# In spam classification, minimizing false positives (identifying a legitimate message as spam) is crucial
# as it can lead to users missing important messages. Precision is a key metric for this.
# High recall is also important to catch as many spam messages as possible.
# F1-score provides a balance between precision and recall.

if nb_precision >= lr_precision and nb_precision >= svm_precision:
    if nb_recall >= lr_recall and nb_recall >= svm_recall:
        best_model = "Naive Bayes"
    elif svm_recall >= nb_recall and svm_recall >= lr_recall:
         best_model = "Support Vector Machine"
    else:
         best_model = "Logistic Regression"
elif svm_precision >= nb_precision and svm_precision >= lr_precision:
    if svm_recall >= nb_recall and svm_recall >= lr_recall:
        best_model = "Support Vector Machine"
    elif nb_recall >= svm_recall and nb_recall >= lr_recall:
         best_model = "Naive Bayes"
    else:
         best_model = "Logistic Regression"
else:
    if lr_recall >= nb_recall and lr_recall >= svm_recall:
        best_model = "Logistic Regression"
    elif nb_recall >= lr_recall and nb_recall >= svm_recall:
         best_model = "Naive Bayes"
    else:
         best_model = "Support Vector Machine"


print(f"Considering the importance of minimizing false positives (high precision) and identifying spam (high recall), and looking at the overall F1-score, the best performing model is the {best_model}.")


--- Model Comparison ---
Naive Bayes: Accuracy=0.9659, Precision=1.0000, Recall=0.7467, F1-score=0.8550
Logistic Regression: Accuracy=0.9426, Precision=0.9574, Recall=0.6000, F1-score=0.7377
Support Vector Machine: Accuracy=0.9677, Precision=0.9831, Recall=0.7733, F1-score=0.8657

--- Best Model Selection ---
Considering the importance of minimizing false positives (high precision) and identifying spam (high recall), and looking at the overall F1-score, the best performing model is the Support Vector Machine.


## Summary:

### Data Analysis Key Findings

*   The dataset contains 5572 SMS messages and 5 columns. Three columns ('Unnamed: 2', 'Unnamed: 3', and 'Unnamed: 4') were found to contain a significant number of missing values and were subsequently dropped.
*   The labels ('ham' and 'spam') were successfully converted to numerical representations (0 and 1).
*   The text data was cleaned by removing punctuation, converting to lowercase, and removing English stop words, resulting in a 'cleaned\_message' column.
*   The cleaned text data was transformed into numerical features using TF-IDF, resulting in a matrix with 5572 samples and 9376 features.
*   The data was split into training (80%, 4457 samples) and testing (20%, 1115 samples) sets.
*   Three models (Multinomial Naive Bayes, Logistic Regression, and Support Vector Machine) were trained on the training data.
*   Model evaluation on the test set showed the following performance:
    *   **Naive Bayes:** Accuracy=0.9659, Precision=1.0000, Recall=0.7467, F1-score=0.8550
    *   **Logistic Regression:** Accuracy=0.9426, Precision=0.9574, Recall=0.6000, F1-score=0.7377
    *   **Support Vector Machine:** Accuracy=0.9677, Precision=0.9831, Recall=0.7733, F1-score=0.8657
*   The Support Vector Machine model demonstrated the best overall performance, achieving the highest accuracy (0.9677), recall (0.7733), and F1-score (0.8657), while also maintaining high precision (0.9831).

### Insights or Next Steps

*   The Support Vector Machine model is the best choice for this SMS spam classification task based on its balanced high performance across key metrics.
*   Further optimization of the chosen SVM model could be explored through hyperparameter tuning or experimenting with different text vectorization techniques (e.g., word embeddings) to potentially improve performance further.
