<a href="https://colab.research.google.com/github/zrghassabi/DataScienceProject/blob/main/SpamDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Text Classification with Naive Bayes

In this notebook, we perform text classification using a Naive Bayes model.
We will train the model, evaluate its performance, and use it to make predictions on test data.
The goal is to classify messages as either spam (1) or not spam (0).

In [None]:
#1. Importing Libraries
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

Explanation: We start by importing the necessary libraries:

pandas for data manipulation.
CountVectorizer for converting text data into numerical features.
MultinomialNB for our Naive Bayes classifier.
f1_score to evaluate model performance.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Path to the extracted file
file_path = '/content/sample_data/SMSSpamCollection'

# Load the dataset
data = pd.read_csv(file_path, sep='\t', header=None, names=['label', 'text'], encoding='latin1')

# Check the first few rows to ensure correct loading
print(data.head())

# Encode labels
data['label'] = data['label'].map({'ham': 0, 'spam': 1})

# Split data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(data['text'], data['label'], test_size=0.2, random_state=42)

# Check the dimensions of the splits
print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")


Explanation: We load the training data from train.csv. The data consists of two columns:

text: The message text (input feature).
label: The label indicating whether the message is spam (1) or not spam (0).

In [None]:
#3. Text Vectorization
vectorizer = CountVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)

Explanation: We convert the text data into numerical features using CountVectorizer. This step transforms the text into a matrix of token counts, which can be used for training the model.

In [None]:
#4. Initialize and Train the Naive Bayes Model
nb_model = MultinomialNB()
nb_model.fit(X_train_vec, y_train)

Explanation: We initialize a MultinomialNB model and train it on the vectorized training data (X_train_vec) and labels (y_train)

In [None]:
#5. Evaluate Model Performance on Validation Data
X_val_vec = vectorizer.transform(X_val)
y_pred_val = nb_model.predict(X_val_vec)
f1 = f1_score(y_val, y_pred_val)
print(f"Validation F1 Score: {f1:.2f}")

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load the test data
test_file_path = '/content/sample_data/SMSSpamCollection'  # Update this path to your test file
test_data = pd.read_csv(test_file_path, sep='\t', header=None, names=['label', 'text'], encoding='latin1')

# Prepare test data
X_test = test_data['text']

# Transform the test data using the same vectorizer
X_test_vec = vectorizer.transform(X_test)

# Make predictions using the trained model
test_preds = nb_model.predict(X_test_vec)

# Create a DataFrame for the submission
submission = pd.DataFrame({'label': test_preds})

# Save the predictions to a CSV file
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")


In [None]:
test_preds

Explanation: We predict the labels on the validation set and calculate the F1 score. The F1 score is a measure of the model’s accuracy, considering both precision and recall.

Explanation: We load the test data from test.csv and transform the text using the same vectorizer that was fitted on the training data. This ensures consistency in feature representation.

Explanation: We create a DataFrame with the predicted labels and save it to submission.csv. This file will be used for evaluation or submission.


In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Load and prepare test data
test_file_path = '/content/sample_data/SMSSpamCollection'  # Update this path to your test file
test_data = pd.read_csv(test_file_path, sep='\t', header=None, names=['label', 'text'], encoding='latin1')

# Extract the text column for prediction
X_test = test_data['text']
y_test_actual = test_data['label']  # Actual target labels for reference

# Transform the test data using the same vectorizer
X_test_vec = vectorizer.transform(X_test)

# Make predictions using the trained model
test_preds = nb_model.predict(X_test_vec)

# Create a DataFrame for the submission
submission = pd.DataFrame({
    'text': X_test,
    'actual_label': y_test_actual.map({'ham': 0, 'spam': 1}),  # Convert actual labels to numerical format
    'predicted_label': test_preds
})

# Map numerical labels back to 'ham' and 'spam' for easier interpretation
submission['actual_label'] = submission['actual_label'].map({0: 'ham', 1: 'spam'})
submission['predicted_label'] = submission['predicted_label'].map({0: 'ham', 1: 'spam'})

# Save the predictions to a CSV file
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")


In [None]:
test_preds