# Experiment 5
Spam Detection System Using Naive Bayes Classifier and Deployment with Flask API


## Aim
To design a spam detection system using text classification techniques, train a Naive Bayes classifier on a labeled dataset, and deploy the model as a web-based API for real-time predictions.


## Objectives
1. To preprocess and analyze a labeled dataset containing messages classified as spam or ham.
2. To convert textual data into a numerical format using CountVectorizer.
3. To implement and train a Naive Bayes classifier for text classification.
4. To evaluate the model's performance using accuracy, classification reports, and confusion matrices.
5. To save the trained model and vectorizer for future use.
6. To deploy the trained model using Flask as an API for real-time spam detection.


## Course Outcomes
1. Understand the process of text preprocessing and feature extraction for machine learning.
2. Learn to implement a Naive Bayes classifier for binary text classification.
3. Gain experience in model evaluation using performance metrics.
4. Learn to save and reuse machine learning models and vectorizers using Pickle.
5. Develop the ability to deploy machine learning models using Flask for real-time applications.


## Theory

- Spam Detection: Spam detection is a binary text classification task where the goal is to classify messages into two categories: ham (non-spam) and spam. It involves preprocessing the text data, transforming it into numerical features, and applying a machine learning algorithm to make predictions.

- CountVectorizer: `CountVectorizer` is a feature extraction tool used to convert text data into a bag-of-words representation. It counts the occurrences of words in each message and represents them as sparse matrices for use in machine learning models.

- Naive Bayes Classifier: The Naive Bayes classifier is a probabilistic algorithm based on Bayes' theorem. It assumes independence among features and is particularly effective for text classification tasks due to its simplicity and speed.

- Flask API: Flask is a lightweight web framework used to build web applications and APIs. In this code, Flask is used to create an API endpoint for real-time spam detection. The trained model and vectorizer are loaded, and predictions are made for input messages.


## Procedure

- Load the Dataset
    - Read the CSV dataset using `pandas.read_csv()`.
    - Select relevant columns and rename them for clarity (`label` for spam/ham and `message` for text content).
    - Map labels to binary values: `ham` to 0 and `spam` to 1.

- Preprocess the Dataset
    - Inspect the dataset using `head()`, `info()`, and `describe()` for basic exploration.
    - Ensure no missing values are present.
    - Split the dataset into training and testing subsets using `train_test_split()`.

- Convert Text Data to Numerical Features
    - Initialize `CountVectorizer` to tokenize and vectorize the text messages.
    - Fit the vectorizer to the training data and transform both training and testing data.

- Train the Naive Bayes Classifier
    - Initialize the `MultinomialNB` classifier.
    - Train the classifier on the vectorized training data using the `fit()` method.

- Evaluate the Model
    - Use the trained model to predict labels for the test data.
    - Compute the accuracy score using `accuracy_score()`.
    - Generate a classification report and confusion matrix to evaluate the model's performance.

- Save the Model and Vectorizer: Save the trained model and vectorizer to files using `pickle.dump()` for future use.

- Deploy the Model Using Flask
    - Load the saved model and vectorizer using `pickle.load()`.
    - Initialize a Flask application and define an endpoint for predictions.
    - Create a function that accepts a message, preprocesses it using the vectorizer, and predicts whether it's spam or ham using the trained model.
    - Return the prediction as a JSON response.


## Results

- Exploratory Data Analysis
    - The dataset contains two categories of messages: spam and ham.
    - The dataset is balanced, with ham messages being more prevalent.

- Model Performance
    - Accuracy: The Naive Bayes classifier achieved an accuracy of approximately 98% on the test dataset.
    - Classification Report:
      - High precision and recall values for both spam and ham categories.
      - F1-scores indicate a well-balanced performance across categories.
    - Confusion Matrix: Minimal misclassifications were observed, with most predictions being accurate.

- Deployment
    - The trained model and vectorizer were successfully saved to files using Pickle.
    - The Flask API endpoint `/predict` was implemented to accept POST requests with messages and return predictions as JSON responses.
    - Test messages were processed and classified correctly in real-time using the API.


## Conclusions
The spam detection system using a Naive Bayes classifier demonstrated excellent performance in classifying messages as spam or ham. The preprocessing steps, including label encoding and vectorization, were crucial for achieving high accuracy. Deploying the model using Flask allowed real-time predictions, showcasing the practical applicability of machine learning in text classification.

This project highlights the importance of data preprocessing, feature extraction, and model evaluation in building effective classification systems. Future improvements could include experimenting with advanced vectorization techniques (e.g., TF-IDF) or integrating deep learning-based text classification models.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import pickle

# Load the dataset
df = pd.read_csv('/content/sample_data/SPAM.csv')

# Preprocess the dataset
df = df[['v1', 'v2']]  # Select relevant columns
df.columns = ['label', 'message']  # Rename columns for clarity
df['label'] = df['label'].map({'ham': 0, 'spam': 1})  # Convert labels to binary

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size=0.2, random_state=42)

# Convert text data to numerical data using CountVectorizer
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
X_test_counts = vectorizer.transform(X_test)

# Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train_counts, y_train)

# Make predictions
y_pred = model.predict(X_test_counts)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('Classification Report:')
print(classification_report(y_test, y_pred))
print('Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.98
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       965
           1       0.99      0.89      0.94       150

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115

Confusion Matrix:
[[963   2]
 [ 16 134]]


In [None]:
# Save the model to a file
with open('/content/sample_data/spam_model.pkl', 'wb') as model_file:
    pickle.dump(model, model_file)

# Save the vectorizer to a file
with open('/content/sample_data/vectorizer.pkl', 'wb') as vectorizer_file:
    pickle.dump(vectorizer, vectorizer_file)

print("Model and vectorizer have been exported.")

In [None]:
# Load the model from the file
with open('/content/sample_data/spam_model.pkl', 'rb') as model_file:
    loaded_model = pickle.load(model_file)

# Load the vectorizer from the file
with open('/content/sample_data/vectorizer.pkl', 'rb') as vectorizer_file:
    loaded_vectorizer = pickle.load(vectorizer_file)

print("Model and vectorizer have been loaded.")

In [None]:
def predict_spam(message):
    # Transform the input message using the loaded vectorizer
    message_vector = loaded_vectorizer.transform([message])

    # Make a prediction using the loaded model
    prediction = loaded_model.predict(message_vector)

    # Map the prediction back to 'ham' or 'spam'
    return 'spam' if prediction[0] == 1 else 'ham'

# Example usage
new_message = "Congratulations! You've won a $1,000 Walmart gift card. Click here to claim your prize."
result = predict_spam(new_message)
print(f"The message is: {result}")


In [None]:

from flask import Flask, request, jsonify

# Load the model and vectorizer
with open('/content/sample_data/spam_model.pkl', 'rb') as model_file:
    loaded_model = pickle.load(model_file)

with open('/content/sample_data/vectorizer.pkl', 'rb') as vectorizer_file:
    loaded_vectorizer = pickle.load(vectorizer_file)

# Initialize the Flask application
app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict_spam():
    # Get the JSON data from the request
    data = request.get_json()

    # Check if 'message' is in the request data
    if 'message' not in data:
        return jsonify({'error': 'No message provided'}), 400

    message = data['message']

    # Transform the input message using the loaded vectorizer
    message_vector = loaded_vectorizer.transform([message])

    # Make a prediction using the loaded model
    prediction = loaded_model.predict(message_vector)

    # Map the prediction back to 'ham' or 'spam'
    result = 'spam' if prediction[0] == 1 else 'ham'

    # Return the result as a JSON response
    return jsonify({'message': message, 'prediction': result}), 200

if __name__ == '__main__':
    app.run(debug=True)


```bash
curl -X POST http://127.0.0.1:5000/predict -H "Content-Type: application/json" -d '{"message": "Congratulations! You have won a $1,000 Walmart gift card!"}'
```

```json
{
    "message": "Congratulations! You have won a $1,000 Walmart gift card!",
    "prediction": "spam"
}
```