# Spam Detection Project

This notebook demonstrates a step-by-step approach to implementing text classification for spam detection, leveraging Term Frequency-Inverse Document Frequency (TF-IDF) and classical machine learning models. The project aligns with the requirements of our university course, emphasizing:
- Using classical frameworks like scikit-learn for basic spam detection.
- Exploring datasets such as the SpamAssassin public corpus.
- Comparing classifiers like Decision Trees, Naive Bayes, and Neural Networks.

This enhanced version includes additional features such as:
- Comprehensive data exploration and preprocessing.
- Robust evaluation with metrics like precision, recall, F1-score, and accuracy.
- Insights into modern approaches for text classification using LLMs for context.


# Text Classification using Decision Trees in Python

Text classification is the process of classifying the text documents into predefined categories. In this article, we are going to explore how we can leverage decision trees to classify the textual data.

Text Classification and Decision Trees
Text classification involves assigning predefined categories or labels to text documents based on their content. Decision trees are hierarchical tree structures that recursively partition the feature space based on the values of input features. They are particularly well-suited for classification tasks due to their simplicity, interpretability, and ability to handle non-linear relationships.

Decision Trees provide a clear and understandable model for text classification, making them an excellent choice for tasks where interpretability is as important as predictive power. Their inherent simplicity, however, might lead to challenges when dealing with very complex or nuanced text data, leading practitioners to explore more sophisticated or ensemble methods for improvement.

Implementation: Text Classification using Decision Trees
For text classification using Decision Trees in Python, we'll use the popular 20 Newsgroups dataset. This dataset comprises around 20,000 newsgroup documents, partitioned across 20 different newsgroups. We'll use scikit-learn to fetch the dataset, preprocess the text, convert it into a feature vector using TF-IDF vectorization, and then apply a Decision Tree classifier for classification.

# Load the Dataset

We will work with the SpamAssassin public corpus, which contains a collection of labeled emails for spam classification. The dataset is well-suited for demonstrating text classification tasks. The focus will be on preprocessing the dataset to extract meaningful features for classification.

In [None]:
# Load SpamAssassin dataset
import os
import pandas as pd

# Define paths to spam and ham emails (adjust to your dataset directory)
spam_path = 'path_to_spam_folder'
ham_path = 'path_to_ham_folder'

# Helper function to load emails
def load_emails(path, label):
    emails = []
    for filename in os.listdir(path):
        with open(os.path.join(path, filename), 'r', encoding='latin-1') as file:
            emails.append({'text': file.read(), 'label': label})
    return emails

# Load spam and ham emails
spam_emails = load_emails(spam_path, 'spam')
ham_emails = load_emails(ham_path, 'ham')

# Create a DataFrame
emails_df = pd.DataFrame(spam_emails + ham_emails)
emails_df.head()

## Data Exploration
Let's explore the dataset to understand the distribution of classes and the content of emails.

In [None]:
# Check class distribution
emails_df['label'].value_counts().plot(kind='bar', title='Class Distribution');

## Text Preprocessing
We will preprocess the text data by:
- Removing stop words.
- Lowercasing text.
- Applying TF-IDF vectorization.


# Load the Dataset

The 20 Newsgroups dataset is loaded with specific categories for simplification. Headers, footers, and quotes are removed to focus on the text content.

In [None]:
# In file ipython-input-1-c6b25e0bfe35
# Import the necessary function from sklearn.datasets
from sklearn.datasets import fetch_20newsgroups

# Load the dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))

Exploratory Data Analysis

This code snippet provides basic exploratory data analysis by visualizing the distribution of classes in the training and test sets and displaying sample documents.

In [None]:
# In file ipython-input-4-4e958edbf22e
# Import necessary library
import numpy as np
import matplotlib.pyplot as plt

# Display distribution of classes in the training set
# Assign the target labels from newsgroups_train to y_train
y_train = newsgroups_train.target
class_distribution = np.bincount(y_train)
plt.bar(range(len(class_distribution)), class_distribution)
plt.xticks(range(len(class_distribution)), newsgroups_train.target_names, rotation=45)
plt.title('Distribution of Classes in Training Set')
plt.xlabel('Class')
plt.ylabel('Number of Documents')
plt.show()

In [None]:
# Display distribution of classes in the test set
y_test = newsgroups_test.target # Assign target labels from newsgroups_test to y_test
class_distribution = np.bincount(y_test)
plt.bar(range(len(class_distribution)), class_distribution)
plt.xticks(range(len(class_distribution)), newsgroups_test.target_names, rotation=45)
plt.title('Distribution of Classes in Test Set')
plt.xlabel('Class')
plt.ylabel('Number of Documents')
plt.show()

# Data Preprocessing

Text data is converted into TF-IDF feature vectors. TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document in a collection. This step is crucial for converting text data into a format that can be used for machine learning.

In [None]:
# Data preprocessing
vectorizer = TfidfVectorizer(stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
y_train = newsgroups_train.target
y_test = newsgroups_test.target


# Decision Tree Classifier
A Decision Tree classifier is initialized and trained on the processed training data. Decision Trees are a non-linear predictive modeling tool that can be used for both classification and regression tasks.

In [None]:
# Initialize and train a Decision Tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)


# Model Evaluation

The trained model is used to make predictions on the test set, and the model's performance is evaluated using accuracy and a detailed classification report, which includes precision, recall, f1-score, and support for each class.

In [None]:
# Make predictions
y_pred = clf.predict(X_test)
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=newsgroups_test.target_names))


The output demonstrates the performance of a Decision Tree classifier on a text classification task using the 20 Newsgroups dataset. An accuracy of approximately 63.25% indicates that the model correctly predicted the category of over half of the newsgroup posts in the test set. The precision, recall, and f1-score for each category show how well the model performs for individual classes. Precision indicates the model's accuracy in labeling a class correctly, recall reflects how well the model identifies all relevant instances of a class, and the f1-score provides a balance between precision and recall. The variation across different categories (alt.atheism, comp.graphics, sci.med, soc.religion.christian) suggests that the model's ability to correctly classify posts varies with the subject matter, performing best in 'soc.religion.christian' and worst in 'alt.atheism'.

# Comparison with Other Text Classification Techniques

We will compare decision trees with other popular text classification algorithms such as Random Forest and Support Vector Machines.

# Text Classification using Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize and train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=newsgroups_test.target_names))


# Text Classification using SVM

In [None]:
from sklearn.svm import SVC

# Initialize and train an SVM classifier
clf = SVC(kernel='linear', random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=newsgroups_test.target_names))


# Observations

1.   SVM outperforms both Random Forest and Decision Tree classifiers in terms of accuracy and overall performance, as indicated by the higher F1-score.
2.   Random Forest performs relatively well but slightly lags behind SVM.
3. Decision Tree shows the lowest performance among the three classifiers, indicating the importance of choosing an appropriate algorithm for text classification tasks.


Are you passionate about data and looking to make one giant leap into your career?

Our Data Science Course will help you change your game and, most importantly, allow students, professionals, and working adults to tide over into the data science immersion.

Master state-of-the-art methodologies, powerful tools, and industry best practices, hands-on projects, and real-world applications. Become the executive head of industries related to Data Analysis, Machine Learning, and Data Visualization with these growing skills.

In [None]:
# Import necessary libraries for data handling, model training, and evaluation
# These libraries help load data, convert text to numbers, build models, and assess model accuracy.
from sklearn.model_selection import train_test_split  # For dividing data into training and testing sets
from sklearn.feature_extraction.text import TfidfVectorizer  # For transforming text into numerical features
from sklearn.tree import DecisionTreeClassifier  # Decision Tree model for classification
from sklearn.ensemble import RandomForestClassifier  # Random Forest model for classification
from sklearn.svm import SVC  # Support Vector Machine model for classification
from sklearn.metrics import classification_report, accuracy_score  # For measuring model performance
from sklearn.datasets import fetch_20newsgroups  # Dataset containing news articles for text classification
import numpy as np  # For numerical calculations
import matplotlib.pyplot as plt  # For creating visual plots

# Load the dataset
# '20 Newsgroups' is a collection of news articles grouped by topic.
# We select four categories (topics) to make this a simpler classification problem.
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
# Load training data (data that the model will learn from)
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))
# Load test data (data to test the model's accuracy after training)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories, remove=('headers', 'footers', 'quotes'))

# Visualize the distribution of each category in the training set
# We can see how many articles we have in each category (class), which helps us understand data balance.
y_train = newsgroups_train.target  # Extract the category labels for the training data
class_distribution_train = np.bincount(y_train)  # Count the number of articles per category
plt.bar(range(len(class_distribution_train)), class_distribution_train)  # Create a bar plot for visual representation
plt.xticks(range(len(class_distribution_train)), newsgroups_train.target_names, rotation=45)  # Label each category
plt.title('Training Set Class Distribution')  # Title of the plot
plt.xlabel('Class')  # Label for x-axis (category)
plt.ylabel('Number of Documents')  # Label for y-axis (count of documents)
plt.show()  # Display the plot

# Visualize the distribution of each category in the test set
# This is done to ensure that test data is also balanced like the training data.
y_test = newsgroups_test.target  # Extract category labels for the test data
class_distribution_test = np.bincount(y_test)  # Count articles per category in test data
plt.bar(range(len(class_distribution_test)), class_distribution_test)  # Bar plot for visual representation
plt.xticks(range(len(class_distribution_test)), newsgroups_test.target_names, rotation=45)  # Label each category
plt.title('Test Set Class Distribution')  # Title of the plot
plt.xlabel('Class')  # Label for x-axis (category)
plt.ylabel('Number of Documents')  # Label for y-axis (count of documents)
plt.show()  # Display the plot

# Preprocess text data: Convert text into numerical data using TF-IDF
# TF-IDF (Term Frequency-Inverse Document Frequency) measures how important words are to a document
# by assigning weights to each word, based on their frequency in a document and in the entire dataset.
vectorizer = TfidfVectorizer(stop_words='english')  # Initialize vectorizer, removing common English stop words
X_train = vectorizer.fit_transform(newsgroups_train.data)  # Learn vocabulary from training data and transform it
X_test = vectorizer.transform(newsgroups_test.data)  # Transform test data using the learned vocabulary
# Here, X_train and X_test are now matrices with numerical values representing the text data.

# Model 1: Decision Tree Classifier
# Decision Tree builds a flowchart-like model of decisions to classify data into categories.
clf_tree = DecisionTreeClassifier(random_state=42)  # Initialize Decision Tree with a random seed for consistency
clf_tree.fit(X_train, y_train)  # Train the model using the training data

# Predict and evaluate the Decision Tree model
y_pred_tree = clf_tree.predict(X_test)  # Use the trained model to predict labels for the test data
print("Decision Tree Model Performance:")
print("Accuracy:", accuracy_score(y_test, y_pred_tree))  # Show the accuracy (percentage of correct predictions)
print("Classification Report:\n", classification_report(y_test, y_pred_tree, target_names=newsgroups_test.target_names))
# The classification report shows performance for each category, including precision and recall.

# Model 2: Random Forest Classifier
# A Random Forest builds multiple Decision Trees and averages their predictions for better accuracy.
clf_rf = RandomForestClassifier(n_estimators=100, random_state=42)  # Initialize Random Forest with 100 trees
clf_rf.fit(X_train, y_train)  # Train the model using the training data

# Predict and evaluate the Random Forest model
y_pred_rf = clf_rf.predict(X_test)  # Predict labels for the test data
print("Random Forest Model Performance:")
print("Classification Report:\n", classification_report(y_test, y_pred_rf, target_names=newsgroups_test.target_names))

# Model 3: Support Vector Machine (SVM) Classifier
# SVM finds the best boundary that separates categories, maximizing the distance between different categories.
clf_svc = SVC(kernel='linear', random_state=42)  # Initialize SVM with a linear kernel (suitable for text data)
clf_svc.fit(X_train, y_train)  # Train the model using the training data

# Predict and evaluate the SVM model
y_pred_svc = clf_svc.predict(X_test)  # Predict labels for the test data
print("SVM Model Performance:")
print("Classification Report:\n", classification_report(y_test, y_pred_svc, target_names=newsgroups_test.target_names))


### Step-by-Step Explanation with Comments

1. **Library Imports**: Detailed comments explain the purpose of each library, especially for someone new to Python and machine learning.

2. **Loading and Understanding the Dataset**:
   - Describes the `20 Newsgroups` dataset and why specific categories were chosen.
   - Explains the difference between training and test data and why both are needed.

3. **Class Distribution Visualization**:
   - Step-by-step comments guide the reader through checking data balance, which is essential for fair model training.

4. **TF-IDF Vectorization**:
   - Simplified explanation of how TF-IDF works to convert text data into numbers.
   - Notes why removing stop words (common words) helps improve model focus on significant terms.

5. **Models and Their Descriptions**:
   - **Decision Tree**: Describes the basics of how a Decision Tree makes predictions.
   - **Random Forest**: Explains why combining multiple Decision Trees improves performance.
   - **SVM**: Highlights the core idea of finding a boundary to separate categories.

6. **Evaluation and Metrics**:
   - Explains `accuracy_score` for measuring overall accuracy.
   - Introduces the `classification_report`, which gives details on precision and recall for each category.

This should make the code easy to understand, with each concept introduced gradually to guide a beginner through the process of building and evaluating classification models for text data.

# Using Decision Trees for spam detection is a straightforward and effective approach.

# Here's a step-by-step guide on how you can implement this in a simple way, using a basic understanding of the process involved in text classification.

### Steps to Implement Spam Detection Using Decision Trees

1. **Gather Your Data**:
   - You need a dataset of emails (or text messages) labeled as either "spam" or "not spam." You can use existing datasets like the [SMS Spam Collection](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection) or the [Enron Email Dataset](https://www.cs.cmu.edu/~enron/).
   - Ensure your dataset contains two main columns: the text of the message and the label (spam or not spam).

2. **Prepare Your Environment**:
   - Make sure you have Python installed along with libraries like `scikit-learn`, `pandas`, and `matplotlib`. You can install these using pip:
     ```bash
     pip install scikit-learn pandas matplotlib
     ```

3. **Load and Preprocess Your Data**:
   - Use pandas to load your data and preprocess it (cleaning and formatting).
   - Split your data into training and testing sets to evaluate your model later.

4. **Convert Text to Numerical Format**:
   - Use `TfidfVectorizer` to convert the text messages into numerical format, which is required for the Decision Tree model.

5. **Train the Decision Tree Model**:
   - Create and train your Decision Tree classifier using the training data.

6. **Make Predictions and Evaluate Your Model**:
   - Use the trained model to predict whether the messages in the test set are spam or not.
   - Evaluate the model's performance using metrics like accuracy, precision, and recall.

### Example Code for Spam Detection Using Decision Trees

Here’s how you can implement the above steps in Python:

```python
# Import necessary libraries
import pandas as pd  # For data manipulation
from sklearn.model_selection import train_test_split  # To split data into training and testing sets
from sklearn.feature_extraction.text import TfidfVectorizer  # To convert text to numerical features
from sklearn.tree import DecisionTreeClassifier  # For the Decision Tree model
from sklearn.metrics import classification_report, accuracy_score  # To evaluate model performance

# Step 1: Load your dataset
# Replace 'spam_data.csv' with your actual file path
data = pd.read_csv('spam_data.csv')  # Load the dataset
print(data.head())  # Display the first few rows to understand the data structure

# Step 2: Preprocess the data
# Assuming the dataset has 'text' and 'label' columns
X = data['text']  # Features (text messages)
y = data['label']  # Labels (spam or not spam)

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Convert text data to numerical format
vectorizer = TfidfVectorizer(stop_words='english')  # Initialize the vectorizer
X_train_vectorized = vectorizer.fit_transform(X_train)  # Fit and transform the training data
X_test_vectorized = vectorizer.transform(X_test)  # Transform the test data

# Step 4: Train the Decision Tree model
clf = DecisionTreeClassifier(random_state=42)  # Initialize the model
clf.fit(X_train_vectorized, y_train)  # Train the model with the training data

# Step 5: Make predictions on the test set
y_pred = clf.predict(X_test_vectorized)  # Predict labels for the test data

# Step 6: Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_pred))  # Print the accuracy
print("Classification Report:\n", classification_report(y_test, y_pred))  # Detailed performance metrics
```

### Explanation of the Code:

1. **Loading Data**: The code starts by loading your dataset from a CSV file into a pandas DataFrame. You can check the structure of your data with `print(data.head())`.

2. **Data Preparation**:
   - The features (`X`) are the text messages, and the labels (`y`) indicate whether they are spam or not.
   - The data is split into training and testing sets to help evaluate the model later.

3. **Text Vectorization**: `TfidfVectorizer` converts the text data into a numerical format that the Decision Tree can understand. It removes common words (stop words) that don't contribute much to the classification.

4. **Model Training**: A Decision Tree classifier is created and trained using the training data.

5. **Prediction**: The trained model predicts whether messages in the test set are spam or not.

6. **Evaluation**: Finally, the model’s accuracy and detailed classification metrics are printed to evaluate its performance.

### Conclusion

By following these steps, you can effectively use the Decision Trees method for spam detection. This approach helps automate the identification of spam messages based on the text content, which can be very useful in real-world applications like email filtering and messaging apps.

In [None]:
# Import necessary libraries
import pandas as pd  # For data manipulation
from sklearn.model_selection import train_test_split  # To split data into training and testing sets
from sklearn.feature_extraction.text import TfidfVectorizer  # To convert text to numerical features
from sklearn.tree import DecisionTreeClassifier  # For the Decision Tree model
from sklearn.metrics import classification_report, accuracy_score  # To evaluate model performance

# Step 1: Load your dataset
# Replace 'spam_data.csv' with your actual file path
data = pd.read_csv('spam_data.csv')  # Load the dataset
print(data.head())  # Display the first few rows to understand the data structure

# Step 2: Preprocess the data
# Assuming the dataset has 'text' and 'label' columns
X = data['text']  # Features (text messages)
y = data['label']  # Labels (spam or not spam)

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Convert text data to numerical format
vectorizer = TfidfVectorizer(stop_words='english')  # Initialize the vectorizer
X_train_vectorized = vectorizer.fit_transform(X_train)  # Fit and transform the training data
X_test_vectorized = vectorizer.transform(X_test)  # Transform the test data

# Step 4: Train the Decision Tree model
clf = DecisionTreeClassifier(random_state=42)  # Initialize the model
clf.fit(X_train_vectorized, y_train)  # Train the model with the training data

# Step 5: Make predictions on the test set
y_pred = clf.predict(X_test_vectorized)  # Predict labels for the test data

# Step 6: Evaluate the model's performance
print("Accuracy:", accuracy_score(y_test, y_pred))  # Print the accuracy
print("Classification Report:\n", classification_report(y_test, y_pred))  # Detailed performance metrics

Running this code on Google Colab with the dataset stored in Google Drive requires mounting Google Drive and loading the dataset directly from there. I'll guide you through these steps and also make improvements to the code for better readability, modularity, and efficiency.

Here’s the optimized code for the SMS Spam Collection dataset, which you can run on Google Colab. I’ll include detailed comments to guide you.

### Step-by-Step Code for Google Colab

```python
# Step 1: Mount Google Drive to Access Dataset
from google.colab import drive
drive.mount('/content/drive')  # Mounts Google Drive for file access

# Step 2: Import necessary libraries
import pandas as pd  # Data manipulation
from sklearn.model_selection import train_test_split  # Train/test split
from sklearn.feature_extraction.text import TfidfVectorizer  # Text to numerical features
from sklearn.tree import DecisionTreeClassifier  # Decision Tree model
from sklearn.metrics import classification_report, accuracy_score  # Model evaluation

# Step 3: Load the SMS Spam Collection dataset
# Update the path below with the actual path to the dataset in your Google Drive
file_path = '/content/drive/My Drive/path_to_your_file/SMSSpamCollection'  # Example path
data = pd.read_csv(file_path, sep='\t', header=None, names=['label', 'text'])  # Load dataset with tab separator
print("First few rows of the dataset:")
print(data.head())  # Show first few rows for verification

# Step 4: Encode the labels
# Convert 'spam' to 1 and 'ham' to 0 to make labels numeric for the model
data['label'] = data['label'].map({'spam': 1, 'ham': 0})

# Step 5: Split data into features and labels
X = data['text']  # Text messages
y = data['label']  # Labels (1 for spam, 0 for ham)

# Split the dataset into 80% training and 20% testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {len(X_train)} messages")
print(f"Testing set size: {len(X_test)} messages")

# Step 6: Convert text data to numerical format using TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)  # Limit features to 3000 for efficiency
X_train_vectorized = vectorizer.fit_transform(X_train)  # Fit & transform training data
X_test_vectorized = vectorizer.transform(X_test)  # Transform test data

# Step 7: Train a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42, max_depth=20)  # Adding max_depth to prevent overfitting
clf.fit(X_train_vectorized, y_train)  # Train the model

# Step 8: Make predictions on the test set
y_pred = clf.predict(X_test_vectorized)  # Predict spam or ham for the test set

# Step 9: Evaluate the model's performance
print("\nDecision Tree Model Performance on SMS Spam Collection Dataset:")
print("Accuracy:", accuracy_score(y_test, y_pred))  # Show accuracy
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))
```

### Code Enhancements and Explanations:

1. **Mount Google Drive**:
   - Using `drive.mount()` allows access to files stored in Google Drive directly from Colab.

2. **TF-IDF Vectorizer**:
   - Specified `max_features=3000` in `TfidfVectorizer` to limit the number of features and make the model more efficient. This can help reduce memory usage and prevent overfitting.

3. **Decision Tree Model Parameters**:
   - Added `max_depth=20` in `DecisionTreeClassifier` to limit the depth of the tree. This helps prevent overfitting by avoiding a model that is too complex for the dataset.

4. **Detailed Output**:
   - The code prints the model’s accuracy and provides a classification report, which includes precision, recall, and F1-score for both spam and ham classes.

### Additional Tips for Running in Google Colab:

- **Change File Path**: Ensure `file_path` points correctly to the SMS Spam Collection dataset within your Google Drive.
- **Install Missing Packages**: If needed, you can install any missing packages in Colab using `!pip install package_name`.
  
This setup should make it straightforward to load your dataset from Google Drive, preprocess the text data, train a Decision Tree classifier, and evaluate its performance on spam detection in a Colab environment.

In [None]:
# Step 1: Mount Google Drive to Access Dataset
from google.colab import drive
drive.mount('/content/drive')  # Mounts Google Drive for file access

# Step 2: Import necessary libraries
import pandas as pd  # Data manipulation
from sklearn.model_selection import train_test_split  # Train/test split
from sklearn.feature_extraction.text import TfidfVectorizer  # Text to numerical features
from sklearn.tree import DecisionTreeClassifier  # Decision Tree model
from sklearn.metrics import classification_report, accuracy_score  # Model evaluation

# Step 3: Load the SMS Spam Collection dataset
# Replace with the actual path to your dataset in Google Drive
file_path = '/content/drive/MyDrive/dataset/SMSSpamCollection'  # Example: if your file is in 'MyDrive/spam_data.csv'
data = pd.read_csv(file_path, sep='\t', header=None, names=['label', 'text'])  # Load dataset with tab separator
print("First few rows of the dataset:")
print(data.head())  # Show first few rows for verification

# Step 4: Encode the labels
# Convert 'spam' to 1 and 'ham' to 0 to make labels numeric for the model
data['label'] = data['label'].map({'spam': 1, 'ham': 0})

# Step 5: Split data into features and labels
X = data['text']  # Text messages
y = data['label']  # Labels (1 for spam, 0 for ham)

# Split the dataset into 80% training and 20% testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {len(X_train)} messages")
print(f"Testing set size: {len(X_test)} messages")

# Step 6: Convert text data to numerical format using TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)  # Limit features to 3000 for efficiency
X_train_vectorized = vectorizer.fit_transform(X_train)  # Fit & transform training data
X_test_vectorized = vectorizer.transform(X_test)  # Transform test data

# Step 7: Train a Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42, max_depth=20)  # Adding max_depth to prevent overfitting
clf.fit(X_train_vectorized, y_train)  # Train the model

# Step 8: Make predictions on the test set
y_pred = clf.predict(X_test_vectorized)  # Predict spam or ham for the test set

# Step 9: Evaluate the model's performance
print("\nDecision Tree Model Performance on SMS Spam Collection Dataset:")
print("Accuracy:", accuracy_score(y_test, y_pred))  # Show accuracy
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))

In [None]:
# Step 1: Mount Google Drive
# Google Colab requires mounting Google Drive if we want to use files from it.
# This code connects to Google Drive to access the dataset stored there.
from google.colab import drive
drive.mount('/content/drive')  # This command mounts Google Drive so we can access files within it.

# Step 2: Import necessary libraries
# Libraries are reusable pieces of code that help with specific tasks.
import pandas as pd  # 'pandas' helps to manage and analyze data in table format.
from sklearn.model_selection import train_test_split  # Splits data into training and testing sets.
from sklearn.feature_extraction.text import TfidfVectorizer  # Converts text into numbers for analysis.
from sklearn.tree import DecisionTreeClassifier  # The Decision Tree model that will predict spam or ham.
from sklearn.metrics import classification_report, accuracy_score  # Tools to measure model performance.

# Step 3: Load the SMS Spam Collection dataset
# Here, we load the dataset containing SMS messages and their labels (spam or ham).
file_path = '/content/drive/MyDrive/dataset/SMSSpamCollection'  # This path should be updated to your dataset location on Google Drive.

# Read the dataset using 'pandas' with tab ('\t') as the separator.
# Since this dataset doesn’t have column names, we name them 'label' and 'text' for clarity.
data = pd.read_csv(file_path, sep='\t', header=None, names=['label', 'text'])

# Display the first few rows to confirm the data loaded correctly
print("First few rows of the dataset:")
print(data.head())  # Show the first few rows of data to get an idea of its structure.

# Step 4: Encode the labels as numbers
# Machine learning models work with numbers, so we need to convert text labels (spam or ham) to numbers.
# Here, we map 'spam' to 1 and 'ham' to 0.
data['label'] = data['label'].map({'spam': 1, 'ham': 0})  # 'spam' becomes 1, and 'ham' becomes 0.

# Step 5: Define features (X) and labels (y)
# - X will contain the text messages (SMS content).
# - y will contain the corresponding labels (1 for spam, 0 for ham).
X = data['text']  # Features (the actual text of the SMS messages)
y = data['label']  # Labels (1 for spam, 0 for ham)

# Step 6: Split the data into training and testing sets
# We need to divide the data so we can train the model on one part and test it on another.
# - The model will learn patterns in the training set (80% of data).
# - The testing set (20% of data) will help us see how well the model performs on unseen messages.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Print the sizes of the training and testing sets
print(f"Training set size: {len(X_train)} messages")  # Number of messages in training set
print(f"Testing set size: {len(X_test)} messages")  # Number of messages in testing set

# Step 7: Convert text data into a numerical format using TF-IDF Vectorizer
# The model can't understand text directly, so we need to turn it into numbers.
# TF-IDF (Term Frequency-Inverse Document Frequency) assigns a numerical value to words.
# This helps the model focus on words that are important in distinguishing spam from ham.
vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)  # Ignores common words and limits to 3000 words for efficiency.

# Fit and transform the training data
X_train_vectorized = vectorizer.fit_transform(X_train)  # 'fit' learns vocabulary from training data, 'transform' converts it to numbers.
print(X_train_vectorized)
# Transform the testing data
X_test_vectorized = vectorizer.transform(X_test)  # Convert test messages to numbers using the same vocabulary.

# Step 8: Train the Decision Tree Classifier
# Now we create and train our Decision Tree model.
# The Decision Tree works by splitting data based on certain conditions to classify messages as spam or ham.
# We set 'max_depth' to limit the depth of the tree, helping it generalize better and avoid overfitting.
clf = DecisionTreeClassifier(random_state=42, max_depth=20)  # Setting 'max_depth' keeps the model simpler and faster.

# Train (fit) the model on the training data
clf.fit(X_train_vectorized, y_train)  # The model learns from training data to recognize patterns in spam and ham.

# Step 9: Make predictions on the test set
# With our trained model, we now predict whether each message in the test set is spam or ham.
y_pred = clf.predict(X_test_vectorized)  # Predict labels (spam/ham) for test messages

# Step 10: Evaluate the model's performance
# To see how well the model performs, we check its accuracy and other metrics.
# - Accuracy: Percentage of correct predictions (out of all predictions).
# - Classification report: Detailed performance metrics for spam and ham (precision, recall, F1-score).
print("\nDecision Tree Model Performance on SMS Spam Collection Dataset:")
print("Accuracy:", accuracy_score(y_test, y_pred))  # Shows overall accuracy of the model

# Print a classification report for further insights into model performance on each class (ham/spam)
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))


In [None]:
print(X_train_vectorized)

In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize a dictionary to store the results for each depth
depth_results = {}

# Test decision tree classifiers with varying depths
for depth in range(1, 21):  # Test depths from 1 to 20
    clf = DecisionTreeClassifier(random_state=42, max_depth=depth)
    clf.fit(X_train_vectorized, y_train)  # Train the classifier
    y_pred = clf.predict(X_test_vectorized)  # Predict on the test set

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    depth_results[depth] = accuracy  # Store accuracy for the current depth

    print(f"Depth: {depth}, Accuracy: {accuracy:.4f}")

# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(list(depth_results.keys()), list(depth_results.values()), marker='o', color='blue')
plt.title('Effect of Tree Depth on Accuracy', fontsize=14)
plt.xlabel('Tree Depth', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.grid(True)
plt.xticks(range(1, 21))
plt.show()

# Identify the best depth
best_depth = max(depth_results, key=depth_results.get)
print(f"\nBest Depth: {best_depth}, Best Accuracy: {depth_results[best_depth]:.4f}")


In [None]:
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize a dictionary to store the results for each depth
depth_results = {}

# Test decision tree classifiers with varying depths
for depth in range(1, 101):  # Test depths from 1 to 100
    clf = DecisionTreeClassifier(random_state=42, max_depth=depth)
    clf.fit(X_train_vectorized, y_train)  # Train the classifier
    y_pred = clf.predict(X_test_vectorized)  # Predict on the test set

    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    depth_results[depth] = accuracy  # Store accuracy for the current depth

    # Print accuracy for every 10th depth to reduce verbosity
    if depth % 10 == 0 or depth == 1 or depth == 100:
        print(f"Depth: {depth}, Accuracy: {accuracy:.4f}")

# Plotting the results
plt.figure(figsize=(12, 8))
plt.plot(list(depth_results.keys()), list(depth_results.values()), marker='o', color='blue')
plt.title('Effect of Tree Depth on Accuracy (1-100)', fontsize=14)
plt.xlabel('Tree Depth', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.grid(True)
plt.xticks(range(0, 101, 10))  # Show x-axis ticks every 10 depths
plt.show()


# Print summary of results
print("\nSummary of Results:")
print(f"Depth 1-10: Likely to show rapid improvement in accuracy as the tree learns basic patterns.")
print(f"Depth 10-100: Accuracy seems to not have a significant change due to overfitting on the training data.")
print(f"Optimal Depth: The depth with the highest accuracy.")

# Identify the best depth
best_depth = max(depth_results, key=depth_results.get)
print(f"\nBest Depth: {best_depth}, Best Accuracy: {depth_results[best_depth]:.4f}")


NEXT WEEK TASKS:

1.   In one figure we should have accuracy of training and test dataset together.

2.   Check actual depth for decision tree.



In [None]:
# Required Libraries
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Initialize dictionaries to store training and testing accuracies
train_accuracies = {}
test_accuracies = {}

# Test decision tree classifiers with varying depths
for depth in range(1, 21):  # Test depths from 1 to 20
    clf = DecisionTreeClassifier(random_state=42, max_depth=depth)
    clf.fit(X_train_vectorized, y_train)  # Train the classifier

    # Predict on the training set
    y_train_pred = clf.predict(X_train_vectorized)
    train_accuracy = accuracy_score(y_train, y_train_pred)  # Training accuracy
    train_accuracies[depth] = train_accuracy

    # Predict on the testing set
    y_test_pred = clf.predict(X_test_vectorized)
    test_accuracy = accuracy_score(y_test, y_test_pred)  # Testing accuracy
    test_accuracies[depth] = test_accuracy

    print(f"Depth: {depth}, Training Accuracy: {train_accuracy:.4f}, Testing Accuracy: {test_accuracy:.4f}")

# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(list(train_accuracies.keys()), list(train_accuracies.values()), marker='o', label='Training Accuracy', color='blue')
plt.plot(list(test_accuracies.keys()), list(test_accuracies.values()), marker='s', label='Testing Accuracy', color='orange')
plt.title('Effect of Tree Depth on Accuracy', fontsize=14)
plt.xlabel('Tree Depth', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.legend(fontsize=12)
plt.grid(True)
plt.xticks(range(1, 21))
plt.show()

# Identify the best depth based on test accuracy
best_depth = max(test_accuracies, key=test_accuracies.get)
print(f"\nBest Depth: {best_depth}, Best Testing Accuracy: {test_accuracies[best_depth]:.4f}")


In [None]:
# Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Step 2: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score
import matplotlib.pyplot as plt

# Step 3: Load the SMS Spam Collection dataset
file_path = '/content/drive/MyDrive/dataset/SMSSpamCollection'  # Update the path to your dataset
data = pd.read_csv(file_path, sep='\t', header=None, names=['label', 'text'])
print("First few rows of the dataset:")
print(data.head())

# Step 4: Encode the labels as numbers
data['label'] = data['label'].map({'spam': 1, 'ham': 0})

# Step 5: Define features (X) and labels (y)
X = data['text']
y = data['label']

# Step 6: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {len(X_train)} messages")
print(f"Testing set size: {len(X_test)} messages")

# Step 7: Convert text data into a numerical format using TF-IDF Vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Step 8: Train and evaluate the Decision Tree Classifier for varying depths
train_accuracies = {}
test_accuracies = {}

for depth in range(1, 21):  # Test depths from 1 to 20
    clf = DecisionTreeClassifier(random_state=42, max_depth=depth)
    clf.fit(X_train_vectorized, y_train)

    # Predict on the training set
    y_train_pred = clf.predict(X_train_vectorized)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    train_accuracies[depth] = train_accuracy

    # Predict on the testing set
    y_test_pred = clf.predict(X_test_vectorized)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_accuracies[depth] = test_accuracy

    print(f"Depth: {depth}, Training Accuracy: {train_accuracy:.4f}, Testing Accuracy: {test_accuracy:.4f}")

# Step 9: Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(list(train_accuracies.keys()), list(train_accuracies.values()), marker='o', label='Training Accuracy', color='blue')
plt.plot(list(test_accuracies.keys()), list(test_accuracies.values()), marker='s', label='Testing Accuracy', color='orange')
plt.title('Effect of Tree Depth on Accuracy', fontsize=14)
plt.xlabel('Tree Depth', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.legend(fontsize=12)
plt.grid(True)
plt.xticks(range(1, 21))
plt.show()

# Step 10: Identify the best depth based on test accuracy
best_depth = max(test_accuracies, key=test_accuracies.get)
print(f"\nBest Depth: {best_depth}, Best Testing Accuracy: {test_accuracies[best_depth]:.4f}")

# Step 11: Evaluate the final model at the best depth
clf = DecisionTreeClassifier(random_state=42, max_depth=best_depth)
clf.fit(X_train_vectorized, y_train)
y_test_pred = clf.predict(X_test_vectorized)

print("\nFinal Model Performance at Best Depth:")
print("Accuracy on Test Set:", accuracy_score(y_test, y_test_pred))
print("\nClassification Report:\n", classification_report(y_test, y_test_pred, target_names=['Ham', 'Spam']))
