# AI-Powered Phishing Email Classifier

## I. Project Goal and Scope
This project aims to design and implement a robust machine learning system to classify emails as Legitimate (Ham) or Phishing/Spam. The primary objective is to achieve a high **Recall** score to minimize the risk of missing phishing emails.

## II. Core Technical Requirements
1. **Data**: SMS Spam Collection dataset.
2. **Preprocessing**: Text cleaning (lowercase, removing special chars/URLs/stopwords), Stemming.
3. **Feature Engineering**: TF-IDF Vectorization.
4. **Models**: Naive Bayes, Logistic Regression, Random Forest.
5. **Evaluation**: Accuracy, Precision, Recall, F1-Score, Confusion Matrix.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
import sys
import os

# Add src to path to import modules
sys.path.append(os.path.abspath('../src'))

from data_loader import get_processed_data
from features import extract_features
from model_trainer import train_models
from evaluation import evaluate_model
from inference import predict_email, load_resources

## III. Data Acquisition and Preprocessing
We load the SMS Spam Collection dataset and apply the following cleaning steps:
- Lowercase conversion
- Removal of HTML tags, URLs, and special characters
- Removal of stop words
- Stemming using PorterStemmer

In [None]:
df = get_processed_data()
df.head()

## IV. Feature Engineering
We use **TF-IDF Vectorization** to convert text data into numerical features. We limit the features to the top 5000 words.

In [None]:
X_train, X_test, y_train, y_test, vectorizer = extract_features(df)

## V. Model Training and Selection
We train three models with hyperparameter tuning using GridSearchCV:
1. **Naive Bayes (MultinomialNB)**: A strong baseline for text classification.
2. **Logistic Regression**: A linear model, tuned with `class_weight='balanced'` to improve Recall.
3. **Random Forest**: An ensemble tree-based model.

We optimize for **Recall** during cross-validation.

In [None]:
trained_models = train_models(X_train, y_train)

## VI. Evaluation and Metrics
We evaluate the models on the unseen test set.

In [None]:
results = []
for name, model in trained_models.items():
    metrics = evaluate_model(model, X_test, y_test, name)
    metrics['Model'] = name
    results.append(metrics)

results_df = pd.DataFrame(results)
results_df

### Model Comparison
Based on the results above, we select the best model. Typically, **Logistic Regression** performs very well for this task, especially when balanced for Recall.

## VII. Inference
We can now use the best model to classify new emails.

In [None]:
# Load the best model (assuming Logistic Regression for this example, or dynamically pick)
best_model_name = 'Logistic_Regression'
vec, model = load_resources(best_model_name)

sample_spam = "URGENT! You have won a 1 week FREE membership in our Â£100,000 Prize Jackpot! Txt WORD: CLAIM to No: 81010 T&C www.dbuk.net LCCLTD POBOX 4403LDNW1A7RW18"
sample_ham = "Hey, are we still meeting for lunch tomorrow?"

print("Prediction for Spam Sample:")
print(predict_email(sample_spam, vec, model))

print("\nPrediction for Ham Sample:")
print(predict_email(sample_ham, vec, model))