# Task 2: Email Spam Detection with Skewed Data

## Overview
In real-world spam detection systems, the number of legitimate emails (ham) significantly outweighs the number of spam emails. This class imbalance means that even a model with high accuracy may completely fail at its real goal: catching spam.

To build a useful spam filter, you must:
- Properly handle imbalanced data
- Use meaningful metrics like precision, recall, and F1-score

In [1]:
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

In [2]:
container_path = "data/spam_emails"
data = load_files(container_path)

In [3]:
# split data
# split the data and target variables
X, y = data.data, data.target

#split the dataset into train and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # random_state keeps data split identical for each run    

In [4]:
# cretae a pipeline for logistic regression, 
# add labels for process for quick access later on
pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        lowercase = True,
        stop_words = "english",
        max_features = 1000
    )),
    ("clf", LogisticRegression(class_weight = "balanced", max_iter = 1000))
])

In [5]:
# train the model
pipeline.fit(X_train, y_train)

# get prediction for test set
y_pred = pipeline.predict(X_test)

In [6]:
# display the report
print(classification_report(y_test, y_pred, target_names=data.target_names))

              precision    recall  f1-score   support

         ham       1.00      0.98      0.99       618
        spam       0.91      0.99      0.95       115

    accuracy                           0.98       733
   macro avg       0.96      0.99      0.97       733
weighted avg       0.98      0.98      0.98       733



## Note:

1. The actual task was to train the model on whole data and return the pipeline. The pipeline was then tested (on hidden data) to check if it achives the acuuracy >= 80% with f1-score >= 80%
2. Since I don't have the access to hidden data, I did a split here
3. The actual submission can be found in submission folder in file `task2.py`