In this file, we'll accomplish the following:

1. Cut out a small piece of a large dataset (so that we can run this exercise meaningfully without high performance computing resources)
2. Employ natural language processing (NLP) techniques to analyze the emotional sentiment of email text
3. Encode email text into TF-IDF vectorizations

By the end of the file, you'll turn a dataset of email text into a ready-to-model dataset with engineered features.

DISCLAIMER: Because this project contains real phishing email text, there is inappropriate language in the email text. Please do not inspect this variable if you are not comfortable with that.

In [6]:
import numpy as np
import pandas as pd 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

First, we'll load in our raw dataset.

In [7]:
df = pd.read_csv('../DATA/raw.csv')

These cells pull out sample_size (default 2,000) rows from the full dataset to cut off a small sample that should run on students laptops. Note that we've used the 'stratify' argument which ensures our sample has an equal balance of phishing and safe emails.

In [8]:
sample_size = 2000

In [9]:
# Create a stratified sample of 5000 observations
_, df = train_test_split(
    df,
    test_size=sample_size,       # Number of samples you want
    stratify=df['email_type'],  # Stratify on the phishing/safe label
)

#### Sentiment Feature Engineering

Sentiment analysis is the process of identifying emotional tone behind a body of text.

We are using the VADER (Valence Aware Dictionary and sEntiment Reasoner) tool, which is
designed to detect sentiment in short pieces of text (like email bodies).

For each email, VADER gives us four scores:
- 'neg': proportion of the text that is negative
- 'neu': proportion that is neutral
- 'pos': proportion that is positive
- 'compound': an overall sentiment score, ranging from -1 (extremely negative) to +1 (extremely positive)

These sentiment features can help us detect patterns: for example, phishing emails might use
stronger negative or urgent emotional language compared to safe emails.

In this step, we apply VADER to each email and expand the results into new columns.

In [10]:
sia = SentimentIntensityAnalyzer()

# Function to calculate sentiment scores
def get_sentiment(text):
    return sia.polarity_scores(text)

sentiment_df = df['email_text'].apply(get_sentiment).apply(pd.Series)

#### TF-IDF Feature Engineering

TF-IDF stands for "Term Frequency - Inverse Document Frequency."

It transforms raw text into numerical features by measuring:
- How often a word appears in a specific document (term frequency)
- How unique that word is across all documents (inverse document frequency)

Words that appear often in one email but rarely across the full dataset are considered
more important, and get higher TF-IDF scores.

In this project, we use TF-IDF to capture meaningful words or short phrases (unigrams and bigrams)
from each email. These will act as input features for our machine learning model.

We also:
- Remove common stopwords like "the," "and," "is," etc. that don't add meaning
- Limit the vocabulary to the top 5000 most important words/phrases to reduce memory use

The result is a matrix where each row is an email and each column represents the importance
of a specific word or phrase in that email.

In [39]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words='english', max_features=5000)
tfidf_matrix = vectorizer.fit_transform(df['email_text'])
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(), 
    columns=vectorizer.get_feature_names_out(), 
    index=df.index
)

Our final df for EDA and modeling has labelled (phishing or safe) email text encoded in TF-IDF vectors along with sentiment scores.

In [40]:
final_df = pd.concat(
    [
        sentiment_df[['neg', 'neu', 'pos', 'compound']],  # Sentiment columns
        tfidf_df,                                         # TF-IDF features
        df[['email_type']]                                # Target variable
    ],
    axis=1
)


In [42]:
final_df.to_parquet('../DATA/final.parquet')