### Question 8

- Using Python, build a social engineering detection model (e.g., phishing email detection) with text feature extraction (TF-IDF or word embeddings). 
- Provide the Python code and describe the model evaluation metrics.

ref: https://huggingface.co/datasets/zefang-liu/phishing-email-dataset/blob/main/Phishing_Email.csv

### Using dataset from HuggingFace

Reading data

- Right away, it can be seen that Unnamed: 0 is just an index, and can be removed.

In [1]:
import pandas as pd 

df = pd.read_csv("Phishing_Email.csv")
df

Unnamed: 0.1,Unnamed: 0,Email Text,Email Type
0,0,"re : 6 . 1100 , disc : uniformitarianism , re ...",Safe Email
1,1,the other side of * galicismos * * galicismo *...,Safe Email
2,2,re : equistar deal tickets are you still avail...,Safe Email
3,3,\nHello I am your hot lil horny toy.\n I am...,Phishing Email
4,4,software at incredibly low prices ( 86 % lower...,Phishing Email
...,...,...,...
18645,18646,date a lonely housewife always wanted to date ...,Phishing Email
18646,18647,request submitted : access request for anita ....,Safe Email
18647,18648,"re : important - prc mtg hi dorn & john , as y...",Safe Email
18648,18649,press clippings - letter on californian utilit...,Safe Email


Examining data

- From here, 'Unnamed: 0' has the same amount of rows as the entries, solidifying that it is just an index column.
- It can also be seen that 'Email Text' and 'Email Type' have different number of entries, showing that there are null values in 'Email Text'

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18650 entries, 0 to 18649
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  18650 non-null  int64 
 1   Email Text  18634 non-null  object
 2   Email Type  18650 non-null  object
dtypes: int64(1), object(2)
memory usage: 437.2+ KB


### Preprocessing

Removing 'Unnamed: 0' or index column

- .drop() function drops the column and inplace=True directly modifies the dataframe
- Just to make sure it worked, .info() is ran again

In [3]:
df.drop('Unnamed: 0', axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18650 entries, 0 to 18649
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Email Text  18634 non-null  object
 1   Email Type  18650 non-null  object
dtypes: object(2)
memory usage: 291.5+ KB


Removing null values in 'Email Text' column

- df is replaced with a df that excludes the null values contained in 'Email Text'
- .notna() is used to only set the column with non-null values.
- .info() is ran again to ensure that the removal of null values worked.

In [4]:
df = df[df['Email Text'].notna()]

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 18634 entries, 0 to 18649
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Email Text  18634 non-null  object
 1   Email Type  18634 non-null  object
dtypes: object(2)
memory usage: 436.7+ KB


Defining X and y

In [5]:
X = df['Email Text']
y = df['Email Type']

Checking for class imbalance

- .value_counts() checks for each class instances

In [6]:
count = y.value_counts()

count

Email Type
Safe Email        11322
Phishing Email     7312
Name: count, dtype: int64

Train Test Split
- The train test split has a 80:20 ratio

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Downsampling for equal number of instances in each class
- Downsampling is done after train-test split bcause only the training set is downsampled for class balance
- This is to avoid bias towards any class during training
- The test set is not downsampled because the randomness can reflect real world testing

In [8]:
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy='majority')
X_train, y_train = rus.fit_resample(X_train.to_frame(), y_train)

In [9]:
print('Training shapes: ', X_train.shape, y_train.shape)

Training shapes:  (11588, 1) (11588,)


In [10]:
print(type(X_train))

<class 'pandas.core.frame.DataFrame'>


### Modelling

#### Logistic Regression Pipeline

##### TFIDF

From examining the data earlier, it is known that both X and y contain 'object' types or text.
Therefore, a part of the preprocessing must include converting the text into numeric format, in this case TFIDF vectorizer is used.
TFIDF vectorizer's calculation ensures every extracted word holds a certain weight, so most frequently used word(s) do not necessarily have heavier weights.

Though it is a part of preprocessing, it is in the 'Modelling' section because it is included in the Logistic Regression pipeline for cleaner and brief code. THe following are the specified parameters:
- ngram_range=(1,3) which means that the extracted words can be unigram (1-word phrases), bigram (2-word phrases) or trigram (3-word phrases)
- max_features=10000 means that top 10000 n-gram features will be kept for importance

##### Logistic Regression
- Famous for binary classifications, suitable for classifying emails into 1s or 0s.
- max_iter=1000 refers to the number of iterations allowed for optimization, or to increase time for convergence (improve loss function)
- The .fit() includes X_train.iloc[:,0] because TfidfVectorizer expects a 1-dimensional array so iloc makes sure all text is fed instead of 2-dimensional (text, label).

In [11]:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

lr_pipeline = make_pipeline(
    TfidfVectorizer(ngram_range=(1,3), max_features= 10000),
    LogisticRegression(max_iter=1000)
)

lr_pipeline.fit(X_train.iloc[:, 0], y_train)

### Evaluation

Metrics Used
- Classification Report
    - Accuracy: Measures the proportion of correctly predicted instances (both phishing and safe) over the total samples
    - Precision: Indicates how many emails predicted as phishing were actually phishing. This is important because it reduces false alarms in environments where false positives are costly.
    - Recall: Shows how many actual phishing emails were correctly identified. This is important because it minimizes risk of missing real phshing emails.
    - F1-Score: Harmonic mean of precision and recall, balancing both metrics

- Confusion Matrix
    - False Positives: Safe emails incorrectly flagged as phishing
    - False Negatives: Phishing emails missed and marked as safe
    - True Positives: Correctly classified phishing emails
    - True Negatives: Correctly classified safe emails


Based on the classification report,

- Accuracy: 96.24% shows very high correct classification rate
- Precision: 93.06%
- Recall: 98.09% shows phishing classification is mostly correct
- F1-Score: 95.51% (phishing) and 96.77% (safe) shows that for both classes there is balanced performance

In [12]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred_lr = lr_pipeline.predict(X_test)

print(classification_report(y_test, y_pred_lr, digits=4))

                precision    recall  f1-score   support

Phishing Email     0.9306    0.9809    0.9551      1518
    Safe Email     0.9864    0.9498    0.9677      2209

      accuracy                         0.9624      3727
     macro avg     0.9585    0.9653    0.9614      3727
  weighted avg     0.9637    0.9624    0.9626      3727



Confusion matrix results:

- TP = 1489
- FN = 29
- FP = 111
- TN = 2098

Low FN shows that few phishing emails are misclassified.
High TN and moderate FP shows that most safe emails are correctly classified

In [13]:
print(confusion_matrix(y_test, y_pred_lr))

[[1489   29]
 [ 111 2098]]
