**NOTE**
We started building this document but then noticed that the HyperParameter tuning notebook already did what we needed to do for the model evaluation so we did not complete this notebook.


# Model Building and Evaluation

In this notebook, we will be building and evaluating different machine learning models to classify URLs as either `phishing` or `legitimate`. 

## Steps:
1. Load the preprocessed dataset
2. Split the dataset into training and testing sets
3. Train multiple machine learning models
4. Evaluate the models using accuracy, precision, recall, and F1-score
5. Perform cross-validation to validate model performance


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:

# Load the preprocessed dataset (replace the path with your actual file path)
data = pd.read_csv('../data/norm_data.csv')

# Define features (X) and target (y)
X = data.copy()
y = X.pop('label')

X.pop('url')

# Convert the target labels to binary format
y = y.map({'phishing': 1, 'legitimate': 0})

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)



We begin by loading the preprocessed dataset and splitting it into training and testing sets.
We use 80% of the data for training the models and reserve 20% for testing the models' performance.

The target variable (`label`) is converted to a binary format where:
- `1` represents `phishing`
- `0` represents `legitimate`


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Initialize the models
log_reg = LogisticRegression()
tree_clf = DecisionTreeClassifier()
rf_clf = RandomForestClassifier()
svm_clf = SVC()

# Train the models
log_reg.fit(X_train, y_train)
tree_clf.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)
svm_clf.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



We are training four machine learning models:
1. **Logistic Regression**: A simple linear model for classification.
2. **Decision Tree**: A non-linear model that splits data based on feature values.
3. **Random Forest**: An ensemble model that builds multiple decision trees and averages their predictions.
4. **Support Vector Machine (SVM)**: A model that finds the hyperplane which best separates the classes.

Each model is trained on the training set (`X_train`, `y_train`).


In [None]:

from sklearn.metrics import accuracy_score, classification_report

# Make predictions
log_reg_pred = log_reg.predict(X_test)
tree_clf_pred = tree_clf.predict(X_test)
rf_clf_pred = rf_clf.predict(X_test)
svm_clf_pred = svm_clf.predict(X_test)

# Evaluate accuracy
print("Logistic Regression Accuracy: ", accuracy_score(y_test, log_reg_pred))
print("Decision Tree Accuracy: ", accuracy_score(y_test, tree_clf_pred))
print("Random Forest Accuracy: ", accuracy_score(y_test, rf_clf_pred))
print("SVM Accuracy: ", accuracy_score(y_test, svm_clf_pred))

# Classification report for more detailed evaluation
print("Logistic Regression Report:", classification_report(y_test, log_reg_pred))
print("Decision Tree Report:", classification_report(y_test, tree_clf_pred))
print("Random Forest Report:", classification_report(y_test, rf_clf_pred))
print("SVM Report:", classification_report(y_test, svm_clf_pred))



We evaluate the models using the following metrics:
1. **Accuracy**: The percentage of correct predictions out of all predictions.
2. **Precision**: The percentage of true positives out of all positive predictions.
3. **Recall**: The percentage of true positives out of all actual positives.
4. **F1-score**: The harmonic mean of precision and recall, providing a balance between the two.

We generate detailed classification reports for each model.


In [None]:

from sklearn.model_selection import cross_val_score

# Perform cross-validation for each model
log_reg_cv = cross_val_score(log_reg, X_train, y_train, cv=5)
tree_clf_cv = cross_val_score(tree_clf, X_train, y_train, cv=5)
rf_clf_cv = cross_val_score(rf_clf, X_train, y_train, cv=5)
svm_clf_cv = cross_val_score(svm_clf, X_train, y_train, cv=5)

print("Logistic Regression Cross-Val Scores: ", log_reg_cv)
print("Decision Tree Cross-Val Scores: ", tree_clf_cv)
print("Random Forest Cross-Val Scores: ", rf_clf_cv)
print("SVM Cross-Val Scores: ", svm_clf_cv)



We also perform cross-validation with 5 folds to assess the stability and generalizability of the models.
Cross-validation helps reduce overfitting by training the model on different subsets of the data and evaluating its performance on unseen data.
