# Machine Learning for Predicting Phishing URLs

Building off the PhishGuardX project, which aimed to educate users about phishing threats by simulating fake phishing campaigns, this new part of the Something Awesome project will focus on the critical aspect of these phishing attacks: fraudulent URLs. 

Phishing attacks often leverage deceptive URLs, leading unsuspecting users to malicious websites that aim to steal sensitive information as PhishGuardX demonstrated. To counter this threat, the objective of this part of the project is to develop a system that can automatically identify and classify URLs as legitimate or malicious with a high degree of accuracy. 

By doing so, I aim to enhance online security and protect users from falling victim to phishing scams. 

Because we are classifying whether a URL is a phishing scam or not, this means that this is a classification problem. 


In [18]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import xgboost as xgb

### Step 1: Data Processing

In [19]:
# Load phishing dataset
df = pd.read_csv('phishing_dataset.csv')
df = df.drop(columns=['id'])
df.head()

Unnamed: 0,NumDots,SubdomainLevel,PathLevel,UrlLength,NumDash,NumDashInHostname,AtSymbol,TildeSymbol,NumUnderscore,NumPercent,...,IframeOrFrame,MissingTitle,ImagesOnlyInForm,SubdomainLevelRT,UrlLengthRT,PctExtResourceUrlsRT,AbnormalExtFormActionR,ExtMetaScriptLinkRT,PctExtNullSelfRedirectHyperlinksRT,CLASS_LABEL
0,3,1,5,72,0,0,0,0,0,0,...,0,0,1,1,0,1,1,-1,1,1
1,3,1,3,144,0,0,0,0,2,0,...,0,0,0,1,-1,1,1,1,1,1
2,3,1,2,58,0,0,0,0,0,0,...,0,0,0,1,0,-1,1,-1,0,1
3,3,1,6,79,1,0,0,0,0,0,...,0,0,0,1,-1,1,1,1,-1,1
4,3,0,4,46,0,0,0,0,0,0,...,1,0,0,1,1,-1,0,-1,-1,1


In [20]:
X = df.drop("CLASS_LABEL", axis=1)
y = df["CLASS_LABEL"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### Step 2: Model Selection and Hyperparameter Tuning
Random Forest Classifier is one of the best models used for classification problems. A random forest uses ensemble learning, which combines multiple different decision trees into one to make a prediction. This ensemble approach is often more accurate compared to individual decision trees ("wisdom of the crowd").

Random forests also have less of an ability of overfit the data, making it a better choice for classifying URLs especially when there are a lot of features in our dataset. This means that it is also robust to outliers in the data. 

In [21]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5],
    'learning_rate': [0.1, 0.01, 0.001],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

model_forest = RandomForestClassifier(n_estimators=100, random_state=42)
model_xgb = xgb.XGBClassifier()

grid_search = GridSearchCV(estimator=model_xgb, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Find the best hyperparameters
best_params = grid_search.best_params_

### Step 3: Model Training

In [22]:
model_forest.fit(X_train, y_train)

best_xgb_model = xgb.XGBClassifier(**best_params)
best_xgb_model.fit(X_train, y_train)

### Step 4: Model Evaluation

In [15]:
y_pred_forest = model_forest.predict(X_test)
accuracy_forest = accuracy_score(y_test, y_pred_forest)
report_forest = classification_report(y_test, y_pred_forest)

print(f"Accuracy: {accuracy_forest:.2f}")
print(report_forest)
print("---------------------------------------------------------------------------")

y_pred_xgb = model_forest.predict(X_test)
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
report_xgb = classification_report(y_test, y_pred_xgb)

print(f"Accuracy: {accuracy_xgb:.2f}")
print(report_xgb)

Accuracy: 0.98
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       988
           1       0.98      0.98      0.98      1012

    accuracy                           0.98      2000
   macro avg       0.98      0.98      0.98      2000
weighted avg       0.98      0.98      0.98      2000

-------------------------
Accuracy: 0.98
              precision    recall  f1-score   support

           0       0.98      0.98      0.98       988
           1       0.98      0.98      0.98      1012

    accuracy                           0.98      2000
   macro avg       0.98      0.98      0.98      2000
weighted avg       0.98      0.98      0.98      2000

-------------------------
