It's time to start the modeling process. But first let's import all the dependencies we would need and load up the data. 

In [1]:
# Filter the uneccesary warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Import numpy
import numpy as np

# Fix the random seed
np.random.seed(7)

In [3]:
# Load the numpy arrays which will be our datasets from now
X_train, y_train = np.load("X_train.npy", allow_pickle=True), np.load("y_train.npy", allow_pickle=True)
X_test, y_test = np.load("X_test.npy", allow_pickle=True), np.load("y_test.npy", allow_pickle=True)

Let's instantiate the Logistic Regression model and fit it to the training data. 

In [12]:
# Other imports
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
from sklearn.linear_model import LogisticRegression
import wandb
import time

I like to always define utility functions for training machine learning models with the code to also measure its training time and performance. 

In [13]:
def train_eval_pipeline(model, train_data, test_data, name):
    # Initialize Weights and Biases
    wandb.init(project="phishing-websites-detection", name=name)
    
    # Segregate the datasets
    (X_train, y_train) = train_data
    (X_test, y_test) = test_data
    
    # Train the model and log all the necessary metrics
    start = time.time()
    model.fit(X_train, y_train)
    end = time.time() - start
    prediction = model.predict(X_test)

    wandb.log({"accuracy":accuracy_score(y_test, prediction)*100.0,\
               "precision": precision_recall_fscore_support(y_test, prediction, average='macro')[0],
               "recall": precision_recall_fscore_support(y_test, prediction, average='macro')[1],
               "training_time":end})
    
    print("Accuracy score of the Logistic Regression classifier with default hyperparameter values {0:.2f}%"\
              .format(accuracy_score(y_test, prediction)*100.))
    print("\n")
    print("----Classification report of the Logistic Regression classifier with default hyperparameter value----")
    print("\n")
    print(classification_report(y_test, prediction, target_names=["Phishing Websites", "Normal Websites"]))

In [14]:
logreg = LogisticRegression()
train_eval_pipeline(logreg, (X_train, y_train),
                         (X_test, y_test), "logistic_regression")

wandb: Wandb version 0.8.19 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Accuracy score of the Logistic Regression classifier with default hyperparameter values 92.46%


----Classification report of the Logistic Regression classifier with default hyperparameter value----


                   precision    recall  f1-score   support

Phishing Websites       0.93      0.90      0.91      3924
  Normal Websites       0.92      0.94      0.93      4920

        micro avg       0.92      0.92      0.92      8844
        macro avg       0.92      0.92      0.92      8844
     weighted avg       0.92      0.92      0.92      8844



Can we improve this model? A good way to start approaching is to tune the hyperparameters of the model. Let's first define a grid of values for the hyperparameters we would like to tune. We will using *random search* for hyperparameter tuning. 

In [None]:
# Import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [15]:
# Define the grid of values
penalty = ["l1", "l2"]
C = [0.8, 0.9, 1.0]
tol = [0.01, 0.001 ,0.0001]
max_iter = [100, 150, 200, 250]

# Create a dictionary where tol and max_iter are keys and the lists 
# of their values are the corresponding values
param_grid = dict(penalty=penalty, C=C, tol=tol, max_iter=max_iter)

Now that we have defined the grid, let's use it to find a good set of hyperparameter values. 

In [16]:
# Instantiate RandomizedSearchCV with the required parameters
random_model = RandomizedSearchCV(estimator=logreg, param_distributions=param_grid, cv=5)

# Fit random_model to the data
random_model_result = random_model.fit(X_train, y_train)

# Summarize results
best_score, best_params = random_model_result.best_score_, random_model_result.best_params_
print("Best score: %.2f using %s" % (best_score*100., best_params))

Best score: 92.44 using {'tol': 0.0001, 'penalty': 'l1', 'max_iter': 100, 'C': 1.0}


Random search did not help much in boosting up the accuracy score. Just to ensure let's take the hyperparameter values and train another Logistic Regression model with those. 

As an additional exercise, you might want to define a distribution instead of specifying a grid of values and let the random search algorithm randomly sample values from that distribution and see the results. Follow [this article](https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/) to know more about this process. 

Let's first log the hyperparameter values with which we are going to train the model. 

In [18]:
config = wandb.config

config.tol = 0.001
config.penalty = "l1"
config.C = 1.0

wandb: Wandb version 0.8.19 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


In [20]:
# Train the model
logreg = LogisticRegression(tol=config.tol, penalty=config.penalty, max_iter=250, C=config.C)
train_eval_pipeline(logreg, (X_train, y_train),
                         (X_test, y_test), "logistic-regression-random-search")

wandb: Wandb version 0.8.19 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade


Accuracy score of the Logistic Regression classifier with default hyperparameter values 92.48%


----Classification report of the Logistic Regression classifier with default hyperparameter value----


                   precision    recall  f1-score   support

Phishing Websites       0.93      0.90      0.91      3924
  Normal Websites       0.92      0.94      0.93      4920

        micro avg       0.92      0.92      0.92      8844
        macro avg       0.93      0.92      0.92      8844
     weighted avg       0.92      0.92      0.92      8844

