# How to Use the PhishBuster Model

This notebook details how to use PhishBuster for classifying emails as safe or unsafe

## Imports

You need to import these libraries into your program.

In [1]:
# Import required libraries
import pandas as pd
import nltk
from nltk.corpus import stopwords
from textblob import TextBlob
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
import pickle

## Stopwords

You need to load stopwords for text preprocessing and define *stop_words* variable 

In [2]:
# Load stopword lists used in text preprocessing
nltk.download('stopwords')
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tytoa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load Pickle Files

You need to load the two pickel files for the vectorizer and classifier models. They are in the pickle folder

In [3]:
# Load pickle files with vectorizer and classifier
vectorizer = pickle.load(open('pickle/vectorizer.pkl', 'rb')) 
classifier = pickle.load(open('pickle/classifier.pkl', 'rb')) 

## Define Text Preprocessing Function

You need to define the text preprocessing function that cleans and normalizes the raw email text

In [4]:
# Define a text pre-processing function
# @ param: text - A string containing the raw email text to clean
# @ return - A string with the cleaned email text
def preprocess_text(text):
    # Contruct a TextBlob object from the text
    blob = TextBlob(text)

    # Convert text to lowercase
    words = [word.lower() for word in blob.words]

    # Remove punctuation
    words = [word for word in words if word.isalpha()]
    
    # Remove stop words
    words = [word for word in words if word not in stop_words]

    # Perform text normalization
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    # Join the words
    text = ' '.join(words)
    return text

## Define Single Prediction Function

You can make single predictions on a string by defining this function

In [5]:
# Define a function to preprocess and vetorize an input string and make a prediction
# @ param: text - A string containing the raw email text to classify
# @ return - An integer, either 0 if the email is safe or 1 if the email is unsafe
def make_prediction(text):
    clean_text = [preprocess_text(text)]
    input_df = pd.DataFrame(data=clean_text, columns=['clean_text'])
    vect = vectorizer.transform(input_df.clean_text)
    pred = classifier.predict(vect)
    if pred == 0:
        return 0
    else:
        return 1

You call the *make_prediction* function like this.

In [6]:
safe_text = 'this is a nice email :)'
print(f'Text: {safe_text} Classification: {make_prediction(safe_text)}')

unsafe_text = 'This is an unsafe email. Sexy! Horny!'
print(f'Text: {unsafe_text} Classification: {make_prediction(unsafe_text)}')

Text: this is a nice email :) Classification: 0
Text: This is an unsafe email. Sexy! Horny! Classification: 1


## Define Batch Prediction Function

You can make batch predictions on a dataframe by defining this function

In [15]:
# Define a function to preprocess and vectorize an input dataframe and make predictions
# @ param: df - A dataframe with one column called 'text' with values being strings of raw email text
# @ return - The dataframe with a 'classification' feature of integers, 0 for safe 1 for unsafe
def batch_prediction(df):
    df['clean_text'] = df['text'].astype(str).apply(preprocess_text)
    vect = vectorizer.transform(df.clean_text)
    pred = classifier.predict(vect)
    df['classification'] = pred
    df = df[['text', 'classification']]
    return df

You call the *batch_prediction* function like this.

In [16]:
data = ['this is a nice email :)', 'This is an unsafe email. Sexy! Horny!']
df_batch = pd.DataFrame(data=data, columns=['text'])
batch_prediction(df_batch)

Unnamed: 0,text,classification
0,this is a nice email :),0
1,This is an unsafe email. Sexy! Horny!,1
