<a href="https://colab.research.google.com/github/sugatoray/Manning-Phishing-Websites-Detection/blob/master/src/notebooks/MPWD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project: _Use Machine Learning to Detect Phishing Websites_

+ [Project home page @manning][#project-home-manning]
+ [Project Repo@GitHub - core][#project-github-core]
+ [Project Repo@GitHub - sugatoray][#project-github-sugatoray]


[#project-home-manning]: https://liveproject.manning.com/module/101_1_1/use-machine-learning-to-detect-phishing-websites/introduction/about-this-liveproject?

[#project-github-core]: https://github.com/sayakpaul/Manning-Phishing-Websites-Detection
[#project-github-sugatoray]: https://github.com/sugatoray/Manning-Phishing-Websites-Detection

---

## Instructions

In this liveProject (
**Manning Phishing Websites Detection**: _MPWD_), you will be filling in the role of a data scientist employed by an organization's cybersecurity manager. Lately, the employees of the organization are receiving a lot of emails containing links to phishing websites. Your task will be to develop a machine learning model for predicting whether or not an email that contains a link to a website is a phishing website or not.

Phishing attacks are considered to be one of the most common types of online security threats, and are capable of breaking into an organization's online security so as to extract confidential information like user passwords, financial information, and so on. The [Internet Crime Report 2018](https://www.ic3.gov/media/2018.aspx) presents the effects of phishing websites.

Your first assignment as a newly on-boarded data scientist is to build upon the following steps to develop a phishing websites classifier:

-   Load and understand a tabular dataset. As a data scientist, you should be comfortable working with tabular data.

-   Query the dataset for deriving interesting reports.

-   Clean the dataset accordingly so that it is well-suited for a machine learning model.

-   Build and train machine learning models, like Logistic Regression and Neural Networks.

-   Perform hyperparameter tuning techniques, like random search.

-   Provide a summary of the performance of the machine learning models.

Experiment Tracking: 

We will use [`wandb`][#wandb-github] for experiment tracking.

[#wandb-github]: https://github.com/wandb/client#quickstart

**Installation**:  

```python
pip install wandb
```

**Example Use**:  



## Install Packages

- `pandas_profiling`
- `wandb`

In [101]:
!pip install -U -q pandas_profiling[notebook]

[?25l[K     |███                             | 10kB 24.3MB/s eta 0:00:01[K     |██████▏                         | 20kB 1.8MB/s eta 0:00:01[K     |█████████▏                      | 30kB 2.3MB/s eta 0:00:01[K     |████████████▎                   | 40kB 2.6MB/s eta 0:00:01[K     |███████████████▍                | 51kB 2.0MB/s eta 0:00:01[K     |██████████████████▍             | 61kB 2.3MB/s eta 0:00:01[K     |█████████████████████▌          | 71kB 2.5MB/s eta 0:00:01[K     |████████████████████████▋       | 81kB 2.8MB/s eta 0:00:01[K     |███████████████████████████▋    | 92kB 2.9MB/s eta 0:00:01[K     |██████████████████████████████▊ | 102kB 2.8MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 2.8MB/s 
[?25h

In [0]:
!pip install -U -q wandb

In [100]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_profiling as ppf

import wandb
import time

import os, json
from IPython.display import display, clear_output

from tqdm import tqdm_notebook as tqdm
import warnings
warnings.resetwarnings()
print("WARNINGS ARE BEING IGNORED!!!")
warnings.filterwarnings("ignore")

%matplotlib inline

# set numpy random generator seed 
# for reproducibility
seed = 42
np.random.seed(seed)



In [0]:
data_url = r"https://raw.githubusercontent.com/sugatoray/Manning-Phishing-Websites-Detection/master/Phishing.csv"

## Manage API Key

We will use the custom defined `APIKeyHandler class` and add the API key to the environment variables.

```python
api_key_name = 'WANDB_API_KEY'
```

In [60]:
class APIKeyHandler(object):
    """Handles API Keys. Adds and/or removes them to/from the environment 
    variables (os.environ).
    
    Example
    -------

    # Initialize APIKeyHandler
    akh = APIKeyHandler()
    print(akh) # print the object
    api_key_name = 'WANDB_API_KEY'
    # ADD api-key to os.environ
    akh.add_to_enviroment(api_key_name=api_key_name)
    # REMOVE api-key from os.environ
    #akh.remove_from_environemnt(api_key_name=api_key_name)
    """
    API_STORE = r"/content/drive/My Drive/Data Repository/API_REPO/api_chest.json"

    def __init__(self):
        self.api_key_names = None
        self._apis = None
        self.update_api_data()

    def update_api_data(self):
        self._read_api_store()
        self.api_key_names = sorted(list(self._apis.keys()))

    def __repr__(self):
        total = len(self.api_key_names)
        msg = ', '.join(self.api_key_names)
        cls = self.__class__.__name__
        return '{}( total {} keys; [ {} ] )'.format(cls, total, msg)

    def _read_api_store(self):
        with open(API_STORE, 'r') as f:
            self._apis = json.loads(f.read())

    def get_api_key(self, api_key_name=None):
        if api_key_name is not None:        
            self.update_api_data()
            #apis = json.loads(self._apis)
            return self._apis.get(api_key_name, None)

    def add_to_enviroment(self, api_key_name=None):
        if api_key_name is not None:
            print('Adding Environment Variable: {}'.format(api_key_name))
            api_key_value = self.get_api_key(api_key_name)
            if api_key_value is not None:
                os.environ[api_key_name] = api_key_value
                print(' ... SUCCESS')
            else:
                print(' ... ABORTED. Key value is NULL.')
    
    def remove_from_environemnt(self, api_key_name=None):
        if api_key_name is not None:            
            api_key_value = self.get_api_key(api_key_name)            
            if (api_key_name in os.environ) and (api_key_value is not None):
                print('Removing Environment Variable: {}'.format(api_key_name))
                v = os.environ.pop(api_key_name, None)
                if v is not None:
                    print(' ... SUCCESS')
                else:
                    print(' ... ABORTED. Key value in os.environ is NULL.')
            
print(APIKeyHandler.__doc__)
print('\n'+''.join(['-']*80)+'\n')
# Initialize APIKeyHandler
akh = APIKeyHandler()
print(akh) # print the object
api_key_name = 'WANDB_API_KEY'
# ADD api-key to os.environ
akh.add_to_enviroment(api_key_name=api_key_name)
# REMOVE api-key from os.environ
#akh.remove_from_environemnt(api_key_name=api_key_name)

Handles API Keys. Adds and/or removes them to/from the environment 
    variables (os.environ).
    
    Example
    -------

    # Initialize APIKeyHandler
    akh = APIKeyHandler()
    print(akh) # print the object
    api_key_name = 'WANDB_API_KEY'
    # ADD api-key to os.environ
    akh.add_to_enviroment(api_key_name=api_key_name)
    # REMOVE api-key from os.environ
    #akh.remove_from_environemnt(api_key_name=api_key_name)
    

--------------------------------------------------------------------------------

APIKeyHandler( total 4 keys; [ IQAIR_API_KEY, JOVIAN_API_KEY, OPENSKY_API_KEY, WANDB_API_KEY ] )
Adding Environment Variable: WANDB_API_KEY
 ... SUCCESS


## Add Login Credentials using `wandb login` Command

Inside a Jupyter Notebook, run the following from a code cell. And when prompted paste your API key from https://app.wandb.ai/authorize.

```
!wandb login
```

**CAUTION**: This approach leaves a visible trace with what API KEY you just used. DONOT use this if you want to share your notebook with someone.

If you want to clear the output after the login is successful, you could use the following method.

```
from IPython.display import clear_output
!wandb login
clear_output()
```

However, I have wrapped this in a python convenience function `wandb_login()`.

```python
def wandb_login(sleeptime=0.5):
    !wandb login
    time.sleep(sleeptime)
    clear_output()
    # Location of your wandb credential saved 
    #   using command wandb login: ~/.netrc 
    # Uncomment the following line to see the 
    #   contents of the file.
    #!cat ~/.netrc 
```

Just use: 

```python
wandb_login()
```

In [0]:
def wandb_login(sleeptime=0.5):
    !wandb login
    time.sleep(sleeptime)
    clear_output()
    # Location of your wandb credential saved 
    #   using command wandb login: ~/.netrc 
    # Uncomment the following line to see the 
    #   contents of the file.
    #!cat ~/.netrc 

wandb_login()

## Clone GitHub Repo

To clone the github repo use the following command.

```
!git clone https://github.com/sugatoray/Manning-Phishing-Websites-Detection.git
```

In [66]:
!git clone https://github.com/sugatoray/Manning-Phishing-Websites-Detection.git

Cloning into 'Manning-Phishing-Websites-Detection'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (15/15), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 15 (delta 3), reused 2 (delta 0), pack-reused 0[K
Unpacking objects: 100% (15/15), done.


## Define Project Home Directory

In [0]:
home = r"/content/Manning-Phishing-Websites-Detection"
data_path = os.path.join(home, 'Phishing.csv') # data path

## Read Data into a Pandas DataFrame

In [85]:
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,port,HTTPS_token,Request_URL,URL_of_Anchor,Links_in_tags,SFH,Submitting_to_email,Abnormal_URL,Redirect,on_mouseover,RightClick,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,1,-1,1,-1,1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,1,-1,1,0,-1,-1,1,1,0,1,1,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,1,-1,1,0,-1,-1,-1,-1,0,1,1,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,1,-1,-1,0,0,-1,1,1,0,1,1,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1,1


## Inspect the DataFrame with `pandas_profiling` Library

```python
# Install if necessary
#!pip install -U -q pandas_profiling
import pandas_profiling as ppf
# Create Profile Report
ppf.ProfileReport(df)
```

In [102]:
profile = ppf.ProfileReport(df)
profile.to_file(output_file = 'profile_report.html')

HBox(children=(FloatProgress(value=0.0, description='Summarize dataset', max=43.0, style=ProgressStyle(descrip…




HBox(children=(FloatProgress(value=0.0, description='Generate report structure', max=1.0, style=ProgressStyle(…




HBox(children=(FloatProgress(value=0.0, description='Render HTML', max=1.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Export report to file', max=1.0, style=ProgressStyle(desc…




## Set up Git

### Install Latest Git Version

In [127]:
!git --version 

git version 2.17.1


In [126]:
!apt-get install git

Reading package lists... Done
Building dependency tree       
Reading state information... Done
git is already the newest version (1:2.17.1-1ubuntu0.7).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.


### Setup Global Config Variables

In [128]:
!git config --global user.name "sugatoray"
!git config --global user.email "ray.sugato@gmail.com"
!git config -l

user.name=sugatoray
user.email=ray.sugato@gmail.com
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true
remote.origin.url=https://github.com/sugatoray/Manning-Phishing-Websites-Detection.git
remote.origin.fetch=+refs/heads/*:refs/remotes/origin/*
branch.master.remote=origin
branch.master.merge=refs/heads/master


### Check Git Commands

In [107]:
!git

usage: git [--version] [--help] [-C <path>] [-c <name>=<value>]
           [--exec-path[=<path>]] [--html-path] [--man-path] [--info-path]
           [-p | --paginate | --no-pager] [--no-replace-objects] [--bare]
           [--git-dir=<path>] [--work-tree=<path>] [--namespace=<name>]
           <command> [<args>]

These are common Git commands used in various situations:

start a working area (see also: git help tutorial)
   clone      Clone a repository into a new directory
   init       Create an empty Git repository or reinitialize an existing one

work on the current change (see also: git help everyday)
   add        Add file contents to the index
   mv         Move or rename a file, a directory, or a symlink
   reset      Reset current HEAD to the specified state
   rm         Remove files from the working tree and from the index

examine the history and state (see also: git help revisions)
   bisect     Use binary search to find the commit that introduced a bug
   grep       Prin

In [109]:
!git help remote 

GIT-REMOTE(1)                     Git Manual                     GIT-REMOTE(1)

NNAAMMEE
       git-remote - Manage set of tracked repositories

SSYYNNOOPPSSIISS
       _g_i_t _r_e_m_o_t_e [-v | --verbose]
       _g_i_t _r_e_m_o_t_e _a_d_d [-t <branch>] [-m <master>] [-f] [--[no-]tags] [--mirror=<fetch|push>] <name> <url>
       _g_i_t _r_e_m_o_t_e _r_e_n_a_m_e <old> <new>
       _g_i_t _r_e_m_o_t_e _r_e_m_o_v_e <name>
       _g_i_t _r_e_m_o_t_e _s_e_t_-_h_e_a_d <name> (-a | --auto | -d | --delete | <branch>)
       _g_i_t _r_e_m_o_t_e _s_e_t_-_b_r_a_n_c_h_e_s [--add] <name> <branch>...
       _g_i_t _r_e_m_o_t_e _g_e_t_-_u_r_l [--push] [--all] <name>
       _g_i_t _r_e_m_o_t_e _s_e_t_-_u_r_l [--push] <name> <newurl> [<oldurl>]
       _g_i_t _r_e_m_o_t_e _s_e_t_-_u_r_l _-_-_a_d_d [--push] <name> <newurl>
       _g_i_t _r_e_m_o_

### Check inside `.git` directory under project-home

In [112]:
!ls -la Manning-Phishing-Websites-Detection/.git

total 52
drwxr-xr-x  8 root root 4096 May  8 03:57 .
drwxr-xr-x  5 root root 4096 May  8 05:02 ..
drwxr-xr-x  2 root root 4096 May  8 03:57 branches
-rw-r--r--  1 root root  293 May  8 03:57 config
-rw-r--r--  1 root root   73 May  8 03:57 description
-rw-r--r--  1 root root   23 May  8 03:57 HEAD
drwxr-xr-x  2 root root 4096 May  8 03:57 hooks
-rw-r--r--  1 root root  297 May  8 03:57 index
drwxr-xr-x  2 root root 4096 May  8 03:57 info
drwxr-xr-x  3 root root 4096 May  8 03:57 logs
drwxr-xr-x 19 root root 4096 May  8 03:57 objects
-rw-r--r--  1 root root  114 May  8 03:57 packed-refs
drwxr-xr-x  5 root root 4096 May  8 03:57 refs


### Inspect contents of `project-home/.git/config` file 

In [113]:
!cat Manning-Phishing-Websites-Detection/.git/config

[core]
	repositoryformatversion = 0
	filemode = true
	bare = false
	logallrefupdates = true
[remote "origin"]
	url = https://github.com/sugatoray/Manning-Phishing-Websites-Detection.git
	fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
	remote = origin
	merge = refs/heads/master


### Start Committing to Repo

In [119]:
home

'/content/Manning-Phishing-Websites-Detection'

In [0]:
os.chdir(home)

In [121]:
!git status

On branch master
Your branch is up to date with 'origin/master'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)

	[31msrc/[m

nothing added to commit but untracked files present (use "git add" to track)


In [0]:
!git add .

In [124]:
!git commit -m "first commit"

[master cd11fc3] first commit
 1 file changed, 18151 insertions(+)
 create mode 100644 src/stage1/profile_report.html


In [129]:
!git push origin master

fatal: could not read Username for 'https://github.com': No such device or address


## Current Status

Could not commit to git because of this error:

```
!git push origin master
```
Output:  
```
fatal: could not read Username for 'https://github.com': No such device or address
```

### Possible solution

1. https://medium.com/@navan0/how-to-push-files-into-github-from-google-colab-379fd0077aa8

1. https://stackoverflow.com/questions/22147574/fatal-could-not-read-username-for-https-github-com-no-such-file-or-directo

1. https://unix.stackexchange.com/questions/33617/how-can-i-update-to-a-newer-version-of-git-using-apt-get