## Malicious URLs Kaggle Dataset

This dataset contains 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs. For our purposes, we will extract the html data of the URLs and use it as our dataset.

In [21]:
import pandas as pd
import numpy as np
import requests

# Load the dataset
df = pd.read_csv('data/malicious_phish.csv')
df.head()

Unnamed: 0,url,type
0,br-icloud.com.br,phishing
1,mp3raid.com/music/krizz_kaliko.html,benign
2,bopsecrets.org/rexroth/cr/1.htm,benign
3,http://www.garage-pirenne.be/index.php?option=...,defacement
4,http://adventure-nicaragua.net/index.php?optio...,defacement


This dataset has a known issue where urls from PhishStorm dataset are mislabelled: benign urls are labelled as phishing urls and vice versa. We will fix this issue by checking the url in the PhishStorm dataset and correcting the label.

In [17]:
# Fix the encoding issue of phishstorm dataset
from ftfy import fix_encoding

with open('data/urlset.csv', 'r', encoding='ascii') as f:
    with open('data/urlset_fixed.csv', 'w', encoding='ascii') as f2:
        while True:
            try:
                line = f.readline()
            except:
                pass
            if not line:
                break
            f2.write(fix_encoding(line))

In [18]:
# Load the PhishStorm dataset
phishstorm = pd.read_csv('data/urlset_fixed.csv', encoding='ascii')
phishstorm.head()

Unnamed: 0,domain,ranking,mld_res,mld.ps_res,card_rem,ratio_Rrem,ratio_Arem,jaccard_RR,jaccard_RA,jaccard_AR,jaccard_AA,jaccard_ARrd,jaccard_ARrem,label
0,nobell.it/70ffb52d079109dca5664cce6f317373782/...,10000000.0,1.0,0.0,18,107.611111,107.277778,0.0,0.0,0.0,0.0,0.8,0.795729,1.0
1,www.dghjdgf.com/paypal.co.uk/cycgi-bin/webscrc...,10000000.0,0.0,0.0,11,150.636364,152.272727,0.0,0.0,0.0,0.0,0.0,0.768577,1.0
2,serviciosbys.com/paypal.cgi.bin.get-into.herf....,10000000.0,0.0,0.0,14,73.5,72.642857,0.0,0.0,0.0,0.0,0.0,0.726582,1.0
3,mail.printakid.com/www.online.americanexpress....,10000000.0,0.0,0.0,6,562.0,590.666667,0.0,0.0,0.0,0.0,0.0,0.85964,1.0
4,thewhiskeydregs.com/wp-content/themes/widescre...,10000000.0,0.0,0.0,8,29.0,24.125,0.0,0.0,0.0,0.0,0.0,0.748971,1.0


In [19]:
len(phishstorm)

95913

We will check the urls in the PhishStorm dataset and correct the label in the malicious dataset. In PhishStorm label 1 is phishing and 0 is benign.

In [22]:
from tqdm import tqdm

# Check the urls in the PhishStorm dataset
phish_urls = phishstorm['domain'].values
fixed_count = 0
for i, row in tqdm(df.iterrows(), total=df.shape[0]):
    if row['url'] in phish_urls:
        phishstorm_row = phishstorm[phishstorm['domain'] == row['url']]
        if row['type'] == 'phishing' and phishstorm_row['label'].values[0] == 0:
            df.loc[i, 'type'] = 'benign'
            fixed_count += 1
        elif row['type'] == 'benign' and phishstorm_row['label'].values[0] == 1:
            df.loc[i, 'type'] = 'phishing'
            fixed_count += 1

print(f'Fixed {fixed_count} rows')

100%|██████████| 651191/651191 [20:54<00:00, 519.24it/s]

Fixed 95910 rows





In [27]:
df.to_csv('data/malicious_phish_fixed.csv')

In [3]:
from tqdm import tqdm
import pandas as pd
import numpy as np
import requests

In [10]:
df = pd.read_csv('data/malicious_phish_fixed.csv')
print(len(df))
df.head()

651191


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,url,type
0,0,0,br-icloud.com.br,phishing
1,1,1,mp3raid.com/music/krizz_kaliko.html,benign
2,2,2,bopsecrets.org/rexroth/cr/1.htm,benign
3,3,3,http://www.garage-pirenne.be/index.php?option=...,defacement
4,4,4,http://adventure-nicaragua.net/index.php?optio...,defacement


In [12]:
from urllib.parse import urlparse, ParseResult

def url_parse(url):
    p = urlparse(url, 'http')
    netloc = p.netloc or p.path
    path = p.path if p.netloc else ''
    if not netloc.startswith('www.'):
        netloc = 'www.' + netloc

    p = ParseResult('http', netloc, path, *p[3:])
    return p.geturl()

In [13]:
df['url'] = df['url'].apply(url_parse)

In [15]:
df.to_csv('data/malicious_phish_fixed.csv', index=False)