Fetching infosec-related machine learning datasets with scikit-learn
=======================================================

If the goal is to post a sample jupyter notebook online as a portfolio demonstration piece, then one of the first "problems" one encounters is, what dataset to use. If the goal of the portfolio piece is moreso to demonstrate model-building skills, and less-so domain knowledge of the dataset or associated feature-engineering skills, then starting with raw data would be tedious and distracting to a lean demonstration. One could theoretically prepare the data in a separate analysis, and then host a prepared dataset on e.g. github and fetch it using a url in the portfolio piece, but, again, this is distracting if data munging and feature engineering is not the primary goal for the demo.

Fortunately, this is a common problem, and therefore, there exist standardized ways of fetching data.

This notebook illustrates some of them. Specifically, security-data-related ones.

In [1]:
import numpy as np

scikit-learn.datasets
-----------------------------

Scikit-learn has [many methods](https://scikit-learn.org/stable/datasets/index.html) to load prepared or generate random data with specified distributions.

Mind that the "normal" traffic has a dadgum gotcha `.` at the end of its target value: `normal.`

Also mind that, by default, the function only fetches 10% of the dataset.

### `fetch_kddcup99`

documentation for the dataset [here](https://kdd.ics.uci.edu/databases/kddcup99/task.html)

In [2]:
# convenience function for describing a subset
def describe_kddcup99_Ys(Y):
    s = Y == b'normal.'
    t = np.logical_not(s)

    normal = Y[s]
    abnormal = Y[t]

    num_normal = normal.shape[0]
    num_abnormal = abnormal.shape[0]
    percent_abnormal = num_abnormal / ( num_normal + num_abnormal )

    unique_abnormal = np.unique(abnormal)

    print("{} normal traffic points".format(Y[s].shape[0]))
    print("{} abnormal traffic points".format(Y[t].shape[0]))
    print("percent abnormal {}".format(percent_abnormal))
    print("abnormal classes: {}".format(unique_abnormal))
    print('num_abnormal: {}'.format(unique_abnormal.shape[0]))

In [3]:
from sklearn.datasets import fetch_kddcup99

# no subset filtering, data of shape `(494021, 41)`, with 
# (97278,) normal traffic and 
# (396743,) abnormal traffic
# making a ratio of 80% abnormal (that's a lot!!)

x, Y = fetch_kddcup99(return_X_y=True)
describe_kddcup99_Ys(Y)

97278 normal traffic points
396743 abnormal traffic points
percent abnormal 0.8030893423558918
abnormal classes: [b'back.' b'buffer_overflow.' b'ftp_write.' b'guess_passwd.' b'imap.'
 b'ipsweep.' b'land.' b'loadmodule.' b'multihop.' b'neptune.' b'nmap.'
 b'perl.' b'phf.' b'pod.' b'portsweep.' b'rootkit.' b'satan.' b'smurf.'
 b'spy.' b'teardrop.' b'warezclient.' b'warezmaster.']
num_abnormal: 22


In [4]:
# filter to all "normal" data plus only 3377 "anomalous" datapoints, 
# resulting in a feature dataset of shape `(100655, 41)` (3% non-normal traffic)

x, Y = fetch_kddcup99(subset='SA', return_X_y=True)
describe_kddcup99_Ys(Y)

97278 normal traffic points
3377 abnormal traffic points
percent abnormal 0.03355024588942427
abnormal classes: [b'back.' b'ipsweep.' b'neptune.' b'nmap.' b'pod.' b'portsweep.' b'satan.'
 b'smurf.' b'teardrop.' b'warezclient.']
num_abnormal: 10


### `fetch_openml` -- Phishing Websites

This openml dataset comes from [UCI](https://archive.ics.uci.edu/ml/datasets/phishing+websites), connected with three academic papers cited on the UCI page. Features are described [in this word doc](https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Phishing%20Websites%20Features.docx) hosted on uci.edu

In [5]:
from sklearn.datasets import fetch_openml

x, Y = fetch_openml(data_id='4534', return_X_y=True) # dataset hosted , data_id taken from the url
x.shape

# use `as_frame` argument to get a pandas dataframe
data, target = fetch_openml(data_id='4534', return_X_y=True, as_frame=True)
print("pandas df colnames: {}".format(data.columns))

pandas df colnames: Index(['having_IP_Address', 'URL_Length', 'Shortining_Service',
       'having_At_Symbol', 'double_slash_redirecting', 'Prefix_Suffix',
       'having_Sub_Domain', 'SSLfinal_State', 'Domain_registeration_length',
       'Favicon', 'port', 'HTTPS_token', 'Request_URL', 'URL_of_Anchor',
       'Links_in_tags', 'SFH', 'Submitting_to_email', 'Abnormal_URL',
       'Redirect', 'on_mouseover', 'RightClick', 'popUpWidnow', 'Iframe',
       'age_of_domain', 'DNSRecord', 'web_traffic', 'Page_Rank',
       'Google_Index', 'Links_pointing_to_page', 'Statistical_report'],
      dtype='object')


#### Other OpenML security datasets
* SPAM email database, [https://www.openml.org/d/44](https://www.openml.org/d/44)
  - has text pre-featurized
* Credit Card Fraud
  - several on OpenML, but same dataset:
      - [https://www.openml.org/d/42175](https://www.openml.org/d/42175)
      - [https://www.openml.org/d/1597](https://www.openml.org/d/1597)
  - This is the same dataset as the famous one hosted on kaggle, [here](https://www.kaggle.com/mlg-ulb/creditcardfraud)
    and also the same as the one used by [this AWS Sagemaker tutorial](https://aws.amazon.com/solutions/fraud-detection-using-machine-learning/)

### datahub.io datapackages -- Credit Card Fraud

This is the same data as listed in the above "Other OpenML security datasets" page. Datahub imported this credit card fraud
dataset from OpenML.

They include code on their website for how to fetch and load their hosted data using various languages. Pandas example
for the credit card fraud [here](https://www.datahub.io/machine-learning/creditcard#pandas)

My gripe with this library is that it doesn't handle caching, whereas scikit-learn's `fetch_openml` [does](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html#sklearn.datasets.fetch_openml). I was going to say that one benefit of using the `datapackage` library is that it can load a dataset straight into a pandas dataframe, with attendant column names and types, but `fetch_openml` can do that too via the `as_frame` argument, since openml ARFF format affords specifying attribute names and types. 

So, pfft datahub.io. `fetch_openml` ftw.