# Assignment Email

This email was received from David Meltzer on July 6, 2023 at 8:59AM PST. I began work at 6:00PM that day.

> Hi Russell,
>
> As I mentioned last week, the challenge is the last step in our interview process.  I have provided the instructions below, but if you need any clarification or have problems with how we have defined this, please let us know.  To the extent you incur any AWS costs as part of this, we will reimburse you.  There is no fixed deadline for this, but we are hoping to have it back within the next week or so to be able to make a decision.
>
> ----
>
> Use the IOT intrusion detection dataset for a supervised anomaly detection task. The label column indicates whether each row's data is normal or anomalous.
You can choose an ML classifier for training on this dataset.
>
> The task is to build a classifier using the train dataset and deploy a trained model in Sagemaker.
You can use your personal AWS account to do this.
The classifier should be able to generate robust classification metrics on a held-out test/validation dataset. You can use a portion of your train set for validation metrics.
>
> As for output, please send us a link to the code and classification results in your git repository when you are done. Configuration should be through an orchestration system so we can re-create the environment programmatically.
>
> We will use a held-out test set to evaluate your model's performance.
>
> Test data can be found here: [https://github.com/netography/ml-engineer/archive/refs/heads/main.zip](https://github.com/netography/ml-engineer/archive/refs/heads/main.zip)
>
> It can also be found here: [https://www.dropbox.com/scl/fi/oz1fspqu4mago3wxeb9dp/IoT-network-intrustion-dataset-train.csv?rlkey=d6hblxlw4t163tt386w16gldi&dl=0](https://www.dropbox.com/scl/fi/oz1fspqu4mago3wxeb9dp/IoT-network-intrustion-dataset-train.csv?rlkey=d6hblxlw4t163tt386w16gldi&dl=0)
>
> -Dave

In [None]:
import pandas as pd

pd.set_option('display.max_columns', 100)

# Load and Evaluate the Flow Logs

In [None]:
train_df = pd.read_parquet("data/ml-engineer-main/iot_network_intrustion_dataset_train.parquet")
train_df = train_df.rename(columns={"Unnamed: 0": "ID"})

train_df.head()

In [None]:
train_df.describe()

In [None]:
print(f"{len(train_df):,}")

In [None]:
test_df = pd.read_parquet("data/ml-engineer-main/iot_network_intrustion_dataset_test.parquet")
test_df = test_df.rename(columns={"Unnamed: 0": "ID"})

test_df.head()

In [None]:
test_df.describe()

In [None]:
print(f"{len(test_df):,}")

## How are flows unique?

In [None]:
print(f'{len(train_df["ID"].unique()):,}')

In [None]:
print(f'{len(train_df["Flow_ID"].unique()):,}')

In [None]:
print(f'{len(train_df[train_df.Flow_ID == "163.152.127.193-192.168.0.13-10101-56361-17"]):,}')

## This seems like it...

In [None]:
train_df[train_df.Flow_ID == "163.152.127.193-192.168.0.13-10101-56361-17"].sort_values(["Timestamp", "ID"])

## Label Check

As we see below, the data is just 6.4% / 6.1% normal event types. I would like to ask a question about this, but at this late date I will go with what I can determine. 

In [None]:
train_label_counts = train_df["Label"].value_counts()

train_normal_pct = (train_label_counts[1] / train_label_counts.sum()) * 100
print(f"Normal label percentage: {train_normal_pct:,.2f}%\n")

train_label_counts

In [None]:
test_label_counts = test_df["Label"].value_counts()

test_normal_pct = (test_label_counts[1] / test_label_counts.sum()) * 100
print(f"Normal label percentage: {test_normal_pct:,.2f}%\n")

test_label_counts

### Discussion

We might assume the proportion isn't representative of the problem which means we might perform some sample that oversamples normal data. While I don't know how the data was sampled, I am going to assume it is a representative, random sample... it is the only prior I have evidence for at this time. The problem is thus flipped on its head. We need to detect normal traffic and throw out anomalies.

Perhaps this is a web server? Let's do some more EDA to understand it, so I can feel comfortable about proceeding.

In [None]:
[
    len(train_df["Src_IP"].unique()),
    len(train_df["Dst_IP"].unique()),
]

In [None]:
# How many IPs have more than 1 connection?
# ip_counts = train_df.groupby("Dst_IP").filter(lambda x: len(x) > 1)["Dst_IP"].value_counts()

ip_counts = train_df["Dst_IP"].value_counts()
ip_counts

In [None]:
# Time for a histogram!
import seaborn as sns

sns.histplot(data=ip_counts, bins=60, log=True)

In [None]:
# Get a sense of the distribution across destination IPs
dst_label_counts = train_df[["Dst_IP", "Label"]].value_counts().reset_index().sort_values(by=["Dst_IP", "Label"], ascending=[True, False])
dlc_df = dst_label_counts[dst_label_counts["Dst_IP"].str.startswith("192.168.0")]
dlc_df

In [None]:
ip_counts = train_df[["Dst_IP", "Label"]].value_counts().sort_index()
ip_counts = ip_counts.to_frame().rename(columns={0: "count"})
ip_counts

In [None]:
ip_val_counts = ip_counts.reset_index().rename(columns={"count": "IP_Count"})
ip_val_counts

### Comparing Anomaly / Label Rations per IP

The following histogram is interesting... there is no clear pattern visible in terms of label per destination IP. I am going to stop my EDA here and move on to a baseline unsupervised 

In [None]:
sns.histplot(data=ip_val_counts, x="IP_Count", hue="Label", kde=True, bins=50, log_scale=True)

## Network Visualization

I just have to look at this as a network before moving onto actual machine learning... as I just worked at Graphistry and am familiar with their tool.

### Preparing a Dataset to Visualize

I need to see the network before proceeding onwards because I tend to make fundamental errors when I don't perform this step... sort of like in this exercise - the work I did in CloudFormation to setup a new SageMaker notebook inside a VPC - when I could have just used a SageMaker domain and the SageMaker Python SDK to publish a model very easily. In any case... let's do a first pass visualization.

In [None]:
viz_df = train_df[["Flow_ID", "Src_IP", "Src_Port", "Dst_IP", "Dst_Port", "Timestamp", "Protocol", "Label"]]
viz_df.head()

In [None]:
viz_df["src"] = viz_df["Src_IP"] + " / " + viz_df["Src_Port"].astype("str") + " / " + viz_df["Protocol"].astype("str")
viz_df["dst"] = viz_df["Dst_IP"] + " / " + viz_df["Dst_Port"].astype("str") + " / " + viz_df["Protocol"].astype("str")

viz_df.head()

### Are Flow IDs Labeled All or Nothing?

As we will see below - yes they are!

In [None]:
flow_label_counts = viz_df.groupby(["Flow_ID", "Label"]).count().reset_index()
flow_label_counts = flow_label_counts.rename(columns={"Src_IP": "Count"})[["Flow_ID", "Label", "Count"]]
flow_label_counts

In [None]:
flow_anomaly_counts = flow_label_counts[flow_label_counts["Label"] == "Anomaly"]
flow_anomaly_counts

In [None]:
flow_normal_counts = flow_label_counts[flow_label_counts["Label"] == "Normal"]
flow_normal_counts

### Look... a flow ID is all normal or all anomalous, regardless of timestamp :)

There is no overlap between these two datasets. This simplifies visualizing them considerably.

In [None]:
flow_anomaly_ratio = flow_normal_counts["Count"] / flow_anomaly_counts["Count"]
flow_anomaly_ratio.sort_values()

### Alter `viz_df` to Account for Label Polarity

Let's dedupe the flows to account for the fact that Flow IDs are always anomalous or not...

In [None]:
viz_df = viz_df.drop(columns=["Timestamp"], errors="ignore").drop_duplicates()
print(f"Total edges: {len(viz_df):,}")
viz_df

### Setting up Graphistry

I used the GPUs freely available on [Graphistry Hub](https://hub.graphistry.com/) at [https://hub.graphistry.com/](https://hub.graphistry.com/) to visualize the flow logs as a network. It is free for personal use and is powerful for visualizing networks large and small.

You can [signup](https://hub.graphistry.com/accounts/signup/) for a Graphistry account at [https://hub.graphistry.com/accounts/signup/](https://hub.graphistry.com/accounts/signup/). <b>You should use a username/password/email to get the required credentials</b>, although after that you can login with your Github or Google account.

<center><img src="images/graphistry_hub_registration.png" /></center>

Retain and use your credentials in the login form and in the environment variables in the next cell below. You should set the `GRAPHISTRY_USERNAME` and `GRAPHISTRY_PASSWORD` variables in the `env/graphistry.env` file, and then restart this docker container to pickup the new values.

<center><img src="images/graphistry_hub_homepage.png" /></center>

In [None]:
import os
import graphistry

In [None]:
# Environment variable setup
GRAPHISTRY_USERNAME = os.getenv("GRAPHISTRY_USERNAME")
GRAPHISTRY_PASSWORD = os.getenv("GRAPHISTRY_PASSWORD")

In [None]:
graphistry.register(
    api=3,
    username=GRAPHISTRY_USERNAME,
    password=GRAPHISTRY_PASSWORD,
)

In [None]:
(
    graphistry.edges(viz_df, source="src", destination="dst")
    #.options({""})
    .plot()
)

### 192.168.0.13: Real Time Streaming Protocol (RTSP) for Microsoft Windows Media

One interesting thing that pops up immediately is the following image of Graphistry visualizing the flow network, which shows port scanning from a number of hosts of the machine 

<center><img alt="192.168.0.13 is an interesting high degree node. Is this a port scan or normal traffic?" src="images/Netography-Network-Flows-Graphistry-Server.jpg" /></center>
<br />

You can see below that the flows are mostly around the host `192.168.0.13` serving a [Real Time Streaming Protocol (RTSP) for Microsoft Windows Media streaming services and QuickTime Streaming Server (QTSS)](https://www.speedguide.net/port.php?port=554) workload. You can imagine if we were manually feature engineering that a low port (<1024) with a high degree would likely be legitimate.

<br />
<center><img alt="192.168.0.13 is a Windows Media Server" src="images/Netography-Network-Flows-Graphistry-Windows-Media-Server.jpg" /></center>
<br />



Wait... traffic to this server is marked as `Anomaly`. This is very confusing. Things seem reversed once again. I really need to know more about the use case to understand this better... what constitutes normal traffic on this network isn't what I would expect.

In [2]:
thin_df = train_df[train_df.Dst_IP == "192.168.0.13"][["Flow_ID", "Src_IP", "Src_Port", "Dst_IP", "Dst_Port", "Protocol", "Timestamp", "Label", "Cat", "Sub_Cat"]]
test_df

# Feature Engineering

While we could use something like sentence encoding on our IP addresses and the like, for simplicity let's matricize our features - which are mostly numeric - in the simplest, most direct manner possible.

In [None]:
train_df.columns

In [None]:
# Dump IDs - this must work on new data. Ignore errors for repeats.
train_df = train_df.drop(columns=["ID", "Flow_ID"], errors="ignore")

In [None]:
train_df.dtypes.value_counts()

### String Columns

I am going to ordinal encode the string columns. `Cat` and `Sub_Cat` in particular look useful. [Ordinal encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html) in scikit-learn over label encoding allows for new values to be encoded once the model is deployed.

#### Timestamp...

Not sure what to do with timestamp... it should be relative, but to what? Probably a difference between it and the previous flow log.

Leaving it out on a first pass. I will window function a diff from the last value if need be.

In [None]:
str_cols = list(train_df.columns[train_df.dtypes == "object"].values)
str_cols_no_dt = [x for x in str_cols if x != "Timestamp"]
str_cols_no_dt

In [None]:
str_train_df = train_df[str_cols_no_dt]
str_train_df.dtypes

In [None]:
# Wow, Cat is informative
str_train_df.values

In [None]:
from sklearn.preprocessing import OrdinalEncoder

# Encode unknown values as -1 as this is for production - retrain to pickup the new flow IPs, etc.
ordinal_encoder = OrdinalEncoder(
    categories="auto",
    handle_unknown="use_encoded_value",
    unknown_value=-1,
)

In [None]:
ordinal_encoder.fit(str_train_df)

In [None]:
X_str_train = ordinal_encoder.transform(str_train_df)
X_str_train.shape

In [None]:
# Note we are dropping the timestamp at this point - potentially a big deal
numeric_train_df = train_df.drop(columns=str_cols, errors="ignore")

# We have np.inf as values - not good for KMeans below
numeric_train_df = numeric_train_df.replace([np.inf, -np.inf], np.nan)

X_numeric_train = numeric_train_df.values
X_numeric_train.shape

In [None]:
import numpy as np

X_train = np.append(X_numeric_train, X_str_train, axis=1)
X_train.shape

In [None]:
# We ran into a problem below, now we np.nan impute infinities above
np.isinf(X_train).sum()

In [None]:
# With infinities gone, NaNs are problematic too... so let's impute the column average to eliminate signal
feature_means = np.nanmean(X_train,axis=0)
feature_means

In [None]:
#Find indices that you need to replace
nan_indices = np.where(np.isnan(X_train))

# np.inf and np.nans appear in these columns
numeric_train_df.columns[16:18]

numeric_train_df["Flow_Byts/s"].isnull().sum(), numeric_train_df["Flow_Pkts/s"].isnull().sum()

In [None]:
#Place column means in the indices. Align the arrays using take
X_train[nan_indices] = np.take(feature_means, nan_indices[1])

# Should now be zero infs and nans
np.sum(np.where(np.isnan(X_train))), np.sum(np.where(np.isinf(X_train)))

## Statistical Anomaly Detection with KMeans in `scikit-learn`

Pulled this as a first baseline from a KMeans example from scikit-learn.

In [None]:
import numpy as np
from sklearn.cluster import KMeans

# Fit the k-means algorithm to the dataset
kmeans = KMeans(n_clusters=2).fit(X_train)

# Get the distances of each point to its nearest cluster
distances = kmeans.transform(X_train)
nearest_distances = np.min(distances, axis=1)

# Define a threshold for anomaly detection
threshold = np.percentile(nearest_distances, 95)

# Identify anomaly indices
anomalies = np.where(nearest_distances > threshold)

# Print the indices of the anomalies
print("Anomalies:", anomalies)
anomalies[0].shape

## Scoring Anomaly Detection Algorithms

We need a consistent way to score the anomaly detection methods. Given the label imbalance where normal traffic is only 6.4% of the training data and 6.1% of the tet data, accuracy isn't going to work very well.



## PyOD

PyOD is the leading anomaly detection library...

In [None]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager
from pyod.models.knn import KNN 
from pyod.utils.data import generate_data, get_outliers_inliers