# Anomaly Detection

This is an example where the **Isolation Forest** algorithm is used to detect anomalies in CloudTrail logs.

## How It Works
Isolation Forest operates by constructing a forest of random trees (i.e., an ensemble of decision trees). Each tree is built by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The primary idea is that, in a randomly partitioned tree, anomalies will require fewer splits to isolate them compared to normal points, which are more similar to each other and thus require more splits.

1. **Random Splitting**: Each tree is built by randomly selecting a feature and a split value. This random partitioning helps in isolating the data points quickly.
1. **Path Length**: The number of splits required to isolate a data point is known as the path length. Anomalies, being few and distinct, are expected to have shorter path lengths in the trees.
1. **Scoring**: The anomaly score is calculated based on the path length. Points with shorter average path lengths across the trees in the forest are considered anomalies.

## Advantages
- **Efficiency**: Isolation Forest is highly efficient and can handle large datasets with low memory requirements.
- **Performance**: It is particularly effective for detecting anomalies in high-dimensional datasets.
- **Interpretability**: The concept of path length and isolation provides a straightforward interpretation of why a point is considered an anomaly.

In [None]:
%pip install seaborn matplotlib pandas numpy scikit-learn

In [None]:
%pip install https://scanner-dev-public.s3.us-west-2.amazonaws.com/sdks/python/scanner_client-0.0.1-py3-none-any.whl

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
import matplotlib.pyplot as plt
import seaborn as sns
from scanner_client import Scanner
from datetime import datetime, timezone, timedelta
import os

In [None]:
def convert_results_to_data_frame(results):
    rows = [row.columns.to_dict() for row in results.rows]
    column_tags = results.column_tags.to_dict()
    if len(column_tags) > 0:
        # If this is a table, use the column ordering in the data frame
        return pd.DataFrame(data=rows, columns=results.column_ordering)
    else:
        # Otherwise, this is a list of log events, so use pandas JSON
        # normalization to set the table columns to the union of all keys.
        return pd.json_normalize(rows)

In [None]:
scanner = Scanner(
    api_url=os.environ["SCANNER_API_URL"],
    api_key=os.environ["SCANNER_API_KEY"],
)

In [None]:
end_time = datetime.now(tz=timezone.utc)
start_time = end_time - timedelta(days=1)

In [None]:
response = scanner.query.blocking_query(
    start_time=start_time.isoformat(),
    end_time=end_time.isoformat(),
    query_text="""
        %ingest.source_type: "aws:cloudtrail"
        userIdentity.type: IAMUser
    """
)
df = convert_results_to_data_frame(response.results)
df.head()

In [None]:
features = [
    'eventSource', 'eventName', 'userIdentity.arn', 'sourceIPAddress', 'eventTime', 
    'awsRegion', 'eventHour', 'eventDate',
]

df['eventHour'] = pd.to_datetime(df['eventTime']).dt.hour
df['eventDate'] = pd.to_datetime(df['eventTime']).dt.date

df = df[features]

# Encode categorical features. i.e. turn a string enum into a number
df['encEventDate'] = pd.factorize(df['eventDate'])[0]
df['encEventSource'] = pd.factorize(df['eventSource'])[0]
df['encEventName'] = pd.factorize(df['eventName'])[0]
df['encSourceIPAddress'] = pd.factorize(df['sourceIPAddress'])[0]
df['encUserIdentityArn'] = pd.factorize(df['userIdentity.arn'])[0]
df['encAwsRegion'] = pd.factorize(df['awsRegion'])[0]

df.head()

In [None]:
model_features = [
    'eventHour', 'encEventDate', 'encEventSource', 'encEventName', 'encSourceIPAddress', 'encUserIdentityArn', 'encAwsRegion',
]
model_df = df[model_features]
model_df.head()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(model_df)

# Train Isolation Forest model
clf = IsolationForest(contamination=0.01, random_state=42)
df['anomaly'] = clf.fit_predict(X)

# Interpret results
df['anomaly'] = df['anomaly'].map({1: 0, -1: 1})  # Convert to 0 (normal) and 1 (anomaly)

# Anomalies
anomalies = df[df['anomaly'] == 1]



# Anomalies Detected

In [None]:
print("Anomalies found:")
print(len(anomalies))

anomalies[features]

# Anomalies by Event Name

In [None]:
# Assuming anomalies is your DataFrame with the anomaly data
plt.figure(figsize=(10, 6))
sns.countplot(data=anomalies, x='eventName')
plt.title('Anomalies by Event Name')
plt.xlabel('Event Name')
plt.ylabel('Count of Anomalies')
plt.xticks(rotation=45)
plt.show()

# Heatmap of Anomalies by Hour and Day

In [None]:
# Creating a pivot table for the heatmap
heatmap_data = anomalies.pivot_table(index='eventHour', columns='eventDate', aggfunc='size', fill_value=0)

plt.figure(figsize=(12, 8))
sns.heatmap(heatmap_data, annot=True, fmt='d')
plt.title('Heatmap of Anomalies by Hour and Date')
plt.xlabel('Event Date')
plt.ylabel('Event Hour')
plt.show()