# **Lab 8: Dimensionality Reduction & Anomaly Detection**

During this lecture lecture, we learned more about unsupervised problems. In this lab, we will see how to use PCA for dimensionality reduction and LOF and IsolationForest for Anomaly Detection

## Exercise 2: Novelty Detection with LOF

We are going to train a Local Outlier Factor on a dataset containing normal credit card transactions (expected behaviors) and use it on unseen data to identify outliers.

We will be loading the data here:
https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab08/ex2/creditcard_fraud.csv

The steps are:
1.   Load and explore dataset
2.   Prepare Data
3.   Train LOF on normal observations
4.   Analyse Outliers Detection results

---
### 0. Setup Environment

In [None]:
# Do not modify this code
!pip install -q utstd

from utstd.folders import *
from utstd.ipyrenders import *

lab = LabExFolder(
  course_code="36106",
  lab="lab08",
  exercise="ex02"
)
lab.run()

In [None]:
import warnings
warnings.simplefilter(action='ignore')

### 1. Load and Explore Dataset

**[1.1]** Import the pandas, numpy and altair packages

In [None]:
# Placeholder for student's code

In [None]:
# Solution
import pandas as pd
import numpy as np
import altair as alt

**[1.2]** Create a variable called `file_url` containing the link to the CSV file and load the dataset into dataframe called `df`

In [None]:
# Solution
file_url = 'https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab08/ex2/creditcard_fraud.csv'
df = pd.read_csv(file_url)

**[1.3]** Display the first 5 rows of `df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.head()

**[1.4]** Display the dimensions of `df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.shape

**[1.5]** Display the summary of `df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.info()

**[1.6]** Display the descriptive statistics of `df`



In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.describe()

### 2.   Prepare Data

**[2.1]** Create a copy of `df` and save it into a variable called `df_cleaned`

In [None]:
# Placeholder for student's code

In [None]:
df_cleaned = df.copy()

**[2.2]** Re-map the values of the `class` column:
- 0 will become 1
- 1 will become -1

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned['class'] = df_cleaned['class'].map({0:1, 1:-1})

**[2.3]** Create a pandas mask called `outliers_mask` that will filter all observations with value -1 on the `class` column

In [None]:
# Placeholder for student's code

In [None]:
# Solution
outliers_mask = df_cleaned['class'] == -1

**[2.4]** Create 2 new dataframe fron `df_cleaned`:
- one called `X_outliers` that will contain all the observations with value -1 on the `class` column
- the other one called `X_normal` that will contain all the remaining observations

In [None]:
# Placeholder for student's code

In [None]:
# Solution
X_outliers = df_cleaned[outliers_mask].copy()
X_normal = df_cleaned[~outliers_mask].copy()

**[2.5]** Print the dimensions of `X_outliers` and `X_normal`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
print(X_outliers.shape)
print(X_normal.shape)

**[2.6]** Randomly sample 500 observations of `X_normal` and saved them in a new variable called `X_new`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
X_new = X_normal.sample(500)

**[2.7]** Remove all the observations contained in `X_new` from the `X_normal` dataframe (use dataframe indexes)

In [None]:
# Placeholder for student's code

In [None]:
# Solution
X_normal.drop((X_new.index), inplace=True)

**[2.8]** Print the dimensions of `X_new` and `X_normal`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
print(X_new.shape)
print(X_normal.shape)

**[2.9]** Append the observations from `X_outliers` to `X_new`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
X_new = pd.concat([X_new, X_outliers])

In [None]:
X_new.shape

**[2.10]** Save the `class` column from `X_normal` and `X_new` into 2 new variables respectively called `class_normal` and `class_new`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
class_normal = X_normal.pop('class')
class_new = X_new.pop('class')

### 3. Train LOF on normal observations

**[3.1]** Import LocalOutlierFactor from sklearn

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.neighbors import LocalOutlierFactor

**[3.2]** Create a variable called `lof` that will instantiate LocalOutlierFactor class and fit on `X_normal`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
lof = LocalOutlierFactor(n_neighbors=21, novelty=True).fit(X_normal)

**[3.3]** Store the predictions from `lof` on `X_new` in a variable called `preds`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
preds = lof.predict(X_new)
preds

### 4. Analyse Outliers Detection results

**[4.1]** Import accuracy_score from sklearn

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.metrics import accuracy_score

**[4.2]** Display the accuracy score from the predictions against `class_new`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
accuracy_score(class_new, preds)

**[4.3]** Import ConfusionMatrixDisplay from sklearn

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.metrics import ConfusionMatrixDisplay

**[4.4]** Display the confusion matrix from the predictions against `class_new`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
ConfusionMatrixDisplay.from_predictions(class_new, preds, normalize='true')