# **Lab 8: Dimensionality Reduction & Anomaly Detection**

During this lecture lecture, we learned more about unsupervised problems. In this lab, we will see how to use PCA for dimensionality reduction and LOF and IsolationForest for Anomaly Detection

## Exercise 3: Anomaly Detection with IsolationForest

We are going to train a IsolationForest on a dataset containing information on thyroid diseases. Each observation is related patients diagnosed with hypothyroid against normal patients.
https://archive.ics.uci.edu/ml/datasets/thyroid+disease


We will be loading the data here:
https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab08/ex3/annthyroid.csv

The steps are:
1.   Load and explore dataset
2.   Prepare Data
3.   Train IsolationForest
4.   Analysing IsolationForest Results
5.   Train PCA
6.   Perform Dimensionality Reduction with PCA

---
### 0. Setup Environment



In [None]:
# Do not modify this code
!pip install -q utstd

from utstd.folders import *
from utstd.ipyrenders import *

lab = LabExFolder(
  course_code="36106",
  lab="lab08",
  exercise="ex03"
)
lab.run()

In [None]:
import warnings
warnings.simplefilter(action='ignore')

---

### 1. Load and Explore Dataset




**[1.1]** Import the pandas, numpy and altair packages

In [None]:
# Placeholder for student's code

In [None]:
# Solution
import pandas as pd
import numpy as np
import altair as alt

**[1.2]** Create a variable called `file_url` containing the link to the CSV file and load the dataset into dataframe called `df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
file_url = 'https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab08/ex3/annthyroid.csv'
df = pd.read_csv(file_url)

**[1.3]** Display the first 5 rows of `df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.head()

**[1.4]** Display the dimensions of `df`


In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.shape

**[1.5]** Display the summary of `df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.info()

**[1.6]** Display the descriptive statistics of `df`


In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.describe()

---
### 2. Prepare Data

**[2.1]** Create a copy of `X` and save it into a variable called `df_cleaned`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
X = df.copy()

---
### 3. Train IsolationForest

**[3.1]** Import IsolationForest from Sklearn

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.ensemble import IsolationForest

**[3.2]** Instantiate the IsolationForest into a variable called `ifr` and fit it on `X`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
ifr = IsolationForest(random_state=0).fit(X)

**[3.3]** Save the predictions of `ifr` on `X` and save them into a variable called `preds`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
preds = ifr.predict(X)

**[3.4]** Display the number of observations of `preds` that have the value -1

In [None]:
# Placeholder for student's code

In [None]:
# Solution
(preds == -1).sum()

---
### 4. Analysing IsolationForest Results

**[4.1]** Add a new column to `df` called `anomaly` that will contain the date from `preds`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df['anomaly'] = preds

In [None]:
df

**[4.2]** Create a Pandas mask called `anomaly_mask` that will check if observations from the column `anomaly` are equal to -1

In [None]:
# Placeholder for student's code

In [None]:
# Solution
anomaly_mask = df['anomaly'] == -1

**[4.3]** Display the descriptive statistics on the observations defined as anomalies (value -1)

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df[anomaly_mask].describe()

**[4.4]** Display the descriptive statistics on the observations defined as normal (value 1)

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df[~anomaly_mask].describe()

---
### 5. Train PCA

**[5.1]** Import PCA from sklearn

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.decomposition import PCA

**[5.2]** Instantiate a new PCA called `pca` with 2 components only

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pca = PCA(2)

**[5.3]** Fit `pca` on `X`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pca.fit(X)

**[5.3]** Save the explained variance ratio of the principal components into a variable called `pc_variance_ratio`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pc_variance_ratio = pca.explained_variance_ratio_
pc_variance_ratio

**[5.4]** Create a list called `pc_list` that will contain the name of each principal component (PC1, PC2, ...)

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pc_list = [f'PC{i}' for i in list(range(1, len(pc_variance_ratio) + 1))]
pc_list

**[5.5]** Create a dictionary called `pc_loadings` that will have the name of principal component as keys and their corresponding loadings as values

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pc_loadings = dict(zip(pc_list, pca.components_))

**[5.6]** Convert `pc_loadings` to a dataframe called `loadings_df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
loadings_df = pd.DataFrame(pc_loadings)
loadings_df

**[5.7]** Prepend a column called `feature_names` to `loadings_df` that will contain the name of the original features

In [None]:
# Placeholder for student's code

In [None]:
# Solution
loadings_df.insert(0,'feature_names', X.columns)
loadings_df

**[5.8]** Display a horizontal bar chart with the loadings of PC1 against the original features

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.Chart(loadings_df).mark_bar().encode(
    x='PC1:Q',
    y="feature_names:N"
)

**[5.9]** Display a horizontal bar chart with the loadings of PC2 against the original features

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.Chart(loadings_df).mark_bar().encode(
    x='PC2:Q',
    y="feature_names:N"
)

---
### 6.   Perform Dimensionality Reduction with PCA

**[6.1]** Apply PCA transformation on `X` using `pca` and save the outputs into a dataframe called `pca_df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
res_df = pd.DataFrame(pca.transform(X))
res_df

**[6.2]** Rename the columns of `pca_df` to PC1 and PC2. Display its content

In [None]:
# Placeholder for student's code

In [None]:
# Solution
res_df.columns = ['PC1', 'PC2']

**[6.3]** Copy the content of columns `anomaly` from `df` into a new column called `anomaly` in `res_df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
res_df['anomaly'] = df['anomaly']

**[6.4]** Display a scatter plot showing the observations from `res_df` against PC1 and PC2

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.data_transformers.disable_max_rows()
alt.Chart(res_df).mark_circle(opacity=0.3).encode(
    x='PC1',
    y='PC2',
    color='anomaly:N'
)