# **Lab 8: Dimensionality Reduction & Anomaly Detection**

During this lecture lecture, we learned more about unsupervised problems. In this lab, we will see how to use PCA for dimensionality reduction and LOF and IsolationForest for Anomaly Detection

## Exercise 1: PCA with K-Means

We are going to perform a cluster analysis with k-means in to identify groups of customers with similar behaviors but on a reduced data with PCA.

We will be loading the data here:
https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab08/ex1/customer_segmentation.csv

The steps are:
1.   Load and explore dataset
2.   Prepare Data
3.   Scale the data
4.   Find number of Principal Components
5.   Reduce dimensions with PCA
6.   Find Optimal Number of Clusters
7.   Train a K-means Model

---
### 0. Setup Environment

In [None]:
# Do not modify this code
!pip install -q utstd

from utstd.folders import *
from utstd.ipyrenders import *

lab = LabExFolder(
  course_code="36106",
  lab="lab08",
  exercise="ex01"
)
lab.run()

In [None]:
import warnings
warnings.simplefilter(action='ignore')

### 1. Load and Explore Dataset

**[1.1]** Import the pandas, numpy and altair packages

In [None]:
# Placeholder for student's code

In [None]:
# Solution
import pandas as pd
import numpy as np
import altair as alt

**[1.2]** Create a variable called `file_url` containing the link to the CSV file and load the dataset into dataframe called `df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
file_url = 'https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab08/ex1/customer_segmentation.csv'
df = pd.read_csv(file_url)

**[1.3]** Display the first 5 rows of `df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.head()

**[1.4]** Display the dimensions of `df`


In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.shape

**[1.5]** Display the summary of `df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.info()

**[1.6]** Display the descriptive statistics of `df`


In [None]:
# Placeholder for student's code

In [None]:
# Solution
df.describe()

### 2.   Prepare Data

**[2.1]** Create a copy of `df` and save it into a variable called `df_cleaned`





In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned = df.copy()

**[2.2]** Remove the columns  `Customer Id` and `Address` from the dataframe

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned.drop(['Customer Id', 'Address'], axis=1, inplace=True)

**[2.3]** Replace missing values from the `Defaulted` column with the value 0

In [None]:
# Placeholder for student's code

In [None]:
# Solution
df_cleaned['Defaulted'] = df_cleaned['Defaulted'].fillna(0)

### 3.   Scale the data

**[3.1]** Create a copy of df and save it into a variable called `X`


In [None]:
# Placeholder for student's code

In [None]:
# Solution
X = df_cleaned.copy()

**[3.2]** Import StandardScaler from sklearn.preprocessing

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.preprocessing import StandardScaler

**[3.3]** Instantiate the StandardScaler into a variable called `scaler`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
scaler = StandardScaler().set_output(transform="pandas")

**[3.4]** Fit and apply the scaling on `X` and save the results back to a variable called `X_scaled`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
X_scaled = scaler.fit_transform(X)

In [None]:
X_scaled

### 4.   Reduce dimensions with PCA

Note: If you are stuck in previous steps, you can dowload the content of the scaled data here: https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36106-mlaa/lab08/ex1

**[4.1]** Import PCA from sklearn

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.decomposition import PCA

**[4.2]** Instantiate an object called `pca` from the PCA class

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pca = PCA()

**[4.3]** Fit `pca` on the scaled data

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pca.fit(X_scaled)

**[4.4]** Create a list called `pc_list` that will contain the name of each principal component (PC1, PC2, ...)

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pc_list = [f'PC{i}' for i in list(range(1, pca.n_features_in_ + 1))]
pc_list

**[4.5]** Create a dictionary called `pc_data` with the following key-value pairs:
- pc: list of principal component names
- exp_variance: list of variance ration explained by the principal components found

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pc_data = {
    'pc': pc_list,
    'exp_variance': list(pca.explained_variance_ratio_)
}
pc_data

**[4.5]** Convert `pc_data` to a dataframe called `pc_df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pc_df = pd.DataFrame(pc_data)
pc_df

**[4.6]** Create a line chart that will show the cumulative sum of explained variance ratio for each principal components

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.Chart(pc_df).mark_line().transform_window(
    cum_var='sum(exp_variance)'
).encode(
    x=alt.X('pc:N', sort=list(X.columns)),
    y='cum_var:Q'
).properties(width=600)

**[4.7]** Create a dictionary called `pc_loadings` that will have the name of principal component as keys and their corresponding loadings as values

In [None]:
pc_loadings = dict(zip(pc_list, pca.components_))
pc_loadings

**[4.8]** Convert `pc_loadings` to a dataframe called `loadings_df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
loadings_df = pd.DataFrame(pc_loadings)
loadings_df

**[4.9]** Prepend a column called `feature_names` to `loadings_df` that will contain the name of the original features

In [None]:
# Placeholder for student's code

In [None]:
# Solution
loadings_df.insert(0,'feature_names', X.columns)
loadings_df

**[4.10]** Display a horizontal bar chart with the loadings of `PC1` against the original features

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.Chart(loadings_df).mark_bar().encode(
    x='PC1:Q',
    y="feature_names:N"
)

**[4.12]** Display a horizontal bar chart with the loadings of `PC2` against the original features

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.Chart(loadings_df).mark_bar().encode(
    x='PC2:Q',
    y="feature_names:N"
)


**[4.13]** Display a horizontal bar chart with the loadings of `PC3` against the original features

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.Chart(loadings_df).mark_bar().encode(
    x='PC3:Q',
    y="feature_names:N"
)


### 5.   Reduce dimensions with PCA

**[5.1]** Instantiate a new PCA with 3 components only and fit on the scaled data

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pca = PCA(3).fit(X_scaled)

**[5.2]** Apply PCA transformation on the scaled data using `pca` and save the outputs into a dataframe called `pca_df`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pca_df = pd.DataFrame(pca.transform(X_scaled))

**[5.3]** Rename the columns of `pca_df` to `PC1`, `PC2` and `PC3`. Display its content

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pca_df.columns = ['PC1', 'PC2', 'PC3']
pca_df

## 6. Find Optimal Number of Clusters

**[6.1]** Create an empty dataframe and list called respectively `cluster_df` and `inertia_list`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
cluster_df = pd.DataFrame()
inertia_list = []

**[6.2]** Create a new column called `nb_cluster` on `cluster_df` that will contain values from 1 to 20

In [None]:
# Placeholder for student's code

In [None]:
# Solution
cluster_df['nb_cluster'] = range(1, 20)

**[6.3]** Import KMeans from sklearn.cluster

In [None]:
# Placeholder for student's code

In [None]:
# Solution
from sklearn.cluster import KMeans

**[6.4]** Iterate through the range of cluster and instantiate a kmeans accordingly and store the inertia score

In [None]:
# Placeholder for student's code

In [None]:
# Solution
for k in cluster_df['nb_cluster']:
    kmeans = KMeans(n_clusters=k, random_state=8, n_init=10).fit(pca_df)
    inertia_list.append(kmeans.inertia_)

**[6.5]** Create a new column called `inertia` on `cluster_df` that will contain the inertia values

In [None]:
# Placeholder for student's code

In [None]:
# Solution
cluster_df['inertia'] = inertia_list

In [None]:
cluster_df

**[6.6]** Plot a line chart that will show the inertia values from the number of clusters

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.Chart(cluster_df).mark_line().encode(alt.X('nb_cluster'), alt.Y('inertia'))

**[6.7]** Create a variable called `k` that will take the optimal number of clusters

In [None]:
# Placeholder for student's code

In [None]:
# Solution
k = 4

## 7. Train a K-means Model

**[7.1]** Instantiate a kmeans with random_state=1, k as n_clusters

In [None]:
# Placeholder for student's code

In [None]:
# Solution
kmeans = KMeans(random_state=1, n_clusters=k)

**[7.2]** Fit the model with the data

In [None]:
# Placeholder for student's code

In [None]:
# Solution
kmeans.fit(pca_df)

**[7.3]** Predict the cluster on the data and save the results back to the original dataframe `df` in a new column called 'kmeans'

In [None]:
# Placeholder for student's code

In [None]:
# Solution
pca_df['kmeans'] = kmeans.predict(pca_df)

In [None]:
pca_df

**[7.4]** Display a scatter plot showing the clusters against `PC1` and `PC2`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.Chart(pca_df).mark_point().encode(color='kmeans:N').encode(
    x='PC1',
    y='PC2'
)

**[7.5]** Display a scatter plot showing the clusters against `PC1` and `PC3`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.Chart(pca_df).mark_point().encode(color='kmeans:N').encode(
    x='PC1',
    y='PC3'
)

**[7.6]** Display a scatter plot showing the clusters against `PC2` and `PC3`

In [None]:
# Placeholder for student's code

In [None]:
# Solution
alt.Chart(pca_df).mark_point().encode(color='kmeans:N').encode(
    x='PC2',
    y='PC3'
)