# Dimension reduction answers

Notebook containing the example answers to Chapter 4 - Dimension reduction.

Note: these answers are examples and your way of tackling the problem may also be correct.

## Preparation

In [None]:
import numpy as np

from sklearn.decomposition import PCA 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.decomposition import KernelPCA
from sklearn.decomposition import IncrementalPCA
from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

import pandas as pd

# function from original notebook
from dim_red_function import PCA_train_predict_score

In [None]:
high_dim = np.loadtxt("../data/high_dimensions.csv", delimiter=",")

# Splitting the data set
X = high_dim[:,0:-1]
y = high_dim[:,-1]

## Exercise 1

Perform PCA again on $X$, this time with $n\_components=10$, what is the difference in **total** explained variance between using five and ten components? Which number of components might we prefer to use?

In [None]:
# Find the decomposition of the data
pca_5 = PCA(n_components=5).fit(X)
pca_10 = PCA(n_components=10).fit(X)

# Calculate total variance ratio
tevr_5 = pca_5.explained_variance_ratio_.sum()
tevr_10 = pca_10.explained_variance_ratio_.sum()

print("5 components: ", tevr_5)
print("10 components: ", tevr_10)

In [None]:
# Calculate the difference between the ratios (largest - smallest)
tevr_difference = tevr_10 - tevr_5

print("Difference in total explained ratio: ", round(tevr_difference, 3))

The use of $k = 10$ explains more of the variance of the original data, so we would prefer to use that rather than $k = 5$.

## Exercise 2

What value of $k$ has the highest associated F1 score? What does that tell you about our data?

**HINT** Look up the numpy function $np.argmax()$

### Preparation

In [None]:
# The maximum number of components is given by the original feature number
k_max = X.shape[1]

# Create a list of k_values
k_values = list(range(1, k_max))

# Generate F1 scores for each k value of PCA
f1_scores = [PCA_train_predict_score(X, y, k) for k in k_values]

### Answer

In [None]:
# We get the index of the highest k values here
highest_value_index = np.argmax(f1_scores)

# The k value this corresponds to is
highest_k = k_values[highest_value_index]

print("The value of K with the highest F1 score is:", highest_k)

## Exercise 3
Using $X$ fit a PCA model with $k = 50$.

Plot the cumulative sum of variance explained ratio of each component.

**HINT:** use the numpy $np.cumsum()$ function

In [None]:
# Create the object with the decomposed data
pca_50 = PCA(n_components=50).fit(X)

# Calculate the cumulative sum of each component evr
cumulative_sum_ver = np.cumsum(pca_50.explained_variance_ratio_)

# Create a list of the component numbers
components_values = list(range(1, 51))

In [None]:
plt.title("Cumulative explained variance ratio at number of components")
plt.xlabel("number of components")
plt.ylabel("Cumulative explained variance ratio")
plt.plot(components_values, cumulative_sum_ver, c="navy");

We can see that the explained variance ratio increases sharply at low numbers of components. Around 5-10 components this rate decreases, showing that each additional component explains less additional variance than the earlier components

## Exercise 4

Using $X\_circles$ perform *linear* PCA with $k = 2$. Plot the resulting new dimensions of data. Calculate the **total** explained variance ratio of the model and discuss why it may be difficult for a linear machine learning model to learn the pattern of this data.

### Preparation


In [None]:
# Load in our concentric circle data
circle_data = np.loadtxt("../data/circles.csv", delimiter=",")

# Split X and Y data
X_circles, y_circles = circle_data[:,0:-1], circle_data[:,-1]

### Answer

In [None]:
# Create and fit the PCA object
pca_circles_2 = PCA(n_components=2).fit(X_circles)

# Generate the PCA'd data
X_circles_2 = pca_circles_2.transform(X_circles)

In [None]:
plt.figure(figsize=(6, 6))
plt.title("Linear PCA Dimensions of Features")
plt.xlabel("$component~1$")
plt.ylabel("$component~2$")
plt.axis("equal")
plt.scatter(x=X_circles_2[:,0], y=X_circles_2[:,1], c=y_circles, cmap="coolwarm");

Looking at this view of our data shows that we still have two classes, one a circle within the ring of another. If we consider a simple linear model that will help us determine which class is which it is difficult to separate the classes. There is no line that you can draw in this space that will have one class mostly on one side and the other class mostly on the other.

## Exercise 5

Perform Kernel PCA on the $X\_circles$ data set using the "rbf" kernel with $k = 2$.

Use values of $gamma=1$ and $gamma=7$. What effect does this have?

Plot the results of each value.

In [None]:
# Create kernel PCA objects with associated parameters
kpca_g_1 = KernelPCA(n_components=2, kernel="rbf", gamma=1)
kpca_g_7 = KernelPCA(n_components=2, kernel="rbf", gamma=7)

# Generate the data from Kernel PCA with each parameter
X_g_1 = kpca_g_1.fit_transform(X_circles)
X_g_7 = kpca_g_7.fit_transform(X_circles)

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, sharey=True, sharex=True, figsize=(10, 5))

ax1.set_title("rbf kernel $gamma=1$")
ax1.set_xlabel("$component~1$")
ax1.set_ylabel("$component~2$")
ax1.scatter(x=X_g_1[:,0], y=X_g_1[:,1], c=y_circles, cmap="coolwarm");

ax2.set_title("rbf kernel $gamma=7$")
ax2.set_xlabel("$component~1$")
ax2.set_ylabel("$component~2$")
ax2.scatter(x=X_g_7[:,0], y=X_g_7[:,1], c=y_circles, cmap="coolwarm");

The value of `gamma` determines the strength of the radial basis function kernel effect. The larger `gamma` transforms the data more using the radial coordinates (distance from centre) which gives a more pronounced change to our circlular data. `gamma` is a hyperparameter which means we can tune it!

## Exercise 6

Using $ipca$ fit the data to $X$. Compare the total variance explained ratio with the original PCA method.

In [None]:
# Create and fit incremental PCA object
ipca_20 = IncrementalPCA(n_components=20).fit(X)

# Create and fit linear PCA object
pca_20 = PCA(n_components=20).fit(X)

In [None]:
ipca_20_tevr = ipca_20.explained_variance_ratio_.sum()
pca_20_tevr = pca_20.explained_variance_ratio_.sum()

In [None]:
print("IPCA total explained variance ratio: {}%".format(round(100*ipca_20_tevr, 2)))
print("PCA total explained variance ratio: {}%".format(round(100*pca_20_tevr, 2)))
print("Difference: {}%".format(round(100*(pca_20_tevr - ipca_20_tevr), 3)))

We can see that there is a difference in performance between the two methods for the data and number of components given. However, this difference is small in scale, the difference in model performance as a result will be small. The added benefit of being able to use this method for data which will not fit into memory can outway the slight performance reduction in some cases.

## Exercise 7

Using /data/bikes.csv, remove the non-numeric data, visualise the data in 2D using t-SNE and the target "count" attribute.

Compare t-SNE and PCA in 2D for this data. 

What value of $k$ is needed have the total variance explained ratio $> 0.8$.

In [None]:
bikes_raw_data = pd.read_csv("../data/bikes.csv")

# Keep only boolean and numerical data in the frame
bikes_numeric = bikes_raw_data.select_dtypes(include=["float64", "int64", "bool"])

In [None]:
# Drop our missing data for each
bikes_clean = bikes_numeric.dropna()

# Convert to numpy
bikes = bikes_clean.to_numpy()

# Separate features and target
X_bikes, y_bikes = bikes[:,:-1], bikes[:,-1]

In [None]:
# Create dimension reduction objects
tsne_2 = TSNE(n_components=2)
pca_2 = PCA(n_components=2)

# Generate the new data for each technique
X_bikes_tsne = tsne_2.fit_transform(X_bikes)
X_bikes_pca = pca_2.fit_transform(X_bikes)

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))

ax1.set_title("t-SNE")
ax1.set_xlabel("$component~1$")
ax1.set_ylabel("$component~2$")
tsne_plot = ax1.scatter(x=X_bikes_tsne[:,0], y=X_bikes_tsne[:,1], c=y_bikes, cmap="coolwarm")
fig.colorbar(tsne_plot, ax=ax1)

ax2.set_title("PCA")
ax2.set_xlabel("$component~1$")
ax2.set_ylabel("$component~2$")
pca_plot = ax2.scatter(x=X_bikes_pca[:,0], y=X_bikes_pca[:,1], c=y_bikes, cmap="coolwarm")
fig.colorbar(pca_plot, ax=ax2);

We can see that the two different methods are giving us significantly different structures of our data with regards to the target class. Both seem to separate the values of "count" to some extent, with the t-SNE creating a clear separate boundary.

In [None]:
# Create and fit object with all components (default number)
pca_bikes_all = PCA().fit(X_bikes)

# Calculate the cumulative sum of each component's ratio in order
pca_bikes_all.explained_variance_ratio_.cumsum()

From inspection of the cumulative sum of total explained variance we can see that $k=2$ gives us $>0.8$ explained. For larger numbers of components however we can pass the variance we want to achieve to the PCA object instead of the number of components which will then find the right number of components for us.

In [None]:
# We can specify the desired ratio
pca_bikes_var = PCA(n_components=0.8).fit(X_bikes)

# The PCA object has useful attributes
pca_bikes_var.n_components_

More explicitly it follows a method similar to:

In [None]:
# Create an array of the cumulative ratio values
cum_sums = pca_bikes_all.explained_variance_ratio_.cumsum()

# Create a range of k values from 1 to k
k_values = range(1, pca_bikes_all.n_features_ + 1)

# Loop over the cumulative sums and stop when value > 0.8
for index, each_sum in enumerate(cum_sums):
    if each_sum > 0.8:
        print("Number of required k for >= 0.8 explained variance ratio:", k_values[index])
        break