<a href="https://colab.research.google.com/github/yesoly/MachineLearningProject/blob/master/Assignment_08.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Principal Component Analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import random as rd

## 1. Data



- the data are given by the file data-pca.txt
- the data consist of a set of points $\{ (x_i, y_i) \}_{i=1}^{n}​$ where $z_i = (x_i, y_i)$ denotes a 2-dimensional point in the cartesian coordinate



load the data from the files

In [None]:
path = '/content/drive/My Drive/ML_Assignment/data/data-pca.txt'
data = np.loadtxt(path, delimiter=',')
x = data[:,0]
y = data[:,1]

Plot the original data points

In [None]:
fig_1 = plt.figure(figsize = (6,6))
plt.scatter(x, y, c='r', marker = '+') 
plt.title('original data points')
plt.show()
fig_1.savefig('original data points.png')

## 2. Normalization

- the data is normalized to have the mean = 0 and the standard deviation = 1
- $x = \frac{x - \mu_x}{\sigma_x}$ and $y = \frac{y - \mu_y}{\sigma_y}$​


> * $\mu_x$​ denotes the mean of $x$
> * $\sigma_x$​ denotes the standard deviation of $x$
> * $\mu_y$​ denotes the mean of $y$
> * $\sigma_y$​ denotes the standard deviation of $y$









define a function to normalize the input data points $x$ and $y$

In [None]:
def normalize_data(x, y):

    xn = x - np.mean(x) / np.std(x) # normalize x. the mean of xn is zero and the standard deviation of xn is one #
    yn = y - np.mean(y) / np.std(y) # normalize y. the mean of yn is zero and the standard deviation of yn is one #

    return xn, yn

Plot the normalized data points

In [None]:
xn, yn = normalize_data(x, y)

In [None]:
fig_2 = plt.figure(figsize = (6,6))
plt.scatter(xn, yn, c='r', marker = '+') 
plt.title('data normalized by z-scoring')
plt.show()
fig_1.savefig('normalized data points.png')

## 3. Covariance Matrix




- compute the co-variance matrix
- $\Sigma = \frac{1}{n} \sum_{i = 1}^n z_i z_i^T = \frac{1}{n} Z^T Z$

> * $n$ denotes the number of data

> * $Z = \begin{bmatrix} z_1^T \\ \vdots \\ z_n^T \end{bmatrix}$





define a function to compute the co-variance matrix of the data

In [None]:
def compute_covariance(x, y):

    covar = # compute the covariance matrix #
    
    return covar

##4. Principal Components

* compute the eigen-values and the eigen-vectors of the co-variance matrix

define a function to compute the principal directions from the co-variance matrix

In [None]:
def compute_principal_direction(covariance):

    direction = # compute the principal directions from the co-variance matrix #
    
    return direction

define a function to compute the projection of the data point onto the principal axis

In [None]:
def compute_projection(point, axis):

    projection = # compute the projection of point on the axis #
    
    return projection

In [None]:
def compute_distance(point1, point2):

    distance = # compute the Euclidean distance between point1 and point2 #
    
    return distance

K-means clustering algorithms

In [None]:
def k_means_clustering(max_iter, data, centroids):
    loss_iters = [] # record the loss values
    centroid_iters = []

    for i in range(max_iter):
        dist = compute_distance(data, centroids)
        cluster = compute_label(dist)
        centroids = compute_centroid(cluster)
        loss = compute_loss(cluster, centroids)
        c_dist = compute_centroid_distrance(centroids)
        loss_iters.append(loss)      # save the current loss value
        centroid_iters.append(c_dist)

    return cluster, centroids, loss_iters, centroid_iters

Learning K-means clustering

In [None]:
final_result, final_c, loss_iter, c_iter = k_means_clustering(max_iter, data, centroids)

Plot the loss curve

In [None]:
# Plot the loss curve
fig_3 = plt.figure(figsize = (8,5))

plt.plot(np.array(range(max_iter)),loss_iter, c = 'b')
plt.title('Loss')
plt.show()

fig_3.savefig('Loss.png')

Plot the centroid of each clsuter

In [None]:
# Plot the centroid of each clsuter
fig_4 = plt.figure(figsize = (8,8))

color=['red','blue','green', 'black', 'yellow']
label=['Cluster 1','Cluster 2','Cluster 3', 'Cluster 4', 'Cluster 5']
np_c_iter = np.array(c_iter)

for i in range(k):
    idx = (final_result[:,2]==i)
    plt.plot(np.array(range(max_iter)), np_c_iter[:,i], c=color[i],label=label[i])

plt.title('centroid of clsuter')
plt.legend(loc = 'upper right')
plt.show()
fig_4.savefig('centroid of clsuter.png')

Plot the final clustering result

In [None]:
fig_5 = plt.figure(figsize = (8,8))

color=['red','blue','green', 'black', 'yellow']
label=['Cluster 1','Cluster 2','Cluster 3', 'Cluster 4', 'Cluster 5']
for i in range(k):
    idx = (final_result[:,2]==i)
    plt.scatter(x_data[idx],y_data[idx], c=color[i],label=label[i])
plt.scatter(final_c[:,0],final_c[:,1],s=300, c='k', marker='+', label='Centroids')

plt.title('Final cluster')
plt.legend()
plt.show()
fig_5.savefig('Final cluster.png')

# Output

##1. Plot the original data points [1pt]

In [None]:
fig_1

## 2. Plot the normalized data points [1pt]

- $z = \frac{z - \mu}{\sigma}​$
- $\mu$ denotes the average and $\sigma$ denotes the standard deviation

In [None]:
fig_2

## 3. Plot the principal axes [2pt]

- plot the normalized data points
- plot the first principal vector
- plot the second principal vector

In [None]:
fig_3

##4. Plot the first principal axis [3pt]

- plot the normalized data points
- plot the first principal axis

In [None]:
fig_4

##5. Plot the project of the normalized data points onto the first principal axis [4pt]

- plot the normalized data points
- plot the first principal axis
- plot the projected points from the normalized data points onto the first principal axis

In [None]:
fig_5

## 6. Plot the lines between the normalized data points and their projection points on the first principal axis [3pt]

- plot the normalized data points
- plot the first principal axis
- plot the projected points from the normalized data points onto the first principal axis
- plot the lines that connect between the normalized data points and their projection points on the first principal axis

In [None]:
fig_6

##7. Plot the second principal axis [3pt]

- plot the normalized data points
- plot the second principal axis

In [None]:
fig_7

##8. Plot the project of the normalized data points onto the second principal axis [4pt]

- plot the normalized data points
- plot the second principal axis
- plot the projected points from the normalized data points onto the second principal axis

In [None]:
fig_8

##9. Plot the lines between the normalized data points and their projection points on the second principal axis [3pt]

- plot the normalized data points
- plot the second principal axis
- plot the projected points from the normalized data points onto the second principal axis
- plot the lines that connect between the normalized data points and their projection points on the second principal axis

In [None]:
fig_9