# Tutorial 9 - Unsupervised Learning

*Written and revised by Jozsef Arato, Mengfan Zhang, Dominik Pegler*  
Computational Cognition Course, University of Vienna  
https://github.com/univiemops/tewa1-computational-cognition

---
**This tutorial will cover:**

1. multivariate gaussian distribution
2. gaussian mixture model and k-means
3. k-means for image processing
4. PCA for image compression
5. simualting n-dimensional random gaussian data
6. visualize gmm, with probabilties 

---

## 1. Imports

In [None]:
import numpy as np
from matplotlib import pyplot as plt

## 2. Creating 2D gaussian data

### 2.1. Independent 2D gaussian data using a covariance matrix

- use the `np.eye()` function to make an identity matrix
- use such a covariance matrix to make random 2D gaussian data
- for making such data, use  `np.random.multivariate_normal()`

In [None]:
np.eye(3)  # 3by3 Identity matrix

In [None]:
N=100
Means=[0,0]  # use zero mean for X1 and X2
# YOUR CODE
Covar=np.eye(2)
print(Covar)
XX=np.random.multivariate_normal(# your code)
plt.scatter(XX[:,0],XX[:,1])
plt.xlabel('X1',fontsize=14)
plt.ylabel('X2',fontsize=14)
plt.title('Independent data')

### 2.2. Dependency with a covariance matrix

Change the covariance matrix, such that the data is dependent, try different values, and make data with positive and negative correlation.



In [None]:
N=100
Means=[0,0]  # use zero mean for X1 and X2
# YOUR CODE
Covar=np.eye(2)
Covar[0,1]=# your code
Covar[1,0]=# your code
Covar[0,0]=# your code
Covar[1,1]=# your code

print(Covar)
XX=np.random.multivariate_normal(Means,Covar,N)
plt.scatter(XX[:,0],XX[:,1])
plt.xlabel('X1',fontsize=14)
plt.ylabel('X2',fontsize=14)
plt.title('Dependent data')

## 3. Creating multivariate gaussian mixture data

Make 3 datasets D1-D3, with different means for X1 and X2, combine them into a single numpy array, for example using `np.vstack()`.

In [None]:
Covar=np.eye(2)
Covar[0,1]=.5
Covar[1,0]=.5

D1=np.random.multivariate_normal(# your code)

Covar=np.eye(2)
Covar[0,1]=-.5
Covar[1,0]=-.5


D2=np.random.multivariate_normal# your code)
D3=np.random.multivariate_normal(# your code)
XX=np.vstack((D1,D2,D3))
plt.scatter(XX[:,0],XX[:,1])

plt.xlabel('X1',fontsize=14)
plt.ylabel('X2',fontsize=14)

## 4. Fitting gaussian mixture model

In [None]:
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

set up model

In [None]:
gm = gaussian_mixture(n_components=3)

To fit gaussian mixture  model change the number of components, and see how the model fit measures change observe the fitted score, covariance matrices, and model fit measures (BIC & AIC).


In [None]:
gm.fit(XX)
print('mean ',# your code)
print('covar ',# your code)
print('score ',# your code)   # average log-likelihood
print('BIC ',gm.bic(XX))   # Bayesian information criterion
print('AIC ',gm.aic(XX))  # Akaike information criterion

In [None]:
gm.covariances_[1]

In [None]:
gm.covariances_[0]

In [None]:
gm.predict(xx)

### 4.1. test the `predict()` method for the original data

Visualize prediction:

In [None]:
plt.scatter(xx[:, 0], xx[:, 1], c=gm.predict(xx))
plt.xlabel("X1", fontsize=14)
plt.ylabel("X2", fontsize=14)

---

*(Advanced part)*

## 5. k-means

### Repeat on the same data

Try to repeat the above workflow, but with kmeans, instead of gaussian mixture, and visualize the result on the same data, do you get the same division with the 2 methods?

In [None]:
#

### 5.1. Upload a photo of your choice

In [None]:
from google.colab import files

uploaded = files.upload()

due to computational issues, import minibatch kmeans

In [None]:
from sklearn.cluster import MiniBatchKMeans

In [None]:
from PIL import Image
image = np.asarray(Image.open(# your image file))
print('resolution',np.shape(# your code))

In [None]:
plt.imshow(image)

In [None]:
plt.imshow(image[:, 400:620, 0], cmap="Reds")

In [None]:
plt.imshow(image[200:420, :, :])

In [None]:
plt.imshow(image)

transform image into grayscale, using numpy

In [None]:
gray = np.mean(image, 2)
res = np.shape(gray)

gray_array = gray.reshape(-1, 1)  # 1 dimensional data array for machine learning
print(np.shape(gray_array))

In [None]:
plt.imshow(gray, cmap=plt.get_cmap("gray"))

In [None]:
gray = np.mean(image, 2)
res = np.shape(gray)
print(res)
plt.imshow(gray, cmap=plt.get_cmap("gray"))

### 5.2. Set up k-means algorithm

if you have a large image, better to use MiniBatchKMeans

In [None]:
km = mini_batch_k_means(n_clusters=6)

use your 1d arrangment of pixel values, to fit the model

In [None]:
km.fit(gray_array)

In [None]:
plt.hist(gray_array)

In [None]:
km.cluster_centers_

In [None]:
gray_array[0:100].T

In [None]:
len(gray_array)

In [None]:
km.predict(gray_array)[0:100]  # print prediction for first 100 pixels

### 5.3. Recover clustered image
The tricky part is recovering the image, based on the model prediction

for this, you will need to combine the predict() and the cluster_centers_

finally you have to reshape from 1d to 2d to get back an image, that you can display




In [None]:
Preds=Km.predict(# your code)
Km.cluster_centers_


# your code

#### 5.3.1. Vectorized solution

In [None]:
prd_pix = km.cluster_centers_[preds]
recover = prd_pix.reshape(res)

#### 5.3.2 Iterative solution

In [None]:
pred_pix = np.zeros(len(preds))
for i in range(len(preds)):
    pred_pix[i] = km.cluster_centers_[preds[i]]
recover = pred_pix.reshape(res)

In [None]:
plt.imshow(recover, cmap=plt.get_cmap("gray"))

In [None]:
plt.imshow(recover, cmap=plt.get_cmap("gray"))

In [None]:
km.cluster_centers_

visualize what you recovered

In [None]:
plt.imshow(recover, cmap=plt.get_cmap("gray"))

## 6. Principal component analysis

lets test it on the same image as above.

remember pca uses correlation (covariance) between columns for that can be used for a compressed represenation..


In [None]:
from sklearn.decomposition import PCA

In [None]:
pc = pca(n_components=2)

In [None]:
pc.fit(gray)

 fit_transform is the same as reducing the data, with the components of the fitted PCA:   gray.dot(pc.components_.T)

In [None]:
reduced = pc.fit_transform(gray)

In [None]:
print(pc.explained_variance_)
print(pc.explained_variance_ratio_)

In [None]:
print("original shape", np.shape(gray))
print(
    "components", np.shape(pc.components_)
)  # components_ contains the "loadings", how much each columns contributes to the PCA-s
print("reduced data", np.shape(reduced))

In [None]:
pc.components_

In [None]:
plt.imshow(pc.inverse_transform(reduced), cmap="gray")

# gray.dot(pc.components_.T

In [None]:
plt.imshow(pc.inverse_transform(reduced), cmap="gray")

In [None]:
plt.plot(pc.components_[0, :], label="pca1")
plt.plot(pc.components_[1, :], label="pca2")
plt.xlabel("pixels")
plt.ylabel("pca loadings")
plt.legend()

In [None]:
plt.scatter(
    pc.components_[0, :], pc.components_[1, :], c=np.arange(np.shape(gray)[1])
)  # ,label='pca1')
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.colorbar()

In [None]:
plt.imshow(gray, cmap="gray")

In [None]:
plt.plot(pc.components_[0, :], label="pca1")
plt.plot(pc.components_[1, :], label="pca2")
plt.xlabel("pixels")
plt.ylabel("pca loadings")
plt.legend()

PCA 1 vs 2, color coded by column number

In [None]:
plt.scatter(
    pc.components_[0, :], pc.components_[1, :], c=np.arange(np.shape(gray)[1])
)  # ,label='pca1')
plt.xlabel("PCA 1")
plt.ylabel("PCA 2")
plt.colorbar()

In [None]:
plt.imshow(gray, cmap="gray")

### 6.1. Testing multiple decompositions

In [None]:
plt.figure(figsize=(16, 10))
for cc, nc in enumerate(np.arange(2, 30, 5)):
    plt.subplot(3, 3, cc + 1)
    pc = pca(n_components=nc)
    reduced = pc.fit_transform(gray)
    plt.imshow(pc.inverse_transform(reduced), cmap="gray")
    plt.title(
        "Num PC-s: "
        + str(nc)
        + " explained var:"
        + str(np.round(np.sum(pc.explained_variance_ratio_), 2))
    )

In [None]:
np.shape(pc.components_)

### 6.2. Simulating multivariate data with many predictors


In [None]:
n_var = 10  # number of variables
means = np.zeros(n_var)  #  all with zero mean
n_dat = 200  # number of data points
a = np.random.normal(0, 1, (n_var, n_var))
covar = np.dot(a, a.transpose())
plt.pcolor(covar)
plt.colorbar()
plt.title("Random covariance matrix")
xtab = np.random.multivariate_normal(means, covar, n_dat)

with the above covariance matrix, we made a random dataset with 10 predictors

1. calculate the empirical correlation matrix with numpy corrcoef (beware on row and columns defaults)
2. or write your own code, and use scipy.stats.pearsonr
3. visualize the empiricial correlation matrix as above

In [None]:
#

### 6.3. Print the first 20 rows and all columns of Xtab to "see" the data

In [None]:
print(Xtab[# your code])

### 6.4. PCA on simulated data
fitting pca and transforming the data into lower dimensions,

In [None]:
pca = pca(n_components=2)
dim_reduc_data = pca.fit_transform(xtab)

np.shape(dim_reduc_data)

visualize data, after dimensionality reduction

In [None]:
plt.scatter(dim_reduc_data[:, 0], dim_reduc_data[:, 1])
plt.xlabel("PC1")
plt.ylabel("PC2")

In [None]:
plt.scatter(pca.components_[0, :], pca.components_[1, :], c=np.arange(n_var))
plt.xlabel("PC1")
plt.ylabel("PC2")

## 7. Visualizing gaussian mixture, with predicted probabilities

set up params and fit model

In [None]:
nc = 4
gm = gaussian_mixture(n_components=nc)
gm.fit(xx)

make predictions, and visualize

In [None]:
cols = [
    "Reds",
    "Blues",
    "Greens",
    "Purples",
    "Greys",
]  # ,'olive','orange','darkred','marine']
preds = gm.predict(xx)
pred_p = gm.predict_proba(xx)
print(np.shape(pred_p))
plt.figure()
for c in range(nc):
    plt.scatter(
        xx[preds == c, 0],
        xx[preds == c, 1],
        c=np.max(pred_p[preds == c, :], 1),
        cmap=cols[c],
    )
plt.colorbar()
plt.xlabel("X1", fontsize=14)
plt.ylabel("X2", fontsize=14)
plt.title("BIC: " + str(np.round(gm.bic(xx))))

## Homework 1

1. use the above code to fit a gaussian mixture model to the same data with the number of clusters/components changing from 2 to 8. (all values from 2 to 8 with for loop).
2. for each iteration, calculate the model fit measures BIC and AIC
3. graph on a single graph, the number of iterations on the X-axis, and both AIC and BIC on the Y axis (with a line for AIC and another for BIC),
4. remember to add axis labels and legends (to see which one is AIC, which is BIC), try to make it look nice, by chaning fontsize, color, linewidth etc.


remember AIC and BIC are model fit measures, basedon the log-likelihood, and lower values indicate better model fit



## Homework 2

Implement k-means with NumPy only.

Requirements:

- number of centroids optional
- intialize the centroids randomly (within the range of the data)
- number of steps optional
- keep track of distance (distance to closest centroid should descrease)





