# Applied Machine Learning (2023), exercises


## General instructions for all exercises

Follow the instructions and fill in your solution under the line marked by tag

> YOUR CODE HERE

Remove also line 

> raise NotImplementedError()

**Do not change other areas of the document**, since it may disturb the autograding of your results!
  
Having written the answer, execute the code cell by and pressing `Shift-Enter` key combination. The code is run, and it may print some information under the code cell. The focus automatically moves to the next cell and you may "execute" that cell by pressing `Shift-Enter` again, until you have reached the code cell which tests your solution. Execute that and follow the feedback. Usually it either says that the solution seems acceptable, or reports some errors. You can go back to your solution, modify it and repeat everything until you are satisfied. Then proceed to the next task.
   
Repeat the process for all tasks.

The notebook may also contain manually graded answers. Write your manually graded answer under the line marked by tag:

> YOUR ANSWER HERE

Manually graded tasks are text in markdown format. It may contain text, pseudocode, or mathematical formulas. You can write formulas with $\LaTeX$-syntax by enclosing the formula with dollar signs (`$`), for example `$f(x)=2 \pi / \alpha$`, will produce $f(x)=2 \pi / \alpha$

When you have passed the tests in the notebook, and you are ready to submit your solutions, validate and submit your solution using the nbgrader tools from the `Nbgrader/Assignment List`-menu.


# Applied Machine Learning (2022), exercises


## General instructions for all exercises

Follow the instructions and fill in your solution under the line marked by tag

> YOUR CODE HERE

Remove also line 

> raise NotImplementedError()

**Do not change other areas of the document**, since it may disturb the autograding of your results!
  
Having written the answer, execute the code cell by and pressing `Shift-Enter` key combination. The code is run, and it may print some information under the code cell. The focus automatically moves to the next cell and you may "execute" that cell by pressing `Shift-Enter` again, until you have reached the code cell which tests your solution. Execute that and follow the feedback. Usually it either says that the solution seems acceptable, or reports some errors. You can go back to your solution, modify it and repeat everything until you are satisfied. Then proceed to the next task.
   
Repeat the process for all tasks.

The notebook may also contain manually graded answers. Write your manually graded answer under the line marked by tag:

> YOUR ANSWER HERE

Manually graded tasks are text in markdown format. It may contain text, pseudocode, or mathematical formulas. You can write formulas with $\LaTeX$-syntax by enclosing the formula with dollar signs (`$`), for example `$f(x)=2 \pi / \alpha$`, will produce $f(x)=2 \pi / \alpha$

When you have passed the tests in the notebook, and you are ready to submit your solutions, download the whole notebook, using menu `File -> Download as -> Notebook (.ipynb)`. Save the file in your hard disk, and submit it in [Moodle](https://moodle.uwasa.fi) or EUNICE Moodle under the corresponding excercise.

Your solution should be an executable Python code. Use the code already existing as an example of Python programing and read more from the numerous Python programming material from the Internet if necessary. 


# Unsupervised learning, clustering

## Task 1: Apply k-means

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()  # for plot styling

The data-file `liver-spectroscopy.tab` is a text file which contains a Near Infrared Spectral (NIR) measurements of different cells from a liver. The file includes 731 samples, and each one has 234 variables. The variables describes the amount of infrared radiation absorbed in each of the tested 234 wavelengths. The last column in the data is the cell type, which can be one of the four types: 'collagen', 'glycogen', 'lipids' or 'DNA'. The column names are the wavelengths used in measuring the absorbtion values for the specific columns.

In [None]:
D = pd.read_table('liver-spectroscopy.tab')

# Print the classes
print(D.type.unique())

# Separate the true classes out from the data, into separate variable
types=pd.Categorical(D.type)
del(D['type'])

First task is to apply PCA to transform the data `D` to PCA projection `projected`. Then apply KMeans to `projected` and try to find clusters from the data. Use just enough PCA-components to contain more than 90% of the variance. Name your KMeans object as `kmeans`.

Plot the clusters using PC1 and PC2 axis using scatter plot. Use different color for each cluster in the plot.

In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

pca = PCA(n_components=20)
projected = pca.fit_transform(D)

kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(projected)
pc1 = projected[:, 0]
pc2 = projected[:, 1]

plt.figure(figsize=(10, 8))
plt.scatter(pc1, pc2, c=kmeans.labels_, cmap='rainbow')
plt.title('K-Means Clustering of Liver Spectroscopy Data')
plt.xlabel('Principal Component 1 (PC1)')
plt.ylabel('Principal Component 2 (PC2)')
plt.show()


In [None]:
projected

In [None]:
points=0
if (abs(kmeans.inertia_-75)<5):
    points+=1
points

## Task 2, apply PCA and GMM

Use previous PCA projection and apply Gaussian Mixture Model to projected data and try to find clusters from the data. Use `gmm` as a name for GaussianMixture object.

Plot the clusters with different colors in the scatter plot using PC1 and PC2 axis.

In [None]:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=4, random_state=0)
gmm.fit(projected)

pc1 = projected[:, 0]
pc2 = projected[:, 1]

# Predict cluster assignments using GMM
cluster_labels = gmm.predict(projected)

# Create a scatter plot with different colors for each cluster
plt.figure(figsize=(10, 8))
plt.scatter(pc1, pc2, c=cluster_labels, cmap='rainbow')
plt.title('Gaussian Mixture Model Clustering of Liver Spectroscopy Data')
plt.xlabel('Principal Component 1 (PC1)')
plt.ylabel('Principal Component 2 (PC2)')
plt.show()

In [None]:
points=0
points

## Task 3: Confusion matrix

Calculate the confusion matrix `CM` between the clusters and true tissue types. Find from `CM`, the number of the cluster where *DNA* tissue type samples are most often assigned. Assign in variable `nDNA` the number of times the samples of tissue type *DNA* were assigned into this cluster.

Some instructions

1. Find the category number of DNA ny listing `types.categories`. The category number is the index of string 'DNA' in the list of categories.
1. Use `CM=confusion_matrix)=` -function from `sklearn.metrics` to plot the confusion matrix
1. Assign the found number in variable `nDNA`


In [None]:
types.categories

In [None]:
category_number_DNA = types.categories.get_loc('DNA')
category_number_DNA

In [None]:
from sklearn.metrics import confusion_matrix
CM = confusion_matrix(types.codes, cluster_labels)
CM

In [None]:
nDNA = CM[category_number_DNA].argmax()

In [None]:
nDNA

In [None]:
points=0
assert('CM' in globals()), "Define the confusion matrix as CM please!"
assert(type(CM) == np.ndarray), "Confusion matrix is not an numpy array??"
assert('nDNA' in globals()), "Assign nDNA as istructed, please"

points

## Task 3: The probability of the samples

The GMM model includes the function called `.predict_proba()` which returns the probability that a certain sample belongs to different clusters. Calculate the probabilities of each sample belongin to each cluster and assign the result in variable `P`. Then find out the proability that the last sample belongs in the cluster where most DNA samples belong to, and assign that probability in variable `pDNA`.

In [None]:
P = gmm.predict_proba(projected)
pDNA = P[-1, nDNA]

In [None]:
points=0
assert('P' in globals()), "Define probabilities, P as instructed, please."
assert('pDNA' in globals()), "Define probability, pDNA as instructed, please."
points+=1
points

## Task 4: Evaluation

1. Which clustering method, KMeans or GMM is better for this case and why?
1. How can the quality of the clustering methods be assessed if 
    1. the true classes are known?
    1. the true classes are not known?
    
Answer by writing text or Markdown text in the cell below.

For this data set, K-means clustering works efficiently.

If the true classes are known, you can assess the quality of clustering using metrics. These metrics compare the clustering results to the truth labels and provide measures of how well the clustering method reproduces the true classes.

When the true classes are not known, evaluating clustering quality can be more challenging. In the absence of ground truth labels, it's often a good practice to use following metrics:
Silhouette Score
Davies-Bouldin Index
Within-Cluster Sum of Squares (WCSS)
Visual Inspection