# Machine Learning Laboratory
Institute of Imaging and Computer Vision, RWTH Aachen

Version SS2024

# Session 1: Basics All at Once


## Goals of this session

After this session, you will have an understanding of:

First 30min:
-  basic python programming
-  python datastructures
-  some important python packages: `NumPy`, `matplotlib.pyplot`, `scikit-learn`
-  working with images
-  some common terms used in ML 


Second 30min:
-  the Iris dataset 
-  PCA

Third 30min:
- Difference between supervised and unsupervised learning
- What is meant by Classification in ML
- Supervised Learning:
    - Support Vector Machine (SVM)
    - Optional: k-nearest neighbour

Fourth 30min:
- Unsupervised Learning:
    - k-means clustering
    - GMM


**>> Time Management**
  
**>> Please keep in mind the suggested 30min intervals. Not as fast as expected? No problem at all: you will have ca. 60min buffer time. Still not completed? You can continue with the last piece at home.**

### Structure of the Notebook: 
You need to perform the tasks and answer the questions in the notebook.
For most of the tasks you will find either examples or hints to help you.
For this notebook the hints are mostly in the form of commands that you will be required to use. 
Feel free to search the internet and find out how to use the required command, eg. which inputs need to provided and how many outputs are to be expected, etc.

Let's start with some basic python programming.
Have fun!

Python requires you to `import` the package you want to work on. Let's `import` an important package that motivates us to learn python!

In [None]:
# importing necessary packages
import antigravity

## 1. Basic `NumPy` operations

In [None]:
# importing necessary packages
import numpy as np

**Task:** Create a 1D array with 10 elements 
Hint: you may use any of `NumPy` built-in array generators, eg. `np.zeros`, `np.ones`, `np.arange`, `np.random`, etc.

In [None]:
# Your code here:



**Task:** Reshape the above array (`np.reshape`)

In [None]:
# Your code here:



**Task:** Create another 1D array and combine it with the previous array (a) vertically (b) horizontally
(eg. using `np.concatenate`, `np.vstack`, etc.)

In [None]:
# Your code here:



**Task:** Get positions where first and second array have the same elements (eg. `np.where`)

In [None]:
# Your code here:



**Task:** Extract the following from the matrix you created by horizontal concatenation:
 (a) the first element of the matrix, (b) the first row, (c) the first column,
 (d) any subset of your choice, eg.the four center elements.

#### This is called 'Slicing'

In [None]:
# Your code here:



## 2. Python data types
- numbers (eg. int, float32)
- booleans (True, False)
- 'null' type (None)
- strings (eg. 'hello', 'world')
- lists (eg. [8, 'hello', 5>7])
- tuples (eg. (8,'hello',5>7))

Hopefully you are already familiar with the common datatypes. We are going to discuss the last two:

### List
Python lists allow you to store a sequence of different objects, as in the above example, [int, string, bool]. 

Note: Lists are created using sqaure brackets [ ] and comma separated values

**Tasks on List**
1. Create a list containing a string, a bool, and an int
2. Change any one object of your choice from the list
3. Add new objects to the end of the existing list using (a) `append` (b) `extend`.

In [None]:
# Your code here:



**Question:** What is the difference between `append` and `extend`?

Answer:

**Tasks on List (cont.)**
5. Try out the examples using the functions: `pop`, `remove`, and `del`

In [None]:
myList = [0,5,3,4,3]
# myList.pop(3)

# myList = [0,5,3,4,3]
# myList.remove(3)
# myList

# myList = [0,5,3,4,3]
# del myList[3]
# myList

**Question:** What is the difference among functions from task 5? 

Answer: 



### Tuple
Similar to list, tuples can also store a sequence of arbitrary objects. They are constructed using parentheses.

Important: List can be changed after construction (mutable). Tuple cannot be changed (immutable). 

**Tasks on Tuple**
1. Create a tuple object and replace the first object with the int object 0
2. Convert the above tuple to list, change the first object, convert back to tuple

Hint: Use functions `list()` and `tuple()` to convert datatype 
Tip: You can check datatype using either `type()` or `isinstance(var,dtype)`

In [None]:
# Your code here:



### Sequence types
- list
- string
- tuple
- `NumPy` array

**Task on sequence types** (handy when programming in python!)

- Checking if object is contained within sequence or not.

Understand the following example code and then perform the given tasks.

In [None]:
# Example code:
x = (1,2,3)
y1 = 3 in x
y2 = 5 not in x
print('y1 =',y1)
print('y2 =',y2)

**Task:** Perform the tasks according to the instructions in the following three cells and answer the question at the end

In [None]:
# Given:
task1 = 'let us learn python'
# Task: Check if 'us' is present in task1 or not
# Your code here:



In [None]:
# Given:
task2 = [10,20,30,40]
# Task: Check if [10,20] is present in task2 or not
# Your code here:



In [None]:
# Given:
task3 = [True,[2,3],'hello']
# Task: Check if [2,3] is present in task3 or not
# Your code here:



**Question:** Explain the different outputs in task 2 and task 3 above.

Answer: 



## 3. Working with images
We will learn to load, display and manipulate images using the `matplotlib` package.

In [None]:
# importing necessary packages
%matplotlib inline
import matplotlib.pyplot as plt
import cv2

In [None]:
# Load image
irisImg = cv2.imread('/home/praktikum/MLlab/irisImage.png') 
irisImg = cv2.cvtColor(irisImg, cv2.COLOR_BGR2RGB) # because cv2 reads images as bgr!
# Display  image
plt.imshow(irisImg) # display in RGB
plt.show()
irisImg.shape

**Task:** Find out the datatype of the image.
Hint: You've already used this command above!

In [None]:
# Your code here:



### Let's use the `NumPy` manipulations we learned to manipulate the image

**Task:** Extract and display only one of the species
Hint: You can do so by the array slicing method you learned above

In [None]:
# Your code here:
# Please name the image 'iris1'



**Task:** Delete the white border from all the sides of the extracted image (you can do it by simply cropping the image further down)

In [None]:
# Your code here:
# You may use plt.axis('off') to display images without the axis



### Some useful image manipulations

**Task:** Resize the image by three times (use `cv2.resize`)

In [None]:
# Your code here:
# Please name the image 'resizedImg'



**Task:** Convert the image to grayscale. This can be done by taking the sum of weighted RGB values as follows: 0.2989*R + 0.5870*G + 0.1140*B

Hint: Use slicing to extract individual channels

Hint: Use `cmap='gray'` in `plt.imshow` to transorm into black&white mode

In [None]:
# Your code here:
# Please name your image 'irisGray'



Alternatively, we could use the built-in function `cv2.COLOR_RGB2GRAY`

In [None]:
# irisGray = cv2.cvtColor(iris1, cv2.COLOR_RGB2GRAY)
# plt.imshow(irisGray, cmap='gray')
# plt.axis('off')
# plt.show()

**Task:** Plot the histogram of the gray scale image (use `plt.hist()` with `ravel()`)

In [None]:
# Your code here:



### Normalization
In Machine Learning tasks, normalization is often the first step. It essentially means scaling and centering the data values for faster convergence and improved accuracy. Here we show one method of normalizing by subtracting the minimum value from all data points and then dividing by the range of values. 

### Normalization method 1:

In [None]:
# Normalization method 1: 
irisNorm1 = (irisGray-np.min(irisGray))/(np.max(irisGray)-np.min(irisGray))
plt.imshow(irisNorm1, cmap='gray')
plt.axis('off')
plt.show()
# To see the histogram of the normalized image
plt.hist(irisNorm1.ravel(),32)
plt.show()

Note that after normalization the image is rescaled between 0 and 1. 

### Normalization method 2:

**Task:** Perform normalization on the grayscale image by subtracting the mean and scaling by the standard deviation 

In [None]:
# Normalization method 2: 
# Your code here:
# Please name your image 'irisNorm2'



Now we're a little comfortable with python and handling different kinds of data. We now move on to the next step!

## 4. Machine Learning
In ML, we try to create a model based on the data we have at hand. So the first most important thing is to learn to represent that data such that the computer understands it. In our course, we will use python's Scikit-Learn package.

### The dataset
The Iris flower dataset (or Fisher's or Anderson's Iris data set) is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. The aim was to quantify the morphologic variation of Iris flowers of three related species (the three flowers we saw above!). 

- The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).
- Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. 

Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

We will analyze and visualize the data using `seaborn`. It is a handy python package for visualising statistic data and it offers a plethora of nice plotting tools. It also has the aforementioned Iris dataset included. The Iris dataset in seaborn does not use `numpy` or `cv2`, as we don't work on the images directly. Instead, we work with 'features,' as mentioned above. This information is stored in a so called `pandas dataframe`, another useful tool for statistical analyses.

In [None]:
# We will now load this dataset and display the first few elements.
# importing necessary packages
import seaborn as sns  
iris = sns.load_dataset('iris')
iris.head() #displaying the data (partially)

Note: pandas dataframe displays the data in a very intuitive tabular format.
Here we see the four features our data has as individual columns.
The species on the last column are formed by all the features from the corresponding rows. 

### Data visualization

In [None]:
%matplotlib inline 
sns.set_theme()
sns.pairplot(iris, hue='species',height=1.5)

Note: Here we see a visual pair-wise comparison of all the features

### Working with our dataset using `sklearn`
We will now load the data using the Python ML library `sklearn`.
It includes many ML algorithms, some of which we will be learning in this course.

In [None]:
# importing necessary packages
from sklearn import datasets
iris = datasets.load_iris()

In [None]:
# Displaying the dataset
print(iris.data)
# Displaying the datatype
print(type(iris.data))

Note: 
1. The dataset was loaded as a numpy array
2. We do not see the species information here
     
In `sklearn`, the dataset can be separated into 'features' and 'targets' as follows (`sklearn` stores species information as targets)

In [None]:
features = iris.data[:,[0,1,2,3]]
targets = iris.target 

**Task:** Extract all the features of the first flower species
Hint: You will find some more information when you inspect the shape of 'features' (and 'targets')

In [None]:
# Your code here:



Now let's try to visualize the relationship between some of the features 

In [None]:
# Plotting the relationship between the Sepal Length and Sepal Width
    
species = ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica')
colors = ('blue','green','red')

data = [[features[np.where(targets == target)][:, feature] for feature in [0, 1]] for target in range(3)]

for item, color, group in zip(data, colors, species):
    plt.scatter(item[0], item[1], color=color)
    plt.title('Iris dataset scatter plot')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()

**Task:** In a similar manner, plot petal length versus petal width

In [None]:
# Your code here:



**Question:** State your findings, eg. do these two features provide a better species separation? 

Answer:

We have been comparing only two features at time only because it is easier to visualize. In ML, however, it is quite common to have hundreds of features. In such a case, we might have some features that are redundant by, for eg., being just a linear combination of some other features! Hence, it is common to perform dimensionality reduction to retain only the most representative features from the entire set. One way of doing is by Principal Component Analysis (PCA). Here we try to understand and implement PCA:

## 5. Principal Component Analysis (PCA)

### PCA Summary:
- Standardize the data.
- Calculate Eigenvectors and Eigenvalues from the covariance matrix.
- Sort eigenvalues in descending order
- Choose the k eigenvectors that correspond to the k largest eigenvalues (k=number of dimensions of the new feature subspace (k≤d)).
- Construct the projection matrix W from the selected k eigenvectors.
- Transform the original dataset X via W to obtain a k-dimensional feature subspace Y.

In [None]:
X = features
Y = targets
# importing necessary packages for standardizing
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

**Task:** Calculate the covariance matrix of the **transposed** feature matrix (use `NumPy`'s `cov` function)

In [None]:
# Your code here:
# Please name the matrix as 'cov_mat'
# Remember to transpose the matrix before computing the covariance


**Task:** Perform eigenvalue decomposition on cov_mat (use `NumPy`'s linear algebra function `eig`)

In [None]:
# Your code here:
# Please name the matrix as 'cov_mat'
# Please call your variables eig_vecs and eig_vals



**Question:**  How many PC values should we retain? 

**Answer:** I don't know either ;) 

But we can both find it out by calulating the 'Explained Variance.'

In [None]:
total = sum(eig_vals)
var_exp = [(i/total)*100 for i in sorted(eig_vals, reverse=True)]

for pos, data in enumerate(var_exp):
    plt.bar(pos, data, align='edge')
plt.show

This plot shows that around 73% of the variance is captured by the first PC and almost 23% by the second. We can safely ignore the third and the fourth component without losing much information. 
In principle, we are reducing the 4D feature space to a 2D feature subspace, by choosing the "top 2" eigenvectors with the highest eigenvalues to construct a d×k-dimensional eigenvector matrix W. So let's construct W.

In [None]:
# First, make a list of (eigenvalue, eigenvector) tuples
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low
eig_pairs.sort()
eig_pairs.reverse()
  
matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1), 
                      eig_pairs[1][1].reshape(4,1)))

print('Matrix W:\n', matrix_w)

Lastly, we will use the 4×2-dimensional projection matrix W to transform our samples onto the new subspace via the equation
Y=X×W, where Y is a 150×2 matrix of our transformed samples.

In [None]:
Y = X.dot(matrix_w)

In [None]:
species = ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica')
colors = ('blue','green','red')
data = [[Y[np.where(targets == target)][:, feature] for feature in [0, 1]] for target in range(3)]

for item, color, group in zip(data, colors, species):
    plt.scatter(item[0], item[1], color=color)
    plt.title('Projection matrix')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.show()

### PCA using `scikit-learn`
Now we can compare the self-implemented PCA with the one in `scikit-learn`

In [None]:
# importing necessary packages
from sklearn.decomposition import PCA as sklearnPCA

sklearn_pca = sklearnPCA(n_components=2)
features_PCA = sklearn_pca.fit_transform(X)

species = ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica')
colors = ('blue','green','red')
data = [[features_PCA[np.where(targets == target)][:, feature] for feature in [0, 1]] for target in range(3)]

for item, color, group in zip(data, colors, species):
    plt.scatter(item[0], item[1], color=color)
    plt.title('Projection matrix')
plt.xlabel('PCA1')
plt.ylabel('PCA2')
plt.show()

## Discussion
What is the difference between LDA and PCA algorithms for dimension reduction in the term of annotation?



## 6. Supervised Learning (Classification of the Iris dataset)

Classification tasks aim at predicting the class of a given data sample. For the Iris dataset, we want to predict which one of the three species a given sample belongs to. In supervised learning, we do so by 'learning' a model based on some training samples. We then use this trained model to make predictions on 'unseen' or test-dataset.

Before we dive into any of the algorithms, we need to get introduced to the idea of splitting the datset into train and test sets. 

We know that the Iris dataset consists of 150 samples. Now, we will take 80% of the samples as our training set and the remaining as test set. We calculate the accuracy of our model on the test set. The common ML terminology is as follows: X_train, X_test are, respectively, the feature vectors for training and testing; and y_train and y_test are the corresponding labels for X_train and X_test. The predictions will be stored in y_pred.
**Task:** Using the `train_test_split` function of `sklearn`, split the PCA features obtained above in accordance with the standard ML naming convention (as explained above). 

Tip: Please use the default parameters for the split. You will learn this topic more in details in the next session ;)



In [None]:

# Your code here:
# Hint: import the necessary modules first



### 6.1 My very own classifier
**Task:** Refer to the PCA plot above and design your own classifier (by completing the assisting code!). For this, you may define simple linear or non-linear conditions to classify the sample points into one of the three classes. 

Example condition: if PCA1 for given sample < -2, sample is Iris-Setosa


In [None]:

# Complete the following lines of code: 

class StudentsClassifier:
    def predict(self, X):
        return np.array([self.predict_single(x) for x in X])
    
    def predict_single(self, X):
        # X[0] = PCA feature 1
        # X[1] = PCA feature 2
        
        # write a function that returns the index of the class
        # e.g. if Iris-Setosa, return 0 
        

        # Your code here:

    

### 6.2. Qualitative evaluation

Below is a ready-to-use function to visualize classification boundaries. You are not required to understand it in detail but try to understand what inputs/outputs the functions require/produce. You will need to use these later in the notebook. 


In [None]:

def make_meshgrid(x, y, h=.02):
    """Create a mesh of points to plot in

    Parameters
    ----------
    x: data to base x-axis meshgrid on
    y: data to base y-axis meshgrid on
    h: stepsize for meshgrid, optional

    Returns
    -------
    xx, yy : ndarray
    """
    x_min, x_max = x.min() - 1, x.max() + 1
    y_min, y_max = y.min() - 1, y.max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    return xx, yy


def plot_contours(clf, X, y, **params):
    """Plot the decision boundaries for a classifier.

    Parameters
    ----------
    ax: matplotlib axes object
    clf: a classifier
    xx: meshgrid ndarray
    yy: meshgrid ndarray
    params: dictionary of params to pass to contourf, optional
    """
    fig, ax = plt.subplots(constrained_layout=True)
    
    X0, X1 = X[:, 0], X[:, 1]
    xx, yy = make_meshgrid(X0, X1)
    
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    out = ax.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.7)
    
    ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=30, edgecolors='k')
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())
    plt.show()
    


Using the function $plot\_contours$, visualize the training boundaries obtained by your classifier by providing the correct arguments to it. 


In [None]:

# Calling your classifier
my_clf = StudentsClassifier()

# Visualizing class boundaries for my_clf
plot_contours(my_clf, X_train, y_train)



### 6.3. Quantitative evaluation

**Task:** First 'predict' the classes for X_test using the classifier you just defined above. And then calculate the accuracy on this test dataset. Use `scikit`'s `accuracy_score`.

Hint: First import the necessary module.


In [None]:

# Your code here:
my_clf = StudentsClassifier()

# y_pred = ? # Complete the code!



### 6.4. Support Vector Machine

SVMs are a powerful and flexible class of supervised algorithms for both classification and regression. They are memory efficient in that they use only a subset of the training points in the decision function (called support vectors). The simplest SVM uses a linear kernel to separate classes. 
**Task:** Using `sklearn`'s `svm.SVC`, implement a SVM with a linear kernel. Also perform a qualitative and a quantitative evaluation just as you did previously. 



In [None]:
from sklearn import svm
# Example code: LinearSVC
clf0 = svm.LinearSVC() # SVC = Support Vector Classifier
# train clf0 on X_train and y_train
clf0.fit(X_train, y_train)
# visualize 
plot_contours(clf0, X_train, y_train)

# Your code here:



**Question:** How is it in comparison to the your classifier? (Compare the accuracies).

Answer: 




### 6.4. More on accuracy metrics

The Confusion Matrix provides a better summary of the classification performance than just the accuracy score. The latter is often not the best measure for classification tasks involving more than two classes. Calculating a confusion matrix can give you a better idea of what your classification model is getting right and what types of errors it is making.
**Task:** Read up on Confusion Matrix if you are not familiar with the term. Then using `sklearn`'s `confusion_matrix` draw it and try to understand the perfomance of your classifier. 

Tip: You may also want to look into `sklearn`'s `classification_report` function. 


In [None]:

# Your code here:


Stay tuned for more metrics in the next session ;)

### 6.5. Tuning the Hyperparameters

There are several parameters that can help achieve better results (introduced in the Preparatory Material). 

- Kernel: Depending on the (expected) distribution of our classes, we can choose different types of functions, eg. linear, polynomial, and radial basis function (RBF). As might be obvious, the latter two are useful for non-linear hyperplane.

- Regularization: `C` in scikit-learn is a penalty parameter that controls the flexibility allowed to the hyperplane. A smaller value of C creates a small-margin hyperplane and a larger value creates a larger-margin hyperplane. 

- Gamma: This defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. A small gamma value define a Gaussian function with a large variance. In this case, two points can be considered similar even if are far from each other. In the other hand, a large gamma value means define a Gaussian function with a small variance and in this case, two points are considered similar just if they are close to each other.
**Task:** With a simple trial and error approach try to visualize and understand the effect of the hyperparameters and find an optimum classifier for the given dataset. You may use the metrices you learnt about to evaluate the performance of the different classifiers.


In [None]:
# Your code here:


### 6.6. Optional: k-nearest neighbour (knn)
#### *>> You first jump to 7. Unsupervised Learning and come back if you're interested.*
#### *>> But they are very cooool!*
### knn Algorithm

- Define a distance metric (Euclidean distance)
- Choose a value for k (= the number of nearest neighbours) 
- Take k-nearest neighbors of the new data point, according to your distance metric
- Assign the new data point the same category as its nearest neighbors
### 6.7. Optional: My very own knn classifier

We define our own knn classifier according to the above algorithm. 
**Task (Step 1):** Define a function that calculates the euclidean distance between two points. 


In [None]:
def my_ecd(v1, v2):
    # Your code here:


**Task (Step 2):** Define a function that uses my_ecd to calculate the distance between one single test data point with *all* the training data points. Save the distances *along with their respective indices* in a list. Return the *sorted* list as the output of the function. 

Hint: 1. To get indices, you may want to use `enumerate`.        2. You may use the function `sort` or `sorted`.


In [None]:
def my_distance_metric(X_train, single_test):
    """Calculates the distance between one test sample X_test and every sample in X_train.

    Parameters:
    X_train = all available training samples
    single_test = one particular test sample
    k = number of nearest neighbours
    -----------
    Returns: sorted distance list  
    """
    dist_list = []
    # Your code here:
    # Define a for-loop to compute distance

    return dist_list


**Taks (Step 3):** Define a function to save the first k target values corresponding to dist_list obtained above. 


In [None]:
def my_target_list(dist_list, y_train, k): 
    # Your code here:
    # make a list of the k neighbors' targets


    return target_list


**Task (Step 4):** Define a function that assigns predictions to the test data points. 

Hint: Use `most_common`method from the `Counter` object to get the target that occurs maximum number of times.


In [None]:
from collections import Counter
def my_predict(target_list):
    # Your code here:



**Task (Step 5):** Finally define a function that loops through all data points predicting each one by one.


In [None]:
def my_knn(X_test, X_train, y_train, k):    
    all_predictions = []
    # Your code here:
    # define a for-loop to loop through all the test data points 
    # to get the predictions for each one of them individually 


**Task:** Use the knn-function
to predict X_test and calculate the accuracy. 


In [None]:
# Your code here:



**Task:** Use the built-in knn classifier from `scikit-learn`


In [None]:
# Your code here:



**Question:** How does your knn classifier perform in in comparison to the built-in function? 

Answer:


## 7. Unsupervised Learning 
Until now we always worked with features and targets (labels) of the given dataset. In unsupervised learning, we do not have any labels for the data. For classification task, here, we will rely on some clustering algorithms. 

In this section, we will see the K-means and Gausssian Mixture Model (GMM) clustering methods. These agorithms require us to 'guess' how many clusters (classes) we have. 
### 7.1. K-means 

This is the simplest clustering algorithm. We initialize the algorithm with 'k' clusters according to which we get 'k' centroids. The algorithm then iteratively assigns every datapoint to its nearest cluster. The 'means' in its name refers to averaging of the data, i.e., finding the centroid.
**Task:** Perform unsupervised classification using `sklearn's` `KMeans`. To check your results, print out the output lables and compare with the target values. 

Note: For this task we ignore the several optional parameters. Providing only the `n_clusters` argument will suffice for this task. 
**Question:** Before you begin the task, think about which dataset should you work with here? 

Hint: Unsupervised learning = NO labels available

Answer: 


In [None]:
# Your code here:



**Question:** Is it informative to compare the labels and the targets? Explain.

Answer: 


### 7.2 Visualization of labels

I hope by now it is clear to you why calculating accuracy as done previously does not make sense in this case. Hence, we perform a simple visualization to assess the performance. 
**Task:** Plot the results of the kmeans function as a scatter plot (similar to PCA). 


In [None]:
# Your code here:


# Optional: Plotting the centroids of the clusters
# plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], s = 70, c = 'black')



### 7.3 The Elbow Method

In a truly unsupervised learning scenario, how could we make the initial guess for the number of custers? One option is The Elbow Method.

The basic idea behind cluster partitioning methods, such as k-means clustering, is to define clusters such that the total intra-cluster variation, or total within-cluster sum of square (wcss), is minimized. In the Elbow Method, we plot the WCSS against a set of values for 'k' and the location of a bend (elbow) in the plot is generally considered as an indicator of the appropriate number of clusters.


In [None]:
#Finding the optimum number of clusters for k-means classification
wcss = [] #within cluster sum of squares

for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    kmeans.fit(features_PCA)
    wcss.append(kmeans.inertia_)
    
#Plotting the results onto a line graph, allowing us to observe 'The elbow'
plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') 
plt.show()


**Question:** What is the optimum number of clusters according to the Elbow Method? 

Answer: 
### 7.4 GMM

As we saw, Kmeans might not always provide the most optimum output because it deos not have any intrinsic measure of probability or uncertainty of cluster assignments. A major limitation of k-means is that the cluster models must be circular: k-means has no built-in way of accounting for oblong or elliptical clusters.

Gaussian mixture models (GMMs) offer an extension to the idea of kmeans and provide a better estimation. They attempt to find a mixture of multi-dimensional Gaussian probability distributions that best model any input dataset. While Kmeans is a method that performs hard labeling, i.e., it simply choses the maximum probability, GMM provide soft labeling by looking at all the probabilities instead of only maximum. 
**Task:** Try out GMM using `sklearn`'s `GaussianMixture`. Display the probabilities that are assigned to every sample to understand the concept of soft-labeling as explained above. You may also plot the clusters for visualization. 

Tip: Round the probabilities up to two decimal places before displaying.


In [None]:
# Your code here:



## Feedback Cell:
I hope this Notebook gave you a good start into python and Machine Learning. Let us know how you liked it. Any suggestions/ criticism are also welcome! 

Your Feedback: 


### Acknowledgements

The iris pictures are licensed with [CC BY-SA 3.0 DEED](https://creativecommons.org/licenses/by-sa/3.0/) from https://w.wiki/9hfQ and https://w.wiki/9hfR, and [CC BY-SA 2.0 DEED](https://creativecommons.org/licenses/by-sa/2.0/deed.en) from https://w.wiki/9hfT. The [iris dataset](https://doi.org/10.24432/C56C76) from Fisher (1936) is licensed under [CC BY 4.0 LEGAL CODE](https://creativecommons.org/licenses/by/4.0/legalcode). The libraries used are antigravity, numpy, matplotlib, seaborn, sklearn, opencv, etc. 
