# Machine learning I
## Unsupervised learning (Clustering)

We will assume that our data has a form simmilar to the superviced setting. Our data consists of N samples with M features, a 2-dimensional array or matrix $\mathbb{R}^{N \times M}$ in the following format:

$$\mathbf{Data} = \begin{bmatrix}
    \textbf{feature 1} & \textbf{feature 2} & \textbf{feature 3} & \dots  & \textbf{feature M} \\
    x_{1}^{(1)} & x_{2}^{(1)} & x_{3}^{(1)} & \dots  & x_{M}^{(1)} \\
    x_{1}^{(2)} & x_{2}^{(2)} & x_{3}^{(2)} & \dots  & x_{M}^{(2)} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{1}^{(N)} & x_{2}^{(N)} & x_{3}^{(N)} & \dots  & x_{M}^{(N)}
\end{bmatrix}.
$$


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

We start by Creating some artificial dataset with M = 2 features and N = 1000 observations. We further assume that our data is drawn from 4 clusters (centers).

In [None]:
from sklearn import datasets
X, y = datasets.make_blobs(n_samples=1000, n_features=2, centers=4, cluster_std=1.5, random_state=7)
print ("X:\n", X[:5])
print ("y:\n", y[:5])

and plot the data

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y)

Train a k-means model

In [None]:
from sklearn.cluster import KMeans
clust = KMeans(n_clusters=4)
y_cl = clust.fit_predict(X)
y_cl[:5]

and visualize the predicted cluster memberships

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y_cl)

What happens if we have a 'wrong' number (e.g. 3 or 6) of clusters?

In [None]:
# %load solutions/l2_num_clusters.py


## How can we find a good number of clusters k?

In [None]:
from sklearn import metrics

The [silhouette score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html) is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample.

In [None]:
clust = KMeans(n_clusters=5)
y_cl = clust.fit_predict(X)
metrics.silhouette_score(X, y_cl, metric='euclidean')

In [None]:
ks = range(2,10)
scores = []
for k in ks:
    clust = KMeans(n_clusters=k)
    y_cl = clust.fit_predict(X)
    scores.append(metrics.silhouette_score(X, y_cl, metric='euclidean'))
plt.plot(ks,scores)
    

Alternative scores are the [Calinski and Harabaz score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabaz_score.html) or 
[Rand index adjusted for chance](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html).

## Other clustering methods:
### 1) [DBSCAN](https://en.wikipedia.org/wiki/DBSCAN)

In [None]:
from sklearn.cluster import DBSCAN
clust = DBSCAN(eps=1, min_samples=5)
y_cl = clust.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_cl)

Note that there are some "outliers" that do not belong to any cluster!

Which instances are the outliers?

In [None]:
np.where(y_cl == -1)

### 2) Use [Agglomerative Clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) (see [here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) for documentation)

In [None]:
# %load solutions/l2_AgglomerativeClustering.py


## Clustering the Iris data

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/gesiscss/WDCNLP/main/data/iris.csv", na_values="?")
df[1:5]

In [None]:
X =  df.drop("species", axis=1).values
X[1:5]

In a real example we should standardize the features. The [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) removes the mean and scales them to unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler
X_scaled = StandardScaler().fit_transform(X)
X_scaled[1:5]

In [None]:
clust = KMeans(n_clusters=3)
y_cl = clust.fit_predict(X_scaled)
y_cl

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y_cl)

In [None]:
plt.scatter(X[:, 2], X[:, 3], c=y_cl)

### Conducting a [Principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) on the Iris features

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
X_pca[1:5]

In [None]:
plt.scatter(X_pca[:, 0], X[:, 1], c=y_cl)