<a href="https://colab.research.google.com/github/urmilapol/urmilapolprojects/blob/master/unsupervisedml.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsupervised Learning Tutorial

In [None]:
import matplotlib.pyplot as plt
from sklearn import datasets
import sklearn.metrics as sm
 
import pandas as pd
import numpy as np
 
# Only needed if you want to display your plots inline if using Notebook
# change inline to auto if you have Spyder installed
%matplotlib inline

In [None]:
# import some data to play with
iris = datasets.load_iris()

## Mapping target labels to target names

In [None]:
species_dict = dict(zip(range(0, len(iris.target_names)), iris.target_names))

iris_species = list((map(lambda x : species_dict[x], iris.target)))

## Explore the data

In [None]:
iris.data

In [None]:
iris.feature_names

![](http://5047-presscdn.pagely.netdna-cdn.com/wp-content/uploads/2015/04/iris_petal_sepal.png)

In [None]:
iris.target

In [None]:
iris.target_names

![](https://i1.wp.com/dataaspirant.com/wp-content/uploads/2017/01/irises.png?resize=600%2C181)

In [None]:
# Store the inputs as a Pandas Dataframe and set the column names
x = pd.DataFrame(iris.data, columns = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width'])
 
y = pd.DataFrame(iris.target, columns = ['Targets'])

## Visualizing the data

### How many clusters?

You are given an array points of size 150x4. As seen above, our features are sepal length (cm), sepal width (cm), petal length (cm), petal width (cm).

matplotlib.pyplot has already been imported as plt.

Make a scatter plot by passing x.Sepal_Length and x.Sepal_Width to the plt.scatter() function.
Make a scatter plot by passing x.Petal_Length and x.Petal_Width to the plt.scatter() function.
Call the plt.show() function to show your plot.
How many clusters do you see?

In [None]:
# Set the size of the plot
plt.figure(figsize=(14,7))
 
# Plot Sepal
plt.subplot(1, 2, 1) # Creating subplots (1st subplot of 1 row, 2 columns)

# Produce a scatter plot for the sepal length and width 
plt.scatter(x.Sepal_Length, x.Sepal_Width)
plt.xlabel('Length')
plt.ylabel('Width')

plt.title('Sepal')
 
plt.subplot(1, 2, 2)
# Produce a scatter plot for the petal length and width 
plt.scatter(x.Petal_Length, x.Petal_Width)
plt.xlabel('Length')
plt.ylabel('Width')

plt.title('Petal')

## Dealing with the size of data with Principal Component Analysis

The sheer size of data in the modern age is not only a challenge for computer hardware but also a main bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis is to identify patterns in data; PCA aims to detect the correlation between variables. If a strong correlation between variables exists, the attempt to reduce the dimensionality only makes sense. In a nutshell, this is what PCA is all about: Finding the directions of maximum variance in high-dimensional data and project it onto a smaller dimensional subspace while retaining most of the information.

### PCA and Dimensionality Reduction

Often, the desired goal is to reduce the dimensions of a $d$-dimensional dataset by projecting it onto a $(k)$-dimensional subspace (where $k<d$) in order to increase the computational efficiency while retaining most of the information. An important question is "what is the size of $k$ that represents the data 'well'?"

#### Summary of Approach

Instantiate (or create) the specific machine learning model you want to use
Fit the model to the training data
Use the model to make predictions

* Standardise data
* Instantiate [```PCA()```](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)
* Fit ```PCA``` to the training data with the [```.fit```](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit function.)
* Use ```PCA``` to [```transform```](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.transform) the training data

#### Standardising your data

##### Motivation

Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

Unscaled data can also slow down or even prevent the convergence of many gradient-based estimators. 
There are various methods to normalize data. For this tutorial we are going to use the [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from scikit-learn.

 ```python
 from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)
```

### Variance of PCA features

The iris dataset is 4-dimensional. But what is its **intrinsic dimension**? (**Intrinsic dimension** = number of features needed to approximate the dataset) Make a plot of the variances of the PCA features to find out. As before, samples is a 2D array, where each row represents a sample. You'll need to standardize the features first.

#### Instructions

* Create an instance of StandardScaler called ```scaler```.
* Create a PCA instance called ```pca```.
* Use the ```.fit_transform()``` function of scaler and assign to ```X_norm``` to the iris samples.
* Use the ```.fit``` function of pca to the scaled data ```X_norm```
* Extract the number of components used using the ```.n_components_``` attribute of pca. Place this inside a range() function and store the result as features.
* Use the plt.bar() function to plot the explained variances, with features on the x-axis and ```pca.explained_variance_``` on the y-axis.

In [None]:
# Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Create scaler: scaler
scaler = StandardScaler()

# Create a PCA instance: pca
pca = PCA()

# Fit_transform scaler to 'X'
X_norm = scaler.fit_transform(x)

In [None]:
# Fit pca to 'X'
pca.fit(X_norm)

# Plot the explained variances
features = range(0, pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()

Looking at your plot, what do you think would be a reasonable choice for the "intrinsic dimension" of the the iris dataset? Recall that the intrinsic dimension is the number of PCA features with significant variance.

### Dimension reduction of the iris dataset

In a previous exercise, you found the "intrinsic dimension" to be some $k < 4$ of the iris dataset. Now use PCA for dimensionality reduction of the iris dataset, retaining only the 2 most important components.

We have already been scaled above, and is available as ```X_norm```.

#### Instructions

* Create a ```PCA``` instance called ```pca``` with ```n_components=2```.
* Use the ```.fit()``` method of ```pca``` to fit it to the scaled iris data ```X_norm```.
* Use the ```.transform()``` method of ```pca``` to transform the ```X_norm```. Assign the result to ```pca_features```.

In [None]:
# Create a PCA model with 2 components: pca
pca = PCA(n_components=2)

# Fit the PCA instance to the scaled samples
pca.fit(X_norm)

# Transform the scaled samples: pca_features
pca_features = pca.transform(X_norm)

# Print the shape of pca_features
print(pca_features.shape)

In [None]:
plt.scatter(pca_features[:, 0], pca_features[:, 1])

## What is Kmeans clustering?

Kmeans clustering is an unsupervised learning technique to automatically group data into coherent clusters.

Data: The model will take in training data
Output: Cluster centroids and the labels for each data point. The labels tell us which clusters they belong to.

### A summary of the algorithm

Randomly intialize K cluster centroids.

While the centroid positions are not the same,
* For each data point, say x, find the cluster centroid closest to x.
* Update cluster centers using data points assigned to them (Calculate the mean)


## Building the Kmeans model

From the scatter plot of the previous exercise, you saw that the points seem to separate into 3 clusters. You'll now create a KMeans model to find 3 clusters, and fit it to the data points from the previous exercise. After the model has been fit, you'll obtain the cluster labels for some new points using the .predict() method.

You are given the array points from the previous exercise, and also an array new_points.

### Instructions

* Import KMeans from sklearn.cluster.
* Using KMeans(), create a KMeans instance called model to find 3 clusters. To specify the number of clusters, use the n_clusters keyword argument.
* Use the .fit() method of model to fit the model to the array of points points.
* Use the .predict() method of model to predict the cluster labels of new_points, assigning the result to labels.

In [None]:
# Import KMeans
from sklearn.cluster import KMeans

# Create a KMeans instance with 3 clusters: model
model = KMeans(n_clusters=3)

# Fit model to points
model.fit(X_norm)

# Determine the cluster labels of new_points: labels
labels = model.labels_

# Print cluster labels of new_points
labels

## Inspect your clustering

* Can check correspondence with e.g. iris species
* … but what if there are no species to check against?
* Measure quality of a clustering
* Informs choice of how many clusters to look for

## Correspondence with iris species

### Instructions

Use the ```pd.crosstab()``` function on ```df['labels']``` and ```df['varieties']``` to count the number of times each iris species coincides with each cluster label. Assign the result to ```ct```

In [None]:
# Create a KMeans model with 3 clusters: model
model = KMeans(n_clusters=3)

# Use fit_predict to fit model and obtain cluster labels: labels
labels = model.fit_predict(pca_features)

# Create a DataFrame with labels and varieties as columns: df
df = pd.DataFrame({'labels': labels, 'species': iris_species})

# Create crosstab: ct
ct = pd.crosstab(df['labels'], df['species'])

# Display ct
print(ct)

### Measuring Quality of Clustering

* Using only samples and their cluster labels
* A good clustering has tight clusters
* ... and samples in each cluster bunched together

###  Inertia measures clustering quality

* Measures how spread out the clusters are (lower is better)
* Distance from each sample to centroid of its cluster
* Afer ```fit()```, available as attribute ```inertia_```
* k-means attempts to minimize the inertia when choosing clusters

```python
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3)
model.fit(samples)
print(model.inertia_)
78.9408414261
```

### Instructions

* For each of the given values of ```k```, perform the following steps:
* Create a ```KMeans``` instance called model with k clusters.
* Fit the model to the grain data samples.
* Append the value of the ```inertia_``` attribute of model to the list ```inertias```.
* The code to plot ```ks``` vs ```inertias``` has been written for you, so hit 'Shift + Enter' to see the plot!

In [None]:
ks = range(1, 6)
inertias = []

for k in ks:
    # Create a KMeans instance with k clusters: model
    model = KMeans(n_clusters=k)
    
    # Fit model to samples
    model.fit(pca_features)
    
    # Append the inertia to the list of inertias
    inertias.append(model.inertia_)
    
# Plot ks vs inertias
plt.plot(ks, inertias, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

## Next steps

* PCA - Look at factor loadings
* KMeans - Evaluation of cluster compactness by looking at Silhouette Score when we do not have access to labels to evaluate

## Resources

* https://www.datacamp.com/courses/unsupervised-learning-in-python

## THE END - WELL DONE!