# Unsupervised Learning

In this notebook we will learn the basics of unsupervised learning. 

After completing this notebook you will be familiar with the basics of K-Means clustering as well as Principal Component Analysis for dimensionality reduction.

- PCA looks to find a low-dimensional representation of the observations that explain a good fraction of the variance
- Clustering looks to find homogeneous subgroups among the observations

In concrete terms we will analyse a data set consisting of a bag of [stop words](https://en.wikipedia.org/wiki/Stop_words) from a collection of books. Our task is to try to group and identify the authors of the books.

**Link to the data set source:** https://www.openml.org/d/458

**Structure of the Notebook:**
1. Exploring the data
 - Prepare the data for clustering
1. K-Means clustering
 - Determine amount of clusters using the elbow method
 - Apply K-Means clustering to identify authors
 - Analyse the results
1. Principal Component Analysis
 - Determine the amount of principal components
 - Analysing the principal components
 - Apply PCA to be used in clustering
 - Analyse the results
1. Using PCA to visualize clusters in two dimensions

## Loading the libraries

In [1]:
import numpy as np # library used for matrix and mathematical operations
import pandas as pd # library used for data wrangling
import matplotlib.pyplot as plt # library used for plotting data
import seaborn as sns # a more sophisticated library for visualizations

# display visualizations within the notebook
%matplotlib inline 

# a few personal preferences for the visualizations
plt.style.use('seaborn-whitegrid')
plt.rcParams.update({'font.size': 12})
sns.set_style('whitegrid')

## Loading the data

Use the `read_csv()` function from the Pandas library to load the dataset into a Pandas DataFrame. We'll call the variable `df` (short for dataframe).

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv

https://pandas.pydata.org/pandas-docs/stable/api.html#dataframe

In [2]:
# your code here (read csv)

## Exploring the data
**Basic information about the dataset**

Call the `info()` method on the newly created Pandas DataFrame.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.info.html#pandas.DataFrame.info

In [3]:
# your code here (info)

**Getting a glimpse of the data**

Call the `head()` method from the Pandas DataFrame to view the content of the first 5 rows of the DataFrame.

**Tips:** You can return *n amount of rows by passing *n as a parameter value. You can list the last *n rows by calling the `tail()` method.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html

In [4]:
# your code here (head)

**Books**

Print the number of unique books by calling the `nunique()` function on the `BookID` column in the dataset.

https://pandas.pydata.org/pandas-docs/stable/indexing.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.nunique.html#pandas.Series.nunique

In [5]:
# your code here (books summary)

**Authors**

Print the number of authors by calling the `unique()` function on the `Author` column in the dataset.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.unique.html#pandas.unique

In [6]:
# your code here (author summary)

**Books per author**

Group the data by `Author` using the `groupby()` functiond. 

Print the number of books per author by calling the `nunique()` function on the the `BookID` column.

In [7]:
# your code here (books per author summary)

**Words**

Print lists of top 10 words by:
- Value count in an individual row
- Sum or words occurring in total
- Mean word occurence per row

You can use `iloc` for slicing the dataframe (exclude last two columns) and `max()`, `sum()`, `mean()` functions in addition to a few familiar functions mentioned above to reach the desired result.

In [8]:
# your code here (max occurrence)

In [9]:
# your code here (sum of words)

In [10]:
# your code here (average word count)

**CONCLUSION/WHAT DID WE LEARN?**

*Your answer here*

## Preparing the data
K-Means might work in unintended ways if the unit and scale of data varies a lot. As we saw during the data analysis phase the data amount is rather homogenous. Nevertheless in this section we will standardize/normalize the data.

**Extract features and standardize the data**

Scikit Learn provides a convenient `StandardScaler` class that you can import from the sklearn.preprocessing package. Create and instance of the class and call the `fit_transform()` method to standardize the data.  

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

Transform all data but the last column into a separate variable `X` using slicing.

In [11]:
# your code here (standard scaler)

# K-Means clustering
In K-means clustering, we seek to partition the observations into a pre-specified *K* number of distinct, non-overlapping clusters.

**Documentation:** https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

**Theory/examples:** https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html

## Identifying number of clusters
We can use the elbow method to determine the amount of clusters for the data. In the elbow method you will create 1 to *n* amount of clusters (where n is the amount of features) and store the Within Cluster Sum of Squares (WCSS) of each run. By plotting the WCSS values for each cluster amount we hope to see where the data converges.

Import the `KMeans` class from the `sklearn.cluster` package and create a Python list called `wcss`.
Loop through a range from 1 to num_features+1 and within this loop:
- Apply KMeans with `i` number of clusters
- Append the value from the `inertia_` attribute to the wcss list.

Finally `plot()` the data using Matplotlib (and style the plot according to your liking)

In [12]:
# your code here (elbow method)

**CONCLUSION/WHAT DID WE LEARN?**

*Your answer here*

## Applying the  the actual K-Means clustering
With this specific dataset we're mostly interested in clustering by author, which means that the number of clusters is 4.

A few notes about the parameters of the `KMeas class`:
- init is used to define centroid/cluster inintialization
- n_init set the number of times the algorithm is run with different centroid seeds
- max_iter sets the maximum number of iterations for a single run
- random_state can be used if one needs reproducible results **NOTE:** Never use in production!

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

**Tip:** You can both `fit()` and `predict()` the data at once calling the `fit_predict()` method.

In [13]:
# your code here (K-Means)

## Evaluating the results
Now that we have divided the data into four distinct clusters it is time to evaluate the results. In general, evaluating the performance of clustering algorithms is challenging to say the least. In this dataset we have a target variable, which means that we can compare the output of our clustering algorithm to that. **Note:** In normal cases, where clustering is used, we do not have this novelty. Otherwise supervised learning is generally more accurate than unsupervised learning. 

### Prepare the target/label to be used in metrics
Before we can evalue the results the label/target variable needs to be converted to a numeric representation since strings (such as Shakespeare) cannot be numerically computed.

Scikit Learn provides a convenient `LabelEncoder` class from the `sklearn.preprocessing` package to do this. Create an instance of the class and call the `fit_transform()` method to convert the labels into variable `y`.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder

In [14]:
# your code here (label encoding)

### Homogeneity, completeness and V-measure
Given the knowledge of the ground truth class assignments of the samples, it is possible to define some intuitive metric using conditional entropy analysis.

In particular Rosenberg and Hirschberg (2007) define the following two desirable objectives for any cluster assignment:

- **homogeneity:** each cluster contains only members of a single class.
- **completeness:** all members of a given class are assigned to the same cluster.

We can turn those concept as scores homogeneity_score and completeness_score. Both are bounded below by 0.0 and above by 1.0 (higher is better).

https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

In [15]:
# your code here (homogeneity)

In [16]:
# your code here (completeness)

The harmonic mean from homogenity and completeness is called **V-measure** and is computed using the `v_measure_score()` method.

In [17]:
# your code here (v measure)

**CONCLUSION/WHAT DID WE LEARN?**

*Your answer here*

# Principal Component Analysis

PCA looks to find a low-dimensional representation of the observations that explain a good fraction of the variance.

**Documentation:** https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

**Theory/examples:** https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-analysis.html

## Chosing number of Principal components
In order to identify how much variance is explained by each Principal Component we will create an instance of th `PCA` class from the `sklearn.decomposition` package. Lastly `fit()` the data to the newly created object.

In [18]:
# your code here (PCA)

**Plot the variance plots**

We can plot the variance in two different ways. 
1. Plotting the `eplained_variance_ratio_` visualizes the percentage of variance each Principal Component explains. This allows us to locate a similar elbow curve as we did with the K-Means algorithm.
1. Plotting the cumulative sum of the `eplained_variance_ratio_` visualizes the total amount of variance explained by a certain number of Principal Components combined. This might be useful if you are looking to reach a certain threshold level for the components. 

Create two separate plots visualizing both use cases. You can use th Numpy method `cumsum()` to calculate the cumulative sum of the components. 

In [19]:
# your code here (elbow method)

In [20]:
# your code here (cumulative variance explained)

**CONCLUSION/WHAT DID WE LEARN?**

*Your answer here*

## Analysing the Principal Components
Before we proceed with applying PCA in practice, let's have a look at how we can better understand the Principal Components.

In order to more easily analyze the components we'll create a new Pandas DataFrame of the data:
- Store the feature names in a variable named `columns` (remember to exclude the Author column)
- Create a list of principal component identifiers i.e. \['PC1', 'PC2', ... 'PCn'] (you can use Python [list comprehensions](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions) for this
- Store the principal components in a variable named `data` by accessing the `components_` attribute from the pca object
- Create a new Pandas DataFrame using above variables (set the parameters `data`, `columns`, `index`). Call it `df_pca` so you do not overwrite the main DataFrame.

In [21]:
# your code here (dataframe from pca)

Take a peek at the data using `head()` to ensure that the DataFrame looks okay.

In [22]:
# your code here (head)

**Plotting the Principal Components**

A heatmap provides a good overview of the effect that each variable has on each principal component. We will utilize the `heatmap()` method from the Seaborn library to create the heatmap. For convenience we'll enlargen the plot by first calling the `figure()` method from the Matplotlib library and provide the figure dimensions as a tuple for the `figsize` parameter. E.g. use a size of *(25, 10)*.

In addition, it might be a good idea to only plot the top 10 principal components to improve the readability of the plot.

In [23]:
# your code here (plot heatmap)

### Principal Component 1
The visualization might give an good overview but to more specifically understand the variables effect on the Principal Components we need to have a closer look at the data. 

Let's have a look at the variables that have the biggest positive and negative impact on the Principal component value.

Slice the data using `loc` to only select the first principal component and use the `sort_values()` and `head()` functions to create the top lists.

In [24]:
# your code here (largest positive factors for PC1)

In [25]:
# your code here (largest negative factors for PC1)

You can use the Pandas `describe()` method to get a statistical overview of the data.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.describe.html

In [26]:
# your code here (statistical description for PC1)

**CONCLUSION/WHAT DID WE LEARN?**

*Your answer here*

## Repeat K-Means with Principal Components
Now that we better understand PCA we can apply it to the clustering algorithm. 
Select two different amount of principal components:
1. Number of components that explains at least 80% of the variance
1. Number of components that explains at least 90% of the variance


- Create the PCA object and apply the transformation (now you need to use `fit_transform()` to actually get the values for the principal components.
- Create the K-Means object and predict the clusters.
- Apply the three metrics used earlier and compare the results.

### PCA using number of components from the elbow

In [27]:
# your code here (PCA)

In [28]:
# your code here (K-Means)

In [29]:
# your code here (homogeneity score)

In [30]:
# your code here (completeness score)

In [31]:
# your code here (v measure)

### PCA using ~80% variance explained

In [32]:
# your code here (PCA)

In [33]:
# your code here (K-Means)

In [34]:
# your code here (homogeneity score)

In [35]:
# your code here (completeness score)

In [36]:
# your code here (v measure)

**CONCLUSION/WHAT DID WE LEARN?**

*Your answer here*

## Using PCA to Visualize Clusters

In [37]:
# your code here (K-Means)

In [38]:
# your code here (scatter plot, PC1 & PC2)

In [39]:
# your code here (scatter plot, PC1 & PC3)

In [40]:
# your code here (scatter plot, PC2 & PC3)

**Comparison to actual labels**

Finally we'll end the visualization by comparing the identified clusters and actual labels side by side. 

In [41]:
# your code here (scatter plot, cluster vs. labels)

**CONCLUSION/WHAT DID WE LEARN?**

*Your answer here*