<!--NAVIGATION-->


<a href="https://colab.research.google.com/github/saskeli/x/blob/master/pca.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

|                                     -                                     |
|---------------------------------------------------------------------------|
| [Exercise 8 (explained variance)](<#Exercise-8-(explained-variance&#41;>) |



# ML: Principal component analysis

## Principal component analysis

Principal component analysis is an unsupervised learning method that tries to detect the directions in which the vector formed data varies most. It first finds the direction of highest variance, and then proceeds to discover directions of highest variance that are orthogonal to those direction already found. So, for n dimensional data, it returns, by default, n orthogonal directions and the corresponding variances. These directions are called *pricipal axes*, and if we project a data point to these axes, we get the *principal components* of each axis.

To use another terminology, the set of principal axes forms a base for the vector space where the data points reside, and the principal components are the coordinates of the data points in this new coordinate system. The `PCA` class in the scikit-learn library has a `transform` method, which transforms data to this new coordinate system.

Let's look at an example where the data is from multi-variate Gaussian distribution.

#### <div class="alert alert-info">Exercise 8 (explained variance)</div>

This exercise can give two points at maximum!

Part 1.

Write function `explained_variance` which reads the tab separated file "data.tsv". The data contains 10 features. Then fit PCA to the data. The function should return two lists (or 1D arrays). The first list should contain the variances of all the features.  The second list should consist of the explained variances returned by the PCA.

In the main function print these values in the following form:
```
The variances are: ?.??? ?.??? ...
The explained variances after PCA are: ?.??? ?.??? ...
```
Print the values with three decimal precision and separate the values by a space.

Part 2.

Plot the cumulative explained variances. The y-axis should be the cumulative sum, and the x-axis the number of terms in the cumulative sum.
<hr/>

In [None]:


import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

def explained_variance():
    df = pd.read_csv("src/data.tsv", sep= "\t")
    model = PCA()
    model.fit(df)
    return df.var(axis=0), model.explained_variance_

def main():
    v, ev = explained_variance()
    #print(sum(v), sum(ev))
    print("The variances are: ", end=" ")
    for i in v:
        print(f"{i:.3f}", end=" ")

    print("\n")
    print("The explained variances after PCA are: ", end=" ")
    for i in ev:
        print(f"{i:.3f}", end=" ")

    plt.plot(np.arange(1,11), np.cumsum(ev))
    plt.show()
    

if __name__ == "__main__":
    main()


## Summary (week 6)

* We got to know another supervised learning method, namely, naive Bayes classification
* We saw examples of naive Bayes classification where either Gaussian or multinomial distribution was used to model the features of samples belonging to a class
* We saw how to use cross validation to asses prediction abilities of a model. This allows us to be sure that the model is not overfitting.
* In the clustering section we saw examples of using k-means, DBSCAN, and hierarchical clustering methods. They have different approaches to clustering, and each have different strengths.
* Clustering is based on the notion of distance between the points in the data.
* Principal component analysis is another example of unsupervised learning
* It can reduce the dimensionality of a data by throwing away those dimensions where the variability is low.

<!--NAVIGATION-->


<a href="https://colab.research.google.com/github/saskeli/x/blob/master/pca.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
