## *unsupervised learning*

### Objectives
- Preprocessing input data
- Dimensionality reduction 
- Principal Component Analysis (PCA)
 - Dimensionality reduction
 - Data Compression
 - Visualization
- Clustering
 - K-Means
 - Other clustering methods

In [17]:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns; sns.set()

## Transformations
- A very basic example is the **rescaling** of our data, which is a requirement for many machine learning algorithms. 
- There exist many different rescaling techniques, and in the following example, we will take a look at a particular method that is commonly called **standardization**.
- Here, we will recale the data so that each feature is:
 - **centered at zero** (mean = 0);
 - with **unit variance** (standard deviation = 0).

computed via the equation $x_{standardized} = \frac{x - \mu_x}{\sigma_x}$,
where $\mu$ is the sample mean, and $\sigma$ the standard deviation, respectively.

In [18]:
#example
a = np.array([1,2,3,4,5])
a_standardized = (a - a.mean()) / a.std()
print(a_standardized)

print("mean:" , a_standardized.mean())
print("std:" , a_standardized.std())

[-1.41421356 -0.70710678  0.          0.70710678  1.41421356]
mean: 0.0
std: 0.9999999999999999


### using sklearn for transformations
Although standardization is a most basic preprocessing procedure.  as we've seen in the code snipped above,  scikit-learn implements a `StandardScaler` class for this computation. 

To get some more practice with scikit-learn's "Transformer" interface, let's start by loading the iris dataset and rescale it:

In [19]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
x , y = iris.data , iris.target
x_train , x_test , y_train , y_test = train_test_split(x , y , test_size=.25 , random_state=2)
print(x_train.shape)

(112, 4)


In [20]:
#The iris dataset is not "centered" that is it has non-zero mean and the standard deviation is different for each component.
print('mean:' , x_train.mean(axis=0))
print('standard deviation:' , x_train.std(axis=0))

mean: [5.8875     3.05       3.84553571 1.23392857]
standard deviation: [0.81340609 0.44340404 1.73348783 0.7523242 ]


In [21]:
#standardization
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()

In [22]:
#fit()
#As with the classification and regression algorithms, we call fit to learn the model from the data. 
#As this is an unsupervised model, we only pass X, not y. This simply estimates mean and standard deviation.
standard_scaler.fit(x_train)

In [23]:
#trancform()
#Now we can rescale our data by applying the transform (not predict) method.
x_train_scaled = standard_scaler.transform(x_train)

In [24]:
print('mean:' , x_train_scaled.mean(axis=0))
print('standard deviation:' , x_train_scaled.std(axis=0))

mean: [1.96990242e-15 3.41393580e-15 5.96744876e-16 4.85722573e-16]
standard deviation: [1. 1. 1. 1.]


In [26]:
# It's important to note that the same transformation is applied to the training and the test set.
# That has the consequence that usually the mean of the test data is not zero after scaling.
x_test_scaled = standard_scaler.transform(x_test)
print('mean test data:' , x_test_scaled.mean(axis=0))
print('standard deviation test data:' , x_test_scaled.std(axis=0))


mean test data: [-0.21433587  0.0652844  -0.19932976 -0.18151769]
standard deviation test data: [1.04018969 0.91559884 1.04366875 1.02620647]


In [None]:
#It is important for the training and test data to be transformed in exactly the same way.
#for the following processing steps to make sense of the data, as is illustrated in the figure below.
from figs_folder import plot_scaling
