# Python machine learning: unsupervised learning
For unsupervised learning, there is no labeled data, ie no target variable Y. We will have the features (X) only. The gaol is to find a way to see if the data can be grouped by just using the X. For practice only, we will use Y to see how good the model works.

**sklearn flows**
![sklearn](sklearn_flows.png)

## Load the UCI IRIS dataset and prepare the dataset for machine learning
Data description is [here](https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.names)

In [98]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
colnames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']
iris = pd.read_csv(url, names=colnames)

In [99]:
X = iris[['sepal-length', 'sepal-width', 'petal-length', 'petal-width']].values
Y = iris[['species']].values                ##Using Y for metric reporting only

In [100]:
from sklearn import model_selection
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.2, random_state=7)

## Perform machine learning using a number of algorithms

### Clustering with K Means

Clusters are data groups. Each cluster is centered on a point called centroid. The goal is to find number of clusters that maximizes the difference between centroids.

In [101]:
from sklearn import cluster
##since we already know there are 3 classes, we will init to that; notice the random_state has significant impact.
KMC = cluster.KMeans(init='k-means++', n_clusters=3, random_state=42)
KMC.fit(X_train)
KMC_pred = KMC.predict(X_test)

In [102]:
##show the number of centroids and X features
KMC.cluster_centers_.shape

(3L, 4L)

In [103]:
print KMC_pred[:9]

[1 1 0 1 1 0 2 1 0]


Notice that the predicted class is numerical values representing the three clusters (0, 1, 2). If we want to see how this model works, we will compare KMC_pred with Y_test. Because Y_test uses characters while KMC_pred uses integers to represent the classes of Y, we need to convert Y_test into integers.

In [104]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(Y_test.ravel())                  ##le will try to figure out how many levels of integers to use
Y_test_orig = Y_test                    ##remember the original Y_test
Y_test = le.transform(Y_test.ravel())   ##le will then transform the chars into integers
le.classes_                             ##optional, to see how many classes are estimated

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

We can compare the the original Y_test vs transformed Y_test side by side to ensure they are identically represented:

In [107]:
zip(Y_test_orig, Y_test)[:9]

[(array(['Iris-virginica'], dtype=object), 2),
 (array(['Iris-versicolor'], dtype=object), 1),
 (array(['Iris-setosa'], dtype=object), 0),
 (array(['Iris-versicolor'], dtype=object), 1),
 (array(['Iris-virginica'], dtype=object), 2),
 (array(['Iris-setosa'], dtype=object), 0),
 (array(['Iris-versicolor'], dtype=object), 1),
 (array(['Iris-versicolor'], dtype=object), 1),
 (array(['Iris-setosa'], dtype=object), 0)]

Couple of notables on transformation: FOR INFO only

<# Using one-hot dataframe, but not sure yet how to take that dataframe for confusion matrix (tbd)  
import numpy as np  
Y_test = pd.get_dummies(Y_test)               ##generate one-hot dataframe  

<# numpy array to list conversion
Y1 = np.ndarray.tolist(Y_test)

<# numpy list to array
Y2= np.asarray(Y1)          ##convert list to array

In [108]:
from sklearn import metrics
metrics.confusion_matrix(Y_test, KMC_pred)

array([[ 7,  0,  0],
       [ 0, 10,  2],
       [ 0,  5,  6]])

How to interpret the confusion matrix:  (markdown table must start with pipe, no other text)
Redraw the confusion matrix below

|Predicted | 0 | 1 | 2 |
| -------- | - | - | - |  
|True     0| 7 | 0 | 0 |
|         1| 0 |10 | 2 | 
|         2| 0 | 5 | 6 | 
    
    

Tables are created based on the number of classes of the Y. In this case, there are 3 classes (0, 1, 2). Predicted on top line, True values are vertical. Off-diagonals are errors. Total cases = 30.
+ For class 0, there is 100% prediction accuracy, total 7 cases, all predicted correctly as 0.
+ For class 1, 10 are predicted correctly (True =1, predicted =1); there are 2 cases it predicted as class 2. Accuracy of 10/12. 
+ The worst case is class 2. Of 11 cases, 6 correctly predicted as 2, while 5 is predicted as 1. Accuracy is 6/11. 