# Gesture Phase Detection using Unsupervised and Supervised Learning



## 1. Introduction

 ## 1.1 Important Links:
- [Gestuture Phase Segmentation Dataset from UCI Repository](https://archive.ics.uci.edu/ml/datasets/Gesture+Phase+Segmentation)
- [sci-kit learn classes for Quick Reference](http://scikit-learn.org/stable/modules/classes.html)

## 1.2 About the Dataset
   --- Gesture Phase Segmentation consists in a temporal segmentation of gestures, performed by gesture researchers in
       order to preprocess videos for further analysis. The description of gesture phase segmentation problem may be   found in MADEO et al. (2013a) and MADEO et al. (2013b).

   --- The dataset is composed by seven videos recorded using Microsoft Kinect sensor. Three different users were asked to read three comic strips and to tell the stories in front of the sensor. Using Microsoft Kinect, we have obtained:
       (a) a image of each frame, identified by a timestamp; 
       (b) a text file containing positions (coordinates x, y, z) of six articulation points -- left hand, right hand, left wrist, right wrist, head and spine, with each line in the file corresponding to a frame and identified by a timestamp. 
The images enabled a manual segmentation of each file by a specialist, providing a ground truth for classification.

   --- The dataset is organized in 14 files: 7 raw files and 7 processed files, one for each video which compose the dataset. The name of the file refers to each video: the letter corresponds to the user (A, B, C) and the number corresponds to the story (1, 2, 3). The raw files contain the information obtained from Microsoft Kinect, described above. The processed file contains vectorial and scalar velocity and acceleration of left hand, right hand, left wrist, and right wrist. These measures were obtained after normalizing positions of hand and wrists considering the position of head and spine, and using a displacement equal to 3 in order to measure velocity, as described in MADEO et al. (2013c).

### Number of Instances: 

	A1 - 1747 frames
	A2 - 1264 frames
	A3 - 1834 frames
	B1 - 1073 frames
	B3 - 1423 frames
	C1 - 1111 frames
	C3 - 1448 frames

### Number of Attributes: 

	
   - Raw files (*_raw.csv): 18 numeric attributes (double), a timestamp (integer) and a class attribute (nominal).
   - Processed files (*_va3.csv): 32 numeric attributes (double) and a class attribute (nominal).
    
	A feature vector with up to 51 numeric attributes can be generated with the two files mentioned above.

### Attribute Information:

#### Raw files:

   1. lhx - Position of left hand (x coordinate)
   2. lhy - Position of left hand (y coordinate)
   3. lhz - Position of left hand (z coordinate)
   4. rhx - Position of right hand (x coordinate)
   5. rhy - Position of right hand (y coordinate)
   6. rhz - Position of right hand (z coordinate)
   7. hx - Position of head (x coordinate)
   8. hy - Position of head  (y coordinate)
   9. hz - Position of head  (z coordinate)
   10. sx - Position of spine (x coordinate)
   11. sy - Position of spine (y coordinate)
   12. sz - Position of spine (z coordinate)
   13. lwx - Position of left wrist (x coordinate)
   14. lwy - Position of left wrist (y coordinate)
   15. lwz - Position of left wrist (z coordinate)
   16. rwx - Position of right wrist (x coordinate)
   17. rwy - Position of right wrist (y coordinate)
   18. rwz - Position of right wrist (z coordinate)
   19. timestamp - 
   20. phase: 
		-- Rest
		-- Preparation
		-- Stroke
		-- Hold
		-- Retraction

#### Processed files:

   1. Vectorial velocity of left hand (x coordinate)
   2. Vectorial velocity of left hand (y coordinate)
   3. Vectorial velocity of left hand (z coordinate)
   4. Vectorial velocity of right hand (x coordinate)
   5. Vectorial velocity of right hand (y coordinate)
   6. Vectorial velocity of right hand (z coordinate)
   7. Vectorial velocity of left wrist (x coordinate)
   8. Vectorial velocity of left wrist (y coordinate)
   9. Vectorial velocity of left wrist (z coordinate)
   10. Vectorial velocity of right wrist (x coordinate)
   11. Vectorial velocity of right wrist (y coordinate)
   12. Vectorial velocity of right wrist (z coordinate)
   13. Vectorial acceleration of left hand (x coordinate)
   14. Vectorial acceleration of left hand (y coordinate)
   15. Vectorial acceleration of left hand (z coordinate)
   16. Vectorial acceleration of right hand (x coordinate)
   17. Vectorial acceleration of right hand (y coordinate)
   18. Vectorial acceleration of right hand (z coordinate)
   19. Vectorial acceleration of left wrist (x coordinate)
   20. Vectorial acceleration of left wrist (y coordinate)
   21. Vectorial acceleration of left wrist (z coordinate)
   22. Vectorial acceleration of right wrist (x coordinate)
   23. Vectorial acceleration of right wrist (y coordinate)
   24. Vectorial acceleration of right wrist (z coordinate)
   25. Scalar velocity of left hand
   26. Scalar velocity of right hand
   27. Scalar velocity of left wrist
   28. Scalar velocity of right wrist
   29. Scalar velocity of left hand
   30. Scalar velocity of right hand
   31. Scalar velocity of left wrist
   32. Scalar velocity of right wrist
   33. phase:
		-- D (rest position, from portuguese "descanso")
		-- P (preparation)
		-- S (stroke)
		-- H (hold)
		-- R (retraction)


___

## 2. Loading the Dataset

In [1]:
%matplotlib inline
import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split
from sklearn import model_selection, metrics
from sklearn import preprocessing
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

import matplotlib
import matplotlib.pyplot as plt

### 2.1 Reading the '.csv' Files

In [6]:
# read .csv from provided dataset
csv_filename01="./Dataset/a1_raw.csv"
csv_filename02="./Dataset/a1_va3.csv"
csv_filename03="./Dataset/a2_raw.csv"
csv_filename04="./Dataset/a2_va3.csv"
csv_filename05="./Dataset/a3_raw.csv"
csv_filename06="./Dataset/a3_va3.csv"

csv_filename07="./Dataset/b1_raw.csv"
csv_filename08="./Dataset/b1_va3.csv"
csv_filename09="./Dataset/b3_raw.csv"
csv_filename10="./Dataset/b3_va3.csv"

csv_filename11="./Dataset/c1_raw.csv"
csv_filename12="./Dataset/c1_va3.csv"
csv_filename13="./Dataset/c3_raw.csv"
csv_filename14="./Dataset/c3_va3.csv"

# df=pd.read_csv(csv_filename,index_col=0)
df1=pd.read_csv("./Dataset/a1_raw.csv" , skiprows=[1,2,3,4])
df2=pd.read_csv(./Dataset/a1_va3.csv")
df3=pd.read_csv(./Dataset/a2_raw.csv , skiprows=[1,2,3,4])
df4=pd.read_csv(./Dataset/a2_va3.csv)
df5=pd.read_csv(./Dataset/a3_raw.csv , skiprows=[1,2,3,4])
df6=pd.read_csv(./Dataset/a3_va3.csv)
df7=pd.read_csv(./Dataset/b1_raw.csv , skiprows=[1,2,3,4])
df8=pd.read_csv(./Dataset/b1_va3.csv)
df9=pd.read_csv(./Dataset/b3_raw.csv , skiprows=[1,2,3,4])
df10=pd.read_csv(csv_filename10)
df11=pd.read_csv(csv_filename11 , skiprows=[1,2,3,4])
df12=pd.read_csv(csv_filename12)
df13=pd.read_csv(csv_filename13 , skiprows=[1,2,3,4])
df14=pd.read_csv(csv_filename14)

### 2.2 Removing the 'timestamp' & 'phase' labels from unprocessed files

In [3]:
df1.drop('timestamp',axis=1,inplace=True)
df1.drop('phase',axis=1,inplace=True)
df3.drop('timestamp',axis=1,inplace=True)
df3.drop('phase',axis=1,inplace=True)
df5.drop('timestamp',axis=1,inplace=True)
df5.drop('phase',axis=1,inplace=True)
df7.drop('timestamp',axis=1,inplace=True)
df7.drop('phase',axis=1,inplace=True)
df9.drop('timestamp',axis=1,inplace=True)
df9.drop('phase',axis=1,inplace=True)
df11.drop('timestamp',axis=1,inplace=True)
df11.drop('phase',axis=1,inplace=True)
df13.drop('timestamp',axis=1,inplace=True)
df13.drop('phase',axis=1,inplace=True)

### 2.3 Visualising the Table 1

In [4]:
#df1.head()

In [5]:
#df1.shape

### 2.4 Visualising the Table 2

In [6]:
#df2.head()

In [7]:
#df2.shape

In [8]:
#df2.Phase.unique()

Visualising the Column Labels in the two tables.

In [9]:
#df1.columns

In [10]:
#df2.columns

### 2.5 Renaming 'Phase' Column for convinience

In [11]:
df2.rename(columns={'Phase': 'phase'}, inplace=True)
df4.rename(columns={'Phase': 'phase'}, inplace=True)
df6.rename(columns={'Phase': 'phase'}, inplace=True)
df8.rename(columns={'Phase': 'phase'}, inplace=True)
df10.rename(columns={'Phase': 'phase'}, inplace=True)
df12.rename(columns={'Phase': 'phase'}, inplace=True)
df14.rename(columns={'Phase': 'phase'}, inplace=True)

___

### 2.6 Concatenating the Dataframes 

In [12]:
p1 = pd.concat([df1,df2],axis=1)
p2 = pd.concat([df3,df4],axis=1)
p3 = pd.concat([df5,df6],axis=1)
p4 = pd.concat([df7,df8],axis=1)
p5 = pd.concat([df9,df10],axis=1)
p6 = pd.concat([df11,df12],axis=1)
p7 = pd.concat([df13,df14],axis=1)

In [13]:
df= pd.concat([p1,p2,p3,p4,p5,p6,p7])

### 2.7 Encoding Phase Labels and Estimating number of instances of Differrent Labels

In [14]:
df.phase.unique()

array(['D', 'P', 'S', 'H', 'R'], dtype=object)

In [15]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['phase'] = le.fit_transform(df['phase'])

In [16]:
df.groupby('phase').count()

Unnamed: 0_level_0,lhx,lhy,lhz,rhx,rhy,rhz,hx,hy,hz,sx,...,23,24,25,26,27,28,29,30,31,32
phase,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,2741,2741,2741,2741,2741,2741,2741,2741,2741,2741,...,2741,2741,2741,2741,2741,2741,2741,2741,2741,2741
1,998,998,998,998,998,998,998,998,998,998,...,998,998,998,998,998,998,998,998,998,998
2,2097,2097,2097,2097,2097,2097,2097,2097,2097,2097,...,2097,2097,2097,2097,2097,2097,2097,2097,2097,2097
3,1087,1087,1087,1087,1087,1087,1087,1087,1087,1087,...,1087,1087,1087,1087,1087,1087,1087,1087,1087,1087
4,2950,2950,2950,2950,2950,2950,2950,2950,2950,2950,...,2950,2950,2950,2950,2950,2950,2950,2950,2950,2950


In [17]:
df.phase.unique()

array([0, 2, 4, 1, 3])

___

### 2.8 Randomising the Data before Splitting

#### Before:

In [18]:
df.head()

Unnamed: 0,lhx,lhy,lhz,rhx,rhy,rhz,hx,hy,hz,sx,...,24,25,26,27,28,29,30,31,32,phase
0,5.00316,4.27853,1.542866,4.985812,4.182155,1.52033,5.037557,1.619226,1.778925,5.052367,...,0.00018808,0.005133,0.0104,0.000646,0.007871,0.004631,0.000963,9.2e-05,0.000438,0
1,5.064488,4.290401,1.542146,4.955739,4.163175,1.511876,5.037724,1.618397,1.779722,5.045395,...,-7.5e-07,0.005093,0.005756,0.000573,0.003459,0.00073,0.000332,1.2e-05,0.000433,0
2,5.067825,4.290883,1.542058,4.928284,4.157637,1.511306,5.038332,1.618043,1.78008,5.045374,...,-3.92e-05,0.002406,0.003279,0.000452,0.003261,0.002412,0.000852,4.2e-05,0.000202,0
3,5.070332,4.290677,1.541985,4.916637,4.151067,1.51051,5.038742,1.618044,1.780114,5.045767,...,-3.184e-05,0.001416,0.001334,0.000493,0.001358,0.000313,0.000611,2.9e-05,0.000596,0
4,5.071611,4.290927,1.542046,4.906132,4.143034,1.509449,5.042224,1.618561,1.780209,5.047422,...,-2.015e-05,0.000158,0.001709,0.000325,0.001713,0.000203,6.9e-05,3.8e-05,6.9e-05,0


In [19]:
df = df.sample(frac=1)

#### After:

In [20]:
df.head()

Unnamed: 0,lhx,lhy,lhz,rhx,rhy,rhz,hx,hy,hz,sx,...,24,25,26,27,28,29,30,31,32,phase
1087,6.171766,3.673885,1.540934,5.183923,2.342675,1.46296,4.993442,1.630638,1.758757,4.989823,...,0.00048882,0.001908,0.019398,0.00102,0.019363,0.000933,0.009907,0.000592,0.009903,4
663,1.598923,4.018676,2.12724,3.447668,3.961191,2.194116,2.406612,0.866565,2.419828,2.433447,...,2.37e-05,0.002486,0.00164,0.002337,0.000369,0.000438,0.000202,0.000414,4.8e-05,2
167,1.900734,5.139158,1.966193,3.518129,3.974145,2.017433,2.873748,1.309875,2.223579,2.62221,...,-1.2e-07,0.016789,0.001238,0.014379,0.000946,0.00062,3.1e-05,0.000143,1.7e-05,3
603,1.877504,4.600049,2.244735,2.388666,4.04745,2.220921,2.251689,0.814129,2.41751,2.389094,...,-1.06e-05,0.004462,0.00282,0.003729,0.002076,0.000187,0.000232,2.6e-05,0.000122,0
379,5.368234,3.68643,1.511358,5.200821,3.733999,1.513813,5.448239,1.473407,1.781345,5.476373,...,-3.054e-05,0.004301,0.004158,0.000762,0.00255,0.005362,0.003775,0.000887,6.7e-05,0


### 2.9 Extracting the Feautre & Label Vector + Splitting into Test & Train (60:40)

In [21]:
cols = list(df.columns)
features = cols
features.remove('phase')

In [22]:
len(features)

50

In [23]:
X = df[features]
y = df['phase']
X.shape

(9873, 50)

In [25]:
# split dataset to 60% training and 40% testing
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.4, random_state=0)

### 2.10 Normalize the Dataset for Easier Parameter Selection

In [None]:
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)

In [None]:
print(X_train.shape, y_train.shape)

___

## 3. Clustering- an Unsupervised Learning Task

### Link: 
1. Sci-Kit Library on Clustering Methods: (http://scikit-learn.org/stable/modules/clustering.html#clustering)
2. Sck-Kit Learn Metrics ()
3. Comparing different clustering algorithms on Toy Datasets. (http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py)

In [None]:
from sklearn import cluster
from sklearn import metrics

### Overview:
Eventhough, we are provided with a labelled dataset for our analysis. We would for first part of our implementation, treat it as a unlabelled dataset and try to run clustering algorithms to find out the distinct group of data points namely:
1. PCA
2. Agglomerative Clustering
3. KMeans
4. Affinity Propogation
5. MeanShift
6. Mixture of Gaussian Models

A similar comparartive study of various unsupervised classification algorithms on the Toy Dataset is provided in the link below.

http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html#sphx-glr-auto-examples-cluster-plot-cluster-comparison-py

### 3.1 PCA - Principal Component Analysis

It is a precursor step to any analysis that we may subject to our dataset. The above datset has a highly dimensional feature space consisting of 50 fetures. In such a high-dimensional space, Euclidean distances tend to become inflated and meaningless. This can severely impact our algorithms performance. Such a situation demands more data to train our model and this problem is called the 'Curse of Dimensionality.'

The PCA algorithm solves this problem by finding out the features that explain the maximum variance. So, instead of training our models over 50 features we will be training them over 5 features that explain the maximum variance.

In [None]:
len(features)

In [None]:
# Apply PCA with the same number of dimensions as variables in the dataset
from sklearn.decomposition import PCA
pca = PCA(n_components=50)
#pca = PCA(n_components=18)
pca.fit(X)

# Print the components and the amount of variance in the data contained in each dimension

#print(pca.components_)
#print(pca.explained_variance_)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(list(pca.explained_variance_ratio_),'-o')
plt.title('Explained variance ratio as function of PCA components')
plt.ylabel('Explained variance ratio')
plt.xlabel('Component')
plt.show()

### 3.1.1 Create 'reduced_data' - a Feature Dataframe containing PCA components explaining Maximum Variance

In [None]:
# First we reduce the data to two dimensions using PCA to capture variation
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(X)
#print(reduced_data[:10])  # print upto 10 elements

In [None]:
reduced_data.shape

### 3.1.2 Applying KMeans on 'reduced_data'

In [None]:
kmeans = cluster.KMeans(n_clusters=5)
clusters = kmeans.fit(reduced_data)
print(clusters)

In [None]:
# Plot the decision boundary by building a mesh grid to populate a graph.
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1

hx = (x_max-x_min)/1000.
hy = (y_max-y_min)/1000.

xx, yy = np.meshgrid(np.arange(x_min, x_max, hx), np.arange(y_min, y_max, hy))

# Obtain labels for each point in mesh. Use last trained model.
Z = clusters.predict(np.c_[xx.ravel(), yy.ravel()])

In [None]:
# Find the centroids for KMeans or the cluster means for GMM 

centroids = kmeans.cluster_centers_

#print('*** K MEANS CENTROIDS ***')
#print(centroids)

# TRANSFORM DATA BACK TO ORIGINAL SPACE FOR ANSWERING 7

#print('*** CENTROIDS TRANSFERED TO ORIGINAL SPACE ***')
#print(pca.inverse_transform(centroids))

In [None]:
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='Nearest',
           extent=(xx.min(), xx.max(), yy.min(), yy.max()),
           cmap=plt.cm.Paired,
           aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0], reduced_data[:, 1], 'k.', markersize=2)
plt.scatter(centroids[:, 0], centroids[:, 1],
            marker='x', s=169, linewidths=3,
            color='w', zorder=10)
plt.title('Clustering on the seeds dataset (PCA-reduced data)\n'
          'Centroids are marked with white cross')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()

____

### 3.2 Applying Agglomerative Clustering 

#### Reference Link: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html

Other Links:
- [What is a Connectivity Matrix?](https://people.hofstra.edu/geotrans/eng/methods/connectivitymatrix.html)
- [What is Precision & Recall? What is a Confusion Matrix?](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/)
- [How to compute F-Score?](https://en.wikipedia.org/wiki/F1_score)
- [What is a ROC Curve? How is it useful?](http://www.dataschool.io/roc-curves-and-auc-explained/)
- [What is Cohen's Kappa?](https://en.wikipedia.org/wiki/Cohen's_kappa)

In [None]:
ac = cluster.AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='complete')
labels = ac.fit_predict(X)
print('Cluster labels: %s' % labels)

In [None]:
np.shape(labels)

___

### 3.3 K Means Clustering

#### Reference Link: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

Other Links:



In [None]:
clf = cluster.KMeans(init='k-means++', n_clusters=5, random_state=5)
clf.fit(X_train)
print(clf.labels_.shape)
print(clf.labels_)

In [None]:
# Predict clusters on testing data
y_pred = clf.predict(X_test)

In [None]:
print ("Addjusted rand score:{:.2}".format(metrics.adjusted_rand_score(y_test, y_pred)))
print ("Homogeneity score:{:.2} ".format(metrics.homogeneity_score(y_test, y_pred)) )
print ("Completeness score: {:.2} ".format(metrics.completeness_score(y_test, y_pred)))
print ("Confusion matrix")
print (metrics.confusion_matrix(y_test, y_pred))

____

### 3.4 Affinity Propogation

#### Reference Link: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation
Other Links

In [None]:
# Affinity propagation
aff = cluster.AffinityPropagation()
aff.fit(X_train)
print(aff.cluster_centers_indices_.shape)

In [None]:
y_pred = aff.predict(X_test)

In [None]:
print ("Addjusted rand score:{:.2}".format(metrics.adjusted_rand_score(y_test, y_pred)))
print ("Homogeneity score:{:.2} ".format(metrics.homogeneity_score(y_test, y_pred)))
print ("Completeness score: {:.2} ".format(metrics.completeness_score(y_test, y_pred)))
print ("Confusion matrix")
print (metrics.confusion_matrix(y_test, y_pred))

____

### 3.5 MeanShift

#### Reference Link: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html

Other Links:

In [None]:
ms = cluster.MeanShift()
ms.fit(X_train)

In [None]:
y_pred = ms.predict(X_test)

In [None]:
print "Addjusted rand score:{:.2}".format(metrics.adjusted_rand_score(y_test, y_pred))
print "Homogeneity score:{:.2} ".format(metrics.homogeneity_score(y_test, y_pred)) 
print "Completeness score: {:.2} ".format(metrics.completeness_score(y_test, y_pred))
print "Confusion matrix"
print metrics.confusion_matrix(y_test, y_pred)

___

### 3.6 Mixture of Guassian Models

#### Reference Link: http://scikit-learn.org/stable/modules/mixture.html

Other Links:

In [None]:
from sklearn import mixture

In [None]:
# Define a heldout dataset to estimate covariance type
X_train_heldout, X_test_heldout, y_train_heldout, y_test_heldout = train_test_split(
        X_train, y_train,test_size=0.25, random_state=42)

for covariance_type in ['spherical','tied','diag','full']:
    gm = mixture.GaussianMixture(n_components=100, covariance_type=covariance_type, random_state=42, n_init=5)
    gm.fit(X_train_heldout)
    y_pred=gm.predict(X_test_heldout)
    print "Adjusted rand score for covariance={}:{:.2}".format(covariance_type, 
                                                               metrics.adjusted_rand_score(y_test_heldout, y_pred))


___

### 3.7 KMeans vs. Agglomerative Clustering

In [None]:
pca = PCA(n_components=2)
X1 = pca.fit_transform(X)

In [None]:
from matplotlib.pyplot import cm 
n=6
c = []
color=iter(cm.rainbow(np.linspace(0,1,n)))
for i in range(n):
    c.append(next(color))

#**************************************************************************************************************#

n = 5
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(8,6))

#**************************************************************************************************************#

km = KMeans(n_clusters= n , random_state=0)
y_km = km.fit_predict(X1)

for i in range(n):
    ax1.scatter(X1[y_km==i,0], X1[y_km==i,1], c=c[i], marker='o', s=40, label='cluster{}'.format(i))   
ax1.set_title('K-means clustering')

#**************************************************************************************************************#

ac = AgglomerativeClustering(n_clusters=n, affinity='euclidean', linkage='complete')
y_ac = ac.fit_predict(X1)

for i in range(n):
    ax2.scatter(X1[y_ac==i,0], X1[y_ac==i,1], c=c[i], marker='o', s=40, label='cluster{}'.format(i))
ax2.set_title('Agglomerative clustering')

#**************************************************************************************************************#

# Put a legend below current axis
plt.legend()
    
plt.tight_layout()
#plt.savefig('./figures/kmeans_and_ac.png', dpi=300)
plt.show()

____

# 4. Classification- Supervised Learning Task

## 4.0 Introduction

In [26]:
import os
from sklearn.tree import DecisionTreeClassifier, export_graphviz
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn import model_selection, metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA

from time import time
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score , classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_score, recall_score, accuracy_score, classification_report

In [27]:
# First we reduce the data to two dimensions using PCA to capture variation
#pca = PCA(n_components=10)
#X = pca.fit_transform(X)
#print(reduced_data[:10])  # print upto 10 elements

In [28]:
# split dataset to 60% training and 40% testing
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.4, random_state=0)

In [29]:
print (X_train.shape, y_train.shape,X_test.shape, y_test.shape)

((5923, 50), (5923,), (3950, 50), (3950,))


## 4.0.2 Cross Validation Parameter

Link: (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)

Point to Note:
- #### sklearn.model_selection.cross_val_score : (estimator,X,y)

## 4.0.3 Feature Scaling
Link: (https://www.analyticsvidhya.com/blog/2016/07/practical-guide-data-preprocessing-python-scikit-learn/)

In [30]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler

#1. MinMax Scaling
min_max=MinMaxScaler()
X_train_minmax=min_max.fit_transform(X_train)
X_test_minmax =min_max.fit_transform(X_test)

#2. Scale
X_train_scale =scale(X_train)
X_test_scale  =scale(X_test)

#3. Standard Scaler
X_train_Stdscaler = StandardScaler().fit_transform(X_train)
X_test_Stdscaler = StandardScaler().fit_transform(X_test)

___

## 4.1 Decision Tree accuracy and time elapsed caculation

In [31]:
dt = DecisionTreeClassifier(min_samples_split=20,random_state=99)
# dt = DecisionTreeClassifier(min_samples_split=20,max_depth=5,random_state=99)

In [32]:
tt0=time()

print ("cross result========")
scores = model_selection.cross_val_score(dt, X,y, cv=5)
print (scores)
print (scores.mean())
tt1=time()
print ("time elapsed: ", tt1-tt0)
print ("\n")

[ 0.75316136  0.77884615  0.75025329  0.73796249  0.77546883]
0.75913842502
('time elapsed: ', 3.512260913848877)




## 4.1.1 Implementation 1

In [33]:
t0=time()
print ("DecisionTree: Case-1: Normal Unprocessed Data")

clf_dt1= dt.fit(X_train,y_train)

print ("Acurracy: ", clf_dt1.score(X_test,y_test))
t1=time()
print ("time elapsed: ", t1-t0)

DecisionTree: Case-1: Normal Unprocessed Data
('Acurracy: ', 0.71772151898734182)
('time elapsed: ', 0.4864768981933594)


## 4.1.2 Implementation 2

In [34]:
t0=time()
print ("DecisionTree: Case-2: MinMax Feature Scaling")

clf_dt2= dt.fit(X_train_minmax,y_train)

print ("Acurracy: ", clf_dt2.score(X_test_minmax,y_test))
t1=time()
print ("time elapsed: ", t1-t0)

DecisionTree: Case-2: MinMax Feature Scaling
('Acurracy: ', 0.40962025316455697)
('time elapsed: ', 0.518604040145874)


## 4.1.3 Implementation 3

In [35]:
t0=time()
print ("DecisionTree: Case-3: Scale")

clf_dt3= dt.fit(X_train_scale,y_train)

print ("Acurracy: ", clf_dt3.score(X_test_scale,y_test))
t1=time()
print ("time elapsed: ", t1-t0)

DecisionTree: Case-3: Scale
('Acurracy: ', 0.66835443037974684)
('time elapsed: ', 0.5103750228881836)


## 4.1.4 Implementation 4

In [36]:
t0=time()
print ("DecisionTree: Case-4: Standard Scaler")

clf_dt4= dt.fit(X_train_Stdscaler,y_train)

print ("Acurracy: ", clf_dt4.score(X_test_Stdscaler,y_test))
t1=time()
print ("time elapsed: ", t1-t0)

DecisionTree: Case-4: Standard Scaler
('Acurracy: ', 0.66835443037974684)
('time elapsed: ', 0.48151302337646484)


___

## 4.2 Random Forest accuracy and time elapsed caculation

In [37]:
rf = RandomForestClassifier(n_estimators=100,n_jobs=-1)

In [38]:
tt2=time()
print ("cross result========")
scores = model_selection.cross_val_score(rf, X,y, cv=5)
print (scores)
print (scores.mean())
tt3=time()
print ("time elapsed: ", tt3-tt2)
print ("\n")

[ 0.90035407  0.90080972  0.9047619   0.90978206  0.91535732]
0.906213014968
('time elapsed: ', 13.290452003479004)




## 4.2.1 Implementation 1

In [None]:
t2=time()
print ("RandomForest: Case-1 : Normal")
clf_rf1 = rf.fit(X_train,y_train)
print ("Acurracy: ", clf_rf1.score(X_test,y_test))
t3=time()

print ("time elapsed: ", t3-t2)

## 4.2.2 Implementation 2

In [None]:
t2=time()
print ("RandomForest: Case-2 : MinMax Feature Scaling")
clf_rf2 = rf.fit(X_train_minmax,y_train)
print ("Acurracy: ", clf_rf2.score(X_test_minmax,y_test))
t3=time()
print ("time elapsed: ", t3-t2)

## 4.2.3 Implementation 3

In [None]:
t2=time()
print ("RandomForest: Case-3 : Scale")
clf_rf3 = rf.fit(X_train_scale,y_train)
print ("Acurracy: ", clf_rf3.score(X_test_scale,y_test))
t3=time()
print ("time elapsed: ", t3-t2)

## 4.2.4 Implementation 4

In [None]:
t2=time()
print ("RandomForest: Case-2 : MinMax Feature Scaling")
clf_rf4 = rf.fit(X_train_Stdscaler,y_train)
print ("Acurracy: ", clf_rf4.score(X_test_Stdscaler,y_test))
t3=time()
print ("time elapsed: ", t3-t2)

___

## 4.3 Naive Bayes accuracy and time elapsed caculation


In [None]:
nb = BernoulliNB()

In [None]:
tt4=time()
print ("cross result========")
scores = model_selection.cross_val_score(nb, X,y, cv=5)
print (scores)
print (scores.mean())
tt5=time()
print ("time elapsed: ", tt5-tt4)
print ("\n")

In [None]:
t4=time()
print ("NaiveBayes")
clf_nb=nb.fit(X_train,y_train)
print ("Acurracy: ", clf_nb.score(X_test,y_test))
t5=time()
print ("time elapsed: ", t5-t4)

___

## 4.4 KNN accuracy and time elapsed caculation

In [None]:
t6=time()
print ("KNN")
# knn = KNeighborsClassifier(n_neighbors=3)
knn = KNeighborsClassifier()
clf_knn=knn.fit(X_train, y_train)
print ("Acurracy: ", clf_knn.score(X_test,y_test) )
t7=time()
print ("time elapsed: ", t7-t6)

In [None]:
tt6=time()
print ("cross result========")
scores = model_selection.cross_val_score(clf_knn, X,y, cv=5)
print (scores)
print (scores.mean())
tt7=time()
print ("time elapsed: ", tt7-tt6)
print ("\n")

___

## 4.5 SVM Accuracy and Time Elapsed Calculation

## 4.5.1 SVM with a Linear Kernel 

In [None]:
t7=time()
print ("SVM")

svc = SVC()
clf_svc=svc.fit(X_train_minmax, y_train)
print ("Acurracy: ", clf_svc.score(X_test_minmax,y_test) )
t8=time()
print ("time elapsed: ", t8-t7)

In [None]:
tt7=time()
print ("cross result========")
scores = model_selection.cross_val_score(clf_svc, X,y, cv=5)
print (scores)
print (scores.mean())
tt8=time()
print ("time elapsed: ", tt8-tt7)
print ("\n")

## 4.5.2 SVM with Multiple Kernels 

In [42]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn import model_selection

svc = SVC()

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10, 100]}

grid = model_selection.GridSearchCV(svc, parameters, n_jobs=-1, verbose=1, scoring='accuracy')


grid.fit(X_train, y_train)

print ('Best score: %0.3f' % grid.best_score_)

print ('Best parameters set:')
best_parameters = grid.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print ('\t%s: %r' % (param_name, best_parameters[param_name]))
    
predictions = grid.predict(X_test)
print (classification_report(y_test, predictions))

Fitting 3 folds for each of 6 candidates, totalling 18 fits


[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:   35.8s finished


Best score: 0.698
Best parameters set:
	C: 100
	kernel: 'rbf'
             precision    recall  f1-score   support

          0       0.80      0.98      0.88      1126
          1       0.61      0.48      0.54       410
          2       0.58      0.55      0.56       847
          3       0.78      0.31      0.45       420
          4       0.74      0.83      0.78      1147

avg / total       0.71      0.72      0.70      3950



In [40]:
pipeline = Pipeline([
    ('clf', SVC(kernel='rbf', gamma=0.01, C=100))
])

parameters = {
    'clf__gamma': (0.01, 0.03, 0.1, 0.3, 1),
    'clf__C': (0.1, 0.3, 1, 3, 10, 30),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, scoring='accuracy')

grid_search.fit(X_train, y_train)

print ('Best score: %0.3f' % grid_search.best_score_)

print ('Best parameters set:')
best_parameters = grid_search.best_estimator_.get_params()

for param_name in sorted(parameters.keys()):
    print ('\t%s: %r' % (param_name, best_parameters[param_name]))
    
predictions = grid_search.predict(X_test)
print (classification_report(y_test, predictions))

Fitting 3 folds for each of 30 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   57.7s
[Parallel(n_jobs=-1)]: Done  90 out of  90 | elapsed:  1.8min finished


Best score: 0.848
Best parameters set:
	clf__C: 30
	clf__gamma: 1
             precision    recall  f1-score   support

          0       0.90      0.98      0.94      1126
          1       0.81      0.86      0.83       410
          2       0.85      0.79      0.82       847
          3       0.88      0.75      0.81       420
          4       0.93      0.92      0.93      1147

avg / total       0.89      0.89      0.88      3950



___

In [None]:
clf_gb = ske.GradientBoostingClassifier(n_estimators=50)
test_classifier(clf_gb)

___

## 4.6 Leveraging weak learners via adaptive boosting

In [None]:
from sklearn.ensemble import AdaBoostClassifier

tree = DecisionTreeClassifier(criterion='entropy', 
                              max_depth=1)

ada = AdaBoostClassifier(base_estimator=tree,
                         n_estimators=500, 
                         learning_rate=0.1,
                         random_state=0)

tree = tree.fit(X_train, y_train)
y_train_pred = tree.predict(X_train)
y_test_pred = tree.predict(X_test)

tree_train = accuracy_score(y_train, y_train_pred)
tree_test = accuracy_score(y_test, y_test_pred)
print('Decision tree train/test accuracies %.3f/%.3f'
      % (tree_train, tree_test))

ada = ada.fit(X_train, y_train)
y_train_pred = ada.predict(X_train)
y_test_pred = ada.predict(X_test)

ada_train = accuracy_score(y_train, y_train_pred) 
ada_test = accuracy_score(y_test, y_test_pred) 
print('AdaBoost train/test accuracies %.3f/%.3f'
      % (ada_train, ada_test))

## 5. Important Resources

Complete List of all
http://scikit-learn.org/stable/modules/classes.html

___