__Dataset__:
- Classification of hand gestures based on sensor recordings of muscle movements.
- Data source:UCI

__Project Goal__:
1. Understanding various evaluation metrics used for classification
2. Compare performance of different classification methods.

__Methods Used__:
1. [Accuracy](#accuracy), [Confusion matrix](#confusion_matrix), [Classification report](#precision)
2. Logistic Regression, KNN, Gradient Boosted trees


In [1]:
import os
import pandas as pd

In [2]:
data = pd.DataFrame()
for root,dirs,files in os.walk(r''):
            for file in files:
                if file != 'README.txt':
                    data_read = pd.read_csv(os.path.join(root,file),sep='\t',header=None,index_col=None,skiprows=1)
                    print ('Appending file',file)
                    data = pd.concat([data,data_read])
                else:
                    pass

Appending file 1_raw_data_13-12_22.03.16.txt
Appending file 2_raw_data_13-13_22.03.16.txt
Appending file 1_raw_data_14-19_22.03.16.txt
Appending file 2_raw_data_14-21_22.03.16.txt
Appending file 1_raw_data_09-32_11.04.16.txt
Appending file 2_raw_data_09-34_11.04.16.txt
Appending file 1_raw_data_18-02_24.04.16.txt
Appending file 2_raw_data_18-03_24.04.16.txt
Appending file 1_raw_data_10-28_30.03.16.txt
Appending file 2_raw_data_10-29_30.03.16.txt
Appending file 1_raw_data_10-38_11.04.16.txt
Appending file 2_raw_data_10-40_11.04.16.txt
Appending file 1_raw_data_18-48_22.03.16.txt
Appending file 2_raw_data_18-50_22.03.16.txt
Appending file 1_raw_data_12-14_23.03.16.txt
Appending file 2_raw_data_12-16_23.03.16.txt
Appending file 1_raw_data_12-41_23.03.16.txt
Appending file 2_raw_data_12-43_23.03.16.txt
Appending file 1_raw_data_11-08_21.03.16.txt
Appending file 2_raw_data_11-10_21.03.16.txt
Appending file 1_raw_data_13-11_18.03.16.txt
Appending file 2_raw_data_13-13_18.03.16.txt
Appending 

In [3]:
data.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1,1e-05,-2e-05,-1e-05,-3e-05,0.0,-1e-05,0.0,-1e-05,0.0
1,5,1e-05,-2e-05,-1e-05,-3e-05,0.0,-1e-05,0.0,-1e-05,0.0
2,6,-1e-05,1e-05,2e-05,0.0,1e-05,-2e-05,-1e-05,1e-05,0.0


In [4]:
data_main = data.iloc[:,1:].copy()
cols=['C1','C2','C3','C4','C5','C6','C7','C8','Class']
data_main.columns = cols
data_main.head()

Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,Class
0,1e-05,-2e-05,-1e-05,-3e-05,0.0,-1e-05,0.0,-1e-05,0.0
1,1e-05,-2e-05,-1e-05,-3e-05,0.0,-1e-05,0.0,-1e-05,0.0
2,-1e-05,1e-05,2e-05,0.0,1e-05,-2e-05,-1e-05,1e-05,0.0
3,-1e-05,1e-05,2e-05,0.0,1e-05,-2e-05,-1e-05,1e-05,0.0
4,-1e-05,1e-05,2e-05,0.0,1e-05,-2e-05,-1e-05,1e-05,0.0


In [5]:
data_main.shape

(4237908, 9)

In [6]:
gesture = {'0':'unmarked data',
'1':'hand at rest', 
'2': 'hand clenched in a fist', 
'3' : 'wrist flexion',
'4' : 'wrist extension',
'5' : 'radial deviations',
'6' : 'ulnar deviations',
'7' : 'extended palm' }

In [7]:
gesture

{'0': 'unmarked data',
 '1': 'hand at rest',
 '2': 'hand clenched in a fist',
 '3': 'wrist flexion',
 '4': 'wrist extension',
 '5': 'radial deviations',
 '6': 'ulnar deviations',
 '7': 'extended palm'}

since the gesture 7 was not tested on everybody and gesture 0 is unmarked, we will delete that gesture from our dataset.

Another exercise that can be done in with this dataset is to classify the unmarked gestures. But since there is no validation set for that, we won't be able to score our classification.

# Tasks for this dataset

For now, we will consider the following exercises on this dataset:

1) Try various multi-class classification algorithms Logistic regression and KNN 

2) Evaluate various scoring metrics for our model

In [8]:
lis = data_main[data_main['Class']==0].index

In [9]:
data_clean = data_main.loc[~((data_main['Class'] == 0) | (data_main['Class'] == 7))]

In [10]:
data_clean['Class'].value_counts()

6.0    253009
5.0    251733
4.0    251570
1.0    250055
3.0    249494
2.0    243193
Name: Class, dtype: int64

So, our final data set will be data_clean. 

In [11]:
data_clean = data_clean.reset_index(drop=True)

In [12]:
data_clean.head()

Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,Class
0,-1e-05,0.0,-1e-05,0.0,0.0,-1e-05,-1e-05,1e-05,1.0
1,-1e-05,-2e-05,0.0,-1e-05,-1e-05,-1e-05,-3e-05,-2e-05,1.0
2,-1e-05,-2e-05,0.0,-1e-05,-1e-05,-1e-05,-3e-05,-2e-05,1.0
3,-1e-05,-2e-05,0.0,-1e-05,-1e-05,-1e-05,-3e-05,-2e-05,1.0
4,-1e-05,-2e-05,0.0,-1e-05,-1e-05,-1e-05,-3e-05,-2e-05,1.0


# Data Split

In [13]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

In [14]:
#Function to plot tables side by side
from IPython.core.display import HTML

def multi_table(table_list):
    ''' Acceps a list of IpyTable objects and returns a table which contains each IpyTable in a cell'''
    return HTML(
        '<table><tr style="background-color:white;">' + 
        ''.join(['<td>' + table._repr_html_() + '</td>' for table in table_list]) +
        '</tr></table>'
    )

# Classifiers



- Logistic Regression<br>
Don't know how logistic regression works? Refer to my article [here]() on how machine learning algorithms work.

- KNN<br>
Don't know how KNN works? Refer to my article [here]() on how machine learning algorithms work.

# Ensemble methods:
------------------------------
- Gradient Boost<br>
Don't know how gradient boosting works? Refer to my article [here]() on how machine learning algorithms work.

In [15]:
#sample data
sample_data = data_clean.groupby('Class',as_index=False,group_keys=False).apply(lambda x:x.sample(frac =1,random_state =1))

In [16]:
sample_data['Class'].value_counts()

6.0    253009
5.0    251733
4.0    251570
1.0    250055
3.0    249494
2.0    243193
Name: Class, dtype: int64

In [17]:
#Split --> Scale the train --> transform test with same scaler
#split the data

x_train,x_test,y_train,y_test = train_test_split(sample_data.iloc[:,:-1],sample_data.iloc[:,-1],test_size =0.2)

#Scale

scaler = preprocessing.MinMaxScaler()
scaler_1 = scaler.fit(x_train)

#Transform test and train data by using scaler which is trained only on train data

x_train_scaled = pd.DataFrame(scaler_1.transform(x_train),columns=x_train.columns,index=x_train.index)
x_test_scaled = pd.DataFrame(scaler_1.transform(x_test),columns=x_test.columns,index=x_test.index)

In [18]:
# we can use various metrics from sklearn.metrics module independently, or 
# use the classification report to compute it all together

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

## Models

### Logistic Regression

In [19]:
from sklearn.linear_model import LogisticRegression

In [20]:
log_class = LogisticRegression(multi_class='multinomial',solver='lbfgs').fit(x_train_scaled,np.ravel(y_train))

In [21]:
pred_log = log_class.predict(x_test_scaled)

### Gradient Trees

In [24]:
from sklearn.ensemble import GradientBoostingClassifier

In [32]:
gd_model = GradientBoostingClassifier(learning_rate=0.3).fit(x_train_scaled,np.ravel(y_train))

__New learning__ : 
1. np.ravel - 
2. Gradient boosting is super slow. Justified by slow learning rate.

In [33]:
pred_gd = gd_model.predict(x_test_scaled)

### KNN

In [26]:
from sklearn.neighbors import KNeighborsClassifier

In [27]:
knn_mod = KNeighborsClassifier().fit(x_train_scaled,np.ravel(y_train))

In [28]:
#default k=5, i.e., 5 neigbors considered

In [29]:
pred_mod = knn_mod.predict(x_test_scaled)

<a id="accuracy"></a>
**Understanding the Metrics**

1. Accuracy
     
 The most commonly used metric is the accuracy of model. Accuracy is defined as the ratio of __correctly predicted labels__ to the __total number of predictions__. 
<br>

    In case of binary classification, if 70 out of 100 predictions are made correctly, the accuracy is 70%. <br>Same is applied for the multi class classification, which is our case. We just sum up the total correctly predicted labels and divide it by the total predictions.


In [50]:
from sklearn import metrics

In [52]:
print("logistic regression",metrics.accuracy_score(y_test,pred_log))
print("gradient trees",metrics.accuracy_score(y_test,pred_gd))
print("KNN",metrics.accuracy_score(y_test,pred_mod))

logistic regression 0.18121416492390205
gradient trees 0.6764928571666817
KNN 0.9736800851202925



    - Logistic Regression: 18%
    - Gradient Trees: 68%
    - KNN: 97%

   But, can we identify how many type of hand movements of type "1" were classified correctly?<br>Not just from one accuracy percentage. For such kind of knowledge, we use another metric which is called confusion matrix.
<br><br>


<a id="confusion_matrix"></a>
**Confusion Matrix**
    
  Scratching the surface of the classification model performance a little more, confusion matrix is helpful when we want to understand the model performance for each label being classified. As shown below, ![title](images/confusion_matrix.png) <br>the confusion matrix is described as a matrix of predicted number of labels vs the actual number of labels. This is particularly for binary classification, but a similar matrix can be formed for multi-class problems where on top we have the actual values and down side we have predicted values. 

This matrix allows us to have a clear picture on how good are the labels being predicted on indivual level. The other terms used in the table are:
  - TP: True Positives -  TP are the number of data points which are actually positive and were correctly classified as positives.
  - FP: False Positives - FP are the number of data points which are actually negative but were classified positive. 
  - TN: True Negatives - TN are the number of data points which are actually negatives and were correctly classified as negatives.
  - FN: False Negatives - FN are the number of data points which are actually positive but were classified as negatives.

Furthter, FP are also called as **Type 1 error** and FN as **Type 2 error**. These terms are usually used in healthcare industry during the testing of disease. <br><br>
__Ideally__, we want the FP and FN to be 0, since that would mean that we correctly identified positives and negatives. But this is not usually the case. We always get some FP and FN. Depending on the business applications, the task then changes from improving the accuracy of the overall model to now reducing the FP or FN.<br> <br>
     
   __Disease detection test__<br>

  - We don't want the test to fail (i.e., Predicted- Negative) in identifying a patient having the disease (Actual -Postive). Since if the test fails, we have a false negative. In case the test had tested a patient without the disease as positive, it would also mean that the test has failed, but this would be a false positive. 
  - In the medical industry, we can afford having a false positive because the healthy patient can under go extra tests and then eventually be identified as healthy. So the __cost of false positive__ is the cost of extra tests.
  - But the same is not correct for a false negative error. A false negative would mean that the patient is healthy, which actually is not the case, and no extra tests are requires. So, the __cost of false negative__ might eventually be the patient's life.
  - Thus, in case of disease detection, we want our false negative rate to be as low as possible.
  
__Traffic Light Violation Detection__:
 - Traffic light violation is recorded via the images and videos mounted nearby. Whenever a driver violates the traffic light, not only the driver is fined but is also allowed to challenge the violation in the court. 
 - If the driver has no violations but is still fined, he will have to challenge it in the court. This would mean wastage of government resources due to a false positive classification.
 - In this case, we would like to reduce the FP. But, we cannot increase the FN since that would mean that the model is failing to identify the real violators. This problem would need a balance of FP and FN.
<br><br>



In [44]:
print("\n\n Confusion Matrix- Logistic Regression\n\n",confusion_matrix(y_test,pred_log))



 Confusion Matrix- Logistic Regression

 [[  546     4    63  5361  8240 36025]
 [ 3413  5354  4372 15362  5564 14569]
 [ 3903  2619  5925 14192  7366 15914]
 [ 9055  1039  3145 23333  3283 10431]
 [ 7653   419  4448 21371  4637 11568]
 [ 4937  2676  5705 18127  4657 14535]]


In [45]:
print("\n\n Confusion Matrix- Gradient Tree\n\n",confusion_matrix(y_test,pred_gd))



 Confusion Matrix- Gradient Tree

 [[47558   111   524   133  1662   251]
 [  819 28859  5589  2854  4939  5574]
 [ 1895  3997 30925   845  3153  9104]
 [  857  2742   621 32610  9629  3827]
 [ 2309  2904  3488  7725 32388  1282]
 [  692  4591  8422  3912  2540 30480]]


In [46]:
print("\n\n Confusion Matrix-KNN\n\n",confusion_matrix(y_test,pred_mod))



 Confusion Matrix-KNN

 [[49954    27    88    41   106    23]
 [   85 46909   496   313   358   473]
 [  144   362 48425    97   285   606]
 [   68   202    72 49067   605   272]
 [  195   243   275   726 48481   176]
 [   63   368   568   384   170 49084]]


### Reports

In [47]:
print("Classification report for Logistic Regression \n\n",classification_report(y_test,pred_log))

Classification report for Logistic Regression 

               precision    recall  f1-score   support

         1.0       0.02      0.01      0.01     50239
         2.0       0.44      0.11      0.18     48634
         3.0       0.25      0.12      0.16     49919
         4.0       0.24      0.46      0.32     50286
         5.0       0.14      0.09      0.11     50096
         6.0       0.14      0.29      0.19     50637

    accuracy                           0.18    299811
   macro avg       0.20      0.18      0.16    299811
weighted avg       0.20      0.18      0.16    299811



In [48]:
print("Classification report for Gradient Tree \n\n",classification_report(y_test,pred_gd))

Classification report for Gradient Tree 

               precision    recall  f1-score   support

         1.0       0.88      0.95      0.91     50239
         2.0       0.67      0.59      0.63     48634
         3.0       0.62      0.62      0.62     49919
         4.0       0.68      0.65      0.66     50286
         5.0       0.60      0.65      0.62     50096
         6.0       0.60      0.60      0.60     50637

    accuracy                           0.68    299811
   macro avg       0.67      0.68      0.67    299811
weighted avg       0.67      0.68      0.67    299811



In [49]:
print("Classification report for KNN\n\n",classification_report(y_test,pred_mod))

Classification report for KNN

               precision    recall  f1-score   support

         1.0       0.99      0.99      0.99     50239
         2.0       0.98      0.96      0.97     48634
         3.0       0.97      0.97      0.97     49919
         4.0       0.97      0.98      0.97     50286
         5.0       0.97      0.97      0.97     50096
         6.0       0.97      0.97      0.97     50637

    accuracy                           0.97    299811
   macro avg       0.97      0.97      0.97    299811
weighted avg       0.97      0.97      0.97    299811



<a id="precision"></a>

**Understanding the Metrics**

1. **Precision**<br><br>
    Going a step forward with confusion matrix, we can determine the fractions between the numbers within the matrix for better understanding. Precision and recall are used not for the whole model, but when a specific label is of interest. __The precision of a model for a particular label is the proportion of the prediction which is correctly classified__.
    <BR><br>Mathematically, precision is defined as:<br><br>
    \begin{equation}{\text{Precision}}=\frac{\text { True Positive }}{\text { True Positive }+\text { False Positive }}\end{equation}
    
   We can use precision to compare the algorithms:
     - The precision of Logistic regression for label "1.0" is 2%.
       This means that out of 50,239 predicted data points which have label "1.0",the algorithm correctly classified only 2% of data.
     - Similarly, Gradient Tree was able to predict 88% of label "1.0" 
     - KNN predicted 99% of data with labal "1.0" correctly
<br><br>

2. **Recall**<br>
      __Recall can be defined as the proportion of the true values for a particular label that our algorithm correctly classified__.<br> It is calculated as follows:<br><br>
\begin{equation}{\text{Recall}}
= \frac{\text { True Positive }}{\text { True Positive }+\text { False Negative }}
\end{equation} <br>
Comparing the algorithms using recall:
     - Out of 50,239 true label points of "1.0", only 1% were classified as "1.0".
     - Gradient tree has recall of 95% for label 1.0. i.e, it classified 95% of the true labeled points.
     - KNN has recall of 99% for label 1.0. 
     <br><Br>
    
3. **F1-Score**<br>
    We now know that the recall as well as precision needs to be maximum for a good model. Doing both the metrics together might be a difficult task. To solve this issue, there is another metric called the F-1 Score. F-1 score is the harmonic mean of precision and recall. It is defined as:
<br><Br>\begin{equation}
F_{1}=2 \cdot \frac{\text { precision } \cdot \text { recall }}{\text { precision }+\text { recall }}
\end{equation}
<br>
The main advantage of F1-Score is that it is a single metric which helps in optimizing the other two metrics, precision and recall, which is more convienient.
<bR><Br>
    
4. **Macro-Average & Weighted-average**: These are options that we can include for better use during imbalanced dataset or if we want to focus on some particular label. 
    - Macro-averages: This is the averaged unweighted mean of each label
    - Weighted-mean: Average the support-weighted mean per label
    - Micro-average: Averaging the total true positives,false negatives and false positives 