
## Writing my own binary classifier evaluation module

Classification is the supervised-learning equivalent of clustering. A classifier is used to label an observation as belonging to one of a finite number of categories, or classes. I use labeled training data to build the classifier and then use the classifier to predict the categories that new observations belong to. 

In this project, I will build my own module for working with and evaluating binary classifiers and write unit tests for it. I will then use the module to evaluate k-nearest neighbor (KNN) classification on social data. I will use the scikit-learn package to run the KNN algorithm.  

### Function to split data into training and testing sets

Refer to  `class_eval.py` file in Data-Science-Projects repository

### Function to estimate values from confusion matrix

Refer in  `class_eval.py` file to a function called `confusion_matrix()`


### Function to estimate evaluation metrics

Refer in the file `class_eval.py` to the functions: `accuracy()`, `sensitivity()`, `specificity()`, `pos_pred_val()`, `neg_pred_val()`

### Unit tests for the classification evaluation module

Refer to `tests.py` file

---
## Evaluating your module using KNN on real data

In [1]:
import random
import class_eval
from sklearn.neighbors import KNeighborsClassifier

FPATH = '../data/house-votes-84.data'

# Fix the random seed for predictability, if needed
#random.seed(0)

def get_data(fpath):
    """Opens file house-votes-84.data and returns a list with 
    labels (political affiliations) and a list with 
    feature vectors (voting decisions). Voting decisions 
    are represented with 1 for yes, 0 for no, and 0.5 for neither.
    """
    relabel = {'y': 1, 'n': 0, '?': 0.5}
    data = []
    for line in open(fpath, 'r'):
        strlst = line.strip().split(',')
        toappend = [strlst[0]] + [relabel[i] for i in strlst[1:]]
        data.append(toappend)
    return data 

# Get the data
data = get_data(FPATH)

# Split it into training and testing sets and separate labels from feature vectors
train, test = class_eval.split_training_testing(data, p_test=20)
train_labels = [i[0] for i in train]
train_features = [i[1:] for i in train]
test_actual_labels = [i[0] for i in test]
test_features = [i[1:] for i in test]

# Make an instance of the KNN classifier and fit a model to the training data
neigh = KNeighborsClassifier(n_neighbors=11)
neigh.fit(train_features, train_labels) 

# Predict the labels for the test data and evaluate the performance
test_pred_labels = neigh.predict(test_features)

# The predict() method returns an object of type numpy.ndarray, so
# we will transform it to list to fit the function specification
class_eval.print_eval_metrics(list(test_pred_labels), test_actual_labels, 'democrat')

# This routine is meant for testing purposes only.
# In an actual analysis, we will look more systematically for a k 
# that maximizes the model's accuracy. We will then use multiple rounds 
# of random partitioning and average the model's performance over all rounds.

Accuracy: 0.9080459770114943
Sensitivity: 0.8846153846153846
Specificity: 0.9428571428571428
Positive predictive value: 0.9583333333333334
Negative predictive value: 0.8461538461538461
