# The Effect of Threshholding on Classification

## The Methods

We first perform some simple threshholding.  The conjecture is that low intensity pixels might cause some fuzzing in the classification process, so we zero out all pixels with value below a certain threshhold, and perform classification over all possible threshhold values.

## Environment

Of course, we always need the environment.

In [None]:
import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 29

import numpy as np

from sklearn import svm, metrics

In [None]:
data = pd.read_csv('train.csv').get_values()

## Simple Threshholding

For our initial threshholding, we use a smaller training set in order to get at least some information.

In [None]:
refinement_training_labels = data[:10000,0]
refinement_test_labels = data[10001:11000,0]
refinement_test_data = data[10001:11000,1:]
metrics_np = np.ndarray(shape=(5,0))
for refinement in range(0, 255):
    refinement_training_data = data[:10000,1:]
    refinement_training_data[refinement_training_data < refinement] = 0
    refinement_classifier = svm.LinearSVC()
    refinement_classifier.fit(refinement_training_data, refinement_training_labels)
    refinement_predicted_labels = refinement_classifier.predict(refinement_test_data)
    labels = np.array(range(0, 10))
    refinements = np.empty(10)
    refinements.fill(refinement)
    precision = metrics.precision_score(refinement_test_labels, refinement_predicted_labels, average=None)
    recall = metrics.recall_score(refinement_test_labels, refinement_predicted_labels, average=None)
    f1 = metrics.f1_score(refinement_test_labels, refinement_predicted_labels, average=None)
    scores = np.vstack([refinements, labels, precision, recall, f1])
    metrics_np = np.concatenate((metrics_np, scores), axis=1)
    print("Finished refinement %i" % refinement)
    
metrics_df = pd.DataFrame(
    columns = ['refinement', 'label', 'precision', 'recall', 'f1'],
    data = metrics_np.T
)

## Line Thickness

Our previous attempt at dimensionality reduction by projecting onto the y-axis was less than promising.  One possible explanation is that variations in line thickness could confuse the model.  Two methods to correct this will be attempted.