# The Effect of Thresholding on Classification

## The Methods

We first perform some simple thresholding.  The conjecture is that low intensity pixels might cause some fuzzing in the classification process, so we zero out all pixels with value below a certain threshold, and perform classification over all possible threshold values.

## Environment

In [69]:
import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 29

import numpy as np

from sklearn import svm, metrics

## Simple Thresholding


In [60]:
raw_st_data = pd.read_csv('train.csv').get_values()
print("Data read complete")
st_train_labels = raw_st_data[:10000,0]
st_test_labels = raw_st_data[10000:11000,0]
metrics_np = np.ndarray(shape=(5,0))
for threshold in range(0, 255):
    st_data = raw_st_data[:11000,1:]
    st_data[st_data < threshold] = 0
    st_train_data = st_data[:10000,:]
    st_test_data = st_data[10000:,:]
    st_classifier = svm.LinearSVC()
    st_classifier.fit(st_train_data, st_train_labels)
    st_predicted_labels = st_classifier.predict(st_test_data)
    labels = np.array(range(0, 10))
    thresholds = np.empty(10)
    thresholds.fill(threshold)
    precision = metrics.precision_score(st_test_labels, st_predicted_labels, average=None)
    recall = metrics.recall_score(st_test_labels, st_predicted_labels, average=None)
    f1 = metrics.f1_score(st_test_labels, st_predicted_labels, average=None)
    scores = np.vstack([thresholds, labels, precision, recall, f1])
    metrics_np = np.concatenate((metrics_np, scores), axis=1)
    print("Finished threshold level %i" % threshold)
    
metrics_df = pd.DataFrame(
    columns = ['threshold', 'label', 'precision', 'recall', 'f1'],
    data = metrics_np.T
)
metrics_df.to_csv('metrics_df.csv')

Finished threshold level 0
Finished threshold level 1
Finished threshold level 2
Finished threshold level 3
Finished threshold level 4
Finished threshold level 5
Finished threshold level 6
Finished threshold level 7
Finished threshold level 8
Finished threshold level 9
Finished threshold level 10
Finished threshold level 11
Finished threshold level 12
Finished threshold level 13
Finished threshold level 14
Finished threshold level 15
Finished threshold level 16
Finished threshold level 17
Finished threshold level 18
Finished threshold level 19
Finished threshold level 20
Finished threshold level 21
Finished threshold level 22
Finished threshold level 23
Finished threshold level 24
Finished threshold level 25
Finished threshold level 26
Finished threshold level 27
Finished threshold level 28
Finished threshold level 29
Finished threshold level 30
Finished threshold level 31
Finished threshold level 32
Finished threshold level 33
Finished threshold level 34
Finished threshold level 35
Fi

## Line Thickness

Our previous attempt at dimensionality reduction by projecting onto the y-axis was less than promising.  One possible explanation is that variations in line thickness could confuse the model.  three methods to address this will be attempted.

### Averaging

The first two methods involve a pipeline where the first two steps are a simple thresholding, then projection to the y-axis to obtain two vectors of length 28 for each digit.  After that, we try two different averaging methods.  The first is to average by the sum of all pixels, the second is to average over the count of nonzero pixels.  The differences between the results of these methods shed light onto whether pen pressure confuses our models.

In [71]:
raw_avg_data = pd.read_csv('train.csv').get_values()
print("Data read complete")
avg_train_labels = raw_avg_data[:10000,0]
avg_test_labels = raw_avg_data[10000:11000,0]
avg_metrics_np = np.ndarray(shape=(8,0))
for threshold in range(0, 255):
    print("Threshold: %i" % threshold)
    labels = np.array(range(0, 10))
    thresholds = np.empty(10)
    thresholds.fill(threshold)
    avg_data = raw_avg_data[:11000,1:]
    avg_data[avg_data < threshold] = 0
    count_vector = (avg_data != 0).sum(1) 
    sum_vector = avg_data.sum(1)
    print("Count Vector: %s" % count_vector)
    print("Sum Vector: %s" % sum_vector)
    avg_data = np.reshape(avg_data, (-1,28,28))
    # Y-axis is second axis after reshaping
    avg_data = np.sum(avg_data, axis=2)
    ac_data = avg_data / count_vector[:, None]
    ac_train_data = ac_data[:10000,:]
    ac_test_data = ac_data[10000:,:]
    ac_classifier = svm.LinearSVC()
    ac_classifier.fit(ac_train_data, avg_train_labels)
    ac_predicted_labels = ac_classifier.predict(ac_test_data)
    ac_precision = metrics.precision_score(avg_test_labels, ac_predicted_labels, average=None)
    ac_recall = metrics.recall_score(avg_test_labels, ac_predicted_labels, average=None)
    ac_f1 = metrics.f1_score(avg_test_labels, ac_predicted_labels, average=None)
    as_data = avg_data / sum_vector[:, None]
    as_train_data = as_data[:10000,:]
    as_test_data = as_data[10000:,:]
    as_classifier = svm.LinearSVC()
    as_classifier.fit(as_train_data, avg_train_labels)
    as_predicted_labels = as_classifier.predict(as_test_data)
    as_precision = metrics.precision_score(avg_test_labels, as_predicted_labels, average=None)
    as_recall = metrics.recall_score(avg_test_labels, as_predicted_labels, average=None)
    as_f1 = metrics.f1_score(avg_test_labels, as_predicted_labels, average=None)
    scores = np.vstack([thresholds, labels,
                        ac_precision, ac_recall, ac_f1,
                        as_precision, as_recall, as_f1])
    avg_metrics_np = np.concatenate((avg_metrics_np, scores), axis=1)
    print("Finished threshold level %i" % threshold)
    
avg_metrics_df = pd.DataFrame(
    columns = ['threshold', 'label',
               'precision (count)', 'recall (count)', 'f1 (count)',
               'precision (sum)', 'recall (sum)', 'f1 (sum)'],
    data = avg_metrics_np.T
)
avg_metrics_df.to_csv('avg_metrics_df.csv')

Data read complete
Threshold: 0
Count Vector: [ 97 245  79 ..., 169  97 105]
Sum Vector: [16649 44609 13425 ..., 30499 15915 16540]
Finished threshold level 0
Threshold: 1
Count Vector: [ 97 245  79 ..., 169  97 105]
Sum Vector: [16649 44609 13425 ..., 30499 15915 16540]
Finished threshold level 1
Threshold: 2
Count Vector: [ 97 242  79 ..., 169  97 105]
Sum Vector: [16649 44606 13425 ..., 30499 15915 16540]


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Finished threshold level 2
Threshold: 3
Count Vector: [ 97 241  79 ..., 168  96 102]
Sum Vector: [16649 44604 13425 ..., 30497 15913 16534]
Finished threshold level 3
Threshold: 4
Count Vector: [ 97 241  77 ..., 168  96 102]
Sum Vector: [16649 44604 13419 ..., 30497 15913 16534]
Finished threshold level 4
Threshold: 5
Count Vector: [ 97 241  76 ..., 167  96 101]
Sum Vector: [16649 44604 13415 ..., 30493 15913 16530]
Finished threshold level 5
Threshold: 6
Count Vector: [ 97 241  76 ..., 166  95 101]
Sum Vector: [16649 44604 13415 ..., 30488 15908 16530]
Finished threshold level 6
Threshold: 7
Count Vector: [ 97 241  75 ..., 166  95 101]
Sum Vector: [16649 44604 13409 ..., 30488 15908 16530]
Finished threshold level 7
Threshold: 8
Count Vector: [ 95 240  75 ..., 166  94 100]
Sum Vector: [16635 44597 13409 ..., 30488 15901 16523]
Finished threshold level 8
Threshold: 9
Count Vector: [ 95 239  73 ..., 166  94 100]
Sum Vector: [16635 44589 13393 ..., 30488 15901 16523]
Finished threshold l

## Double Threshold

The next method takes thresholding to the next level.  We threshold over two axes: pixel-intensity and nonzero-count.  At each nonzero-count level, we pixel-intensity threshold each digit until the nonzero-count is reached.  This method will need to be refined, and probably pushed into another notebook.