# LinearSVC Classification of Digits Dataset
## Environment
First, we set up our environment with standard packages.

In [25]:
import pandas as pd
pd.options.display.max_rows = 10
pd.options.display.max_columns = 28

In [26]:
import numpy as np

In [27]:
from sklearn import svm, metrics

## Initial Handling of Data
The data is available from Kaggle with no extra cleaning/modification needed.

In [28]:
training_digits = pd.read_csv('train.csv').get_values()
training_digits

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       ..., 
       [7, 0, 0, ..., 0, 0, 0],
       [6, 0, 0, ..., 0, 0, 0],
       [9, 0, 0, ..., 0, 0, 0]])

## Initial Training
We'll train on the first 90% of the data to see what kind of accuracy we can expect.

In [29]:
training_labels = training_digits[:37800,0]
training_data = training_digits[:37800,1:]
classifier = svm.LinearSVC()
classifier.fit(training_data, training_labels)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [30]:
test_labels = training_digits[37800:,0]
test_data = training_digits[37800:,1:]
predicted = classifier.predict(test_data)
print(metrics.classification_report(test_labels,predicted))

             precision    recall  f1-score   support

          0       0.93      0.96      0.94       455
          1       0.93      0.97      0.95       458
          2       0.88      0.81      0.84       392
          3       0.94      0.81      0.87       448
          4       0.84      0.92      0.88       438
          5       0.90      0.70      0.79       354
          6       0.88      0.95      0.92       413
          7       0.92      0.93      0.92       421
          8       0.76      0.77      0.77       397
          9       0.76      0.86      0.81       424

avg / total       0.88      0.87      0.87      4200



An accuracy of 88% is good, but not great.

## Predict on the test data
We'll expand our classifier to the rest of the training data, then test on the testing data.  Our expected accuracy is 88%, so it's unlikely that we'll actually submit this prediction.

In [32]:
full_training_labels = training_digits[:,0]
full_training_data = training_digits[:,1:]
full_classifier = svm.LinearSVC()
full_classifier.fit(full_training_data, full_training_labels)
full_test_data = pd.read_csv('test.csv')
full_predictions = full_classifier.predict(full_test_data)
full_predictions

array([2, 0, 9, ..., 3, 9, 2])