# Trist'n Joseph - IST 718 - Lab 3

## Introduction

Classification in machine learning is a very powerful tool as it allows an analyst to determine what are particular attributes which affect an event’s outcome/grouping, and then develop models to predict this grouping. This method of grouping can be applied to various use-cases, such as detecting a fraudulent transaction, identifying cancerous tumors, determining whether a self-driving car should stop, and blocking spam emails from reaching a user’s inbox.

In this analysis, the goal was to build models which can recognize digits from 0 to 9 within handwriting images. This type of classification can be quite difficult due to the variation in handwriting styles – depending on how an individual writes, it is very plausible to think that a 4 looks like a 7 or a 9, a 7 looks like a 1, and a 2 looks like a 5.

Although difficult, this type of analysis is also very useful. With the number of transactions that banks/financial institutions process every day, it is not surprising that they have implemented machine learning algorithms which can analyze the handwriting on a cheque to reduce the processing time needed to process the cheque. Similarly, institutions like the IRS process large numbers of handwritten tax documents on an annual basis. Thus, the need for accuracy – could you imagine if these algorithms processed 1,000,000 dollars as 7,999,999 dollars?

For this analysis, I developed naïve bayes, decision tree, multinomial logistic regression, and support vector machine classifiers to determine which models would yield a higher accuracy when identifying handwritten digits.

## Packages

In [50]:
import pandas as pd
import numpy as np
import collections

import random

from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.naive_bayes import GaussianNB

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

## Data Manipluation

### Data Loading

In [2]:
data_path = "C:/Users/trist/OneDrive/Desktop/Trist'n/School/Syracuse University/Q2 2021/IST707/Homework/Week 7/Kaggle-digit-train.csv"
initial_df = pd.read_csv(data_path)

In [3]:
initial_df.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [11]:
initial_df.iloc[:, 1:].head()

Unnamed: 0,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Data Analysis

## Modeling

In [14]:
X = initial_df.iloc[:, 1:].values
y = initial_df.iloc[:, 0].values

In [15]:
# Splitting the data into testing and training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.75, random_state=1234)

In [68]:
# print(pd.DataFrame(y_test))

distribution_of_y_test = collections.Counter(y_test)
print('There are {} items within the testing output.'.format(len(y_test)))
print(distribution_of_y_test.most_common(10))

There are 31500 items within the testing output.
[(1, 3508), (7, 3321), (3, 3309), (0, 3131), (2, 3118), (9, 3114), (6, 3085), (4, 3056), (8, 3014), (5, 2844)]


Therefore:
- 11.14 percent of the data are `1`
- 10.54 percent of the data are `7`
- 10.50 percent of the data are `3`
- 9.94 percent of the data are `0`
- 9.90 percent of the data are `2`
- 9.89 percent of the data are `9`
- 9.79 percent of the data are `6`
- 9.70 percent of the data are `4`
- 9.57 percent of the data are `8`
- 9.03 percent of the data are `5`

For the model to be of sufficient value, it must perform better than:
- The random guess approach: since there are 10 digits, each label could have an equally likely chance of occuring (10 percent).
- The classify everything as the largest group approach: since `1` is the largest group, the model might try to classify every item as `1` (11.14 percent).

### Naive Bayes

Naïve Bayes is a probabilistic classifier algorithm that assigns labels to instances based on probabilities. The term naïve comes from the fact that this algorithm assumes strong independence between features.

In [17]:
nb_classifier = GaussianNB()
nb_classifier.fit(X_train, y_train)

GaussianNB()

In [18]:
nb_y_predictions = nb_classifier.predict(X_test)

In [26]:
nb_confusion_matrix = pd.DataFrame(confusion_matrix(y_test, nb_y_predictions))
nb_accuracy = round(accuracy_score(y_test, nb_y_predictions) * 100, 3)

In [25]:
nb_confusion_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2747,7,20,18,5,15,222,3,59,35
1,0,3323,12,15,3,8,54,1,59,33
2,270,109,818,325,18,67,899,9,560,43
3,184,251,34,1184,13,11,294,51,1009,278
4,62,58,30,15,636,15,445,30,472,1293
5,322,123,13,70,39,252,275,7,1502,241
6,25,70,23,3,3,13,2912,0,32,4
7,15,34,6,37,32,10,9,1429,70,1679
8,47,549,13,32,12,35,139,8,1715,464
9,14,45,11,7,17,3,3,63,66,2885


In [28]:
print('The accuracy socre for the Naive Bayes classifier is ~{}%'.format(nb_accuracy))

The accuracy socre for the Naive Bayes classifier is ~56.829%


Naive Bayes models:

- Are easy to implement.
- Are less likely to overfit.
- Are suitable for large data sets.

However, these models:

- Are biased in nature.

### Decision Tree

The goal of using a decision tree is to create a model that can be used to predict the class or value of the target variable by learning simple decision rules inferred from data. In decision trees, for predicting a class label for a record, we start from the root of the tree. We compare the values of the root attribute with the record’s attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node.

Decision trees use multiple methods to determine where to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable. The decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

In [31]:
dt_classifer = DecisionTreeClassifier()
dt = dt_classifer.fit(X_train, y_train)

In [32]:
dt_y_predictions = dt.predict(X_test)

In [33]:
dt_confusion_matrix = pd.DataFrame(confusion_matrix(y_test, dt_y_predictions))
dt_accuracy = round(accuracy_score(y_test, dt_y_predictions) * 100, 3)

In [34]:
dt_confusion_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2769,8,68,23,20,77,70,30,37,29
1,3,3277,42,35,10,21,19,18,65,18
2,66,73,2372,150,53,48,98,84,129,45
3,26,68,168,2456,42,207,39,66,171,66
4,18,48,39,28,2448,71,81,51,61,211
5,54,36,57,189,43,2123,108,40,113,81
6,63,27,55,23,89,85,2636,13,77,17
7,23,42,117,51,50,42,13,2782,33,168
8,38,116,139,176,80,107,81,43,2121,113
9,30,20,43,74,135,77,21,107,74,2533


In [35]:
print('The accuracy socre for the Decision Tree classifier is ~{}%'.format(dt_accuracy))

The accuracy socre for the Decision Tree classifier is ~81.006%


Decision tree classifiers:

- Are easy to understand and interpret.
- Can work with numerical and categorical features.
- Select features automatically.
- Require little data processing.

Howver, these models:

- Are prone to overfitting.

### Multinomial Logistic Regression

Multinomial logistic regression is a simple extension of binary logistic regression that allows for more than two categories of the dependent or outcome variable. Like binary logistic regression, multinomial logistic regression uses maximum likelihood estimation to evaluate the probability of categorical membership.

In [37]:
mlr = LogisticRegression(random_state=0, multi_class='multinomial', penalty='none', solver='newton-cg').fit(X_train, y_train)

In [38]:
mlr_y_predictions = mlr.predict(X_test)

In [39]:
mlr_confusion_matrix = pd.DataFrame(confusion_matrix(y_test, mlr_y_predictions))
mlr_accuracy = round(accuracy_score(y_test, mlr_y_predictions) * 100, 3)

In [40]:
mlr_confusion_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,2865,0,36,16,2,94,74,10,25,9
1,0,3354,33,19,2,16,6,13,59,6
2,62,37,2554,151,33,42,58,36,119,26
3,32,29,103,2757,4,127,25,57,115,60
4,28,18,38,9,2598,19,63,61,46,176
5,85,31,53,144,53,2197,73,31,140,37
6,28,14,103,17,16,60,2802,6,33,6
7,8,14,73,47,45,8,4,2887,11,224
8,37,69,81,145,28,210,38,24,2287,95
9,25,12,17,49,172,28,0,171,56,2584


In [41]:
print('The accuracy socre for the Multinomial Logistic Regression classifier is ~{}%'.format(mlr_accuracy))

The accuracy socre for the Multinomial Logistic Regression classifier is ~85.349%


Multinomial logistic regressions: 

- Are easy to implement, interpret and train.
- Make no assumptions about the distributions of the classes.
- Are less likely to overfit.

However, these models:

- Unfortunately assume linearity between the dependent and independent variables.
- Require average or no multinollinearity between independent variables.

### Support Vector Machine

The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. To separate the two classes of data points, there are many possible hyperplanes that could be chosen. The objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

In [43]:
svm = SVC(kernel='linear').fit(X_train, y_train)

In [44]:
svm_y_predictions = svm.predict(X_test)

In [45]:
svm_confusion_matrix = pd.DataFrame(confusion_matrix(y_test, svm_y_predictions))
svm_accuracy = round(accuracy_score(y_test, svm_y_predictions) * 100, 3)

In [46]:
svm_confusion_matrix

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,3028,0,9,7,5,33,38,0,9,2
1,1,3435,12,15,1,7,3,6,24,4
2,26,29,2832,75,30,6,36,23,52,9
3,18,33,81,2948,1,113,8,28,44,35
4,13,12,40,1,2817,5,27,21,4,116
5,32,36,31,161,25,2438,41,8,54,18
6,33,7,58,2,12,53,2904,1,14,1
7,5,15,48,23,42,9,2,3040,7,130
8,19,57,60,139,18,135,18,22,2514,32
9,21,6,14,34,158,17,0,137,20,2707


In [47]:
print('The accuracy socre for the Support Vector Machine classifier is ~{}%'.format(svm_accuracy))

The accuracy socre for the Support Vector Machine classifier is ~90.994%


SVM models:

- Work well with a clear margins of separation
- Are effective in high dimensional spaces.
- Are effective in cases where the number of dimensions is greater than the number of samples.
- Use a subset of training points in the decision function (called support vectors), so it is also memory efficient.

However, these models:

- Dont perform well when there is large data sets because the required training time is higher.
- Dont perform very well when the data set has more noise i.e. target classes are overlapping.

## Conclusion

After models were able to classify handwritten digits with relatively high accuracy. In fact, all models outperformed the both the random guess approach and the 'classify everything as the largest group' approach.

The support vector machine classifier outperformed all other models, having an accuracy of ~91%.