# Task: Decision Tree on the MNIST dataset

Mimic the steps in the iris example. Use a decision tree again to train a classification model to identify handwritten digits using the popular MNIST dataset.

In [None]:
from sklearn.datasets import fetch_openml
from matplotlib import pyplot as plt
import os
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report


## Data Acquisition and Preprocessing
The MNIST data is acquired from openml, but in order not to go out to the Internet frequently, we download it and bring it into our preferred form of a variable X and a variable y and save it in well-named numpy files. If these files exist, we just load the preprocessed data from disk.


In [None]:
X = None 
y = None
if not os.path.exists("mnist_X.npy"):
    print("Acquiring and preprocessing")
    mnist = fetch_openml('mnist_784')
    X = mnist.data.astype(np.uint8) # these have been grayscale in floating point, now 0 .. 255
    y = [int(x) for x in mnist.target] # targets have been strings
    np.save("mnist_X.npy",X )
    np.save("mnist_y.npy",y )
else:
    print("Using cache. Delete mnist_*.npy to reacquire")
    X=np.load("mnist_X.npy")
    y = np.load("mnist_y.npy")

## Show a few random images

Be wise. Your data loading code will be wrong in the beginning. Whenever you train a model, at least glimpse on the soundness of the model input. A random plot is a good start. Here, the title is the class and the image is just the image

In [None]:
# show a random image
vis = np.random.choice(range(X.shape[0]), size=6)
print(vis)

for i,j in enumerate(vis):
    plt.subplot(1,6,i+1)
    plt.imshow(X[j].reshape(28,28),cmap = plt.cm.gray_r, interpolation='nearest')
    plt.title(y[j])


## Train Test Split
Data Mining requires a training set to find model paramaeters and a test set to estimate performance on unseen data. These two datasets can be created for MNIST by using just random splits as provided by `train_test_split` in skicit-learn 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)  # 70% data for training, 30% data for testing
print("# data points for training:", len(X_train))
print("# data points for training:", len(X_test))

## Train the Classifier

In [None]:
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

# Evaluate the classifier

In [None]:
# get estimation from the trained model (dtc)
pred_train = dtc.predict(X_train)
pred_test = dtc.predict(X_test)
# obtain accuracy
print(classification_report(pred_test, y_test))

# Conclude
This seems tos how that a Decision tree is reaching 87% on average, quite good, but not good enough. This is where Deep Learning will bring a real boost.