In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelBinarizer, LabelEncoder
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.linear_model import SGDClassifier, LogisticRegression

In [3]:
from neural_network import FNNClassifier

## Data Preprocessing

For this experiment we will use Fashion MNIST dataset. It's like MNIST number dataset but contains images of fashion accessories instaed of handwritten numbers. We are still looking at a toy dataset, the only reason for using this one over MNIST is becuase MNIST is too old, too boring and too easy. 

### Load and preprocess Training Data

Let's load the data and see what is there. We won't to into too much detail about the data, becuase it's not our focus.

In [4]:
# Load training data from CSV
train_data = pd.read_csv('./data/fashion-mnist_train.csv')

In [5]:
train_data.shape

(60000, 785)

In [6]:
train_data.head()

Unnamed: 0,label,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
0,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,6,0,0,0,0,0,0,0,5,0,...,0,0,0,30,43,0,0,0,0,0
3,0,0,0,0,1,2,0,0,0,0,...,3,0,0,0,0,1,0,0,0,0
4,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The data is stored as comma seprated values (CSV) file, with one row for each image. The first value is the label and next 784 values are the intensity values for each pixel. Each image is $28 \times 28$ pixel, monochrome image. 

Now let's see the labels

In [7]:
train_data["label"].value_counts()

9    6000
8    6000
7    6000
6    6000
5    6000
4    6000
3    6000
2    6000
1    6000
0    6000
Name: label, dtype: int64

We have ten different classes, with exactly 6000 images in every class. This is ideal for training, however it's rare to find such ideal data in real world.

#### Features

Let's get the features

In [8]:
# Get features (X), which is everything except the column label
X_train = train_data[[c for c in train_data.columns if c not in {'label'}]].values

In [9]:
print("Feature value range is", X_train.min(), "to", X_train.max())

Feature value range is 0 to 255


As expected the range for each pixel is 0 to 255. However, we will that see this does not affect us, even if it was different. 

In [10]:
# Get label (y), which is the column label in here
y_train = train_data['label'].values

#### Scale the features

In [11]:
# Convert to Flot64
X_train = np.array(X_train, dtype=np.float64)

The reason that the range does not matter for us is because we are going to scale the features anyway. This is crucial for Feedforward Network or any other classifier based on gradient descent. 

Here we are using scikit-learn's StandardScaler. This one will scale each feature individually, which is not crucuial for our case, because all the features come from the same distribution. However, it's necessary when that's not the case. Standard scaler will center all the features, ie their mean will become zero and scale them by the standrd deviation.

In [12]:
# Initialize standard scaler
scaler = StandardScaler()

*fit_transform* will fit the mean and standard deviation as well as scale the features.

In [13]:
# Scale using standard scaler
X_train = scaler.fit_transform(X_train)

### Load and preprocess testing data

In our case test data is processed same as training data, except from scaling.

In [14]:
test_data = pd.read_csv('./data/fashion-mnist_test.csv')

In [15]:
test_data.shape

(10000, 785)

In [16]:
X_test = test_data[[c for c in test_data.columns if c not in {'label'}]].values

In [17]:
y_test = test_data['label'].values

In [18]:
X_test = np.array(X_test, dtype=np.float64)

Here we only use 'tranform', so the scaling will be done using mean and standard deviation from the training data. 

In [19]:
X_test = scaler.transform(X_test)

### Training

Now let's get to the meat of the process. The actual training on the prepared data. We will compare three different classifiers.

#### Feedforward Network Baseline

This is a baseline Feedforward Network with all the default parameters. This class extends the scikit-learns _BaseEstimator_ class and behaves exactly like most classifier classes in scikit-learn. However, in the backend it uses _Tensorflow (Keras API)_, which eanables fast training using state-of-art algorimns. 

The user is completely abstracted from technical details of the classifier, if they choose to do so. Most importantly, the user does not need to create the _computation graph_. Here, you can see that the classifier is defined without any parameters.

In [20]:
fnn_baseline = FNNClassifier()

_DeepDummies_ provides a sensible defaults for all necessary parameters. The user does not need to worry about them (except a few,  we will see soon that why it pays off _too worry_ about those few)

Here you can see the full list of hyperparameters _DeepDummies_ has generated.

In [21]:
fnn_baseline.get_params()

{'activation': 'auto',
 'batch_size': 128,
 'callbacks': [<tensorflow.python.keras.callbacks.EarlyStopping at 0x17a8927aba8>],
 'class_weight': None,
 'dropout': [0.5],
 'early_stopping': 5,
 'epochs': 100,
 'gradient_clipping_norm': None,
 'gradient_clipping_value': None,
 'hidden_layers': [50],
 'l1_penalty': 0.0,
 'l2_penalty': 0.0,
 'learning_rate': 'auto',
 'loss': 'crossentropy',
 'metrics': ['accuracy'],
 'optimizer': 'adam',
 'timeit': True,
 'validation_data': None,
 'validation_split': 0.1,
 'verbosity': 2}

Let's the fun begin! We just need to call _fit_ method like in scikit-learn.

In [22]:
fnn_baseline.fit(X_train, y_train)

Data size (60000, 784) -	 Epochs 100 -	 Batch Size 128
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Hidden_1 (Dense)             (None, 50)                39250     
_________________________________________________________________
Dropout_1_0.5 (Dropout)      (None, 50)                0         
_________________________________________________________________
Output_softmax (Dense)       (None, 10)                510       
Total params: 39,760
Trainable params: 39,760
Non-trainable params: 0
_________________________________________________________________
Train on 54000 samples, validate on 6000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Fit complete in 48.03 seconds


Let's check the accuracy and loss on testing data. We can use _score_ method, which will return a dictionary.

In [23]:
fnn_baseline.score(X_test, y_test)



{'accuracy': 0.8739, 'loss': 0.35596662336587903}

We can also predict the labels and get the accuracy with scikit-learn's _accuracy_score_ method. As said before, this class is designed to be fully compatible with scikit-learn's API.

In [24]:
accuracy_score(fnn_baseline.predict(X_test), y_test)

0.8739

#### Feedforward Network Custom

We said before that, it pays off to 'tweak' some hyperparameters. And _DeepDummies_ gives you freedom to tweak most of them, at the same time keep it simple and providing sensible defaults.

Let's play with them. One of the most "unreasonable" default in the previous classifier was the number of units in the hidden layer set to 50. With 784 features it is worth increasing that number. So let's change it to 250. Notice that we provide a list as value. The number of elements in the list corresponds to the number of hidden layers. For eg [250, 100] will create graph with 2 hidden layers, first with 250 units and second with 100 units. More details can be found in _API Reference_ document. 

We are also increasing dropout a bit, to avoid overfitting due to more units. This hyperparameters were set by intution followed by brief trial and error.

In [25]:
fnn = FNNClassifier(hidden_layers=[250], 
                    dropout=0.6,
                    early_stopping=10)

In [26]:
fnn.get_params()

{'activation': 'auto',
 'batch_size': 128,
 'callbacks': [<tensorflow.python.keras.callbacks.EarlyStopping at 0x17a88a15320>],
 'class_weight': None,
 'dropout': [0.6],
 'early_stopping': 10,
 'epochs': 100,
 'gradient_clipping_norm': None,
 'gradient_clipping_value': None,
 'hidden_layers': [250],
 'l1_penalty': 0.0,
 'l2_penalty': 0.0,
 'learning_rate': 'auto',
 'loss': 'crossentropy',
 'metrics': ['accuracy'],
 'optimizer': 'adam',
 'timeit': True,
 'validation_data': None,
 'validation_split': 0.1,
 'verbosity': 2}

In [27]:
fnn.fit(X_train, y_train)

Data size (60000, 784) -	 Epochs 100 -	 Batch Size 128
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Hidden_1 (Dense)             (None, 250)               196250    
_________________________________________________________________
Dropout_1_0.6 (Dropout)      (None, 250)               0         
_________________________________________________________________
Output_softmax (Dense)       (None, 10)                2510      
Total params: 198,760
Trainable params: 198,760
Non-trainable params: 0
_________________________________________________________________
Train on 54000 samples, validate on 6000 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 

In [28]:
fnn.score(X_test, y_test)



{'accuracy': 0.8958, 'loss': 0.31569191516041756}

In [29]:
accuracy_score(np.argmax(fnn.predict_proba(X_test), axis=1), y_test)

0.8958

_Voilà!_ With small tweaking we got about 2.5% improvement in accuracy.

#### SGD Classifier

Now let's compare it with scikit-learn's SGDClassifier. You can think of this one as an FNN without hidden layer (FNN as SGD Classifier with hidden layers). Roughly speaking, this is generalized version of Logistic Regression which supports multiway classification. (again, roughly and practically speaking, technically it's very different)

Take time to compare this one with the two classifier used above. You will see they are almost identical for use. 

In [30]:
sgd = SGDClassifier(max_iter=1000, tol=1e-3)

In [31]:
sgd.get_params()

{'alpha': 0.0001,
 'average': False,
 'class_weight': None,
 'epsilon': 0.1,
 'eta0': 0.0,
 'fit_intercept': True,
 'l1_ratio': 0.15,
 'learning_rate': 'optimal',
 'loss': 'hinge',
 'max_iter': 1000,
 'n_iter': None,
 'n_jobs': 1,
 'penalty': 'l2',
 'power_t': 0.5,
 'random_state': None,
 'shuffle': True,
 'tol': 0.001,
 'verbose': 0,
 'warm_start': False}

In [32]:
y_train = train_data['label'].values

In [33]:
sgd.fit(X_train, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=1000, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=0.001, verbose=0, warm_start=False)

In [34]:
y_test = test_data['label'].values

In [35]:
accuracy_score(sgd.predict(X_test), y_test)

0.8372

But the results are not identical. With the hidden layer we got much better accuracy with same human effort.