# Introduction
In this short notebook I wanted to experiment with PCA and neural network implemented with Keras library.
Also I would like to ilustrate how PCA decomposition works on data from this dataset. Most important thing, I would like hopefully to get some feedback to improve this method or find completely different approach.

Thanks to other authors for publishing their notebooks, I reused some parts of the code when I was looking for fiding nice solutions for problems I had on a way. 

# Loading and normalizing data

In [None]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from keras.models import Sequential
from keras.utils import np_utils
from keras.layers import Dense, Dropout, GaussianNoise, Conv1D
from keras.preprocessing.image import ImageDataGenerator

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
test = pd.read_csv('../input/test.csv')
train = pd.read_csv('../input/train.csv')

Y_train = train['label'].values.astype('int32')
Y_train = np_utils.to_categorical(Y_train) 
train.drop(['label'], axis=1, inplace=True)

X_train = (train.values).astype('float32')
X_test = (test.values).astype('float32')

After reading data from CSV I create Numpy arrays from Pandas dataframe. Separately X_train, X_test where is data about pixels, Y_train with labels. Labels' data, which was in form of numbers from 0 to 9 need to be converted to categorical vector, because of neural network construction I used later.

It looks like this now:

In [None]:
print('Y_train value form: {}'.format(Y_train[1]))
print('Which is 0 (1 in [0] position of the vector).')
plt.imshow(X_train[1].reshape(28,28))
plt.show()

Image has 28x28 pixels, so data has 784 features. Now for computers such amount of data isn't that big, but there can be cases, when dimensional reduction can be important. It is better for further processing to have 100 features containing most of the data than 800 features.

Before I use PCA to reduce dimensionality of the data I will standardize it using sklearn StandartScaler. It is fitted to train data, because I assume that I know nothing about test data. Both datasets are transformed. 

In [None]:
scaler = StandardScaler()
scaler.fit(X_train)
X_sc_train = scaler.transform(X_train)
X_sc_test = scaler.transform(X_test)

# PCA decomposition

Before I use data with reduced dimensionality, I would like to show what is this about.

I arbitrary set number of components to 500. It depends on data, to do such visualization we should aim to set number of components close to the number of original feature number. Transformation take more time depending on number of components.

In [None]:
pca = PCA(n_components=500)
pca.fit(X_train)

plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')

In plot above we can see that cumulative explained variance is very high near 100 and then it increases very slowly. That means that data describing changes is mostly contained in 100 components. We need to evaluate trade-offs before we choose number of components we use further. I choose 100 to check how it will work as it seems to have most of the data.

In [None]:
NCOMPONENTS = 100

pca = PCA(n_components=NCOMPONENTS)
X_pca_train = pca.fit_transform(X_sc_train)
X_pca_test = pca.transform(X_sc_test)
pca_std = np.std(X_pca_train)

print(X_sc_train.shape)
print(X_pca_train.shape)

PCA decomposition can be inverted. Let's see how images look after reconstructing them from 100 components.

In [None]:
inv_pca = pca.inverse_transform(X_pca_train)
inv_sc = scaler.inverse_transform(inv_pca)

In [None]:
def side_by_side(indexes):
    org = X_train[indexes].reshape(28,28)
    rec = inv_sc[indexes].reshape(28,28)
    pair = np.concatenate((org, rec), axis=1)
    plt.figure(figsize=(4,2))
    plt.imshow(pair)
    plt.show()
    
for index in range(0,10):
    side_by_side(index)

After inverting PCA and scaler transform I printed images side by side. It looks like the quality of the reconstructed image decreased in comparison to original, but the numbers are clearly visible. It will depend on the number of components.

# Neural network with Keras

I implemened simple model of multilayer perceptron (MLP) neural network using Keras and experimented with it.

Using library is simple. First you need to create model instance and then add layers using models.add() method. First layer need to be set up for proper input dimension. Output layer needs to have proper output dimension and activation function. In between hidden layers can be added.

During compilation parameters of loss function, optimizer and metrics need to be set depanding on problem.

In [None]:
model = Sequential()
layers = 1
units = 128

model.add(Dense(units, input_dim=NCOMPONENTS, activation='relu'))
model.add(GaussianNoise(pca_std))
for i in range(layers):
    model.add(Dense(units, activation='relu'))
    model.add(GaussianNoise(pca_std))
    model.add(Dropout(0.1))
model.add(Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['categorical_accuracy'])

model.fit(X_pca_train, Y_train, epochs=100, batch_size=256, validation_split=0.15, verbose=2)

I experimented with parameters of compilation (metrics and loss), fit (split, epochs, batch_size) and network construction (number of layers, units, layer construction). Time performance changed, but all in all the predictions I got seems to be the same in the end. Above code generates result of 0.97+ accuracy predictions.

In [None]:
predictions = model.predict_classes(X_pca_test, verbose=0)

def write_predictions(predictions, fname):
    pd.DataFrame({"ImageId": list(range(1,len(predictions)+1)), "Label": predictions}).to_csv(fname, index=False, header=True)

write_predictions(predictions, "pca-keras-mlp.csv")

# Conclusion

After experiments with parameters of the used models I came to the result of 0.97+ accuracy on the part of the test set, but I am unable to improve it.

Because dataset is build with images of handwritten digits getting bigger train set could help. Maybe I need to change approach, ignore PCA decomposition at all and use ImageDataGenerator to generate more images or use convolution layers. 

If you know how above result can be (easily?) improved, please leave comment with suggestion. As data science newbie I would be grateful for any suggestions :)