In this notebbok we will see how to reduce dimension on MNIST dataset with two well known technics :
* PCA 
* T-SNE

Then we'll use the two first components as standalone new features and compare the obtained performances among:
* original data
* original data + 2 first pca components
* original + 2 first t-sne vectors

The models that wee'll be using :
* RandomForestClassifier
* KNeighborsClassifier
* GaussianNB


In [None]:
import numpy as np 
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.express as px
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.model_selection import cross_val_score

## DataViz

In [None]:
# Load the data
train = pd.read_csv("../input/digit-recognizer/train.csv")
test = pd.read_csv("../input/digit-recognizer/test.csv")
print("train shape", train.shape)
print("test shape", test.shape)

Lets take a look to the repartition of the classes

In [None]:
X_train = train.drop('label', axis=1)
y_train = train["label"]

# free some space
del train

g = sns.countplot(y_train)

In [None]:
index = 7
print("real value : ", y_train[index])
digit = X_train.loc[index,:]

#reshape to a squared shape
reshape_size = int(np.sqrt(len(digit)))
digit = digit.values.reshape(reshape_size,reshape_size)
#imshow to plot raw digit 
plt.imshow(digit,cmap='binary')
plt.show()

### PCA : Plot a projection on the 2 first principal axis:

The main idea behind PCA is to find out projection vectors called principal axes so that a **maximum ratio of original variance is preserved**.
The corresponding vectors minimize the mean squared distance between the original dataset and its projection onto these axes.

Genral approach:
#### 1. Find out the principal axes : 
By means of SVD-Singular Value Decomposition technique that lets to decompose X_train into three matrices $U Σ V^T$, where V contains all the principal components that we are looking for. These vectors constitue a new orthonormal basis that will be used as the projection basis 

#### 2. Project training data to the principal components basis
Once you have identified all the principal components (r axes), the dimensionality reducing is performed by projecting the orginal dataset onto the hyperplane defined by the first r principal components, that will preserve as much variance as possible.
Mathematically, it consists of appling a **dot product** between the training set matrix $X-{train}$ and the matrix $V_r$ , defined as the matrix containing the first r principal components (of the matrix V). 

Please note that the algorithm assumes that dataset is normalized. (BTW The only family of algorithms that are scale-insensitive are tree-based methods : RF, XGB, LGB..) Even though scikit learn implementation integrates this normalisation step

Let's Compute PCA components using `fit_transform()` method of the `sklearn.decomposition.PCA` model

In [None]:

# first scaling:
# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(test)

In [None]:
%%time
pca = PCA(n_components=2)
pca_proj = pca.fit_transform(X_train)

Now let's plot in  2D the two first principal components

In [None]:
plt.figure(figsize=(15,8))
plt.scatter(pca_proj[:, 0], pca_proj[:, 1], 
            c=y_train, cmap="Paired")
plt.colorbar()
plt.show()

To chose the right reducing dimension number, we will compute the number of components that preserve rather than 100% but for example 90% which is a reasonable proportion. (for data viz we select only the 2 or 3 first ones)


In [None]:
X_train.shape

In [None]:
pca_test = PCA()
pca_test.fit(X_train)

def get_right_dimension(pca, threshold=0.9):
    cumsum = np.cumsum(pca.explained_variance_ratio_)
    d = np.argmax(cumsum >= threshold) + 1
    print(f'The {d} first components are sufficient to preserve {threshold * 100}% of the variance')
    return d

In [None]:
d = get_right_dimension(pca_test)
print(f'Original data dimension : {X_train.shape[-1]}')

#### Use Plotly:
We will use [Plotly](https://plotly.com/)'s python library to make more interactive and high-quality graphs.


In [None]:
# construct the dataframe corresponding to the new pca_proj
df = pd.DataFrame(pca_proj, columns=['axis_{}'.format(i+1) for i in range(pca.n_components)])
df['value'] = y_train
# we have to transform the labels into str so that we can use it for the color fi 
df["value"] = df["value"].astype(str)
fig = px.scatter(df, x="axis_1", y="axis_2", color="value", width=800, height=600)
fig.show()

Now lets persist the computed projections for future use

In [None]:
# save results
df.to_pickle('pca_embedding.pkl')

### Use Tsne 
The t-distributed stochastic neighbor embedding [(t-SNE)](http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf) algorithm is a dimension reduction technique for data visualization developed by Geoﬀrey Hinton and Laurens van der Maaten.
The t-SNE algorithm is based on a probabilistic interpretation of proximities. A probability distribution is deﬁned on the pairs of points in the original space
so that points close to each other have a high probability of being chosen while distant points have a low probability to be selected.  The t-SNE algorithm consists of matching the two probability densities, minimizing the[ Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) between the two distributions with respect to the
location of the points on the map.

The t-SNE non-linear ”feature extraction” algorithm constructs a new representation of the data so that the close data in the original space has a high probability of having close representations in the new space. in the other hand, data that are distant in the original space, have a low probability of having close representations in the new space. 
In practice, the similarity between each pair of data, in both spaces, is measured by means of probabilistic calculations based on distribution hypotheses. And the new representations are constructed in such a way as to minimize the diﬀerence between the probability distributions measured in the original space and those of the new space. 

In code block below applies the T-SNE algorithm on our digits, in order to project them into a 2-dimensional space (which are the two axes of the image). 


In [None]:
#%%time
if False:
    tsn_proj = (TSNE(n_components=2)
                .fit_transform(X_train)
               )
    tsn_proj = pd.read_pickle('../tsne_2_components_embedding.pkl')


In [None]:
# it takes a loot of time 
# I ve already saved it on the first run  
tsn_proj = pd.read_pickle("../input/digitrecognizertsne/tsne_embedding.pkl")
tsn_proj.head()

In [None]:
plt.figure(figsize=(15,8))
plt.scatter(tsn_proj.loc[:, 'axis_1'], tsn_proj.loc[:, 'axis_2'], c=y_train, cmap="Paired")
plt.colorbar()
plt.show()

The image in the center corresponds to the projection of each image in this new space, each digit being associated with
a diﬀerent color (in the fig the orange color represents the number ”1”, the red color the number ”0”). 

As can be seen, the representation provided by the t-SNE algorithm makes it possible to separate and form distinct groups for each of the
digits of the data set.

Lets use plotly

In [None]:

df = pd.DataFrame(tsn_proj, columns=['axis_1', 'axis_2'])
df['value'] = y_train

df["value"] = df["value"].astype(str)
fig = px.scatter(df, x="axis_1", y="axis_2", color="value", 
                  width=800, height=600,
                 title='T-SNE vizualisation of MNIST Data')
fig.show()

In [None]:
# save data
#df.to_pickle('tsne_2_components_embedding.pkl')

Lets evaluate different try different algorithms on:
* original data
* original data + 2 first PCA
* original data + 3 first t-sne vectors

for each case we will test out:
* RandomForestClassifier
* KNeighborsClassifier
* GaussianNB

To evaluate performance: we will use **f1_weighted** loss function with 5-fold- cross validation strategy

The [F1_weighted](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) allows to get the average of F1_scores (of each label) weighted by support (the number of true instances for each label).

In [None]:
# let's define our classifiers
classifiers = {"RF" : RandomForestClassifier(n_estimators=100, criterion = 'entropy', random_state = 42),
              "KNN" : KNeighborsClassifier(n_neighbors = 7),
              "NB" : GaussianNB() }

### 1. Oiginal data

In [None]:
%%time
origin_scores = {}
for name, clf in classifiers.items():
    print(name, 'classifier')
    origin_scores[name] = cross_val_score(clf, X_train, y_train,
                             scoring="f1_weighted",
                             cv=5)
    print(f'f1_score = {origin_scores[name]}')
    print(name, "overall score :", 
          "%.3f"%(np.mean(origin_scores[name])),
         "+- %.3f"%(np.std(origin_scores[name]))
         )     
    print("*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-")

### 2. PCA transformed data

#### 2.1. All components:
now we will apply the same models on `pca_proj`

In [None]:
%%time
pca_scores = {}

for name, clf in classifiers.items():
    print(name, 'classifier')
    pca_scores[name] = cross_val_score(clf, pca_proj, y_train,
                             scoring="f1_weighted",
                             cv=5)
    print(f'f1_score = {pca_scores[name]}')

#### 2.1. Add the Two first components:

In order to  fairly compare with the tsne we'll add to the orginal data only the 2 first PCA

In [None]:
%%time
two_first_pca_scores = {}


for name, clf in classifiers.items():
    print(name, 'classifier')
    two_first_pca_scores[name] = cross_val_score(clf, 
                                                 np.c_[X_train, pca_proj[:, [0, 1]]], 
                                                 y_train,
                                                 scoring="f1_weighted",
                                                 cv=5)
    
    print(f'CV f1 scores = {two_first_pca_scores[name]}')
    print(name, "overall score :", 
          "%.3f"%(np.mean(two_first_pca_scores[name])),
         "+- %.3f"%(np.std(two_first_pca_scores[name]))
         )     
    print("*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-")

### 2. TSNE transformed data
Now we will apply the same models on `tsn_proj`

In [None]:
%%time
tsne_scores = {}

from sklearn.model_selection import cross_val_score
for name, clf in classifiers.items():
    print(name, 'classifier')
    tsne_scores[name] = cross_val_score(clf, np.c_[X_train, tsn_proj[['axis_1', 'axis_2']].values], y_train,
                             scoring="f1_weighted",
                             cv=5)
    print(f'f1_score = {tsne_scores[name]}')
    print(name, "overall score :", 
          "%.3f"%(np.mean(tsne_scores[name])),
         "+- %.3f"%(np.std(tsne_scores[name]))
         ) 
    print("*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-")

### Save results:

In [None]:

results = pd.DataFrame(classifiers.keys(), columns=['Classifier'])

for colname, dic in zip(["original_F1_Score", "original+2_first_pca_f1_score", "original+2_first_tsne_f1_score"],
                        [origin_scores, two_first_pca_scores, tsne_scores]):
    d = {}
    for k, v in dic.items():
        d[k] = np.mean(v)
    results[colname] = d.values()

In [None]:
results

As we see the T-sne embedding definitely gives the better encoding for the classifiers