# GEC Data Science Program
## Level 2, Lab 1

### Intro to project

1. Understanding the Amazon from Space


https://www.kaggle.com/c/planet-understanding-the-amazon-from-space

2. Google Cloud & YouTube-8M Video Understanding Challenge

https://www.kaggle.com/c/youtube8m

### Environment Setup

#### Software and tools

- Anaconda https://www.anaconda.com/download/
- Tensorflow https://www.tensorflow.org/
- Keras https://keras.io/

#### Mac/Linux Install (python 2.7)
TensorFlow
```sh
export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.12.1-py2-none-any.whl
sudo pip install --upgrade $TF_BINARY_URL --ignore-installed
```
Keras
```sh
sudo pip install keras; python -c "import keras; print keras.__version__"; sudo pip install --upgrade keras
```

In [None]:
from __future__ import division

In [None]:
import tensorflow as tf

In [None]:
tf.__version__

In [None]:
import keras

In [None]:
keras.__version__

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd

## Load Data

#### Data

For this lab we use this data:
https://www.kaggle.com/c/digit-recognizer

In [None]:
d=pd.read_csv("train.csv")

In [None]:
d.head()

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

In [None]:
d.values

### Explore data

Let's look at the first image

In [None]:
d.values[0]

### Q: Can we plot the digits?

In [None]:
L = d.values[3][0]

In [None]:
L

In [None]:
img_vector = d.values[3][1:]

In [None]:
img=img_vector.reshape((28,28))

In [None]:
plt.imshow(img)

In [None]:
plt.imshow(img, cmap=plt.cm.binary)

#### Q: How can we rotate the image 90 degrees?
Hint: use np.rot90

In [None]:
plt.imshow(np.rot90(img), cmap=plt.cm.binary) 

#### Q: How can we flip the image?

In [None]:
#flip upside-down
plt.imshow(, cmap=plt.cm.binary) 

In [None]:
#flip left-right
plt.imshow(, cmap=plt.cm.binary) 

#### Q: Plot some random images:

In [None]:
plt.figure(figsize=(15,5))
for i,idx in enumerate(np.random.randint(1,high=len(d),size=18)):
    L = d.values[idx][0]
    img_vector = d.values[idx][1:]
    img=img_vector.reshape((28,28))
    plt.subplot(2,9,i+1)
    plt.imshow(img, cmap=plt.cm.binary)
    plt.title(str(L))

### Q: What's the distribution of digits?

In [None]:
plt.hist(d.label)

In [None]:
d.label.value_counts().plot.bar()

#### Create X=features, y=labels.

In [None]:
X=

In [None]:
y=

### Q: Can we visualize the Principal Components?

In [None]:
from sklearn.decomposition import PCA

In [None]:
import sklearn

In [None]:
pca=PCA()

In [None]:
Xpca=pca.fit_transform(X)

### Q: How many Principal Components are important? or
### Q: How much variation is explained by each Principal Component?

In [None]:
plt.plot()

In [None]:
np.where(>0.005)

Let's take most important PCs.

In [None]:
Xpca_r = 

### Q: What if we turn Principal Components into images?

In [None]:
plt.figure(figsize=(15,5))
for i,idx in enumerate(np.random.randint(1,high=len(d),size=18)):
    L = y[idx]
    img_vector = Xpca_r[idx]
    img=img_vector.reshape((36,1))
    plt.subplot(1,18,i+1)
    plt.imshow(img)
    plt.title(str(L))

### Q: Can we use another transformation other than PCA?

http://scikit-learn.org/stable/modules/manifold.html

In [None]:
from sklearn import manifold
from matplotlib.pyplot import cm

lle = manifold.LocallyLinearEmbedding(n_neighbors=5, n_components=2)

In [None]:
Xlle = lle.fit_transform(X) #warning: takes a long time to complete

In [None]:
plt.scatter(Xlle[:,0],Xlle[:,1],c=y, cmap=cm.RdYlBu); 
plt.colorbar();

### Let's create a simple classifier -- Nearest Neighbor

In [None]:
from sklearn import model_selection

In [None]:
!pip install tqdm

In [None]:
from tqdm import tqdm

In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(Xpca_r, y, test_size=0.33)

In [None]:
def find_nearest_neighbor(x, D):
    xr=x.reshape((1,len(x))).repeat(len(D), axis=0)
    delt=D-xr
    sq_dist=np.sum(delt*delt, axis=1)
    return np.argmin(sq_dist)

Let's test the first data point from test set:

Q: What's the index of the nearest neighbor in training data to the first test data point?

In [None]:
idx=

Q: What's the nearest neighbor label?

Is the prediction correct?

Q: Do the same for all test data:

In [None]:
def predict_Labels(X, D, labels):
    r=[]
    for x in tqdm(X):
        idx = find_nearest_neighbor(x, D)
        predicted_label = labels[idx]
        r.append(predicted_label)
    return r

In [None]:
y_pred_ = predict_Labels(X_test, X_train, y_train)
y_pred = np.array(y_pred_)

In [None]:
# #save y_pred
# with open("y_pred","w") as f:
#     y_pred.tofile(f)

In [None]:
# #load y_pred
# with open("y_pred","r") as f:
#     y_pred = np.fromfile(f, dtype=np.int64)

Let's check the first few predictions:

In [None]:
y_pred[:10]

In [None]:
y_test[:10]

Looks very good!
#### Q: What's the accuracy?

In [None]:
num_corrects = 

In [None]:
num_corrects, len(y_test)

In [None]:
num_corrects/len(y_test)

### Q: What data points where incorrectly predicted? and why?

In [None]:
incorrects_idx = 

In [None]:
Xe = X_test[incorrects_idx]
ye = y_test[incorrects_idx]

In [None]:
y_pred_e = y_pred[incorrects_idx]

In [None]:
y_pred_e[:10]

In [None]:
ye[:10]

In [None]:
plt.figure(figsize=(5,5))
plt.hist2d(y_pred_e,ye,bins=9,cmap=cm.gray_r);
plt.grid('on');
plt.xlabel("predicted");
plt.ylabel("actual");

#### Q: most errors belong to which number?

Unfortunately we can't see the images because we used PCA and randomly splitted data into train and test sets so we lost the original data points.
#### Q: how can we avoid this problem?

### Q: How can we provide a confidence measure for the predictions?

### Q: Use another algorithm (e.g. SVM, Random Forest) to predict digits. Will acuracy increase?