# Data Processing an Machine Learning in Python
Stephen Casper, scasper@college.harvard.edu

## Imports
Not much to it. Numpy is for linear algebra, Sci-kit Learn is for machine learnign, and Matplotlib is for plotting. All three are extremely common.

In [0]:
# imports
import numpy as np
import sklearn as skl
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn import linear_model
import matplotlib.pyplot as plt

## Getting Data
Let's download and process our data a bit. It's the "Iris" dataset which consists of 3 classes of flowers. samples is represented as a 4 dimensional vector with dimensions giving the length and width of sepals and petals. We will store our data as numpy arrays where each row is a flower and each column is a feature. We sill also normalize the data. why might this be helpful?

In [0]:
# get data
iris = datasets.load_iris()
x = iris.data
y = iris.target

# normalize
x = x - (np.mean(x, axis=0)) / np.std(x, axis=0)

## tSNE Visualization
Let's process the data and see what structure we can find in it by representing it in 2 dimensions instead of 4. tSNE stands for t-distributed stochastic neighbor embedding, and it's a common stochastic tool for representing high dimensional data in lower dimensions. It iteratively moves poitnts around a space, optimizing for how close neightboring points are and how far distant points are from each other. When you see the results, notice that tSNE is done while blind to the labels. Also, try running this cell multiple times to see what happens. 

In [0]:
# fit tSNE
x_tsne = TSNE(n_components=2).fit_transform(x)

# get points of each class
tsne0 = x_tsne[y==0]
tsne1 = x_tsne[y==1]
tsne2 = x_tsne[y==2]

# plot
plt.scatter(tsne0[:,0], tsne0[:,1], c='r')
plt.scatter(tsne1[:,0], tsne0[:,1], c='b')
plt.scatter(tsne2[:,0], tsne0[:,1], c='g')
plt.show()


## PCA Visualization
There's structure! Another technique for dimensionality reduction is principal component analysis (PCA) which represents our data in terms of the axes in design space (in our case it's 4 dimensional space) that explain the most variance in the data. Let's do a PCA transform and plot the first two principal components. Now try running this cell multiple times to see what happens. Also what do you suspect these principal components correspond to?

In [0]:
# fit PCA
pca = PCA(n_components=2)
x_pca = pca.fit_transform(x) 

# get points of each class
pca0 = x_pca[y==0]
pca1 = x_pca[y==1]
pca2 = x_pca[y==2]

# plot
plt.scatter(pca0[:,0], pca0[:,1], c='r')
plt.scatter(pca1[:,0], pca1[:,1], c='b')
plt.scatter(pca2[:,0], pca2[:,1], c='g')
plt.show()

## tSNE vs. PCA
Which technique, tSNE or PCA, seems to do better at finding low dimensional structure in the dataset? As you noticed, probably tSNE, but PCA has the advantage of returning consistent results. Also, tSNE's performance depends on tuning a fairly arbitary perplexity parameter. For datasets like this, the perplexity doesn't have a large effect, but for others, the perplexity of the model can have a massive effect on the outcomes. See [here](https://https://distill.pub/2016/misread-tsne/). For this reason, tSNE isn't commonly used in published research for the same reason that lie detectors aren't admissible in court--it usually works, but it's not reliable. 



## Separating Train/Test Data
We will use PCA preprocessing for our ML models. And let's put out data into training and testing sets. Why is this a good idea? 

In [0]:
# get x's for each class
x0 = x_pca[y==0]
x1 = x_pca[y==1]
x2 = x_pca[y==2]

# make test data with 20 examples from each class
test_x = np.concatenate((x0[:20], x1[:20], x2[:20]))
test_y = np.concatenate((np.repeat(0, 20), np.repeat(1, 20), np.repeat(2, 20)))

# make train data with the rest of the examples
n0 = len(x0[20:])
n1 = len(x1[20:])
n2 = len(x2[20:])
train_x = np.concatenate((x0[20:], x1[20:], x2[20:]))
train_y = np.concatenate((np.repeat(0, len(x0)-20), np.repeat(1, len(x1)-20), np.repeat(2, len(x2)-20)))

## KNN Classification
Here, we will use the k-nearest neighbor (KNN) classification algorithm which classifies points by labeling them the same as the class representing the plurality of the k nearest neighbors in the training set. 

In [0]:
# fit model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(train_x, train_y)

# print results
print('Training data accuracy: ', sum(knn.predict(train_x)==train_y), 'out of', len(train_y))
print('Testing data accuracy: ', sum(knn.predict(test_x)==test_y), 'out of', len(test_y))

## Naive Bayes Classification
A naive bayes classifier calculates the probability of a datapoint being each class given each of its dimension values and then multiplies together the probabilities for each dimension to find an overall value proportional to the probability for each class. Then it classifies based on the class which has the highest probability estimate. It's called naive because it makes the (bad) assumption that each dimension is independent.

In [0]:
# fit model
gnb = GaussianNB()
gnb.fit(train_x, train_y)

# print results
print('Training data accuracy: ', sum(gnb.predict(train_x)==train_y), 'out of', len(train_y))
print('Testing data accuracy: ', sum(gnb.predict(test_x)==test_y), 'out of', len(test_y))

Would the naive bayes classifier work better or worse if we used our tSNE processed data instead of our PCA processed data? Why?

## Decision Tree Classification
A decision tree classifier works a lot like a dichotomous key which you may have learned about in biology classes. It asks yes/no questions about each point (i.e. Is the value of the first dimension less than 5?) and proceeds down a fork of the tree accordingly either to a conclusion or another fork. But it only allows itself to ask these questions up to a certain "depth limit". Try varying this depth limit. What happens?

In [0]:
# fit model
tree_classifier = tree.DecisionTreeClassifier(max_depth=2)
tree_classifier.fit(train_x, train_y)

# print results
print('Training data accuracy: ', sum(tree_classifier.predict(train_x)==train_y), 'out of', len(train_y))
print('Testing data accuracy: ', sum(tree_classifier.predict(test_x)==test_y), 'out of', len(test_y))

## Linear Classification
Finally, we can use a linear classifier which learns a matrix of coefficients for each dimension for each class. It then classifies based on what class value is the highest when that coefficient matrix is multiplied with each datapoint vector. Ours will be trained using a technique called gradient descent. Try varying the number of maximum training steps to take. 

In [0]:
# fit model
lm = linear_model.SGDClassifier(max_iter=1)
lm.fit(train_x, train_y)

# print results
print('Training data accuracy: ', sum(lm.predict(train_x)==train_y), 'out of', len(train_y))
print('Testing data accuracy: ', sum(lm.predict(test_x)==test_y), 'out of', len(test_y))

## Try Something New
Pick one or two:

- Find a new dataset from sklearn and train a model on it.
- Find a new classification algorithm and implement it. 
- Implement an alorithm from scratch.
- Search for the sklearn info page on one of the algorithms and find parameters to tune and see how it affects performance. 
- Find a new way to preprocess data and see how it affects performance with one or more of the algorithms. 
- Learn more about the math behind one of these algorithms (other than KNN). You might be expecially interested in learning about gradient descent. 
