# Intermediate Machine Learning Concepts

## Picking Up Speed in the Competition

Aliens! Scientists have discovered life on another planet! The plant Zenon is crawling with alien creatures, and while we don't know much about these Zenon inhabitants yet, we have collected plenty of data from satellite images. Your job is to classify each creature into one of seven species that the researchers have distinguished so far.

Here we build on our Introduction to Machine Learning talk, bolstering some of our methods as well as introducing some new approaches to the data! 

In [1]:
from __future__ import print_function 
import pandas as pd
import numpy as np
from collections import OrderedDict
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score

## Data Preprocessing

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Save the Id's from the test set (we need them for the predictions)
id_test = test['Id']
X_test = test.drop('Id', axis=1)

# remove constant columns
remove = []
for col in train.columns:
    if train[col].std() == 0:
        remove.append(col)
        
train.drop(remove, axis=1, inplace=True)
test.drop(remove, axis=1, inplace=True)

y_train = train['Alien_Type']
X_train = train.drop(['Id', 'Alien_Type'], axis=1)

In [3]:
# New split for training and evaluation from training set.
X_train, X_val, y_train, y_val= train_test_split(X_train, y_train, test_size=0.3)

## Tree Based Model

In the Introductions to Machine Learning talk we used a Naive Bayes model. Here Let's give a shot at a tree based classifier, the Extra Trees Classifier.

In [4]:
#clf = ExtraTreesClassifier()
#clf = DecisionTreeClassifier()
#clf = GaussianNB()
clf = LogisticRegression()

# Train the model
clf.fit(X_train, y_train)
# Predict the model on the validation set
val_pred = clf.predict(X_val)

# How did we do? Note: this competition uses Multiclass Accuracy!
acc = accuracy_score(y_val, val_pred)
print("Multiclass accuracy: {}".format(acc))

Multiclass accuracy: 0.669080604534


## Principal Components Analysis

While ensemble tree models like the Extra Trees Classifier tend not to overfit, there is still that chance. One strategy to make our models more robust is to use a data preprocessing method called Principal Component Analysis (PCA). In essence, PCA is a data transformation that allows us to describe the variation in our data with fewer explanatory features. It is important to note that PCA does not throw away any of the features in our data, rather it re-expresses the information of our original features in lower dimensions. It is a method used to counteract what is known as the 'curse of dimensionality'.

In [5]:
pca = PCA(n_components=7)

pca.fit(X_train)
X_train_new = pca.transform(X_train)
pca.fit(X_val)
X_val_new = pca.transform(X_val)

# Train the model
clf.fit(X_train_new, y_train)
# Predict the model on the validation set
val_pred = clf.predict(X_val_new)

# How did we do? Note: this competition uses Multiclass Accuracy!
acc = accuracy_score(y_val, val_pred)
print("Multiclass accuracy: {}".format(acc))

Multiclass accuracy: 0.52298488665


We now have a multiclass accuracy score of $77.4$% on the validation set, and this is using information contained in only seven features!

## Cross Fold Validation

So far we have been trusting that the accuracy score we get on validation set will be indicative of what we get on the actual test set. But what if we just happened to get lucky on that random subset of the data? This is a legitimate concern. After all, a great one-time score could be a fluke... 

Luckily, there are ways that we can further safeguard our generalization estimates. One of these methods is to use cross fold validation. The idea here is pretty simple: don't just trust a single random split for your validation data, use many random splits into training and validation sets (folds), fit the model on the new training set and see what happens when we predict on the validation set. If you are consistently good across many random splits of the data, odds are you are going to do pretty well on the test set!

In [6]:
# First let's recombine the train and validation sets
X_train = pd.concat([X_train, X_val])
y_train = pd.concat([y_train, y_val])

cross_val_score(clf, X_train, y_train, cv=5, n_jobs=1).mean()

0.66921419095857404

In [10]:
pca.fit(X_test)
X_test_new = pca.transform(X_test)

In [15]:
# We have our model, let's predict on the test set!
y_pred = clf.predict(X_test_new)

# Creating a submission
# ------------------
submission = pd.DataFrame({"Id":id_test, "Alien_Type":y_pred}).set_index("Id")
submission.to_csv("submission.csv")