# Introduction to Machine Learning

## A Running Start for the Hackathon

Aliens! Scientists have discovered life on another planet! The plant Zenon is crawling with alien creatures, and while we don't know much about these Zenon inhabitants yet, we have collected plenty of data from satellite images. Your job is to classify each creature into one of seven species that the researchers have distinguished so far.

Let's get up and running for this competition! Here is the first rule for machine learning: start with something simple, use that to gauge how difficult the problem is, and iterate quickly! 

In [27]:
from __future__ import print_function 
import pandas as pd
import numpy as np
from collections import OrderedDict
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

## Data Preprocessing

Let's first read in the data (Pandas is a good way to start here). Because of the competition submission requirements, we are going to want to save the IDs for the examples in the training set. We will also want to separate the field that we are trying to predict (in the case of classification, this is often called a label). The convention is to call the field we are trying to predict 'y_train' and the set of attributes we will use to train the model (often called features) X_train. 

Note: once we save the test set IDs for the submission format, we will want to remove them from the training data (IDs typically tell us no information about the aliens we are interested in, they are just numbers used in place of names). 

In [37]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

train.head()

Unnamed: 0,Id,Elevation,Oxygen_level,Slope,Distance_to_spring,Temperature,Distance_to_major_Z_crater,Trace_chemicals_morning,Trace_chemicals_midday,Trace_chemicals_afternoon,...,Terra32,Terra33,Terra34,Terra35,Terra36,Terra37,Terra38,Terra39,Terra40,Alien_Type
0,11730,3277,72,3,595,86,5872,223,233,146,...,0,0,0,0,0,0,0,0,0,2
1,9613,3680,14,15,1208,325,5150,204,207,137,...,0,0,0,0,0,0,0,0,1,7
2,7591,2760,25,12,30,1,2882,214,215,135,...,0,1,0,0,0,0,0,0,0,2
3,1012,2940,90,4,30,0,5370,226,233,143,...,0,0,0,0,0,0,0,0,0,1
4,3042,3049,25,20,283,58,124,206,194,118,...,0,0,0,0,0,0,0,0,0,2


In [39]:
test.head()

Unnamed: 0,Id,Elevation,Oxygen_level,Slope,Distance_to_spring,Temperature,Distance_to_major_Z_crater,Trace_chemicals_morning,Trace_chemicals_midday,Trace_chemicals_afternoon,...,Terra31,Terra32,Terra33,Terra34,Terra35,Terra36,Terra37,Terra38,Terra39,Terra40
0,8319,3031,145,15,240,16,1753,240,236,121,...,0,0,0,0,0,0,0,0,0,0
1,10993,2544,251,14,0,0,2792,187,249,199,...,0,0,0,0,0,0,0,0,0,0
2,21,2501,71,9,60,8,767,230,223,126,...,0,0,0,0,0,0,0,0,0,0
3,11826,2830,23,18,258,168,268,206,198,123,...,0,0,0,0,0,0,0,0,0,0
4,9862,3225,104,4,60,-2,2032,226,234,143,...,0,0,0,0,1,0,0,0,0,0


In [28]:
# Save the Id's from the test set (we need them for the predictions)
id_test = test['Id']
X_test = test.drop('Id', axis=1)

y_train = train['Alien_Type']
X_train = train.drop(['Id', 'Alien_Type'], axis=1)

## Creating a Validation Set

We could fit a bunch of models blindly on the entire training set, making predictions on the test set and waiting to see what happens. This gives us very little feedback about how our modeling is going, however. We would like to see how well our model is performing on some data that was not used for training (otherwise, how can you be sure that your model isn't just memorizing all of the training examples?). This extra reservation of data for diagnostics (and later model parameter tuning) is called the validation set. For statistical reasons, we often use multiple random splits of the training data for our validation set, but, for starters, we will stick with just one. 

In [29]:
# New split for training and evaluation from training set.
X_train, X_val, y_train, y_val= train_test_split(X_train, y_train, test_size=0.3)

## Model Fitting

Here comes the fun part. Pick a model, train it on the training set, and find out how well we do on our validation set! 

In [40]:
# Model choice (clf is a naming convention for classifers in the Python world)
clf = GaussianNB()

# Train the model
clf.fit(X_train, y_train)
# Predict the model on the validation set
val_pred = clf.predict(X_val)

# How did we do? Note: this competition uses Multiclass Accuracy!
acc = accuracy_score(y_val, val_pred)
print("Multiclass accuracy: {}".format(acc))

Multiclass accuracy: 0.597607052897


## Reflection

Our model scored a $59.4$% accuracy on the validation set! That is not bad at all, especially considering we used the default parameters for the Gaussian Naive Bayes model. Remember, since we are predicting seven different classes, random guessing for each prediction would be $1.0/7.0 = 0.14$!

## Final Model Fittment and Predictions

Now that we have a (hopefully) reasonable idea about how well our model will performs on some unforseen data, let's fit the model over all of the training data (the concatenation of the training and validation sets). Then, we can make some predictions on the test set for our first competition submission!

Submissions for the Kaggle competitions can be a bit tricky sometimes. You want to make sure that you are getting your formatting correct. This competition asks for a .csv with two columns: the IDs and the predicted labels for the test set. This is pretty straight forward using Python's dictionaries with Pandas' Dataframes.

In [41]:
# Fitting the model over all the training data
X_train = pd.concat([X_train, X_val])
y_train = pd.concat([y_train, y_val])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Creating a submission
submission = pd.DataFrame({"Id":id_test, "Alien_Type":y_pred}).set_index('Id')
submission.to_csv('submission.csv')

## Next Steps

Try looking through Scikit Learn's documentation (scikit-learn.org) and try a few models out. There is this great theorem in machine learning called the No Free Lunch Theorem which essentially says that there are no guarantee that any particular type of model will outperform all others for a given problem. So try out a bunch of models, read up on their characteristics and their parameters, and see if you can find the winning solution! It is all up for grabs, and sometimes the answer comes from surprising places!

We only scratched the surface here. Come out to the Intermediate Machine Learning Concepts talk with Camille and Todd at 1:00 p.m.; there is plenty more that we can do!

Happy Hacking!