# Assignment 7 - Machine Learning
### This assignment is focused on both applying machine learning methods using scikit-learn, and validating their performance.

This is a paired assignment; the same rules apply as for previous paired
assignments.  There is no requirement that you work with the same partner as on
the previous assignment; in fact, I encourage you to switch around who you work
with, since this will be good practice for being a data scientist who needs to
work with a diverse group of collaborators.

The goal of this assignment is to get some practice using Scikit-learn to apply
machine various learning algorithms to data in a sensible way.  This will
involve both the creation and evaluation of potential solutions to machine
learning problems.  It is important to not only train a model, but also to
validate that model; otherwise, your results won't be trustworthy.

The basic steps you will follow are:

    1. Load some data
    2. Pre-process as necessary
    3. Format data for scikit-learn
    4. Train & validate models using scikit-learn

### Basic Imports
This is the stuff we've been using all along

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

### Load some data
As we've seen, there are lots of ways we can load data using Pandas.  Here, we're going to load up a dataset that doesn't require much pre-processing (i.e. it's clean enough that we don't need to mess with it too much)

In [2]:
# NOTE: the data file doesn't have a header with column names, so we'll set 
# them up manually based on the data description (see file 'covtype.info')

colNames = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 
                 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon','Hillshade_3pm',
                 'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area1', 
            'Wilderness_Area2','Wilderness_Area3','Wilderness_Area4']

for x in range (40) :
    columnName = 'Soil_Type' + str(x + 1)
    colNames.append(columnName)

colNames.append('Cover_Type')

### TODO - data description
Before you start working with the data, you should try to understand what actual problem the data represent.  You should be able to figure this out by looking at the 'covtype.info' file.  You should put a summary of the most salient points here.  Don't just copy/paste from the file, it's got way too much information in it, and lots of that information is not strictly relevant to a high-level understanding of the problem.

In [3]:
# read in the actual dataset
data = pd.read_csv('covtype.data', header=None, names=colNames)
data.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,5
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,5
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,2
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,2
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,5


### Separate out the features we'll use to train our model from the 'target' variable we're trying to predict
Make two new table views, one which drops the final column (i.e. 'Cover_Type'), and the other which contains *only* the final column.  Call the first one 'features' and the second one 'target'.

In [4]:
features = data
features = features.drop(['Cover_Type'], axis=1)
target = data['Cover_Type']

### Format the data for scikit-learn
Scikit-learn requires that input be given as numpy arrays.  Fortunately, Pandas dataframes have a parameter '.values' that will give you the contents of the table formatted as a numpy array; use that to generate numpy versions of your 'features' and 'target' variables.  Call them 'X' and 'y' respectively.

In [5]:
#features = X, target = y
X = features.values
y = target.values
X

array([[2596,   51,    3, ...,    0,    0,    0],
       [2590,   56,    2, ...,    0,    0,    0],
       [2804,  139,    9, ...,    0,    0,    0],
       ..., 
       [2386,  159,   17, ...,    0,    0,    0],
       [2384,  170,   15, ...,    0,    0,    0],
       [2383,  165,   13, ...,    0,    0,    0]])

### Import some classifiers from scikit-learn
We're going to import several classifiers so we can compare them, as well as some model selection tools that will let us do validation

In [6]:
from sklearn import model_selection
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

### Try out two different classifiers, and see how well they perform
Try out the classifiers 'svm.SVC' and 'neighbors.KNeighborsClassifier'.  If you don't give them any explicit parameters, they'll use defaults, which is fine for now.  Try fitting each to the whole data set, and then testing them on the whole data set.

See the posted example from class before break, and also the online documentation for scikit learn:

http://scikit-learn.org/stable/modules/svm.html#classification

http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

0.96414303149154557

In [8]:
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
tree.score(X_test, y_test)

0.93271928818050009

### Now use cross-validation to see how well they are actually doing on novel data
The idea of cross-validation is to try splitting the data several times and seeing not only how well the algorithm does on the unseen part of the data, but also seeing how consistent it is across different splits.  Try using 5 folds (i.e. use the parameter 'cv=5').

This will return a list of accuracy scores; be sure to print out both the mean and the standard deviation of this list.

See the documentation for further details:

http://scikit-learn.org/stable/modules/cross_validation.html

In [9]:
#Cross-validation
A_scores = cross_val_score(clf, X, y, cv = 5)
B_scores = cross_val_score(tree, X, y, cv = 5)

print ('K Neighbors (mean):', A_scores.mean())
print ('Decision Tree (mean):', B_scores.mean())

print ('K Neighbors (std-dev):', A_scores.std())
print ('Decision Tree (std-dev):', B_scores.std())

K Neighbors (mean): 0.509239366804
Decision Tree (mean): 0.557119696182
K Neighbors (std-dev): 0.0237856884632
Decision Tree (std-dev): 0.0339459637218
