In [1]:
import pandas as pd

Going to write a basic pipeline for supervised learning, and look at how multiple classifiers can be used to solve the same problem. Then, I'll build a little more intuition for what it means for an algorithm to learn something from data (this sounds magical but it's not).

Imagine your're building up a spam classifier.
![spam classifier](./assets/spamClassifier.png)
This is just a function that labels an email as spam or not spam. Say you've already collected a dataset and you're ready to train a model.  


In [2]:
pd.DataFrame({"Email":["Click here to claim your prize!","What's new?","Hang out later?","You have won $1000,000","..."],
              "Label":["Spam","Not spam","Not spam","Spam","..."]
             })

Unnamed: 0,Email,Label
0,Click here to claim your prize!,Spam
1,What's new?,Not spam
2,Hang out later?,Not spam
3,"You have won $1000,000",Spam
4,...,...


But before you put it into production there's a question you need to answer first. How accurate will it be when you need it to classify emails that weren't in you're training data. We want to verify our models work well before we deploy them. You can do an experiment to help you figure this out.

![train and test](./assets/spamDataset2.png)

One approach is to partition your data into two parts. Call these _TRAIN_ and _TEST_.

* Train: use train to train the model.
* Test: use test to see how accurate it is on new data.

This is a common pattern so let's see how it looks in code.

# Example 1

In [3]:
# import a dataset into scikit
from sklearn import datasets

# using iris because it's provided by scikit
iris = datasets.load_iris()

# I've already used Iris in module 2
# what I havn't done before is...
# I'm calling the features "X" and the target "labels"
X = iris.data
y = iris.target

The reason I'm using _X_ and _y_ is that I can think of a classifier as a function. Think of X as the input (i.e. Features) and y as the output (i.e. label).

```python
f(x) = y
```
After importing the dataset, it needs to be partitioned. Use the utility below because it is handy, and the syntax is clear.

In [4]:
# using model selection because cross_validation will be depricated in version 0.20.
# from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
X_test, X_train, y_test, y_train = train_test_split(X, y, test_size = .5)

The code above is taking our X's and y's (i.e. features and labels) and partitioning them into two sets. *X_train* and *y_train* are the features & labels for the training set, and *X_test* and *y_test* are the features & labels for the testing set. The paramaeter *test_size* is set to 0.5 to split the data into equal sets. In this case, 75 records will be in training, and the other 75 will be in testing.

Now let's create the classifier. I'm going to use two different types of classifier to show how they accomplish the same task. 

In [5]:
X_test.shape

(75, 4)

In [6]:
y_test.shape

(75,)

In [7]:
# starting with the decision tree
# this was covered in previous module
# Note there are only two lines of code that are classifier specific.
from sklearn import tree
classifier = tree.DecisionTreeClassifier()

In [8]:
# Training the classifier with data
classifier.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [9]:
# at this point the classifier is ready to be used to classifiy data
# call the predict method to classify your testing data
predictions = classifier.predict(X_test)

In [10]:
# print out the predictions to see 
# the list of numbers correspond to the type of iris the classifier predicts for each row in the testing data
print(predictions)

[2 0 0 0 2 2 2 0 0 1 1 2 2 2 0 0 0 2 1 2 2 2 2 2 1 1 2 0 1 1 2 0 1 0 1 2 1
 2 1 1 2 0 0 0 2 2 0 2 2 2 2 2 0 1 1 1 0 0 1 2 1 2 2 0 0 1 2 2 2 0 2 0 1 1
 1]


In [11]:
# now let's look at how accurate the classifier was on the testing set
# recall you have the true labels for the testing data located in variable "Y_test"
# to calculate the accuracy, you can compare the predicted labels to the true labels and tally up the score
# there's a convenience method in scikit you can import to do this
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))

0.92


Notice the accuracy here is over 90%. The score may vary between trials due to the randomness in how the train / test data is partitioned.


# Example 2

Now for something interesting. By replacing these two lines of code, you can use a different classifier to accomplish the same task.

```python
from sklearn import tree
tree_classifier = tree.DecisionTreeClassifier()
```

Instead of using a decision tree, I'm going to use one called k-nearest neighbors.

In [12]:
# using a new classifier to perform the same task
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier()

If you run your experiment, you'll see that the code worked in exactly the same way. The accuracy may be different from decision tree classifier works differently than k-nearest neighbors, and because of the randomness in train / test split. Likewise, if I wanted to use a more sophisticated classifier I could ust import it and change these two lines. Otherwise, the code is the same.

```python
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

```

The reamining code runs the same as code in decision tree (above).

In [13]:
# Training the classifier with data
classifier.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [14]:
# at this point the classifier is ready to be used to classifiy data
# call the predict method to classify your testing data




predictions = classifier.predict(X_test)

In [15]:
# print out the predictions to see 
# the list of numbers correspond to the type of iris the classifier predicts for each row in the testing data
print(predictions)

[2 0 0 0 1 2 2 0 0 1 1 2 2 2 0 0 0 2 1 2 2 2 1 2 1 1 2 0 1 1 2 0 1 0 1 2 1
 2 1 1 2 0 0 0 2 2 0 2 2 2 2 1 0 1 1 1 0 0 1 2 1 2 2 0 0 1 2 2 2 0 2 0 2 1
 1]


In [16]:
# now let's look at how accurate the classifier was on the testing set
# recall you have the true labels for the testing data located in variable "Y_test"
# to calculate the accuracy, you can compare the predicted labels to the true labels and tally up the score
# there's a convenience method in scikit you can import to do this
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))

0.9466666666666667


The takeaway here is that although there are many types of classifiers, at a high level they have a similar interface

# Learning from Data

What does it mean to learn from data?

Earlier I stated featurs are called "X" and labels "y", beacuse they are the input and output of a function. I already know what a funciton is from programming:

```python
def classify(features):
    # do some logic
    return label
```

From the perspective of _supervised learning_, I don't want to write the code above by hand; because I want an algorithm to learn it from traning data. So what does it mean to learn a function. A function is just a mapping from input to output values, for example:

```python
y = mx+b
```

This is the equation for a line and there are two parameters:
* m: gives the slope
* b: gives the y-int

Given these parameters you can plot the values for x. In regards to supervised learning, my classify function might have some parameters as well. But, the input _X_ are the _features_ I want to classify, and the output _y_ is a _label_ (i.e. spam or not spam, or the type of flower). 

# <span style = "color:red"> IMAGE <span>

What can the body of a function look like? This is the part you want to write algorithmically (i.e. learn). The important thing to understand here is that you are not starting from scratch and pullig the body of the funciton out of thin air. Instead, you start with a model. Think of a model as the _prototype_ or fhe _rules_ that define the body of your function. Typically a model has perameters that you can adjust with your training data.

# <span style = "color:red"> IMAGE <span>
    

Let's say you're trying to destinguish between red dots and green dots. To do this I'll use the x & y coordinates of a dot. How can you classify this data? You want a function to consider a new dot it's never seen before and  classifies it as red or green; or maybe you have a lot of data you want to classifery (i.e. a lot of dots the classifier has never seen before). These are dots that weren't in our training data. Since the classifier has never seen them before, how can you predict the right label? Imagine if you could draw a line across the data. Then you could say the dots to the left of the line are green and the dots to the right of the line are red; and this line can serve as your classifier.
    
# <span style = "color:red"> IMAGE <span>
    

So how can you learn this line? One way is to use this training data to adjust the parameters of this model, and the model you use is a simple straight line (like the one shown above). That means you have two parameters to adjust; _m_ and _b_, and by changing them you can change where the line appears. So, how can you learn the right parameters? One idea is that you can iteratively adjust them using the training data. For example, you might start with a random line, then use it to classify your first example. If the classifier gets the right, you don't need to change the line, and move on to the next example. But if it gets it wrong, you can slightly adjust the parameters of the model to make it more accurate. The takeaway here is, one way to think of learning is is using training data to adjust the parameters of a model. 

# Resource

[Write a pipeline](https://www.youtube.com/watch?v=84gqSbLcBFE)

[Tensorflow Playground](): For future study. Tensorflow is an example of a neural network you can run and experiment with in your browser. Think of a neuarl network as a more sophisticated type of classifier (like a decision tree or a simple line), but in principle the idea is similar. 