## Basic Machine Learning
Machine learning can be broken into two categories: unsupervised and supervised. Unsupervised learning uses unlabeled data. Two goals of unsupervised learning are anomaly detection and clustering. For instance, imagine the iris dataset didn't have the species feature (i.e. didn't have labels). Anomaly detection could tell us which of the 150 instances stand out from the rest of the data. Clustering could tell us that there are 3 distinct groups of instances in the data, thereby recovering the unknown class labels.

We are going to focus on supervised learning for now. The main goal of supervised learning is to classify instances. In other words, given labeled data, we create a model that will label unlabeled instances.

Lets look at an example.
<br><br>
## Load Data

In [39]:
from sklearn import datasets
from sklearn.model_selection import train_test_split

loaded_data = datasets.load_iris()

x = loaded_data.data # By convention we store instances in a variable called x
y = loaded_data.target # By convention we store labels in a variable called y

In [40]:
x[0] # This is the first instance

array([5.1, 3.5, 1.4, 0.2])

In [41]:
y[0] # This is the label for the first instance. 

0

In [42]:
print(loaded_data.target_names[y[0]]) # Remember, 0 means setosa

setosa


<br>
<br>

## Train a model

<br>

### First we setup the model.

In [43]:
from sklearn import neighbors
# We are going to set up a nearest neighbors classifier
nn_model = neighbors.KNeighborsClassifier(3) # The '3' specifies how many neighbors to use when determining an instances class. All model types have parameters that a user sets prior to training. These user settable parameters are called hyperparameters. 




### Then we train the model.

In [44]:
nn_model.fit(x, y) # The data we use to train a model is called the training set.

KNeighborsClassifier(n_neighbors=3)

### Now that we have a model, let's use it to classify some instances.

In [45]:
predictions = nn_model.predict(x) # Have the model create labels for the instances. The data we use to test a model is called the test set.

In [46]:
print(y) # print out the known labels
print(predictions) # print out the model generated labels

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1
 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2
 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


### Lets generate a single value to represent how well our model worked.
This value is called accuracy. There are better measurements of how well models perform. We'll look at these later. For now, accuracy is good enough.

In [47]:
# This code iterates over the predictions and labels, adding 1 to num_correct when the prediction matches the corresponding label
num_predictions = len(predictions)
num_correct = 0

for instance_num in range(num_predictions):
    if predictions[instance_num] == y[instance_num]:
        num_correct += 1

print(f"The accuracy is {num_correct / num_predictions}")



The accuracy is 0.96


In [48]:
# Or we can use sklearn's function to do the same
from sklearn.metrics import accuracy_score
print(f"The accuracy is {accuracy_score(y, predictions)}")

The accuracy is 0.96


<br>

Okay, that wasn't that impressive. Why? We created a model using 150 instances, then checked our model using those same 150 instances. What's wrong with this? Above we used the nearest neighbor (NN) classifier. If you knew every new unclassified instance exactly matches one of the training instances, could you design a method better than NN? Think about it for a minute. One better solution would be to just look up the label of the matching training instance and assign that label to the new instance.

<br>

A better test of our model would be to train the model with some labeled data, then test it with a different set of labeled data. Instead of collecting new instances, we will randomly split the data into two different sets: training and test.

In [58]:
## Split Data into testing and training sets
seed = 49

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.20, random_state=seed)

"""
This function is built into sklearn. It randomly choose 80% of the instances for the training set and uses the remaining 20% (test_size = .20) for the test set.

The seed is an optional argument which guarantees that the random selections are always the same. 
"""
print(x_train[0]) # run this cell several times and this will print the same instance every time.

# Now change the seed to any other number and try it again. x_train[0] is now a different instance. Using a seed allows us to perform reproducible science.

[5.7 3.8 1.7 0.3]


## Train again
Now we are going to train a new model using the training set then test using the testing set.

In [59]:

nn_model = neighbors.KNeighborsClassifier(3)

nn_model.fit(x_train, y_train) # train the model using the training set

KNeighborsClassifier(n_neighbors=3)

## Calculate Accuracy

In [60]:
predictions = nn_model.predict(x_test)
print(predictions)
print(y_test)


[1 2 1 2 2 0 2 2 2 2 0 1 0 1 1 1 1 2 2 0 0 1 0 2 0 1 1 2 0 2]
[1 2 1 2 2 0 2 2 2 2 0 1 0 1 1 2 1 2 2 0 0 1 0 2 0 1 1 1 0 2]


In [61]:
# Or we can use sklearn's function to do the same
print(f"The accuracy is {accuracy_score(y_test, predictions)}")

The accuracy is 0.9333333333333333
