# Building Models
## Objectives
- Learn how models work
- Build a model on a real dataset

## [10] Discussion Exercise to Review ML Concepts
Suppose we are trying to predict whether or not a given tissue sample is cancerous or not.

- What type of machine learning is this?
- What is our target?
- What are some potential features?

# [30] Learn the k-Nearest Neighbors Algorithm

k-Nearest Neighbors is a simple algorithm that helps us illustrate how a model works and what parameters can be tuned.

Let's work together through the following work sheets.


### kNN Worksheet 1

![kNN_Worksheet_1](kNN_sheet_1.png)

### Questions
1. What	is	the	nearest	point to the question mark? This is also called	the	1-nearest neighbor.	
2. What are the 2 nearest neighbors to the question mark? Based  on that, how should we classify the question mark?
3. What about 3 nearest neighbors?
4. How about the 5 nearest neighbors?
5. What	do you notice happening as k increases?

### kNN Worksheet 2

![kNN_Worksheet_2](kNN_sheet_2.png)

### Questions
1. What would a 1-nearest-neighbors classifier output?
2. What about a 3-nearest neighbors classifier?

## Discussion
1. Any thoughts about this model?
2. Any drawbacks that you see or benefits?

**Notes:** Let's talk about model explainability vs. blackbox.

# [30] Let's build a kNN model using `scikit-learn`
Let's walk through the following machine learning workflow together.

**First, let's orient ourselves to scikit-learn**

## https://scikit-learn.org/stable/

In [8]:
import pandas as pd
df = pd.read_csv("https://s3-us-west-2.amazonaws.com/ga-dat-2015-suneel/datasets/breast-cancer.csv", header=None)

In [9]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## Let's go through the ML Workflow
1. Separate into feature and target
2. Separate those into training and validation sets
3. Train the model using the training data
4. Test the model on the validation data

**1. Separate into feature and target.**

In [16]:
y.head()

0    M
1    M
2    M
3    M
4    M
Name: 1, dtype: object

Let's see how we hone in on just our feature columns...

Let's take a look out feature "matrix"

In [21]:
X.head()

Unnamed: 0,2,3,4,5,6,7,8,9,10,11,...,22,23,24,25,26,27,28,29,30,31
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


**2. Separate into training and validation sets**

## Exercise
Use `train_test_split` to separarate `y` and `X` into training and test (a.k.a. validation) sets.

In [22]:
from sklearn.model_selection import train_test_split

**3. Let's train the model on the training data**

In [1]:
# from sklearn.neighbors import KNeighborsClassifier
# cancer_predictor = KNeighborsClassifier(n_neighbors=5)
# cancer_predictor.fit(X_train, y_train)

**Let's make a prediction**

In [3]:
# y_test.head(10)

In [2]:
# cancer_predictor.predict(X_test.head(10))

**4. Let's score the model**

In [4]:
# cancer_predictor.score(X_test, y_test)

# [30] Mini-Lab: Model Tuning
Let's learn how to tune kNN to see if/how we can improve the performance of our model.

**Try building a model for different values of `n_neighbors` from 3 to 101 and save the accuracies in a list**

HINT: Use for loops

**Then, we'll graph n_neighbors vs score to see which value is best for this data set**

In [30]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib
