```
---
title: Intro Classification + KNN
duration: "1:5"
creator:
    name: Kiefer Katovich + David Yerrington
    city: SF
    updated by: John Marin
    city: LA
---
```

<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 15px; height: 80px">

# Intro to Classification + KNN
Week 5 | Lesson 1.1



### Classification is the prediction of a <u>qualitative</u> response for an observation (unlike regression where we assume the response variable is quantitative).

- <b> Predicting a qualitative response for an observation can be referred to as classifying that observation, since it involves assigning the observation to a category, or class. <br> <br>

- It is also the case that the methods used for classification first predict the probability of each of the categories of a qualitative variable, as the basis for making the classi- fication. In this sense they also behave like regression methods </b>

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Identify classification problems
- Understand the difference between regression and classification problems
- Understand the basic difference between KNN and Logistic Regression

---

# (5 mins) Discussion:

# 1.  Exactly what is classification and how is it different than regression?
# 2 . What sort of problems do you solve with classification?


## Identify that problem:  Classification or Regression?
<br>
<img src="https://snag.gy/0j5DbL.jpg" width=500> 




## Identify that problem: Classification or Regression? 
<br>

<img src="https://snag.gy/Rk6sEw.jpg" width=500> 

## Identify that problem: Classification or Regression?
<br>

<img src="https://snag.gy/lS3FNa.jpg" width=520> 
<br>
<br>

<br><br>
# Common Classification Methods

## K-Nearest Neighbors
![](https://snag.gy/J38wxN.jpg)

## Logistic Regression
![](https://snag.gy/NCnh3b.jpg)

## Naive Bayes
![](https://snag.gy/ZJwmn6.jpg)

## Decision Trees
![](https://snag.gy/cJL5gr.jpg)

## Classification Intro

![](https://snag.gy/0Jns5x.jpg)

Classification methods in machine learning are fundamentally supervised methods where the training data which observations are associated with a (discrete) label designating their class.  Classification is different than regression (with continious values) because we are now predicting classes / labels.  This can be thought of as a discrimination problem, modelling the differences or similarities between groups. 

*Classification is supervised because we know the labels of our trained / sampled observations.*





## Classification Evaluation

Classification is assessed much differently than continuious regression.  Generally, we are concerned with if we misidentified anything incorrectly, completely missed the mark, or predicted correctly between our traning and test sets during cross validation.  There are a few different things we usually talk about and look at when it pertains to classification related to these ideas such as **precision**, **recall**, **accuracy**, **F-measures**, **class imbalance**, and our beloved **Reciever Operator Curve**.

Once we get a sense of how our classfication method performs, we have the opporutnity to tune for sensitivity or specificity.

> **Sensitivity and specificity** are statistical measures of the performance of a binary classification test, also known in statistics as classification function:

> **Sensitivity** (also called the true positive rate, the recall, or probability of detection[1] in some fields) measures the proportion of positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition).

> **Specificity** (also called the true negative rate) measures the proportion of negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).

> [Sensitivity and Specificity](https://en.wikipedia.org/wiki/Sensitivity_and_specificity)


## Contingency Table / Confusion Matrix
<img src="https://snag.gy/qit9l3.jpg" width=520> 
<br>
<br>

## Confusion Matrix - sklearn**
<img src="https://snag.gy/b6VDIo.jpg" width=500>
<br>

## ROC Curve Analysis
<img src="https://snag.gy/CBxZbh.jpg" width=500>

## sklearn "classification_report()"
```python
>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 0]
>>> y_pred = [0, 0, 2, 1, 0]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))
             precision    recall  f1-score   support

    class 0       0.67      1.00      0.80         2
    class 1       0.00      0.00      0.00         1
    class 2       1.00      0.50      0.67         2

avg / total       0.67      0.60      0.59         5
```

# K Nearest Neighbors classification walkthrough

From here on out we are going to look at how the kNN algorithm classifies malignant vs. benign tumor category in the Wisconsin breast cancer dataset.

---

## kNN

### What is k-Nearest Neighbors?

- When a prediction is required for an unseen (out-of-sample) data instance, the kNN algorithm will search through the training dataset and localize the k-most similar instances. The attributes of the most similar instances are summarized and returned (i.e. "voted upon") and become the prediction value of the unseen instance. <br><br>

- The similarity measure depends on the data involved. The Euclidean distance is often used, but categorical or binary data lend themselves to the Hamming distance. <br><br>

- For regression problems, the average of the neighbors may be returned, whereas for classification problems, the most prevalent class is returned. <br><br>

![](https://snag.gy/hatSE6.jpg)

The pseudocode algorithm for kNN is as follows:



```
for unclassified_point in sample:
    for known_point in known_class_points:
        calculate distances (euclidean or other) between known_point and unclassified_point
    for k in range of specified_neighbors_number:
        find k_nearest_points in known_class_points to unclassified_point
    assign class to unclassified_point using "votes" from k_nearest_points
```
> ### Common KNN Distance Functions
> These distance functions can be used with KNN.  Euclidean is the most common choice.
>
> ### Euclidean  
> $\sqrt{\sum\limits_{i=1}^k(x_i - y_i)^2}$
>
> ### Manattan 
> $\sum\limits_{i=1}^k \left| x_i - y_i \right|$
>
> ### Minkowski
> $\left(\sum_{i=1}^n |x_i-y_i|^p\right)^{1/p}$

---

[NOTE: in the case of ties, sklearn's `KNeighborsClassifier()` will just choose the first class using uniform weights! If this is unappealing to you you can change the weights keyword argument to 'distance'.]