<a href="https://colab.research.google.com/github/veyselberk88/Data-Science-Tools-and-Ecosystem/blob/main/lec37.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="./ccsf.png" alt="CCSF Logo" width=200px style="margin:0px -5px">

# Lecture 37: Implementing Classifiers

Associated Textbook Sections: [17.4](https://ccsf-math-108.github.io/textbook/chapters/17/4/Implementing_the_Classifier.html)

---

## Outline

* [Implementing a Classifier](#Implementing-a-Classifier)
* [Feature Selection](#Feature-Selection)
* [Review of Rows](#Review-of-Rows)
* [Calculating Distance](#Calculating-Distance)
* [Splitting the Data](#Splitting-the-Data)
* [Nearest Neighbors](#Nearest-Neighbors)

---

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from mpl_toolkits.mplot3d import Axes3D

---

## Implementing a Classifier

---

### A Process

```mermaid
flowchart TD
    A["Population"] --> B["Sample with labels"]
    B -->|"x% of Sample"| C["Training Set"]
    B -->|"(100-x)% of Sample"| D["Test Set"]
    C --> E["Model the association\n between attributes & labels"]
    E --> F["Predict label of a new point"]
    E --> G["Predict labels for Test Set"]
    D --> G
    G --> H["Evaluate model quality"]
```

---

### $k$ Nearest Neighbor Classification

- A nearest neighbor classifier assigns a label to an unlabeled point by using the **majority label** of **nearby** points
- The number of nearby points considered is called **$k$** and can vary depending on the classifier
- For this lecture, let's explore a way to
    - Calculate distance
    - Split the sample into training and testing sets
    - Determine the nearby points
    - Calculate the majority label

---

## Feature Selection

---

### Reviewing the Features

* From the previous lecture, we showed you the chronic kidney disease (CKD) data set
* Each row represented an individual
* Each column was a feature of the individual
* The class label was a label indicating whether they were diagnosed with CKD (`1`) or not (`0`)

In [None]:
ckd = Table.read_table('ckd.csv').relabeled('Blood Glucose Random', 'Glucose')
ckd.sample(3).show(3)

Age,Blood Pressure,Specific Gravity,Albumin,Sugar,Red Blood Cells,Pus Cell,Pus Cell clumps,Bacteria,Glucose,Blood Urea,Serum Creatinine,Sodium,Potassium,Hemoglobin,Packed Cell Volume,White Blood Cell Count,Red Blood Cell Count,Hypertension,Diabetes Mellitus,Coronary Artery Disease,Appetite,Pedal Edema,Anemia,Class
23,80,1.025,0,0,normal,normal,notpresent,notpresent,111,34,1.1,145,4.0,14.3,41,7200,5.0,no,no,no,good,no,no,0
42,70,1.02,0,0,normal,normal,notpresent,notpresent,93,32,0.9,143,4.7,16.6,43,7100,5.3,no,no,no,good,no,no,0
33,80,1.025,0,0,normal,normal,notpresent,notpresent,100,37,1.2,142,4.0,16.9,52,6700,6.0,no,no,no,good,no,no,0


* Which features help predict CKD?
* We saw that glucose and hemoglobin measurements can helpful

In [None]:
ckd = ckd.select('Hemoglobin', 'Glucose', 'Class')
ckd

Hemoglobin,Glucose,Class
11.2,117,1
9.5,70,1
10.8,380,1
5.6,157,1
7.7,173,1
9.8,95,1
12.5,264,1
10.0,70,1
10.5,253,1
9.8,163,1


---

## Rows

---

### Reviewing Rows of Tables

Each row contains all the data for one individual
* `t.row(i)` evaluates to `i`th row of table `t`
* `t.row(i).item(j)` is the value of column `j` in row `i`
* If all values are numbers, then `np.array(t.row(i))` evaluates to an array of all the numbers in the row.  
* To consider each row individually, use
* `for row in t.rows:`
    `... row.item(j) ...`
* `t.exclude(i)` evaluates to the table `t` without its `i`th row


---

## Calculating Distance

---

### Pythagoras' Formula

<img src="./pyth.png" width=20%>

For a right triangle with legs $a, b$ and hypotenuse $c$, the following relationship is always true: $$a^2 + b^2 = c^2.$$


---

### Distance Between Two Points

One way to calculate the distance $D$ between two points utilizes a right triangle and depends on the number of attributes each point has.
* If $D$ represents the distance between points $(x_0, y_0)$ and $(x_1, y_1)$, then $$D = \sqrt{(x_0 - x_1)^2 + (y_0 - y_1)^2}$$
* If $D$ represents the distance between points $(x_0, y_0, z_0)$ and $(x_1, y_1, z_1)$, then $$D = \sqrt{(x_0 - x_1)^2 + (y_0 - y_1)^2 + (z_0 - z_1)^2}$$
* etc.

---

### Distances using Array Arithmetic

* For a table like `tbl`:

Attribute | Point 1 | Point 2
|---|---|---|
x | x_0 | x_1
y | y_0 | y_1
z | z_0 | z_1
... | ... | ...

* You can get the column (array) data:
    * `pt1 = tbl.column('Point 1')`
    * `pt2 = tbl.column('Point 2')`
* Then the distance formula with NumPy is: `np.sqrt(np.sum((pt1 - pt2)**2))`
* This formula with NumPy is the same no matter how many rows (attributes) there are!

---

### Demo: Distance

* Create a function to calculate the distance between 2 points (represented as arrays).
* Use that function to calculate the distance between two rows of numeric data from a table.
* Remove the `'Class'` column from `ckd` to create `features`.
* Apply the `row_distance` function.

In [None]:
def distance(pt1, pt2):
    """Return the distance between two points, represented as arrays"""
    return np.sqrt(sum((pt1 - pt2)**2))

In [None]:
def row_distance(row1, row2):
    """Return the distance between two numerical rows of a table"""
    return distance(np.array(row1), np.array(row2))

In [None]:
features = ckd.drop('Class')
features

Hemoglobin,Glucose
11.2,117
9.5,70
10.8,380
5.6,157
7.7,173
9.8,95
12.5,264
10.0,70
10.5,253
9.8,163


In [None]:
features.row(1)

Row(Hemoglobin=9.5, Glucose=70)

In [None]:
row_distance(features.row(0), features.row(1))

47.030734631727789

In [None]:
row_distance(features.row(0), features.row(2))

263.00030418233359

In [None]:
row_distance(features.row(2), features.row(2))

0.0

---

## Splitting the Data

---

### `split`

* The `datascience` library contains the `split` `Table` method
    * Notation: `tbl.split(first_n)`
* Splits `tbl` into 2 Tables
* `first_n` represents the number of rows from `tbl` to be randomly assigned to the first table
* The rest of the rows in `tbl` are assigned to the second table
* This function produces two outputs (tables)!

---

### Demo: Splitting Data

* Use `split` to split the data into a training set and a test set, with half of the original data in each set
* Explain how `split` works by showing a manual process of splitting the data

In [None]:
half_way = round(ckd.num_rows/2)
train_50, test_50 = ckd.split(half_way)

In [None]:
train_50

Hemoglobin,Glucose,Class
13.6,107,0
14.2,134,0
13.5,91,0
15.8,100,0
9.9,94,1
15.9,130,0
12.6,122,1
16.2,83,0
14.4,132,0
16.3,111,0


In [None]:
test_50

Hemoglobin,Glucose,Class
14.8,139,0
15.5,130,0
17.6,79,0
14.7,81,0
14.0,92,0
15.3,113,0
5.6,157,1
17.0,112,0
12.5,264,1
13.1,128,0


In [None]:
half_way = round(ckd.num_rows/2)
shuffled = ckd.sample(with_replacement=False)
train_manual = shuffled.take(np.arange(half_way))
test_manual  = shuffled.take(np.arange(half_way, ckd.num_rows))

In [None]:
train_manual

Hemoglobin,Glucose,Class
9.1,129,1
13.0,99,0
14.3,111,0
10.0,117,1
7.9,288,1
16.7,89,0
13.3,88,0
16.2,117,0
16.3,140,0
12.6,424,1


In [None]:
test_manual

Hemoglobin,Glucose,Class
8.3,107,1
10.9,214,1
13.0,117,0
13.5,91,0
15.1,74,0
15.0,140,0
14.7,105,0
14.8,139,0
14.1,137,0
14.2,114,0


---

## Nearest Neighbors

---

### Finding the `k` Nearest Neighbors

To find the `k` nearest neighbors of an example:
* Find the distance between the example and each example in the training set
* Augment the training data table with a column containing all the distances
* Sort the augmented table in increasing order of the distances
* Take the top `k` rows of the sorted table

---

### The Classifier

To classify a point:
* Find its `k` nearest neighbors
* Take a majority vote of the `k` nearest neighbors to see which of the two classes appears more often
* Assign the point the class that wins the majority vote

---

### Demo: The Classifier

The `distance` function calculates the distance between an example row (an individual patient) and every row in the training set of `patient` data.

* How can we use kNN to classify an example patient?
* Split the `ckd` into training and testing sets.
* Measure the distance between the example patient and every row in the training data set.
* Create a function that finds the `k` closest row(s) in the training set to the example patient. Apply that function to the situation above.
* Create a function or functions to report the majority class for the nearest `k` rows in the training set to the example patient.

In [None]:
example_patient = [12.3, 119]

In [None]:
example_patient

[12.3, 119]

In [None]:
np.random.seed(123) # Makes sure we all get the same data
row_80th = round(ckd.num_rows * 0.80)
train, test = ckd.split(row_80th)

In [None]:
train

Hemoglobin,Glucose,Class
12.6,122,1
8.3,273,1
15.0,95,0
17.3,104,0
15.1,74,0
13.7,132,0
15.6,131,0
16.5,113,0
16.5,75,0
15.8,131,0


In [None]:
test

Hemoglobin,Glucose,Class
15.7,105,0
15.0,89,0
13.5,130,0
14.3,100,0
16.7,93,0
16.3,111,0
15.8,100,0
9.4,214,1
13.6,137,0
10.8,380,1


In [None]:
def distances(training, example):
    """
    Compute distance between example row and every row in training.
    Return training augmented with Distance column
    """
    distances = make_array()
    features_only = training.drop('Class')

    for row in features_only.rows:
        distances = np.append(distances, row_distance(row, example))

#   ^ SAME AS DOING:
#
#   for i in np.arange(features_only.num_rows):
#       row = features_only.row(i)
#       distances = np.append(distances, row_distance(row, example))

    return training.with_column('Distance_to_ex', distances)

In [None]:
distances(train, example_patient).sort('Distance_to_ex')

Hemoglobin,Glucose,Class,Distance_to_ex
12.0,118,1,1.04403
13.4,120,0,1.48661
13.9,119,0,1.6
13.0,117,0,2.11896
13.0,117,0,2.11896
11.2,117,1,2.28254
13.6,121,0,2.38537
14.8,118,0,2.69258
12.6,122,1,3.01496
10.0,117,1,3.04795


In [None]:
def closest(training, example, k):
    """
    Return a table of the k closest neighbors to example
    """
    return distances(training, example).sort('Distance_to_ex').take(np.arange(k))

In [None]:
closest(train, example_patient, 5)

Hemoglobin,Glucose,Class,Distance_to_ex
12.0,118,1,1.04403
13.4,120,0,1.48661
13.9,119,0,1.6
13.0,117,0,2.11896
13.0,117,0,2.11896


In [None]:
(closest(train, example_patient, 5)).group('Class').sort('count', descending=True)

Class,count
0,4
1,1


In [None]:
def majority_class(topk):
    """
    Return the class with the highest count
    """
    return topk.group('Class').sort('count', descending=True).column(0).item(0)

In [None]:
def classify(training, example, k):
    """
    Return the majority class among the
    k nearest neighbors of example
    """
    return majority_class(closest(training, example, k))

In [None]:
classify(train, example_patient, 5)

0

---

### Did the Classifier Work?

- We predicted the class of the example patient.
- Was the prediction correct?
- A better question - How reliable is our classifier?
- We will use the test set in the next lecture to measure the quality of our classifier.

---

### Review of the Steps

- `distance(pt1, pt2)`: Returns the distance between the arrays `pt1` and `pt2`
- `row_distance(row1, row2)`: Returns the distance between the rows `row1` and `row2`
- `distances(training, example)`: Returns a table that is `training` with an additional column `'Distance'` that contains the distance between `example` and each row of `training`
- `closest(training, example, k)`: Returns a table of the rows corresponding to the k smallest distances
- `majority_class(topk)`: Returns the majority class in the `'Class'` column
- `classify(training, example, k)`: Returns the predicted class of `example` based on a `k` nearest neighbors classifier using the historical sample `training`

In the next lecture, we will show you a way to evaluate a classifier.

---

## Attribution

This content is licensed under the <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC BY-NC-SA 4.0)</a> and derived from the <a href="https://www.data8.org/">Data 8: The Foundations of Data Science</a> offered by the University of California, Berkeley.

<img src="./by-nc-sa.png" width=100px>