<a href="https://colab.research.google.com/github/rilschultz/CS167Notes/blob/main/Day06_Normalization_and_w_knn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day06
## Weighted k Nearest Neighbors

#### CS167: Machine Learning, Spring 2023

Tuesday, February 14th 💖, 2023

📆 [Course Schedule](https://docs.google.com/spreadsheets/d/e/2PACX-1vSvFV5Mz0_YZE1d5r3gQ8IMktE4cBAsJIlP30cl2GhEpSO0J-YWV62QokSDz-OcOCsEmxMuKpY0kVlR/pubhtml?gid=0&single=true) | 🙋[PollEverywhere](https://pollev.com/meredithmoore011) | 📜 [Syllabus](https://analytics.drake.edu/~moore/cs167_s23_syllabus.html) | 📬 [CodePost Login](https://codepost.io/login)

# Overview of Today:

Normalization

Weighted k-NN


# Admin Stuff
Grading:
- Notebook #1 is graded, scores and feedback are posted on CodePost, let me know if you have any questions.


You should be working on:
- [Notebook #2](https://classroom.github.com/a/d6K-Q-t8): due **Thursday, February 16th, 2023 by 11:59pm** 
- Heads up that **Quiz #1 will be released today after class (~2:00pm)**, due next Tuesday, 2/21 by 11:49pm.
    - To be completed individually
    - Only one chance to hit 'submit'
    - Cite any materials that you use outside of class

## Can't forget to load our data:

And some of our favorite modulues, `pandas` and `numpy`

In [3]:
#run this cell if you're using Colab:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
#import the data:
#make sure the path on the line below corresponds to the path where you put your dataset.
import pandas as pd
import numpy as np
path = '/content/drive/MyDrive/datasets/penguins_size.csv'
penguins = pd.read_csv(path)
penguins.head()

path1 = '/content/drive/MyDrive/datasets/irisData.csv'
iris = pd.read_csv(path1)
iris.head()

path2 = '/content/drive/MyDrive/datasets/titanic.csv'
titanic = pd.read_csv(path2)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


# Quick Review: Missing Data:
Most datasets you will work with will not be in perfect shape--you'll need to "clean" the data before you can run any machine learning algorithms on it.

Missing data is a pretty common thing--so much so that there's a special value for missing data: `NaN`, or not a number.

The steps of cleaning data normally include:
1. Identifying which columns have missing data `df.isna().any()`
2. Determining how much data is missing in each column `df.col_missing_data.value_counts(dropna=False)`
3. Deciding what to do with the missing data: drop it `dropna()`, fill it `fillna()`, let it be 
    - Remember to either save the returned result `result = df.whatever_column.dropna()`, or use `df.whatever_column.dropna(inplace=True)`

## Summary: Missing Data Functions
- `isna()`: returns True for any missing data
- `notna()`: returns True for any data that is __not__ `NaN`
- `any()`: returns true if any of the elements in a Series is True
- `value_counts()`: returns a list of the values in a Series, use `dropna=False` to see `NaN` values
- `dropna()`: drops rows or columns (specify which axis, 1 or 0) that have missing data. Don't forget to either save the result of the call or add `inplace=True` as a parameter.
- `fillna()`: replaces missing data with a given value (generally 0 or the mean)

## 💻 Review Exercise: 

Take care of the missing data in the penguin dataset 🐧.

In [5]:
# step 1: identify which columns are missing data
penguins.isna().any()

species              False
island               False
culmen_length_mm      True
culmen_depth_mm       True
flipper_length_mm     True
body_mass_g           True
sex                   True
dtype: bool

In [18]:
# step 2: determine how much data is missing from each column that is missing data
results = penguins.culmen_depth_mm.value_counts(dropna = False)
results[np.nan]
#OR
penguins.species.count() - penguins.count()

species               0
island                0
culmen_length_mm      2
culmen_depth_mm       2
flipper_length_mm     2
body_mass_g           2
sex                  10
dtype: int64

Sex is missing 10

body mass is missing 2

Flipper length is missing 2

Culmen Depth is missing 2

Culmen length is missing 2



In [20]:
# step 3: decide whether to drop, fill, or leave it
penguins.dropna(inplace=True)
penguins.shape

(334, 7)

# 💬 Discussion Question:

Imagine we wanted to use the penguin dataset to predict the species of penguin using k-Nearest Neighbors:
- What steps will you need to take before running a kNN on the penguin dataset?
- Will each column have an equal weight in the final prediction? Or will one column have a bigger say in the decision? Why?

In [21]:
penguins.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE


# ✨ New Material

## Normalization Motivation:

In datasets that have numeric data, the columns that have the largest magnitude will have a greater 'say' in the decision of what to predict. 

In the penguin dataset, `body_mass_g` will have a much bigger say in the prediction than the other options.

# Normalization:

__Normalizing data:__
- rescale attrbute values so they're about the same
- adjusting values measured on different scales to a common scale

## A Simple Normalization:
One simple method of normalizing data is to replace each value with a proportion relativeto the max value. 

For example, the oldest person on the Titanic was 80, so: 

| **age** | **replaced by** |
|---------|:------------------|
| 80      | 80/80 = 1        |
| 50      | 50/80 = 0.625    |
| 48      | 48/80 = 0.6      |
| 25      | 25/80 = 0.3125   |
| 4       | 4/80 = 0.05      |

## Before Normalization
<div>
<img src="https://github.com/merriekay/S23-CS167-Notes/blob/main/images/day03_zscore_improvement.png?raw=1" width=600/>
</div>

### Age is overemphasized here

## After Normalization

<div>
<img src="https://github.com/merriekay/S23-CS167-Notes/blob/main/images/day03_norm_dist1.png?raw=1"/>
</div>

### Now is sex over-emphasized?

## Z-Score: Another Normalization Method

__Idea__: rather than normalize to proportion of max, normalize based on how mnay standard deviations they are away from the mean.

__Standard Deviation__: usually represened as $\sigma$ (sigma), a kind of 'average' distance from the average value. 
- a low standard deviation indicates that the values tend to be close to the mean
- a high standard deviation indicates that the values are spread out over a wider range.

## Standard Deviation:
<div>
<img src="https://github.com/merriekay/S23-CS167-Notes/blob/main/images/day03_std.png?raw=1" width=600/>
</div>

## Standard Deviation Calculation:

## $\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}$

1. Find the mean, represented as $\mu$ (mu)
2. Then, for each number, subtract the mean and square the result.
3. Then, find the mean of those squared differences. 
4. Take the square root of tht and we are done. 

Let $\mu$ be the mean, then standard deviation of $x_1, x+2, ..., x_N$ is:

## $\sigma = \sqrt{\frac{(x_1-\mu)^2 + (x_2 - \mu)^2+ ... + (x_N-\mu)^2}{N}}$

# Corrected Sample Standard Deviation

The mean of a sample tends to be a good estimate for the mean of the entire population (on average), but.. 
- standard deviation of samples tend to be _smaller_ than the standard deviation of the entier population.

__Bessel's correction__ says that you should divide by $N-1$ instead of N when working with a sample (as we usually do in machine learning tasks), and your estimate will be a little less biased. 

## $\sigma = \sqrt{\frac{(x_1-\mu)^2 + (x_2 - \mu)^2+ ... + (x_N-\mu)^2}{N-1}}$

# Computing the Z-Score
After computing the corrected sample standard deviation,

to normlaize, replace each value $x_i$ with it's Z-Socre based on the mean ($\mu$) and standard deviation ($\sigma$) of it's column. 

## $Z-score: \frac{x_i- \mu}{\sigma}$

## Exampe Z-Score Calculation

For example: 
On the Titanic:
- sex mean(0:male, 1:female): 0.35
- sex standard deviation: 0.48
- age mean: 29.7
- age standard deviation: 13

<div>
<img src="https://github.com/merriekay/S23-CS167-Notes/blob/main/images/day03_zscore.png?raw=1" width=600/>
</div>


<div>
<img src="https://github.com/merriekay/S23-CS167-Notes/blob/main/images/day03_zscore_ex.png?raw=1" width=600/>
</div>

# Normalization Code:
Let's try out some code now:



In [22]:
#make sure your data is loaded and ready to go (one of the top few cells)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## New function `replace()`

Called on a dataframe, will repalce values given in `to_replace` with `value`. 

Let's use this to make the `sex` column of the dataset numeric.

In [24]:
titanic['sex'] = titanic['sex'].replace(to_replace='female', value=1)
titanic['sex'] = titanic['sex'].replace(to_replace='male', value=0)
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,0,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,1,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,1,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,1,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,0,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Calculating z-score:
Now that we have the data as 1s and 0s, let's calculate the mean and standard deviation.

In [25]:
s_mean = titanic.sex.mean()
s_std = titanic.sex.std()

#replace column with each entry's z-score
titanic.sex = (titanic.sex - s_mean)/s_std
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,-0.737281,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,1.354813,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,1.354813,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,1.354813,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,-0.737281,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Next, you'd need to repeat this process for all of the predictor columns -- so they're all of compareable size. 

## 💻 Programming Exercise #1: 

Normalize each of the predictor columns in the iris dataset.

> Note: you need a way to transform the new reading (the specimen) that you will make the precition on so that the new one and the training data will all be on the same sclae. How can you do that?

Repeat your kNN prediction code with the normalized data. 
- Does the value of k change the predictions? 

In [35]:
from zmq.constants import PLAIN_SERVER
# use z-score to normalize the iris data
iris_copy = iris.copy()
plm = iris_copy['petal length'].mean()
slm = iris_copy['sepal length'].mean()
pwm = iris_copy['petal width'].mean()
swm = iris_copy['sepal width'].mean()

pls = iris_copy['petal length'].std()
sls = iris_copy['sepal length'].std()
pws = iris_copy['petal width'].std()
sws = iris_copy['sepal width'].std()

iris_copy['petal lenght'] = (iris_copy['petal length'] - plm/pls)
iris_copy['sepal lenght'] = (iris_copy['sepal length'] - slm/sls)
iris_copy['petal width'] = (iris_copy['petal width'] - pwm/pws)
iris_copy['sepal width'] = (iris_copy['sepal width'] - swm/sws)

In [36]:
def knn(specimen, data, k):
    # write your code in here to make this function work
    # 1. calculate distances
    data_copy = data.copy() #good practice to make a copy of the data
    data_copy['distance_to_new'] = np.sqrt(
        (specimen['petal length'] - data_copy['petal length'])**2 
        +(specimen['sepal length'] - data_copy['sepal length'])**2 
        +(specimen['petal width'] - data_copy['petal width'])**2
        +(specimen['sepal width'] - data_copy['sepal width'])**2)

    # 2. sort
    sorted_data = data_copy.sort_values(['distance_to_new'])
    
    # 3. predict
    prediction = sorted_data.iloc[0:k]['species'].mode()[0]

    #return prediction
    return prediction

In [39]:
#what will you have to do here to make it work?

new_iris = {}
new_iris['petal length'] = (5.1 - plm) / pls
new_iris['sepal length'] = (7.2 - slm) / sls
new_iris['petal width'] = (1.5 - pwm) / pws
new_iris['sepal width'] = (2.5 - swm) / sws

pred = knn(new_iris, iris_copy, 15)
print(pred)

Iris-setosa


## Programming Exercie #2: 

Write a function called `z_score()` that will take in a list of the names of the columns that you want to normalize, and the dataframe, and will return a dataframe where those columns have been z-score normalized.

In [43]:
def z_score(columns, data):
    """
    takes in a list of columns to normalize using the z-score method
    Params:
        columns, a list of columns to normalize
        data, the dataframe, preferably a copy
    Return:
        a copy of the dataframe with the specified columns normalized
    """
    norm = data.copy()
    
    for col in columns:
        # get the mean and std
        average = norm.iloc[col].mean()
        stdev = norm.iloc[col].std()
        # replace the column with the z-score
        norm.iloc[col] = (norm.iloc[col] - average) / stdev
    return norm

In [44]:
iris_norm = z_score(['sepal length', 'sepal width', 'petal width', 'petal length'], iris)
iris_norm.head()

TypeError: ignored

## Are all neighbors created equal?

The way we've learned kNN so far, each neighbor gets an equal vote in the decision of what to predict.

Do we see any problems with this? If so, what?

<div>
<img src="https://github.com/merriekay/S23-CS167-Notes/blob/main/images/day04_wknn_motivation.png?raw=1" width = 500/>
</div>

Should neighbors that are closer to the new instance get a larger share of the vote?

# Weighted k-NNN Intuition:

In weighted kNN, the nearest k points are given a weight, and the weights are grouped by the target variable. The class with the largest sum of weights will be the class that is predicted. 

The intuition is to give more weight to the points that are nearby and less weight to the points that are farther away.
- distance-weighted voting

In w-kNN, we want to predict the target variable with the most weight, where the weight is defined by the inverse distance function.

## $w_{q,i} = \frac{1}{d(x_q, x_i)^2}$

> In English, you can read that as the __weight__ of a traning example is equal to 1 divided by the distance between the new instance and the traning example squared.

## A w-kNN Example: Step 1

Start by calculating the distance between the new example ('X'), and each of the other training examples:

<div>
<img src="https://github.com/merriekay/S23-CS167-Notes/blob/main/images/day04_wknn_ex.png?raw=1"/>
</div>

## A w-kNN Example: Step 2

Then, __calculate the weight___ of each training example using the inverse distance squared.

<div>
<img src="https://github.com/merriekay/S23-CS167-Notes/blob/main/images/day04_wknn_ex1.png?raw=1"/>
</div>

## A w-kNN Example: Step 3

Find the k closest neighbors--let's assume `k=3` for this example: 
<div>
<img src="https://github.com/merriekay/S23-CS167-Notes/blob/main/images/day04_wknn_ex2.png?raw=1"/>
</div>

Then, sum the weights for each possible class: 
- __orange__: $1$
- __blue__: $1/16 + 1/9 = 0.115$

### What would a __normal 3NN__ predict? Weighted 3NN?

## Let's write some code: 

Write a new function `weighted_knn()`

Pass the iris measurements (specimen), data frame, and k as parameters and return the precited class.

In [49]:
def weighted_knn(specimen, data, k):
    #calculate the distance
    data['distance_to_new'] = np.sqrt(
    (specimen['petal length'] - data['petal length'])**2 
    +(specimen['sepal length'] - data['sepal length'])**2 
    +(specimen['petal width'] - data['petal width'])**2
    +(specimen['sepal width'] - data['sepal width'])**2)

    
    # calculate the weights (remember, weights are 1/d^2)
    data['weights'] = 1 / data['distance_to_new']**2
    
    # find the k closest neighbors
    data.sort_values(['distance_to_new'], inplace=True)
    neighbors = data.iloc[0:k]
    
    # use groupby to sum the weights of each species in the closest k
    results = neighbors.groupby(['species'])['weights'].sum()
    # return the class that has the largest sum of weight.
    return results.idxmax()

In [50]:
new_iris = {}
new_iris['petal length'] = 5.1
new_iris['sepal length'] = 7.2
new_iris['petal width'] = 1.5
new_iris['sepal width'] = 2.5

weighted_knn(new_iris, iris, 15)
#new_iris['petal length']

'Iris-versicolor'

## Exercises:

Normalize each of the predictor columns in the iris dataset, or just use `iris_norm` which we created above.

>__Note__: you need a way to transform the new reading (the specimen) that you will make the prediction on so that the new one and the training data will all be on the same scale. How can you do that?

Repeat your k-NN prediction code for the normalized data.
- Does the value of k change the predictions? 
    - compare using `k=3`, and `k=5` on each method (normalized and non-normalized), (weighted and unweighted)

In [None]:
# get the mean() and std() for each column of iris


# create a new dictionary with the normalized values


species
Iris-versicolor    1.319030
Iris-virginica     2.360782
Name: weight, dtype: float64


'Iris-virginica'

In [None]:
print("Not normalized:")
print('unweighted kNN, k=3:', knn(new_iris, iris, 3))
print('unweighted kNN, k=5:', knn(new_iris, iris, 5))
print('weighted kNN, k=3:', weighted_knn(new_iris, iris, 3))
print('weighted kNN, k=5:', weighted_knn(new_iris, iris, 5))

In [None]:
print("Normalized:")
print('unweighted kNN, k=3:', knn(norm_iris, iris_norm, 3))
print('unweighted kNN, k=5:', knn(norm_iris, iris_norm, 5))
print('weighted kNN, k=3:', weighted_knn(norm_iris, iris_norm, 3))
print('weighted kNN, k=5:', weighted_knn(norm_iris, iris_norm, 5))

## Use these tables to keep track of your predictions:
### `k=3`
|                    | **not normalized** | **normalized** |
|--------------------|--------------------|----------------|
| **unweighted kNN** |          |              |
| **weighted kNN**   |          |               |

### `k=5`

|                    | **not normalized** | **normalized** |
|--------------------|--------------------|----------------|
| **unweighted kNN** |                    |                |
| **weighted kNN**   |                    |                |

# 💬 Discussion Question

Should we __always__ normalize our data? Why or why not?

When does it make sense to normalize? When might it make more sense not to?