<a href="https://colab.research.google.com/github/urness/CS167Fall2025/blob/main/Day06_Normalization_and_Weighted_kNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day06
##Normalization and Weighted kNN

#### CS167: Machine Learning, Fall 2025


## Before we get started, let's load in our datasets:
Make sure you change the path to match your Google Drive.


In [None]:
import pandas as pd
import numpy as np

# The first step is to mount your Google Drive to your Colab account.
#You will be asked to authorize Colab to access your Google Drive. Follow the steps they lead you.

from google.colab import drive
drive.mount('/content/drive')

In [None]:
#import the data:
#make sure the path on the line below corresponds to the path where you put your dataset.

iris_df = pd.read_csv('/content/drive/MyDrive/CS167/datasets/irisData.csv')

titanic_df = pd.read_csv('/content/drive/MyDrive/CS167/datasets/titanic.csv')

penguins_df = pd.read_csv('/content/drive/MyDrive/CS167/datasets/penguins.csv')

# 💬 Discussion Question:

Imagine we wanted to use the penguin dataset to predict the species of penguin using k-Nearest Neighbors:
- What steps will you need to take before running a kNN on the penguin dataset?
- Will each column have an equal weight in the final prediction? Or will one column have a bigger say in the decision? Why?

In [None]:
penguins_df.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,gender
0,Adelie,Torgersen,39.1,18.7,181,3750,MALE
1,Adelie,Torgersen,39.5,17.4,186,3800,FEMALE
2,Adelie,Torgersen,40.3,18.0,195,3250,FEMALE
3,Adelie,Torgersen,36.7,19.3,193,3450,FEMALE
4,Adelie,Torgersen,39.3,20.6,190,3650,MALE


## Normalization Motivation:

In datasets that have numeric data, the columns that have the largest magnitude will have a greater 'say' in the decision of what to predict.

In the penguin dataset, `body_mass_g` will have a much bigger say in the prediction than the other options.

# Normalization:

__Normalizing data:__
- rescale attribute values so they're about the same
- adjusting values measured on different scales to a common scale

## A Simple Normalization:
One simple method of normalizing data is to replace each value with a proportion relative to the max value.

For example, the oldest person on the Titanic was 80, so:

| **age** | **replaced by** |
|---------|:------------------|
| 80      | 80/80 = 1        |
| 50      | 50/80 = 0.625    |
| 48      | 48/80 = 0.6      |
| 25      | 25/80 = 0.3125   |
| 4       | 4/80 = 0.05      |

## Before Normalization
<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day03_norm_dist1.png" width=600/>
</div>

### Age is overemphasized here



---



## After Normalization

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day03_norm_dist1.png" width=600/>
</div>

### Now is sex over-emphasized?

## Z-Score: Another Normalization Method

__Idea__: rather than normalize to proportion of max, normalize based on how many **standard deviations** they are away from the mean.

__Standard Deviation__: usually represened as $\sigma$ (sigma), a kind of 'average' distance from the average value.
- a low standard deviation indicates that the values tend to be close to the mean
- a high standard deviation indicates that the values are spread out over a wider range.

## Standard Deviation Calculation:

## $\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}$

1. Find the mean, represented as $\mu$ (mu)
2. Then, for each number, subtract the mean and square the result.
3. Then, find the mean of those squared differences.
4. Take the square root of tht and we are done.

Let $\mu$ be the mean, then standard deviation of $x_1, x_2, ..., x_N$ is:

## $\sigma = \sqrt{\frac{(x_1-\mu)^2 + (x_2 - \mu)^2+ ... + (x_N-\mu)^2}{N}}$

# Corrected Sample Standard Deviation

The mean of a sample tends to be a good estimate for the mean of the entire population (on average), but..
- standard deviation of samples tend to be _smaller_ than the standard deviation of the entire population.

__Bessel's correction__ says that you should divide by $N-1$ instead of N when working with a sample (as we usually do in machine learning tasks), and your estimate will be a little less biased.

## $\sigma = \sqrt{\frac{(x_1-\mu)^2 + (x_2 - \mu)^2+ ... + (x_N-\mu)^2}{N-1}}$

# Computing the Z-Score
After computing the corrected sample standard deviation,

to normalize, replace each value $x_i$ with it's Z-Score based on the mean ($\mu$) and standard deviation ($\sigma$) of it's column.

## $Z-score: \frac{x_i- \mu}{\sigma}$

## Example Z-Score Calculation

For example:
On the Titanic:
- `sex` mean(0:male, 1:female): 0.35
- `sex` standard deviation: 0.48
- `age` mean: 29.7
- `age` standard deviation: 13

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day03_zscore.png" width=600/>
</div>

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day03_zscore_ex.png" width=600/>
</div>

# Normalization Code:
Let's try out some code now:


In [None]:
#make sure your data is loaded and ready to go (one of the top few cells)
titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## New function `map()`

Called on a dataframe, `map()` will replace values given in a dictionary format.

Let's use this to make the `sex` column of the dataset numeric.

In [None]:
#first, let's make a copy of the original data into a new dataframe
normalized_titanic_df = titanic_df.copy()
normalized_titanic_df['sex'] = normalized_titanic_df['sex'].map({'male': 0, 'female': 1})
normalized_titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,0,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,1,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,1,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,1,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,0,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Calculating z-score:
Now that we have the data as 1s and 0s, let's calculate the mean and standard deviation.

In [None]:
s_mean = normalized_titanic_df.sex.mean()
s_std = normalized_titanic_df.sex.std()

#replace column with each entry's z-score
normalized_titanic_df.sex = (normalized_titanic_df.sex - s_mean)/s_std
normalized_titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,-0.737281,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,1.354813,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,1.354813,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,1.354813,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,-0.737281,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


Next, you'd need to repeat this process for all of the predictor columns -- so they're all of comparable size.

## 💻 Programming Exercise #1:

Normalize each of the predictor columns in the iris dataset.

Use the knn function (with k=5) to predict the species of the following
```
new_iris = {}
new_iris['petal length'] = 5.1
new_iris['sepal length'] = 7.2
new_iris['petal width'] = 1.5
new_iris['sepal width'] = 2.5
```

> Note: you need a way to transform the new reading (the specimen) that you will make the precition on so that the new one and the training data will all be on the same sclae. How can you do that?


In [None]:
# YOUR CODE HERE

In [None]:
def kNN(specimen, data, k):
    # 1. calculate distances
    data['distance_to_new'] = np.sqrt(
     (specimen['petal length'] - data['petal length'])**2
    +(specimen['sepal length'] - data['sepal length'])**2
    +(specimen['petal width'] - data['petal width'])**2
    +(specimen['sepal width'] - data['sepal width'])**2)

    # 2. sort
    sorted_data = data.sort_values(['distance_to_new'])

    # 3. predict
    prediction = sorted_data.iloc[0:k]['species'].mode()[0]

    return prediction

## Programming Exercise #2:

Write a function called `z_score()` that will take in a list of the names of the columns that you want to normalize, and the dataframe, and will return a dataframe where those columns have been z-score normalized.

In [None]:
def z_score(columns, data):
    """
    takes in a list of columns to normalize using the z-score method
    Params:
        columns, a list of columns to normalize
        data, the dataframe, preferably a copy
    Return:
        a copy of the dataframe with the specified columns normalized
    """
    data_copy = data.copy()

    for col in columns:
        # get the mean and std

        # replace the column with the z-score

    return normalized

In [None]:
iris_norm = z_score(['sepal length', 'sepal width', 'petal width', 'petal length'], iris_df)
iris_norm.head()

## Are all neighbors created equal?

The way we've learned kNN so far, each neighbor gets an equal vote in the decision of what to predict.

Do we see any problems with this? If so, what?

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_wknn_motivation.png" width = 500/>
</div>

Should neighbors that are closer to the new instance get a larger share of the vote?

# Weighted k-NNN Intuition:

In weighted kNN, the nearest k points are given a weight, and the weights are grouped by the target variable. The class with the largest sum of weights will be the class that is predicted.

The intuition is to give more weight to the points that are nearby and less weight to the points that are farther away.
- distance-weighted voting

In w-kNN, we want to predict the target variable with the most weight, where the weight is defined by the inverse distance function.

## $w_{q,i} = \frac{1}{d(x_q, x_i)^2}$

> In English, you can read that as the __weight__ of a training example is equal to 1 divided by the distance between the new instance and the training example squared.

## A w-kNN Example: Step 1

Start by calculating the distance between the new example ('X'), and each of the other training examples:

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_wknn_ex.png"/>
</div>

## A w-kNN Example: Step 2

Then, __calculate the weight___ of each training example using the inverse distance squared.

<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_wknn_ex1.png"/>
</div>

## A w-kNN Example: Step 3

Find the k closest neighbors--let's assume `k=3` for this example:
<div>
<img src="https://analytics.drake.edu/~reza/teaching/cs167_sp25/notes/images/day04_wknn_ex2.png"/>
</div>

Then, sum the weights for each possible class:
- __orange__: $1$
- __blue__: $1/16 + 1/9 = 0.115$

### What would a __normal 3NN__ predict? Weighted 3NN?

## Let's write some code:

Write a new function `weighted_knn()`

Pass the iris measurements (specimen), data frame, and k as parameters and return the precited class.