## Homework 4 Weighted k-Nearest Neighbors (kNN)

This is an individual assignment.  You must document assistance in accordance with _Documentation of Academic Work_ .  Submit a cover sheet and documentation by saving your documentation to a pdf and placing it inside the folder for this homework.

Let's say we are trying to classify player $i$ as either a Hall of Famer or not.  In Lesson 15, we coded the standard version of kNN which weights each of $i$'s $k$ neighbors equally.  Weighted kNN is an alternative method that gives additional weight to points that are closer to $i$.  One option is to weight by inverse distance.  To make the weights sum to 1 (an attractive property), we divide the inverse distance by the sum of all the inverse distances of $i$'s closest neighbors.  Mathematically, the weight from player $i$ to $j$ is:

$$w_{i,j} = \frac{\frac{1}{d(i,j)}}{\sum_{j' \in S}\frac{1}{d(i,j')}}$$ 

where $S$ is the set of the $k$ closest points to $i$ and $d(i,j)$ is the distance from $i$ to $j$.  Note the following property holds:

$$\sum_{j \in S} w_{i,j} = 1$$

and the predicted probability of player $i$ making the Hall of Fame is:

$$\hat{Pr}(HOF = 1 | X = i) = \sum_{j \in S} w_{i,j} * I(HOF_j = 1)$$

where $ I(HOF_j = 1)$ equals 1 if player $j$ is in the Hall of Fame and 0 otherwise.

Example:

Let $k = 3$ and player $i$'s closest neighbors are $j = 0, 1, 2$ and are 2, 5, and 10 units in distance away, respectively.  The closest two players ($j = 0, 1$) are in the Hall of Fame.  The third neighbor is not.

$$w_{i,0} = \frac{1/2}{1/2 + 1/5 + 1/10} = \frac{5}{8}$$

$$w_{i,1} = \frac{1/5}{1/2 + 1/5 + 1/10} = \frac{1}{4}$$

$$w_{i, 2} = \frac{1/10}{1/2 + 1/5 + 1/10} = \frac{1}{8}$$

The estimated probability player $i$ makes the Hall of Fame is:

$$w_{i,0} * I(HOF_0 = 1) + w_{i,1} * I(HOF_1 = 1) + w_{i,2} * I(HOF_2 = 1) = \frac{5}{8} * 1 + \frac{1}{4} * 1 + \frac{1}{8} * 0 = \frac{7}{8}$$

*Update the function below from Lesson 15 to give the user a choice between unweighted and weighted kNN.*

The *all* data frame contains all Major League Baseball players who were eligible (10+ seasons, retired for five years) for the Hall of Fame at some point in time and 17 players who are likely to be on the ballot in 2024 (ballot2024 == 1).


In [5]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

#Read in all players
all = pd.read_csv('halloffame2024.csv')


all.head(n = 10)

Unnamed: 0,playerID,years,H,HR,nameFirst,nameLast,finalGame,inducted,ballot2024,color_pt
0,aaronha01,23,3771,755,Hank,Aaron,1976-10-03,1,0,2
1,aasedo01,13,0,0,Don,Aase,1990-10-03,0,0,0
2,abbotgl01,11,0,0,Glenn,Abbott,1984-08-08,0,0,0
3,abbotji01,10,2,0,Jim,Abbott,1999-07-21,0,0,0
4,abbotpa01,11,5,0,Paul,Abbott,2004-08-07,0,0,0
5,abernte02,14,25,0,Ted,Abernathy,1972-09-30,0,0,0
6,abreubo01,18,2470,288,Bobby,Abreu,2014-09-28,0,1,1
7,ackerji01,10,9,0,Jim,Acker,1992-06-14,0,0,0
8,adairje01,13,1022,57,Jerry,Adair,1970-05-03,0,0,0
9,adamsba01,19,216,3,Babe,Adams,1926-08-11,0,0,0


In [6]:
import plotly.express as px

fig = px.scatter(all,
                 x="H",
                 y="HR",
                 color='color_pt',
                 hover_name="nameLast",
                 hover_data=['H', 'HR'])

fig.update_layout(showlegend=False)

fig.update_traces(marker=dict(size=12,
                              line=dict(width=2, color='DarkSlateGrey')),
                  selector=dict(mode='markers'))

fig.show()

Update the function below with a weighted version of kNN (see comment 'Enter your code here'). *Try to avoid using loops.*

In [8]:
def k_nearest_hof(all, k = 5, weighted = False):
    '''function to perform weighted and unweighted kNN'''
    
    #extract stats as numpy arrays
    players_stats = all.loc[all.ballot2024 == 0, :][['HR', 'H']].to_numpy(copy = True)
    hof = all.loc[all.ballot2024 == 0, :]['inducted'].to_numpy(copy = True)
    ballot_stats = all.loc[all.ballot2024 == 1, :][['HR', 'H']].to_numpy(copy = True)
    
    #standardize the variables
    players_std = (players_stats - np.mean(players_stats, axis = 0))/np.std(players_stats, axis = 0)
    ballot_std = (ballot_stats - np.mean(players_stats, axis = 0))/np.std(players_stats, axis = 0)

    #add a new axis for broadcasting
    players_std = players_std[np.newaxis, :]
    ballot_std = ballot_std[:, np.newaxis]
    
    #calculate distance from each player on the ballot to each player in eligible
    dist = np.sum((ballot_std - players_std) ** 2, axis = 2) ** (1 / 2)
    
    #determine who is closest to each player
    nearest = np.argsort(dist, axis = 1)
    
    
    #find the nearest k
    nearest_k = nearest[:, 0:k]
    
    #create a copy of players on the ballot
    ballot = all.loc[all.ballot2024 == 1, :].copy()
    
    ballot['neighbors_prop'] = 0
    
    if weighted:
        #ENTER YOUR CODE HERE
        dists = dist[np.arange(len(dist))[:,None], nearest_k]
        weights = 1/dists ** 2
        weights = weights/np.sum(weights, axis = 1, keepdims = True)
        ballot['neighbors_prop'] = np.sum(weights * hof[nearest_k], axis = 1)
        # since weighted is a parameter, i did not include an input function that prompts the user.


    else:
        ballot['neighbors_prop'] = np.sum(hof[nearest_k], axis = 1) / k
        
    
    return ballot.sort_values('neighbors_prop', ascending = False)
    


print(k_nearest_hof(all, k = 10, weighted = False))

print(k_nearest_hof(all, k = 10, weighted = True))


       playerID  years     H   HR nameFirst   nameLast   finalGame  inducted  \
2691  sheffga01     22  2689  509      Gary  Sheffield  2009-09-30         0   
178   beltrad01     21  3166  477    Adrian     Beltre  2018-09-30         0   
3046  vizquom01     24  2877   80      Omar    Vizquel  2012-10-03         0   
2421  ramirma02     19  2574  555     Manny    Ramirez  2011-04-06         0   
179   beltrca01     20  2725  435    Carlos    Beltran  2017-10-01         0   
2527  rodrial01     22  3115  696      Alex  Rodriguez  2016-08-12         0   
2546  rolliji01     17  2455  231     Jimmy    Rollins  2016-06-08         0   
1408  hunteto01     19  2452  353     Torii     Hunter  2015-10-03         0   
1269  heltoto01     17  2519  369      Todd     Helton  2013-09-29         0   
6     abreubo01     18  2470  288     Bobby      Abreu  2014-09-28         0   
1059  gonzaad01     15  2050  317    Adrian   Gonzalez  2018-06-10         0   
1355  hollima01     15  2096  316      M