# Task 1.1: Machine Learning Problem
The problem I'm solving using Logistic Regression is whether an NBA player will be inducted into the Hall of Fame. This problem is best solved using Logistic Regression because the solution involves predicting a probability, which is the purpose of Logistic Regression. Other classification models may be able to predict the class of an NBA player (Hall of Famer or not), but probabilities are more applicable to this problem due to the induction of a player into the Hall of Fame being mostly subjective in the end.


In [1]:
import math
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import utils

# Task 1.2: Data Manipulation
The model uses data from multiple data sets since there was no singular data set relevant to the NBA containing both information about players and their Hall of Fame status. <br>
Links:
 - [NBA Players](https://www.kaggle.com/datasets/justinas/nba-players-data)</li>
 - [NBA Hall of Famers 2021](https://www.kaggle.com/datasets/simplyab/nba-hall-of-famers-as-for-2021)

Exploratory Data Analysis (EDA) processes are shown in later cells, along with their descriptions and justifications.


## Load data

In [74]:
temp_player_data = pd.read_csv('/content/all_seasons.csv')
temp_hof_data = pd.read_csv('/content/NBA Hall of Famers 2021.csv')

In [75]:
predict_label = 'In_Hall_of_fame'

In [76]:
# Maybe conduct EDA on the columns being dropped
# Ensure no harm done to model by removing columns

player_data = temp_player_data.drop(columns=['Unnamed: 0', 'age', 'player_height', 'player_weight', 'country', 'season'])
hof_data = temp_hof_data.drop(columns=['position', 'height', 'weight', 'born'])

In [77]:
player_data.head()

Unnamed: 0,player_name,team_abbreviation,college,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
0,Dennis Rodman,CHI,Southeastern Oklahoma State,1986,2,27,55,5.7,16.1,3.1,16.1,0.186,0.323,0.1,0.479,0.113
1,Dwayne Schintzius,LAC,Florida,1990,1,24,15,2.3,1.5,0.3,12.3,0.078,0.151,0.175,0.43,0.048
2,Earl Cureton,TOR,Detroit Mercy,1979,3,58,9,0.8,1.0,0.4,-2.1,0.105,0.102,0.103,0.376,0.148
3,Ed O'Bannon,DAL,UCLA,1995,1,9,64,3.7,2.3,0.6,-8.7,0.06,0.149,0.167,0.399,0.077
4,Ed Pinckney,MIA,Villanova,1985,1,10,27,2.4,2.4,0.2,-11.2,0.109,0.179,0.127,0.611,0.04


In [78]:
hof_data.head()

Unnamed: 0,Name,All_star_selections,In_Hall_of_fame
0,Kareem Abdul-Jabbar,19,1
1,Kobe Bryant,18,1
2,LeBron James,17,2
3,Tim Duncan,15,1
4,Kevin Garnett,15,1


## Pre-EDA: Null-value analysis
Checking each data set for null values helps determine the correct method to merge them together. In addition, entries with too many null values can be eliminated, and entries with few null values can be filled manually. Luckily, this process revealed that there are no null values in either data set, so the merge was seamless.

In [79]:
hof_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 437 entries, 0 to 436
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Name                 437 non-null    object
 1   All_star_selections  437 non-null    int64 
 2   In_Hall_of_fame      437 non-null    int64 
dtypes: int64(2), object(1)
memory usage: 10.4+ KB


In [80]:
player_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12305 entries, 0 to 12304
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   player_name        12305 non-null  object 
 1   team_abbreviation  12305 non-null  object 
 2   college            12305 non-null  object 
 3   draft_year         12305 non-null  object 
 4   draft_round        12305 non-null  object 
 5   draft_number       12305 non-null  object 
 6   gp                 12305 non-null  int64  
 7   pts                12305 non-null  float64
 8   reb                12305 non-null  float64
 9   ast                12305 non-null  float64
 10  net_rating         12305 non-null  float64
 11  oreb_pct           12305 non-null  float64
 12  dreb_pct           12305 non-null  float64
 13  usg_pct            12305 non-null  float64
 14  ts_pct             12305 non-null  float64
 15  ast_pct            12305 non-null  float64
dtypes: float64(9), int64(1

## Data pre-processing
Group player data by each player and merge with HOF data

In [81]:
player_data[player_data['player_name'] == 'Dennis Rodman']

Unnamed: 0,player_name,team_abbreviation,college,draft_year,draft_round,draft_number,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
0,Dennis Rodman,CHI,Southeastern Oklahoma State,1986,2,27,55,5.7,16.1,3.1,16.1,0.186,0.323,0.1,0.479,0.113
728,Dennis Rodman,CHI,Southeastern Oklahoma State,1986,2,27,80,4.7,15.0,2.9,6.7,0.169,0.319,0.088,0.459,0.112
1227,Dennis Rodman,LAL,Southeastern Oklahoma State,1986,2,27,23,2.1,11.2,1.3,-1.4,0.114,0.348,0.065,0.388,0.063
1354,Dennis Rodman,DAL,Southeastern Oklahoma State,1986,2,27,12,2.8,14.3,1.2,-7.1,0.136,0.351,0.064,0.457,0.046


In [82]:
def column_agg(x):
    d = {}
    d['team'] = x['team_abbreviation'].mode().iloc[0]
    d['college'] = x['college'].iloc[0]
    d['draft'] = -1 if x['draft_year'].iloc[0] == 'Undrafted' else x['draft_number'].iloc[0]
    games = x['gp'].sum()
    d['gp'] = games
    convertStat = lambda stat: round((stat * x['gp']).sum() / games, 2)
    d['pts'] = convertStat(x['pts'])
    d['reb'] = convertStat(x['reb'])
    d['ast'] = convertStat(x['ast'])
    d['net_rating'] = convertStat(x['net_rating'])
    d['oreb_pct'] = convertStat(x['oreb_pct'])
    d['dreb_pct'] = convertStat(x['dreb_pct'])
    d['usg_pct'] = convertStat(x['usg_pct'])
    d['ts_pct'] = convertStat(x['ts_pct'])
    d['ast_pct'] = convertStat(x['ast_pct'])
    return pd.Series(d, index=['team', 'college', 'draft', 'gp', 'pts', 'reb',
                               'ast', 'net_rating', 'oreb_pct', 'dreb_pct',
                               'usg_pct', 'ts_pct', 'ast_pct'])

In [83]:
player_groups = player_data.groupby('player_name').apply(column_agg)
player_groups.head()

Unnamed: 0_level_0,team,college,draft,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
player_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
A.C. Green,DAL,Oregon State,23,379,5.86,6.19,0.89,-1.58,0.09,0.19,0.13,0.49,0.05
A.J. Bramlett,CLE,Arizona,39,8,1.0,2.8,0.0,-32.6,0.16,0.21,0.15,0.19,0.0
A.J. Guyton,CHI,Indiana,32,80,5.51,1.02,1.82,-7.35,0.02,0.06,0.2,0.47,0.23
AJ Hammons,DAL,Purdue,46,22,2.2,1.6,0.2,-0.6,0.05,0.2,0.17,0.47,0.04
AJ Price,IND,Connecticut,52,261,5.83,1.45,2.19,-1.28,0.02,0.09,0.21,0.48,0.25


In [84]:
hof_data = hof_data.rename(columns={'Name': 'player_name'})
merge_data = player_groups.merge(hof_data, on='player_name', how='inner')
merge_data.head()

Unnamed: 0,player_name,team,college,draft,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,All_star_selections,In_Hall_of_fame
0,A.C. Green,DAL,Oregon State,23,379,5.86,6.19,0.89,-1.58,0.09,0.19,0.13,0.49,0.05,1,0
1,Al Horford,ATL,Florida,3,950,13.67,8.18,3.31,4.03,0.07,0.21,0.19,0.57,0.16,5,2
2,Allan Houston,NYK,Tennessee,11,602,18.56,3.06,2.45,0.58,0.01,0.09,0.25,0.54,0.13,2,0
3,Allen Iverson,PHI,Georgetown,1,914,26.67,3.71,6.15,0.74,0.02,0.08,0.31,0.52,0.28,11,1
4,Alonzo Mourning,MIA,Georgetown,2,553,14.68,7.63,0.94,5.4,0.1,0.2,0.24,0.58,0.06,7,1


In [85]:
merge_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 212 entries, 0 to 211
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   player_name          212 non-null    object 
 1   team                 212 non-null    object 
 2   college              212 non-null    object 
 3   draft                212 non-null    object 
 4   gp                   212 non-null    int64  
 5   pts                  212 non-null    float64
 6   reb                  212 non-null    float64
 7   ast                  212 non-null    float64
 8   net_rating           212 non-null    float64
 9   oreb_pct             212 non-null    float64
 10  dreb_pct             212 non-null    float64
 11  usg_pct              212 non-null    float64
 12  ts_pct               212 non-null    float64
 13  ast_pct              212 non-null    float64
 14  All_star_selections  212 non-null    int64  
 15  In_Hall_of_fame      212 non-null    int

In [86]:
# Only keep inactive players (active players may or may not become HOFs)
# Label encoding done to convert categorical features to numbers
le = preprocessing.LabelEncoder()

final_data = merge_data[merge_data['In_Hall_of_fame'] < 2]

final_data['team'] = le.fit_transform(final_data['team'])

final_data['draft'] = final_data['draft'].astype(int)
max_draft = max(final_data['draft'])
final_data['draft'] = final_data['draft'].apply(lambda x: abs(x - max_draft) if x > -1 else -1)

final_data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,player_name,team,college,draft,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,All_star_selections,In_Hall_of_fame
0,A.C. Green,6,Oregon State,34,379,5.86,6.19,0.89,-1.58,0.09,0.19,0.13,0.49,0.05,1,0
2,Allan Houston,20,Tennessee,46,602,18.56,3.06,2.45,0.58,0.01,0.09,0.25,0.54,0.13,2,0
3,Allen Iverson,22,Georgetown,56,914,26.67,3.71,6.15,0.74,0.02,0.08,0.31,0.52,0.28,11,1
4,Alonzo Mourning,15,Georgetown,55,553,14.68,7.63,0.94,5.4,0.1,0.2,0.24,0.58,0.06,7,1
5,Amar'e Stoudemire,23,,48,846,18.91,7.84,1.24,2.37,0.09,0.2,0.26,0.6,0.07,6,0


## Data preparation for model
Split data into train and test X and y data sets

In [87]:
# Final data set (player_name is kept for reference to each of the entries)
final_data = final_data.drop(columns=['college', 'dreb_pct', 'oreb_pct', 'ast', 'usg_pct'])
final_data.head()

Unnamed: 0,player_name,team,draft,gp,pts,reb,net_rating,ts_pct,ast_pct,All_star_selections,In_Hall_of_fame
0,A.C. Green,6,34,379,5.86,6.19,-1.58,0.49,0.05,1,0
2,Allan Houston,20,46,602,18.56,3.06,0.58,0.54,0.13,2,0
3,Allen Iverson,22,56,914,26.67,3.71,0.74,0.52,0.28,11,1
4,Alonzo Mourning,15,55,553,14.68,7.63,5.4,0.58,0.06,7,1
5,Amar'e Stoudemire,23,48,846,18.91,7.84,2.37,0.6,0.07,6,0


In [88]:
use_data = final_data.drop(columns=['player_name'])

X_final, y_final = use_data.drop(columns=[predict_label]), use_data[predict_label]

min_max = preprocessing.MinMaxScaler()
temp = X_final.values
temp_cols = X_final.columns
temp_scaled = min_max.fit_transform(temp)
X_final = pd.DataFrame(temp_scaled, columns=temp_cols)

In [89]:
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.1, random_state=42)

In [90]:
X_train.head()

Unnamed: 0,team,draft,gp,pts,reb,net_rating,ts_pct,ast_pct,All_star_selections
66,0.677419,0.666667,0.566099,0.379701,0.127581,0.711506,0.642857,0.609756,0.0
18,0.774194,0.912281,0.204843,0.682572,0.223451,0.742386,0.678571,0.463415,0.117647
69,0.935484,0.210526,0.179319,0.466235,0.128319,0.909687,0.857143,0.390244,0.0
124,0.741935,0.859649,0.752618,0.536595,0.548673,0.749154,0.678571,0.121951,0.176471
30,0.806452,1.0,0.440445,0.771128,0.640118,0.742597,0.571429,0.390244,0.235294


In [91]:
y_train.head()

93     0
27     0
99     0
183    0
42     1
Name: In_Hall_of_fame, dtype: int64

In [92]:
X_train = X_train.to_numpy()
Y_train = np.reshape(y_train.to_numpy(), (len(y_train), 1))

X_test = X_test.to_numpy()
Y_test = np.reshape(y_test.to_numpy(), (len(y_test), 1))

print(X_train.shape, Y_train.shape)

(133, 9) (133, 1)


# Exploratory Data Analysis (EDA)
Skip for HW 2

In [None]:
# Possible EDA methods (from class example):
# info(), describe(), pairplot, scatter plot, pie chart, bar chart, histogram,
# grouping, PCA (not from class)

### Description of Data
This provides general statistics for each column in the data, which gives further insights on the number of values and the distributions of the features.

In [None]:
final_data.describe()

Unnamed: 0,team,draft,gp,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,All_star_selections,In_Hall_of_fame
count,148.0,148.0,148.0,148.0,148.0,148.0,148.0,148.0,148.0,148.0,148.0,148.0,148.0,148.0
mean,15.648649,42.804054,651.236486,13.579122,5.563851,2.97277,1.562703,0.060473,0.155676,0.215676,0.534122,0.161216,3.932432,0.243243
std,9.011606,14.089924,361.940863,4.758926,2.592176,1.975245,4.868963,0.036818,0.056715,0.043002,0.037178,0.092287,3.909977,0.430498
min,0.0,-1.0,13.0,1.94,1.23,0.28,-32.38,0.01,0.06,0.09,0.35,0.04,1.0,0.0
25%,8.0,38.0,371.0,10.58,3.485,1.5475,-0.31,0.03,0.1,0.19,0.51,0.09,1.0,0.0
50%,15.0,48.0,621.5,13.565,5.49,2.395,1.72,0.05,0.15,0.215,0.53,0.13,2.0,0.0
75%,23.0,53.0,894.25,16.735,7.5225,3.9225,3.8975,0.09,0.2,0.25,0.5525,0.2025,6.0,0.0
max,31.0,56.0,1541.0,26.67,14.79,8.71,14.9,0.18,0.33,0.33,0.63,0.45,18.0,1.0


### Pairplot
This shows one-to-one correlations between each of the features, along with the distribution of the prediction label (Hall of Famer) for each correlation. Detecting direct correlations between features is especially important when using Logistic Regression, since the algorithm is less accurate when significant correlations exist.

In [None]:
sns.pairplot(final_data, hue=predict_label)

KeyboardInterrupt: ignored

Error in callback <function flush_figures at 0x7fc31bd65320> (for post_execute):


KeyboardInterrupt: ignored

### Correlation heatmaps
Due to the detection of significant correlations between features in the pairplot above, further analysis was required to eliminate these correlations. The heatmaps provide a cleaner visualization along with discrete values representing the level of correlation between features. These values allow for a clear threshhold to determine which correlations need to be eliminated. <br>
The second heatmap shows the best results in terms of limited correlation, so the features to remove from the data are
- dreb_pct
- oreb_pct
- ast
- usg_pct

In [None]:
# Source: https://towardsdatascience.com/why-feature-correlation-matters-a-lot-847e8ba439c4

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(final_data.corr(), vmax=1.0, center=0, fmt='.2f',
                square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .70})
plt.show()

In [None]:
# Source: https://towardsdatascience.com/why-feature-correlation-matters-a-lot-847e8ba439c4

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(final_data.drop(columns=['dreb_pct', 'oreb_pct', 'ast', 'usg_pct']).corr(), vmax=1.0, center=0, fmt='.2f',
                square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .70})
plt.show()

In [None]:
# Source: https://towardsdatascience.com/why-feature-correlation-matters-a-lot-847e8ba439c4

fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(final_data.drop(columns=['reb', 'ast', 'usg_pct']).corr(), vmax=1.0, center=0, fmt='.2f',
                square=True, linewidths=.5, annot=True, cbar_kws={"shrink": .70})
plt.show()

### Histogram
This provides further insight on the distribution of individual features against the different values of the prediction label (Hall of Famer). This was done because the data description above showed an odd distribution for the net rating feature at first glance, so a more detailed look at the distribution seemed appropriate. However, in the end, the outliers in the feature didn't seem like they would cause much harm to the model as enough of the expected distribution was present in both possible labels, so those data points were kept in the data set.

In [None]:
# Histogram for net_rating

g = sns.FacetGrid(final_data, col=predict_label, height=5)
g.map(plt.hist, 'net_rating', bins=20)

### Pie charts
These show the distribution of values in categorical features. The prediction label was treated as categorical because although the 0 and 1 are numerical, they represent possible classes, not continuous values. The cardinality of the College feature led to its removal from the dataset, as the feature didn't feel important enough to keep at the expense of high dimensionality.

In [None]:
# Source: https://matplotlib.org/3.1.1/gallery/pie_and_polar_charts/pie_and_donut_labels.html#sphx-glr-gallery-pie-and-polar-charts-pie-and-donut-labels-py

fig, ax = plt.subplots(figsize=(5,5))
ax.pie(x=final_data[predict_label].value_counts(), 
       labels=final_data[predict_label].value_counts().index,
       autopct=lambda pct: '{:.1f}%\n({:d})'.format(pct, int(len(final_data)*pct/100.)),
       textprops=dict(color="black"))

ax.set_title(predict_label)

plt.show()

In [None]:
# Pie charts for college
fig, ax = plt.subplots(figsize=(5,5))
ax.pie(x=final_data['college'].value_counts(), 
       labels=final_data['college'].value_counts().index,
       autopct='%1.1f%%', textprops=dict(color="black"))

ax.set_title('College')

plt.show()

### PCA

In [None]:
# PCA (Not used due to limited applicability)
# pca = PCA()
# pca.fit(final_data.drop(columns=['player_name', 'college', predict_label]))
# print(pca.explained_variance_ratio_)

# Task 1.3: Logistic Regression Model

In [62]:
get_acc = lambda y, y_hat: (len(y) - np.count_nonzero(np.round(y_hat, decimals=0) - y)) / len(y)

In [None]:
# Task 3: logistic regression
# Use this cell to reset the weights for each version of the model

n = 133
w = np.random.normal(scale=0.1, size=(9, 1))
b = np.zeros((1, 1))
sigmoid = lambda x: 1 / (1 + math.exp(-x))

## Batch Gradient Descent

In [None]:
learning_rate = 0.1

# Converges after 10 iterations
for _ in range(10):
    A = np.vectorize(sigmoid)(np.dot(X_train, w) + b)
    cost = -1/n * np.sum(Y_train * np.log(A) + (1-Y_train) * (np.log(1-A)))
    dw = np.dot(X_train.T, (A-Y_train))/n
    db = np.sum(A-Y_train)/n

    w = w - learning_rate*dw
    b = b - learning_rate*db

In [None]:
# Accuracy measure

Y_hat = np.vectorize(sigmoid)(np.dot(X_train, w) + b)
print(get_acc(Y_train, Y_hat))

Y_hat = np.vectorize(sigmoid)(np.dot(X_test, w) + b)
print(get_acc(Y_test, Y_hat))

## Stochastic Gradient Descent

In [None]:
learning_rate = 0.1

# Converges after 10 iterations
for _ in range(10):
    X_train, Y_train = utils.shuffle(X_train, Y_train)
    for i in range(n):
        a = np.vectorize(sigmoid)(np.dot(X_train[i], w) + b)
        cost = -1/n * np.sum(Y_train[i] * np.log(a) + (1-Y_train[i]) * (np.log(1-a)))
        dw = np.dot(np.reshape(X_train[i].T, (9, 1)), np.reshape(a-Y_train[i], (1, 1)))/n
        db = (a-Y_train[i])/n

        w = w - learning_rate*dw
        b = b - learning_rate*db

In [None]:
Y_hat = np.vectorize(sigmoid)(np.dot(X_train, w) + b)
print(get_acc(Y_train, Y_hat))

Y_hat = np.vectorize(sigmoid)(np.dot(X_test, w) + b)
print(get_acc(Y_test, Y_hat))

0.7518796992481203
0.8


# Task 1.4: Optimization Algorithms
Chosen algorithms are Adam and Adamax

In [None]:
# Only used in this task
import tensorflow as tf

## Adam optimizer

In [None]:
# Task 4: optimizers
# Adam optimizer, adamax optimizer
# Using tensorflow sigmoid function because custom sigmoid function isn't
# compatible with Tensors

adam_opt = tf.keras.optimizers.Adam(learning_rate=0.1)

w = tf.Variable(w)
b = tf.Variable(b)
print(b)
Y_train = tf.cast(tf.convert_to_tensor(Y_train), dtype=tf.float64)
X_train = tf.cast(tf.convert_to_tensor(X_train), dtype=tf.float64)
loss = lambda: (-1/n) * tf.math.reduce_sum(
    Y_train * tf.math.log(tf.math.sigmoid(tf.tensordot(X_train, w, 1) + b)) + 
    (1-Y_train) * (tf.math.log(1-tf.math.sigmoid(tf.tensordot(X_train, w, 1) + b))))

adam_opt.minimize(loss, var_list=[w, b]).numpy()

print(b)

<tf.Variable 'Variable:0' shape=(1, 1) dtype=float64, numpy=array([[0.]])>
<tf.Variable 'Variable:0' shape=(1, 1) dtype=float64, numpy=array([[-0.09999881]])>


In [None]:
Y_hat = tf.math.sigmoid(tf.tensordot(X_train, w, 1) + b)
print(get_acc(Y_train, Y_hat))

Y_hat = tf.math.sigmoid(tf.tensordot(X_test, w, 1) + b)
print(get_acc(Y_test, Y_hat))

0.7518796992481203
0.8


## Adamax optimizer

In [None]:
adamax_opt = tf.keras.optimizers.Adamax(learning_rate=0.1)

w = tf.Variable(w)
b = tf.Variable(b)
print(b)
Y_train = tf.cast(tf.convert_to_tensor(Y_train), dtype=tf.float64)
X_train = tf.cast(tf.convert_to_tensor(X_train), dtype=tf.float64)
loss = lambda: (-1/n) * tf.math.reduce_sum(
    Y_train * tf.math.log(tf.math.sigmoid(tf.tensordot(X_train, w, 1) + b)) + 
    (1-Y_train) * (tf.math.log(1-tf.math.sigmoid(tf.tensordot(X_train, w, 1) + b))))

adamax_opt.minimize(loss, var_list=[w, b]).numpy()

print(b)

<tf.Variable 'Variable:0' shape=(1, 1) dtype=float64, numpy=array([[0.]])>
<tf.Variable 'Variable:0' shape=(1, 1) dtype=float64, numpy=array([[-0.09999997]])>


In [None]:
Y_hat = tf.math.sigmoid(tf.tensordot(X_train, w, 1) + b)
print(get_acc(Y_train, Y_hat))

Y_hat = tf.math.sigmoid(tf.tensordot(X_test, w, 1) + b)
print(get_acc(Y_test, Y_hat))

0.7518796992481203
0.8


## Conclusion + Result Analysis



Logistic regression was run with Batch Gradient Descent, Stochastic Gradient Descent, Batch Gradient Descent using the Adam optimizer, and Batch Gradient Descent using the Adamax optimizer. All of the different versions of the model converged with a training accuracy of ~75% and a testing accuracy of 80%. Due to the relatively small size of the data set, the testing set (which only consists of 15 elements) may not be entirely valid in testing the model's generalizability. <br>
Based on these results, there is no difference between using an optimization algorithm or not. However, this is most likely because every run was able to converge to the best possible results. Optimization algorithms are likely very helpful when applied to a model that is underperforming or converging incorrectly. With a model that is already converging correctly though, optimizers may not add much value.

# Task 2.1: Decision Tree Classifier

In [63]:
# Imports used for HW2
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold

In [None]:
# Run import cell, task 1.2, and first cell of task 1.3 to load data properly for HW2
# Test that the data was loaded properly
print(X_train)
print(y_train)
print(X_test)
print(Y_test)

In [64]:
clf = DecisionTreeClassifier(random_state=42)
clf = clf.fit(X_train, y_train)

In [65]:
y_pred = clf.predict(X_test)
print(get_acc(Y_test, np.reshape(y_pred, (15, 1))))

0.9333333333333333


In [66]:
# Try different parameters on the model
# Parameters to test: random_state, max_depth(int), max_leaf_nodes(int), min_impurity_decrease(float)
possible_vals = [5, 10, 20]
for val in possible_vals:
    clf1 = DecisionTreeClassifier(random_state=42, max_depth=val)
    clf1 = clf1.fit(X_train, y_train)
    y_pred = clf1.predict(X_test)
    print(get_acc(Y_test, np.reshape(y_pred, (15, 1))))
print("------------------")

possible_vals = [5, 10, 15]
for val in possible_vals:
    clf2 = DecisionTreeClassifier(random_state=42, max_leaf_nodes=val)
    clf2 = clf2.fit(X_train, y_train)
    y_pred = clf2.predict(X_test)
    print(get_acc(Y_test, np.reshape(y_pred, (15, 1))))
print("------------------")

possible_vals = [0.2, 0.4, 0.8]
for val in possible_vals:
    clf3 = DecisionTreeClassifier(random_state=42, min_impurity_decrease=val)
    clf3 = clf3.fit(X_train, y_train)
    y_pred = clf3.predict(X_test)
    print(get_acc(Y_test, np.reshape(y_pred, (15, 1))))
print("------------------")

0.9333333333333333
0.9333333333333333
0.9333333333333333
------------------
0.9333333333333333
0.9333333333333333
0.9333333333333333
------------------
0.9333333333333333
0.8
0.8
------------------


## Discussion
I chose to alter 3 parameters: the max depth, max leaf nodes, and min impurity decrease of the tree. The max depth indicates the maximum depth of the tree. The max leaf nodes indicates the number of leaf nodes to end up with, which constraints the number of splits. The min impurity decrease indicates a constraint on the condition of a split, where the decrease of an impurity after a split has a minimum threshold. <br>
None of these features had an impact on my final accuracy, except for the min impurity decrease reducing the accuracy for values higher than 0.2 (default is 0.0). This is most likely due to the small sample size of my test dataset and the simplicity of my data in general. Even though some final weights may be different with different parameters, the problem is easy enough to solve that the slight differences aren't seen in the results of the model.

# Task 2.2: Bagging and Boosting

In [93]:
# Bagging model: Random Forest Classifier
rfc = RandomForestClassifier(random_state=42)
rfc = rfc.fit(X_train, y_train)

In [94]:
y_pred = rfc.predict(X_test)
print(get_acc(Y_test, np.reshape(y_pred, (15, 1))))

0.9333333333333333


In [95]:
# Boosting model: Gradient Boosting Classifier
gbc = GradientBoostingClassifier(random_state=42)
gbc = gbc.fit(X_train, y_train)

In [96]:
y_pred = gbc.predict(X_test)
print(get_acc(Y_test, np.reshape(y_pred, (15, 1))))

0.9333333333333333


In [71]:
# k-folds cross validation on both models
X = X_final.to_numpy()
y = y_final.to_numpy()

kf = KFold(n_splits=10)#shuffle=True, random_state=42)
kf.get_n_splits(X)
i = 1
for train_index, test_index in kf.split(X):
    print("Split #" + str(i))
    x_train, x_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    rfc = RandomForestClassifier(random_state=42)
    rfc = rfc.fit(x_train, y_train)
    y_pred = rfc.predict(x_test)
    print(get_acc(np.reshape(y_test, (len(y_test), 1)), np.reshape(y_pred, (len(y_pred), 1))))

    gbc = GradientBoostingClassifier(random_state=42)
    gbc = gbc.fit(x_train, y_train)
    y_pred = gbc.predict(x_test)
    print(get_acc(np.reshape(y_test, (len(y_test), 1)), np.reshape(y_pred, (len(y_pred), 1))))

    i += 1

Split #1
1.0
0.9333333333333333
Split #2
0.8
0.8666666666666667
Split #3
1.0
0.9333333333333333
Split #4
0.8666666666666667
0.8666666666666667
Split #5
0.8666666666666667
0.8
Split #6
1.0
0.9333333333333333
Split #7
0.9333333333333333
0.9333333333333333
Split #8
0.9333333333333333
0.8666666666666667
Split #9
0.9285714285714286
0.7857142857142857
Split #10
0.7857142857142857
0.9285714285714286


## Discussion
K-folds cross validation was used with 10 splits. The resulting accuracies from the splits don't vary too much from the original results of the bagging and boosting models, but a couple of the splits have 100% testing accuracy, most likely due to the small size of the test data set. The little variance is also likely due to the small test size and simplicity of the data in general since the bagging and boosting models themselves didn't provide an improvement on the regular decision tree classifier.

# Task 2.3: Results Comparison

The effectiveness of the three models is extremely similar in terms of the results comparison, most likely due to the simplicity of the problem and the limited data set, as mentioned in the previous discussion. The metric used was the accuracy metric (*get_acc* function), which divides the number of mismatched predictions by the number of total predictions. The accuracy metric is justified because both a player being incorrectly classified as a Hall of Famer and a player being incorrectly left out of the Hall of Fame can be punished the same, which the accuracy metric accomplishes. A different metric would place higher importance on either the false positive or false negative, which wouldn't be as complete as the accuracy since both are equally important. As shown below, in this problem specifically, since only one data point is misclassified in the test data, the precision is the same as the accuracy and the recall is 100%. Both of these aren't any better at describing the performance of the model than the accuracy.

In [132]:
# Demonstrate use of different metric
get_precision = lambda y, y_hat: (len(y) - np.count_nonzero(np.round(y_hat, decimals=0) - y)) / ((len(y) - np.count_nonzero(np.round(y_hat, decimals=0) - y)) + np.sum((np.round(y_hat, decimals=0) - y) == 1))
get_recall = lambda y, y_hat: (len(y) - np.count_nonzero(np.round(y_hat, decimals=0) - y)) / ((len(y) - np.count_nonzero(np.round(y_hat, decimals=0) - y)) + np.sum((np.round(y_hat, decimals=0) - y) == -1))

In [133]:
# Bagging model: Random Forest Classifier
rfc = RandomForestClassifier(random_state=42)
rfc = rfc.fit(X_train, y_train)

In [134]:
y_pred = rfc.predict(X_test)
print(get_precision(Y_test, np.reshape(y_pred, (15, 1))))
print(get_recall(Y_test, np.reshape(y_pred, (15, 1))))

0.9333333333333333
1.0


In [135]:
# Boosting model: Gradient Boosting Classifier
gbc = GradientBoostingClassifier(random_state=42)
gbc = gbc.fit(X_train, y_train)

In [136]:
y_pred = gbc.predict(X_test)
print(get_precision(Y_test, np.reshape(y_pred, (15, 1))))
print(get_recall(Y_test, np.reshape(y_pred, (15, 1))))

0.9333333333333333
1.0
