## NLP Draft Predictions

This notebook details our initial attempt for the supervised task of predicting NHL player draft positions and the unsupervised task of clustering NHL players based on similarities. The main methodology uses NLP word vectors extracted from 2014-2022 NHL scouting reports from various public sports news outlets.

Members:
- Quoc-Huy Nguyen
- Ryan DeSalvio



In [1]:
import os
import re
import json
import numpy as np
import pandas as pd

import matplotlib.cm as cm
import matplotlib.pyplot as plt

from sklearn.svm import SVC
from sklearn import preprocessing
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GroupShuffleSplit
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

import clean_reports
import preprocess_reports
import setup_predictor
from model import *
from train_test_predictor import train_and_test

nltk.download("punkt")
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mrquo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mrquo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\mrquo\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\mrquo\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
# dataset location
DATASET = "data/prospect-data.csv"

# load dataset into dataframe
data = clean_reports.clean(DATASET, raw=True)

data.head()

Unnamed: 0,Year,Position,Height,Weight,Drafted,Team,Average Ranking,Name,Description - Corey Pronman,Description - Scott Wheeler,Description - Smaht Scouting,Description - ESPN (Chris Peters),Description - EP Rinkside,Description - EP Rinkside Part 2,Description - The Painted Lines,Description - FCHockey
0,2022,LW,76.0,218.0,1,MON,1.0,Juraj Slafkovsky,Slafkovsky has all the assets you're looking f...,Slafkovsky is one of the draft's most tantaliz...,Slafkovsky can be a menace at the NHL level. H...,The potential of what Slafkovsky can be as he ...,Slafkovsky is one of the largest players in th...,Nothing brought out Juraj Slafkovsky's draft y...,Slafkovsky dominated the international scene t...,Juraj Slafkovsky drives offense from the wing....
1,2022,C,72.5,193.0,4,SEA,2.0,Shane Wright,Wright is a very well-rounded center who has n...,Still my top prospect in this class (though no...,The complicated and essential question to answ...,With high-end hockey sense highlighted by his ...,The top player from the class held this positi...,In a draft year shaped by substantial depth ra...,Wright has been on the radar of scouts for a l...,Shane Wright is an elite two-way center with i...
2,2022,D,72.0,190.0,2,NJD,4.0,Simon Nemec,Nemec is a very well-rounded defenseman. His p...,This kid turned 18 in the middle of February a...,"With Nemec, you are netting a top pairing defe...","One of the very best passers in this draft, Ne...",The statistical comparables to Nemec's draft y...,Šimon Nemec just put together the most product...,Few players have been as dominant at the pro-l...,Simon Nemec is a mobile two-way and highly-int...
3,2022,C,70.5,180.0,3,ARI,3.0,Logan Cooley,Cooley is a dynamic player. When he has the pu...,"Cooley is a beautiful, flowing skater capable ...",Logan Cooley is for sure one of the more well-...,"Over the last several months, I have constantl...","A highly creative, speed-driven pivot who brin...",You won't find a more singularly-gifted puckha...,"A small, speedy center, Cooley dominated the U...",Juraj Slafkovsky drives offense from the wing....
4,2022,D,75.0,189.0,6,CBJ,5.0,David Jiricek,"Jiricek is a big, right-shot defenseman who ma...",After suffering a knee injury at the world jun...,Top pairing defenseman at the NHL level if he ...,"A highly mobile, 6-foot-3 defenseman with a bo...","A knee injury fractured Jiříček's season, but ...",David Jiříček finishes the year as the highest...,"Similar to Nemec, Jiricek plays a mature game ...",David Jiricek is an exceptional two-way defens...


In [3]:
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 360 entries, 0 to 359
Data columns (total 16 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Year                               360 non-null    int64  
 1   Position                           360 non-null    object 
 2   Height                             358 non-null    float64
 3   Weight                             358 non-null    float64
 4   Drafted                            360 non-null    int64  
 5   Team                               360 non-null    object 
 6   Average Ranking                    120 non-null    float64
 7   Name                               360 non-null    object 
 8   Description - Corey Pronman        347 non-null    object 
 9   Description - Scott Wheeler        171 non-null    object 
 10  Description - Smaht Scouting       107 non-null    object 
 11  Description - ESPN (Chris Peters)  187 non-null    object 

## Data Cleaning

In [4]:
# clean up dataset
# might have to look at dropping seattle in the future but for clustering it 
# should not matter
data = data[data['Team'] != 'SEA']

# try with only forwards
# data = data[
#     (data['Position'] == 'C') | 
#     (data['Position'] == 'LW') | 
#     (data['Position'] == 'RW')
# ]

# keep data only to 2014-2022 (predict this year's class at a later time)
data = data[data['Year'] <= 2022]

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 355 entries, 0 to 359
Data columns (total 16 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Year                               355 non-null    int64  
 1   Position                           355 non-null    object 
 2   Height                             353 non-null    float64
 3   Weight                             353 non-null    float64
 4   Drafted                            355 non-null    int64  
 5   Team                               355 non-null    object 
 6   Average Ranking                    115 non-null    float64
 7   Name                               355 non-null    object 
 8   Description - Corey Pronman        342 non-null    object 
 9   Description - Scott Wheeler        166 non-null    object 
 10  Description - Smaht Scouting       104 non-null    object 
 11  Description - ESPN (Chris Peters)  182 non-null    object 

In [5]:
HOCKEY_WORDS = ["usntdp", "ntdp", "development", "program",
                "khl", "shl", "ushl", "ncaa", "ohl", "chl", "whl", "qmjhl",
                "sweden", "russia", "usa", "canada", "ojhl", "finland", 
                "finnish", "swedish", "russian", "american", "wisconsin",
                "michigan", "bc", "boston", "london", "bchl", "kelowna",
                "liiga", 
                "portland", "minnesota", "ska", "frolunda", "sjhl", "college",
                "center", "left", "right", "saginaw", "kelowna", "frolunda",
                "slovakia"]

# scouting report columns
mask = data.columns.str.match('Description')
scouting_reports = data.columns[mask]

# preprocess data with NLTK
preprocessed_df = data.copy()
for report in scouting_reports:
    report_preprocessor = preprocess_reports.NltkPreprocessor(data[report])
    preprocessed_df.loc[:,report] = report_preprocessor\
        .remove_names(data['Name'])\
        .remove_whitespace()\
        .remove_words(HOCKEY_WORDS)\
        .get_text()


In [6]:
# transform from wide to long data frame
long_df = preprocessed_df.melt(
    id_vars=['Year', 'Position', 'Height', 'Weight', 'Drafted', 'Team', 'Average Ranking', 'Name'],
    value_vars=scouting_reports.tolist(),
    var_name='reporter',  
    value_name='text'
).dropna(
    subset=['text']
)



In [8]:
bert_embeddings_path = 'data/reports_with_bert_embeddings.csv'
if not os.path.exists(bert_embeddings_path):
    from sentence_transformers import SentenceTransformer

    # get bert embeddings now, so we don't have to redo it for
    # each fit during GridSearch
    bert_model = SentenceTransformer('all-mpnet-base-v2')

    bert_embeddings = bert_model.encode(long_df['text'].ravel())

    bert_columns = [f'bert{i}' for i in range(bert_embeddings.shape[1])]
    bert_df = pd.DataFrame(bert_embeddings, columns=bert_columns)

    long_df = long_df.join(bert_df, on=long_df.index)

    with open(bert_embeddings_path, 'w') as write_file:
        long_df.to_csv(write_file)
else:
    long_df = pd.read_csv(bert_embeddings_path)

    bert_columns = long_df.columns[long_df.columns.str.match('^bert')].tolist()

In [9]:
# setup model architecture
numeric_cols = ['Height', 'Weight'] + bert_columns
categorical_cols = ['Position', 'reporter']
# text_cols = scouting_reports.tolist()
text_cols = []
lr_model = setup_predictor.setup(
    numeric_cols=numeric_cols, 
    categorical_cols=categorical_cols,
    text_cols=text_cols,
    func=LogisticOrdinalRegression()
)
knn_model = setup_predictor.setup(
    numeric_cols=numeric_cols, 
    categorical_cols=categorical_cols,
    text_cols=text_cols,
    func=OrdinalKNeighborsClassifier()
)
rf_model = setup_predictor.setup(
    numeric_cols=numeric_cols, 
    categorical_cols=categorical_cols,
    text_cols=text_cols,
    func=RandomForestOrdinalClassifier()
)

In [10]:
X = long_df[numeric_cols + categorical_cols + text_cols]
y = long_df['Drafted']
groups = long_df['Name']

mean_df = pd.DataFrame(columns=['accuracy', 'f1', 'precision', 'recall'])
std_df = pd.DataFrame(columns=['accuracy', 'f1', 'precision', 'recall'])

In [11]:
# logistic regression model
param_grid = {
    'clf__penalty' : ['l1', 'l2'],
    'clf__C' : np.logspace(-4,4,20).tolist(),
    'clf__solver' : ['liblinear'],
}

label = 'BERT_log_reg'

lr_metrics = train_and_test(lr_model, X, y, groups, param_grid, notes=label)

lr_mean = {k : np.mean(v) for k,v in lr_metrics.items()}
lr_std = {k : np.std(v) for k,v in lr_metrics.items()}

mean_df.loc[label] = pd.Series(lr_mean)
std_df.loc[label] = pd.Series(lr_std)


Fitting 3 folds for each of 40 candidates, totalling 120 fits


100%|██████████| 52/52 [00:02<00:00, 19.87it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Fitting 3 folds for each of 40 candidates, totalling 120 fits


100%|██████████| 52/52 [00:01<00:00, 41.02it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Fitting 3 folds for each of 40 candidates, totalling 120 fits


100%|██████████| 60/60 [00:38<00:00,  1.54it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [12]:
# KNN model
param_grid = {
    'clf__n_neighbors' : np.arange(3,8).tolist(),
}

label = 'BERT_KNN'

knn_metrics = train_and_test(knn_model, X, y, groups, param_grid, notes=label)

knn_mean = {k : np.mean(v) for k,v in knn_metrics.items()}
knn_std = {k : np.std(v) for k,v in knn_metrics.items()}

mean_df.loc[label] = pd.Series(knn_mean)
std_df.loc[label] = pd.Series(knn_std)


Fitting 3 folds for each of 5 candidates, totalling 15 fits


100%|██████████| 52/52 [00:00<00:00, 1761.23it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Fitting 3 folds for each of 5 candidates, totalling 15 fits


100%|██████████| 52/52 [00:00<00:00, 1855.60it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Fitting 3 folds for each of 5 candidates, totalling 15 fits


100%|██████████| 60/60 [00:00<00:00, 1557.10it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [13]:
# Random Forest Classification model
param_grid = {
    'clf__n_estimators' : np.arange(60, 110, 20).tolist(),
    'clf__max_depth' : np.arange(20, 100, 20).tolist(),
}

label = 'BERT_rand_forest'

rf_metrics = train_and_test(rf_model, X, y, groups, param_grid, notes=label)

rf_mean = {k : np.mean(v) for k,v in rf_metrics.items()}
rf_std = {k : np.std(v) for k,v in rf_metrics.items()}

mean_df.loc[label] = pd.Series(rf_mean)
std_df.loc[label] = pd.Series(rf_std)


Fitting 3 folds for each of 12 candidates, totalling 36 fits


100%|██████████| 52/52 [00:22<00:00,  2.30it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Fitting 3 folds for each of 12 candidates, totalling 36 fits


100%|██████████| 52/52 [00:34<00:00,  1.53it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Fitting 3 folds for each of 12 candidates, totalling 36 fits


100%|██████████| 60/60 [00:33<00:00,  1.79it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [14]:
mean_df

Unnamed: 0,accuracy,f1,precision,recall
NLTK_log_reg,0.009663,0.009663,0.012601,0.014181
NLTK_KNN,0.016027,0.016027,0.015873,0.022605
NLTK_rand_forest,0.012314,0.012314,0.015237,0.015402


In [15]:
std_df

Unnamed: 0,accuracy,f1,precision,recall
NLTK_log_reg,0.003956,0.003956,0.00782,0.001907
NLTK_KNN,0.001984,0.001984,0.002654,0.006797
NLTK_rand_forest,0.002884,0.002884,0.000983,0.006633


In [16]:
with open('artifacts/mean_df.csv', 'a') as append_file:
    mean_df.to_csv(append_file, index=True, line_terminator='\n')

with open('artifacts/std_df.csv', 'a') as append_file:
    std_df.to_csv(append_file, index=True, line_terminator='\n')

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=09d8dc5c-d54b-4729-bd12-f4067dd931f4' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>