## NLP Draft Predictions

This notebook details our initial attempt for the supervised task of predicting NHL player draft positions and the unsupervised task of clustering NHL players based on similarities. The main methodology uses NLP word vectors extracted from 2014-2022 NHL scouting reports from various public sports news outlets.

Members:
- Quoc-Huy Nguyen
- Ryan DeSalvio



In [1]:
import os
import re
import json
import numpy as np
import pandas as pd

from collections import defaultdict

import matplotlib.cm as cm
import matplotlib.pyplot as plt

from sklearn.svm import SVC
from sklearn import preprocessing
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.model_selection import GroupShuffleSplit
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.metrics import f1_score, precision_score, recall_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

import clean_reports
import preprocess_reports
import setup_predictor
from model import *
from train_test_predictor import train_and_test

nltk.download("punkt")
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
# dataset location
DATASET = "data/prospect-data.csv"

# load dataset into dataframe
data = clean_reports.clean(DATASET, raw=True)

data.head()

Unnamed: 0,Year,Position,Height,Weight,Drafted,Team,Average Ranking,Name,Description - Corey Pronman,Description - Scott Wheeler,Description - Smaht Scouting,Description - ESPN (Chris Peters),Description - EP Rinkside,Description - EP Rinkside Part 2,Description - The Painted Lines,Description - FCHockey
0,2023,C,69.75,185.0,,,1.0,Connor Bedard,Bedard is a potential franchise-changing No. 1...,Bedard’s statistical profile speaks for itself...,Connor Bedard is an extremely gifted generatio...,One of the most naturally gifted goal scorers ...,,Connor Bedard is the premier prospect in the w...,,
1,2023,C,74.0,187.0,,,2.0,Adam Fantilli,There's so much to love about Fantilli's NHL p...,"Fantilli is a big, strong, powerful center who...",Adam Fantilli has every tool that an NHL team ...,"A 6-foot-2, 200-pound power center with touch,...",,"A fantastic consolation prize, Adam Fantilli w...",,
2,2023,RW,70.0,148.0,,,3.0,Matvei Michkov,Michkov is one of the very best first-year dra...,Michkov is the best Russian prospect since Ale...,"A smart, dynamic goal-scoring winger, Michkov ...","For the last few years, I’ve described Michkov...",,"Statistically, Matvei Michkov is *another* fir...",,
3,2023,C,75.0,194.0,,,4.0,Leo Carlsson,"Carlsson has elite skill, which when combined ...",Though he doesn’t play the game with some of t...,Carlsson has been played extremely well at the...,The buzz is growing (and rightfully so) that C...,,"Oh, look, another first-overall talent. Leo Ca...",,
4,2023,LW,69.75,170.0,,,5.0,Zach Benson,Benson has a ton of creativity and offense in ...,"There were a lot of nights last season, on an ...",While I don’t necessarily see Zach Benson reac...,"An offensive dynamo with deft scoring touch, B...",,Some people are worried about selecting a 5-9 ...,,


In [3]:
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 402 entries, 0 to 401
Data columns (total 16 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Year                               402 non-null    int64  
 1   Position                           402 non-null    object 
 2   Height                             402 non-null    float64
 3   Weight                             402 non-null    float64
 4   Drafted                            360 non-null    float64
 5   Team                               360 non-null    object 
 6   Average Ranking                    162 non-null    float64
 7   Name                               402 non-null    object 
 8   Description - Corey Pronman        389 non-null    object 
 9   Description - Scott Wheeler        213 non-null    object 
 10  Description - Smaht Scouting       149 non-null    object 
 11  Description - ESPN (Chris Peters)  229 non-null    object 

## Data Cleaning

In [4]:
# clean up dataset
# might have to look at dropping seattle in the future but for clustering it 
# should not matter
data = data[data['Team'] != 'SEA']

# try with only forwards
# data = data[
#     (data['Position'] == 'C') | 
#     (data['Position'] == 'LW') | 
#     (data['Position'] == 'RW')
# ]

# keep data only to 2014-2022 (predict this year's class at a later time)
data = data[data['Year'] <= 2022]

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 355 entries, 42 to 401
Data columns (total 16 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Year                               355 non-null    int64  
 1   Position                           355 non-null    object 
 2   Height                             355 non-null    float64
 3   Weight                             355 non-null    float64
 4   Drafted                            355 non-null    float64
 5   Team                               355 non-null    object 
 6   Average Ranking                    115 non-null    float64
 7   Name                               355 non-null    object 
 8   Description - Corey Pronman        342 non-null    object 
 9   Description - Scott Wheeler        166 non-null    object 
 10  Description - Smaht Scouting       104 non-null    object 
 11  Description - ESPN (Chris Peters)  182 non-null    object

In [5]:
HOCKEY_WORDS = ["usntdp", "ntdp", "development", "program",
                "khl", "shl", "ushl", "ncaa", "ohl", "chl", "whl", "qmjhl",
                "sweden", "russia", "usa", "canada", "ojhl", "finland", 
                "finnish", "swedish", "russian", "american", "wisconsin",
                "michigan", "bc", "boston", "london", "bchl", "kelowna",
                "liiga", 
                "portland", "minnesota", "ska", "frolunda", "sjhl", "college",
                "center", "left", "right", "saginaw", "kelowna", "frolunda",
                "slovakia"]

# scouting report columns
mask = data.columns.str.match('Description')
scouting_reports = data.columns[mask]

# preprocess data with NLTK
preprocessed_df = data.copy()
for report in scouting_reports:
    report_preprocessor = preprocess_reports.NltkPreprocessor(data[report])
    preprocessed_df.loc[:,report] = report_preprocessor\
        .remove_names(data['Name'])\
        .remove_whitespace()\
        .tokenize_text()\
        .remove_stopwords(HOCKEY_WORDS)\
        .normalize_words(normalization='porter')\
        .get_text()


In [6]:
# transform from wide to long data frame
#   and rows with missing text
long_df = preprocessed_df.melt(
    id_vars=['Year', 'Position', 'Height', 'Weight', 'Drafted', 
             'Team', 'Average Ranking', 'Name'],
    value_vars=scouting_reports.tolist(),
    var_name='reporter',  
    value_name='text'
).dropna(
    subset=['text']
)

# subset_long_df = long_df.sample(frac=0.2)

In [7]:
# setup model architecture
numeric_cols = ['Height', 'Weight']
categorical_cols = ['Position', 'reporter']
text_cols = ['text']
lr_model = setup_predictor.setup(
    numeric_cols=numeric_cols, 
    categorical_cols=categorical_cols,
    text_cols=text_cols,
    func=LogisticOrdinalRegression()
)
knn_model = setup_predictor.setup(
    numeric_cols=numeric_cols, 
    categorical_cols=categorical_cols,
    text_cols=text_cols,
    func=OrdinalKNeighborsClassifier()
)
rf_model = setup_predictor.setup(
    numeric_cols=numeric_cols, 
    categorical_cols=categorical_cols,
    text_cols=text_cols,
    func=RandomForestOrdinalClassifier()
)

In [8]:
X = long_df[numeric_cols + categorical_cols + text_cols]
y = long_df['Drafted']
groups = long_df['Name']

mean_df = pd.DataFrame(columns=['accuracy', 'f1', 'precision', 'recall'])
std_df = pd.DataFrame(columns=['accuracy', 'f1', 'precision', 'recall'])

In [10]:
# logistic regression model
param_grid = {
    'clf__penalty' : ['l1', 'l2'],
    'clf__C' : np.logspace(-4,4,20).tolist(),
    'clf__solver' : ['liblinear'],
}

label = 'NLTK_log_reg'

lr_metrics = train_and_test(lr_model, X, y, groups, param_grid, notes=label)

lr_mean = {k : np.mean(v) for k,v in lr_metrics.items()}
lr_std = {k : np.std(v) for k,v in lr_metrics.items()}

mean_df.loc[label] = pd.Series(lr_mean)
std_df.loc[label] = pd.Series(lr_std)


Fitting 3 folds for each of 40 candidates, totalling 120 fits
100%|██████████| 52/52 [00:00<00:00, 825.57it/s]
100%|██████████| 52/52 [00:00<00:00, 829.10it/s]
100%|██████████| 52/52 [00:00<00:00, 880.40it/s]
100%|██████████| 52/52 [00:00<00:00, 445.33it/s]
100%|██████████| 52/52 [00:00<00:00, 541.22it/s]
100%|██████████| 52/52 [00:00<00:00, 612.32it/s]
100%|██████████| 52/52 [00:00<00:00, 817.70it/s]
100%|██████████| 52/52 [00:00<00:00, 825.77it/s]
100%|██████████| 52/52 [00:00<00:00, 769.68it/s]
100%|██████████| 52/52 [00:00<00:00, 417.97it/s]
100%|██████████| 52/52 [00:00<00:00, 500.15it/s]
100%|██████████| 52/52 [00:00<00:00, 520.29it/s]
100%|██████████| 52/52 [00:00<00:00, 789.59it/s]
100%|██████████| 52/52 [00:00<00:00, 544.99it/s]
100%|██████████| 52/52 [00:00<00:00, 849.15it/s]
100%|██████████| 52/52 [00:00<00:00, 410.83it/s]
100%|██████████| 52/52 [00:00<00:00, 428.29it/s]
100%|██████████| 52/52 [00:00<00:00, 475.01it/s]
100%|██████████| 52/52 [00:00<00:00, 609.43it/s]
100%|██

In [11]:
# KNN model
param_grid = {
    'clf__n_neighbors' : np.arange(3,8).tolist(),
}

label = 'NLTK_KNN'

knn_metrics = train_and_test(knn_model, X, y, groups, param_grid, notes=label)

knn_mean = {k : np.mean(v) for k,v in knn_metrics.items()}
knn_std = {k : np.std(v) for k,v in knn_metrics.items()}

mean_df.loc[label] = pd.Series(knn_mean)
std_df.loc[label] = pd.Series(knn_std)


Fitting 3 folds for each of 5 candidates, totalling 15 fits
100%|██████████| 52/52 [00:00<00:00, 1061.07it/s]
100%|██████████| 52/52 [00:00<00:00, 1334.07it/s]
100%|██████████| 52/52 [00:00<00:00, 1157.18it/s]
100%|██████████| 52/52 [00:00<00:00, 1124.63it/s]
100%|██████████| 52/52 [00:00<00:00, 1207.17it/s]
100%|██████████| 52/52 [00:00<00:00, 1169.93it/s]
100%|██████████| 52/52 [00:00<00:00, 1269.03it/s]
100%|██████████| 52/52 [00:00<00:00, 1307.65it/s]
100%|██████████| 52/52 [00:00<00:00, 1341.96it/s]
100%|██████████| 52/52 [00:00<00:00, 1265.82it/s]
100%|██████████| 52/52 [00:00<00:00, 1165.84it/s]
100%|██████████| 52/52 [00:00<00:00, 1390.15it/s]
100%|██████████| 52/52 [00:00<00:00, 1206.84it/s]
100%|██████████| 52/52 [00:00<00:00, 1045.72it/s]
100%|██████████| 52/52 [00:00<00:00, 1409.62it/s]
100%|██████████| 52/52 [00:00<00:00, 894.25it/s]
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
Fitting 3 folds for each of 5 c

In [12]:
# Random Forest Classification model
param_grid = {
    'clf__n_estimators' : np.arange(40, 120, 20).tolist(),
    'clf__max_depth' : np.arange(20, 100, 10).tolist(),
}

label = 'NLTK_rand_forest'

rf_metrics = train_and_test(rf_model, X, y, groups, param_grid, notes=label)

rf_mean = {k : np.mean(v) for k,v in rf_metrics.items()}
rf_std = {k : np.std(v) for k,v in rf_metrics.items()}

mean_df.loc[label] = pd.Series(rf_mean)
std_df.loc[label] = pd.Series(rf_std)


Fitting 3 folds for each of 32 candidates, totalling 96 fits
100%|██████████| 52/52 [00:06<00:00,  8.14it/s]
100%|██████████| 52/52 [00:06<00:00,  8.26it/s]
100%|██████████| 52/52 [00:05<00:00,  9.00it/s]
100%|██████████| 52/52 [00:09<00:00,  5.51it/s]
100%|██████████| 52/52 [00:09<00:00,  5.57it/s]
100%|██████████| 52/52 [00:08<00:00,  6.01it/s]
100%|██████████| 52/52 [00:12<00:00,  4.13it/s]
100%|██████████| 52/52 [00:12<00:00,  4.22it/s]
100%|██████████| 52/52 [00:11<00:00,  4.57it/s]
100%|██████████| 52/52 [00:15<00:00,  3.26it/s]
100%|██████████| 52/52 [00:15<00:00,  3.42it/s]
100%|██████████| 52/52 [00:14<00:00,  3.67it/s]
100%|██████████| 52/52 [00:07<00:00,  7.14it/s]
100%|██████████| 52/52 [00:06<00:00,  7.61it/s]
100%|██████████| 52/52 [00:06<00:00,  8.12it/s]
100%|██████████| 52/52 [00:10<00:00,  4.85it/s]
100%|██████████| 52/52 [00:10<00:00,  5.01it/s]
100%|██████████| 52/52 [00:09<00:00,  5.46it/s]
100%|██████████| 52/52 [00:13<00:00,  3.73it/s]
100%|██████████| 52/52 [00:

In [13]:
mean_df

Unnamed: 0,accuracy,f1,precision,recall
NLTK_log_reg,0.015824,0.015824,0.018799,0.018233
NLTK_KNN,0.011425,0.011425,0.011664,0.016442
NLTK_rand_forest,0.014825,0.014825,0.018299,0.018231


In [14]:
std_df

Unnamed: 0,accuracy,f1,precision,recall
NLTK_log_reg,0.001717,0.001717,0.001952,0.002227
NLTK_KNN,0.00142,0.00142,0.000337,0.007151
NLTK_rand_forest,0.001789,0.001789,0.002475,0.003977


In [15]:
with open('artifacts/mean_df.csv', 'a') as append_file:
    mean_df.to_csv(append_file, index=True, line_terminator='\n')

with open('artifacts/std_df.csv', 'a') as append_file:
    std_df.to_csv(append_file, index=True, line_terminator='\n')

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=09d8dc5c-d54b-4729-bd12-f4067dd931f4' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>