Since I'm going to reference a single dataframe a lot, it makes sense to create it once instead of constantly going to the postegresql server for data retrieval. This process will also preprocess the df by changing the column header of goalie_minutes to goalie_seconds, replacing bad position labels with blanks, and replacing alternate positions names with generic A/M/D/G

In [1]:
import sys
sys.path.insert(0, '../..')

import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import helpers.laxdb as laxdb
LaxDB = laxdb.LaxDB

SELECT data from Postegresql server

In [2]:
box = """
SELECT * from ncaa.box_scores;
"""
boxdata = LaxDB().query(box)

teams = """
SELECT id, name from ncaa.teams;
"""
teamdata = LaxDB().query(teams)

No need to leave the connection open after the data is retrieved

In [3]:
LaxDB().close()

Merge the two sets of data on common field

In [4]:
boxscore_df = teamdata.merge(boxdata, left_on='id', right_on='team_id')
boxscore_df.head()

Unnamed: 0,id_x,name,id_y,game_id,team_id,player_id,position,player_name,goals,assists,...,caused_turnovers,faceoffs_won,faceoffs_taken,penalties,penalty_time,goalie_minutes,goals_allowed,goalie_saves,created_at,updated_at
0,2,Binghamton,4712,1,2,41.0,D,Chris Bechle,0,0,...,0,0,0,1,30,0,0,0,2023-10-24 16:37:17.319077,2023-10-24 16:37:17.319077
1,2,Binghamton,4713,1,2,48.0,D,George Diegnan,0,0,...,0,1,2,0,0,0,0,0,2023-10-24 16:37:17.319077,2023-10-24 16:37:17.319077
2,2,Binghamton,4714,1,2,51.0,D,Sean Finnigan,0,0,...,0,0,0,0,0,0,0,0,2023-10-24 16:37:17.319077,2023-10-24 16:37:17.319077
3,2,Binghamton,4715,1,2,57.0,M,Matt Kaser,0,1,...,0,0,0,0,0,0,0,0,2023-10-24 16:37:17.319077,2023-10-24 16:37:17.319077
4,2,Binghamton,4716,1,2,63.0,M,Anthony Lombardo,0,0,...,0,0,0,0,0,0,0,0,2023-10-24 16:37:17.319077,2023-10-24 16:37:17.319077


Change column name 'goalie_minutes' to 'goalie_seconds'

In [5]:
boxscore_df.rename(columns={'goalie_minutes':'goalie_seconds'}, inplace=True)

What are all of the unique positions in this dataframe

In [6]:
boxscore_df['position'].unique()

array(['D', 'M', 'A', 'G', '', 'GK', 'LSM', 'FO', 'MF', 'MFO', 'O', 'DM',
       'S', 'LS', '6', 'AM', 'F', 'L', 'FOS', 'NA', 'N', 'AS', 'MA', 'DL',
       'DK', 'FS', 'CD', 'ATT', 'MID', 'DEF', 'FOM', 'FOGO', 'SSDM', None,
       'FA', 'MD', 'RW', 'DB', 'C', 'GOAL', 'LM', 'SR', 'AMF', 'SS', 'F0',
       '45', 'LSMF', 'MFA', 'AT', 'B', 'ATK', 'DLSM', 'DLMS', 'LPM', '9',
       '3', 'DLS', '0', 'LMS', '16', 'FW', 'AA', '35', '4', 'LSD', 'AD',
       'FOR', 'LDM', 'LP', 'AQ', 'DMF', '8', 'GF', 'LSMD', 'MG', 'FM',
       'MIC', 'W', '27', 'PCS', 'GT', '1', 'DDM', 'SLM', 'D08', '5', 'MB',
       '2', 'ATMD', 'GO', 'MIDF', 'Q', 'MK', 'CK', 'DG', 'K', 'IH', 'DST',
       'DFO', 'MIG', 'ATM', 'NF', 'SSD', 'ID', '31', 'AFO', 'SSM', 'MDM',
       'FK', 'FSO'], dtype=object)

In [12]:
count_df = boxscore_df.groupby('position')['name'].count()
count_df

position
        931109
0            1
1            1
16           1
2            1
         ...  
SS           2
SSD          2
SSDM        19
SSM          2
W            2
Name: name, Length: 109, dtype: int64

In [13]:
count_df.to_csv('count_positions_df')

That's interesting. There are letters and letter groups that make sense, but then there are numbers. Let's take a look at 27

In [8]:
twensev = np.where(boxscore_df['position'] == '27')
twensev

(array([793184], dtype=int64),)

In [9]:
boxscore_df.iloc[793184]

id_x                                      3318
name                          Westminster (UT)
id_y                                   1222376
game_id                                  26505
team_id                                   3318
player_id                             124438.0
position                                    27
player_name                        Jacob Parks
goals                                        0
assists                                      0
points                                       0
shots                                        2
shots_on_goal                                0
man_up_goals                                 0
man_down_goals                               0
ground_balls                                 1
turnovers                                    1
caused_turnovers                             0
faceoffs_won                                 0
faceoffs_taken                               0
penalties                                    0
penalty_time 

Since there is no position numbered 27 and some of the others don't make sense either, it's time to remove all of the positions that don't make sense and make them blank

In [10]:
boxscore_df['position'].replace({"''":"", "O":"", "S":"", "6":"", "AM":"", "F":"", "L":"", "NA":"", "N":"", "AS":"", 
                                 "MA":"", "FS":"", "CD":"", None:"", "C":"", "SR":"", "F0":"", "45":"", "B":"", 
                                 "9":"", "3":"", "0":"", "16":"", "35":"", "4":"", "AD":"", "FOR":"", "8":"", "GF":"", 
                                 "MG":"", "MIC":"", "27":"", "PCS":"", "1":"", "D08":"", "5":"", "2":"", "Q":"", "MK":"", 
                                 "CK":"", "DG":"", "K":"", "IH":"", "DST":"", "DFO":"", "MIG":"", "NF":"", "ID":"", 
                                 "31":"", "FK":"", "FSO":""}, inplace=True
)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  boxscore_df['position'].replace({"''":"", "O":"", "S":"", "6":"", "AM":"", "F":"", "L":"", "NA":"", "N":"", "AS":"",


This classifier is intended to predict the four main positions: Attacker (A), Midfielder (M), Defender (D), and Goalie (G). To note, there are subsets of each. For example, Attackers can be called 'X' or wings. Let's clean up the other positions by replacing what the scorer entered with A, M, D, or G

In [11]:
boxscore_df['position'].replace({ 'ATT': 'A', 'RW': 'A', 'AT': 'A', 'ATK': 'A', 'FW': 'A', 'AA': 'A', 'AQ': 'A', 'W': 'A', 
                                   'ATMD': 'A', 'ATM': 'A', 'AFO': 'A', 'DL':'D', 'DK':'D', 'DEF':'D', 'DB':'D', 'DLS':'D', 
                                   'LSD':'D', 'LP':'D', 'DDM':'D', 'GK': 'G', 'GOAL': 'G', 'GT': 'G', 'LSM': 'M', 'FO': 'M', 'MF': 'M', 
                                   'MFO': 'M', 'DM': 'M', 'LS': 'M', 'FOS': 'M', 'MID': 'M', 'FOM': 'M', 'FOGO': 'M', 'SSDM': 'M', 
                                   'FA': 'M', 'MD': 'M', 'LM': 'M', 'AMF': 'M', 'SS': 'M', 'LSMF': 'M', 'MFA': 'M', 'DLSM': 'M', 
                                   'DLMS': 'M', 'LPM': 'M', 'LMS': 'M', 'LDM': 'M', 'DMF': 'M', 'LSMD': 'M', 'FM': 'M', 'SLM': 'M', 
                                   'MB': 'M', 'GO': 'M', 'MIDF': 'M', 'SSD': 'M', 'SSM': 'M', 'MDM': 'M'}, inplace=True
                                   )


Confirming that the only positions are the four main categories and blanks.

In [12]:
pos = boxscore_df['position'].unique()
pos

array(['D', 'M', 'A', 'G', ''], dtype=object)

Save the boxscore dataframe to csv

In [13]:
boxscore_df.to_csv('boxscore_df')

Load to the csv to confirm it's the same

In [14]:
df = pd.read_csv('boxscore_df')
df.head()

Unnamed: 0.1,Unnamed: 0,id_x,name,id_y,game_id,team_id,player_id,position,player_name,goals,...,caused_turnovers,faceoffs_won,faceoffs_taken,penalties,penalty_time,goalie_seconds,goals_allowed,goalie_saves,created_at,updated_at
0,0,2,Binghamton,4712,1,2,41.0,D,Chris Bechle,0,...,0,0,0,1,30,0,0,0,2023-10-24 16:37:17.319077,2023-10-24 16:37:17.319077
1,1,2,Binghamton,4713,1,2,48.0,D,George Diegnan,0,...,0,1,2,0,0,0,0,0,2023-10-24 16:37:17.319077,2023-10-24 16:37:17.319077
2,2,2,Binghamton,4714,1,2,51.0,D,Sean Finnigan,0,...,0,0,0,0,0,0,0,0,2023-10-24 16:37:17.319077,2023-10-24 16:37:17.319077
3,3,2,Binghamton,4715,1,2,57.0,M,Matt Kaser,0,...,0,0,0,0,0,0,0,0,2023-10-24 16:37:17.319077,2023-10-24 16:37:17.319077
4,4,2,Binghamton,4716,1,2,63.0,M,Anthony Lombardo,0,...,0,0,0,0,0,0,0,0,2023-10-24 16:37:17.319077,2023-10-24 16:37:17.319077


All looks good!!