CSC-5610-301: AI Tools and Paradigms
Project #6: Baseball Hall of Fame Prediction
Data Cleaning
12.7.2024
Benjamin F. Shaske

Import all relative libraries for data cleaning.

In [1]:
import kagglehub
import pandas as pd
import os

  from .autonotebook import tqdm as notebook_tqdm


Import the latest wupdated versions of all .csv files from Kaggle Hub.

In [None]:
# Download latest version
path = kagglehub.dataset_download("open-source-sports/baseball-databank")

print("Path to dataset files:", path)

csv_files = [f for f in os.listdir(path) if f.endswith('.csv')]


Path to dataset files: C:\Users\bshaske\.cache\kagglehub\datasets\open-source-sports\baseball-databank\versions\2


About Dataset
Baseball Databank is a compilation of historical baseball data in a
convenient, tidy format, distributed under Open Data terms.

This version of the Baseball databank was downloaded from Sean Lahman's website.

Note that as of v1, this dataset is missing a few tables because of a restriction on the number of individual files that can be added. This is in the process of being fixed. The missing tables are Parks, HomeGames, CollegePlaying, Schools, Appearances, and FieldingPost.

The Data
The design follows these general principles. Each player is assigned a
unique number (playerID). All of the information relating to that player
is tagged with his playerID. The playerIDs are linked to names and
birthdates in the MASTER table.

The database is comprised of the following main tables:

MASTER - Player names, DOB, and biographical info
Batting - batting statistics
Pitching - pitching statistics
Fielding - fielding statistics
It is supplemented by these tables:

AllStarFull - All-Star appearances
HallofFame - Hall of Fame voting data
Managers - managerial statistics
Teams - yearly stats and standings
BattingPost - post-season batting statistics
PitchingPost - post-season pitching statistics
TeamFranchises - franchise information
FieldingOF - outfield position data
FieldingPost- post-season fielding data
ManagersHalf - split season data for managers
TeamsHalf - split season data for teams
Salaries - player salary data
SeriesPost - post-season series information
AwardsManagers - awards won by managers
AwardsPlayers - awards won by players
AwardsShareManagers - award voting for manager awards
AwardsSharePlayers - award voting for player awards
Appearances - details on the positions a player appeared at
Schools - list of colleges that players attended
CollegePlaying - list of players and the colleges they attended
Descriptions of each of these tables can be found attached to their associated files, below.

Acknowledgments
This work is licensed under a Creative Commons Attribution-ShareAlike
3.0 Unported License. For details see:
http://creativecommons.org/licenses/by-sa/3.0/

Person identification and demographics data are provided by
Chadwick Baseball Bureau (http://www.chadwick-bureau.com),
from its Register of baseball personnel.

Player performance data for 1871 through 2014 is based on the
Lahman Baseball Database, version 2015-01-24, which is
Copyright (C) 1996-2015 by Sean Lahman.

The tables Parks.csv and HomeGames.csv are based on the game logs
and park code table published by Retrosheet.
This information is available free of charge from and is copyrighted
by Retrosheet. Interested parties may contact Retrosheet at
http://www.retrosheet.org.

Load independent data frames. Title them so they are easy to use by the team. Descriptions of key column data.

In [None]:
AllstarFull_df = pd.read_csv(os.path.join(path, 'AllstarFull.csv'))
print('\nAbout this file: AllstarFull_df\nplayerID: Player ID code\nYearID: Year\ngameNum: Game number (zero if only one All-Star game played that season)'\
    '\ngameID: Retrosheet ID for the game idea\nteamID: Team\nlgID: League\nGP: 1 if Played in the game\nstartingPos: If player was game starter, the position played\n\nAllstarFull_df:')
AllstarFull_df.info()  

AwardsManagers_df = pd.read_csv(os.path.join(path, 'AwardsManagers.csv'))
print('\nAbout this file: AwardsManagers_df\nplayerID: Manager ID code\nawardID: Name of award won\nyearID: Year\nlgID: League\ntie: Award was a tie (Y or N)'\
    '\nnotes: Notes about the award\n\nAwardsManagers_df:')
AwardsManagers_df.info() 

AwardsPlayers_df = pd.read_csv(os.path.join(path, 'AwardsPlayers.csv'))
print('\nAbout this file: AwardsPlayers_df\nplayerID: Player ID code\nawardID: Name of award won\nyearID: Year\nlgID: League\ntie: Award was a tie (Y or N)'\
    '\nnotes: Notes about the award\n\nAwardsPlayers_df:')
AwardsPlayers_df.info()  

AwardsShareManagers_df = pd.read_csv(os.path.join(path, 'AwardsShareManagers.csv'))
print('\nAbout this file: AwardsShareManagers_df\nawardID: Name of award won\nyearID: Year\nlgID: League\nplayerID: Manager ID code\npointsWon: Number of points won'\
    '\npointsMax: Maximum number of points possible\nvotesFirst: Number of first place votes\n\nAwardsShareManagers_df:')
AwardsShareManagers_df.info()

AwardsSharePlayers_df = pd.read_csv(os.path.join(path, 'AwardsSharePlayers.csv')) 
print('\nAbout this file: AwardsSharePlayers_df\nawardID: Name of award won\nyearID: Year\nlgID: League\nplayerID: Player ID code\npointsWon: Number of points won'\
    '\npointsMax: Maximum number of points possible\nvotesFirst: Number of first place votes\n\nAwardsSharePlayers_df:')  
AwardsSharePlayers_df.info()  

Batting_df = pd.read_csv(os.path.join(path, 'Batting.csv'))
print('\nAbout this file: Batting_df\nplayerID: Player ID code\nyearID: Year\nstint: player\'s stint (order of appearances within a season)\nteamID: Team\nlgID: League'\
    '\nG: Games\nAB: At Bats\nR: Runs\nH: Hits\n2B: Doubles\n3B: Triples\nHR: Homeruns\nRBI: Runs Batted In\nSB: Stolen Bases\nCS: Caught stealing\nBB: Base on Balls'\
        '\nSO: Strikeouts\nIBB: Intentional walks\nHBP: Hit by pitch\nSH: Sacrifices\nSF: Sacrifice flies\nGIDP: Grounded into double plays\n\nBatting_df:')
Batting_df.info()

BattingPost_df = pd.read_csv(os.path.join(path, 'BattingPost.csv'))
print('\nAbout this file: BattingPost_df\nplayerID: Player ID code\nyearID: Year\nround: Playoff round\nteamID: Team\nlgID: League\nG: Games\nAB: At Bats\nR: Runs'\
    'H: Hits\n2B: Doubles\n3B: Triples\nHR: Homeruns\nRBI: Runs Batted In\nSB: Stolen Bases\nCS: Caught stealing\nBB: Base on Balls\nSO: Strikeouts\nIBB: Intentional walks+'\
        'HBP: Hit by pitch\nSH: Sacrifices\nSF: Sacrifice flies\nGIDP: Grounded into double plays\n\nBattingPost_df:')  
BattingPost_df.info() 

Fielding_df = pd.read_csv(os.path.join(path, 'Fielding.csv'))
print('\nAbout this file: Fielding_df\nplayerID: Player ID code\nyearID: Year\nstint: player\'s stint (order of appearances within a season)\nteamID: Team\nlgID: League'\
    'POS: Position\nG: Games\nGS: Games Started\nInnOuts: Time played in the field expressed as outs\nPO: Putouts\nA: Assists\nE: Errors\nDP: Double Plays\nPB: Passed Balls'\
        'WP: Wild Pitches\nSB: Opponent Stolen Bases\nCS: Opponents Caught Stealing\nZR: Zone Rating\n\nFielding_df:')
Fielding_df.info()

FieldingOF_df = pd.read_csv(os.path.join(path, 'FieldingOF.csv'))
FieldingOF_df.info()

HallOfFame_df = pd.read_csv(os.path.join(path, 'HallOfFame.csv'))
HallOfFame_df.info()

Managers_df = pd.read_csv(os.path.join(path, 'Managers.csv'))
Managers_df.info()

ManagersHalf_df = pd.read_csv(os.path.join(path, 'ManagersHalf.csv'))  
ManagersHalf_df.info()

Master_df = pd.read_csv(os.path.join(path, 'Master.csv'))
Master_df.info()

Pitching_df = pd.read_csv(os.path.join(path, 'Pitching.csv'))
Pitching_df.info()

PitchingPost_df = pd.read_csv(os.path.join(path, 'PitchingPost.csv'))
PitchingPost_df.info()

Salaries_df = pd.read_csv(os.path.join(path, 'Salaries.csv'))
Salaries_df.info()

SeriesPost_df = pd.read_csv(os.path.join(path, 'SeriesPost.csv'))
SeriesPost_df.info()

Teams_df = pd.read_csv(os.path.join(path, 'Teams.csv'))
Teams_df.info()

TeamsFranchises_df = pd.read_csv(os.path.join(path, 'TeamsFranchises.csv'))
TeamsFranchises_df.info()

TeamsHalf_df = pd.read_csv(os.path.join(path, 'TeamsHalf.csv'))
TeamsHalf_df.info()


About this file: AllstarFull_df
playerID: Player ID code
YearID: Year
gameNum: Game number (zero if only one All-Star game played that season)
gameID: Retrosheet ID for the game idea
teamID: Team
lgID: League
GP: 1 if Played in the game
startingPos: If player was game starter, the position played

AllstarFull_df:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5069 entries, 0 to 5068
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   playerID     5069 non-null   object 
 1   yearID       5069 non-null   int64  
 2   gameNum      5069 non-null   int64  
 3   gameID       5020 non-null   object 
 4   teamID       5069 non-null   object 
 5   lgID         5069 non-null   object 
 6   GP           5050 non-null   float64
 7   startingPos  1580 non-null   float64
dtypes: float64(2), int64(2), object(4)
memory usage: 316.9+ KB

About this file: AwardsManagers_df
playerID: Manager ID code
awardID: Name of award won
yearI