CSC-5610-301: AI Tools and Paradigms
Project #6: Baseball Hall of Fame Prediction
Data Cleaning
12.7.2024
Benjamin F. Shaske

Import all relative libraries for data cleaning.

In [100]:
import kagglehub
import pandas as pd
import os

Import the latest wupdated versions of all .csv files from Kaggle Hub.

In [101]:
# Download latest version
path = kagglehub.dataset_download("open-source-sports/baseball-databank")

print("Path to dataset files:", path)

csv_files = [f for f in os.listdir(path) if f.endswith('.csv')]


Path to dataset files: C:\Users\bshaske\.cache\kagglehub\datasets\open-source-sports\baseball-databank\versions\2


About Dataset
Baseball Databank is a compilation of historical baseball data in a
convenient, tidy format, distributed under Open Data terms.

This version of the Baseball databank was downloaded from Sean Lahman's website.

Note that as of v1, this dataset is missing a few tables because of a restriction on the number of individual files that can be added. This is in the process of being fixed. The missing tables are Parks, HomeGames, CollegePlaying, Schools, Appearances, and FieldingPost.

The Data
The design follows these general principles. Each player is assigned a
unique number (playerID). All of the information relating to that player
is tagged with his playerID. The playerIDs are linked to names and
birthdates in the MASTER table.

The database is comprised of the following main tables:

MASTER - Player names, DOB, and biographical info
Batting - batting statistics
Pitching - pitching statistics
Fielding - fielding statistics
It is supplemented by these tables:

AllStarFull - All-Star appearances
HallofFame - Hall of Fame voting data
Managers - managerial statistics
Teams - yearly stats and standings
BattingPost - post-season batting statistics
PitchingPost - post-season pitching statistics
TeamFranchises - franchise information
FieldingOF - outfield position data
FieldingPost- post-season fielding data
ManagersHalf - split season data for managers
TeamsHalf - split season data for teams
Salaries - player salary data
SeriesPost - post-season series information
AwardsManagers - awards won by managers
AwardsPlayers - awards won by players
AwardsShareManagers - award voting for manager awards
AwardsSharePlayers - award voting for player awards
Appearances - details on the positions a player appeared at
Schools - list of colleges that players attended
CollegePlaying - list of players and the colleges they attended
Descriptions of each of these tables can be found attached to their associated files, below.

Acknowledgments
This work is licensed under a Creative Commons Attribution-ShareAlike
3.0 Unported License. For details see:
http://creativecommons.org/licenses/by-sa/3.0/

Person identification and demographics data are provided by
Chadwick Baseball Bureau (http://www.chadwick-bureau.com),
from its Register of baseball personnel.

Player performance data for 1871 through 2014 is based on the
Lahman Baseball Database, version 2015-01-24, which is
Copyright (C) 1996-2015 by Sean Lahman.

The tables Parks.csv and HomeGames.csv are based on the game logs
and park code table published by Retrosheet.
This information is available free of charge from and is copyrighted
by Retrosheet. Interested parties may contact Retrosheet at
http://www.retrosheet.org.

Load independent data frames. Title them so they are easy to use by the team. Descriptions of key column data.

In [None]:
AllstarFull_df = pd.read_csv(os.path.join(path, 'AllstarFull.csv'))
AllstarFull_df.info()

AwardsManagers_df = pd.read_csv(os.path.join(path, 'AwardsManagers.csv'))
AwardsManagers_df.info()

AwardsPlayers_df = pd.read_csv(os.path.join(path, 'AwardsPlayers.csv'))
AwardsPlayers_df.info()

AwardsShareManagers_df = pd.read_csv(os.path.join(path, 'AwardsShareManagers.csv'))
AwardsShareManagers_df.info()

AwardsSharePlayers_df = pd.read_csv(os.path.join(path, 'AwardsSharePlayers.csv')) 
AwardsSharePlayers_df.info()

Batting_df = pd.read_csv(os.path.join(path, 'Batting.csv'))
Batting_df.info()

BattingPost_df = pd.read_csv(os.path.join(path, 'BattingPost.csv'))
BattingPost_df.info()

Fielding_df = pd.read_csv(os.path.join(path, 'Fielding.csv'))
Fielding_df.info()

FieldingOF_df = pd.read_csv(os.path.join(path, 'FieldingOF.csv'))
print('\nAbout this file: FieldingOF_df\nplayerID: Player ID code\nyearID: Year\nstint: player\'s stint (order of appearances within a season)\nteamID: Team\nlgID: League'\
    'POS: Position\nG: Games\nGS: Games Started\nInnOuts: Time played in the field expressed as outs\nPO: Putouts\nA: Assists\nE: Errors\nDP: Double Plays\n\nFieldingOF_df:')
FieldingOF_df.info()

HallOfFame_df = pd.read_csv(os.path.join(path, 'HallOfFame.csv'))
print('\nAbout this file: HallOfFame_df\nplayerID: Player ID code\nyearID: Year of ballot\nvotedBy: Method by which player was voted upon\nballots: Total ballots cast in that year'\
    'needed: Number of votes needed for selection in that year\nvotes: Total votes received\ninducted: Whether player was inducted by that vote or not (Y or N)\ncategory: Category in which'\
        'player was honored\nneeded_note: Explanation of qualifiers for special elections\n\nHallOfFame_df:')
HallOfFame_df.info()

Managers_df = pd.read_csv(os.path.join(path, 'Managers.csv'))
print('\nAbout this file: Managers_df\nplayerID: Manager ID code\nyearID: Year\nteamID: Team\nlgID: League\ninseason: Managerial order.  Zero if the individual managed the team the entire year'\
    'G: Games\nW: Wins\nL: Losses\nrank: Team\'s final position in the standings that year\nplyrMgr: Player Manager (denoted by "Y")\n\nManagers_df:')
Managers_df.info()

ManagersHalf_df = pd.read_csv(os.path.join(path, 'ManagersHalf.csv'))  
print('\nAbout this file: ManagersHalf_df\nplayerID: Manager ID code\nyearID: Year\nteamID: Team\nlgID: League\ninseason: Managerial order.  Zero if the individual managed the team the entire year'\
    'G: Games\nW: Wins\nL: Losses\nrank: Team\'s final position in the standings that year\nplyrMgr: Player Manager (denoted by "Y")\n\nManagersHalf_df:')
ManagersHalf_df.info()

Master_df = pd.read_csv(os.path.join(path, 'Master.csv'))
print('\nAbout this file: Master_df\nplayerID: Player ID code\nbirthYear: Year player was born\nbirthMonth: Month player was born\nbirthDay: Day player was born\nbirthCountry: Country player was born'\
    'birthState: State player was born\nbirthCity: City player was born\ndebut: Date player made first major league appearance\nfinalGame: Date player made final major league appearance'\
        'retroID: ID used by retrosheet\nbbrefID: ID used by Baseball Reference website\n\nMaster_df:')
Master_df.info()

Pitching_df = pd.read_csv(os.path.join(path, 'Pitching.csv'))
print('\nAbout this file: Pitching_df\nplayerID: Player ID code\nyearID: Year\nstint: player\'s stint (order of appearances within a season)\nteamID: Team\nlgID: League\nW: Wins\nL: Losses'\
    'G: Games\nGS: Games Started\nCG: Complete Games\nSHO: Shutouts\nSV: Saves\nIPouts: Outs Pitched (innings pitched x 3)\nH: Hits\nER: Earned Runs\nHR: Homeruns\nBB: Base on Balls'\
        'SO: Strikeouts\nBAOpp: Opponent\'s Batting Average\nERA: Earned Run Average\nIBB: Intentional Walks\nWP: Wild Pitches\nHBP: Hit by pitch\nBK: Balks\nBFP: Batters faced by pitcher'\
            'GF: Games Finished\nR: Runs Allowed\nSH: Sacrifices by opposing batters\nSF: Sacrifice flies by opposing batters\nGIDP: Grounded into double plays by opposing batter\n\nPitching_df:')
Pitching_df.info()

PitchingPost_df = pd.read_csv(os.path.join(path, 'PitchingPost.csv'))
print('\nAbout this file: PitchingPost_df\nplayerID: Player ID code\nyearID: Year\nround: Playoff round\nteamID: Team\nlgID: League\nW: Wins\nL: Losses\nG: Games\nGS: Games Started'\
    'CG: Complete Games\nSHO: Shutouts\nSV: Saves\nIPouts: Outs Pitched (innings pitched x 3)\nH: Hits\nER: Earned Runs\nHR: Homeruns\nBB: Base on Balls\nSO: Strikeouts\nBAOpp: Opponent\'s Batting Average'\
        'ERA: Earned Run Average\nIBB: Intentional Walks\nWP: Wild Pitches\nHBP: Hit by pitch\nBK: Balks\nBFP: Batters faced by pitcher\nGF: Games Finished\nR: Runs Allowed\nSH: Sacrifices by opposing batters'\
            'SF: Sacrifice flies by opposing batters\nGIDP: Grounded into double plays by opposing batter\n\nPitchingPost_df:')
PitchingPost_df.info()

Salaries_df = pd.read_csv(os.path.join(path, 'Salaries.csv'))
print('\nAbout this file: Salaries_df\nyearID: Year\nteamID: Team\nlgID: League\nplayerID: Player ID code\nsalary: Salary\n\nSalaries_df:')
Salaries_df.info()

SeriesPost_df = pd.read_csv(os.path.join(path, 'SeriesPost.csv'))
print('\nAbout this file: SeriesPost_df\nyearID: Year\nround: Playoff round\nteamIDwinner: Team ID of the winner\nlgIDwinner: League ID of the winner\nteamIDloser: Team ID of the loser'\
    'lgIDloser: League ID of the loser\nwins: Wins by the winner\nlosses: Losses by the loser\nties: Tie games\n\nSeriesPost_df:')
SeriesPost_df.info()

Teams_df = pd.read_csv(os.path.join(path, 'Teams.csv'))
print('\nAbout this file: Teams_df\nyearID: Year\nlgID: League\nteamID: Team\nfranchID: Franchise (links to TeamsFranchise table)\ndivID: Team\'s division\nRank: Position in final standings'\
    'G: Games played\nGhome: Games played at home\nW: Wins\nL: Losses\nDivWin: Division Winner (Y or N)\nWCWin: Wild Card Winner (Y or N)\nLgWin: League Champion (Y or N)'\
        'WSWin: World Series Winner (Y or N)\nR: Runs scored\nAB: At Bats\nH: Hits by batters\n2B: Doubles\n3B: Triples\nHR: Homeruns by batters\nBB: Walks by batters\nSO: Strikeouts by batters'\
            'SB: Stolen Bases\nCS: Caught stealing\nHBP: Hit by pitch\nSF: Sacrifice flies\nRA: Opponents runs scored\nER: Earned Runs\nERA: Earned Run Average\nCG: Complete Games'\
                'SHO: Shutouts\nSV: Saves\nIPouts: Outs Pitched (innings pitched x 3)\nHA: Hits allowed\nHRA: Homeruns allowed\nBBA: Walks allowed\nSOA: Strikeouts by pitchers\nE: Errors\nDP: Double Plays'\
                    'FP: Fielding Percentage\nname: Team Name\npark: Name of team\'s home ballpark\nattendance: Home attendance total\nBPF: Three-year park factor for batters\nPPF: Three-year park factor for pitchers'\
                        'teamIDBR: Team ID used by Baseball Reference website\nteamIDlahman45: Team ID used in Lahman database version 4.5\nteamIDretro: Team ID used by Retrosheet\n\nTeams_df:')
Teams_df.info()

TeamsFranchises_df = pd.read_csv(os.path.join(path, 'TeamsFranchises.csv'))
print('\nAbout this file: TeamsFranchises_df\nfranchID: Franchise ID\nfranchName: Franchise name\nactive: Whether team is currently active (Y or N)\nNAassoc: Association with National Association'\
    'NAassoc: Association with National Association\n\nTeamsFranchises_df:')
TeamsFranchises_df.info()

TeamsHalf_df = pd.read_csv(os.path.join(path, 'TeamsHalf.csv'))
print('\nAbout this file: TeamsHalf_df\nyearID: Year\nlgID: League\nteamID: Team\nHalf: Half season indicator (1 or 2)\ndivID: Team\'s division\nDivWin: Division Winner (Y or N)\nRank: Position in final standings'\
    'G: Games played\nW: Wins\nL: Losses\nDivWin: Division Winner (Y or N)\nWCWin: Wild Card Winner (Y or N)\nLgWin: League Champion (Y or N)\nWSWin: World Series Winner (Y or N)'\
        'R: Runs scored\nAB: At Bats\nH: Hits by batters\n2B: Doubles\n3B: Triples\nHR: Homeruns by batters\nBB: Walks by batters\nSO: Strikeouts by batters\nSB: Stolen Bases'\
            'CS: Caught stealing\nHBP: Hit by pitch\nSF: Sacrifice flies\nRA: Opponents runs scored\nER: Earned Runs\nERA: Earned Run Average\nCG: Complete Games\nSHO: Shutouts'\
                'SV: Saves\nIPouts: Outs Pitched (innings pitched x 3)\nHA: Hits allowed\nHRA: Homeruns allowed\nBBA: Walks allowed\nSOA: Strikeouts by pitchers\nE: Errors\nDP: Double Plays'\
                    'FP: Fielding Percentage\nname: Team Name\npark: Name of team\'s home ballpark\nattendance: Home attendance total\nBPF: Three-year park factor for batters\nPPF: Three-year park factor for pitchers'\
                        'teamIDBR: Team ID used by Baseball Reference website\nteamIDlahman45: Team ID used in Lahman database version 4.5\nteamIDretro: Team ID used by Retrosheet\n\nTeamsHalf_df:')
TeamsHalf_df.info()


About this file: Fielding_df
playerID: Player ID code
yearID: Year
stint: player's stint (order of appearances within a season)
teamID: Team
lgID: LeaguePOS: Position
G: Games
GS: Games Started
InnOuts: Time played in the field expressed as outs
PO: Putouts
A: Assists
E: Errors
DP: Double Plays
PB: Passed BallsWP: Wild Pitches
SB: Opponent Stolen Bases
CS: Opponents Caught Stealing
ZR: Zone Rating

Fielding_df:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170526 entries, 0 to 170525
Data columns (total 18 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   playerID  170526 non-null  object 
 1   yearID    170526 non-null  int64  
 2   stint     170526 non-null  int64  
 3   teamID    170526 non-null  object 
 4   lgID      169023 non-null  object 
 5   POS       170526 non-null  object 
 6   G         170526 non-null  int64  
 7   GS        75849 non-null   float64
 8   InnOuts   102313 non-null  float64
 9   PO        156409 non-null  f

#Cleaning the data
##Cleaning Functions: 
create code for repetitive cleaning functions throughout data set.

In [103]:
def delete_column(df, column_name):
    """Deletes a specified column from a DataFrame."""
    # Check if the column exists in the DataFrame, error handling to verify proper column name.
    if column_name in df.columns:
        df = df.drop(columns=[column_name])
        print(f"Column '{column_name}' has been deleted.")
    else:
        print(f"Column '{column_name}' not found in the DataFrame.")
    return df

def delete_rows_with_nulls(df):
    """Deletes rows from a DataFrame that contain any null (NaN) values."""
    # Drop rows with any null values
    df_cleaned = df.dropna()
    print(f"{len(df) - len(df_cleaned)} rows containing null values were deleted.")
    return df_cleaned

def replace_nan(df, value):
    """Replaces all NaN values in a DataFrame with a specified value."""
    # Replace NaN values
    df_replaced = df.fillna(value)
    print(f"All NaN values have been replaced with '{value}'.")
    return df_replaced

def replace_nan_with_imputed_value(df, column_name, method="mean"):
    """Replaces NaN values in a specific column with an imputed value (mean, median, or mode)."""
    if column_name not in df.columns:
        print(f"Column '{column_name}' not found in the DataFrame.")
        return df
    
    if method == "mean":
        imputed_value = df[column_name].mean()
    elif method == "median":
        imputed_value = df[column_name].median()
    elif method == "mode":
        imputed_value = df[column_name].mode()[0]  # Use the first mode value
    else:
        raise ValueError("Invalid method. Choose 'mean', 'median', or 'mode'.")
    
    # Replace NaN values in the specified column
    df[column_name] = df[column_name].fillna(imputed_value)
    print(f"NaN values in column '{column_name}' have been replaced with the {method} value: {imputed_value}")
    return df

def append_missing_values(source_list, target_list):
    """Appends values from the source list to the target list if they are not already in the target list."""
    for value in source_list:
        if value not in target_list:
            target_list.append(value)
    return target_list

##Cleaning Variables used to track type and prep for joining.

In [None]:
cleaned_categorical = [] # List to store cleaned categorical columns
cleaned_numerical_int = [] # List to store cleaned numerical (integer) columns
cleaned_numerical_float = [] # List to store cleaned numerical (float) columns
cleaned_boolean = [] # List to store cleaned boolean columns
cleaned_dataframes = [] # List to store cleaned DataFrames

Determine the primary combined data set. (clean before joining?). Key unique ID should be the player ID. Data Frames to be combined using the ID.
16/20 potential data sets contain player ID.- Create a list for additional referencing.

Data with no player ID will not be used for the primary calculations

In [105]:
# Data sets containing player IDs
df_with_playerID = ['AllstarFull_df', 'AwardsManagers_df', 'AwardsPlayers_df', 'AwardsShareManagers_df', 'AwardsSharePlayers_df', 'Batting_df', 'BattingPost_df',\
    'Fielding_df', 'FieldingOF_df', 'HallOfFame_df    ', 'Managers_df', 'ManagersHalf_df', 'Master_df', 'Pitching_df', 'PitchingPost_df', 'Salaries_df']

##Cleaning the AllstarFull_df. 
Categorical variables: {Player ID, yearID, gameID, teamID, lgID, startingPos}
NOTE: Starting position has many null values; consider removing from cleaned data set. non starters DH, bench players, could just change to 0 equating non starter.
Numerical variables: {game number, games played}
Consider changing position to categorical respective position (Is this even necessary?)

In [106]:
print('\nAbout this file: AllstarFull_df\nplayerID: Player ID code\nYearID: Year\ngameNum: Game number (zero if only one All-Star game played that season)'\
    '\ngameID: Retrosheet ID for the game idea\nteamID: Team\nlgID: League\nGP: 1 if Played in the game\nstartingPos: If player was game starter, the position played\n\nAllstarFull_df:')
print(f"{AllstarFull_df.head()}\n")
AllstarFull_df.info()

# Consolodate Variables
AllstarFull_df_cat = ['playerID', 'yearID', 'gameID', 'teamID', 'lgID', 'startingPos']
append_missing_values(AllstarFull_df_cat, cleaned_categorical)

AllstarFull_df_num_int = ['gameNum']
append_missing_values(AllstarFull_df_num_int, cleaned_numerical_int)

AllstarFull_df_num_float = ['GP']
append_missing_values(AllstarFull_df_num_float, cleaned_numerical_float)


About this file: AllstarFull_df
playerID: Player ID code
YearID: Year
gameNum: Game number (zero if only one All-Star game played that season)
gameID: Retrosheet ID for the game idea
teamID: Team
lgID: League
GP: 1 if Played in the game
startingPos: If player was game starter, the position played

AllstarFull_df:
    playerID  yearID  gameNum        gameID teamID lgID   GP  startingPos
0  gomezle01    1933        0  ALS193307060    NYA   AL  1.0          1.0
1  ferreri01    1933        0  ALS193307060    BOS   AL  1.0          2.0
2  gehrilo01    1933        0  ALS193307060    NYA   AL  1.0          3.0
3  gehrich01    1933        0  ALS193307060    DET   AL  1.0          4.0
4  dykesji01    1933        0  ALS193307060    CHA   AL  1.0          5.0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5069 entries, 0 to 5068
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   playerID     5069 non-null   object 
 1   ye

['GP']

##Cleaning the AwardsMangers_df.
Consider notes columns.
Key question swould be to identify if a former player becoming a manager and is found in the awards impacts the likelyhood of Hall of Fame induction.
Categorical variables: {PlayerID, Award ID, YearID, lgID}
Boolean variables: {tie}

In [107]:
print('\nAbout this file: AwardsManagers_df\nplayerID: Manager ID code\nawardID: Name of award won\nyearID: Year\nlgID: League\ntie: Award was a tie (Y or N)'\
    '\nnotes: Notes about the award\n\nAwardsManagers_df:')
print(f"{AwardsManagers_df.head()}\n") 
print(AwardsManagers_df.info())

# Consolodate Variables
AwardsManagers_df_cat = ['playerID', 'awardID', 'yearID', 'lgID']
append_missing_values(AwardsManagers_df_cat, cleaned_categorical)

AwardsManagers_df_bool = ['tie']
append_missing_values(AwardsManagers_df_bool, cleaned_boolean)

# Delete the notes column due to redundant informations
AwardsManagers_df = delete_column(AwardsManagers_df, 'notes')



About this file: AwardsManagers_df
playerID: Manager ID code
awardID: Name of award won
yearID: Year
lgID: League
tie: Award was a tie (Y or N)
notes: Notes about the award

AwardsManagers_df:
    playerID                    awardID  yearID lgID  tie  notes
0  larusto01  BBWAA Manager of the year    1983   AL  NaN    NaN
1  lasorto01  BBWAA Manager of the year    1983   NL  NaN    NaN
2  andersp01  BBWAA Manager of the year    1984   AL  NaN    NaN
3   freyji99  BBWAA Manager of the year    1984   NL  NaN    NaN
4    coxbo01  BBWAA Manager of the year    1985   AL  NaN    NaN

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 177 entries, 0 to 176
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   playerID  177 non-null    object 
 1   awardID   177 non-null    object 
 2   yearID    177 non-null    int64  
 3   lgID      177 non-null    object 
 4   tie       2 non-null      object 
 5   notes     0 non-null      float64

##Cleaning the AwardsPlayers_df
Same as the AwardsManagert_df.
Eliminate the notes column (most notes indicate position played, this should be found in the general dataset and might be excessive). 
(Note: significantly less null entries; however, not easy to parse for the scope of this project)


In [108]:
print('\nAbout this file: AwardsPlayers_df\nplayerID: Player ID code\nawardID: Name of award won\nyearID: Year\nlgID: League\ntie: Award was a tie (Y or N)'\
    '\nnotes: Notes about the award\n\nAwardsPlayers_df:')
print(f"{AwardsPlayers_df.head()}\n") 
print(AwardsPlayers_df.info())
print(AwardsPlayers_df.notes.unique())


About this file: AwardsPlayers_df
playerID: Player ID code
awardID: Name of award won
yearID: Year
lgID: League
tie: Award was a tie (Y or N)
notes: Notes about the award

AwardsPlayers_df:
    playerID                awardID  yearID lgID  tie notes
0   bondto01  Pitching Triple Crown    1877   NL  NaN   NaN
1  hinespa01           Triple Crown    1878   NL  NaN   NaN
2  heckegu01  Pitching Triple Crown    1884   AA  NaN   NaN
3  radboch01  Pitching Triple Crown    1884   NL  NaN   NaN
4  oneilti01           Triple Crown    1887   AA  NaN   NaN

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6078 entries, 0 to 6077
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   playerID  6078 non-null   object
 1   awardID   6078 non-null   object
 2   yearID    6078 non-null   int64 
 3   lgID      6078 non-null   object
 4   tie       45 non-null     object
 5   notes     4648 non-null   object
dtypes: int64(1), object(5)
memory usa

##Clean the AwardsShareManagers_df
No null data. 
Categorical variables: {awardID, yearID, lgID, playerID}
Numerical variables: {pointsWon, PointsMax, votesFirst}

In [109]:
print('\nAbout this file: AwardsShareManagers_df\nawardID: Name of award won\nyearID: Year\nlgID: League\nplayerID: Manager ID code\npointsWon: Number of points won'\
    '\npointsMax: Maximum number of points possible\nvotesFirst: Number of first place votes\n\nAwardsShareManagers_df:')
print(f"{AwardsShareManagers_df.head()}\n")
print(AwardsShareManagers_df.info())


About this file: AwardsShareManagers_df
awardID: Name of award won
yearID: Year
lgID: League
playerID: Manager ID code
pointsWon: Number of points won
pointsMax: Maximum number of points possible
votesFirst: Number of first place votes

AwardsShareManagers_df:
           awardID  yearID lgID   playerID  pointsWon  pointsMax  votesFirst
0  Mgr of the year    1983   AL  altobjo01          7         28           7
1  Mgr of the year    1983   AL    coxbo01          4         28           4
2  Mgr of the year    1983   AL  larusto01         17         28          17
3  Mgr of the year    1983   NL  lasorto01         10         24          10
4  Mgr of the year    1983   NL  lillibo01          9         24           9

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   awardID     414 non-null    object
 1   yearID      414 non-null    int64 
 2   lgID  

##Clean the AwardsSharePlayers_df
Consider changing the null votesFirst to 0. (might consider the value of this column for this project)
All other data contains no null values.
Categorical variables: {awardID, yearID, lgID, playerID}
Numerical variables: {pointsWon, pointsMax, and votesFirst}

In [110]:
print('\nAbout this file: AwardsSharePlayers_df\nawardID: Name of award won\nyearID: Year\nlgID: League\nplayerID: Player ID code\npointsWon: Number of points won'\
    '\npointsMax: Maximum number of points possible\nvotesFirst: Number of first place votes\n\nAwardsSharePlayers_df:')  
print(f"{AwardsSharePlayers_df.head()}\n")
print(AwardsSharePlayers_df.info())


About this file: AwardsSharePlayers_df
awardID: Name of award won
yearID: Year
lgID: League
playerID: Player ID code
pointsWon: Number of points won
pointsMax: Maximum number of points possible
votesFirst: Number of first place votes

AwardsSharePlayers_df:
    awardID  yearID lgID   playerID  pointsWon  pointsMax  votesFirst
0  Cy Young    1956   ML   fordwh01        1.0         16         1.0
1  Cy Young    1956   ML  maglisa01        4.0         16         4.0
2  Cy Young    1956   ML  newcodo01       10.0         16        10.0
3  Cy Young    1956   ML  spahnwa01        1.0         16         1.0
4  Cy Young    1957   ML  donovdi01        1.0         16         1.0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6795 entries, 0 to 6794
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   awardID     6795 non-null   object 
 1   yearID      6795 non-null   int64  
 2   lgID        6795 non-null   object 
 3   play

##Cleaning Batting_df
Categorical variables: {playerID, yearID, teamID, lgID}
All batting stats should be numerical. Note the use of FLOAT variables.

In [111]:
print('\nAbout this file: Batting_df\nplayerID: Player ID code\nyearID: Year\nstint: player\'s stint (order of appearances within a season)\nteamID: Team\nlgID: League'\
    '\nG: Games\nAB: At Bats\nR: Runs\nH: Hits\n2B: Doubles\n3B: Triples\nHR: Homeruns\nRBI: Runs Batted In\nSB: Stolen Bases\nCS: Caught stealing\nBB: Base on Balls'\
        '\nSO: Strikeouts\nIBB: Intentional walks\nHBP: Hit by pitch\nSH: Sacrifices\nSF: Sacrifice flies\nGIDP: Grounded into double plays\n\nBatting_df:')
print(f"{Batting_df.head()}\n")
print(Batting_df.info())
print(f"\nUnique Stint Values:\n {Batting_df.stint.unique()}")


About this file: Batting_df
playerID: Player ID code
yearID: Year
stint: player's stint (order of appearances within a season)
teamID: Team
lgID: League
G: Games
AB: At Bats
R: Runs
H: Hits
2B: Doubles
3B: Triples
HR: Homeruns
RBI: Runs Batted In
SB: Stolen Bases
CS: Caught stealing
BB: Base on Balls
SO: Strikeouts
IBB: Intentional walks
HBP: Hit by pitch
SH: Sacrifices
SF: Sacrifice flies
GIDP: Grounded into double plays

Batting_df:
    playerID  yearID  stint teamID lgID   G     AB     R     H    2B  ...  \
0  abercda01    1871      1    TRO  NaN   1    4.0   0.0   0.0   0.0  ...   
1   addybo01    1871      1    RC1  NaN  25  118.0  30.0  32.0   6.0  ...   
2  allisar01    1871      1    CL1  NaN  29  137.0  28.0  40.0   4.0  ...   
3  allisdo01    1871      1    WS3  NaN  27  133.0  28.0  44.0  10.0  ...   
4  ansonca01    1871      1    RC1  NaN  25  120.0  29.0  39.0  11.0  ...   

    RBI   SB   CS   BB   SO  IBB  HBP  SH  SF  GIDP  
0   0.0  0.0  0.0  0.0  0.0  NaN  NaN NaN N

##Cleaning BattingPost_df
Categorical variables: {yearID, playerID, team}

In [112]:
print('\nAbout this file: BattingPost_df\nplayerID: Player ID code\nyearID: Year\nround: Playoff round\nteamID: Team\nlgID: League\nG: Games\nAB: At Bats\nR: Runs'\
    'H: Hits\n2B: Doubles\n3B: Triples\nHR: Homeruns\nRBI: Runs Batted In\nSB: Stolen Bases\nCS: Caught stealing\nBB: Base on Balls\nSO: Strikeouts\nIBB: Intentional walks+'\
        'HBP: Hit by pitch\nSH: Sacrifices\nSF: Sacrifice flies\nGIDP: Grounded into double plays\n\nBattingPost_df:')  
print(f"{BattingPost_df.head()}\n")
print(BattingPost_df.info())


About this file: BattingPost_df
playerID: Player ID code
yearID: Year
round: Playoff round
teamID: Team
lgID: League
G: Games
AB: At Bats
R: RunsH: Hits
2B: Doubles
3B: Triples
HR: Homeruns
RBI: Runs Batted In
SB: Stolen Bases
CS: Caught stealing
BB: Base on Balls
SO: Strikeouts
IBB: Intentional walks+HBP: Hit by pitch
SH: Sacrifices
SF: Sacrifice flies
GIDP: Grounded into double plays

BattingPost_df:
   yearID round   playerID teamID lgID  G  AB  R  H  2B  ...  RBI  SB  CS  BB  \
0    1884    WS  becanbu01    NY4   AA  1   2  0  1   0  ...    0   0 NaN   0   
1    1884    WS  bradyst01    NY4   AA  3  10  1  0   0  ...    0   0 NaN   0   
2    1884    WS  esterdu01    NY4   AA  3  10  0  3   1  ...    0   1 NaN   0   
3    1884    WS  forstto01    NY4   AA  1   3  0  0   0  ...    0   0 NaN   0   
4    1884    WS  keefeti01    NY4   AA  2   5  0  1   0  ...    0   0 NaN   0   

   SO  IBB  HBP  SH  SF  GIDP  
0   0  0.0  NaN NaN NaN   NaN  
1   1  0.0  NaN NaN NaN   NaN  
2   3  0.0

##Cleaning Fielding_df

In [113]:
print('\nAbout this file: Fielding_df\nplayerID: Player ID code\nyearID: Year\nstint: player\'s stint (order of appearances within a season)\nteamID: Team\nlgID: League'\
    'POS: Position\nG: Games\nGS: Games Started\nInnOuts: Time played in the field expressed as outs\nPO: Putouts\nA: Assists\nE: Errors\nDP: Double Plays\nPB: Passed Balls'\
        'WP: Wild Pitches\nSB: Opponent Stolen Bases\nCS: Opponents Caught Stealing\nZR: Zone Rating\n\nFielding_df:')
print(Fielding_df.head())
Fielding_df.info()


About this file: Fielding_df
playerID: Player ID code
yearID: Year
stint: player's stint (order of appearances within a season)
teamID: Team
lgID: LeaguePOS: Position
G: Games
GS: Games Started
InnOuts: Time played in the field expressed as outs
PO: Putouts
A: Assists
E: Errors
DP: Double Plays
PB: Passed BallsWP: Wild Pitches
SB: Opponent Stolen Bases
CS: Opponents Caught Stealing
ZR: Zone Rating

Fielding_df:
    playerID  yearID  stint teamID lgID POS   G  GS  InnOuts    PO     A  \
0  abercda01    1871      1    TRO  NaN  SS   1 NaN      NaN   1.0   3.0   
1   addybo01    1871      1    RC1  NaN  2B  22 NaN      NaN  67.0  72.0   
2   addybo01    1871      1    RC1  NaN  SS   3 NaN      NaN   8.0  14.0   
3  allisar01    1871      1    CL1  NaN  2B   2 NaN      NaN   1.0   4.0   
4  allisar01    1871      1    CL1  NaN  OF  29 NaN      NaN  51.0   3.0   

      E   DP  PB  WP  SB  CS  ZR  
0   2.0  0.0 NaN NaN NaN NaN NaN  
1  42.0  5.0 NaN NaN NaN NaN NaN  
2   7.0  0.0 NaN NaN N