# Business Understanding

Our MLB team wishes to improve upon our prior season's record in order to increase our chances of making a deep postseason run and winning the World Series next season.  Our offensive production was a weakness during the prior season.  We would like to use OPS to evaluate and predict the offensive production of MLB hitters.  OPS combines on-base skills (OBP or on-base percentage) with power hitting skills (slugging percentage), to measure overall offensive performace.  We will use this information to help build our roster for next season (evaluate our current under-contract players and possible trade acquisitions, as well as free agents).  Hitters with a high OPS will help our team score more runs, win more games, perform better in the postseason, win a championship, improve fan sentiment, driving revenue and profits in the process.

# Data Understanding

In [1]:
#install Kaggle API to read data in directly from the site

!pip install kaggle



In [2]:
# define download path and copy the file

import os
import shutil

downloads_path = os.path.expanduser('~/Downloads/kaggle.json')
target_path = os.path.join(os.getcwd(), 'kaggle.json')

if os.path.exists(downloads_path):
    shutil.copy(downloads_path, target_path)
    print(f"kaggle.json copied to: {target_path}")
else:
    print("kaggle.json not found in Downloads.  Donwload it first lol!!!")

kaggle.json copied to: /Users/buzzardsroostimac/Documents/Flatiron/Phase_5/MLB_hitter_production/MLB_hitter_production/kaggle.json


In [3]:
# set permissions

!chmod 600 kaggle.json

In [4]:
!cp ~/Downloads/kaggle.json ~/.kaggle/kaggle.json

In [5]:
# providing API token credentials

!cat ~/.kaggle/kaggle.json

{"username":"shannonhunley","key":"e3f14a20e8743dcdf2de9f39646fcd05"}

In [6]:
# set permissions

!chmod 600 ~/.kaggle/kaggle.json

In [7]:
# downloading the Lahman baseball database which contains the hitting data

!kaggle datasets download -d dalyas/lahman-baseball-database

Dataset URL: https://www.kaggle.com/datasets/dalyas/lahman-baseball-database
License(s): CC-BY-SA-3.0
lahman-baseball-database.zip: Skipping, found more recently modified local copy (use --force to force download)


In [8]:
# unzip the data

!unzip -o lahman-baseball-database.zip

Archive:  lahman-baseball-database.zip
  inflating: lahman_1871-2024_csv/AllstarFull.csv  
  inflating: lahman_1871-2024_csv/Appearances.csv  
  inflating: lahman_1871-2024_csv/AwardsManagers.csv  
  inflating: lahman_1871-2024_csv/AwardsPlayers.csv  
  inflating: lahman_1871-2024_csv/AwardsShareManagers.csv  
  inflating: lahman_1871-2024_csv/AwardsSharePlayers.csv  
  inflating: lahman_1871-2024_csv/Batting.csv  
  inflating: lahman_1871-2024_csv/BattingPost.csv  
  inflating: lahman_1871-2024_csv/CollegePlaying.csv  
  inflating: lahman_1871-2024_csv/Fielding.csv  
  inflating: lahman_1871-2024_csv/FieldingOF.csv  
  inflating: lahman_1871-2024_csv/FieldingOFsplit.csv  
  inflating: lahman_1871-2024_csv/FieldingPost.csv  
  inflating: lahman_1871-2024_csv/HallOfFame.csv  
  inflating: lahman_1871-2024_csv/HomeGames.csv  
  inflating: lahman_1871-2024_csv/Managers.csv  
  inflating: lahman_1871-2024_csv/ManagersHalf.csv  
  inflating: lahman_1871-2024_csv/Parks.csv  
  inflating: lah

In [9]:
# reading in the data, getting a look at its columns and some values

import pandas as pd
df = pd.read_csv('lahman_1871-2024_csv/Batting.csv')

df.head()

Unnamed: 0,playerID,yearID,stint,teamID,lgID,G,G_batting,AB,R,H,...,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP,G_old
0,aardsda01,2004,1,SFN,NL,11,,0,0,0,...,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
1,aardsda01,2006,1,CHN,NL,45,,2,0,0,...,0.0,0.0,0,0.0,0.0,0.0,1.0,0.0,0.0,
2,aardsda01,2007,1,CHA,AL,25,,0,0,0,...,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,
3,aardsda01,2008,1,BOS,AL,47,,1,0,0,...,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0,
4,aardsda01,2009,1,SEA,AL,73,,0,0,0,...,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,


In [10]:
# looking at the amount of data and columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115450 entries, 0 to 115449
Data columns (total 24 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   playerID   115450 non-null  object 
 1   yearID     115450 non-null  int64  
 2   stint      115450 non-null  int64  
 3   teamID     115450 non-null  object 
 4   lgID       114713 non-null  object 
 5   G          115450 non-null  int64  
 6   G_batting  3266 non-null    float64
 7   AB         115450 non-null  int64  
 8   R          115450 non-null  int64  
 9   H          115450 non-null  int64  
 10  2B         115450 non-null  int64  
 11  3B         115450 non-null  int64  
 12  HR         115450 non-null  int64  
 13  RBI        114694 non-null  float64
 14  SB         113082 non-null  float64
 15  CS         91908 non-null   float64
 16  BB         115450 non-null  int64  
 17  SO         113350 non-null  float64
 18  IBB        78799 non-null   float64
 19  HBP        112634 non-n

In [12]:
# looking at the values of column 'G_old' since I'm unfamiliar with it, to see what info it provides

df['G_old'].value_counts()

0.0    1651
Name: G_old, dtype: int64

In [14]:
# dropping unnecessary columns

df_clean = df.drop(columns=['stint', 'teamID', 'lgID', 'G', 'G_batting', 'SB', 'CS', 'GIDP', 'G_old'])
df_clean.head()

Unnamed: 0,playerID,yearID,AB,R,H,2B,3B,HR,RBI,BB,SO,IBB,HBP,SH,SF
0,aardsda01,2004,0,0,0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0.0
1,aardsda01,2006,2,0,0,0,0,0,0.0,0,0.0,0.0,0.0,1.0,0.0
2,aardsda01,2007,0,0,0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0.0
3,aardsda01,2008,1,0,0,0,0,0,0.0,0,1.0,0.0,0.0,0.0,0.0
4,aardsda01,2009,0,0,0,0,0,0,0.0,0,0.0,0.0,0.0,0.0,0.0


In [16]:
# confirming uneeded columns dropped and looking into null values

df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 115450 entries, 0 to 115449
Data columns (total 15 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   playerID  115450 non-null  object 
 1   yearID    115450 non-null  int64  
 2   AB        115450 non-null  int64  
 3   R         115450 non-null  int64  
 4   H         115450 non-null  int64  
 5   2B        115450 non-null  int64  
 6   3B        115450 non-null  int64  
 7   HR        115450 non-null  int64  
 8   RBI       114694 non-null  float64
 9   BB        115450 non-null  int64  
 10  SO        113350 non-null  float64
 11  IBB       78799 non-null   float64
 12  HBP       112634 non-null  float64
 13  SH        109382 non-null  float64
 14  SF        79346 non-null   float64
dtypes: float64(6), int64(8), object(1)
memory usage: 13.2+ MB


## Data Preparation

# Exploratory Data Analysis

# Modeling

# Evaluation

# Conclusion

## Limitations

## Recommendations

## Next Steps