# Wrangling and Analyzing Open Source Baseball Data

**Baseball Notebooks**  
1. This is the first in a series of notebooks for wrangling and analyzing Baseball data.

This notebook will:
* create directories for Lahman and Retrosheet raw and wrangled data
* download the data
* unzip the downloaded files

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension: https://github.com/jupyterlab/jupyterlab-toc

# Open Source Baseball Data
**Lahman**  
* Stats per Player per Year including:
  * Batting.csv
  * Pitching.csv
  * Fielding.csv
* Stats per Team per Year:
  * Teams.csv
* "Lookup" tables such as:
  * People.csv: player_id -> player info
  * Parks.csv: park_id -> park info
* and more ...

**Retrosheet**  
* Play by Play data for every MLB game since 1921
  * as parsed by cwdaily -> Batting/Pitching/Fielding stats per Player per Game
  * as parsed by cwgame -> Batting/Pitching/Fielding stats per Team per Game, and Game specific info  
  
**Using Both**  
* People.csv has the Lahman player_id as well as the Retrosheet player_id
* Teams.csv has the Lahman team_id as well as the Retrosheet team_id
* This allows for joins between Lahman and Retrosheet

**Note**  
The code checks to see if the data has already been downloaded or not, and later code checks to see if the data has already been parsed or not.  This allows for rerunning the notebook cell without having to wait.  If you want to download new data, and reprocess it, then remove all the data from all the Lahman and Retrosheet directories.

# Download and Unpack Lahman Data

## Create Lahman Directories
* raw data -- zipped and unzipped files
* wrangled data -- to be populated in later notebooks

In [1]:
import pandas as pd
import numpy as np

import os
import wget
from pathlib import Path
import zipfile

In [2]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_lahman_raw = lahman.joinpath('raw')
p_lahman_wrangled = lahman.joinpath('wrangled')

# create directories from these path objects
p_lahman_raw.mkdir(parents=True, exist_ok=True)
p_lahman_wrangled.mkdir(parents=True, exist_ok=True)

## Download and Unzip
There are two sources for the Lahman data.

**Sean Lahman**  
http://www.seanlahman.com/baseball-archive/statistics  
This site has specific snapshots of the data.  Useful if you want to be sure you have the same data as someone else.

**Baseball Databank**  
https://github.com/chadwickbureau/baseballdatabank  
This is the latest data.  This is the data that will be used here.

In [3]:
# if not already downloaded, download Lahman zip file
os.chdir(p_lahman_raw)
baseball_zip = p_lahman_raw.joinpath('baseballdatabank-master.zip')

if not baseball_zip.is_file():
    url = 'https://github.com/chadwickbureau/baseballdatabank/archive/master.zip'
    wget.download(url)

    # unzip it
    with zipfile.ZipFile('baseballdatabank-master.zip', "r") as zip_ref:
        zip_ref.extractall()

In [4]:
import shutil
os.chdir(p_lahman_raw)
people_csv = p_lahman_raw.joinpath('People.csv')

if not people_csv.is_file():
    unzip_dir = p_lahman_raw.joinpath('baseballdatabank-master/core')

    # move the unzipped csv files to the current working directory
    os.chdir(p_raw)
    for root, dirs, files in os.walk(unzip_dir):
        for file in files:
            shutil.move(root+'/'+file, '.')

    # rm the extract directory
    shutil.rmtree('baseballdatabank-master')

In [5]:
# verify the current directory (p_lahman_raw) has the csv files
os.chdir(p_lahman_raw)
sorted(os.listdir())

['AllstarFull.csv',
 'Appearances.csv',
 'AwardsManagers.csv',
 'AwardsPlayers.csv',
 'AwardsShareManagers.csv',
 'AwardsSharePlayers.csv',
 'Batting.csv',
 'BattingPost.csv',
 'CollegePlaying.csv',
 'Fielding.csv',
 'FieldingOF.csv',
 'FieldingOFsplit.csv',
 'FieldingPost.csv',
 'HallOfFame.csv',
 'HomeGames.csv',
 'Managers.csv',
 'ManagersHalf.csv',
 'Parks.csv',
 'People.csv',
 'Pitching.csv',
 'PitchingPost.csv',
 'Salaries.csv',
 'Schools.csv',
 'SeriesPost.csv',
 'Teams.csv',
 'TeamsFranchises.csv',
 'TeamsHalf.csv',
 'baseballdatabank-master.zip',
 'readme2014.txt']

# Download and Unpack Retrosheet Data

## Create Retrosheet Directories

Create Retrosheet directories for:
* raw data -- zipped and unzipped files
* wrangled data -- to be populated in later notebooks

In [6]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_retro_raw = retrosheet.joinpath('raw')
p_retro_wrangled = retrosheet.joinpath('wrangled')

# create directories (if they don't already exist) from these path objects
p_retro_raw.mkdir(parents=True, exist_ok=True)
p_retro_wrangled.mkdir(parents=True, exist_ok=True)

## Download and Unzip

### Retrosheet Event (aka Play by Play) Data
Data is available from 1921 to present.

Note that the "live ball" era of baseball began in 1920.  
[Wikipedia Live Ball Era](https://en.wikipedia.org/wiki/Live-ball_era)

Roughly
* less than 1% of all games are missing since 1955
* almost all of the missing games occurred before about 1975

Data from 1955 through present will be downloaded an unzipped.

The start year of 1955 was chosen for a few reasons:
* Retrosheet is less likely to have missing games
* some statistics, such as sacrifice flies, were not recorded prior to 1955

Using all data since 1955 will create a DataFrame of over 2 GB in later notebooks.

In [7]:
os.chdir(p_retro_raw)

for year in range(1955,2019):   
    # download each event file, if it doesn't exist locally
    filename = f'{year}eve.zip'
    path = Path(filename)
    if not path.exists():
        url = f'http://www.retrosheet.org/events/{year}eve.zip'
        wget.download(url)
    
    # unzip each zip file, if its contents don't exist locally
    # {year}BOS.EVA is in all zip files
    filename = f'{year}BOS.EVA'
    path = Path(filename)
    if not path.exists():
        filename = f'{year}eve.zip'
        with zipfile.ZipFile(filename, "r") as zip_ref:
            zip_ref.extractall(".")

### Unzipped Data File Types
The unzipped data consists of 3 types of files:
1. *.EVA and *.EVN -- these are American League and National League event files per team per year
2. *.ROS -- these are the rosters per team per year
3. TEAM* -- these are the MBL teams in existence per year

In [8]:
# List 2018 Play by Play Files
files = os.listdir(p_retro_raw)
for file in files:
    if '2018' in file and (file.endswith('.EVA') or file.endswith('.EVN')):
        print(file)

2018HOU.EVA
2018CHN.EVN
2018MIL.EVN
2018CLE.EVA
2018TEX.EVA
2018ATL.EVN
2018ANA.EVA
2018PHI.EVN
2018TOR.EVA
2018PIT.EVN
2018SEA.EVA
2018DET.EVA
2018SLN.EVN
2018BOS.EVA
2018OAK.EVA
2018TBA.EVA
2018MIN.EVA
2018WAS.EVN
2018MIA.EVN
2018ARI.EVN
2018SDN.EVN
2018LAN.EVN
2018COL.EVN
2018KCA.EVA
2018CHA.EVA
2018CIN.EVN
2018SFN.EVN
2018NYN.EVN
2018BAL.EVA
2018NYA.EVA


In [9]:
# List 2018 Roster Files
files = os.listdir(p_retro_raw)
for file in files:
    if '2018' in file and file.endswith('.ROS'):
        print(file)

OAK2018.ROS
DET2018.ROS
MIA2018.ROS
TEX2018.ROS
LAN2018.ROS
SEA2018.ROS
CIN2018.ROS
CHA2018.ROS
CHN2018.ROS
PHI2018.ROS
BOS2018.ROS
ATL2018.ROS
MIN2018.ROS
NYA2018.ROS
ANA2018.ROS
HOU2018.ROS
TOR2018.ROS
NYN2018.ROS
COL2018.ROS
KCA2018.ROS
TBA2018.ROS
CLE2018.ROS
BAL2018.ROS
SFN2018.ROS
SDN2018.ROS
SLN2018.ROS
PIT2018.ROS
ARI2018.ROS
WAS2018.ROS
MIL2018.ROS


In [10]:
# List 2018 Team Files
files = os.listdir(p_retro_raw)
for file in files:
    if '2018' in file and file.startswith('TEAM'):
        print(file)

TEAM2018
