# BaseBall Data Analysis with Pandas

This is the first in a series of notebooks.

This notebook is designed to be used with Jupyter Lab and the Table of Contents extension: https://github.com/jupyterlab/jupyterlab-toc

This notebook briefly describes:
* the baseball data
* how the data will be persisted after being wrangled (in later notebooks)

This notebook will:
* create directories to hold the downloaded data
* download the data
* unzip the downloaded files

# Open Source Baseball Data
**Lahman**  
* Stats per Player per Season including:
  * Batting.csv
  * Pitching.csv
  * Fielding.csv
* "Lookup" tables such as:
  * People.csv: player_id -> player info
  * Parks.csv: park_id -> park info
  * Teams.csv: team_id -> team info
* and more ...

**Retrosheet**  
* Play by Play data for every MLB game since 1921
  * as parsed by cwdaily -> Batting and Pitching stats per Player per Game
  * as parsed by cwgame -> Game Stats (e.g. home and away team hits) per Game  
  
**Using Both**  
* People.csv has the Lahman player_id as well as the Retrosheet player_id
* This allows for joins (about player data) between the two data sources

# Persisting Intermediate Results

**Primary Concern: Column Data Types**
* dates should be read back in as dates to allow for date specific operations
* small integer values should be read back as small integers to save memory
* categories should be read back as categories to save memory  
  
In all cases: it is helpful to both data analysts and other software libraries which may make use of the data, to know the most specific data type for a variable.

**Persisted Data Representations**  
For demonstration purposes, the data will be persisted in two forms (only one form is necessary for data analysis):
1. CSV files
2. Postgres Tables

**CSV Files**  
To avoid loss of Pandas column data types:
* When writing:
  * The DataFrame's data types will be written to a csv file.
  * The DataFrame will be written to a csv file.
* When reading:
  * The data types will be read.
  * The data types will be used to properly read in the DataFrame.

**Postgres**  
1. pd.to_sql()
  * automatically chooses the right database data type for pd.datetime
  * chooses too large a database data type for numeric values, so
    * specify SQL Alchemy types such as SmallInteger and Integer, appropriately
2. df.read_sql()
  * automatically chooses the right Pandas data type for database datetime columns
  * chooses too large a Pandas data type for numeric values, so
    * use pd.to_numeric with downcast
  * chooses object when category may be more appropriate, so
    * convert object to category if number of unique values is less than 5% (for example) of the number of records

# Download and Unpack Lahman Data

## Create Lahman Directories
* raw data -- zipped and unzipped files
* wrangled data -- to be populated in later notebooks

In [1]:
import pandas as pd
import numpy as np

import os
import wget
from pathlib import Path
import zipfile

In [2]:
# create path objects
home = Path.home()
lahman = home.joinpath('data/lahman')
p_raw = lahman.joinpath('raw')
p_wrangled = lahman.joinpath('wrangled')

# create directories from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_wrangled.mkdir(parents=True, exist_ok=True)

## Download and Unzip
There are two sources for the Lahman data.

**Sean Lahman**  
http://www.seanlahman.com/baseball-archive/statistics  
There appears to be a snapshot of data taken the day prior to last season's opening day.

**Baseball Databank**  
https://github.com/chadwickbureau/baseballdatabank  
This is the latest data.  As of the time of this writing, it includes the 2018 season whereas the previous link does not.

In order to use 2018 data, the baseball databank will be used.

In [3]:
# download zip file from github (if not already downloaded)
os.chdir(p_raw)
baseball_zip = p_raw.joinpath('baseballdatabank-master.zip')

if not baseball_zip.is_file():
    url = 'https://github.com/chadwickbureau/baseballdatabank/archive/master.zip'
    wget.download(url)

    # unzip it
    with zipfile.ZipFile('baseballdatabank-master.zip', "r") as zip_ref:
        zip_ref.extractall()

In [4]:
import shutil
os.chdir(p_raw)
people_csv = p_raw.joinpath('People.csv')

if not people_csv.is_file():
    unzip_dir = p_raw.joinpath('baseballdatabank-master/core')

    # move the unzipped csv files to the current working directory
    os.chdir(p_raw)
    for root, dirs, files in os.walk(unzip_dir):
        for file in files:
            shutil.move(root+'/'+file, '.')

    # rm the extract directory
    shutil.rmtree('baseballdatabank-master')

In [5]:
# verify the current directory (p_raw) has the csv files
os.chdir(p_raw)
sorted(os.listdir())

['AllstarFull.csv',
 'Appearances.csv',
 'AwardsManagers.csv',
 'AwardsPlayers.csv',
 'AwardsShareManagers.csv',
 'AwardsSharePlayers.csv',
 'Batting.csv',
 'BattingPost.csv',
 'CollegePlaying.csv',
 'Fielding.csv',
 'FieldingOF.csv',
 'FieldingOFsplit.csv',
 'FieldingPost.csv',
 'HallOfFame.csv',
 'HomeGames.csv',
 'Managers.csv',
 'ManagersHalf.csv',
 'Parks.csv',
 'People.csv',
 'Pitching.csv',
 'PitchingPost.csv',
 'Salaries.csv',
 'Schools.csv',
 'SeriesPost.csv',
 'Teams.csv',
 'TeamsFranchises.csv',
 'TeamsHalf.csv',
 'baseballdatabank-master.zip',
 'readme2014.txt']

# Download and Unpack Retrosheet Data

## Create Retrosheet Directories

Create Retrosheet directories for:
* raw data -- zipped and unzipped files
* wrangled data -- to be populated in later notebooks

In [6]:
# create path objects
home = Path.home()
retrosheet = home.joinpath('data/retrosheet')
p_raw = retrosheet.joinpath('raw')
p_wrangled = retrosheet.joinpath('wrangled')

# create directories (if they don't already exist) from these path objects
p_raw.mkdir(parents=True, exist_ok=True)
p_wrangled.mkdir(parents=True, exist_ok=True)

## Download and Unzip

### Retrosheet Event (aka Play by Play) Data
Data is available from 1921 to present.

Here, data from 1955 through 2018 will be downloaded and unzipped.  The start year of 1955 was chosen in part because there are fewer missing values for baseball attributes from 1955 on.

Using 1955 to present will result in (at least one temporary) 2+ Gig Pandas DataFrame in later notebooks, so chose more or less years as appropriate for your computer's resources.

In [7]:
# change to raw file directory
os.chdir(p_raw)

for year in range(1955,2019):   
    # download each event file, if it doesn't exist locally
    filename = f'{year}eve.zip'
    path = Path(filename)
    if not path.exists():
        url = f'http://www.retrosheet.org/events/{year}eve.zip'
        wget.download(url)
    
    # unzip each zip file, if its contents don't exist locally
    # {year}BOS.EVA is in all zip files
    filename = f'{year}BOS.EVA'
    path = Path(filename)
    if not path.exists():
        filename = f'{year}eve.zip'
        with zipfile.ZipFile(filename, "r") as zip_ref:
            zip_ref.extractall(".")

### Unzipped Data File Types
The unzipped data consists of 3 types of files:
1. *.EVA and *.EVN -- these are American League and National League event files per team per year
2. *.ROS -- these are the rosters per team per year
3. TEAM* -- these are the MBL teams in existence per year

In [10]:
# List 2018 Play by Play Files
files = os.listdir(p_raw)
for file in files:
    if '2018' in file and (file.endswith('.EVA') or file.endswith('.EVN')):
        print(file)

2018HOU.EVA
2018CHN.EVN
2018MIL.EVN
2018CLE.EVA
2018TEX.EVA
2018ATL.EVN
2018ANA.EVA
2018PHI.EVN
2018TOR.EVA
2018PIT.EVN
2018SEA.EVA
2018DET.EVA
2018SLN.EVN
2018BOS.EVA
2018OAK.EVA
2018TBA.EVA
2018MIN.EVA
2018WAS.EVN
2018MIA.EVN
2018ARI.EVN
2018SDN.EVN
2018LAN.EVN
2018COL.EVN
2018KCA.EVA
2018CHA.EVA
2018CIN.EVN
2018SFN.EVN
2018NYN.EVN
2018BAL.EVA
2018NYA.EVA


In [11]:
# List 2018 Roster Files
files = os.listdir(p_raw)
for file in files:
    if '2018' in file and file.endswith('.ROS'):
        print(file)

OAK2018.ROS
DET2018.ROS
MIA2018.ROS
TEX2018.ROS
LAN2018.ROS
SEA2018.ROS
CIN2018.ROS
CHA2018.ROS
CHN2018.ROS
PHI2018.ROS
BOS2018.ROS
ATL2018.ROS
MIN2018.ROS
NYA2018.ROS
ANA2018.ROS
HOU2018.ROS
TOR2018.ROS
NYN2018.ROS
COL2018.ROS
KCA2018.ROS
TBA2018.ROS
CLE2018.ROS
BAL2018.ROS
SFN2018.ROS
SDN2018.ROS
SLN2018.ROS
PIT2018.ROS
ARI2018.ROS
WAS2018.ROS
MIL2018.ROS


In [12]:
# List 2018 Team Files
files = os.listdir(p_raw)
for file in files:
    if '2018' in file and file.startswith('TEAM'):
        print(file)

TEAM2018
