# Introduction to pyBaseball
For the first part of this class, we'll use the pyBaseball package to access data. pyBaseball is a python package that provides a nice API for the Baseball Savant website and the Lahman database. Lahman is actually a bunch of .csv files that you download onto your local machine.

There are lots of examples of pyBaseball queries on their github repo. I encourage you to rummage through the site and look at what's available. 

If you haven't already done so, you need to install pyBaseball.

Open a terminal and type the following commands to pull the latest pybaseball

* git clone https://github.com/jldbc/pybaseball
* cd pybaseball
* python setup.py install --user

To test that pybaseball installed correctly, run the following. If you get back data, pybaseball is working.

In [1]:
from pybaseball import statcast
data = statcast(start_dt='2017-06-24', end_dt='2017-06-27')
data.head(2)

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,...,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment
0,316,CU,2017-06-27,79.7,-1.3441,5.4075,Matt Bush,608070.0,456713.0,field_out,...,1.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,Standard,Strategic
1,329,FF,2017-06-27,98.1,-1.3547,5.4196,Matt Bush,429665.0,456713.0,field_out,...,1.0,2.0,1.0,2.0,2.0,1.0,1.0,2.0,Standard,Strategic


## Lahman Database ##
The Lahman database was created by Sean Lahman, and contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2017.  It includes data from the American and National Leagues, as well as other leagues from 1871-1875. 

The version that we will work with here is a collection of .csv files. The files are all publicly available through several sources, including http://www.seanlahman.com. We're going to access it through the pybaseball Python package. You can find documentation on what's in each .csv on the Lahman website.

A quick query of pybaseball Lahman to display some data

In [3]:
#Download the Lahman data
from pybaseball.lahman import *

#download the entire lahman database to your current working directory
download_lahman() 

#Look at the data. Divided by category
#.csv files for Batting, Fielding, Managers, Pitching, etc

#Dataframe - a dataframe in Python is a 2d data structure with rows and columns. 
#Think of it as a data table.
#Dataframes are the data structure that we'll be working with all semester.

#create a dataframe from the batting.csv file
batting = batting()

'''
Once we have the dataframe, there are some common things that we want
to do with it.
Inspect what's in it. What are the column variables
Filter rows by a criteria
Filter columns by a criteria
'''

#first, inspect the data
#get the first 10 rows
batting.head(10)

#or, to show the column headers only

list(batting)

#Filter rows that meet a criteria.
#Use column id to access
batting2016 = batting[(batting["yearID"]==2016)]#one year only
batting2016.head(5)

#Filter by column using the column id
batting[["teamID", "lgID"]].head(10)

#You can repeat the process with the other csv files to look at what's included in each file.
#If you look at the pyBaseball Lahman.py code, you'll see the functions that return
#the data stored in each csv. For example, let's look at the Parks data
parks = parks()
parks.head(10)
p = parks[(parks["park.name"]=='Riverside Park')]
print(p)

  park.key       park.name park.alias    city state country
0    ALB01  Riverside Park        NaN  Albany    NY      US


# Questions #
In groups of 2-3 or alone, find the answers to the following questions in the Lahman data. Write your code in a code cell and submit your notebook to Canvas.

1. Who played second base for the Baltimore Orioles in 2016? The teamID for the Orioles is BAL. Can you get the first and last name, as well as the playerID?
2. Who was the manager for the St. Louis Cardinals in 2010?