# Baseball Prediction: 5a - Getting (Raw) Individual Pitcher Data
In the previous lesson we compared our simple, hitting-only model to the Las Vegas odds. We concluded that incorporating the starting pitcher information would be a crucial next step to improve our model.

In this notebook we will learn how to scrape individual, game-level, pitching data from retrosheet. We will write a loop to go through and download the data. This will enable us to augment our game-level dataframe with features derived from the previous performance of the starting pitcher.

Let's start by going to retrosheet and finding the stats for Pedro Martinez.

www.retrosheet.org

In [33]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

pd.set_option('display.max_columns',1000)
pd.set_option('display.max_rows',1000)

import lxml
import html5lib
from urllib.request import urlopen
import time

from bs4 import BeautifulSoup
import requests

## Scrape a single season

In [34]:
url = 'https://www.retrosheet.org/boxesetc/2004/Kmartp0010132004.htm' # Pedro Martinez 2004 pitching retrosheet page
page = requests.get(url)
#page.content

In [35]:
soup = BeautifulSoup(page.content, 'html.parser')
soup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "https://www.w3.org/TR/REC-html40/strict.dtd">

<html dir="LTR" lang="EN">
<pre><a href="../MISC/Kdescr.htm">Read Me</a></pre>
<head>
<title>The 2004 BOS A Regular Season Pitching Log for Pedro Martinez</title>
<link href="https://www.retrosheet.org/menubar/menubar.css" rel="stylesheet" type="text/css"/>
<script src="https://www.retrosheet.org/menubar/menubar.js" type="text/javascript"></script>
</head>
<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://ww

In [36]:
soup1 = list(soup.children)[-1]
soup1

<html dir="LTR" lang="EN">
<pre><a href="../MISC/Kdescr.htm">Read Me</a></pre>
<head>
<title>The 2004 BOS A Regular Season Pitching Log for Pedro Martinez</title>
<link href="https://www.retrosheet.org/menubar/menubar.css" rel="stylesheet" type="text/css"/>
<script src="https://www.retrosheet.org/menubar/menubar.js" type="text/javascript"></script>
</head>
<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
<li><a href="https://www.retrosheet.o

In [37]:
soup2 = list(soup1.children)[-1]
soup2

<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Coaches">Coaches</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Umpires">Umpires</a>
<li><a href="https://www.retrosheet.org/transactions/index.html">Transactions</a>
</li></li></li></li></li></ul>
<li><a href="#">Games →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html">Regular season</a>
<li><a h

In [38]:
soup3 = list(soup2.children)
soup3

['\n',
 <p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>,
 '\n',
 <div class="mbcenter">
 <ul class="nav">
 <li><a href="https://www.retrosheet.org/">Home</a>
 <li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
 <li><a href="#">Games/People/Parks ↓</a>
 <ul>
 <li><a href="#">People →</a>
 <ul>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Coaches">Coaches</a>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html#Umpires">Umpires</a>
 <li><a href="https://www.retrosheet.org/transactions/index.html">Transactions</a>
 </li></li></li></li></li></ul>
 <li><a href="#">Games →</a>
 <ul>
 <li><a href="https://www.retrosheet.org/boxesetc/index.html">R

In [39]:
index_num = np.where(["Opponent" in str(x) for x in soup3])[0][0]
index_num

12

In [40]:
soup4 = soup3[index_num]
soup4

<pre>   Date    #         Opponent  GS  CG SHO  GF  SV  IP     H  BFP  HR   R  ER  BB  IB  SO  SH  SF  WP HBP  BK  2B  3B GDP ROE   W   L    ERA
<a href="../2004/04042004.htm"> 4- 4-2004</a>   <a href="../2004/B04040BAL2004.htm">BOX+PBP</a> AT BAL A   1   0   0   0   0   6     7   26   1   3   2   1   0   5   0   0   0   1   0   0   0   1   1   0   1   3.00
<a href="../2004/04102004.htm"> 4-10-2004</a>   <a href="../2004/B04100BOS2004.htm">BOX+PBP</a> VS TOR A   1   0   0   0   0   7.2   4   29   1   1   1   2   0   7   0   0   0   1   0   0   0   1   0   1   0   1.98
<a href="../2004/04152004.htm"> 4-15-2004</a>   <a href="../2004/B04150BOS2004.htm">BOX+PBP</a> VS BAL A   1   0   0   0   0   5     8   26   2   7   7   4   0   3   0   0   0   0   0   2   0   1   0   0   0   4.82
<a href="../2004/04202004.htm"> 4-20-2004</a>   <a href="../2004/B04200TOR2004.htm">BOX+PBP</a> AT TOR A   1   0   0   0   0   7     5   28   0   2   1   2   0   6   0   1   0   0   0   1   2   0   0   1   0   

In [41]:
soup5 = list(soup4.children)
soup5

['   Date    #         Opponent  GS  CG SHO  GF  SV  IP     H  BFP  HR   R  ER  BB  IB  SO  SH  SF  WP HBP  BK  2B  3B GDP ROE   W   L    ERA\n',
 <a href="../2004/04042004.htm"> 4- 4-2004</a>,
 '   ',
 <a href="../2004/B04040BAL2004.htm">BOX+PBP</a>,
 ' AT BAL A   1   0   0   0   0   6     7   26   1   3   2   1   0   5   0   0   0   1   0   0   0   1   1   0   1   3.00\n',
 <a href="../2004/04102004.htm"> 4-10-2004</a>,
 '   ',
 <a href="../2004/B04100BOS2004.htm">BOX+PBP</a>,
 ' VS TOR A   1   0   0   0   0   7.2   4   29   1   1   1   2   0   7   0   0   0   1   0   0   0   1   0   1   0   1.98\n',
 <a href="../2004/04152004.htm"> 4-15-2004</a>,
 '   ',
 <a href="../2004/B04150BOS2004.htm">BOX+PBP</a>,
 ' VS BAL A   1   0   0   0   0   5     8   26   2   7   7   4   0   3   0   0   0   0   0   2   0   1   0   0   0   4.82\n',
 <a href="../2004/04202004.htm"> 4-20-2004</a>,
 '   ',
 <a href="../2004/B04200TOR2004.htm">BOX+PBP</a>,
 ' AT TOR A   1   0   0   0   0   7     5   28   0  

In [42]:
for i in range(12):
    print(soup5[i].get_text().split())

['Date', '#', 'Opponent', 'GS', 'CG', 'SHO', 'GF', 'SV', 'IP', 'H', 'BFP', 'HR', 'R', 'ER', 'BB', 'IB', 'SO', 'SH', 'SF', 'WP', 'HBP', 'BK', '2B', '3B', 'GDP', 'ROE', 'W', 'L', 'ERA']
['4-', '4-2004']
[]
['BOX+PBP']
['AT', 'BAL', 'A', '1', '0', '0', '0', '0', '6', '7', '26', '1', '3', '2', '1', '0', '5', '0', '0', '0', '1', '0', '0', '0', '1', '1', '0', '1', '3.00']
['4-10-2004']
[]
['BOX+PBP']
['VS', 'TOR', 'A', '1', '0', '0', '0', '0', '7.2', '4', '29', '1', '1', '1', '2', '0', '7', '0', '0', '0', '1', '0', '0', '0', '1', '0', '1', '0', '1.98']
['4-15-2004']
[]
['BOX+PBP']


## Given the url that refers to a specific pitcher and season we scrape the data and process it a bit

In [43]:

def get_season_pitching_data(url):    
    time.sleep(1)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    html=list(soup.children)[-1]
    body = list(html.children)[-1]
    sec_next = list(body.children)
    secnum = np.where(["Opponent" in str(x) for x in sec_next])[0][0]
    key_section = sec_next[secnum]
    working_part = list(key_section.children)
    p_header = working_part[0].strip().split()
    mod_header= ['at_vs','Opponent','League', 'GS', 'CG', 'SHO', 'GF', 'SV', 'IP', 'H',
            'BFP', 'HR', 'R', 'ER', 'BB', 'IB', 'SO', 'SH', 'SF', 'WP', 'HBP',
            'BK', '2B', '3B', 'GDP', 'ROE', 'W', 'L', 'ERA']

    date_list = []
    day_href_list = []
    for k in range(1,len(working_part),4):
        date_list.append(working_part[k].get_text().strip())
        day_href_list.append(working_part[k].attrs['href'])

    dblhead_num_list = []
    for k in range(2,len(working_part),4):
        dblhead_num_list.append(working_part[k].strip())

    game_href_list = []
    for k in range(3,len(working_part),4):
        game_href_list.append(working_part[k].attrs['href'])

    main_data_matrix = []
    for k in range(4,len(working_part),4):
        main_data_row = (working_part[k].strip().split())[:29]
        main_data_matrix.append(main_data_row)

    out_df = pd.DataFrame(main_data_matrix, columns = mod_header)
    out_df['Date'] = date_list
    out_df['dblhead_num'] = dblhead_num_list
    return(out_df)

In [44]:
url

'https://www.retrosheet.org/boxesetc/2004/Kmartp0010132004.htm'

In [45]:
get_season_pitching_data(url)

Unnamed: 0,at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
0,AT,BAL,A,1,0,0,0,0,6.0,7,26,1,3,2,1,0,5,0,0,0,1,0,0,0,1,1,0,1,3.0,4- 4-2004,
1,VS,TOR,A,1,0,0,0,0,7.2,4,29,1,1,1,2,0,7,0,0,0,1,0,0,0,1,0,1,0,1.98,4-10-2004,
2,VS,BAL,A,1,0,0,0,0,5.0,8,26,2,7,7,4,0,3,0,0,0,0,0,2,0,1,0,0,0,4.82,4-15-2004,
3,AT,TOR,A,1,0,0,0,0,7.0,5,28,0,2,1,2,0,6,0,1,0,0,0,1,2,0,0,1,0,3.86,4-20-2004,
4,AT,NY,A,1,0,0,0,0,7.0,4,26,0,0,0,1,0,7,0,0,0,0,0,2,0,0,0,1,0,3.03,4-25-2004,
5,AT,TEX,A,1,0,0,0,0,4.0,9,21,1,6,6,1,0,3,0,0,0,0,0,4,0,0,0,0,1,4.17,5- 1-2004,2.0
6,AT,CLE,A,1,0,0,0,0,7.0,4,27,1,2,2,3,0,8,0,0,0,0,0,0,0,0,0,1,0,3.92,5- 6-2004,
7,VS,CLE,A,1,0,0,0,0,7.0,5,28,0,2,2,2,0,11,0,0,0,0,0,1,0,0,0,0,0,3.73,5-11-2004,
8,AT,TOR,A,1,0,0,0,0,7.0,6,28,1,3,3,1,0,6,0,0,0,0,0,0,0,0,0,0,1,3.75,5-16-2004,
9,VS,TOR,A,1,0,0,0,0,6.0,5,24,0,2,2,1,0,7,0,0,0,0,0,1,1,0,0,0,0,3.68,5-22-2004,


## Get all season links for a player

In [46]:
url = 'https://www.retrosheet.org/boxesetc/M/Pmartp001.htm' # Pedro Martinez Base Page
page = requests.get(url)
sup = BeautifulSoup(page.content, 'html.parser')
sup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "https://www.w3.org/TR/REC-html40/strict.dtd">

<html dir="LTR" lang="EN">
<pre><a href="../MISC/Pdescr.htm">Read Me</a></pre>
<head>
<title>Pedro Martinez</title>
<link href="https://www.retrosheet.org/menubar/menubar.css" rel="stylesheet" type="text/css"/>
<script src="https://www.retrosheet.org/menubar/menubar.js" type="text/javascript"></script>
</head>
<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">

In [47]:
sup2 = list(sup.children)[2]
sup3 = list(sup2.children)[5]
sup3

<body>
<p class="nopad"><a href="https://www.retrosheet.org"><img alt="Retrosheet" class="bancenter" height="46" src="https://www.retrosheet.org/menubar/retro-logo.gif" width="400"/></a></p>
<div class="mbcenter">
<ul class="nav">
<li><a href="https://www.retrosheet.org/">Home</a>
<li><a href="https://www.retrosheet.org/searches/search.html">Search</a></li>
<li><a href="#">Games/People/Parks ↓</a>
<ul>
<li><a href="#">People →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Players">Players</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Managers">Managers</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Coaches">Coaches</a>
<li><a href="https://www.retrosheet.org/boxesetc/index.html#Umpires">Umpires</a>
<li><a href="https://www.retrosheet.org/transactions/index.html">Transactions</a>
</li></li></li></li></li></ul>
<li><a href="#">Games →</a>
<ul>
<li><a href="https://www.retrosheet.org/boxesetc/index.html">Regular season</a>
<li><a h

## Plan - find the <'pre'> tag that starts with 'Pitching Record' (after stripping whitespace) 
## Get the href attribute for all the <'a'> tags with the word "Daily"

In [48]:
pre_tags = [x for x in sup3.find_all('pre')]
pre_tag_text = [x.get_text().strip() for x in pre_tags]
pre_tag_text

['Top Performances',
 'Pitcher Matchups   Batter Matchups',
 'Batting Record\nYear Team                     G    AB    R    H  2B  3B  HR  RBI   BB IBB   SO HBP  SH  SF  XI ROE GDP   SB  CS   AVG   OBP   SLG   BFW Year Team\n1992 LA  N    Daily Splits    2     2    0    0   0   0   0    0    0   0    0   0   0   0   0   0   0    0   0  .000  .000  .000   0.0 1992 LA  N\n1993 LA  N    Daily Splits   66     4    0    0   0   0   0    0    0   0    3   0   2   0   0   0   0    0   0  .000  .000  .000   0.0 1993 LA  N\n1994 MON N    Daily Splits   24    44    1    4   0   1   0    5    3   0   21   0   5   1   0   0   1    0   0  .091  .146  .136   0.0 1994 MON N\n1995 MON N    Daily Splits   30    63    2    7   0   0   0    2    0   0   30   2   5   2   0   0   1    0   0  .111  .134  .111   0.0 1995 MON N\n1996 MON N    Daily Splits   33    64    5    6   1   0   0    4    4   0   29   1  16   0   0   1   0    0   0  .094  .159  .109   0.0 1996 MON N\n1997 MON N    Daily Splits   31    

In [49]:
np.where([x.startswith('Pitching Record') for x in pre_tag_text])[0][0]

7

In [50]:
ind = np.where([x.startswith('Pitching Record') for x in pre_tag_text])[0][0]
a_tags = pre_tags[ind].find_all('a')
a_tags

[<a href="../1992/Y_1992.htm">1992</a>,
 <a href="../1992/TLAN01992.htm">LA  N</a>,
 <a href="../1992/Kmartp0010011992.htm">Daily</a>,
 <a href="../1992/Lmartp0010011992.htm">Splits</a>,
 <a href="../1992/Y_1992.htm">1992</a>,
 <a href="../1992/TLAN01992.htm">LA  N</a>,
 <a href="../1993/Y_1993.htm">1993</a>,
 <a href="../1993/TLAN01993.htm">LA  N</a>,
 <a href="../1993/Kmartp0010021993.htm">Daily</a>,
 <a href="../1993/Lmartp0010021993.htm">Splits</a>,
 <a href="../1993/Y_1993.htm">1993</a>,
 <a href="../1993/TLAN01993.htm">LA  N</a>,
 <a href="../1994/Y_1994.htm">1994</a>,
 <a href="../1994/TMON01994.htm">MON N</a>,
 <a href="../1994/Kmartp0010031994.htm">Daily</a>,
 <a href="../1994/Lmartp0010031994.htm">Splits</a>,
 <a href="../1994/Y_1994.htm">1994</a>,
 <a href="../1994/TMON01994.htm">MON N</a>,
 <a href="../1995/Y_1995.htm">1995</a>,
 <a href="../1995/TMON01995.htm">MON N</a>,
 <a href="../1995/Kmartp0010041995.htm">Daily</a>,
 <a href="../1995/Lmartp0010041995.htm">Splits</a>,


In [51]:
links = [x.attrs['href'] for x in a_tags if x.get_text()=='Daily']
links

['../1992/Kmartp0010011992.htm',
 '../1993/Kmartp0010021993.htm',
 '../1994/Kmartp0010031994.htm',
 '../1995/Kmartp0010041995.htm',
 '../1996/Kmartp0010051996.htm',
 '../1997/Kmartp0010061997.htm',
 '../1998/Kmartp0010071998.htm',
 '../1999/Kmartp0010081999.htm',
 '../2000/Kmartp0010092000.htm',
 '../2001/Kmartp0010102001.htm',
 '../2002/Kmartp0010112002.htm',
 '../2003/Kmartp0010122003.htm',
 '../2004/Kmartp0010132004.htm',
 '../2005/Kmartp0010142005.htm',
 '../2006/Kmartp0010152006.htm',
 '../2007/Kmartp0010162007.htm',
 '../2008/Kmartp0010172008.htm',
 '../2009/Kmartp0010182009.htm']

## Get the links to the pitcher-season tables given the pitcher id

In [52]:
def get_daily_season_links(pitcher_id):
    letter = pitcher_id.upper()[0]
    url_prefix = 'https://www.retrosheet.org/boxesetc/'
    url = url_prefix+letter+'/P'+pitcher_id+'.htm'
    time.sleep(1)
    page = requests.get(url)
    soup = BeautifulSoup(page.content, 'html.parser')
    html=list(soup.children)
    body = list(html[2].children)[5]
    pre_texts = [x for x in body.find_all('pre')]
    secnum = np.where([x.get_text().strip().startswith('Pitching Record') for x in pre_texts])[0][0]
    a_pre_texts = pre_texts[secnum].find_all('a')
    daily_season_links = [url_prefix+x.attrs['href'][3:] for x in a_pre_texts if x.get_text()=='Daily']
    return(daily_season_links)

In [57]:
get_daily_season_links('martp001') #martp001 - Pedro tag


['https://www.retrosheet.org/boxesetc/1992/Kmartp0010011992.htm',
 'https://www.retrosheet.org/boxesetc/1993/Kmartp0010021993.htm',
 'https://www.retrosheet.org/boxesetc/1994/Kmartp0010031994.htm',
 'https://www.retrosheet.org/boxesetc/1995/Kmartp0010041995.htm',
 'https://www.retrosheet.org/boxesetc/1996/Kmartp0010051996.htm',
 'https://www.retrosheet.org/boxesetc/1997/Kmartp0010061997.htm',
 'https://www.retrosheet.org/boxesetc/1998/Kmartp0010071998.htm',
 'https://www.retrosheet.org/boxesetc/1999/Kmartp0010081999.htm',
 'https://www.retrosheet.org/boxesetc/2000/Kmartp0010092000.htm',
 'https://www.retrosheet.org/boxesetc/2001/Kmartp0010102001.htm',
 'https://www.retrosheet.org/boxesetc/2002/Kmartp0010112002.htm',
 'https://www.retrosheet.org/boxesetc/2003/Kmartp0010122003.htm',
 'https://www.retrosheet.org/boxesetc/2004/Kmartp0010132004.htm',
 'https://www.retrosheet.org/boxesetc/2005/Kmartp0010142005.htm',
 'https://www.retrosheet.org/boxesetc/2006/Kmartp0010152006.htm',
 'https://

In [58]:
get_season_pitching_data(get_daily_season_links('martp001')[2])

Unnamed: 0,at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
0,VS,CHI,N,1,0,0,0,0,6.0,3,23,0,1,1,1,0,8,0,0,0,2,0,3,0,0,0,0,1,1.5,4- 8-1994,
1,VS,CIN,N,1,0,0,0,0,8.0,1,26,0,1,1,0,0,8,0,0,1,1,0,0,0,0,0,0,0,1.29,4-13-1994,
2,AT,SF,N,1,0,0,0,0,7.0,5,27,0,0,0,2,1,10,0,0,0,2,0,0,1,2,1,0,0,0.86,4-18-1994,
3,AT,LA,N,1,0,0,0,0,6.2,7,30,2,6,6,2,0,9,0,0,1,0,0,2,1,0,1,0,1,2.6,4-24-1994,
4,VS,SD,N,1,0,0,0,0,5.0,4,24,0,3,3,3,0,6,0,0,0,1,0,1,1,0,1,1,0,3.03,4-30-1994,
5,AT,ATL,N,1,0,0,0,0,5.0,6,19,1,3,3,0,0,4,0,0,0,0,0,1,0,1,0,0,1,3.35,5- 6-1994,
6,VS,NY,N,1,0,0,0,0,7.0,7,29,0,3,3,3,0,8,0,0,0,0,0,3,0,1,0,1,0,3.43,5-11-1994,
7,AT,PHI,N,1,0,0,0,0,5.0,6,21,0,2,2,1,0,3,0,1,0,0,0,2,0,1,0,0,0,3.44,5-17-1994,
8,AT,PIT,N,1,0,0,0,0,6.0,6,24,2,2,2,0,0,4,0,0,0,0,0,1,0,0,0,1,0,3.4,5-22-1994,
9,VS,COL,N,1,0,0,0,0,7.1,4,30,0,2,0,4,1,10,0,1,0,0,0,1,0,0,0,0,0,3.0,5-28-1994,


## Get all the data for a particular pitcher


In [59]:
def get_full_pitching_data(pitcher_id):
    link_list = get_daily_season_links(pitcher_id)
    df_pitching = pd.DataFrame()
    for url in link_list:
        df_pitching = pd.concat((df_pitching, get_season_pitching_data(url)))
    return(df_pitching)

In [60]:
dg_data = get_full_pitching_data('martp001')

In [61]:
dg_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 476 entries, 0 to 8
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   at_vs        476 non-null    object
 1   Opponent     476 non-null    object
 2   League       476 non-null    object
 3   GS           476 non-null    object
 4   CG           476 non-null    object
 5   SHO          476 non-null    object
 6   GF           476 non-null    object
 7   SV           476 non-null    object
 8   IP           476 non-null    object
 9   H            476 non-null    object
 10  BFP          476 non-null    object
 11  HR           476 non-null    object
 12  R            476 non-null    object
 13  ER           476 non-null    object
 14  BB           476 non-null    object
 15  IB           476 non-null    object
 16  SO           476 non-null    object
 17  SH           476 non-null    object
 18  SF           476 non-null    object
 19  WP           476 non-null    object


In [62]:
dg_data.sample(5)

Unnamed: 0,at_vs,Opponent,League,GS,CG,SHO,GF,SV,IP,H,BFP,HR,R,ER,BB,IB,SO,SH,SF,WP,HBP,BK,2B,3B,GDP,ROE,W,L,ERA,Date,dblhead_num
29,AT,BAL,A,1,0,0,0,0,6,7,26,0,2,2,1,0,6,0,0,1,0,0,2,0,0,0,1,0,2.26,9-22-2002,
15,VS,CHI,A,1,0,0,0,0,5,3,18,1,1,1,1,0,4,0,0,0,0,0,0,0,0,0,1,0,2.08,6-26-1999,
15,VS,PHI,N,1,0,0,0,0,7,2,26,1,1,1,2,0,2,0,0,0,1,0,0,0,0,0,1,0,3.73,6-25-2004,
26,VS,PHI,N,1,0,0,0,0,7,8,32,4,5,5,3,0,6,2,0,0,0,0,2,0,0,0,0,1,2.9,8-31-2005,
56,VS,ATL,N,0,0,0,1,0,2,1,7,0,0,0,1,0,4,0,0,0,0,0,0,0,0,0,1,0,2.47,9- 6-1993,


## Load in our game level data

In [63]:
df=pd.read_csv('df_bp3.csv')

  df=pd.read_csv('df_bp3.csv')


In [64]:
start_pitchers_h = df.pitcher_start_id_h.unique()
start_pitchers_v = df.pitcher_start_id_v.unique()
len(start_pitchers_h), len(start_pitchers_v)

(2286, 2293)

In [65]:
start_pitchers_all = np.union1d(start_pitchers_h, start_pitchers_v)
len(start_pitchers_all), start_pitchers_all[:10]

(2470,
 array(['abadf001', 'abboc001', 'abboj001', 'abbok001', 'abbop001',
        'aceva001', 'acevj001', 'acevj002', 'ackej001', 'adamc002'],
       dtype=object))

### Run this for everyone in the list - may take a bit to run

In [66]:
for p_id in start_pitchers_all[:2]:
    print(p_id)
    df_temp = get_full_pitching_data(p_id)
    # may want to modify below to save to a dedicated folder
    fname_out = 'pitching_data_'+p_id+'.csv'
    df_temp.to_csv(fname_out, index=False)

abadf001
abboc001
