# Predicting the Next Pitch Type
Using the **mlbgame** library, we can build a training set of pitch sequences. Then, we will use that to train a Sequence to Sequence neural network to predict the next pitch given the previous pitches in an at bat. 

In this project, the predictions will be on an 'At Bat' basis.  That is, if you know the current set of pitches during an individual at bat, you will, hopefully, use this to predict the next one.  This simplifies the data collection, as the MLB game data is already split by at-bats. 

## Hello World
First things first, we check that mlbgame is working.  Looks like it is.  

In [17]:
from __future__ import print_function
import mlbgame
import random
import pickle
import os.path

month = mlbgame.games(2015, 6, home="Mets")
games = mlbgame.combine_games(month)
for game in games:
    print(game)

Giants (5) at Mets (0)
Giants (8) at Mets (5)
Giants (4) at Mets (5)
Braves (3) at Mets (5)
Braves (5) at Mets (3)
Braves (8) at Mets (10)
Blue Jays (3) at Mets (4)
Blue Jays (2) at Mets (3)
Reds (1) at Mets (2)
Reds (1) at Mets (2)
Reds (2) at Mets (7)
Reds (1) at Mets (2)
Cubs (1) at Mets (0)


## Gathering Pitches
We need to get the actual pitch sequences out of the MLB data.  So lets start by getting just the pitches from one game.  We get the events, then for each inning we can look at each at bat in the top and bottom of the inning.  

Pitch mappings:

* A - Changeup (CH)
* B - Curveball (CU)
* C - Cutter (FC)
* D - Eephus (EP)
* E - Forkball (FO)
* F - Four-Seam Fastball (FF) - Seems to be a discrepency here.  Official site says FA. Other tutorials disagree.  Data uses FF.  Dont see any FA's.
* G - Knuckleball (KN)
* H - Knuckle-curve (KC)
* I - Screwball (SC)
* J - Sinker (SI)
* K - Slider (SL)
* L - Splitter (FS)
* M - Two-Seam Fastball (FT)

In [18]:
pitch_dict = {"CH":"A", 
              "CU":"B", 
              "FC":"C", 
              "EP":"D", 
              "FO":"E", 
              "FF":"F", 
              "KN":"G", 
              "KC":"H", 
              "SC":"I", 
              "SI":"J", 
              "SL":"K", 
              "FS":"L", 
              "FT":"M",
              "PO":"", # Not sure what these three are
              "IN":"", # Weird
              "UN":"", # Weird
              "":""}

def get_pitch_seqs_from_game( game_id ):
    pitches = []
    events  = mlbgame.game_events(game_id)
    for i in events:
        inning = events[i]
        pitch_str = ""
        for ab in inning['top']+inning['bottom']:
            for pitch in ab.pitches:
                pitch_str += pitch_dict[ pitch.pitch_type ]
        pitches.append(pitch_str)
    return pitches

In [19]:
pitch_sequences = get_pitch_seqs_from_game( games[0].game_id )

for seq in pitch_sequences:
    print(seq)

BFBJFFFFJJFFFJKBKJBJBB
BFKMAFMFAFJBJBABJJJBJBJJ
HAFMHHHCKMMMMMHMFAKMKFKKJJBBJJJBBBJJBJ
FAJJJBFBBAFJBAJBBJJAJJA
BJBJJJFBBAJFFFBFJBBJBJFBFFJJAJJJAJBJJBBJJ
FAMMMAFCMMMAJABAJBBJ
FFJFFFFAJFAJJFFJBBJJBJJJJBJKJABJJJB
JJAFJFABJJJBFFFFJJBJJKKJJJJFJ
AAFBAAAAKBJJJJJJJJJJB


We convert the pitch sequences to strings of single letter identifiers.  This will simplify using the strings with an RNN. 

Now we can easily build a data set as large as we'd like by just grabbing games from date ranges, and passing them to the above method, all the while appending to a large data set collection.  Lets get pitch data from 5th month of 2016.  There are about 2000 games a year, so we should expect about $2000\cdot2\cdot9 = 36000$ entries in the data set.

In [20]:
def get_pitch_data_from_game_list( games ):
    pitch_data = []
    for game in games:
        try:
            pitch_data += get_pitch_seqs_from_game( game.game_id )
        except:
            print("error with game: "+game.game_id)
    return pitch_data

In [13]:
fname = "pitches.p"
if os.path.isfile(fname):
    pitch_data = pickle.load( open( fname, "rb") )
else:
    year_2016 = mlbgame.games(2016)
    games = mlbgame.combine_games(year_2016)
    pitch_data = get_pitch_data_from_game_list( games )
    pitch_data_clean = [s for s in pitch_data if s != '']
    pitch_data = pitch_data_clean
    pickle.dump( pitch_data, open( fname, "wb" ) )

COOL!  Thats a good chunk of data.  Wooooo.  Not sure where the other errors come from, but we just except and move on.  Some of the lines are empty, due to either value or http execptions.  Cutting those out still leaves us with about 23k sequences.  Pickle is used to save us having to redownload.  If a different year or game set is to be loaded, simply delete the pitches.p file and rerun.  

In [15]:
print("number of instances: "+str(len(pitch_data)))
print("random instance: "+pitch_data_clean[random.randint(0, len(pitch_data_clean))])

number of instances: 23561
random instance: FKBKKFBFBFKKFKFAFFKBBFKKKKBKKJKJ


I WANT MORE DATA. NOMNOMNOM.  Lets see if we can get a few years of data.  Let's abstract away a method for loading data by year.  This is crude, but as long as we give it well behaved year identifiers we should be ok.  

In [21]:
def load_pitch_by_year( year ):
    fname = "pitches_"+str(year)+".p"
    if os.path.isfile(fname):
        pitch_data = pickle.load( open( fname, "rb") )
    else:
        year = mlbgame.games( year )
        games = mlbgame.combine_games( year )
        pitch_data = get_pitch_data_from_game_list( games )
        pitch_data_clean = [s for s in pitch_data if s != '']
        pitch_data = pitch_data_clean
        pickle.dump( pitch_data, open( fname, "wb" ) )
    return pitch_data

In [None]:
# If the list of years contains a new year, this will take a WHILE.
years_wanted = [2016, 2015, 2014]
pitches_by_year = {}

for y in years_wanted:
    pitches_by_year[str(y)] = load_pitch_by_year(y)