# Predicting the Next Pitch Type
Using the [**mlbgame**](http://panz.io/mlbgame/) library, we can build a training set of pitch sequences. Then, we will use that to train an RNN to predict the next pitch given the previous pitches in an at bat. 

## Gathering Pitches
We need to get the actual pitch sequences out of the MLB data.  So lets start by getting just the pitches from one game.  We get the events, then for each inning we can look at each at bat in the top and bottom of the inning.  

Pitch mappings:

* A - Changeup (CH)
* B - Curveball (CU)
* C - Cutter (FC)
* D - Eephus (EP)
* E - Forkball (FO)
* F - Four-Seam Fastball (FF) - Seems to be a discrepency here.  Official site says FA. Other tutorials disagree.  Data uses FF.  Dont see any FA's.
* G - Knuckleball (KN)
* H - Knuckle-curve (KC)
* I - Screwball (SC)
* J - Sinker (SI)
* K - Slider (SL)
* L - Splitter (FS)
* M - Two-Seam Fastball (FT)

In [3]:
import mlbgame
import pickle
import os

pitch_dict = {"CH":"A", 
              "CU":"B", 
              "FC":"C", 
              "EP":"D", 
              "FO":"E", 
              "FF":"F", 
              "KN":"G", 
              "KC":"H", 
              "SC":"I", 
              "SI":"J", 
              "SL":"K", 
              "FS":"L", 
              "FT":"M",
              "PO":"", # Not sure what these three are
              "IN":"", # Weird
              "UN":"", # Weird
              "":""}

def get_pitch_seqs_from_game( game_id ):
    pitches = []
    events  = mlbgame.game_events(game_id)
    for i in events:
        inning = events[i]
        pitch_str = ""
        for ab in inning['top']+inning['bottom']:
            for pitch in ab.pitches:
                pitch_str += pitch_dict[ pitch.pitch_type ]
        pitches.append(pitch_str)
    return pitches

We convert the pitch sequences to strings of single letter identifiers.  Note - I did this initially because I thought it would help putting the data into a network, but I dont think it was necessary.  I ended up making 1-hot vectors, so this is probably an unneeded extra step.

Next is a method to abstract getting a set of pitch seqences from a list of games.

In [4]:
def get_pitch_data_from_game_list( games ):
    pitch_data = []
    for game in games:
        try:
            pitch_data += get_pitch_seqs_from_game( game.game_id )
        except:
            print("error with game: "+game.game_id)
    return pitch_data

Lets see if we can get a few years of data.  Let's abstract away a method for loading data by year.  This is crude, but as long as we give it well behaved year identifiers we should be ok.  Make sure the ```pitch_data``` directory exists or you'll wait a long while to see an annoying error with open. 

In [5]:
PITCH_DATA_DIR = "pitch_data/"
def load_pitch_by_year( year ):
    fname = PITCH_DATA_DIR+"pitches_"+str(year)+".p"
    if os.path.isfile(fname):
        pitch_data = pickle.load( open( fname, "rb") )
    else:
        year = mlbgame.games( year )
        games = mlbgame.combine_games( year )
        pitch_data = get_pitch_data_from_game_list( games )
        pitch_data_clean = [s for s in pitch_data if s != ''] # Removes the empty atbats.  
        pitch_data = pitch_data_clean
        pickle.dump( pitch_data, open( fname, "wb" ) )
    return pitch_data

In [6]:
# If the list of years contains a new year, this will take a WHILE.
years_wanted = [2016, 2015, 2014]
pitches_by_year = {}

for y in years_wanted:
    pitches_by_year[str(y)] = load_pitch_by_year(y)

error with game: 2016_03_22_anamlb_slcaaa_1
error with game: 2016_03_22_tbamlb_cubint_1
error with game: 2016_03_31_phimlb_pfsmin_1
error with game: 2016_03_31_sdnmlb_elpaaa_1
error with game: 2016_04_02_detmlb_atlmlb_1
error with game: 2016_04_02_milmlb_blxaax_1
error with game: 2016_04_04_bosmlb_clemlb_1
error with game: 2016_04_04_houmlb_nyamlb_1
error with game: 2016_04_09_miamlb_wasmlb_1
error with game: 2016_04_10_nyamlb_detmlb_1
error with game: 2016_04_17_balmlb_texmlb_1
error with game: 2016_04_27_milmlb_chnmlb_1
error with game: 2016_04_28_pitmlb_colmlb_1
error with game: 2016_04_30_atlmlb_chnmlb_1
error with game: 2016_05_16_bosmlb_kcamlb_1
error with game: 2016_05_26_chamlb_kcamlb_1
error with game: 2016_06_08_clemlb_seamlb_1
error with game: 2016_09_25_atlmlb_miamlb_1
error with game: 2016_10_03_clemlb_detmlb_1


FileNotFoundError: [Errno 2] No such file or directory: 'pitch_data/pitches_2016.p'

In [15]:
total = 0
for y in years_wanted:
    total += len(pitches_by_year[str(y)])
    print( "year: "+str(y)+" count: "+str(len(pitches_by_year[str(y)])) )
print("total: "+str(total))

year: 2016 count: 23561
year: 2015 count: 23085
year: 2014 count: 22781
total: 69427


Sweet.  Thats about 70k sequences to train off of.  
To train, I think the pitch types need to be converted to sequences of 1-Hot input vectors.  There are 13 possible pitch types, so each pitch will be a vector of size 13.  

In [105]:
# I'll need to know the length of the longest sequence so I can size the time-step of the RNN.
for y in years_wanted:
    longest = max( pitches_by_year[str(y)], key=len )
    print(longest + " len: " + str(len(longest)))

KBKKAFKBFKFBMFKMMFFFKFKMMMKKMFKMMMAFKFMFAKMMMMFMBFBFFFFFKFFACCFBABCCJBAAJAJCJCAAFCACJKKKKFF len: 91
MBFKABFKFKMMMBMMMFFKFAMMMMMABMFCFFCFFCFKFFKCFCFFFFHFFFFFHHFAFFFFFHAFFFAFHFFFFMAFMCCFFCMMBF len: 90
FKFFFKFFFFKMMBKKJAMMBMMBMJJBMMMMJMBJMFFFBFAFFFBFFBAFFFAFFFBFAFBBFFFBFFFMKMMKKMKMMKKKKKKK len: 88


## Building a Model
First, I'll need to turn the data set into a huge tensor of the shape [ NUM_SEQ, MAX_SEQ_LEN, NUM_PITCH_TYPES ].  Or atleast I think thats the shape I want.  

TODO - Fix the shapes.  :(

In [1]:
import numpy as np
VECTOR_SIZE = 13 # Number of pitch types

pitch_char_dict = { "A":0,
                    "B":1,
                    "C":2,
                    "D":3,
                    "E":4,
                    "F":5,
                    "G":6,
                    "H":7,
                    "I":8,
                    "J":9,
                    "K":10,
                    "L":11,
                    "M":12}

# Create 1-Hot vectors from data set
def create_one_hot_series( seq ):
    vectors = []
    for char in seq:
        v = np.zeros( VECTOR_SIZE )
        v[ pitch_char_dict[char] ] = 1.0
        vectors.append(v)
    return vectors    

In [2]:
# Build the actual data set.
full_list = []
for y in years_wanted:
    full_list += pitches_by_year[str(y)]

X_full = []
for seq in full_list:
    X_full.append( create_one_hot_series( seq ) )

NameError: name 'years_wanted' is not defined

In [98]:
import sys
print( str(sys.getsizeof(X_full)) + " bytes")

578936 bytes


In [99]:
# Pitch Prediction Network
import tensorflow as tf
tf.reset_default_graph()

# Construction Phase ###############
NUM_INPUTS  = 13    # Size of the input vector (the number of possible pitch types)
NUM_OUTPUTS = 13    # Want a pitch type out, so same size as input.
NUM_NEURONS = 5     # Just a placeholder.  Its what the book uses for the sequence example 
NUM_STEPS   = 91    # Size of largest input sequence - So all training can fit.

## RNN Graph
X = tf.placeholder( tf.float32, [None, NUM_STEPS, NUM_INPUTS] ) 
y = tf.placeholder( tf.float32, [None, NUM_STEPS, NUM_INPUTS] ) # X, but shifted to the left 1 element.
seq_len = tf.placeholder( tf.int32, [None] ) # 1D Tensor to hold the length of each sequence in a batch

basic_cell   = tf.contrib.rnn.BasicRNNCell( num_units=NUM_NEURONS )
wrapped_cell = tf.contrib.rnn.OutputProjectionWrapper( basic_cell, output_size=NUM_OUTPUTS )

outputs, states = tf.nn.dynamic_rnn( wrapped_cell, X, dtype=tf.float32, sequence_length=seq_len ) 

## Cost Function for Training
LEARNING_RATE = 0.001
loss        = tf.reduce_mean( tf.square( outputs-y ) )
optimizer   = tf.train.AdamOptimizer( learning_rate=LEARNING_RATE )
training_op = optimizer.minimize( loss )

init = tf.global_variables_initializer()

In [100]:
def get_training_batch( X_full, batch_number, batch_size ):
    start = batch_number*batch_size
    end   = start+batch_size
    if end > len(X_full):
        end = len(X_full)-1
    X_batch = X_full[start:end]
    y_batch = [ s[1:] for s in X_batch ]  # Shift the sequence to get a y
    seq_lens = [ len(s) for s in X_batch ]
    return np.array(X_batch), np.array(y_batch), np.array(seq_lens)

In [107]:
### Training Phase
BATCH_SIZE = 100
NUM_ITERATIONS = int(len(X_full) / BATCH_SIZE)

with tf.Session() as sess:
    init.run()
    for iteration in range(NUM_ITERATIONS):
        X_batch, y_batch, seq_lengths = get_training_batch( X_full, iteration, BATCH_SIZE )
        sess.run( training_op, feed_dict={X: X_batch, y: y_batch, seq_len: seq_lengths})
        if iteration % 100 == 0:
            mse = loss.eval(feed_dict={X: X_batch, y: y_batch})
            print(iteration, "\tMSE:", mse)

ValueError: setting an array element with a sequence.