# Predicting the Next Pitch Type
Using the [**mlbgame**](http://panz.io/mlbgame/) library, we can build a training set of pitch sequences. Then, we will use that to train an RNN to predict the next pitch given the previous pitches in an at bat. 

## Gathering Pitches
We need to get the actual pitch sequences out of the MLB data.  So lets start by getting just the pitches from one game.  We get the events, then for each inning we can look at each at bat in the top and bottom of the inning.  

Pitch mappings:

* A - Changeup (CH)
* B - Curveball (CU)
* C - Cutter (FC)
* D - Eephus (EP)
* E - Forkball (FO)
* F - Four-Seam Fastball (FF) - Seems to be a discrepency here.  Official site says FA. Other tutorials disagree.  Data uses FF.  Dont see any FA's.
* G - Knuckleball (KN)
* H - Knuckle-curve (KC)
* I - Screwball (SC)
* J - Sinker (SI)
* K - Slider (SL)
* L - Splitter (FS)
* M - Two-Seam Fastball (FT)

In [103]:
import mlbgame
import pickle
import os

pitch_dict = {"CH":"A", 
              "CU":"B", 
              "FC":"C", 
              "EP":"D", 
              "FO":"E", 
              "FF":"F", 
              "KN":"G", 
              "KC":"H", 
              "SC":"I", 
              "SI":"J", 
              "SL":"K", 
              "FS":"L", 
              "FT":"M",
              "PO":"", # Not sure what these three are
              "IN":"", # Weird
              "UN":"", # Weird
              "":""}

def get_pitch_seqs_from_game( game_id ):
    pitches = []
    events  = mlbgame.game_events(game_id)
    for i in events:
        inning = events[i]
        pitch_str = ""
        for ab in inning['top']+inning['bottom']:
            for pitch in ab.pitches:
                pitch_str += pitch_dict[ pitch.pitch_type ]
        pitches.append(pitch_str)
    return pitches

We convert the pitch sequences to strings of single letter identifiers.  Note - I did this initially because I thought it would help putting the data into a network, but I dont think it was necessary.  I ended up making 1-hot vectors, so this is probably an unneeded extra step.

Next is a method to abstract getting a set of pitch seqences from a list of games.

In [4]:
def get_pitch_data_from_game_list( games ):
    pitch_data = []
    for game in games:
        try:
            pitch_data += get_pitch_seqs_from_game( game.game_id )
        except:
            print("error with game: "+game.game_id)
    return pitch_data

Lets see if we can get a few years of data.  Let's abstract away a method for loading data by year.  This is crude, but as long as we give it well behaved year identifiers we should be ok.  Make sure the ```pitch_data``` directory exists or you'll wait a long while to see an annoying error with open. 

In [104]:
PITCH_DATA_DIR = "pitch_data/"
def load_pitch_by_year( year ):
    fname = PITCH_DATA_DIR+"pitches_"+str(year)+".p"
    if os.path.isfile(fname):
        pitch_data = pickle.load( open( fname, "rb") )
    else:
        year = mlbgame.games( year )
        games = mlbgame.combine_games( year )
        pitch_data = get_pitch_data_from_game_list( games )
        pitch_data_clean = [s for s in pitch_data if s != ''] # Removes the empty atbats.  
        pitch_data = pitch_data_clean
        pickle.dump( pitch_data, open( fname, "wb" ) )
    return pitch_data

In [10]:
# If the list of years contains a new year, this will take a WHILE.
years_wanted = [2016, 2015, 2014]
pitches_by_year = {}

for y in years_wanted:
    pitches_by_year[str(y)] = load_pitch_by_year(y)

error with game: 2016_03_22_tbamlb_cubint_1
error with game: 2016_03_22_anamlb_slcaaa_1
error with game: 2016_03_31_phimlb_pfsmin_1
error with game: 2016_03_31_sdnmlb_elpaaa_1
error with game: 2016_04_02_detmlb_atlmlb_1
error with game: 2016_04_02_milmlb_blxaax_1
error with game: 2016_04_04_houmlb_nyamlb_1
error with game: 2016_04_04_bosmlb_clemlb_1
error with game: 2016_04_09_miamlb_wasmlb_1
error with game: 2016_04_10_nyamlb_detmlb_1
error with game: 2016_04_17_balmlb_texmlb_1
error with game: 2016_04_27_milmlb_chnmlb_1
error with game: 2016_04_28_pitmlb_colmlb_1
error with game: 2016_04_30_atlmlb_chnmlb_1
error with game: 2016_05_16_bosmlb_kcamlb_1
error with game: 2016_05_26_chamlb_kcamlb_1
error with game: 2016_06_08_clemlb_seamlb_1
error with game: 2016_09_25_atlmlb_miamlb_1
error with game: 2016_10_03_clemlb_detmlb_1
error with game: 2015_03_04_sdnmlb_seamlb_1
error with game: 2015_03_06_sfnmlb_texmlb_1
error with game: 2015_03_07_arimlb_seamlb_1
error with game: 2015_03_07_cinm

error with game: 2014_09_18_bosmlb_pitmlb_1
error with game: 2014_09_18_wasmlb_miamlb_1
error with game: 2014_10_03_kcamlb_anamlb_1
error with game: 2014_10_07_lanmlb_slnmlb_1
error with game: 2014_10_07_wasmlb_sfnmlb_1
error with game: 2014_10_11_kcamlb_balmlb_1
error with game: 2014_10_13_balmlb_kcamlb_1
error with game: 2014_10_22_sfnmlb_kcamlb_1
error with game: 2014_10_28_sfnmlb_kcamlb_1


In [15]:
total = 0
for y in years_wanted:
    total += len(pitches_by_year[str(y)])
    print( "year: "+str(y)+" count: "+str(len(pitches_by_year[str(y)])) )
print("total: "+str(total))

year: 2016 count: 23561
year: 2015 count: 23085
year: 2014 count: 22781
total: 69427


Sweet.  Thats about 70k sequences to train off of.  To create a model, i'll need to know the longest sequence.  This will determine the 'length' of the RNN.

In [105]:
# I'll need to know the length of the longest sequence so I can size the time-step of the RNN.
for y in years_wanted:
    longest = max( pitches_by_year[str(y)], key=len )
    print(longest + " len: " + str(len(longest)))

KBKKAFKBFKFBMFKMMFFFKFKMMMKKMFKMMMAFKFMFAKMMMMFMBFBFFFFFKFFACCFBABCCJBAAJAJCJCAAFCACJKKKKFF len: 91
MBFKABFKFKMMMBMMMFFKFAMMMMMABMFCFFCFFCFKFFKCFCFFFFHFFFFFHHFAFFFFFHAFFFAFHFFFFMAFMCCFFCMMBF len: 90
FKFFFKFFFFKMMBKKJAMMBMMBMJJBMMMMJMBJMFFFBFAFFFBFFBAFFFAFFFBFAFBBFFFBFFFMKMMKKMKMMKKKKKKK len: 88


## Building a Model
First some convience functions to build a data set of one-hot vectors, and a method to get a training batch (along with its y values, and sequence lengths).  

In [185]:
import numpy as np
import random

VECTOR_SIZE   = 13 # Number of pitch types
PAD_TO_LENGTH = 91 # Size of longest sequence
pitch_char_dict = { "A":0,
                    "B":1,
                    "C":2,
                    "D":3,
                    "E":4,
                    "F":5,
                    "G":6,
                    "H":7,
                    "I":8,
                    "J":9,
                    "K":10,
                    "L":11,
                    "M":12}

# Create 1-Hot vectors from data set
def create_one_hot_series( seq ):
    vectors = []
    for char in seq:
        v = [ 0.0 for i in range(VECTOR_SIZE)]
        v[ pitch_char_dict[char] ] = 1.0
        vectors.append(v)
    for i in range( len(seq), PAD_TO_LENGTH ):
        v = [ 0.0 for i in range(VECTOR_SIZE)]
        vectors.append(v)
    return vectors    

# Grab a new randomized batch from the full data set.  Also returns the y labels, and the sequence lengths
def get_training_batch( X_full, batch_size ):
    random.shuffle( X_full )
    X_batch = X_full[:batch_size] 
    seq_lens = [ len(s) for s in X_batch ]
    y_batch = [ s[1:] for s in X_batch ]  # Shift the sequence to get a y
    # Add empty pitch vector at end to keep the same shape as X
    for seq in y_batch:
        v = [ 0 for s in range(VECTOR_SIZE) ]
        seq.append(v)
    return np.asarray(X_batch), np.asarray(y_batch), np.asarray(seq_lens, dtype=np.int32)

In [174]:
# Build the actual data set.
full_list = []
for y in years_wanted:
    full_list += pitches_by_year[str(y)]

# TODO - Split this for validation
X_full = []  # This will be the entire data set.  
for seq in full_list:
    X_full.append( create_one_hot_series( seq ) )

In [210]:
# Pitch Prediction Network
import tensorflow as tf
tf.reset_default_graph()

# Construction Phase ###############
NUM_INPUTS  = 13    # Size of the input vector (the number of possible pitch types)
NUM_OUTPUTS = 13    # Want a pitch type out, so same size as input.
NUM_NEURONS = 5     # Just a placeholder.  Its what the book uses for the sequence example 
NUM_STEPS   = 91    # Size of largest input sequence - So all training can fit.

## Build Graph
X = tf.placeholder( tf.float32, [None, NUM_STEPS, NUM_INPUTS] ) 
y = tf.placeholder( tf.float32, [None, NUM_STEPS, NUM_INPUTS] ) # X, but shifted to the left 1 element.
seq_len = tf.placeholder( tf.int32, [None] ) # 1D Tensor to hold the length of each sequence in a batch

basic_cell   = tf.contrib.rnn.BasicRNNCell( num_units=NUM_NEURONS )
wrapped_cell = tf.contrib.rnn.OutputProjectionWrapper( basic_cell, output_size=NUM_OUTPUTS )

outputs, states = tf.nn.dynamic_rnn( wrapped_cell, X, dtype=tf.float32, sequence_length=seq_len ) 

## Cost Function for Training
LEARNING_RATE = 0.001
loss        = tf.reduce_mean( tf.square( outputs-y ) )
optimizer   = tf.train.AdamOptimizer( learning_rate=LEARNING_RATE )
training_op = optimizer.minimize( loss )

init = tf.global_variables_initializer()

In [211]:
### Training Phase
EPOCHS = 1
BATCH_SIZE = 100
NUM_ITERATIONS = int( len(X_full)/BATCH_SIZE )*EPOCHS

with tf.Session() as sess:
    init.run()
    for iteration in range(NUM_ITERATIONS):
        X_batch, y_batch, seq_lengths = get_training_batch( X_full, BATCH_SIZE )
        sess.run( training_op, feed_dict={X: X_batch, y: y_batch, seq_len: seq_lengths})
        if iteration % 100 == 0:
            mse = loss.eval(feed_dict={X: X_batch, y: y_batch, seq_len: seq_lengths})
            print(iteration, "\tMSE:", mse)

0 	MSE: 0.0357572
100 	MSE: 0.0258345
200 	MSE: 0.0220188
300 	MSE: 0.0213313
400 	MSE: 0.0197646
500 	MSE: 0.0197821
600 	MSE: 0.0198067


## Evaluation

In [216]:
# Takes a sequence, encoded as 'AABCD' and predcits the next one.  
def predict_pitch_from_seq( seq ):
    seq_one_hot = create_one_hot_series( seq )
    seq_tensor  = np.asarray( [seq_one_hot] )
    len_tensor  = np.asarray( [len(seq)] , dtype=np.int32)
    seq_y = tf.Session().run( outputs, feed_dict={ X:seq_tensor, seq_len:len_tensor})
    print(seq_y)

In [221]:
s = "AAA"
predict_pitch_from_seq(s)

FailedPreconditionError: Attempting to use uninitialized value rnn/basic_rnn_cell/weights
	 [[Node: rnn/basic_rnn_cell/weights/read = Identity[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/basic_rnn_cell/weights)]]

Caused by op 'rnn/basic_rnn_cell/weights/read', defined at:
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/ipykernel/kernelapp.py", line 477, in start
    ioloop.IOLoop.instance().start()
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/zmq/eventloop/ioloop.py", line 177, in start
    super(ZMQIOLoop, self).start()
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 440, in _handle_events
    self._handle_recv()
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 472, in _handle_recv
    self._run_callback(callback, msg)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/zmq/eventloop/zmqstream.py", line 414, in _run_callback
    callback(*args, **kwargs)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
    handler(stream, idents, msg)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/ipykernel/ipkernel.py", line 196, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/ipykernel/zmqshell.py", line 533, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2698, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2802, in run_ast_nodes
    if self.run_code(code, result):
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-210-abc2bf8bc559>", line 19, in <module>
    outputs, states = tf.nn.dynamic_rnn( wrapped_cell, X, dtype=tf.float32, sequence_length=seq_len )
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 553, in dynamic_rnn
    dtype=dtype)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 720, in _dynamic_rnn_loop
    swap_memory=swap_memory)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2623, in while_loop
    result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2456, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2406, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 703, in _time_step
    skip_conditionals=True)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 177, in _rnn_step
    new_output, new_state = call_cell()
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/rnn.py", line 691, in <lambda>
    call_cell = lambda: cell(input_t, state)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 498, in __call__
    output, res_state = self._cell(inputs, state)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 122, in __call__
    _linear([inputs, state], self._num_units, True))
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/contrib/rnn/python/ops/core_rnn_cell_impl.py", line 1044, in _linear
    _WEIGHTS_VARIABLE_NAME, [total_arg_size, output_size], dtype=dtype)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 1049, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 948, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 356, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 341, in _true_getter
    use_resource=use_resource)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 714, in _get_single_variable
    validate_shape=validate_shape)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 197, in __init__
    expected_shape=expected_shape)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 313, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 1338, in identity
    result = _op_def_lib.apply_op("Identity", input=input, name=name)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
    op_def=op_def)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/Users/wpower/Documents/ML_Workspace/env/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
    self._traceback = _extract_stack()

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value rnn/basic_rnn_cell/weights
	 [[Node: rnn/basic_rnn_cell/weights/read = Identity[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/basic_rnn_cell/weights)]]
