# Basic Pitch Model
This model provides a baseline to compare against.  The other models will attempt to show that they learn more useful patterns than just the pitch sequences themselves.  This will only take the pitches as input, with no additional features.  

### Data
 The data for all the models will be the pitch sequences from all games in the MLB during the past 4 years (2016,2015,2014,2013). Two scripts were used to collect the data.  The first updates the lahmanDB of baseballs statistics to include the mlbgameID that is used by the python library that exposes the pitch sequence data. Once the database is updated to include the ids for the years to be sequenced, the second script is run.  This script reads the at bat data for each game in a given year and produces a set of example feature vectors.  These are saved in pickled format for later use.  

For this model, just the sequences are needed. So the pickled files need to be read in, and converted to a format that only includes these sequences, and that can be dealt with by tensorflow.  Additionally, the sequences should be converted into one hot vectors. TODO-Find out a way to do this with tensorflow.  

Once the entire data set has been formatted correctly, the data needs to be seperated into training, validation, and testing sets.  

In [1]:
import pickle
# First we get the complete data set 
# Right I just have the first 6 months of the 2016 season.  
# Each entry here is an array with two entries.  the first contains a simple feature
# vector: [pitcher_hand, batter_hand, batter_pos],the second entry is a list of
# pitch identifiers. Each identifier by a 2 letter code, each mapping to the 
# typical pitch name.  
full_data = [] 
year = 2016
for m in [3,4,5,6,7,8]:
    fn = "../data/pitches_{}_{}.p".format(year, m)
    seqs = pickle.load(open(fn, "rb"))
    full_data += seqs

# The MLBgame api documentation is incomplete, but from reading 
# the source code, there should be a total of 16 pitch types.   
cleaned_data = [] # no 0 or 1 length sequences. 
longest_seq = 0
empties_or_single = 0
pitch_types = set()
for line in full_data:
    if(len(line[1]) > longest_seq): longest_seq = len(line[1])
    if(len(line[1]) <= 1): 
        empties_or_single += 1
    else:
        cleaned_data.append(line[1])
        for p in line[1]:# the seq is the second element, first is the feature vector
            pitch_types.add(p)

print("longest sequence length: {}\nempties: {}\ntotal (clean): {}\npitch types: {}".format(longest_seq, 
                                                                                            empties_or_single,
                                                                                            len(cleaned_data),
                                                                                            len(pitch_types)))
pitch_counts = { p: 0 for p in pitch_types }
for line in cleaned_data:
    for p in line:
        pitch_counts[p] += 1

# To ensure the mappings make sense, lets count the pitch occurences and compare them to
# their actual names.
for pitch, count in pitch_counts.items():
    print(pitch, count)

# Saving the cleaned data to a pickle to make it easier to work with the other models. 
pickle.dump(cleaned_data, open("../data/pitches_full_{}.p".format(year), "wb"))


longest sequence length: 18
empties: 13757
total (clean): 110216
pitch types: 16
CU 40774
UN 5
CH 48577
IN 1853
FS 7154
FT 62120
PO 125
KC 9628
FC 23526
FF 164046
FO 188
SC 26
KN 1504
EP 14
SI 34286
SL 70932


```
* 'SL' - 70932  -  slider
* 'KC' - 9628   -  knuckle-curve
* 'CH' - 48577  -  changeup
* 'SI' - 34286  -  fastball (sinker)
* 'FO' - 188    -  pitch-out
* 'FS' - 7154   -  fastball 
* 'CU' - 40774  -  curveball
* 'PO' - 126    -  pitch-out (would be better modeled with on-base info)
* 'KN' - 1504   -  knuckleball
* 'FF' - 164046 -  fastball (four-seam)
* 'EP' - 14     -  eephus
* 'IN' - 1853   -  intentional walk (again, maybe better modeled with on-base info)
* 'SC' - 26     -  screwball
* 'FT' - 62120  -  fastball (two-seam)
* 'FC' - 23526  -  fastball (cutter)   
* 'UN' - 5      -  unidentified (need to deal with this)
```

This is encouragining, the 'odd' pitch types, like unidentified, pitch-out, and eephus occur in very small numbers. Right now my focus is on getting a network graph correctly built, and training it properly, but a stretch goal could be to address the issues related to the 'meta strategy' pitches like the pitch-out and intentional walk. I'll keep them in for now, but I might filter them out, or regather data that includes the on-base info, and see if that helps.

The goal of this investigation was to understand the shape of the input data. The sequences will be composed of vectors with length 16. Each entry will correspond to one of the above pitch identifiers. These sequences of 'one-hot' vectors must be created and padded to the length of the maximum length sequence (18).  

In [2]:
import numpy as np

# Creating X - padded sequences of one-hots. Need a dictionary of pitch types to an index.
pitch_map = {
    'KC': 0,
    'CH': 1,
    'SL': 2,
    'SI': 3,
    'FO': 4,
    'FS': 5,
    'CU': 6,
    'PO': 7,
    'KN': 8,
    'FF': 9,
    'EP': 10,
    'IN': 11,
    'SC': 12,
    'FT': 13,
    'FC': 14,
    'UN': 15
}

MAX_LENGTH = longest_seq

def create_onehot(seq):
    ret = []
    i = 0
    for p in seq:
        p_oh = np.zeros((len(pitch_map),), dtype=np.float32)
        p_oh[pitch_map[p]] = 1.0
        ret.append(p_oh)
        i += 1
    for j in range(i, MAX_LENGTH):# Pad to length. 
        ret.append(np.zeros((len(pitch_map),), dtype=np.float32))
    return ret

def create_target(seq):
    ret = []
    i = 0
    for p in seq[1:]:
        ret.append(pitch_map[p])
        i += 1
    for j in range(i, MAX_LENGTH):
        ret.append(0)
    return ret

X_full = [] # Sequences of onehots.
y_full = [] # index of correct pitch in the one-hot, starting at X[1]
for line in cleaned_data:
    X_full.append(create_onehot(line))
    y_full.append(create_target(line))

    
# these should be 18. Nice.
print(len(X_full[0]), len(X_full[0]))

18 18


### Model

The first pass of the model is a very basic RNN.  Once I can get this training any data, I'll focus on making this an actual architecture that can work.  That is, add things like multiple layers, or using a different cell (like the LSTM).  For now just trying to get the basic RNN cell working.  

In [None]:
import tensorflow as tf
tf.reset_default_graph()

# Assumptions about data:
#  - X Padded to MAX_SIZE, with 0-vectors of size(pitch_types)
#  - X only includes the pitch sequences. 
#  - Note: in the other models, each input in the seq will also have the additional feature 
#          vector for the at-bat.
#  - y Padded to MAX_SIZE, with 0's.  (get length off of X, though)

##### Construction Phase ###############
NUM_INPUTS  = 16    # Size of the input vector (the number of possible pitch types)
NUM_OUTPUTS = 16    # Want a pitch type out, so same size as input.
NUM_NEURONS = 10     # Number of neurons inside the RNN cell.  
MAX_SIZE    = 18    # the maximum size of a sequence.  Everything gets padded to this, and masked.
BATCH_SIZE = 5
LEARNING_RATE = 0.015

### RNN Graph
# 0-vector padded sequences.  
X = tf.placeholder( tf.float32, [BATCH_SIZE, MAX_SIZE, NUM_INPUTS] ) 
# y is X shifted to the left, but also converted to the *index* of the correct logit - for seq2seq loss.
y = tf.placeholder( tf.int32, [BATCH_SIZE, MAX_SIZE] ) 

# Get a 1D Tensor to hold the 'true' length of each padded sequence in a batch
collapsed_features = tf.sign(tf.reduce_max(tf.abs(X), 2)) # use max+abs to see what elements arent 0-vectors
seq_len  = tf.cast( tf.reduce_sum(collapsed_features, 1), tf.int32 ) # Count the 1's to get length.
seq_mask = tf.sequence_mask(seq_len, maxlen=MAX_SIZE, dtype=tf.float32) # Create a mask from these lengths

basic_cell   = tf.contrib.rnn.BasicRNNCell( num_units=NUM_NEURONS )
# output is shaped [BATCH_SIZE, MAX_LENGTH, NUM_NEURONS]
outputs, states = tf.nn.dynamic_rnn( basic_cell, X, dtype=tf.float32, sequence_length=seq_len ) 

### Loss, Optimization, Training.  

# seq2seq loss gets the loss by comparing the logits of the prediction 
# to the index of the correct label, given by y.  seq_mask is used to stop
# unrolling the dynamic RNN at the correct spot in the padded sequence.  

# NOTE: I think the error is here. Since the outputs are [BATCH_SIZE, MAX_LENGTH, NUM_NEURONS], they can't be
#       directly used as a prediction.  Need to turn the output into a prediction by using another
#       network layer that converts the [BATCH_SIZE, MAX_LENGTH, NUM_NEURONS] tensor into a 
#       [BATCH_SIZE, MAX_LENGTH, NUM_OUTPUTS] vector (logits for each pitch type.)
# 
# Atleast, I think this is normally done with a fully connected layer between the outputs and the inputs to the
# actual loss function. 
logits = tf.contrib.layers.fully_connected(outputs, NUM_OUTPUTS)

loss = tf.contrib.seq2seq.sequence_loss(logits, 
                                        y, 
                                        seq_mask, 
                                        average_across_timesteps=True, 
                                        average_across_batch=True)
tf.summary.scalar('loss', loss)

optimizer = tf.train.AdamOptimizer(learning_rate=LEARNING_RATE)

# Once we have a loss function, we can just let the optimizer do its job. (hopefully)
training_op = optimizer.minimize( loss )

init = tf.global_variables_initializer()


### Train

Right now I'm still trying to get the network to actually process a batch of examples.  

In [None]:
#### Training Phase ###############
EPOCHS     = 10 # Will need to figure out what this should be. dVC stuff?
ITERATIONS = 10000

# NOTE: Hacky right now, but just want to get data into the model.
# TODO: actually turn the data into tf.Dataset objects? 
# TODO: or use some of the batch operations?
def get_training_batch(X, y, batch_size):
    ids = np.random.randint(0, len(X), batch_size)
    return np.array(X)[ids], np.array(y)[ids]

merged = tf.summary.merge_all()

# Testing for now, want to see if it actually updates on one batch.
with tf.Session() as sess:
    summary_writer = tf.summary.FileWriter('../data', sess.graph)
    init.run()
    
    # For debugging
    X_batch, y_batch = get_training_batch( X_full, y_full, BATCH_SIZE )
    print("shape(seq_len): ", sess.run(seq_len, feed_dict={X: X_batch, y: y_batch}).shape)
    print("shape(seq_mask): ",sess.run(seq_mask, feed_dict={X: X_batch, y: y_batch}).shape)
    print("shape(outputs): ", sess.run(outputs, feed_dict={X: X_batch, y: y_batch}).shape)
    print("shape(state): ",   sess.run(states, feed_dict={X: X_batch, y: y_batch}).shape)
    print("shape(logits): ",  sess.run(logits, feed_dict={X: X_batch, y: y_batch}).shape)
    
    for i in range(ITERATIONS):
        X_batch, y_batch = get_training_batch( X_full, y_full, BATCH_SIZE )   
        _, l, summary = sess.run([training_op, loss, merged], feed_dict={X: X_batch, y: y_batch})
        summary_writer.add_summary(summary, i)
        if i%10 == 0: print("loss at i {}: {}".format(i, l))
            
    summary_writer.close()


shape(seq_len):  (5,)
shape(seq_mask):  (5, 18)
shape(outputs):  (5, 18, 10)
shape(state):  (5, 10)
shape(logits):  (5, 18, 16)
loss at i 0: 2.8267900943756104
loss at i 10: 2.709425449371338
loss at i 20: 2.0089735984802246
loss at i 30: 1.9500250816345215
loss at i 40: 2.181269645690918
loss at i 50: 1.847040057182312
loss at i 60: 1.6691128015518188
loss at i 70: 2.297168016433716
loss at i 80: 1.9176737070083618
loss at i 90: 1.7170404195785522
loss at i 100: 2.0219006538391113
loss at i 110: 1.9986138343811035
loss at i 120: 2.2544689178466797
loss at i 130: 2.174241304397583
loss at i 140: 2.6272571086883545
loss at i 150: 2.4405465126037598
loss at i 160: 1.7813363075256348
loss at i 170: 1.6529394388198853
loss at i 180: 1.4457261562347412
loss at i 190: 1.6243501901626587
loss at i 200: 2.1029717922210693
loss at i 210: 1.7985771894454956
loss at i 220: 2.3652803897857666
loss at i 230: 1.635446548461914
loss at i 240: 2.4044580459594727
loss at i 250: 2.086073875427246
loss a

loss at i 2360: 2.091217279434204
loss at i 2370: 2.360408306121826
loss at i 2380: 1.9183526039123535
loss at i 2390: 1.645511507987976
loss at i 2400: 2.0755250453948975
loss at i 2410: 1.9413864612579346
loss at i 2420: 1.3453856706619263
loss at i 2430: 1.8689213991165161
loss at i 2440: 1.8482835292816162
loss at i 2450: 2.1978063583374023
loss at i 2460: 1.7647393941879272
loss at i 2470: 2.0937201976776123
loss at i 2480: 1.6762148141860962
loss at i 2490: 1.7376598119735718
loss at i 2500: 1.4741389751434326
loss at i 2510: 1.693228840827942
loss at i 2520: 2.164606809616089
loss at i 2530: 1.3185579776763916
loss at i 2540: 1.1575065851211548
loss at i 2550: 2.065110921859741
loss at i 2560: 1.4948843717575073
loss at i 2570: 1.3689886331558228
loss at i 2580: 1.369441032409668
loss at i 2590: 1.9275054931640625
loss at i 2600: 1.9632614850997925
loss at i 2610: 1.8672701120376587
loss at i 2620: 1.294786810874939
loss at i 2630: 2.096583604812622
loss at i 2640: 1.52465772628

loss at i 4720: 2.4235005378723145
loss at i 4730: 1.627945899963379
loss at i 4740: 2.2439002990722656
loss at i 4750: 1.6502056121826172
loss at i 4760: 2.176772356033325
loss at i 4770: 1.854501724243164
loss at i 4780: 1.66303288936615
loss at i 4790: 1.6711256504058838
loss at i 4800: 1.96393883228302
loss at i 4810: 1.4162349700927734
loss at i 4820: 1.7785613536834717
loss at i 4830: 1.650625467300415
loss at i 4840: 1.2669119834899902
loss at i 4850: 1.7860535383224487
loss at i 4860: 1.5783164501190186
loss at i 4870: 1.3400689363479614
loss at i 4880: 2.110015392303467
loss at i 4890: 1.8689706325531006
loss at i 4900: 0.9505634307861328
loss at i 4910: 2.049506902694702
loss at i 4920: 1.9714382886886597
loss at i 4930: 1.77403724193573
loss at i 4940: 1.5663522481918335
loss at i 4950: 1.5247989892959595
loss at i 4960: 1.6990342140197754
loss at i 4970: 2.230835199356079
loss at i 4980: 1.6096924543380737
loss at i 4990: 2.0491158962249756
loss at i 5000: 1.992991924285888

### Evaluate