# Hand and Position Model

This model will be similar to the previous one, but the vectors in the sequences will also include the pitch and position data from the at bat.  

### Data
First, the new example vectors need to be created.  The y tensor is exactly the same, but extra work needs to be done to create X.

In [10]:
import pickle
import numpy as np
 
full_data = [] 
year = 2016
for m in [3,4,5,6,7,8]:
    fn = "../data/pitches_{}_{}.p".format(year, m)
    seqs = pickle.load(open(fn, "rb"))
    full_data += seqs

cleaned_data = [] # no 0 or 1 length sequences. 
longest_seq = 0
empties_or_single = 0
pitch_types = set()
pos_types   = set()

for line in full_data:
    if(len(line[1]) > longest_seq): longest_seq = len(line[1])
    if(len(line[1]) <= 1): 
        empties_or_single += 1
    else:
        cleaned_data.append(line)
        pos_types.add(line[0][2]) 
        for p in line[1]: # the seq is the second element, first is the feature vector
            pitch_types.add(p)

print("longest sequence length: {}\nempties: {}\ntotal (clean): {}\npitch types: {}".format(longest_seq, 
                                                                                            empties_or_single,
                                                                                            len(cleaned_data),
                                                                                            len(pitch_types)))

print("pos types: {}".format(len(pos_types)))
# Saving the cleaned data to a pickle to make it easier to work with the other models. 
pickle.dump(cleaned_data, open("../data/pitches_full_{}.p".format(year), "wb"))

# Creating X - padded sequences of one-hots. Need a dictionary of pitch types to an index.
pitch_map = {
    'KC': 0,
    'CH': 1,
    'SL': 2,
    'SI': 3,
    'FO': 4,
    'FS': 5,
    'CU': 6,
    'PO': 7,
    'KN': 8,
    'FF': 9,
    'EP': 10,
    'IN': 11,
    'SC': 12,
    'FT': 13,
    'FC': 14,
    'UN': 15
}

MAX_LENGTH = longest_seq
NUM_EXTRA_FEATURES = 3 # pitcher hand, batter hand, batter pos

def create_onehot_with_features(seq):
    ret = []
    i = 0
    for p in seq[1]:
        p_oh = np.zeros((len(pitch_map),), dtype=np.float32)
        p_oh[pitch_map[p]] = 1.0
        ret.append(p_oh)
        i += 1
    for j in range(i, MAX_LENGTH):# Pad to length. 
        ret.append(np.zeros((len(pitch_map),), dtype=np.float32))
    return ret

def create_target(seq):
    ret = []
    i = 0
    for p in seq[1][1:]:
        ret.append(pitch_map[p])
        i += 1
    for j in range(i, MAX_LENGTH):
        ret.append(0)
    return ret

X_full = [] # Sequences of onehots.
y_full = [] # index of correct pitch in the one-hot, starting at X[1]
for line in cleaned_data:
    X_full.append(create_onehot_with_features(line))
    y_full.append(create_target(line))

print(pos_types)
    
# these should be 18. Nice.
print(len(X_full[0]), len(X_full[0]))

longest sequence length: 18
empties: 13757
total (clean): 110216
pitch types: 16
pos types: 125
{'2B-SS', 'PR-1B', '3B-2B', 'SS-RF', 'PH-SS', '2B-P', 'C', 'PR-1B-2B', 'SS-LF', 'C-LF', 'CF-1B', 'PR-LF-CF', 'PH-1B-2B', 'PR-LF', 'PH-LF', 'LF-2B', 'PH-3B', 'DH-1B', '2B-RF', 'PH-RF-LF', 'RF-LF-CF', 'RF-3B', '2B-LF-3B', 'PR-DH-3B', 'RF-CF', 'P', '1B', 'DH', 'LF-1B', 'LF-RF', 'RF-LF-1B', 'PH-2B-1B', 'RF-LF-3B', '3B-LF', '3B-SS', 'PR-SS', 'CF-LF-CF', 'SS-3B-SS', '2B-3B-LF', 'CF-SS', 'PH-DH-RF', 'PR-DH', 'PH-LF-CF', 'PH-1B', '1B-LF', 'DH-2B', 'LF-P-LF-P', 'SS-1B', '3B-1B', '2B', '3B-1B-3B', '3B-RF-3B', 'P-LF-P', 'SS-2B', '1B-2B', 'DH-C', 'LF-CF-LF', 'PH-1B-LF', 'CF-2B', '1B-CF', 'LF-3B', 'DH-3B', 'PR-3B-1B', 'CF-RF', 'LF', 'LF-SS', '1B-P', 'RF-1B-LF', '2B-LF-RF', 'SS-P', '2B-1B', 'LF-1B-LF', 'PH-DH-2B', '1B-RF', 'PH-3B-1B', 'PR-CF', 'RF-2B', 'DH-LF', 'RF-SS', 'PH-C', '3B-2B-LF', 'LF-CF', '3B-CF', '2B-LF', 'PH-RF', 'PR-RF-CF', 'CF-LF', '3B-P', 'DH-SS', 'CF-3B', '3B', 'DH-RF', 'PR-RF', 'C-1B', '1

Woa! Lot more positions than I was anticipating.  Looks like they allow for multiple positions.  I think I can still handle this, but the positions feature will have to be a one-hot of all the individual positions, and a batters pos vector would contain a value for each position listed.  These could be normalized, even. 

In [11]:
simple_poss = set()

for p in pos_types:
    p_split = p.split("-")
    for i in p_split:
        simple_poss.add(i)
# This should look like just the regular list of positions. 
print(simple_poss)

{'PR', '3B', '2B', 'P', 'C', '1B', 'DH', 'SS', 'PH', 'CF', 'RF', 'LF'}


### Model
I feel like there is an issue with just extending the tensor that goes from cell to cell in the RNN.  If at each iteration, we get an output that represents the logits for each feature in the feature vector, whats to stop the network from just predicting the handedness and position at each step, because it never changes?

Is there a way to restrict the calculation of the logits to just the 16 pitch outputs?  Could I make the input be the 16+3 vector, make the internals output 16, and the next input would be output+3features?