## Data Sets
For this project I need a few different data sets.  The first is simply the 'basic' at bats.  Each sequence will represent one at-bat; one pitcher, one batter.  The sequences will be of one-hot vectors encoding the pitch type. Further data sets will build a richer, additional feature vector that will be learned from.  For example:
* pitcher_handedness
* batter_handedness
* batter_trailing_average
* batter_position

#### Handed-ness
There are a few ways that the handedness of the pitcher/batter could be encoded.  I think some of the 'busywork' of learning the patterns for match ups could be simplified by encoding the matchups into a one-hot vector of length 4.  One entry corresponding to each of the possible match ups: {LL, LR, RL, RR}. Hopefully this makes it easier to learn to seperate the pattern/features specific to each match up.  Seems simpler than having the two one hot vectors, and having to learn the map from the two of them to matchups, then from the matchup to the patterns.
#### Posistion
Another one-hot vector.  Will need an accurate list of the position labels.  I think there are tf helpers for making one-hots, could use it for this and the pitch types in the sequences themselves.
#### Averages
The way we calculate and include the averages will need to be able to handle new players, lack of data, edge cases between seasons. Might make this very difficult.  Might be easier to use the teams previous game performance as a proxy. Ok, so spitballing:
* Teams Previous Game Performance
* Players previous game performance
* players average (at the time of the at-bat) (this will be a tough one)

#### Pitcher Type
Suggested by prof. There are some 'standard' (heavy on the air quotes) ways to describe archtypes of pitchers. Could enumerate them, then look at the way people decide on it and craft a heuristic to describe each pitcher.  
## Models
Tentativley, I'll settle for 3 models to compare against each other.  
* Base Model - Just the Pitches into a simple RNN
* Base Model + Hand Matchup + Posistion
* Base Model + Hand Matchup + Posistion + Pitcher Type

This will hopefully show that the base model is crap because its simply counting patterns.  The second model will hopefully perform better with the inclusion of the additional info.  The third model should/could show that the addition of a more 'hand crafted' heuristic feature can also increase the accuracy of the model. 

##

In [14]:
import pickle, os, mlbgame
from pprint import pprint

import pymysql

YEARS_WANTED   = [2016, 2015, 2014] 
PITCH_DATA_DIR = "pitch_data/"

def get_pitch_seqs_from_game( game_id ):
    atbats = []
    events  = mlbgame.game_events(game_id)
    stats   = mlbgame.player_stats(game_id)
    # Dictionaries to get batter/pitcher info for feature vector
    batters = stats['away_batting']+stats['home_batting']
    batters = {b.id: b for b in batters}
    pitchers = stats['away_pitching']+stats['home_pitching']
    pitchers = {b.id: b for b in pitchers}
    for i in events:
        inning = events[i]
        for ab in inning['top']+inning['bottom']:
            pitch_seq = []
            for pitch in ab.pitches:
                pitch_seq.append(pitch.pitch_type)
            feature_str = get_feature_str(ab, batters, pitchers)
            atbats.append([feature_str, pitch_seq])
    return atbats

def get_feature_str(atbat, batters, pitchers):
    # Todo - use the atbat to build the feature vector
    b = batters[atbat.batter]
    p = pitchers[atbat.pitcher]
    # The lahmanDB has its own ID scheme, so you'll need to find the pitcher/batter info
    # using a similarity query over the names... which sucks.  Could use name+other stuff?
    
    # need first and last name.  Theres gotta be duplicates.  fuck me. 
    # WHY CANT THE MLB STANDARDIZE UNIQUE IDs FOR PLAYERS?!?!?!?!?!
    
    
    
#     ret = p.name_display_first_last
    ret = str(atbat.pitcher)
#     ret = p.name
    return ret


pitch_seq = get_pitch_seqs_from_game(game.game_id)
print(pitch_seq)

# For each at bat
    # Get the pitcher
    # Get the batter
    # Get the sequence
    # append the representations together
    
# Don't process the data at this stage, just have a row per at bat.  
# [pitcher_id, batter_id, pitcher_hand, batter_hand, batter_pos, <pitcher type data - TBD>, [sequence]]

# Later, we will take the row data, and convert it to an actual feature vector using some of the TF
# helper functions.  I think it has a way to turn things like characters/labels into one-hot vectors 
# all on its own.  

[['491703', ['SL', 'FF', 'SL', 'FF', 'FF', 'FF', 'FF', 'FF']], ['491703', ['CH', 'FF', 'FF', 'FT', 'FF', 'FF']], ['491703', ['FF', 'SL', 'FF', 'FF', 'CH', 'SL']], ['491703', ['FF']], ['448609', ['FF', 'CH', 'CH']], ['448609', ['FF']], ['448609', ['SL', 'FF', 'SL', 'SL']], ['407890', ['FF', 'FF', 'SL', 'CU', 'SL']], ['407890', ['FF', 'FF']], ['407890', ['FF', 'FT', 'FF', 'FF', 'FF']], ['572971', ['FF', 'CH', 'FC']], ['572971', ['CH', 'CH', 'FF', 'CH', 'FF', 'FF']], ['572971', ['FT']], ['572971', ['FT', 'CH', 'CH', 'FT', 'CH']], ['572971', ['FT', 'CH', 'FT']], ['407890', ['CU']], ['407890', ['FF', 'CU']], ['407890', ['CU', 'CU', 'SL', 'FF']], ['572971', ['FT', 'CH', 'FF', 'CH', 'FT', 'FT', 'FT']], ['572971', ['CH', 'FT']], ['572971', ['SL', 'FT', 'SL']], ['407890', ['FF', 'FF', 'FF', 'SL', 'SL', 'FF']], ['407890', ['CU', 'FF', 'FF']], ['407890', ['CU', 'FF', 'FF', 'SL', 'FF']], ['572971', ['FT']], ['572971', ['CH', 'FF']], ['572971', ['CH']], ['572971', ['FT']], ['474521', ['FF', 'FF', '

## Data PreProcessing
All of the label attribute values will need to be converted to one hot vectors.  Also, if any averages, or numerical values are used (batting averages, on base numbers), these need to be normalized as well.  