# Leveraging SQLAlchemy ORM to Store and Retrieve MLB Stats

## Table of Contents

[Part 1: Exploring the MLB API](#part-1)
- [1a. Install and Import](#part-1a)
- [1b. Get GamePks](#part-1b)
- [1c. The 'Game' Endpoint](#part-1c)

---

### The SQLAlchemy Object Relational Mapper automatically constructs higher-level SQL and automates persistence of python objects.
We're going to query the MLB API using a python wrapper created by Todd Roberts and store the information in a SQLite database for future analysis. 

---

<a id='part-1'></a>

## Part 1: Exploring the MLB API
Todd Roberts' python wrapper is part of the python package index. You can find more information [here](https://pypi.org/project/MLB-StatsAPI/) or on [GitHub](https://github.com/toddrob99/MLB-StatsAPI).

<a id='part-1a'></a>

First, we have to install it and import it.

In [1]:
import sys
#pip install 
#!{sys.executable} -m pip install MLB-StatsAPI

import statsapi as mlb

Todd was nice enough to give us several convenient functions for accessing the API's endpoints. The most flexible/powerful of these is the get() function that takes in an endpoint and returns the raw JSON response from the MLB Stats API. You can find a dictionary with the endpoint configuration by accessing the ENDPOINTS global variable. To get notes for a given endpoint, use the notes() method.

In [2]:
list(mlb.ENDPOINTS.keys())[:10]

['attendance',
 'awards',
 'conferences',
 'divisions',
 'draft',
 'game',
 'game_diff',
 'game_timestamps',
 'game_changes',
 'game_contextMetrics']

In [3]:
print(mlb.notes('game'))

Endpoint: game 
All path parameters: ['ver', 'gamePk']. 
Required path parameters (note: ver will be included by default): ['ver', 'gamePk']. 
All query parameters: ['timecode', 'hydrate', 'fields']. 
Required query parameters: None. 
The hydrate function is supported by this endpoint. Call the endpoint with {'hydrate':'hydrations'} in the parameters to return a list of available hydrations. For example, statsapi.get('schedule',{'sportId':1,'hydrate':'hydrations','fields':'hydrations'})



<a id='part-1b'></a>

#### Get GamePks

In [4]:
from datetime import datetime as dt
import os,re,csv
from os import walk

#dates from the 'season' endpoint are returned in a different format than what we need to query the API
#we'll use this function to take care of that in a moment
def convert_date(date):
    date = dt.strptime(date,"%Y-%m-%d")
    convertedDate = dt.strftime(date,"%m/%d/%Y")
    return convertedDate

def get_gamePks(seasons,target_directory=None):
    """
    Takes in a list of seasons as strings representing their year e.g. ['2018','2019']
    Queries the MLB API to find gamePks for each season and writes them to CSV files
    if a target directory for the gamePks is not specified, a directory called 'gamePks'
    will be added to the current directory. 
    """
    if target_directory:
        gamePks_path = target_directory
    else:
        #create a directory to store CSVs
        try:
            os.mkdir(os.getcwd()+'/gamePks')
        except FileExistsError:
            pass
        gamePks_path=os.getcwd()+'/gamePks'
    
    #walk the gamePks directory to see if we've already added any seasons
    f = []
    for (dirpath, dirnames, filenames) in walk(gamePks_path):
        f.extend(filenames)
        break
    years = [re.findall('[^.csv]+',x) for x in f]
    already_added = [item for sublist in years for item in sublist if item[0] in ['1','2']]
    seasons = list(set(seasons)-set(already_added))
    
    #query the API to get start dates and end dates for all seasons
    all_seasons = mlb.get('seasons',{'sportId':1,'all':True})['seasons']
    
    #filter out the ones we don't care about right now
    seasons = list(filter(lambda x: x['seasonId'] in seasons,all_seasons))
    
    gamePks = {}
    for season in seasons:  
        year = season['seasonId']
        startDate = convert_date(season['seasonStartDate'])
        endDate = convert_date(season['seasonEndDate'])
        
        #returns a list of dicts for each date in the range
        #each dict has a 'games' key with a list of dicts for each game in that day as values
        dates = mlb.get('schedule',{'sportId':1,'startDate':startDate,'endDate':endDate})['dates']
        
        #for each date, and for each game in that date, get the gamePk 
        gamePks[year]= [ game['gamePk'] 
                                          for date in dates 
                                          for game in date['games'] ]
        #store the gamePks as CSVs
        with open(gamePks_path + f"/{year}.csv", 'w',newline='') as myfile:
            wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
            wr.writerow(gamePks[year])
get_gamePks([str(x) for x in range(2008,2020)])   

In [5]:
def read_gamePks():
    gamePks_path = os.curdir+'/gamePks'
    f = []
    for (dirpath, dirnames, filenames) in walk(gamePks_path):
        f.extend(filenames)
        break
    pk_paths = [gamePks_path + '/' + x for x in f if x[0]!= '.']
    
    gamePks = {}
    for path in pk_paths:
        season = re.findall('/gamePks/([^.csv]+)',path)
        with open(path, 'r') as f:
            reader = csv.reader(f)
            seasonPks = list(reader)
        gamePks[season[0]] = [item for sublist in seasonPks for item in sublist]
    return gamePks

In [6]:
gamePks=read_gamePks()

<a id='part-1c'></a>

#### Explore the 'Game' Endpoint

Let's pick a gamePk at random to see what's inside the 'game' endpoint. There is a TON of information stored in nested dictionaries returned from the API query. Since our goal is to store this information in a SQL database, our aim is to organize the information into [first normal form](https://www.essentialsql.com/get-ready-to-learn-sql-8-database-first-normal-form-explained-in-simple-english/).

From this single result, we'll be able to start building [normalized SQL tables](https://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/) for games, teams, venues, players, plays, and pitches. Let's start with the games table. 

In [7]:
temp_pk=gamePks['2019'][501]
print(temp_pk) 
game_result = mlb.get('game',{'gamePk':temp_pk})

565812


In [8]:
game_result.keys()

dict_keys(['copyright', 'gamePk', 'link', 'metaData', 'gameData', 'liveData'])

In [9]:
gameData = game_result['gameData']
gameData.keys()

dict_keys(['game', 'datetime', 'status', 'teams', 'players', 'venue', 'weather', 'review', 'flags', 'alerts', 'probablePitchers', 'officialScorer', 'primaryDatacaster'])

In [10]:
game = gameData['game']
game

{'pk': 565812,
 'type': 'R',
 'doubleHeader': 'N',
 'id': '2019/04/26/pitmlb-lanmlb-1',
 'gamedayType': 'P',
 'tiebreaker': 'N',
 'gameNumber': 1,
 'calendarEventID': '14-565812-2019-04-26',
 'season': '2019',
 'seasonDisplay': '2019'}

The dictionary above, nested 2 layers deep into the original API result, provides us with a good starting point. But we're goint to add some additional information to make our games table more informative

In [11]:
gameData['datetime']

{'dateTime': '2019-04-27T02:10:00Z',
 'originalDate': '2019-04-26',
 'dayNight': 'night',
 'time': '7:10',
 'ampm': 'PM'}

In [12]:
gameData['weather']

{'condition': 'Clear', 'temp': '61', 'wind': '2 mph, Varies'}

In [13]:
gameData['venue']['timeZone']

{'id': 'America/Los_Angeles', 'offset': -8, 'tz': 'PST'}

In [14]:
gameData['status']

{'abstractGameState': 'Final',
 'codedGameState': 'F',
 'detailedState': 'Final',
 'statusCode': 'F',
 'abstractGameCode': 'F'}

In [15]:
#keys to add
keys_to_add = ['dateTime',
          'originalDate',
          'condition',
          'temp','wind','tz']
#dictionaries from which to add them
dicts = [gameData['datetime'],
         gameData['weather'],
         gameData['venue']['timeZone']
        ]

for k in keys_to_add:
    for d in dicts:
        try:
            game[k]=d[k]
        except KeyError:
            continue
#'seasonDisplay' key:value seems to be redundant
del game['seasonDisplay']
game

{'pk': 565812,
 'type': 'R',
 'doubleHeader': 'N',
 'id': '2019/04/26/pitmlb-lanmlb-1',
 'gamedayType': 'P',
 'tiebreaker': 'N',
 'gameNumber': 1,
 'calendarEventID': '14-565812-2019-04-26',
 'season': '2019',
 'dateTime': '2019-04-27T02:10:00Z',
 'originalDate': '2019-04-26',
 'condition': 'Clear',
 'temp': '61',
 'wind': '2 mph, Varies',
 'tz': 'PST'}

In [16]:
gameData['probablePitchers']

{'away': {'id': 502042,
  'fullName': 'Archer, Chris',
  'link': '/api/v1/people/502042'},
 'home': {'id': 547943,
  'fullName': 'Ryu, Hyun-Jin',
  'link': '/api/v1/people/547943'}}

In [17]:
def get_game(api_call):
    gameData = api_call['gameData']
    dateTime = gameData['datetime']
    game = gameData['game']
    weather = gameData['weather']
    timeZone = gameData['venue']['timeZone']
    status = gameData['status']
    probablePitchers = gameData['probablePitchers']
    
    keys_to_add = ['dateTime','originalDate',
                   'condition','temp','wind',
                   'tz','detailedState'
                  ]
    dicts = [weather,dateTime,timeZone,status]
    for k in keys_to_add:
        for d in dicts:
            try:
                game[k]=d[k]
            except KeyError:
                continue
    #'seasonDisplay' key:value seems to be redundant
    del game['seasonDisplay']
    
    home_team = gameData['teams']['home']
    away_team = gameData['teams']['away']
    
    game['homeTeam_id'] = home_team['id']
    game['awayTeam_id'] = away_team['id']
    
    game['venue_id'] = gameData['venue']['id']
    
    for team in ['home','away']:
        try:
            game[f"{team}_probablePitcher"]=probablePitchers[team]['id']
        except KeyError:
            pass
    
    #format the dateTime and originalDate
    fmt = "%Y-%m-%dT%H:%M:%SZ" 
    game['dateTime'] = dt.strptime(game['dateTime'],fmt)
    fmt = "%Y-%m-%d"
    game['originalDate'] = dt.strptime(game['originalDate'],fmt).date()
    
    return game
api_call = mlb.get('game',{'gamePk':temp_pk})
game = get_game(api_call)

## API calls
What's the best way to automate API calls when needed?

In [19]:
def api_calls(gamePks):
    return ( mlb.get('game',{'gamePk':gamePk}) for gamePk in gamePks )

In [20]:
calls = api_calls(gamePks['2019'][500:600])

In [21]:
#test = [get_game(x) for x in calls]

<a id='part-2'></a>

## Introducing SQL Alchemy

In [22]:
import sqlalchemy
from sqlalchemy import create_engine,PrimaryKeyConstraint,UniqueConstraint

In [23]:
class MyDatabase:
    # http://docs.sqlalchemy.org/en/latest/core/engines.html
    """
    Custom class for instantiating a SQL Alchemy connection. Configured here for SQLite, but intended to be flexible.
    Credit to Medium user Mahmud Ahsan:
    https://medium.com/@mahmudahsan/how-to-use-python-sqlite3-using-sqlalchemy-158f9c54eb32
    """
    DB_ENGINE = {
       'sqlite': 'sqlite:////{DB}'
    }

    # Main DB Connection Ref Obj
    db_engine = None
    def __init__(self, dbtype, username='', password='', dbname='',path=os.getcwd()+'/'):
        dbtype = dbtype.lower()
        if dbtype in self.DB_ENGINE.keys():
            engine_url = self.DB_ENGINE[dbtype].format(DB=path+dbname)
            self.db_engine = create_engine(engine_url)
            print(self.db_engine)
        else:
            print("DBType is not found in DB_ENGINE")
db=MyDatabase('sqlite',dbname='mlb.db')

Engine(sqlite://///Users/schlinkertc/code/MLB/mlb_sqlite/blog_posts/mlb.db)


In [24]:
sqlalchemy.dialects.sqlite.base.SQLiteDialect.construct_arguments

[(sqlalchemy.sql.schema.Table, {'autoincrement': False}),
 (sqlalchemy.sql.schema.Index, {'where': None}),
 (sqlalchemy.sql.schema.Column,
  {'on_conflict_primary_key': None,
   'on_conflict_not_null': None,
   'on_conflict_unique': None}),
 (sqlalchemy.sql.schema.Constraint, {'on_conflict': None})]

In [25]:
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()

from sqlalchemy import Table,Column,Integer,String,DateTime,Date,Boolean

class Game(Base):
    __tablename__ = 'games'
    __table_args__ = (PrimaryKeyConstraint('id','detailedState',sqlite_on_conflict='IGNORE'),
                      {'extend_existing': True})
    
    pk = Column(Integer)
    type = Column(String(1))
    doubleHeader = Column(String(1))
    id = Column(String(150))
    gamedayType = Column(String(1))
    tiebreaker = Column(String(1))
    gameNumber = Column(Integer)
    calenderEventId = Column(String(50))
    season = Column(Integer)
    
    dateTime = Column(DateTime)
    originalDate = Column(Date)
    
    detailedState = Column(String(12))
    
    homeTeam_id = Column(Integer)
    awayTeam_id = Column(Integer)
    
    condition = Column(String(25))
    temp = Column(Integer)
    wind = Column(String(50))
    
    venue_id = Column(Integer)
    
    home_probablePitcher = Column(Integer)
    away_probablePitcher = Column(Integer)
    
    def __repr__(self): 
        return "<Game(pk='%s',id='%s')>" % (
                        self.pk, self.id)
    
    def __init__(self,dictionary):
        for k,v in dictionary.items():
            setattr(self,k,v)

In [26]:
Base.metadata.create_all(db.db_engine)

In [27]:
game_record = Game(game)

In [28]:
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=db.db_engine)
session = Session()

In [29]:
game_record

<Game(pk='565812',id='2019/04/26/pitmlb-lanmlb-1')>

In [30]:
session.add(game_record)

session.commit()

session.query(Game).all()

[<Game(pk='565812',id='2019/04/26/pitmlb-lanmlb-1')>]

Now we're going to move on to plays. We're going to set up a one-to-many relationship between games and plays, so we'll need to be able to link the two when we get to the SQLAlchemy stage. With that in mind, we'll start exploring the information embedded in the 'liveData' key of the original API result

'allPlays' gives us a list of dictionaries with information about every play. Remember that we're aiming for first normal form, so we don't want any information about pitches, players, or runners at this time. Those will be contained in their own respective tables

In [42]:
def string_to_dateTime(string):
    fmt="%Y-%m-%dT%H:%M:%S.%fZ"
    return dt.strptime(string,fmt)
string_to_dateTime('2019-04-26T20:59:22.000Z')

datetime.datetime(2019, 4, 26, 20, 59, 22)

In [43]:
# def parse_play(play):
#     play_dict = play['result']
#     play_dict.update(play['about'])
#     play_dict.update(play['count'])
    
#     for t in ['startTime','endTime']:
#         play_dict[t]=string_to_dateTime(play_dict[t])
        
#     for player in ['batter','pitcher']:
#         try:
#             play_dict[f"{player}_id"] = play['matchup'][player]['id']
#         except:
#             pass
#     return play_dict

In [45]:
# def get_plays(API_result):
#     #foreign key references games table
#     gamePk={'gamePk':API_result['gamePk']}
    
#     allPlays = API_result['liveData']['plays']['allPlays']
#     plays = [parse_play(play) for play in allPlays] 
#     [play.update(gamePk) for play in plays]
#     return plays

In [49]:
def flatten_dicts(dictionary):
    """
    recursively flatten a dictionary of dictionaries
    """
    #base case 
    if dict not in [type(x) for x in dictionary.values()]:
        return dictionary
    else:
        for key, value in dictionary.items():
            if type(value)==dict:
                temp_dict = dictionary.pop(key)
                for k,v in temp_dict.items():
                    dictionary[f"{key}_{k}"]=v
                return flatten_dicts(dictionary)
            
    

In [117]:
def get_plays(API_result):
    #foreign key references games table
    gamePk={'gamePk':API_result['gamePk']}
    
    allPlays = API_result['liveData']['plays']['allPlays']
    keys = ['result','about','count']
    plays = []
    matchups = []
    for play in allPlays:
        play_dict = {k:v for k,v in zip(keys,[play[key] for key in keys])}.copy()
        plays.append(flatten_dicts(play_dict))
        matchup = play.pop('matchup')
        #foreign keys to play 'atBatIndex', 'playEndTime'
        fks=['atBatIndex', 'playEndTime']
            
        matchup.update(
            {k:v for k,v in zip(fks,[play[fk] for fk in fks])}
        )
        matchups.append(flatten_dicts(matchup))
            
    [m.update(gamePk) for m in matchups]
    [p.update(gamePk) for p in plays]
    return plays,matchups
        
        

In [118]:
def get_pitches(API_result):
    #foreign key references games table
    gamePk={'gamePk':API_result['gamePk']}
    
    allPlays = API_result['liveData']['plays']['allPlays']
    
    pitches = []
    for play in allPlays:
        for i in play['pitchIndex']:
            pitch = play['playEvents'][i]
            #foreign keys to play 'atBatIndex', 'playEndTime'
            fks=['atBatIndex', 'playEndTime']
            
            pitch.update(
                {k:v for k,v in zip(fks,[play[fk] for fk in fks])}
            )
            
            pitches.append(flatten_dicts(pitch))
    
    [p.update(gamePk) for p in pitches]
    return pitches

In [119]:
def get_runners(API_result):
    #foreign key references games table
    gamePk={'gamePk':API_result['gamePk']}
    
    allPlays = API_result['liveData']['plays']['allPlays']
    
    runners = []
    credits = []
    for play in allPlays:
        for i in play['runnerIndex']:
            runner = play['runners'][i]
            
            fks=['atBatIndex', 'playEndTime']
            
            runner.update(
                {k:v for k,v in zip(fks,[play[fk] for fk in fks])}
            )
            try:
                temp_credits = runner.pop('credits')
                
                for credit in temp_credits:
                    credit.update({k:v for k,v in zip(fks,[play[fk] for fk in fks])})
                
                    credits.append(flatten_dicts(credit))
            except KeyError:
                pass 
            runners.append(flatten_dicts(runner))
    
    [r.update(gamePk) for r in runners]
    return runners,credits

In [120]:
runners,credits = get_runners(api_call)

In [121]:
def get_actions(API_result):
    #foreign key references games table
    gamePk={'gamePk':API_result['gamePk']}
    
    allPlays = API_result['liveData']['plays']['allPlays']
    
    actions = []
    for play in allPlays:
        for i in play['actionIndex']:
            action = play['playEvents'][i]
            #foreign keys to play 'atBatIndex', 'playEndTime'
            fks=['atBatIndex', 'playEndTime']
            
            action.update(
                {k:v for k,v in zip(fks,[play[fk] for fk in fks])}
            )
            
            actions.append(flatten_dicts(action))
    
    [a.update(gamePk) for a in actions]
    return actions

In [68]:
actions = get_actions(api_call)

In [73]:
def get_players(API_result):
    #fk for game_player_link
    gamePk=API_result['gamePk']
    
    players = API_result['gameData']['players']
    players = [flatten_dicts(players[player_id]) for player_id in players.keys()]
    
    game_player_links = []
    for player in players:
        link = {'player':player['id'],'gamePk':gamePk}
        game_player_links.append(link)
    
    return players,game_player_links

In [74]:
players,game_player_links = get_players(api_call)

In [78]:
def get_teams(API_result):
    #fk for game_team_link
    gamePk=API_result['gamePk']
    teams_dict = API_result['gameData']['teams']
    
    teams = []
    links = []
    team_records = []
    for key in ['home','away']:
        team = teams_dict[key]
        
        team_record = team.pop('record')
        team_record.update({'gamePk':gamePk})
        team_records.append(team_record)
        
        teams.append(flatten_dicts(team))
        
        link = {'gamePk':gamePk,
                'team_id':team['id'],
                'home_away':key}
        links.append(link)
        
    return teams, links, team_records

In [79]:
teams, game_team_links, team_records = get_teams(api_call)

In [80]:
def get_venue(API_result):
    venue = API_result['gameData']['venue']
    return flatten_dicts(venue)
get_venue(api_call)

{'id': 22,
 'name': 'Dodger Stadium',
 'link': '/api/v1/venues/22',
 'location_city': 'Los Angeles',
 'location_state': 'California',
 'location_stateAbbrev': 'CA',
 'timeZone_id': 'America/Los_Angeles',
 'timeZone_offset': -8,
 'timeZone_tz': 'PST',
 'fieldInfo_capacity': 56000,
 'fieldInfo_turfType': 'Grass',
 'fieldInfo_roofType': 'Open',
 'fieldInfo_leftLine': 330,
 'fieldInfo_leftCenter': 385,
 'fieldInfo_center': 395,
 'fieldInfo_rightCenter': 385,
 'fieldInfo_rightLine': 330,
 'location_defaultCoordinates_latitude': 34.07368,
 'location_defaultCoordinates_longitude': -118.24053}

In [81]:
import pandas as pd

In [94]:
test_gamePks = gamePks['2019'][::300]

In [108]:
len(test_gamePks)

9

In [98]:
#[get_game(g) for g in api_calls(test_gamePks)]

In [122]:
def parse_games(gamePks):
    api_results=api_calls(gamePks)
    tables = ['games','plays','pitches','runners','credits','actions','teams','venues','game_team_links','game_player_links','team_records','players','matchups']
    dfs = {}
    for table in tables:
        dfs[table]=[]
    for result in api_results:
        dfs['games'].append(get_game(result))
        dfs['venues'].append(get_venue(result))
        dfs['pitches'].extend(get_pitches(result))
        dfs['actions'].extend(get_actions(result))
        
        plays,matchups = get_plays(result)
        dfs['plays'].extend(plays)
        dfs['matchups'].extend(matchups)
        
        runners,credits = get_runners(result)
        dfs['runners'].extend(runners)
        dfs['credits'].extend(credits)
        
        teams, game_team_links, team_records = get_teams(result)
        dfs['teams'].extend(teams)
        dfs['game_team_links'].extend(game_team_links)
        dfs['team_records'].extend(team_records)
        
        players,game_player_links = get_players(result)
        dfs['players'].extend(players)
        dfs['game_player_links'].extend(game_player_links)
    for key in dfs.keys():
        dfs[key]=pd.DataFrame.from_records(dfs[key])
    return dfs
        
    

In [123]:
dfs = parse_games(test_gamePks)

In [128]:
dfs['matchups'].iloc[687]['batterHotColdZones']

[{'zone': '01',
  'color': 'rgba(255, 255, 255, 0.55)',
  'temp': 'lukewarm',
  'value': '.750'},
 {'zone': '02',
  'color': 'rgba(6, 90, 238, .55)',
  'temp': 'cold',
  'value': '.000'},
 {'zone': '03',
  'color': 'rgba(6, 90, 238, .55)',
  'temp': 'cold',
  'value': '.000'},
 {'zone': '04',
  'color': 'rgba(255, 255, 255, 0.55)',
  'temp': 'lukewarm',
  'value': '.667'},
 {'zone': '05',
  'color': 'rgba(6, 90, 238, .55)',
  'temp': 'cold',
  'value': '.400'},
 {'zone': '06',
  'color': 'rgba(214, 41, 52, .55)',
  'temp': 'hot',
  'value': '2.500'},
 {'zone': '07',
  'color': 'rgba(6, 90, 238, .55)',
  'temp': 'cold',
  'value': '.000'},
 {'zone': '08',
  'color': 'rgba(234, 147, 153, .55)',
  'temp': 'warm',
  'value': '1.000'},
 {'zone': '09',
  'color': 'rgba(6, 90, 238, .55)',
  'temp': 'cold',
  'value': '.000'},
 {'zone': '11',
  'color': 'rgba(6, 90, 238, .55)',
  'temp': 'cold',
  'value': '.000'},
 {'zone': '12',
  'color': 'rgba(6, 90, 238, .55)',
  'temp': 'cold',
  'value'

In [82]:
plays_df=pd.DataFrame.from_records(plays)

In [83]:
plays_df.columns

Index(['result_type', 'result_event', 'result_eventType', 'result_description',
       'result_rbi', 'result_awayScore', 'result_homeScore',
       'about_atBatIndex', 'about_halfInning', 'about_isTopInning',
       'about_inning', 'about_startTime', 'about_endTime', 'about_isComplete',
       'about_isScoringPlay', 'about_hasReview', 'about_hasOut',
       'about_captivatingIndex', 'count_balls', 'count_strikes', 'count_outs',
       'matchup_batterHotColdZones', 'matchup_pitcherHotColdZones',
       'matchup_batter_id', 'matchup_batter_fullName', 'matchup_batter_link',
       'matchup_batSide_code', 'matchup_batSide_description',
       'matchup_pitcher_id', 'matchup_pitcher_fullName',
       'matchup_pitcher_link', 'matchup_pitchHand_code',
       'matchup_pitchHand_description', 'matchup_postOnFirst_id',
       'matchup_postOnFirst_fullName', 'matchup_postOnFirst_link',
       'matchup_splits_batter', 'matchup_splits_pitcher',
       'matchup_splits_menOnBase', 'matchup_postOnThird

In [84]:
def unique_columns(df):
    """
    returns names for columns that have all unique values
    """
    return [x for x in df.columns if len(df[x].unique())==df.shape[0]]

In [85]:
pitches_df = pd.DataFrame.from_records(pitches)

In [86]:
pitches_df.columns

Index(['index', 'pfxId', 'playId', 'pitchNumber', 'startTime', 'endTime',
       'isPitch', 'type', 'atBatIndex', 'playEndTime', 'details_description',
       'details_code', 'details_ballColor', 'details_trailColor',
       'details_isInPlay', 'details_isStrike', 'details_isBall',
       'details_hasReview', 'count_balls', 'count_strikes',
       'pitchData_startSpeed', 'pitchData_endSpeed', 'pitchData_strikeZoneTop',
       'pitchData_strikeZoneBottom', 'pitchData_zone',
       'pitchData_typeConfidence', 'pitchData_plateTime',
       'pitchData_extension', 'details_call_code', 'details_call_description',
       'details_type_code', 'details_type_description',
       'pitchData_coordinates_aY', 'pitchData_coordinates_aZ',
       'pitchData_coordinates_pfxX', 'pitchData_coordinates_pfxZ',
       'pitchData_coordinates_pX', 'pitchData_coordinates_pZ',
       'pitchData_coordinates_vX0', 'pitchData_coordinates_vY0',
       'pitchData_coordinates_vZ0', 'pitchData_coordinates_x',
       '

In [87]:
pitches_df['details_description'].unique()

array(['Called Strike', 'Swinging Strike', 'In play, no out', 'Ball',
       'In play, run(s)', 'Foul', 'In play, out(s)', 'Ball In Dirt',
       'Foul Tip', 'Pickoff Attempt 2B', 'Pickoff Attempt 1B'],
      dtype=object)

In [88]:
pitches_df[pitches_df['details_description']=='In play, out(s)']['hitData_totalDistance']

12        NaN
14     138.24
26     343.52
44        NaN
72     285.86
73     285.75
76     144.55
81     247.93
86     312.71
96      84.87
101     95.95
136       NaN
144    315.07
151     89.22
153    112.42
155    132.92
171    146.39
200    140.74
208       NaN
223    146.10
227    146.49
250    286.28
254    265.53
255    126.68
266    307.11
281    291.70
286    256.27
Name: hitData_totalDistance, dtype: float64

In [89]:
unique_columns(pitches_df)

['playId']

In [90]:
plays_df.shape==len(pitches_df['atBatIndex'].unique())

False

In [91]:
pitches_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287 entries, 0 to 286
Data columns (total 63 columns):
index                             287 non-null int64
pfxId                             284 non-null object
playId                            286 non-null object
pitchNumber                       284 non-null float64
startTime                         284 non-null object
endTime                           284 non-null object
isPitch                           287 non-null bool
type                              287 non-null object
atBatIndex                        287 non-null int64
playEndTime                       287 non-null object
details_description               287 non-null object
details_code                      287 non-null object
details_ballColor                 284 non-null object
details_trailColor                284 non-null object
details_isInPlay                  284 non-null object
details_isStrike                  284 non-null object
details_isBall                    

In [92]:
pitches_df['hitData_hardness'].unique()

array([nan, 'medium', 'soft', 'hard'], dtype=object)