# Leveraging SQLAlchemy ORM to Store and Retrieve MLB Stats

## Table of Contents

[Part 1: Exploring the MLB API](#part-1)
- [1a. Install and Import](#part-1a)
- [1b. Get GamePks](#part-1b)
- [1c. The 'Game' Endpoint](#part-1c)

---

### The SQLAlchemy Object Relational Mapper automatically constructs higher-level SQL and automates persistence of python objects.
We're going to query the MLB API using a python wrapper created by Todd Roberts and store the information in a SQLite database for future analysis. 

---

<a id='part-1'></a>

## Part 1: Exploring the MLB API
Todd Roberts' python wrapper is part of the python package index. You can find more information [here](https://pypi.org/project/MLB-StatsAPI/) or on [GitHub](https://github.com/toddrob99/MLB-StatsAPI).

<a id='part-1a'></a>

First, we have to install it and import it.

In [1]:
import sys
#pip install 
#!{sys.executable} -m pip install MLB-StatsAPI

import statsapi as mlb

Todd was nice enough to give us several convenient functions for accessing the API's endpoints. The most flexible/powerful of these is the get() function that takes in an endpoint and returns the raw JSON response from the MLB Stats API. You can find a dictionary with the endpoint configuration by accessing the ENDPOINTS global variable. To get notes for a given endpoint, use the notes() method.

In [79]:
list(mlb.ENDPOINTS.keys())[:10]

['attendance',
 'awards',
 'conferences',
 'divisions',
 'draft',
 'game',
 'game_diff',
 'game_timestamps',
 'game_changes',
 'game_contextMetrics']

In [139]:
print(mlb.notes('game'))

Endpoint: game 
All path parameters: ['ver', 'gamePk']. 
Required path parameters (note: ver will be included by default): ['ver', 'gamePk']. 
All query parameters: ['timecode', 'hydrate', 'fields']. 
Required query parameters: None. 
The hydrate function is supported by this endpoint. Call the endpoint with {'hydrate':'hydrations'} in the parameters to return a list of available hydrations. For example, statsapi.get('schedule',{'sportId':1,'hydrate':'hydrations','fields':'hydrations'})



<a id='part-1b'></a>

#### Get GamePks

In [132]:
from datetime import datetime as dt
import os,re,csv
from os import walk

#dates from the 'season' endpoint are returned in a different format than what we need to query the API
#we'll use this function to take care of that in a moment
def convert_date(date):
    date = dt.strptime(date,"%Y-%m-%d")
    convertedDate = dt.strftime(date,"%m/%d/%Y")
    return convertedDate

def get_gamePks(seasons,target_directory=None):
    """
    Takes in a list of seasons as strings representing their year e.g. ['2018','2019']
    Queries the MLB API to find gamePks for each season and writes them to CSV files
    if a target directory for the gamePks is not specified, a directory called 'gamePks'
    will be added to the current directory. 
    """
    if target_directory:
        gamePks_path = target_directory
    else:
        #create a directory to store CSVs
        try:
            os.mkdir(os.getcwd()+'/gamePks')
        except FileExistsError:
            pass
        gamePks_path=os.getcwd()+'/gamePks'
    
    #walk the gamePks directory to see if we've already added any seasons
    f = []
    for (dirpath, dirnames, filenames) in walk(gamePks_path):
        f.extend(filenames)
        break
    years = [re.findall('[^.csv]+',x) for x in f]
    already_added = [item for sublist in years for item in sublist if item[0] in ['1','2']]
    seasons = list(set(seasons)-set(already_added))
    
    #query the API to get start dates and end dates for all seasons
    all_seasons = mlb.get('seasons',{'sportId':1,'all':True})['seasons']
    
    #filter out the ones we don't care about right now
    seasons = list(filter(lambda x: x['seasonId'] in seasons,all_seasons))
    
    gamePks = {}
    for season in seasons:  
        year = season['seasonId']
        startDate = convert_date(season['seasonStartDate'])
        endDate = convert_date(season['seasonEndDate'])
        
        #returns a list of dicts for each date in the range
        #each dict has a 'games' key with a list of dicts for each game in that day as values
        dates = mlb.get('schedule',{'sportId':1,'startDate':startDate,'endDate':endDate})['dates']
        
        #for each date, and for each game in that date, get the gamePk 
        gamePks[year]= [ game['gamePk'] 
                                          for date in dates 
                                          for game in date['games'] ]
        #store the gamePks as CSVs
        with open(gamePks_path + f"/{year}.csv", 'w',newline='') as myfile:
            wr = csv.writer(myfile,quoting=csv.QUOTE_ALL)
            wr.writerow(gamePks[year])
get_gamePks([str(x) for x in range(2008,2020)])   

In [133]:
def read_gamePks():
    gamePks_path = os.curdir+'/gamePks'
    f = []
    for (dirpath, dirnames, filenames) in walk(gamePks_path):
        f.extend(filenames)
        break
    pk_paths = [gamePks_path + '/' + x for x in f if x[0]!= '.']
    
    gamePks = {}
    for path in pk_paths:
        season = re.findall('/gamePks/([^.csv]+)',path)
        with open(path, 'r') as f:
            reader = csv.reader(f)
            seasonPks = list(reader)
        gamePks[season[0]] = [item for sublist in seasonPks for item in sublist]
    return gamePks

In [135]:
gamePks=read_gamePks()

<a id='part-1c'></a>

#### Explore the 'Game' Endpoint

Let's pick a gamePk at random to see what's inside the 'game' endpoint. There is a TON of information stored in nested dictionaries returned from the API query. Since our goal is to store this information in a SQL database, our aim is to organize the information into [first normal form](https://www.essentialsql.com/get-ready-to-learn-sql-8-database-first-normal-form-explained-in-simple-english/).

From this single result, we'll be able to start building [normalized SQL tables](https://www.essentialsql.com/get-ready-to-learn-sql-database-normalization-explained-in-simple-english/) for games, teams, venues, players, plays, and pitches. Let's start with the games table. 

In [249]:
temp_pk=gamePks['2019'][500]
print(temp_pk) 
game_result = mlb.get('game',{'gamePk':temp_pk})

566385


In [250]:
game_result.keys()

dict_keys(['copyright', 'gamePk', 'link', 'metaData', 'gameData', 'liveData'])

In [251]:
gameData = game_result['gameData']
gameData.keys()

dict_keys(['game', 'datetime', 'status', 'teams', 'players', 'venue', 'weather', 'review', 'flags', 'alerts', 'probablePitchers', 'officialScorer', 'primaryDatacaster'])

In [252]:
game = gameData['game']
game

{'pk': 566385,
 'type': 'R',
 'doubleHeader': 'N',
 'id': '2019/04/26/texmlb-seamlb-1',
 'gamedayType': 'P',
 'tiebreaker': 'N',
 'gameNumber': 1,
 'calendarEventID': '14-566385-2019-04-26',
 'season': '2019',
 'seasonDisplay': '2019'}

The dictionary above, nested 2 layers deep into the original API result, provides us with a good starting point. But we're goint to add some additional information to make our games table more informative

In [253]:
gameData['datetime']

{'dateTime': '2019-04-27T02:10:00Z',
 'originalDate': '2019-04-26',
 'dayNight': 'night',
 'time': '7:10',
 'ampm': 'PM'}

In [254]:
gameData['weather']

{'condition': 'Cloudy', 'temp': '56', 'wind': '5 mph, In From CF'}

In [255]:
gameData['venue']['timeZone']

{'id': 'America/Los_Angeles', 'offset': -7, 'tz': 'PDT'}

In [296]:
gameData['status']

{'abstractGameState': 'Final',
 'codedGameState': 'F',
 'detailedState': 'Final',
 'statusCode': 'F',
 'abstractGameCode': 'F'}

In [256]:
#keys to add
keys_to_add = ['dateTime',
          'originalDate',
          'condition',
          'temp','wind','tz']
#dictionaries from which to add them
dicts = [gameData['datetime'],
         gameData['weather'],
         gameData['venue']['timeZone']
        ]

for k in keys_to_add:
    for d in dicts:
        try:
            game[k]=d[k]
        except KeyError:
            continue
#'seasonDisplay' key:value seems to be redundant
del game['seasonDisplay']
game

{'pk': 566385,
 'type': 'R',
 'doubleHeader': 'N',
 'id': '2019/04/26/texmlb-seamlb-1',
 'gamedayType': 'P',
 'tiebreaker': 'N',
 'gameNumber': 1,
 'calendarEventID': '14-566385-2019-04-26',
 'season': '2019',
 'dateTime': '2019-04-27T02:10:00Z',
 'originalDate': '2019-04-26',
 'condition': 'Cloudy',
 'temp': '56',
 'wind': '5 mph, In From CF',
 'tz': 'PDT'}

In [305]:
gameData['probablePitchers']

{'away': {'id': 571946,
  'fullName': 'Miller, Shelby',
  'link': '/api/v1/people/571946'},
 'home': {'id': 579328,
  'fullName': 'Kikuchi, Yusei',
  'link': '/api/v1/people/579328'}}

In [387]:
def get_game(api_call):
    gameData = api_call['gameData']
    dateTime = gameData['datetime']
    game = gameData['game']
    weather = gameData['weather']
    timeZone = gameData['venue']['timeZone']
    status = gameData['status']
    probablePitchers = gameData['probablePitchers']
    
    keys_to_add = ['dateTime','originalDate',
                   'condition','temp','wind',
                   'tz','detailedState'
                  ]
    dicts = [weather,dateTime,timeZone,status]
    for k in keys_to_add:
        for d in dicts:
            try:
                game[k]=d[k]
            except KeyError:
                continue
    #'seasonDisplay' key:value seems to be redundant
    del game['seasonDisplay']
    
    home_team = gameData['teams']['home']
    away_team = gameData['teams']['away']
    
    game['homeTeam_id'] = home_team['id']
    game['awayTeam_id'] = away_team['id']
    
    game['venue_id'] = gameData['venue']['id']
    
    for team in ['home','away']:
        try:
            game[f"{team}_probablePitcher"]=probablePitchers[team]['id']
        except KeyError:
            pass
    
    #format the dateTime and originalDate
    fmt = "%Y-%m-%dT%H:%M:%SZ" 
    game['dateTime'] = dt.strptime(game['dateTime'],fmt)
    fmt = "%Y-%m-%d"
    game['originalDate'] = dt.strptime(game['originalDate'],fmt).date()
    
    return game
game = get_game(mlb.get('game',{'gamePk':temp_pk}))

In [388]:
game['originalDate']

datetime.date(2019, 4, 26)

## API call class
- instantiated with pk with which we query the 'game' end point. 
- YIELDS the API result because we don't want everything loaded into memory
- class methods parse the yielded generator into dictionaries with which to instantiate table records

In [324]:
def api_call(gamePk):
    yield mlb.get('game',{'gamePk':gamePk})

In [325]:
class API_call():
    def __init__(self,gamePk):
        self.result = api_call(gamePk)

In [326]:
call = API_call(temp_pk)
response = call.result
response

<generator object api_call at 0x10fd9fed0>

In [None]:
keys = ['gamePk','gameData','liveData']


<a id='part-2'></a>

## Introducing SQL Alchemy

In [363]:
import sqlalchemy
from sqlalchemy import create_engine

In [364]:
class MyDatabase:
    # http://docs.sqlalchemy.org/en/latest/core/engines.html
    """
    Custom class for instantiating a SQL Alchemy connection. Configured here for SQLite, but intended to be flexible.
    Credit to Medium user Mahmud Ahsan:
    https://medium.com/@mahmudahsan/how-to-use-python-sqlite3-using-sqlalchemy-158f9c54eb32
    """
    DB_ENGINE = {
       'sqlite': 'sqlite:////{DB}'
    }

    # Main DB Connection Ref Obj
    db_engine = None
    def __init__(self, dbtype, username='', password='', dbname='',path=os.getcwd()+'/'):
        dbtype = dbtype.lower()
        if dbtype in self.DB_ENGINE.keys():
            engine_url = self.DB_ENGINE[dbtype].format(DB=path+dbname)
            self.db_engine = create_engine(engine_url)
            print(self.db_engine)
        else:
            print("DBType is not found in DB_ENGINE")
db=MyDatabase('sqlite',dbname='mlb.db')

Engine(sqlite://///Users/schlinkertc/code/MLB/mlb_sqlite/blog_posts/mlb.db)


In [389]:
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()

from sqlalchemy import Table,Column,Integer,String,DateTime,Date,Boolean

class Game(Base):
    __tablename__ = 'games'
    __table_args__ = {'extend_existing': True}
    
    pk = Column(Integer)
    type = Column(String(1))
    doubleHeader = Column(String(1))
    id = Column(String(150), primary_key=True,unique=True)
    gamedayType = Column(String(1))
    tiebreaker = Column(String(1))
    gameNumber = Column(Integer)
    calenderEventId = Column(String(50))
    season = Column(Integer)
    
    dateTime = Column(DateTime)
    originalDate = Column(Date)
    
    detailedState = Column(String(12))
    
    homeTeam_id = Column(Integer)
    awayTeam_id = Column(Integer)
    
    condition = Column(String(25))
    temp = Column(Integer)
    wind = Column(String(50))
    
    venue_id = Column(Integer)
    
    home_probablePitcher = Column(Integer)
    away_probablePitcher = Column(Integer)
    
    def __repr__(self): 
        return "<Game(pk='%s',id='%s')>" % (
                        self.pk, self.id)
    
    def __init__(self,dictionary):
        for k,v in dictionary.items():
            setattr(self,k,v)

In [380]:
Base.metadata.create_all(db.db_engine)

In [381]:
game_record = Game(get_game(temp_pk))

In [382]:
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=db.db_engine)
session = Session()

In [383]:
session.add(game_record)

In [384]:
session.commit()

In [385]:
session.query(Game).all()

[<Game(pk='566385',id='2019/04/26/texmlb-seamlb-1')>]

Now we're going to move on to plays. We're going to set up a one-to-many relationship between games and plays, so we'll need to be able to link the two when we get to the SQLAlchemy stage. With that in mind, we'll start exploring the information embedded in the 'liveData' key of the original API result

In [258]:
#'game_result' variable still contains the info from our oringal exploration
liveData = game_result['liveData']
liveData.keys()

dict_keys(['plays', 'linescore', 'boxscore', 'decisions', 'leaders'])

In [259]:
plays = liveData['plays']
plays.keys()

dict_keys(['allPlays', 'currentPlay', 'scoringPlays', 'playsByInning'])

In [260]:
allPlays = plays['allPlays']

'allPlays' gives us a list of dictionaries with information about every play. Remember that we're aiming for first normal form, so we don't want any information about pitches, players, or runners at this time. Those will be contained in their own respective tables

In [267]:
temp_play = allPlays[0]
temp_play.keys()

dict_keys(['result', 'about', 'count', 'matchup', 'pitchIndex', 'actionIndex', 'runnerIndex', 'runners', 'playEvents', 'atBatIndex', 'playEndTime'])

In [268]:
temp_play['result']

{'type': 'atBat',
 'event': 'Groundout',
 'eventType': 'field_out',
 'description': 'Delino DeShields grounds out, second baseman Dee Gordon to first baseman Edwin Encarnacion.',
 'rbi': 0,
 'awayScore': 0,
 'homeScore': 0}

In [269]:
temp_play['about']

{'atBatIndex': 0,
 'halfInning': 'top',
 'isTopInning': True,
 'inning': 1,
 'startTime': '2019-04-27T00:55:22.000Z',
 'endTime': '2019-04-27T02:11:24.000Z',
 'isComplete': True,
 'isScoringPlay': False,
 'hasReview': False,
 'hasOut': True,
 'captivatingIndex': 0}

In [271]:
temp_play['count']

{'balls': 1, 'strikes': 0, 'outs': 1}

'playEvents' will give us a list of dictionaries for pitches and events that happened during the play. We'll use the 'pitchIndex' and 'actionIndex' to access these details later.

In [288]:
print(temp_play['actionIndex'],temp_play['pitchIndex'])
len(temp_play['playEvents']) == ( len(temp_play['actionIndex']) 
                                 + len(temp_play['pitchIndex'])
                                )

[0, 1, 2] [3, 4]


True

In [289]:
temp_play['runnerIndex']

[0]

In [291]:
temp_play['atBatIndex']

0

In [None]:
def get_play(API_result):
    #foreign key references games table
    play['gamePk']=API_result['gamePk']