## Data scrapping Baseball savant data

Baseball savant provides data for many aspects of baseball, including batting, pitching, fielding. In this notebook we are scrapping the data from the website, though a user can download a .csv file from the website. One advantage of scrapping the data is that we can get every column of data into a file, while on the website, a user can give selections of data they want. I note below where this code can be adapted to include certain years and selections.

In [16]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import matplotlib.pyplot as plot
from time import sleep
import json, re

import requests

## this is to suppress warnings I was getting in this code. 
import warnings
# Suppress FutureWarning messages
warnings.simplefilter(action='ignore', category=FutureWarning)

In [17]:
## append 'years' list with years desired. I include only '2024'
## since this is the year the bat tracking data became available. 

years = ['2024']

## go through the list of years, and gather the urls to access
## the baseball savant website.
for year in years:
    savant_url = 'https://baseballsavant.mlb.com/leaderboard/custom?year=2024&type=batter&filter=&min=10&selections=&chart=false&x=&y=&r=no&chartType=beeswarm&sort=1&sortDir=desc'

    savant_urls = [savant_url.split('2024')[0] + year + savant_url.split('2024')[1] for year in years] 

## initialize a data frame to store the players data
players_data = pd.DataFrame()

## going through the urls accessing the baseball savant website.  
for url in savant_urls:
    response = requests.get(url= savant_urls[0])
    response.status_code
    
    ## create the soup object, parsing the html code
    soup = BeautifulSoup(response.text, 'html.parser')
    
    ## find all the div tags, with attribute article template
    ## since this is where the data is held on baseball savant website
    data = soup.find_all('div', attrs={'class':'article-template'})[0]

    ## the data is held in a script tag
    script = data.find('script')

    ## since the data we want is in a variable in the html code as a json object,
    ## we need to gather it via searching in script for the right string
    
    ## \[(.*)\] says we want all instances of strings between \[ and \]
    ## json.loads parses a valid JSON string and converts it into a Python Dictionary.
    batting_data = json.loads(re.search(r'\[(.*)\]', script.text).group(0))
    
    ## batting_data is a list of dictionaries, so we append the dataframe for each
    ## list item. 
    for i,batter in enumerate(batting_data):
        df = pd.DataFrame(batting_data[i], index = [i])
        players_data = pd.concat([players_data, df])
    
    ## sleep the request in case user wants to get more years from baseball savant. 
    sleep(2)

## exporting the dataframe as a csv
players_data.to_csv('updated_batter_data.csv', index = False)