# MLB Scraper

This Python module provides a class `MLB_Scrape` that interacts with the MLB Stats API to retrieve various types of baseball-related data. The data is processed and returned as Polars DataFrames for easy manipulation and analysis.

## Requirements

- Python 3.x
- `requests` library
- `polars` library
- `numpy` library
- `tqdm` library
- `pytz` library

You can install the required libraries using pip:

```sh 
pip install requests polars numpy tqdm pytz
```

## Usage

Import the MLB_Scrape class from the module and Initialize the scraper

In [2]:
# Import the MLB_Scrape class from the module
from api_scraper import MLB_Scrape

# Initialize the scraper
scraper = MLB_Scrape()

#### get_sport_id()

Retrieves the list of sports from the MLB API and processes it into a Polars DataFrame.

In [3]:
# Call the get_sport_id method
sport_ids = scraper.get_sport_id()
print(sport_ids)

shape: (18, 7)
┌──────┬──────┬─────────────────────┬────────────────────┬──────────────┬───────────┬──────────────┐
│ id   ┆ code ┆ link                ┆ name               ┆ abbreviation ┆ sortOrder ┆ activeStatus │
│ ---  ┆ ---  ┆ ---                 ┆ ---                ┆ ---          ┆ ---       ┆ ---          │
│ i64  ┆ str  ┆ str                 ┆ str                ┆ str          ┆ i64       ┆ bool         │
╞══════╪══════╪═════════════════════╪════════════════════╪══════════════╪═══════════╪══════════════╡
│ 1    ┆ mlb  ┆ /api/v1/sports/1    ┆ Major League       ┆ MLB          ┆ 11        ┆ true         │
│      ┆      ┆                     ┆ Baseball           ┆              ┆           ┆              │
│ 11   ┆ aaa  ┆ /api/v1/sports/11   ┆ Triple-A           ┆ AAA          ┆ 101       ┆ true         │
│ 12   ┆ aax  ┆ /api/v1/sports/12   ┆ Double-A           ┆ AA           ┆ 201       ┆ true         │
│ 13   ┆ afa  ┆ /api/v1/sports/13   ┆ High-A             ┆ A+           ┆ 30

##### get_sport_id_check()
Checks if the provided sport ID exists in the list of sports retrieved from the MLB API.

In [4]:
# Call the get_sport_id_check method
is_valid = scraper.get_sport_id_check(sport_id=1)
print(is_valid)


True


##### get_schedule()
Retrieves the schedule of baseball games based on the specified parameters.

In [6]:
# Call the get_schedule method
schedule = scraper.get_schedule(year_input=[2024], sport_id=[1], game_type=['R'])
print(schedule)

shape: (2_430, 8)
┌─────────┬──────────┬────────────┬───────────────┬──────────────┬───────┬──────────┬──────────────┐
│ game_id ┆ time     ┆ date       ┆ away          ┆ home         ┆ state ┆ venue_id ┆ venue_name   │
│ ---     ┆ ---      ┆ ---        ┆ ---           ┆ ---          ┆ ---   ┆ ---      ┆ ---          │
│ i64     ┆ str      ┆ date       ┆ str           ┆ str          ┆ str   ┆ i64      ┆ str          │
╞═════════╪══════════╪════════════╪═══════════════╪══════════════╪═══════╪══════════╪══════════════╡
│ 745444  ┆ 06:05 AM ┆ 2024-03-20 ┆ Los Angeles   ┆ San Diego    ┆ F     ┆ 5150     ┆ Gocheok Sky  │
│         ┆          ┆            ┆ Dodgers       ┆ Padres       ┆       ┆          ┆ Dome         │
│ 746175  ┆ 06:05 AM ┆ 2024-03-21 ┆ San Diego     ┆ Los Angeles  ┆ F     ┆ 5150     ┆ Gocheok Sky  │
│         ┆          ┆            ┆ Padres        ┆ Dodgers      ┆       ┆          ┆ Dome         │
│ 746418  ┆ 04:10 PM ┆ 2024-03-28 ┆ New York      ┆ Houston      ┆ F     

#### get_data() and get_data_df()

Retrieves live game data for a list of game IDs and Converts a list of game data JSON objects into a Polars DataFrame.

In [8]:
# Call the get_data method
game_data = scraper.get_data(game_list_input=[745444])
# Call the get_data_df method
data_df = scraper.get_data_df(data_list=game_data)
print(data_df)

This May Take a While. Progress Bar shows Completion of Data Retrieval.


Processing: 100%|██████████| 1/1 [00:00<00:00,  4.05iteration/s]

Converting Data to Dataframe.
shape: (304, 78)
┌─────────┬────────────┬───────────┬──────────────┬───┬────────────┬─────┬────────────┬────────────┐
│ game_id ┆ game_date  ┆ batter_id ┆ batter_name  ┆ … ┆ event_type ┆ rbi ┆ away_score ┆ home_score │
│ ---     ┆ ---        ┆ ---       ┆ ---          ┆   ┆ ---        ┆ --- ┆ ---        ┆ ---        │
│ i64     ┆ str        ┆ i64       ┆ str          ┆   ┆ str        ┆ f64 ┆ f64        ┆ f64        │
╞═════════╪════════════╪═══════════╪══════════════╪═══╪════════════╪═════╪════════════╪════════════╡
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie Betts ┆ … ┆ NaN        ┆ NaN ┆ NaN        ┆ NaN        │
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie Betts ┆ … ┆ NaN        ┆ NaN ┆ NaN        ┆ NaN        │
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie Betts ┆ … ┆ NaN        ┆ NaN ┆ NaN        ┆ NaN        │
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie Betts ┆ … ┆ NaN        ┆ NaN ┆ NaN        ┆ NaN        │
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie 




#### get_teams()

Retrieves information about MLB teams from the MLB API and processes it into a Polars DataFrame.

In [9]:
# Call the get_teams method
teams = scraper.get_teams()
print(teams)

shape: (741, 10)
┌─────────┬────────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ team_id ┆ city       ┆ name      ┆ franchise ┆ … ┆ parent_or ┆ league_id ┆ league_na ┆ parent_or │
│ ---     ┆ ---        ┆ ---       ┆ ---       ┆   ┆ g         ┆ ---       ┆ me        ┆ g_abbrevi │
│ i64     ┆ str        ┆ str       ┆ str       ┆   ┆ ---       ┆ i64       ┆ ---       ┆ ation     │
│         ┆            ┆           ┆           ┆   ┆ str       ┆           ┆ str       ┆ ---       │
│         ┆            ┆           ┆           ┆   ┆           ┆           ┆           ┆ str       │
╞═════════╪════════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 100     ┆ Georgia    ┆ Yellow    ┆ Georgia   ┆ … ┆ Office of ┆ 107       ┆ College   ┆ null      │
│         ┆ Tech       ┆ Jackets   ┆ Tech      ┆   ┆ the Commi ┆           ┆ Baseball  ┆           │
│         ┆ Yellow     ┆           ┆ Yellow    ┆   ┆ ssioner   ┆          

#### get_leagues()
Retrieves information about MLB leagues from the MLB API and processes it into a Polars DataFrame.

In [10]:
# Call the get_leagues method
leagues = scraper.get_leagues()
print(leagues)

shape: (116, 4)
┌───────────┬────────────────────────┬─────────────────────┬──────────┐
│ league_id ┆ league_name            ┆ league_abbreviation ┆ sport_id │
│ ---       ┆ ---                    ┆ ---                 ┆ ---      │
│ i64       ┆ str                    ┆ str                 ┆ i64      │
╞═══════════╪════════════════════════╪═════════════════════╪══════════╡
│ 103       ┆ American League        ┆ AL                  ┆ 1        │
│ 104       ┆ National League        ┆ NL                  ┆ 1        │
│ 114       ┆ Cactus League          ┆ CL                  ┆ null     │
│ 115       ┆ Grapefruit League      ┆ GL                  ┆ null     │
│ 117       ┆ International League   ┆ INT                 ┆ 11       │
│ …         ┆ …                      ┆ …                   ┆ …        │
│ 107       ┆ College Baseball       ┆ CBB                 ┆ 22       │
│ 108       ┆ College Baseball       ┆ CBB                 ┆ 22       │
│ 587       ┆ Showcase Games         ┆ SG       

#### get_player_games_list()
Retrieves a list of game IDs for a specific player in a given season.

In [17]:
# Call the get_player_games_list method
player_games = scraper.get_player_games_list(player_id=592450, season=2024)
print(player_games)

[746418, 746412, 746410, 746413, 747218, 747220, 747219, 745764, 745766, 745765, 745761, 745762, 745763, 746656, 746658, 746648, 744951, 744944, 744948, 745758, 745760, 745759, 745757, 745754, 745753, 745755, 746000, 745998, 746001, 747046, 747047, 747048, 747044, 745756, 745750, 745751, 745752, 745749, 745748, 745094, 745096, 745092, 745909, 745908, 745907, 745747, 745744, 745745, 745746, 745742, 745743, 745741, 745415, 745411, 745413, 746229, 746228, 746225, 745331, 745332, 745330, 745739, 745740, 745738, 745737, 745735, 745736, 746296, 746294, 746299, 746947, 746946, 746945, 745733, 745734, 745729, 745731, 745727, 745806, 745808, 744919, 744922, 744921, 744916, 745730, 745728, 745726, 745725, 745724, 745723, 745074, 745069, 745065, 747012, 747013, 747009, 745721, 745720, 745722, 745717, 745715, 745716, 746933, 746931, 746929, 745547, 745546, 745543, 745719, 745714, 745718, 745709, 745713, 745710, 745712, 745708, 745711, 746762, 746757, 746759, 746438, 746434, 746431, 745707, 745704,

#### get_game_types()
Retrieves the different types of MLB games from the MLB API and processes them into a Polars DataFrame.

In [15]:
# Call the get_game_types method
game_types = scraper.get_game_types()
print(game_types)

shape: (12, 2)
┌─────┬────────────────────────────┐
│ id  ┆ description                │
│ --- ┆ ---                        │
│ str ┆ str                        │
╞═════╪════════════════════════════╡
│ S   ┆ Spring Training            │
│ R   ┆ Regular Season             │
│ F   ┆ Wild Card Game             │
│ D   ┆ Division Series            │
│ L   ┆ League Championship Series │
│ …   ┆ …                          │
│ N   ┆ Nineteenth Century Series  │
│ P   ┆ Playoffs                   │
│ A   ┆ All-Star Game              │
│ I   ┆ Intrasquad                 │
│ E   ┆ Exhibition                 │
└─────┴────────────────────────────┘


## Example

In this example we will return all the pitch-by-pitch data for Bryce Miller in the 2024 MLB Regular Season

In [42]:
import polars as pl
# Tarik Skubal Player Id
player_id = 682243
season = 2024

# Get Game IDs for Bryce Miler
player_games = scraper.get_player_games_list(player_id=player_id, season=season)

# Get Data for Bryce Miler
data = scraper.get_data(game_list_input=player_games)
df = scraper.get_data_df(data_list=data)
# Print the data
print(df)

This May Take a While. Progress Bar shows Completion of Data Retrieval.


Processing: 100%|██████████| 27/27 [00:04<00:00,  5.51iteration/s]


Converting Data to Dataframe.
shape: (7_896, 78)
┌─────────┬────────────┬───────────┬──────────────┬───┬────────────┬─────┬────────────┬────────────┐
│ game_id ┆ game_date  ┆ batter_id ┆ batter_name  ┆ … ┆ event_type ┆ rbi ┆ away_score ┆ home_score │
│ ---     ┆ ---        ┆ ---       ┆ ---          ┆   ┆ ---        ┆ --- ┆ ---        ┆ ---        │
│ i64     ┆ str        ┆ i64       ┆ str          ┆   ┆ str        ┆ f64 ┆ f64        ┆ f64        │
╞═════════╪════════════╪═══════════╪══════════════╪═══╪════════════╪═════╪════════════╪════════════╡
│ 745279  ┆ 2024-03-31 ┆ 680776    ┆ Jarren Duran ┆ … ┆ NaN        ┆ NaN ┆ NaN        ┆ NaN        │
│ 745279  ┆ 2024-03-31 ┆ 680776    ┆ Jarren Duran ┆ … ┆ NaN        ┆ NaN ┆ NaN        ┆ NaN        │
│ 745279  ┆ 2024-03-31 ┆ 680776    ┆ Jarren Duran ┆ … ┆ NaN        ┆ NaN ┆ NaN        ┆ NaN        │
│ 745279  ┆ 2024-03-31 ┆ 680776    ┆ Jarren Duran ┆ … ┆ strikeout  ┆ 0.0 ┆ 0.0        ┆ 0.0        │
│ 745279  ┆ 2024-03-31 ┆ 646240    ┆ Rafae

With the DataFrame, we can filter only pitches thrown by Bryce Miller this season and then group by pitch type to get the metrics for each pitch.

We will be getting the following metrics:
- pitches: Number of Pitches
- start_speed: Initial Velocity of the Pitch (mph)
- ivb: Induced Vertical Break (in)
- hb: Horizontal Break (in)
- spin_rate: Spin Rate (rpm)

In [64]:
# Group the data by pitch type
grouped_df = (
    df.filter(pl.col("pitcher_id") == player_id)
    .group_by(['pitcher_id', 'pitch_type'])
    .agg([
        pl.col('is_pitch').drop_nans().count().alias('pitches'),
        pl.col('start_speed').drop_nans().mean().round(1).alias('start_speed'),
        pl.col('ivb').drop_nans().mean().round(1).alias('ivb'),
        pl.col('hb').drop_nans().mean().round(1).alias('hb'),
        pl.col('spin_rate').drop_nans().mean().round(0).alias('spin_rate'),
    ])
    .with_columns(
        (pl.col('pitches') / pl.col('pitches').sum().over('pitcher_id')).round(3).alias('proportion')
    )
    ).sort('proportion', descending=True)

# Display the grouped DataFrame
print(grouped_df)

shape: (7, 8)
┌────────────┬────────────┬─────────┬─────────────┬──────┬───────┬───────────┬────────────┐
│ pitcher_id ┆ pitch_type ┆ pitches ┆ start_speed ┆ ivb  ┆ hb    ┆ spin_rate ┆ proportion │
│ ---        ┆ ---        ┆ ---     ┆ ---         ┆ ---  ┆ ---   ┆ ---       ┆ ---        │
│ i64        ┆ str        ┆ u32     ┆ f64         ┆ f64  ┆ f64   ┆ f64       ┆ f64        │
╞════════════╪════════════╪═════════╪═════════════╪══════╪═══════╪═══════════╪════════════╡
│ 682243     ┆ FF         ┆ 1014    ┆ 95.1        ┆ 18.2 ┆ 6.4   ┆ 2481.0    ┆ 0.428      │
│ 682243     ┆ SI         ┆ 414     ┆ 94.7        ┆ 10.8 ┆ 16.0  ┆ 2404.0    ┆ 0.175      │
│ 682243     ┆ FS         ┆ 393     ┆ 84.3        ┆ -0.1 ┆ 9.5   ┆ 867.0     ┆ 0.166      │
│ 682243     ┆ ST         ┆ 212     ┆ 82.3        ┆ -5.3 ┆ -14.9 ┆ 2253.0    ┆ 0.09       │
│ 682243     ┆ SL         ┆ 192     ┆ 86.7        ┆ 1.5  ┆ -3.2  ┆ 2534.0    ┆ 0.081      │
│ 682243     ┆ KC         ┆ 95      ┆ 85.1        ┆ -6.5 ┆ -1.6  ┆