# MLB Scraper

This Python module provides a class `MLB_Scrape` that interacts with the MLB Stats API to retrieve various types of baseball-related data. The data is processed and returned as Polars DataFrames for easy manipulation and analysis.

## Requirements

- Python 3.x
- `requests` library
- `polars` library
- `numpy` library
- `tqdm` library
- `pytz` library

You can install the required libraries using pip:

```sh
pip install requests polars numpy tqdm pytz
```

## Usage

Import the MLB_Scrape class from the module and Initialize the scraper

In [1]:
# Import the MLB_Scrape class from the module
from api_scraper import MLB_Scrape

# Initialize the scraper
scraper = MLB_Scrape()

#### get_sport_id()

Retrieves the list of sports from the MLB Stats API and processes it into a Polars DataFrame.

In [2]:
# Call the get_sport_id method
sport_ids = scraper.get_sport_id()
print(sport_ids)

shape: (18, 7)
┌──────┬──────┬─────────────────────┬────────────────────┬──────────────┬───────────┬──────────────┐
│ id   ┆ code ┆ link                ┆ name               ┆ abbreviation ┆ sortOrder ┆ activeStatus │
│ ---  ┆ ---  ┆ ---                 ┆ ---                ┆ ---          ┆ ---       ┆ ---          │
│ i64  ┆ str  ┆ str                 ┆ str                ┆ str          ┆ i64       ┆ bool         │
╞══════╪══════╪═════════════════════╪════════════════════╪══════════════╪═══════════╪══════════════╡
│ 1    ┆ mlb  ┆ /api/v1/sports/1    ┆ Major League       ┆ MLB          ┆ 11        ┆ true         │
│      ┆      ┆                     ┆ Baseball           ┆              ┆           ┆              │
│ 11   ┆ aaa  ┆ /api/v1/sports/11   ┆ Triple-A           ┆ AAA          ┆ 101       ┆ true         │
│ 12   ┆ aax  ┆ /api/v1/sports/12   ┆ Double-A           ┆ AA           ┆ 201       ┆ true         │
│ 13   ┆ afa  ┆ /api/v1/sports/13   ┆ High-A             ┆ A+           ┆ 30

##### get_sport_id_check()
Checks if the provided sport ID exists in the list of sports retrieved from the MLB Stats API.

In [3]:
# Call the get_sport_id_check method
is_valid = scraper.get_sport_id_check(sport_id=1)
print(is_valid)


True


##### get_schedule()
Retrieves the schedule of baseball games based on the specified parameters.

In [4]:
# Call the get_schedule method
schedule = scraper.get_schedule(year_input=[2024], sport_id=[1], game_type=['R'])
print(schedule)

shape: (2_430, 8)
┌─────────┬──────────┬────────────┬───────────────┬──────────────┬───────┬──────────┬──────────────┐
│ game_id ┆ time     ┆ date       ┆ away          ┆ home         ┆ state ┆ venue_id ┆ venue_name   │
│ ---     ┆ ---      ┆ ---        ┆ ---           ┆ ---          ┆ ---   ┆ ---      ┆ ---          │
│ i64     ┆ str      ┆ date       ┆ str           ┆ str          ┆ str   ┆ i64      ┆ str          │
╞═════════╪══════════╪════════════╪═══════════════╪══════════════╪═══════╪══════════╪══════════════╡
│ 745444  ┆ 06:05 AM ┆ 2024-03-20 ┆ Los Angeles   ┆ San Diego    ┆ F     ┆ 5150     ┆ Gocheok Sky  │
│         ┆          ┆            ┆ Dodgers       ┆ Padres       ┆       ┆          ┆ Dome         │
│ 746175  ┆ 06:05 AM ┆ 2024-03-21 ┆ San Diego     ┆ Los Angeles  ┆ F     ┆ 5150     ┆ Gocheok Sky  │
│         ┆          ┆            ┆ Padres        ┆ Dodgers      ┆       ┆          ┆ Dome         │
│ 746418  ┆ 04:10 PM ┆ 2024-03-28 ┆ New York      ┆ Houston      ┆ F     

In [5]:
# Call the get_data method
game_data = scraper.get_data(game_list_input=[745444])
# Call the get_data_df method
data_df = scraper.get_data_df(data_list=game_data)
print(data_df)

This May Take a While. Progress Bar shows Completion of Data Retrieval.


Processing: 100%|██████████| 1/1 [00:00<00:00,  5.46iteration/s]

Converting Data to Dataframe.
shape: (304, 78)
┌─────────┬────────────┬───────────┬─────────────┬───┬────────────┬──────┬────────────┬────────────┐
│ game_id ┆ game_date  ┆ batter_id ┆ batter_name ┆ … ┆ event_type ┆ rbi  ┆ away_score ┆ home_score │
│ ---     ┆ ---        ┆ ---       ┆ ---         ┆   ┆ ---        ┆ ---  ┆ ---        ┆ ---        │
│ i64     ┆ str        ┆ i64       ┆ str         ┆   ┆ str        ┆ i64  ┆ i64        ┆ i64        │
╞═════════╪════════════╪═══════════╪═════════════╪═══╪════════════╪══════╪════════════╪════════════╡
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie      ┆ … ┆ null       ┆ null ┆ null       ┆ null       │
│         ┆            ┆           ┆ Betts       ┆   ┆            ┆      ┆            ┆            │
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie      ┆ … ┆ null       ┆ null ┆ null       ┆ null       │
│         ┆            ┆           ┆ Betts       ┆   ┆            ┆      ┆            ┆            │
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie 




#### get_data() and get_data_df()

Retrieves live game data for a list of game IDs and Converts a list of game data JSON objects into a Polars DataFrame.

In [6]:
# Call the get_data method
game_data = scraper.get_data(game_list_input=[745444])
# Call the get_data_df method
data_df = scraper.get_data_df(data_list=game_data)
print(data_df)

This May Take a While. Progress Bar shows Completion of Data Retrieval.


Processing: 100%|██████████| 1/1 [00:00<00:00, 15.32iteration/s]

Converting Data to Dataframe.
shape: (304, 78)
┌─────────┬────────────┬───────────┬─────────────┬───┬────────────┬──────┬────────────┬────────────┐
│ game_id ┆ game_date  ┆ batter_id ┆ batter_name ┆ … ┆ event_type ┆ rbi  ┆ away_score ┆ home_score │
│ ---     ┆ ---        ┆ ---       ┆ ---         ┆   ┆ ---        ┆ ---  ┆ ---        ┆ ---        │
│ i64     ┆ str        ┆ i64       ┆ str         ┆   ┆ str        ┆ i64  ┆ i64        ┆ i64        │
╞═════════╪════════════╪═══════════╪═════════════╪═══╪════════════╪══════╪════════════╪════════════╡
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie      ┆ … ┆ null       ┆ null ┆ null       ┆ null       │
│         ┆            ┆           ┆ Betts       ┆   ┆            ┆      ┆            ┆            │
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie      ┆ … ┆ null       ┆ null ┆ null       ┆ null       │
│         ┆            ┆           ┆ Betts       ┆   ┆            ┆      ┆            ┆            │
│ 745444  ┆ 2024-03-20 ┆ 605141    ┆ Mookie 




#### get_teams()

Retrieves information about MLB teams from the MLB Stats API and processes it into a Polars DataFrame.

In [7]:
# Call the get_teams method
teams = scraper.get_teams()
print(teams)

shape: (720, 10)
┌─────────┬────────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬───────────┐
│ team_id ┆ city       ┆ name      ┆ franchise ┆ … ┆ parent_or ┆ league_id ┆ league_na ┆ parent_or │
│ ---     ┆ ---        ┆ ---       ┆ ---       ┆   ┆ g         ┆ ---       ┆ me        ┆ g_abbrevi │
│ i64     ┆ str        ┆ str       ┆ str       ┆   ┆ ---       ┆ i64       ┆ ---       ┆ ation     │
│         ┆            ┆           ┆           ┆   ┆ str       ┆           ┆ str       ┆ ---       │
│         ┆            ┆           ┆           ┆   ┆           ┆           ┆           ┆ str       │
╞═════════╪════════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 100     ┆ Georgia    ┆ Yellow    ┆ Georgia   ┆ … ┆ Office of ┆ 107       ┆ College   ┆ null      │
│         ┆ Tech       ┆ Jackets   ┆ Tech      ┆   ┆ the Commi ┆           ┆ Baseball  ┆           │
│         ┆ Yellow     ┆           ┆ Yellow    ┆   ┆ ssioner   ┆          

#### get_leagues()
Retrieves information about MLB leagues from the MLB Stats API and processes it into a Polars DataFrame.

In [8]:
# Call the get_leagues method
leagues = scraper.get_leagues()
print(leagues)

shape: (116, 4)
┌───────────┬────────────────────────┬─────────────────────┬──────────┐
│ league_id ┆ league_name            ┆ league_abbreviation ┆ sport_id │
│ ---       ┆ ---                    ┆ ---                 ┆ ---      │
│ i64       ┆ str                    ┆ str                 ┆ i64      │
╞═══════════╪════════════════════════╪═════════════════════╪══════════╡
│ 103       ┆ American League        ┆ AL                  ┆ 1        │
│ 104       ┆ National League        ┆ NL                  ┆ 1        │
│ 114       ┆ Cactus League          ┆ CL                  ┆ null     │
│ 115       ┆ Grapefruit League      ┆ GL                  ┆ null     │
│ 117       ┆ International League   ┆ INT                 ┆ 11       │
│ …         ┆ …                      ┆ …                   ┆ …        │
│ 107       ┆ College Baseball       ┆ CBB                 ┆ 22       │
│ 108       ┆ College Baseball       ┆ CBB                 ┆ 22       │
│ 587       ┆ Showcase Games         ┆ SG       

#### get_player_games_list()
Retrieves a list of game IDs for a specific player in a given season.

In [9]:
# Call the get_player_games_list method
player_games = scraper.get_player_games_list(player_id=660271, season=2024, pitching=False)
print(player_games)

[745444, 746175, 746165, 746167, 746168, 746166, 746170, 746169, 746163, 746897, 746896, 746895, 745925, 745924, 745923, 746162, 746161, 746164, 746158, 746157, 746159, 746160, 746156, 746154, 744863, 744864, 744867, 744949, 744946, 744947, 747211, 747210, 746152, 746155, 746153, 746149, 746147, 746151, 745421, 745422, 745343, 745342, 745341, 746150, 746148, 746142, 746144, 746146, 746145, 746143, 746713, 746710, 746711, 745818, 745817, 746140, 746138, 746141, 745497, 745494, 745495, 745737, 745735, 745736, 746139, 746137, 746136, 746134, 746135, 746130, 746543, 746542, 746538, 746539, 746131, 746132, 746780, 746775, 746777, 745319, 745317, 745320, 746133, 746128, 746126, 746127, 746129, 746122, 745557, 745556, 745558, 746450, 746447, 746449, 746123, 746121, 746125, 746124, 746118, 746119, 746117, 746361, 746362, 746364, 745389, 745386, 745631, 745636, 745627, 746116, 746120, 746115, 746113, 746112, 746114, 745953, 745954, 745951, 745950, 745144, 745140, 745139, 746109, 746107, 746110,

#### get_game_types()
Retrieves the different types of MLB games from the MLB Stats API and processes them into a Polars DataFrame.

In [10]:
# Call the get_game_types method
game_types = scraper.get_game_types()
print(game_types)

shape: (12, 2)
┌─────┬────────────────────────────┐
│ id  ┆ description                │
│ --- ┆ ---                        │
│ str ┆ str                        │
╞═════╪════════════════════════════╡
│ S   ┆ Spring Training            │
│ R   ┆ Regular Season             │
│ F   ┆ Wild Card Game             │
│ D   ┆ Division Series            │
│ L   ┆ League Championship Series │
│ …   ┆ …                          │
│ N   ┆ Nineteenth Century Series  │
│ P   ┆ Playoffs                   │
│ A   ┆ All-Star Game              │
│ I   ┆ Intrasquad                 │
│ E   ┆ Exhibition                 │
└─────┴────────────────────────────┘


#### get_players()
Retrieves player information from the MLB Stats API and processes them into a Polars DataFrame.

In [11]:
df_player = scraper.get_players(sport_id=1,season=2024,game_type=['S'])
print(df_player)

shape: (1_442, 6)
┌───────────┬─────────────┬────────────┬──────────────────┬──────────┬──────┐
│ player_id ┆ first_name  ┆ last_name  ┆ name             ┆ position ┆ team │
│ ---       ┆ ---         ┆ ---        ┆ ---              ┆ ---      ┆ ---  │
│ i64       ┆ str         ┆ str        ┆ str              ┆ str      ┆ i64  │
╞═══════════╪═════════════╪════════════╪══════════════════╪══════════╪══════╡
│ 445276    ┆ Kenley      ┆ Jansen     ┆ Kenley Jansen    ┆ P        ┆ 111  │
│ 445926    ┆ Jesse       ┆ Chavez     ┆ Jesse Chavez     ┆ P        ┆ 145  │
│ 450203    ┆ Charles     ┆ Morton     ┆ Charlie Morton   ┆ P        ┆ 144  │
│ 455119    ┆ Christopher ┆ Martin     ┆ Chris Martin     ┆ P        ┆ 111  │
│ 458677    ┆ Justin      ┆ Wilson     ┆ Justin Wilson    ┆ P        ┆ 113  │
│ …         ┆ …           ┆ …          ┆ …                ┆ …        ┆ …    │
│ 814217    ┆ Samuel      ┆ Gardner    ┆ Sam Gardner      ┆ P        ┆ 158  │
│ 814280    ┆ Landon      ┆ Tomkins    ┆ Lando

## Example

In this example we will return all the pitch-by-pitch data for Bryce Miller in the 2024 MLB Regular Season

In [12]:
import polars as pl
# Bryce Player Id
player_id = 682243
season = 2024

# Get Game IDs for Bryce Miler
player_games = scraper.get_player_games_list(player_id=player_id, season=season, game_type=['R'], pitching=True)

# Get Data for Bryce Miler
data = scraper.get_data(game_list_input=player_games)
df = scraper.get_data_df(data_list=data)
# Print the data
print(df)

This May Take a While. Progress Bar shows Completion of Data Retrieval.


Processing: 100%|██████████| 31/31 [00:00<00:00, 37.83iteration/s]


Converting Data to Dataframe.
shape: (9_130, 78)
┌─────────┬────────────┬───────────┬─────────────┬───┬────────────┬──────┬────────────┬────────────┐
│ game_id ┆ game_date  ┆ batter_id ┆ batter_name ┆ … ┆ event_type ┆ rbi  ┆ away_score ┆ home_score │
│ ---     ┆ ---        ┆ ---       ┆ ---         ┆   ┆ ---        ┆ ---  ┆ ---        ┆ ---        │
│ i64     ┆ str        ┆ i64       ┆ str         ┆   ┆ str        ┆ i64  ┆ i64        ┆ i64        │
╞═════════╪════════════╪═══════════╪═════════════╪═══╪════════════╪══════╪════════════╪════════════╡
│ 745279  ┆ 2024-03-31 ┆ 680776    ┆ Jarren      ┆ … ┆ null       ┆ null ┆ null       ┆ null       │
│         ┆            ┆           ┆ Duran       ┆   ┆            ┆      ┆            ┆            │
│ 745279  ┆ 2024-03-31 ┆ 680776    ┆ Jarren      ┆ … ┆ null       ┆ null ┆ null       ┆ null       │
│         ┆            ┆           ┆ Duran       ┆   ┆            ┆      ┆            ┆            │
│ 745279  ┆ 2024-03-31 ┆ 680776    ┆ Jarre

With the DataFrame, we can filter only pitches thrown by Bryce Miller this season and then group by pitch type to get the metrics for each pitch.

We will be getting the following metrics:
- pitches: Number of Pitches
- start_speed: Initial Velocity of the Pitch (mph)
- ivb: Induced Vertical Break (in)
- hb: Horizontal Break (in)
- spin_rate: Spin Rate (rpm)

In [13]:
# Group the data by pitch type
grouped_df = (
    df.filter(pl.col("pitcher_id") == player_id)
    .group_by(['pitcher_id', 'pitch_type'])
    .agg([
        pl.col('is_pitch').drop_nans().count().alias('pitches'),
        pl.col('start_speed').drop_nans().mean().round(1).alias('start_speed'),
        pl.col('ivb').drop_nans().mean().round(1).alias('ivb'),
        pl.col('hb').drop_nans().mean().round(1).alias('hb'),
        pl.col('spin_rate').drop_nans().mean().round(0).alias('spin_rate'),
    ])
    .with_columns(
        (pl.col('pitches') / pl.col('pitches').sum().over('pitcher_id')).round(3).alias('proportion')
    )
    ).sort('proportion', descending=True)

# Display the grouped DataFrame
print(grouped_df)

shape: (8, 8)
┌────────────┬────────────┬─────────┬─────────────┬──────┬───────┬───────────┬────────────┐
│ pitcher_id ┆ pitch_type ┆ pitches ┆ start_speed ┆ ivb  ┆ hb    ┆ spin_rate ┆ proportion │
│ ---        ┆ ---        ┆ ---     ┆ ---         ┆ ---  ┆ ---   ┆ ---       ┆ ---        │
│ i64        ┆ str        ┆ u32     ┆ f64         ┆ f64  ┆ f64   ┆ f64       ┆ f64        │
╞════════════╪════════════╪═════════╪═════════════╪══════╪═══════╪═══════════╪════════════╡
│ 682243     ┆ FF         ┆ 1162    ┆ 95.2        ┆ 18.3 ┆ 6.3   ┆ 2483.0    ┆ 0.422      │
│ 682243     ┆ SI         ┆ 470     ┆ 94.7        ┆ 10.9 ┆ 16.0  ┆ 2407.0    ┆ 0.171      │
│ 682243     ┆ FS         ┆ 469     ┆ 84.4        ┆ 0.4  ┆ 9.9   ┆ 912.0     ┆ 0.17       │
│ 682243     ┆ ST         ┆ 234     ┆ 82.4        ┆ -5.3 ┆ -14.8 ┆ 2257.0    ┆ 0.085      │
│ 682243     ┆ SL         ┆ 229     ┆ 86.6        ┆ 1.3  ┆ -3.2  ┆ 2541.0    ┆ 0.083      │
│ 682243     ┆ KC         ┆ 122     ┆ 85.0        ┆ -6.8 ┆ -1.7  ┆