# Scrape and combine IL and DTD data

Scrape IL data from MLB stats api, DTD data from Pro Sports. Since Pro Sports only has names and not ids, the injury alignment process is a bit harder and involves narrowing players down by name and the team where the injury occured. 

In [None]:
%reload_ext autoreload
%autoreload 2

In [41]:
import logging
import os
import pathlib

import numpy as np
import pandas as pd
import requests

# logging.basicConfig(level = logging.INFO)
pd.options.display.max_columns = 100

In [42]:
from injury.scrape.prosports import scrape_dtd_data
from injury.scrape.statsapi import scrape_il_data

### Query IL Data

In [128]:
start, end = 2012, 2022
status_changes, teams = scrape_il_data(start, end + 1)

INFO:injury.scrape.statsapi:Scraping IL data for 2012
INFO:injury.scrape.statsapi:Scraping IL data for 2013
INFO:injury.scrape.statsapi:Scraping IL data for 2014
INFO:injury.scrape.statsapi:Scraping IL data for 2015
INFO:injury.scrape.statsapi:Scraping IL data for 2016
INFO:injury.scrape.statsapi:Scraping IL data for 2017
INFO:injury.scrape.statsapi:Scraping IL data for 2018
INFO:injury.scrape.statsapi:Scraping IL data for 2019
INFO:injury.scrape.statsapi:Scraping IL data for 2020
INFO:injury.scrape.statsapi:Scraping IL data for 2021
INFO:injury.scrape.statsapi:Scraping IL data for 2022


In [130]:
if not os.path.exists("statsapi_data/"):
    os.mkdir("statsapi_data/")
status_changes.to_csv(f"statsapi_data/status_changes{start}-{end}.csv", index=False)
teams.to_csv(f"statsapi_data/teams{start}-{end}.csv", index=False)

### Query DTD Data

Takes a while so saves each year separately

In [131]:
scrape_dtd_data(2022, 2022 + 1, path="prosports_data/")

INFO:injury.scrape.prosports:Scraping DTD data for 2022


### Clean DTD Data

In [132]:
from injury.preprocess.prosports import ProsportsCleaner

In [133]:
# Read data
prosports = pd.concat(
    [pd.read_csv(f) for f in pathlib.Path("prosports_data").glob("prosports_*.csv")]
)
teams = pd.read_csv(f"statsapi_data/teams{start}-{end}.csv")

In [134]:
pc = ProsportsCleaner(prosports, teams)
dtd = pc.clean()

  has_abbrev = prosports.name.str.contains("([A-Z]|r)\.")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prosports["name"] = remove_accents(prosports["name"]).str.lower()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prosports[["name", "name2"]] = prosports["name"].str.split(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  prosports[["name", "n

### Align DTD Data to MLB ids

In [135]:
from injury.preprocess.prosports import AlignProsportsMLB

I get player data through my own statcast database in the cell below. You can get this information from the statsapi or baseball savant. I have included the data I used in `../data`

In [136]:
import statcast
db = statcast.db.Postgres()
players = db.query("player.sql", query_params={"min_year": 2012, "max_year": 2022})
players.to_parquet("../data/players.parquet")



In [137]:
players = pd.read_parquet("../data/players.parquet")
apm = AlignProsportsMLB(dtd, players)
matched_dtd = apm.run()

In [140]:
# matched_dtd.shape

### Combine IL and DTD

In [144]:
status_changes = (
    pd.read_csv(f"statsapi_data/status_changes{start}-{end}.csv")
    .drop(columns=["resolutionDate", "id"])
    .rename(columns={"description": "notes"})
)

player_names = (
    players[["player_id", "full_name"]]
    .drop_duplicates("player_id")
    .rename(columns={"full_name": "name"})
)


il_df = status_changes[status_changes.notes.str.contains(r"the (\d+)(\s|-)day")]
il_df = il_df.merge(player_names, how="left")
il_df["dtd"] = False
matched_dtd["il_days"] = 0

  il_df = status_changes[status_changes.notes.str.contains(r'the (\d+)(\s|-)day')]


In [145]:
injuries = pd.concat([il_df, matched_dtd.drop(columns=["id"])])
injuries["date"] = pd.to_datetime(injuries["date"])
injuries["activated"] = injuries.notes.str.contains("activat")
injuries["transfer"] = injuries.notes.str.contains("transfer")

In [146]:
injuries.reset_index(drop=True).to_parquet(f"../data/injuries{start}-{end}.parquet")