# LA Dodgers Standings, 1958-2023
> This notebook downloads historic standing tables from [Baseball Reference](https://www.baseball-reference.com/teams/LAD/2024-schedule-scores.shtml) and outputs them to CSV, JSON and Parquet formats for later analysis and visualization.

---

#### Import Python tools and Jupyter config

In [216]:
import pandas as pd
import jupyter_black
from time import sleep
from tqdm.notebook import tqdm

In [217]:
jupyter_black.load()
pd.options.display.max_columns = 100
pd.options.display.max_rows = 1000
pd.options.display.max_colwidth = None

---

## Fetch

#### List comprehension of historic urls

In [None]:
urls = [
    f"https://www.baseball-reference.com/teams/LAD/{year}-schedule-scores.shtml"
    for year in range(1958, 2025)
]

#### Loop through urls, fetch standings table, store in list of dataframes

In [220]:
dfs = []

for url in tqdm(urls):
    year = url.split("/")[5].replace("-schedule-scores.shtml", "")
    src_df = (
        pd.read_html(url)[0]
        .query("Tm !='Tm' and Inn != 'Game Preview, and Matchups'")
        .drop(["Unnamed: 2", "Streak", "Orig. Scheduled"], axis=1)
        .rename(columns={"Unnamed: 4": "home_away"})
        .assign(season=year)
    )
    dfs.append(src_df)
    sleep(4)

  0%|          | 0/67 [00:00<?, ?it/s]

#### Concatenate into one historic dataframe

In [233]:
src = pd.concat(dfs)

---

## Process

#### Clean columns

In [234]:
src.columns = src.columns.str.lower().str.replace("/", "_").str.replace("-", "-")

In [235]:
src.columns = [
    "gm",
    "date",
    "tm",
    "home_away",
    "opp",
    "result",
    "r",
    "ra",
    "inn",
    "record",
    "rank",
    "gb",
    "win",
    "loss",
    "save",
    "time",
    "day_night",
    "attendance",
    "cli",
    "year",
]

#### Split, format date

In [236]:
src[["weekday", "date"]] = src["date"].str.split(", ", expand=True)

In [237]:
src["date"] = src["date"].str.replace(" (1)", "").str.replace(" (2)", "")

In [238]:
src["game_date"] = pd.to_datetime(src["date"] + ", " + src["year"], format="%b %d, %Y")

#### Clean home-away column

In [239]:
src.loc[src.home_away == "@", "home_away"] = "away"
src.loc[src.home_away.isna(), "home_away"] = "home"

#### Format "games back" as a number (positive = lead; negative = behind)

In [240]:
src["gb"] = (
    src["gb"].str.replace("up ", "up").str.replace("up", "+").str.replace("Tied", "0")
)

In [241]:
src["gb"] = src["gb"].apply(
    lambda x: float(x) if x.startswith("+") else -float(x) if float(x) != 0 else 0
)

#### The *number* of games

In [242]:
src["attendance"] = src["attendance"].fillna(0)
src["gm"] = src["gm"].astype(int)
src[["r", "ra", "attendance", "gm"]] = src[["r", "ra", "attendance", "gm"]].astype(int)

#### Convert the 'time' column to timedelta, then to minutes

In [243]:
src["time"] = src["time"] + ":00"

In [244]:
src["time_minutes"] = pd.to_timedelta(src["time"]).dt.total_seconds() / 60
src["time_minutes"] = src["time_minutes"].astype(int)

#### Just the columns we need, in a clean dataframe

In [245]:
df = src[
    [
        "gm",
        "game_date",
        "home_away",
        "opp",
        "result",
        "r",
        "ra",
        "record",
        "rank",
        "gb",
        "time",
        "time_minutes",
        "day_night",
        "attendance",
        "year",
    ]
].copy()

---

## Exports

#### CSV format

In [246]:
df.to_csv("../data/processed/dodgers_standings_1958_2023.csv", index=False)

#### JSON

In [247]:
df.to_json(
    "../data/processed/dodgers_standings_1958_2023.json", indent=4, orient="records"
)

#### Parquet

In [248]:
df.to_parquet("../data/processed/dodgers_standings_1958_2023.parquet", index=False)

In [249]:
df.columns

Index(['gm', 'game_date', 'home_away', 'opp', 'result', 'r', 'ra', 'record',
       'rank', 'gb', 'time', 'time_minutes', 'day_night', 'attendance',
       'year'],
      dtype='object')

In [250]:
df.tail()

Unnamed: 0,gm,game_date,home_away,opp,result,r,ra,record,rank,gb,time,time_minutes,day_night,attendance,year
7,7,2024-04-01,home,SFG,W,8,3,5-2,1,1.0,2:38:00,158,N,49044,2024
8,8,2024-04-02,home,SFG,W,5,4,6-2,1,1.0,2:57:00,177,N,49365,2024
9,9,2024-04-03,home,SFG,W,5,4,7-2,1,2.0,2:25:00,145,N,52746,2024
10,10,2024-04-05,away,CHC,L,7,9,7-3,1,2.0,2:57:00,177,D,34981,2024
11,11,2024-04-06,away,CHC,W,4,1,8-3,1,3.0,2:45:00,165,D,41040,2024
