# LA Dodgers Standings, 1958-present
> This notebook downloads the team's current standings table from [Baseball Reference](https://www.baseball-reference.com/teams/LAD/2024-schedule-scores.shtml) and combines it with historic records for later analysis and visualization.

---

#### Import Python tools and Jupyter config

In [25]:
import os
import numpy as np
import pandas as pd
import jupyter_black
from time import sleep
from tqdm.notebook import tqdm

In [26]:
jupyter_black.load()
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = None

In [27]:
profile_name = os.environ.get("AWS_PERSONAL_PROFILE")

In [3]:
today = pd.Timestamp("today").strftime("%Y-%m-%d")

---

## Fetch

#### Import historic game-by-game results, 1958-2023

In [4]:
historic_df = pd.read_parquet("data/processed/dodgers_standings_1958_2023.parquet")

#### Define some variables we need for the request

In [5]:
year = 2024
url = f"https://www.baseball-reference.com/teams/LAD/{year}-schedule-scores.shtml"

#### Get the current year's table

In [6]:
src = (
    pd.read_html(url)[0]
    .query("Tm !='Tm' and Inn != 'Game Preview, and Matchups'")
    .drop(["Unnamed: 2", "Streak", "Orig. Scheduled"], axis=1)
    .rename(columns={"Unnamed: 4": "home_away"})
    .assign(season=year)
)

---

## Process

#### Clean columns

In [7]:
src.columns = src.columns.str.lower().str.replace("/", "_").str.replace("-", "-")

In [8]:
src.columns = [
    "gm",
    "date",
    "tm",
    "home_away",
    "opp",
    "result",
    "r",
    "ra",
    "inn",
    "record",
    "rank",
    "gb",
    "win",
    "loss",
    "save",
    "time",
    "day_night",
    "attendance",
    "cli",
    "year",
]

#### Convert date types where needed

In [9]:
src["gm"] = src["gm"].astype(int)
src["year"] = src["year"].astype(str)

#### Split, format date

In [10]:
src[["weekday", "date"]] = src["date"].str.split(", ", expand=True)

In [11]:
src["date"] = src["date"].str.replace(" (1)", "").str.replace(" (2)", "")

In [12]:
src["game_date"] = pd.to_datetime(src["date"] + ", " + src["year"], format="%b %d, %Y")

#### Clean home-away column

In [13]:
src.loc[src.home_away == "@", "home_away"] = "away"
src.loc[src.home_away.isna(), "home_away"] = "home"

#### Games back figures as a number

In [14]:
src["gb"] = (
    src["gb"].str.replace("up ", "up").str.replace("up", "+").str.replace("Tied", "0")
)

In [15]:
src["gb"] = (
    src["gb"]
    .apply(
        lambda x: float(x) if x.startswith("+") else -float(x) if float(x) != 0 else 0
    )
    .astype(float)
)

#### Just the columns we need

In [16]:
src_df = src[
    [
        "gm",
        "game_date",
        "home_away",
        "opp",
        "result",
        "r",
        "ra",
        "record",
        "rank",
        "gb",
        "time",
        "day_night",
        "attendance",
        "year",
    ]
].copy()

----

## Concatenate

#### Historic and current dataframes combined into one

In [17]:
df = pd.concat([src_df, historic_df]).sort_values("game_date", ascending=False)

In [18]:
df["r"] = df["r"].fillna(np.nan).astype(float)
df["ra"] = df["ra"].fillna(np.nan).astype(float)
df["attendance"] = df["attendance"].fillna(np.nan).astype(float)

In [19]:
df.head()

Unnamed: 0,gm,game_date,home_away,opp,result,r,ra,record,rank,gb,time,day_night,attendance,year
10,10,2024-04-05,away,CHC,L,7.0,9.0,7-3,1,2.0,2:57,D,34981.0,2024
9,9,2024-04-03,home,SFG,W,5.0,4.0,7-2,1,2.0,2:25,N,52746.0,2024
8,8,2024-04-02,home,SFG,W,5.0,4.0,6-2,1,1.0,2:57,N,49365.0,2024
7,7,2024-04-01,home,SFG,W,8.0,3.0,5-2,1,1.0,2:38,N,49044.0,2024
5,6,2024-03-31,home,STL,W,5.0,4.0,4-2,1,0.0,2:41,D,41014.0,2024


In [20]:
df.tail()

Unnamed: 0,gm,game_date,home_away,opp,result,r,ra,record,rank,gb,time,day_night,attendance,year
4,5,1958-04-19,home,SFG,L,4.0,11.0,2-3,5,-2.5,2:37,D,41303.0,1958
3,4,1958-04-18,home,SFG,W,6.0,5.0,2-2,3,-1.5,3:00,D,78672.0,1958
2,3,1958-04-17,away,SFG,L,4.0,7.0,1-2,6,-1.5,2:50,D,12520.0,1958
1,2,1958-04-16,away,SFG,W,13.0,1.0,1-1,4,-0.5,3:03,N,22735.0,1958
0,1,1958-04-15,away,SFG,L,0.0,8.0,0-1,5,-1.0,2:29,D,23448.0,1958


---

## Exports

#### CSV format

In [21]:
df.to_csv("data/processed/dodgers_standings_1958_present.csv", index=False)

#### JSON

In [22]:
df.to_json(
    "data/processed/dodgers_standings_1958_present.json", indent=4, orient="records"
)

#### Parquet

In [None]:
df.to_parquet("data/processed/dodgers_standings_1958_present.parquet", index=False)

#### S3

In [29]:
!aws s3 cp data/processed/dodgers_standings_1958_present.json s3://stilesdata.com/dodgers/dodgers_standings_1958_present.json --profile {profile_name}

upload: data/processed/dodgers_standings_1958_present.json to s3://stilesdata.com/dodgers/dodgers_standings_1958_present.json
