In [48]:
import pandas as pd 
import altair as alt 

In [49]:
data = pd.read_csv("../data/mlb_stats_api_schedule_sample.csv")

In [50]:
data["home_win"] = (data["home_score"] > data["away_score"]).astype(int)
data.head()

Unnamed: 0,game_id,game_datetime,game_date,game_type,status,away_name,home_name,away_id,home_id,doubleheader,...,venue_name,national_broadcasts,series_status,winning_team,losing_team,winning_pitcher,losing_pitcher,save_pitcher,summary,home_win
0,744932,2024-06-01T19:07:00Z,2024-06-01,R,Final,Pittsburgh Pirates,Toronto Blue Jays,134,141,N,...,Rogers Centre,[],Series tied 1-1,Pittsburgh Pirates,Toronto Blue Jays,Mitch Keller,Yusei Kikuchi,Luis Ortiz,2024-06-01 - Pittsburgh Pirates (8) @ Toronto ...,0
1,747029,2024-06-01T20:05:00Z,2024-06-01,R,Final,Tampa Bay Rays,Baltimore Orioles,139,110,N,...,Oriole Park at Camden Yards,['MLB.tv Free Game'],BAL wins 2-0,Baltimore Orioles,Tampa Bay Rays,Jacob Webb,Taj Bradley,,2024-06-01 - Tampa Bay Rays (5) @ Baltimore Or...,1
2,746952,2024-06-01T20:10:00Z,2024-06-01,R,Final,Detroit Tigers,Boston Red Sox,116,111,N,...,Fenway Park,[],BOS leads 2-1,Boston Red Sox,Detroit Tigers,Cooper Criswell,Reese Olson,,2024-06-01 - Detroit Tigers (3) @ Boston Red S...,1
3,746629,2024-06-01T20:10:00Z,2024-06-01,R,Final,Washington Nationals,Cleveland Guardians,120,114,N,...,Progressive Field,[],CLE wins 2-0,Cleveland Guardians,Washington Nationals,Ben Lively,Mitchell Parker,Emmanuel Clase,2024-06-01 - Washington Nationals (2) @ Clevel...,1
4,745812,2024-06-01T20:10:00Z,2024-06-01,R,Final,Arizona Diamondbacks,New York Mets,109,121,N,...,Citi Field,['MLBN (out-of-market only)'],NYM leads 2-1,Arizona Diamondbacks,New York Mets,Kevin Ginkel,Sean Manaea,,2024-06-01 - Arizona Diamondbacks (10) @ New Y...,0


In [57]:
data.isna().sum().sort_values(ascending=False)

home_pitcher_note        92
away_pitcher_note        92
save_pitcher             43
inning_state              1
losing_pitcher            1
winning_pitcher           1
losing_team               1
winning_team              1
current_inning            1
game_datetime             0
home_win                  0
summary                   0
series_status             0
national_broadcasts       0
venue_name                0
home_score                0
game_date                 0
away_score                0
away_probable_pitcher     0
home_probable_pitcher     0
game_num                  0
doubleheader              0
home_name                 0
away_name                 0
status                    0
game_type                 0
run_diff                  0
dtype: int64

In [51]:
# 1. Drop identifier columns
data = data.drop(['game_id', 'away_id', 'home_id', 'venue_id'], axis=1)

# 2. Keep only numeric columns
numeric_data = data.select_dtypes(include='number')

# 3. Drop columns with all missing values
numeric_data = numeric_data.dropna(axis=1, how='all')

# 4. Drop zero-variance columns (no predictive signal)
numeric_data = numeric_data.loc[:, numeric_data.nunique() > 1]

# 5. Summary statistics
numeric_data.describe()

Unnamed: 0,away_score,home_score,current_inning,home_win
count,92.0,92.0,91.0,92.0
mean,4.51087,4.163043,9.065934,0.521739
std,3.242894,2.742731,0.290677,0.502264
min,0.0,0.0,9.0,0.0
25%,2.0,2.0,9.0,0.0
50%,4.0,3.0,9.0,1.0
75%,6.0,6.25,9.0,1.0
max,14.0,10.0,11.0,1.0


- Columns with no variance (e.g., game_num) and columns with entirely missing values (e.g., pitcher_note fields) were removed, as they provide no predictive signal.

- Away teams scored slightly more runs on average (4.51) compared to home teams (4.16). The run environment appears typical for MLB games, with most games falling between 2–6 runs per team. A small number of high-scoring outliers (max 14 runs)  indicate occasional blowout games.

- The current_inning variable reflects the number of innings played and is  post-game information. It will not be used in predictive modeling to avoid data leakage. Confirm with Rabin 

In [52]:
data["home_win"].value_counts(normalize=True)

home_win
1    0.521739
0    0.478261
Name: proportion, dtype: float64

Home win seems to have a slightly larger proportion. 

In [53]:
data["home_score"].describe(), data["away_score"].describe()

(count    92.000000
 mean      4.163043
 std       2.742731
 min       0.000000
 25%       2.000000
 50%       3.000000
 75%       6.250000
 max      10.000000
 Name: home_score, dtype: float64,
 count    92.000000
 mean      4.510870
 std       3.242894
 min       0.000000
 25%       2.000000
 50%       4.000000
 75%       6.000000
 max      14.000000
 Name: away_score, dtype: float64)

In [55]:
data["run_diff"] = data["home_score"] - data["away_score"]
run_diff_chart = (
    alt.Chart(data)
    .mark_bar()
    .encode(
        alt.X("run_diff:Q", bin=alt.Bin(maxbins=20), title="Run Differential (Home - Away)"),
        alt.Y("count()", title="Number of Games"),
        tooltip=["count()"]
    )
    .properties(
        title="Distribution of Run Differential"
    )
)

run_diff_chart

The distribution of run differential is centered near zero, indicating that 
most games are competitive. The majority of games are decided by fewer than 
four runs. A small number of extreme positive and negative values indicate 
occasional blowout games.

The relatively symmetric shape suggests balanced competition in the sample, 
with no extreme skew toward either home or away dominance.

In [56]:
data.groupby("home_win")["run_diff"].mean()

home_win
0   -3.772727
1    2.791667
Name: run_diff, dtype: float64

As expected, games where the home team won have a positive average run 
differential (+2.79), while games where the home team lost show a negative 
average run differential (−3.77).

Interestingly, home losses appear to occur by a larger average margin than 
home wins. This suggests that when the home team loses, losses may be more 
decisive on average in this sample.

In [58]:
data["run_diff"].skew()

np.float64(-0.3854392599392755)

The run differential distribution has a skewness of −0.39, indicating a mild  left skew. This suggests that larger away victories (negative run differential) occur slightly more frequently or with greater magnitude than large home victories  in this sample.

However, the skewness magnitude is small, indicating the distribution remains  largely symmetric overall.

Even though the home team wins slightly more often (about 52% of the time), when they lose, they tend to lose by a bigger amount than when they win.

So:
- Their wins are usually by a few runs.
- Their losses are sometimes by more runs.

That’s why we see a small negative skew in the run differential.

Home teams win a little more often, but when they lose, they sometimes lose by more runs than they usually win by.

In [None]:
import altair as alt

melted = data.melt(
    value_vars=["home_score", "away_score"],
    var_name="team_type",
    value_name="runs"
)

alt.Chart(melted).mark_bar(opacity=0.6).encode(
    alt.X("runs:Q", bin=alt.Bin(maxbins=15), title="Runs"),
    alt.Y("count()", title="Number of Games"),
    color=alt.Color("team_type:N", title="Team Type"),
    tooltip=["team_type", "count()"]
).properties(
    title="Distribution of Runs: Home vs Away"
)

The distributions of home and away runs are broadly similar, with both peaking 
around 3–4 runs per game. However, away teams exhibit a slightly heavier 
right tail, indicating more high-scoring outlier performances. This explains 
their marginally higher average runs in the sample.

Overall, scoring patterns appear comparable between home and away teams, 
suggesting no major structural imbalance in offensive output.

In [60]:
data["total_runs"] = data["home_score"] + data["away_score"]

In [61]:
alt.Chart(data).mark_bar().encode(
    alt.X("total_runs:Q", bin=alt.Bin(maxbins=20), title="Total Runs"),
    alt.Y("count()", title="Number of Games")
).properties(
    title="Distribution of Total Runs per Game"
)

The distribution of total runs per game is centered around 7–9 runs, 
with most games falling between 6 and 12 runs. A small number of high-scoring 
outliers (15+ total runs) indicate occasional offensive explosions.

Overall, the scoring environment in this sample appears consistent with 
modern MLB averages, reflecting moderate scoring volatility.

The EDA confirms a competitive MLB scoring environment with moderate volatility and a slight home-field advantage. However, the current dataset contains only post-game outcome variables and is insufficient for predictive modeling.

Successful performance will depend primarily on robust pre-game feature engineering rather than outcome-level statistics.




The task should be framed as a probabilistic binary classification problem:

Estimate P(Home Team Wins | Pre-Game Features)

Given the observed scoring variability and competitive balance, deterministic prediction is unrealistic. Instead, the model should output calibrated win probabilities.


Recommended Modeling Approaches

1️⃣ Logistic Regression (Baseline)
	•	Interpretable
	•	Naturally probabilistic
	•	Strong benchmark model
	•	Establishes directional feature effects

⸻

2️⃣ Random Forest
	•	Captures nonlinear relationships
	•	Handles feature interactions
	•	Provides probability output

Useful for detecting complex patterns.

⸻

3️⃣ Gradient Boosting (XGBoost / LightGBM)
	•	Strong performance in tabular data
	•	Handles noisy systems effectively
	•	Likely best candidate for extracting small predictive edges