# dbt Mart Validation: Initial Insights from Transformed Data
> Validation Queries: Ronald

**Executive Summary:**  
The dbt pipeline is complete and validated—all 14 models built, all 24 tests passing. We ran initial queries against the three marts to validate data quality and explore patterns. The findings? More interesting than expected.

**What We Found:**
None of the "obvious" attendance drivers showed up clearly:
- **Winning doesn't guarantee crowds:** Average teams (8-9 wins) drew slightly MORE fans than winning teams—likely because of venue capacity ceilings and market size confounds
- **Playoff momentum is weak:** Teams that made playoffs saw similar attendance changes as teams that didn't—suggesting other factors (stadium upgrades, star signings) matter more
- **Indoor vs outdoor is a wash:** Weather-proofed domes show the same attendance variability (~75%) as outdoor stadiums—opponent quality probably dominates

**What This Means:**
Raw attendance numbers are noisy without context. The good news: we have the data staged to add that context. Phase 2 priorities:
- Normalize by venue capacity (eliminate the ceiling effect)
- Add market size controls (isolate team performance signal)
- Layer in weather data for outdoor venues (test the cold-game hypothesis)
- Explore the 4 staged future tables (broadcast impact, score excitement, player stats)

**Pipeline Stats:**  
- **Sources:** Kaggle (2000-2019), ESPN API (2020-2024)  
- **Models:** 5 staging, 2 intermediate, 3 marts  
- **Data Quality:** 24/24 tests passing (unique keys, not-null constraints validated)
- **Tables Ready:** 7 ESPN themed schemas + 3 Kaggle sources = 10 raw tables, 14 dbt models

*Bottom line: The infrastructure works. Now we can ask better questions.*

---

### 0. Setup

In [1]:
# Standard library
import sqlite3
from pathlib import Path

# Third-party
import pandas as pd

# Database connection
DB_PATH = Path('../data/processed/nfl_attendance.db')
conn = sqlite3.connect(DB_PATH)

print(f"Connected to: {DB_PATH}")
print(f"Database exists: {DB_PATH.exists()}")

Connected to: ..\data\processed\nfl_attendance.db
Database exists: True


---

## Question 1: Does Winning Drive Attendance?

**Hypothesis:** Winning teams (10+ wins) should draw larger crowds than losing teams (<8 wins).

**Analysis Approach:**
- Categorize teams by performance tier (winning/average/losing)
- Calculate average weekly attendance per tier
- Compare across 638 team-seasons (2000-2019 Kaggle data)

**Data Source:** `mart_win_attendance_correlation` (dbt mart)

In [2]:
query_q1 = """
SELECT 
    CASE
        WHEN wins >= 10 THEN "Winning (10+ wins)"
        WHEN wins >= 8 THEN "Average (8-9 wins)"
        ELSE "Losing (<8 wins)"
    END AS performance_tier,
    COUNT(*) AS seasons,
    ROUND(AVG(avg_weekly_attendance), 0) AS avg_attendance,
    ROUND(AVG(win_percentage), 3) AS avg_win_pct
FROM mart_win_attendance_correlation
GROUP BY performance_tier
ORDER BY avg_win_pct DESC
"""

df_q1 = pd.read_sql_query(query_q1, conn)
df_q1

Unnamed: 0,performance_tier,seasons,avg_attendance,avg_win_pct
0,Winning (10+ wins),216,68183.0,0.714
1,Average (8-9 wins),138,68708.0,0.533
2,Losing (<8 wins),284,66521.0,0.322


### Finding 1: The "Average Team Paradox"

**Surprising Result:**  
Teams with 8-9 wins draw **higher average attendance** (68,708) than winning teams with 10+ wins (68,183).

**Possible Explanations:**
1. **Venue Capacity Ceiling:** Winning teams may play in smaller-capacity stadiums (e.g., Lambeau Field ~81k vs. MetLife Stadium ~82k), artificially capping their raw attendance numbers regardless of demand.

2. **Market Size Confound:** Average teams might be concentrated in large markets (NY, LA, Chicago) with inherently high attendance baselines, while winning teams could include small-market success stories (Green Bay, Kansas City pre-2020).

3. **Ticket Price Effect:** Winning teams charge premium prices, potentially reducing volume attendance despite higher revenue. (Note: This dataset lacks pricing data to test.)

4. **Competitive Excitement:** Games with 8-9 win teams are often high-stakes, close contests late in the season, driving fan engagement more than blowout wins by dominant teams.

**Implication for Analysis:**  
Raw attendance is a poor metric without normalizing for venue capacity. Phase 2 should calculate **attendance as % of capacity** to isolate demand signals from infrastructure constraints.

**Data Quality Note:**  
Losing teams (<8 wins) show 284 seasons vs. 216 winning seasons — sample size is sufficient, but further stratification by market size and venue type is needed.

---

## Question 2: Do Playoff Teams Get an Attendance Boost?

**Hypothesis:** Teams that made the playoffs should see attendance increase the following season due to fan optimism and "bandwagon effect."

**Analysis Approach:**
- Compare year-over-year attendance changes for playoff vs. non-playoff teams
- Use LAG window function to get prior season context
- Categorize as Increase/Decrease/Flat trends

**Data Source:** `mart_playoff_momentum` (dbt mart with window functions)

In [3]:
query_q2 = """
SELECT
    prior_season_made_playoffs AS made_playoffs_prior_year,
    attendance_trend,
    COUNT(*) AS occurrences,
    ROUND(AVG(attendance_pct_change), 1) AS avg_change_pct,
    ROUND(MIN(attendance_pct_change), 1) AS min_change,
    ROUND(MAX(attendance_pct_change), 1) AS max_change
FROM mart_playoff_momentum
WHERE prior_season_made_playoffs IS NOT NULL
GROUP BY prior_season_made_playoffs, attendance_trend
ORDER BY prior_season_made_playoffs DESC, avg_change_pct DESC
"""

df_q2 = pd.read_sql_query(query_q2, conn)
df_q2

Unnamed: 0,made_playoffs_prior_year,attendance_trend,occurrences,avg_change_pct,min_change,max_change
0,1,Increase,128,3.0,0.0,13.3
1,1,Decrease,100,-1.9,-8.1,0.0
2,0,Increase,167,3.2,0.0,21.4
3,0,Decrease,209,-3.1,-14.5,0.0


### Finding 2: Minimal Playoff Momentum Effect

**Result:**  
Teams that made the playoffs show:
- **Increase:** 128 seasons, avg +3.0%  
- **Decrease:** 100 seasons, avg -1.9%

Teams that **didn't** make playoffs:
- **Increase:** 167 seasons, avg +3.2%  
- **Decrease:** 209 seasons, avg -3.1%

**Key Insight:**  
The playoff "bump" (+3.0%) is actually **smaller** than the increase seen by non-playoff teams (+3.2%) when they do grow attendance. This suggests:

1. **Attendance changes are driven by factors beyond recent playoff success** — such as:
   - Stadium renovations or capacity expansions
   - Star player signings (e.g., acquiring a franchise QB)
   - Market-wide economic trends
   - Schedule strength (more prime-time games)

2. **Fan loyalty may be "sticky"** — playoff teams already had strong attendance in year N, so there's less room to grow. Non-playoff teams with increases might be rebounding from temporary dips.

3. **Sample composition matters** — playoff teams (228 total) vs. non-playoff teams (376 total) have different baseline characteristics. Playoff teams tend to be in established markets with mature fan bases.

**Statistical Caveat:**  
These are raw percentages without controlling for confounds. A proper analysis would require:
- Regression with controls for venue capacity, market size, ticket pricing
- Matched pairs analysis (comparing similar teams that just missed vs. just made playoffs)
- Time series decomposition to separate trend from seasonality

**Implication:**  
Playoff appearances alone don't reliably predict attendance growth. Marketing teams should focus on fan experience improvements, pricing strategy, and star player narratives rather than relying solely on "winning cures all" assumptions.

---

## Question 3: Does Venue Type Affect Attendance Stability?

**Hypothesis:** Indoor stadiums should show more **consistent** attendance (lower variability) than outdoor venues, since weather doesn't impact fan decisions.

**Analysis Approach:**
- Calculate attendance variability as (max - min) / average for each venue
- Compare indoor vs. outdoor venues across 2020-2024 ESPN data
- Filter to venues with 5+ games for statistical reliability

**Data Source:** `mart_venue_attendance_patterns` (dbt mart aggregated by venue)

**Note on Metric:**  
SQLite doesn't support `STDDEV()`, so we use **range-based variability** as a proxy:  
`variability_pct = ((max - min) / avg) * 100`

In [4]:
query_q3 = """
SELECT
    venue_type,
    COUNT(*) AS num_venues,
    ROUND(AVG(games_played), 1) AS avg_games,
    ROUND(AVG(avg_attendance), 0) AS avg_attendance,
    ROUND(AVG(attendance_variability_pct), 1) AS avg_variability_pct,
    ROUND(MIN(attendance_variability_pct), 1) AS min_var,
    ROUND(MAX(attendance_variability_pct), 1) AS max_var
FROM mart_venue_attendance_patterns
GROUP BY venue_type
ORDER BY avg_variability_pct
"""

df_q3 = pd.read_sql_query(query_q3, conn)
df_q3

Unnamed: 0,venue_type,num_venues,avg_games,avg_attendance,avg_variability_pct,min_var,max_var
0,Indoor,10,43.6,62800.0,74.5,2.6,129.2
1,Outdoor,21,40.1,62660.0,76.5,1.1,123.7


### Finding 3: Weather Doesn't Drive Variability (Yet)

**Result:**  
- **Indoor venues (10):** 74.5% average variability  
- **Outdoor venues (21):** 76.5% average variability

**Interpretation:**  
The 2-percentage-point difference is **not meaningful**. Both venue types show similar attendance swings (~75% range from min to max). This contradicts the hypothesis that weather-proofing creates stability.

**Why This Might Be:**

1. **Indoor Venues Are in Cold-Weather Markets:**  
   Most domed stadiums (Detroit, Minneapolis, Indianapolis) are in regions where outdoor football would be miserable. These teams may have built indoor venues *because* their markets already showed weather-sensitive attendance. The dome doesn't eliminate variability — it just shifts the baseline.

2. **Opponent Quality Dominates:**  
   A Packers-Bears rivalry game in December will sell out Lambeau Field (outdoor) regardless of snow, while a Jaguars-Titans Week 17 game might have empty seats in Jacksonville's climate-controlled stadium. **Matchup quality > weather.**

3. **Small Sample + Heterogeneous Mix:**  
   Only 10 indoor venues vs. 21 outdoor means the indoor average could be skewed by 1-2 outliers. Additionally, "outdoor" lumps together Miami (warm) and Buffalo (arctic) — opposite weather profiles.

4. **Data Timeframe (2020-2024) Includes COVID-19:**  
   The 2020 season had capacity restrictions and fan behavior changes. This might artificially inflate variability across both venue types, masking the true weather effect.

**Next Steps for Deeper Analysis:**

- **Add weather data:** Join game-day temperature, precipitation, wind speed for outdoor venues. Test if variability correlates with weather severity.

- **Control for opponent strength:** Use Vegas betting lines or team win% as a covariate to isolate weather from game quality.

- **Stratify outdoor venues:** Separate "cold outdoor" (Buffalo, Green Bay) from "mild outdoor" (LA, Tampa) to test if *extreme* weather matters.

- **Longitudinal comparison:** Analyze teams that switched venues (Rams STL→LA, Raiders OAK→LV) to see if moving indoor changed their attendance patterns.

**Statistical Note:**  
With only 10 indoor venues, a t-test would likely show p > 0.05 (not significant). Proper analysis requires bootstrapping or Bayesian methods to account for small N.

**Business Implication:**  
Indoor stadiums are expensive (~$1B+ to build). If weather doesn't significantly stabilize attendance, then the business case for domes should focus on **event versatility** (hosting concerts, conventions) rather than "guaranteed sellouts."

---

## Summary & Next Steps

### What We Learned:

1. **Winning ≠ Attendance (Without Context)**  
   Average teams draw more fans than winning teams in raw numbers — likely due to venue capacity constraints and market size confounds. **Action:** Normalize attendance by capacity.

2. **Playoff Momentum Is Weak**  
   Making the playoffs doesn't reliably boost next-year attendance more than random fluctuations. **Action:** Investigate other drivers like star signings, stadium upgrades, and pricing changes.

3. **Indoor vs. Outdoor Is Nuanced**  
   Weather-proofing doesn't reduce attendance variability in aggregate. **Action:** Add game-day weather data and stratify by climate zones.

### Phase 2 Priorities:

| Priority | Task | Data Needed | Expected Insight |
|----------|------|-------------|------------------|
| 🔥 HIGH | Venue capacity normalization | Stadium capacities (Wikipedia, team sites) | True demand vs. supply constraints |
| 🔥 HIGH | Market size controls | MSA population, median income | Isolate team performance effect |
| 🟡 MEDIUM | Weather integration | NOAA game-day conditions | Quantify cold/rain impact on outdoor venues |
| 🟡 MEDIUM | Ticket pricing data | StubHub historical prices (if available) | Test price elasticity hypothesis |
| 🟢 LOW | Player star power | Pro Bowl selections, jersey sales | Measure "superstar effect" on attendance |

### Methodology Notes:

- **Sample Size:** 638 team-seasons (Kaggle 2000-2019) + 31 venues (ESPN 2020-2024) provides adequate power for basic trends but limits subgroup analysis.

- **Causality:** All findings are **correlational**. Without randomized experiments (impossible in NFL), we rely on natural experiments (e.g., team relocations, stadium openings) for quasi-causal inference.

- **Data Quality:** dbt tests validate key uniqueness and completeness, but **we haven't validated attendance accuracy** against official NFL sources. Spot-checking a few high-profile games (Super Bowl, playoffs) would increase confidence.

### Recommended Visualizations for Presentation:

1. **Scatter plot:** Win % vs. Attendance (colored by venue capacity) — shows the capacity ceiling effect visually.

2. **Box plot:** Attendance change distribution for playoff vs. non-playoff teams — reveals overlap and outliers.

3. **Map:** US venues colored by average attendance, sized by variability — geographic patterns may emerge (e.g., Midwest loyalty).

4. **Time series:** Pick 3-5 teams and show attendance trajectories overlaid with playoff appearances — illustrates the weak momentum effect.

---

**Conclusion:**  
These initial queries validate that the dbt pipeline produces queryable, analysis-ready data. The surprising findings (average > winning, weak playoff effect, weather-neutral) highlight the importance of **not jumping to conclusions** from univariate analysis. The next phase will layer in contextual variables (capacity, market, weather) to build a more robust explanatory model.

*For the business stakeholder: "Winning helps, but stadium size and market dynamics matter more. Invest in fan experience and venue optimization, not just roster talent."*

In [5]:
# Clean up
conn.close()
print("Database connection closed.")

Database connection closed.
