In [16]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import date, timedelta, datetime
import os
import time
import re

## 0. Approach:
# Data Collection Approach

This project compiles historical game data for Atlanta’s professional sports teams across the NBA, MLB, and MLS. The objective was to create a unified dataset of all Atlanta-area home games with a consistent structure, standardized venue information, and geographic coordinates for spatial visualization and analysis.

#### 1. Data Sources

Game schedules were collected from publicly available league and team schedule pages:

- **NBA (Atlanta Hawks)**
- **MLB (Atlanta Braves)**
- **MLS (Atlanta United FC)**

### MLS Special Case  
Unlike the NBA and MLB, **MLS does not provide open, machine-readable API access or downloadable historical schedule data**. To obtain MLS match information:

- AI-assisted scraping was used to extract match details from the official Atlanta United schedule site.
- The extracted rows were manually checked and corrected for consistency.
- Clean data were inserted into a SQL table and exported as a CSV file for processing.

The finalized source files are:

- `nba_atlanta_games.csv`
- `mlb_atlanta_games.csv`
- `mls_atlanta_united_games.csv`

#### 2. Standardization of Fields

Because each league formats its data differently, several preprocessing steps were used to unify the datasets:

1. Convert all dates into a consistent `datetime` format (`game_date`).
2. Add a `league` column identifying NBA, MLB, or MLS.
3. Retain only the key fields required for the merged dataset:
   - `game_date`
   - `league`
   - `home_team`
   - `event_name`
   - `venue_name`
4. Remove unused fields such as internal game identifiers, season or type labels, scores, and redundant venue metadata.

This produces a consistent schema across all three leagues.

#### 3. Venue Normalization and Geolocation

All games occur at a small set of Atlanta-area venues. To support mapping and spatial analyses, each venue was assigned fixed latitude and longitude coordinates:

| Venue Name              | Latitude  | Longitude   |
|-------------------------|-----------|-------------|
| Mercedes-Benz Stadium   | 33.755489 | -84.401993  |
| State Farm Arena        | 33.757220 | -84.396390  |
| Truist Park             | 33.890781 | -84.468239  |
| CoolToday Park          | 27.032414 | -82.319747  |

These were included as new columns: `latitude` and `longitude`.

#### 4. Merging All Leagues

After cleaning, the NBA, MLB, and MLS tables were vertically merged using `pandas.concat()` and sorted chronologically.  
The final dataset uses the following standardized column order:

1. `game_date`
2. `league`
3. `home_team`
4. `event_name`
5. `venue_name`
6. `longitude`
7. `latitude`

#### 5. Final Output

The completed dataset was exported as:

```
Data/atlanta_sports.csv
```

## 1. NFL Falcon home games 

In [14]:
# Anchor date: Thu, Jun 11, 2026 - beginning of the World Cup
ANCHOR_DATE = date(2026, 6, 11)
YEARS_BACK = 3

# Date range
start_date = date(ANCHOR_DATE.year - YEARS_BACK, ANCHOR_DATE.month, ANCHOR_DATE.day)
end_date = ANCHOR_DATE

print(f"Collecting Atlanta MLB home games from {start_date} to {end_date}...")

BASE_URL = "https://statsapi.mlb.com/api/v1/schedule"
BRAVES_ID = 144  # Atlanta Braves

def month_ranges(start, end):
    cur = date(start.year, start.month, 1)
    while cur <= end:
        if cur.month == 12:
            next_month = date(cur.year + 1, 1, 1)
        else:
            next_month = date(cur.year, cur.month + 1, 1)
        range_start = max(cur, start)
        range_end = min(end, next_month - timedelta(days=1))
        yield range_start, range_end
        cur = next_month

all_rows = []

for rng_start, rng_end in month_ranges(start_date, end_date):
    params = {
        "teamId": BRAVES_ID,           # only Atlanta Braves games
        "sportId": 1,                  # MLB
        "startDate": rng_start.isoformat(),
        "endDate": rng_end.isoformat()
    }

    resp = requests.get(BASE_URL, params=params, timeout=30)
    resp.raise_for_status()
    data = resp.json()

    for d in data.get("dates", []):
        game_date = d.get("date")
        for g in d.get("games", []):

            # Only include home games (played in Atlanta)
            if g.get("teams", {}).get("home", {}).get("team", {}).get("id") != BRAVES_ID:
                continue

            game_pk = g.get("gamePk")
            home_team = g.get("teams", {}).get("home", {}).get("team", {}).get("name")
            away_team = g.get("teams", {}).get("away", {}).get("team", {}).get("name")

            event_name = f"{away_team} at {home_team}"

            venue = g.get("venue", {}) or {}
            venue_name = venue.get("name")
            venue_id = venue.get("id")

            all_rows.append({
                "gamePk": game_pk,
                "event_name": event_name,
                "game_date": game_date,
                "home_team": home_team,
                "away_team": away_team,
                "venue_name": venue_name,
                "venue_id": venue_id
            })

df = pd.DataFrame(all_rows).drop_duplicates(subset=["gamePk"])

# Save to Data folder
output_dir = "Data"
os.makedirs(output_dir, exist_ok=True)

output_file = os.path.join(output_dir, "mlb_atlanta_games.csv")
df.to_csv(output_file, index=False)

print(f"Saved {len(df)} Atlanta home games to {output_file}")


Collecting Atlanta MLB home games from 2023-06-11 to 2026-06-11...
Saved 291 Atlanta home games to Data/mlb_atlanta_games.csv


## 2. NBA Hawks home games

In [13]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import date, timedelta, datetime
import os
import time

# Anchor date: Thu, Jun 11, 2026
ANCHOR_DATE = date(2026, 6, 11)
YEARS_BACK = 3

# Date range
start_date = date(ANCHOR_DATE.year - YEARS_BACK, ANCHOR_DATE.month, ANCHOR_DATE.day)
end_date = ANCHOR_DATE

print(f"Collecting Atlanta Hawks home games from {start_date} to {end_date}...")

def get_nba_season_year(game_date):
    """
    NBA seasons span two calendar years (e.g., 2023-24 season).
    Returns the starting year of the season.
    """
    if game_date.month >= 10:  # October onwards is start of new season
        return game_date.year
    else:  # Jan-Jun is end of previous season
        return game_date.year - 1

def scrape_espn_schedule(season_start_year):
    """
    Scrape Atlanta Hawks schedule from ESPN
    """
    # ESPN uses season type codes: 2 = regular season, 3 = playoffs
    all_games = []
    
    for season_type in [2, 3]:  # Regular season and playoffs
        url = f"https://www.espn.com/nba/team/schedule/_/name/atl/season/{season_start_year}/seasontype/{season_type}"
        
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        
        try:
            resp = requests.get(url, headers=headers, timeout=30)
            resp.raise_for_status()
            
            soup = BeautifulSoup(resp.content, 'html.parser')
            
            # Find the schedule table
            table = soup.find('table', {'class': 'Table'})
            
            if not table:
                print(f"No table found for {season_start_year}-{season_start_year + 1} season (type {season_type})")
                continue
            
            tbody = table.find('tbody', {'class': 'Table__TBODY'})
            if not tbody:
                continue
            
            rows = tbody.find_all('tr', {'class': 'Table__TR'})
            
            for row in rows:
                try:
                    cells = row.find_all('td')
                    if len(cells) < 3:
                        continue
                    
                    # Date cell
                    date_cell = cells[0]
                    date_text = date_cell.get_text(strip=True)
                    
                    # Skip if it's a date header or invalid
                    if not date_text or date_text.startswith('Date'):
                        continue
                    
                    # Opponent cell
                    opponent_cell = cells[1]
                    opponent_text = opponent_cell.get_text(strip=True)
                    
                    # Check if it's a home game (vs) or away game (@)
                    if opponent_text.startswith('@'):
                        continue  # Skip away games
                    
                    # Remove 'vs' prefix
                    opponent = opponent_text.replace('vs', '').strip()
                    
                    # Parse date - ESPN format is like "Wed, Oct 25" or "Thu 10/25"
                    try:
                        # Try to parse with year
                        if ',' in date_text:
                            # Format: "Wed, Oct 25"
                            date_parts = date_text.split(',')[1].strip()
                            game_date = datetime.strptime(f"{date_parts} {season_start_year}", "%b %d %Y").date()
                            
                            # If month is before July, it's the following year
                            if game_date.month < 7:
                                game_date = datetime.strptime(f"{date_parts} {season_start_year + 1}", "%b %d %Y").date()
                        else:
                            # Try other formats
                            continue
                    except:
                        continue
                    
                    # Filter by date range
                    if start_date <= game_date <= end_date:
                        all_games.append({
                            'game_date': game_date,
                            'season': f"{season_start_year}-{season_start_year + 1}",
                            'home_team': 'Atlanta Hawks',
                            'away_team': opponent,
                            'event_name': f"{opponent} at Atlanta Hawks",
                            'venue_name': 'State Farm Arena',
                            'venue_city': 'Atlanta',
                            'season_type': 'Regular Season' if season_type == 2 else 'Playoffs'
                        })
                
                except Exception as e:
                    continue
        
        except Exception as e:
            print(f"Error scraping {season_start_year}-{season_start_year + 1} season (type {season_type}): {e}")
        
        time.sleep(1)  # Be polite between requests
    
    return all_games

# Determine which seasons to scrape
seasons_to_scrape = set()
current = start_date
while current <= end_date:
    seasons_to_scrape.add(get_nba_season_year(current))
    current += timedelta(days=90)  # Jump by ~3 months

all_rows = []

for season_year in sorted(seasons_to_scrape):
    print(f"Scraping {season_year}-{season_year + 1} season...")
    season_games = scrape_espn_schedule(season_year)
    all_rows.extend(season_games)
    time.sleep(2)  # Be polite to the server

# Create DataFrame
df = pd.DataFrame(all_rows)

if not df.empty:
    # Sort by date
    df = df.sort_values('game_date').reset_index(drop=True)
    
    # Remove duplicates
    df = df.drop_duplicates(subset=['game_date', 'away_team'])
    
    # Save to Data folder
    output_dir = "Data"
    os.makedirs(output_dir, exist_ok=True)
    output_file = os.path.join(output_dir, "nba_atlanta_games.csv")
    df.to_csv(output_file, index=False)
    
    print(f"\nSaved {len(df)} Atlanta Hawks home games to {output_file}")
    print(f"\nDate range: {df['game_date'].min()} to {df['game_date'].max()}")
else:
    print("No games found in the specified date range.")

Collecting Atlanta Hawks home games from 2023-06-11 to 2026-06-11...
Scraping 2022-2023 season...
Scraping 2023-2024 season...
Scraping 2024-2025 season...
Scraping 2025-2026 season...

Saved 129 Atlanta Hawks home games to Data/nba_atlanta_games.csv

Date range: 2023-10-19 to 2026-04-18


## 3. MLS Atalanta United home games

In [29]:
import pandas as pd
import os

# Create the data
data = [
    ['2023-06-10', 2023, 'Atlanta United FC', 'D.C. United', '1-0', 'D.C. United at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2023-06-21', 2023, 'Atlanta United FC', 'New York City FC', '0-4', 'New York City FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2023-07-02', 2023, 'Atlanta United FC', 'Toronto FC', '0-2', 'Toronto FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2023-07-15', 2023, 'Atlanta United FC', 'Orlando City SC', '3-3', 'Orlando City SC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2023-08-26', 2023, 'Atlanta United FC', 'Nashville SC', '5-1', 'Nashville SC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2023-08-30', 2023, 'Atlanta United FC', 'New England Revolution', '3-0', 'New England Revolution at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2023-09-16', 2023, 'Atlanta United FC', 'Inter Miami CF', '1-4', 'Inter Miami CF at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2023-09-23', 2023, 'Atlanta United FC', 'CF Montréal', '5-0', 'CF Montréal at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2023-10-07', 2023, 'Atlanta United FC', 'Columbus Crew', '1-1', 'Columbus Crew at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-03-09', 2024, 'Atlanta United FC', 'New England Revolution', '4-1', 'New England Revolution at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-03-17', 2024, 'Atlanta United FC', 'Orlando City SC', '2-0', 'Orlando City SC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-03-31', 2024, 'Atlanta United FC', 'Chicago Fire FC', '3-0', 'Chicago Fire FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-04-14', 2024, 'Atlanta United FC', 'Philadelphia Union', '2-2', 'Philadelphia Union at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-04-20', 2024, 'Atlanta United FC', 'FC Cincinnati', '1-2', 'FC Cincinnati at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-05-04', 2024, 'Atlanta United FC', 'Minnesota United FC', '1-2', 'Minnesota United FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-05-11', 2024, 'Atlanta United FC', 'D.C. United', '2-3', 'D.C. United at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-05-25', 2024, 'Atlanta United FC', 'Nashville SC', '0-1', 'Nashville SC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-06-02', 2024, 'Atlanta United FC', 'Charlotte FC', '2-3', 'Charlotte FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-06-15', 2024, 'Atlanta United FC', 'Houston Dynamo FC', '2-2', 'Houston Dynamo FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-06-29', 2024, 'Atlanta United FC', 'Toronto FC', '2-1', 'Toronto FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-07-17', 2024, 'Atlanta United FC', 'New York City FC', '2-2', 'New York City FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-07-20', 2024, 'Atlanta United FC', 'Columbus Crew', '2-1', 'Columbus Crew at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-09-14', 2024, 'Atlanta United FC', 'Nashville SC', '0-2', 'Nashville SC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-09-18', 2024, 'Atlanta United FC', 'Inter Miami CF', '2-2', 'Inter Miami CF at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-10-02', 2024, 'Atlanta United FC', 'CF Montréal', '1-2', 'CF Montréal at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2024-10-05', 2024, 'Atlanta United FC', 'New York Red Bulls', '2-1', 'New York Red Bulls at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2025-02-22', 2025, 'Atlanta United FC', 'CF Montréal', '', 'CF Montréal at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2025-03-09', 2025, 'Atlanta United FC', 'Orlando City SC', '', 'Orlando City SC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2025-03-23', 2025, 'Atlanta United FC', 'Charlotte FC', '', 'Charlotte FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2025-04-05', 2025, 'Atlanta United FC', 'Chicago Fire FC', '', 'Chicago Fire FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2025-05-03', 2025, 'Atlanta United FC', 'New York City FC', '', 'New York City FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2025-05-17', 2025, 'Atlanta United FC', 'Philadelphia Union', '', 'Philadelphia Union at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
    ['2025-06-12', 2025, 'Atlanta United FC', 'New York City FC', '', 'New York City FC at Atlanta United FC', 'Mercedes-Benz Stadium', 'Atlanta', 'MLS'],
]

# Create DataFrame
columns = ['game_date', 'season', 'home_team', 'away_team', 'score', 'event_name', 'venue_name', 'venue_city', 'league']
df = pd.DataFrame(data, columns=columns)

# Convert game_date to datetime
df['game_date'] = pd.to_datetime(df['game_date'])

# Create Data directory if it doesn't exist
os.makedirs('Data', exist_ok=True)

# Save to CSV
output_file = 'Data/mls_atlanta_united_games.csv'
df.to_csv(output_file, index=False)

print(f"Success! Saved {len(df)} Atlanta United home games to {output_file}")
print(f"\nDate range: {df['game_date'].min().date()} to {df['game_date'].max().date()}")
print(f"\nBreakdown by season:")
print(df.groupby('season').size())

Success! Saved 33 Atlanta United home games to Data/mls_atlanta_united_games.csv

Date range: 2023-06-10 to 2025-06-12

Breakdown by season:
season
2023     9
2024    17
2025     7
dtype: int64


## 4. Merge all home games data into a single CSV file

In [35]:
# 1. Read original datasets
nba = pd.read_csv(os.path.join(data_path, "nba_atlanta_games.csv"))
mlb = pd.read_csv(os.path.join(data_path, "mlb_atlanta_games.csv"))
mls = pd.read_csv(os.path.join(data_path, "mls_atlanta_united_games.csv"))

# 2. Make sure dates are datetime
for df in [nba, mlb, mls]:
    df["game_date"] = pd.to_datetime(df["game_date"])

# 3. Add league column
nba["league"] = "NBA"
mlb["league"] = "MLB"
mls["league"] = "MLS"   # overwrite / enforce

# 4. Keep only needed columns first
keep_cols = ["game_date", "league", "home_team", "event_name", "venue_name"]
nba = nba[keep_cols]
mlb = mlb[keep_cols]
mls = mls[keep_cols]

# 5. Add latitude / longitude based on venue_name
venue_coords = {
    "Mercedes-Benz Stadium": {"latitude": 33.755489, "longitude": -84.401993},
    "Truist Park":           {"latitude": 33.890781, "longitude": -84.468239},
    "State Farm Arena":      {"latitude": 33.757220, "longitude": -84.396390},
    "CoolToday Park":        {"latitude": 27.032414, "longitude": -82.319747},
}

def add_coords(df):
    df["longitude"] = df["venue_name"].map(lambda v: venue_coords[v]["longitude"])
    df["latitude"]  = df["venue_name"].map(lambda v: venue_coords[v]["latitude"])
    return df

nba = add_coords(nba)
mlb = add_coords(mlb)
mls = add_coords(mls)

# 6. Concatenate all games
all_games = pd.concat([nba, mlb, mls], ignore_index=True)

# 7. Order columns exactly as requested
all_games = all_games[
    ["game_date", "league", "home_team", "event_name",
     "venue_name", "longitude", "latitude"]
]

# 8. Sort by date from earliest to latest
all_games = all_games.sort_values("game_date").reset_index(drop=True)

# 9. Save to CSV
output_path = os.path.join(data_path, "atlanta_sports.csv")
all_games.to_csv(output_path, index=False)

print("Saved:", output_path)


Saved: Data/atlanta_sports.csv


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["longitude"] = df["venue_name"].map(lambda v: venue_coords[v]["longitude"])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["latitude"]  = df["venue_name"].map(lambda v: venue_coords[v]["latitude"])


In [36]:
all_games

Unnamed: 0,game_date,league,home_team,event_name,venue_name,longitude,latitude
0,2023-06-10,MLS,Atlanta United FC,D.C. United at Atlanta United FC,Mercedes-Benz Stadium,-84.401993,33.755489
1,2023-06-11,MLB,Atlanta Braves,Washington Nationals at Atlanta Braves,Truist Park,-84.468239,33.890781
2,2023-06-15,MLB,Atlanta Braves,Colorado Rockies at Atlanta Braves,Truist Park,-84.468239,33.890781
3,2023-06-16,MLB,Atlanta Braves,Colorado Rockies at Atlanta Braves,Truist Park,-84.468239,33.890781
4,2023-06-17,MLB,Atlanta Braves,Colorado Rockies at Atlanta Braves,Truist Park,-84.468239,33.890781
...,...,...,...,...,...,...,...
448,2026-06-03,MLB,Atlanta Braves,Toronto Blue Jays at Atlanta Braves,Truist Park,-84.468239,33.890781
449,2026-06-04,MLB,Atlanta Braves,Toronto Blue Jays at Atlanta Braves,Truist Park,-84.468239,33.890781
450,2026-06-05,MLB,Atlanta Braves,Pittsburgh Pirates at Atlanta Braves,Truist Park,-84.468239,33.890781
451,2026-06-06,MLB,Atlanta Braves,Pittsburgh Pirates at Atlanta Braves,Truist Park,-84.468239,33.890781
