# 01 - Financial Data Ingestion from Transfermarkt
**Project**: Premier League Competitiveness Analysis (2000-2024)  
**Purpose**: Scrape squad market values, transfer spending, and league table data from Transfermarkt  
**Output**: Three Delta Lake tables — `squad_market_values`, `transfer_spending`, `league_tables`

## Data Sources
- **Transfermarkt** (web scraping with BeautifulSoup)
  - Squad market values: `/premier-league/startseite/wettbewerb/GB1/`
  - Transfer spending: `/premier-league/einnahmenausgaben/wettbewerb/GB1/`
  - League tables: `/premier-league/tabelle/wettbewerb/GB1/`

## Seasons Covered
- 2000/01 through 2024/25 (25 seasons, 20 clubs each = 500 records per table)
- Note: Squad market values are only available from 2004/05 onward

In [0]:
print("Hello Databricks!")
print(f"Spark version: {spark.version}")


## Step 1: Install Dependencies
Databricks clusters don't come with web scraping libraries pre-installed. `%pip` installs packages directly into the cluster's Python environment. The interpreter restarts automatically after installation.

In [0]:
%pip install requests beautifulsoup4 lxml

## Step 2: Connect to Transfermarkt
Transfermarkt blocks requests without a browser User-Agent header. We fake a Chrome browser identity to get access. Status code 200 = success.

In [0]:
import requests
from bs4 import BeautifulSoup
import time

# Test: can we reach Transfermarkt?
headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'
}

url = "https://www.transfermarkt.us/premier-league/startseite/wettbewerb/GB1/plus/?saison_id=2023"
response = requests.get(url, headers=headers)

print(f"Status code: {response.status_code}")
print(f"Page length: {len(response.text)} characters")

In [0]:
soup = BeautifulSoup(response.text, 'lxml')

# Let's find all tables on the page and see what's there
tables = soup.find_all('table')
print(f"Number of tables found: {len(tables)}")

# Look at the first few links that contain club names
# Transfermarkt uses 'vereinprofil_tooltip' class for club links
links = soup.find_all('a', class_='vereinprofil_tooltip')
print(f"\nLinks with 'vereinprofil_tooltip': {len(links)}")
for link in links[:10]:
    print(f"  {link.text.strip()} -> {link.get('href')}")

In [0]:
# Let's look at what's inside the main tables
# Table 0 is usually navigation, so let's check a few
for i, table in enumerate(tables):
    rows = table.find_all('tr')
    print(f"\n--- Table {i}: {len(rows)} rows ---")
    # Print first row to see structure
    if rows:
        first_row_text = rows[0].get_text(strip=True)[:150]
        print(f"  First row: {first_row_text}")

In [0]:
# Extract data from Table 1
target_table = tables[1]
rows = target_table.find_all('tr')

# Skip header row (index 0) and total row (last), look at actual club rows
for row in rows[1:5]:  # Just first 4 clubs to explore
    cells = row.find_all('td')
    print(f"\nNumber of cells: {len(cells)}")
    for j, cell in enumerate(cells):
        text = cell.get_text(strip=True)
        if text:
            print(f"  Cell {j}: {text}")
        

In [0]:
# Extract all clubs from the 2023/24 season page
club_data = []

for row in rows[1:]:  # Skip header
    cells = row.find_all('td')
    
    club_name = cells[1].get_text(strip=True)
    
    # Skip the totals row (no club name)
    if not club_name:
        continue
    
    squad_size = cells[2].get_text(strip=True)
    avg_age = cells[3].get_text(strip=True).replace(' Years', '')
    total_value = cells[6].get_text(strip=True)
    
    club_data.append({
        'club': club_name,
        'squad_size': squad_size,
        'avg_age': avg_age,
        'total_market_value_raw': total_value
    })

# Check what we got
for club in club_data:
    print(f"{club['club']:30s} | {club['total_market_value_raw']}")

print(f"\nTotal clubs found: {len(club_data)}")

## Step 3: Data Parsing Utilities
Transfermarkt displays values as strings like "€1.46bn" or "€955.65m". These functions convert them to numeric values for analysis. Handles edge cases: different scales (bn/m/k).

In [0]:
def parse_market_value(value_str):
    "Convert Transfermarkt value strings to numbers in euros."
    
    value_str = value_str.replace('€', '').strip()
    
    if 'bn' in value_str:
        return float(value_str.replace('bn', '')) * 1_000_000_000
    elif 'm' in value_str:
        return float(value_str.replace('m', '')) * 1_000_000
    elif 'k' in value_str:
        return float(value_str.replace('k', '')) * 1_000
    else:
        return 0.0

# Test it
test_values = ['€1.46bn', '€955.65m', '€140.90m', '€850k']
for v in test_values:
    print(f"{v:>12s}  ->  {parse_market_value(v):>20,.0f}")

## Step 4: Scrape Squad Market Values
Scrapes the total squad market value for every Premier League club for seasons 2000/01 - 2024/25. Uses a 3-second delay between requests to avoid IP blocking. Saves to Delta table: `squad_market_values`

In [0]:
def scrape_season_squad_values(season_id, headers):
    """Scrape squad market values for one Premier League season.
    
    Args:
        season_id: The starting year (e.g., 2023 for 2023/24 season)
        headers: Request headers with User-Agent
    
    Returns:
        List of dicts with club data, or empty list if failed
    """
    url = f"https://www.transfermarkt.us/premier-league/startseite/wettbewerb/GB1/plus/?saison_id={season_id}"
    
    response = requests.get(url, headers=headers)
    
    if response.status_code != 200:
        print(f"  Failed for {season_id}: status {response.status_code}")
        return []
    
    soup = BeautifulSoup(response.text, 'lxml')
    tables = soup.find_all('table')
    
    # Find the right table (the one with club squad data)
    target_table = None
    for table in tables:
        first_row = table.find('tr')
        if first_row and 'market value' in first_row.get_text().lower():
            target_table = table
            break
    
    if not target_table:
        print(f"  No squad table found for {season_id}")
        return []
    
    rows = target_table.find_all('tr')
    season_data = []
    
    for row in rows[1:]:
        cells = row.find_all('td')
        club_name = cells[1].get_text(strip=True)
        
        if not club_name:
            continue
        
        total_value_raw = cells[6].get_text(strip=True)
        
        season_data.append({
            'season': f"{season_id}/{str(season_id+1)[-2:]}",
            'season_start_year': season_id,
            'club': club_name,
            'squad_size': int(cells[2].get_text(strip=True)),
            'avg_age': float(cells[3].get_text(strip=True).replace(' Years', '')),
            'total_market_value_raw': total_value_raw,
            'total_market_value_eur': parse_market_value(total_value_raw)
        })
    
    return season_data

# Test with one season
test = scrape_season_squad_values(2023, headers)
print(f"Season 2023/24: {len(test)} clubs scraped")
print(f"Example: {test[0]['club']} — €{test[0]['total_market_value_eur']:,.0f}")

In [0]:
# Scrape all season from 2000/01 to 2024/2025
all_squad_values = []

for season_id in range(2000, 2025):
    data = scrape_season_squad_values(season_id, headers)
    all_squad_values.extend(data)
    print(f"Season {season_id}/{str(season_id +1)[2:]}: {len(data)} clubs")

    time.sleep(3)

print(f"\n{'='*50}")
print(f"total records scraped: {len(all_squad_values)}")

In [0]:
# Convert to Spark DataFrame
df_squad_values = spark.createDataFrame(all_squad_values)

# Check the schema
df_squad_values.printSchema()

# Show first 10 rows
df_squad_values.show(10, truncate=False)

# Count by season to verify
df_squad_values.groupBy('season', 'season_start_year').count().orderBy("season_start_year").show(25)


In [0]:
from pyspark.sql.functions import avg, round as spark_round

# Average market value per season to see where data starts
df_squad_values.groupBy('season', 'season_start_year') \
    .agg(spark_round(avg('total_market_value_eur') / 1_000_000, 2).alias('avg_squad_value_millions')) \
    .orderBy('season_start_year') \
    .show(25)

In [0]:
from pyspark.sql.functions import col, when

# Add a flag for whether financial data is available
df_squad_values = df_squad_values.withColumn(
    'has_financial_data',
    when(col('total_market_value_eur') > 0, True).otherwise(False)
)

# Quick check
df_squad_values.groupBy('has_financial_data').count().show()

In [0]:
# Save to Delta Lake
df_squad_values.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("squad_market_values")

print("Table saved!")

In [0]:
%sql
SELECT season, club, total_market_value_eur
FROM squad_market_values
WHERE season_start_year = 2023
ORDER BY total_market_value_eur DESC

In [0]:
top_squads = spark.sql("""
    SELECT season, club, 
           format_number(total_market_value_eur, 0) as squad_value_eur
    FROM squad_market_values
    WHERE has_financial_data = true
    ORDER BY total_market_value_eur DESC
    LIMIT 5
""")
top_squads.show(truncate=False)

## Step 5: Scrape Transfer Spending
Scrapes transfer expenditure, income, and net spend per club per season. Uses the `einnahmenausgaben` (income and expenditure) page. Saves to Delta table: `transfer_spending`

In [0]:
# Let's explore the transfer spending page structure
url = "https://www.transfermarkt.us/premier-league/einnahmenausgaben/wettbewerb/GB1/plus/0?ids=a&sa=&saession_id=2023&saison_id=2023&nat=&pos=&altersklasse=&w_s=&lei498he=&intern=0"

respose = requests.get(url, headers=headers)
print(respose.status_code)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'lxml')
    tables = soup.find_all('table')
    print(f"Found {len(tables)} tables")

    for i, table in enumerate(tables):
        rows = soup.find_all('tr')
        print(f"Table {i} has {len(rows)} rows")

        if rows:
            print(f"First row: {rows[0].get_text(strip=True)[:150]}")


In [0]:
# Try the simpler URL format
url = "https://www.transfermarkt.us/premier-league/einnahmenausgaben/wettbewerb/GB1/plus/0?ids=a&sa=&saison_id=2023&nat=&pos=&altersklasse=&w_s=&leiession_id=&intern=0"

response = requests.get(url, headers=headers)
print(f"Status code: {response.status_code}")

soup = BeautifulSoup(response.text, 'lxml')

# Instead of tables, let's look for divs with class 'box' 
# which is how Transfermarkt structures its content sections
boxes = soup.find_all('div', class_='box')
print(f"\nBoxes found: {len(boxes)}")

for i, box in enumerate(boxes):
    header = box.find(['h2', 'h1', 'h3'])
    if header:
        print(f"  Box {i}: {header.get_text(strip=True)[:100]}")

In [0]:
# Explore the content inside that box
box = boxes[0]

# Look for the actual data table inside this box
box_tables = box.find_all('table')
print(f"Tables inside the box: {len(box_tables)}")

if box_tables:
    for i, table in enumerate(box_tables):
        rows = table.find_all('tr')
        print(f"\n--- Table {i}: {len(rows)} rows ---")
        if rows:
            print(f"  Header: {rows[0].get_text(strip=True)[:200]}")
        if len(rows) > 1:
            print(f"  Row 1: {rows[1].get_text(strip=True)[:200]}")
        if len(rows) > 2:
            print(f"  Row 2: {rows[2].get_text(strip=True)[:200]}")

In [0]:
# Try with a specific season filter
url = "https://www.transfermarkt.us/premier-league/einnahmenausgaben/wettbewerb/GB1/plus/0?ids=a&sa=&saison_id=2023&saison_id_bis=2023&nat=&pos=&altersklasse=&w_s=&leihe=&intern=0"

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'lxml')

box = soup.find_all('div', class_='box')[0]
table = box.find_all('table')[1]
rows = table.find_all('tr')

print(f"Rows: {len(rows)}")
print(f"Header: {rows[0].get_text(strip=True)[:200]}")
print(f"\nFirst 3 clubs:")
for row in rows[1:4]:
    print(f"  {row.get_text(strip=True)[:150]}")

In [0]:
# Let's see the cell structure for one row
row = rows[1]  # First club
cells = row.find_all('td')
print(f"Number of cells: {len(cells)}")
for i, cell in enumerate(cells):
    text = cell.get_text(strip=True)
    if text:
        print(f"  Cell {i}: {text}")

In [0]:
def scrape_season_transfers(season_id, headers):
    """Scrape transfer spending for one Premier League season."""
    url = (
        f"https://www.transfermarkt.us/premier-league/einnahmenausgaben/wettbewerb/GB1/plus/0"
        f"?ids=a&sa=&saison_id={season_id}&saison_id_bis={season_id}"
        f"&nat=&pos=&altersklasse=&w_s=&leihe=&intern=0"
    )
    
    response = requests.get(url, headers=headers)
    
    if response.status_code != 200:
        print(f"  Failed for {season_id}: status {response.status_code}")
        return []
    
    soup = BeautifulSoup(response.text, 'lxml')
    
    boxes = soup.find_all('div', class_='box')
    if not boxes:
        print(f"  No box found for {season_id}")
        return []
    
    table = boxes[0].find_all('table')[1]
    rows = table.find_all('tr')
    
    season_data = []
    for row in rows[1:]:  # Skip header
        cells = row.find_all('td')
        club_name = cells[2].get_text(strip=True)
        
        if not club_name:
            continue
        
        expenditure_raw = cells[3].get_text(strip=True)
        income_raw = cells[5].get_text(strip=True)
        balance_raw = cells[7].get_text(strip=True)
        
        season_data.append({
            'season': f"{season_id}/{str(season_id+1)[-2:]}",
            'season_start_year': season_id,
            'club': club_name,
            'expenditure_eur': parse_market_value(expenditure_raw),
            'arrivals': int(cells[4].get_text(strip=True)),
            'income_eur': parse_market_value(income_raw),
            'departures': int(cells[6].get_text(strip=True)),
            'net_spend_eur': parse_market_value(balance_raw)
        })
    
    return season_data

# Test again
test = scrape_season_transfers(2023, headers)
print(f"Season 2023/24: {len(test)} clubs")
print(f"\nTop 3 spenders:")
for club in sorted(test, key=lambda x: x['net_spend_eur'])[:3]:
    print(f"  {club['club']:25s} | Spent: €{club['expenditure_eur']:>15,.0f} | Net: €{club['net_spend_eur']:>15,.0f}")

In [0]:
def parse_market_value(value_str):
    """Convert Transfermarkt value strings to numbers in euros.
    
    Handles: '€1.46bn', '€955.65m', '€850k', '€-181.90m', '-'
    """
    value_str = value_str.replace('€', '').strip()
    
    if value_str == '-' or value_str == '':
        return 0.0
    
    # Handle negative values
    is_negative = False
    if value_str.startswith('-'):
        is_negative = True
        value_str = value_str[1:]  # Remove the minus sign
    
    if 'bn' in value_str:
        result = float(value_str.replace('bn', '')) * 1_000_000_000
    elif 'm' in value_str:
        result = float(value_str.replace('m', '')) * 1_000_000
    elif 'k' in value_str:
        result = float(value_str.replace('k', '')) * 1_000
    else:
        result = 0.0
    
    return -result if is_negative else result

# Test with negatives
test_values = ['€464.10m', '€-181.90m', '€1.46bn', '-', '€850k']
for v in test_values:
    print(f"{v:>12s}  ->  {parse_market_value(v):>20,.0f}")

In [0]:
# Scrape all seasons
all_transfers = []

for season_id in range(2000, 2025):
    data = scrape_season_transfers(season_id, headers)
    all_transfers.extend(data)
    print(f"Season {season_id}/{str(season_id+1)[-2:]}: {len(data)} clubs")
    time.sleep(3)

print(f"\n{'='*50}")
print(f"Total records scraped: {len(all_transfers)}")

In [0]:
# Convert to Spark DataFrame
df_transfers = spark.createDataFrame(all_transfers)

# Check the schema
df_transfers.printSchema()

# Quick sanity check: who spent the most in a single season ever?
df_transfers.createOrReplaceTempView("transfers_temp")

spark.sql("""
    SELECT season, club, 
           format_number(expenditure_eur, 0) as spent,
           format_number(net_spend_eur, 0) as net_spend
    FROM transfers_temp
    ORDER BY expenditure_eur DESC
    LIMIT 5
""").show(truncate=False)

In [0]:
# Save to Delta Lake
df_transfers.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("transfer_spending")

print("Table saved!")

# Verify both tables exist
spark.sql("SHOW TABLES").show()

## Step 6: Scrape League Tables
Scrapes final standings including position, points, wins, draws, losses, and goals. This is our on-field competitiveness data. Saves to Delta table: `league_tables`

In [0]:
# Test the TheSportsDB API - free, no API key needed
url = "https://www.thesportsdb.com/api/v1/json/3/lookuptable.php?l=4328&s=2023-2024"

response = requests.get(url)
print(f"Status code: {response.status_code}")

# Parse JSON response
data = response.json()
print(f"Keys: {data.keys()}")
print(f"Number of teams: {len(data['table'])}")

# Look at first team's data structure
team = data['table'][0]
print(f"\nAvailable fields:")
for key, value in team.items():
    print(f"  {key}: {value}")

In [0]:
# SportsDB returns 5 teams for the free tier
# Let's try an alternative: football-data.org
print("\nTrying football-data.org...")
url = "https://www.football-data.org/v4/competitions/PL/standings?season=2023"
response = requests.get(url, headers={'X-Auth-Token': 'test'})
print(f"Status: {response.status_code}")

In [0]:
# Transfermarkt has league tables
# Let's check the structure
url = "https://www.transfermarkt.us/premier-league/tabelle/wettbewerb/GB1/saison_id/2023"

response = requests.get(url, headers=headers)
print(f"Status code: {response.status_code}")

soup = BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('table')
print(f"Tables found: {len(tables)}")

for i, table in enumerate(tables):
    rows = table.find_all('tr')
    print(f"\n--- Table {i}: {len(rows)} rows ---")
    if rows:
        print(f"  Header: {rows[0].get_text(strip=True)[:200]}")
    if len(rows) > 1:
        print(f"  Row 1: {rows[1].get_text(strip=True)[:200]}")

In [0]:
target_table = tables[1]
rows = target_table.find_all('tr')

# Check first club row
row = rows[1]
cells = row.find_all('td')
print(f"Number of cells: {len(cells)}")
for i, cell in enumerate(cells):
    text = cell.get_text(strip=True)
    if text:
        print(f"  Cell {i}: {text}")

In [0]:
def scrape_season_table(season_id, headers):
    """Scrape league table for one Premier League season."""
    url = f"https://www.transfermarkt.us/premier-league/tabelle/wettbewerb/GB1/saison_id/{season_id}"
    
    response = requests.get(url, headers=headers)
    
    if response.status_code != 200:
        print(f"  Failed for {season_id}: status {response.status_code}")
        return []
    
    soup = BeautifulSoup(response.text, 'lxml')
    tables = soup.find_all('table')
    
    # Find the table with league standings
    target_table = None
    for table in tables:
        first_row = table.find('tr')
        if first_row and 'Pts' in first_row.get_text():
            target_table = table
            break
    
    if not target_table:
        print(f"  No standings table found for {season_id}")
        return []
    
    rows = target_table.find_all('tr')
    season_data = []
    
    for row in rows[1:]:
        cells = row.find_all('td')
        club_name = cells[2].get_text(strip=True)
        
        if not club_name:
            continue
        
        # Parse goals "96:34" into goals_for and goals_against
        goals_raw = cells[7].get_text(strip=True)
        goals_for, goals_against = goals_raw.split(':')
        
        season_data.append({
            'season': f"{season_id}/{str(season_id+1)[-2:]}",
            'season_start_year': season_id,
            'club': club_name,
            'position': int(cells[0].get_text(strip=True)),
            'played': int(cells[3].get_text(strip=True)),
            'wins': int(cells[4].get_text(strip=True)),
            'draws': int(cells[5].get_text(strip=True)),
            'losses': int(cells[6].get_text(strip=True)),
            'goals_for': int(goals_for),
            'goals_against': int(goals_against),
            'goal_difference': int(cells[8].get_text(strip=True)),
            'points': int(cells[9].get_text(strip=True))
        })
    
    return season_data

# Test
test = scrape_season_table(2023, headers)
print(f"Season 2023/24: {len(test)} clubs\n")
for club in test[:5]:
    print(f"  {club['position']}. {club['club']:20s} | Pts: {club['points']} | W{club['wins']} D{club['draws']} L{club['losses']}")

In [0]:
# Scrape all seasons
all_tables = []

for season_id in range(2000, 2025):
    data = scrape_season_table(season_id, headers)
    all_tables.extend(data)
    print(f"Season {season_id}/{str(season_id+1)[-2:]}: {len(data)} clubs")
    time.sleep(3)

print(f"\n{'='*50}")
print(f"Total records scraped: {len(all_tables)}")

In [0]:
# Convert to Spark DataFrame and save
df_league_tables = spark.createDataFrame(all_tables)

df_league_tables.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("league_tables")

# Verify all three tables exist
spark.sql("SHOW TABLES").show()

# Quick sanity check: all champions 2000-2024
spark.sql("""
    SELECT season, club, points
    FROM league_tables
    WHERE position = 1
    ORDER BY season_start_year
""").show(25, truncate=False)

In [0]:
# Check for name mismatches across our three tables
squad_names = spark.sql("SELECT DISTINCT club FROM squad_market_values").toPandas()['club'].tolist()
transfer_names = spark.sql("SELECT DISTINCT club FROM transfer_spending").toPandas()['club'].tolist()
league_names = spark.sql("SELECT DISTINCT club FROM league_tables").toPandas()['club'].tolist()

# Find names in league_tables that don't appear in squad_market_values
print("League table names NOT in squad values:")
for name in sorted(league_names):
    if name not in squad_names:
        print(f"  {name}")

print(f"\nUnique clubs - League tables: {len(league_names)}")
print(f"Unique clubs - Squad values: {len(squad_names)}")
print(f"Unique clubs - Transfers: {len(transfer_names)}")

## Summary
### Tables Created
| Table | Records | Description |
|-------|---------|-------------|
| `squad_market_values` | 500 | Squad valuations per club per season (financial data from 2004/05) |
| `transfer_spending` | 500 | Transfer income/expenditure per club per season |
| `league_tables` | 500 | Final standings, points, wins, goals |

### Known Issues
- Squad market values return €0 for seasons 2000/01 - 2003/04 (Transfermarkt didn't track values then)
- Club names differ between tables (e.g., "Man City" vs "Manchester City") — resolved in Notebook 02

### Next Step
→ `02_data_processing`: Standardize club names and join all three tables