# Race Results Normalization with Pydantic

This notebook demonstrates how to use Pydantic models to normalize race results data from multiple sources into a consistent, analyzable format. This enables writing general-purpose analysis code that works across different races and data sources.

**Key Benefits:**
- Standardize data from different sources (marathons, fell races, parkruns, etc.)
- Type validation and automatic field conversion
- Flexible column mapping for different data formats
- Consistent API for analysis code

## 1. Import Required Libraries

Let's start by importing Pydantic, pandas, and other necessary libraries.

In [2]:
import pandas as pd
import numpy as np
from typing import Optional, List, Dict
import warnings
warnings.filterwarnings('ignore')

# Import the new Pydantic models from running_results
from running_results.models import (
    NormalizedRaceResult,
    ColumnMapping,
    TimeParser,
    RaceResultsNormalizer,
    RaceCategory,
    Gender,
    normalize_race_results
)

print("✓ Libraries imported successfully")

ModuleNotFoundError: No module named 'pydantic'

## 2. Define Pydantic Models for Race Data

The Pydantic models have already been defined in `running_results/models.py`. Let's explore the key components:

**`NormalizedRaceResult`**: The core schema that all race results are normalized to. It includes:
- **Position fields**: `position_overall`, `position_gender`, `position_category`
- **Participant info**: `name`, `bib_number`, `gender`, `age_category`, `club`
- **Time fields**: `finish_time_seconds`, `chip_time_seconds`, `gun_time_seconds` (with automatic conversion to minutes)
- **Metadata**: `race_name`, `race_date`, `race_year`, `race_category`

**`ColumnMapping`**: Defines how columns in your source data map to normalized fields.

**`RaceResultsNormalizer`**: The main class that handles normalization from arbitrary DataFrame formats.

Let's see an example of the schema:

In [1]:
# Print the NormalizedRaceResult schema
print("NormalizedRaceResult Schema:")
print("=" * 70)
print(NormalizedRaceResult.model_json_schema())


NormalizedRaceResult Schema:


NameError: name 'NormalizedRaceResult' is not defined

## 3. Parse and Validate Race Results

Let's load some real race data and normalize it. We'll use the Edinburgh Marathon and GSR results datasets.

In [None]:
# Load Edinburgh Marathon data
print("Loading Edinburgh Marathon 2024 data...")
edinburgh_df = pd.read_csv('edinburgh-marathon-2024.csv')
print(f"Shape: {edinburgh_df.shape}")
print("\nFirst few rows:")
print(edinburgh_df.head(2))
print("\nColumns:")
print(edinburgh_df.columns.tolist())


In [None]:
# Define how Edinburgh Marathon columns map to our standard schema
edinburgh_mapping = ColumnMapping(
    position_overall='Position (Overall)',
    position_category='Position (Category)',
    name='Name Number',
    club='Club',
    chip_time_seconds='Chip Time (seconds)',
    gun_time_seconds='Gun Time (seconds)',
    age_category='Category'
)

# Create normalizer for Edinburgh Marathon
edinburgh_normalizer = RaceResultsNormalizer(
    mapping=edinburgh_mapping,
    race_name='Edinburgh Marathon 2024',
    race_year=2024,
    race_category=RaceCategory.MARATHON
)

# Normalize the data
edinburgh_normalized = edinburgh_normalizer.normalize(
    edinburgh_df, 
    return_dataframe=True
)

print(f"✓ Normalized {len(edinburgh_normalized)} Edinburgh Marathon results")
print("\nNormalized columns:")
print(edinburgh_normalized.columns.tolist())
print("\nFirst normalized result:")
print(edinburgh_normalized.iloc[0])


In [None]:
# Load GSR (Great Scottish Run) results
print("\nLoading GSR 2022 results...")
gsr_df = pd.read_csv('gsr-results-final-2022.csv')
print(f"Shape: {gsr_df.shape}")
print("\nFirst few rows:")
print(gsr_df.head(2))
print("\nColumns:")
print(gsr_df.columns.tolist())


In [None]:
# GSR has a completely different schema - let's define the mapping
gsr_mapping = ColumnMapping(
    position_overall='Pos',
    name='Name',
    bib_number='Bib',
    club='Club',
    finish_time_minutes='Finish Time'  # Already in minutes for this dataset
)

# Create normalizer for GSR - notice auto_detect can help too!
gsr_normalizer = RaceResultsNormalizer(
    mapping=gsr_mapping,
    race_name='Great Scottish Run 2022',
    race_year=2022,
    race_category=RaceCategory.TEN_K
)

# Normalize the GSR data
gsr_normalized = gsr_normalizer.normalize(
    gsr_df,
    return_dataframe=True
)

print(f"✓ Normalized {len(gsr_normalized)} GSR results")
print("\nFirst normalized result:")
print(gsr_normalized.iloc[0])


## 4. Handle Missing and Inconsistent Data

Pydantic validation automatically handles missing data and type conversions. Let's demonstrate the flexibility:

In [None]:
# Compare the two datasets to see how they're now unified
print("Comparison of normalized datasets:")
print("=" * 70)
print(f"\nEdinburgh Marathon:")
print(f"  - Columns available: {edinburgh_normalized.notna().sum()}")
print(f"  - Name field: {edinburgh_normalized['name'].notna().sum()} non-null")
print(f"  - Chip time seconds: {edinburgh_normalized['chip_time_seconds'].notna().sum()} non-null")
print(f"  - Club field: {edinburgh_normalized['club'].notna().sum()} non-null")

print(f"\nGSR:")
print(f"  - Name field: {gsr_normalized['name'].notna().sum()} non-null")
print(f"  - Finish time minutes: {gsr_normalized['finish_time_minutes'].notna().sum()} non-null")
print(f"  - Club field: {gsr_normalized['club'].notna().sum()} non-null")
print(f"  - Chip time (not in source): {gsr_normalized['chip_time_seconds'].notna().sum()} non-null")

print("\n✓ Both datasets now have the same schema!")
print("✓ Missing data is preserved as null values")
print("✓ Analysis code can now work with both datasets seamlessly")


## 5. Create Utility Functions for Data Standardization

Now we can write general analysis code that works with any normalized race dataset:

In [None]:
def analyze_race_results(normalized_df: pd.DataFrame) -> Dict:
    """
    Generic analysis function that works on any normalized race results.
    
    This demonstrates how once you have normalized data, you can write
    analysis code that works across all races without modification.
    """
    
    results = {}
    
    # Basic statistics
    if 'finish_time_minutes' in normalized_df.columns:
        time_col = 'finish_time_minutes'
    elif 'chip_time_minutes' in normalized_df.columns:
        time_col = 'chip_time_minutes'
    else:
        time_col = None
    
    if time_col:
        valid_times = normalized_df[time_col].dropna()
        results['time_stats'] = {
            'count': len(valid_times),
            'mean_minutes': valid_times.mean(),
            'median_minutes': valid_times.median(),
            'min_minutes': valid_times.min(),
            'max_minutes': valid_times.max(),
            'std_minutes': valid_times.std(),
        }
    
    # Club participation
    club_col = normalized_df['club'].dropna()
    results['top_clubs'] = club_col.value_counts().head(10).to_dict()
    results['club_count'] = club_col.nunique()
    
    # Position analysis
    results['finisher_count'] = normalized_df['position_overall'].notna().sum()
    
    # Category breakdown if available
    if 'age_category' in normalized_df.columns:
        category_counts = normalized_df['age_category'].value_counts()
        results['categories'] = category_counts.to_dict()
    
    return results

# Apply to both races
print("Edinburgh Marathon 2024 Analysis:")
print("-" * 70)
edin_analysis = analyze_race_results(edinburgh_normalized)
print(f"Finishers: {edin_analysis['finisher_count']}")
if 'time_stats' in edin_analysis:
    print(f"Average time: {edin_analysis['time_stats']['mean_minutes']:.1f} minutes")
    print(f"Median time: {edin_analysis['time_stats']['median_minutes']:.1f} minutes")
print(f"Clubs represented: {edin_analysis['club_count']}")
print(f"Top clubs: {list(edin_analysis['top_clubs'].keys())[:5]}")

print("\n\nGreat Scottish Run 2022 Analysis:")
print("-" * 70)
gsr_analysis = analyze_race_results(gsr_normalized)
print(f"Finishers: {gsr_analysis['finisher_count']}")
if 'time_stats' in gsr_analysis:
    print(f"Average time: {gsr_analysis['time_stats']['mean_minutes']:.1f} minutes")
    print(f"Median time: {gsr_analysis['time_stats']['median_minutes']:.1f} minutes")
print(f"Clubs represented: {gsr_analysis['club_count']}")
print(f"Top clubs: {list(gsr_analysis['top_clubs'].keys())[:5]}")

print("\n✓ Same analysis code works for both races with different formats!")


## 6. Test Regularisation with Sample Data

Let's create and validate a custom race dataset to demonstrate the robustness of the schema:

In [None]:
# Create a test dataset with completely different column names and formats
test_data = pd.DataFrame({
    'Rank': [1, 2, 3, 4, 5],
    'Athlete': ['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'Diana Prince', 'Eve Wilson'],
    'BiB': [101, 102, 103, 104, 105],
    'Team': ['Running Club A', 'Team B', 'Independent', 'Running Club A', 'Team C'],
    'Time_HM': ['1:30:45', '1:35:22', '1:42:18', '1:45:30', '1:50:15'],  # HH:MM:SS format
    'AgeGroup': ['30-39', '40-49', '20-29', '50-59', '35-44']
})

print("Test dataset (various formats):")
print(test_data)

# Define mapping for this test data
test_mapping = ColumnMapping(
    position_overall='Rank',
    name='Athlete',
    bib_number='BiB',
    club='Team',
    finish_time_minutes='Time_HM',
    age_category='AgeGroup'
)

# Create normalizer with custom time format parser
test_parser = TimeParser(format='HH:MM:SS')
test_normalizer = RaceResultsNormalizer(
    mapping=test_mapping,
    time_parser=test_parser,
    race_name='Test Local 5K Race',
    race_year=2024,
    race_category=RaceCategory.FIVE_K
)

# Normalize
test_normalized = test_normalizer.normalize(test_data, return_dataframe=True)

print("\nNormalized test data:")
print(test_normalized[['position_overall', 'name', 'finish_time_minutes', 
                        'finish_time_seconds', 'club', 'race_name']])

print("\n✓ Test passed! Irregular data formats were successfully normalized")


## Key Advantages of This Approach

**1. Type Safety**: Pydantic validates all data types automatically
- Invalid values are either coerced or marked as None
- Prevents runtime errors in downstream analysis

**2. Flexibility**: Supports multiple data formats
- Different column names
- Different time formats (HH:MM:SS, MM:SS, seconds)
- Optional fields with sensible defaults

**3. Reusable Analysis Code**: Write once, run anywhere
- Same functions work for marathons, parkruns, fell races, etc.
- No need to handle each data source differently

**4. Auto-detection**: Optional automatic column mapping
- For well-formatted data, no manual mapping needed
- Falls back to explicit mapping when needed

**5. Extensibility**: Easy to add new fields or validations
- Add custom validators to the NormalizedRaceResult model
- Extend for race-specific requirements

## Next Steps

Now that you have normalized data, you can:

1. **Combine multiple races** for comparative analysis
2. **Create reusable analysis functions** that work across all races
3. **Build pipelines** that automatically normalize new race data
4. **Export to standard formats** (JSON, CSV) for sharing with other tools

The models in `running_results/models.py` include comprehensive validation and error handling to ensure data quality!