# Exploratory Data Analysis: Olympic Athletes Performance & Trends

<div style="text-align: center;">
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5c/Olympic_rings_without_rims.svg/2560px-Olympic_rings_without_rims.svg.png" alt="The Olympic Rings" width="320"/>
</div>


# Introduction & Research Context

This analysis explores Olympic medalist data spanning 1896 to 2016, examining demographic patterns, achievement trends, and insights into athletic performance across different Olympic disciplines and time periods. By analyzing athlete characteristics, country performance, and sport-specific patterns, we can identify what factors distinguish Olympic medalists and how competition has evolved.

### Dataset Overview

The analysis uses the Olympic medalists dataset containing one record per athlete-event-medal combination. Key variables include:

* **Athlete Information:** ID, Name, Sex, Age, Height (cm), Weight (kg)
* **Organizational Data:** Team, NOC (National Olympic Committee code), Country
* **Event Details:** Games (year and season), Season (Summer/Winter), City, Sport, Event category
* **Performance:** Medal (Gold, Silver, or Bronze)

**Dataset Scope:** 1896–2016 Olympic Games; includes only medalists (athletes who won at least one medal)

**Important Note:** Missing values exist in Age, Height, Weight, and Country fields. These represent data unavailability rather than zero values, and analyses appropriately exclude missing data from calculations to maintain accuracy.

Let's begin by loading and exploring the dataset structure.


In [10]:
# import the pandas library
import pandas as pd

# load the data into a dataframe
df = pd.read_csv('datasets/olympics.csv')

# preview the dataframe
print(df.head())

   ID                      Name Sex   Age  Height  Weight            Team  \
0   4      Edgar Lindenau Aabye   M  34.0     NaN     NaN  Denmark/Sweden   
1  15      Arvo Ossian Aaltonen   M  30.0     NaN     NaN         Finland   
2  15      Arvo Ossian Aaltonen   M  30.0     NaN     NaN         Finland   
3  16  Juhamatti Tapio Aaltonen   M  28.0   184.0    85.0         Finland   
4  17   Paavo Johannes Aaltonen   M  28.0   175.0    64.0         Finland   

   NOC        Games  Year  Season       City       Sport  \
0  DEN  1900 Summer  1900  Summer      Paris  Tug-Of-War   
1  FIN  1920 Summer  1920  Summer  Antwerpen    Swimming   
2  FIN  1920 Summer  1920  Summer  Antwerpen    Swimming   
3  FIN  2014 Winter  2014  Winter      Sochi  Ice Hockey   
4  FIN  1948 Summer  1948  Summer     London  Gymnastics   

                                    Event   Medal   region  
0             Tug-Of-War Men's Tug-Of-War    Gold  Denmark  
1  Swimming Men's 200 metres Breaststroke  Bronze  Fin

## Data Inspection & Structure

Begin by understanding the dataset dimensions, column types, and data completeness. This foundational review ensures we can trust subsequent analyses.

**Dataset Overview:** The Olympic medalists dataset contains 271,116 records spanning 120 years of international competition (1896-2016). Each record represents an athlete-medal combination, meaning athletes who won multiple medals appear multiple times. The dataset includes historical records across Summer and Winter Olympics, various sports, and athletic disciplines.

In [None]:
# Inspect the numbers of rows and columns
print(f"Dataset shape: {df.shape}")

# Print out all the column names
print(f"\nColumn names:\n{df.columns.tolist()}")

# Inspect column data types, null values, and other info
print("\nData types and missing values:")
df.info()

(39783, 16)


### Data Quality Note: Missing Values

The dataset contains missing values in Age, Height, Weight, and Country fields. These are included in the data as NaN (null) values and represent data unavailability rather than zero values. Analyses appropriately exclude records with missing values from calculations to maintain statistical validity.


In [15]:
# Use unique() to examine the types of medals in the dataset.
print(df['Medal'].unique())

['Gold' 'Bronze' 'Silver']


## Data Preparation

Standardize column names to improve readability and consistency. Rename NOC to CountryCode and region to Country for clarity. Remove the Team column as it duplicates country information and is no longer needed.


In [None]:
# Rename 'NOC' column to 'CountryCode' and 'region' column to 'Country'
df.rename(columns={'NOC': 'CountryCode', 'region': 'Country'}, inplace=True)
print("After renaming columns:")
print(df.head())

# Remove the 'Team' column (duplicates country information)
df.drop(columns=['Team'], inplace=True)
print("\nAfter removing Team column:")
print(df.head())

   ID                      Name Sex   Age  Height  Weight            Team  \
0   4      Edgar Lindenau Aabye   M  34.0     NaN     NaN  Denmark/Sweden   
1  15      Arvo Ossian Aaltonen   M  30.0     NaN     NaN         Finland   
2  15      Arvo Ossian Aaltonen   M  30.0     NaN     NaN         Finland   
3  16  Juhamatti Tapio Aaltonen   M  28.0   184.0    85.0         Finland   
4  17   Paavo Johannes Aaltonen   M  28.0   175.0    64.0         Finland   

  CountryCode        Games  Year  Season       City       Sport  \
0         DEN  1900 Summer  1900  Summer      Paris  Tug-Of-War   
1         FIN  1920 Summer  1920  Summer  Antwerpen    Swimming   
2         FIN  1920 Summer  1920  Summer  Antwerpen    Swimming   
3         FIN  2014 Winter  2014  Winter      Sochi  Ice Hockey   
4         FIN  1948 Summer  1948  Summer     London  Gymnastics   

                                    Event   Medal  Country  
0             Tug-Of-War Men's Tug-Of-War    Gold  Denmark  
1  Swimming 

## Part 1: Demographic Analysis of Olympic Medalists

Examine key demographic characteristics of Olympic medalists including age range, medal distribution, and athlete composition across sports. These foundational metrics provide context for understanding who succeeds at the Olympic level.

**Critical Finding:** Olympic medalists show remarkable age diversity, ranging from pre-teenage prodigies to competitors in their 70s. This wide range reflects the fundamental differences between Olympic sports—some disciplines (like gymnastics and diving) reward youth and physical development, while others (like shooting, equestrian, and sailing) favor experience, technique, and mental discipline. Understanding this variation is essential to recognizing that "Olympic athlete" describes vastly different physical profiles and career trajectories.

### Age Extremes: Olympic Medalists at Life's Boundaries

**Youngest Olympic Medalist:** The presence of single-digit and young teenage medalists indicates that certain Olympic sports actively recruit and develop competitors during childhood. This pattern is particularly common in gymnastics, diving, and swimming, where physical development, training plasticity, and fearlessness during formative years confer competitive advantages. These young medalists represent decades of specialized training beginning in early childhood—a pathway that many sporting federations have institutionalized.

In [None]:
# Age extremes: youngest and oldest medalists
youngest_age = df[df['Medal'].notnull()]['Age'].min()
oldest_age = df[df['Medal'].notnull()]['Age'].max()
print(f"Youngest age of an Olympic medalist: {youngest_age}")
print(f"Oldest age of an Olympic medalist: {oldest_age}")

Youngest age of an Olympic medalist: 10.0


**Oldest Olympic Medalist:** Competitors achieving Olympic medals in their 70s demonstrates that certain sports reward accumulated expertise, strategic thinking, and technical mastery over physical youth. Sports like shooting, sailing, equestrian, and fencing show significantly older average medalist ages, reflecting disciplines where experience, precision, and mental fortitude trump power and speed. These cases challenge conventional narratives about athletic aging and demonstrate that Olympic success encompasses diverse career trajectories.

### Medal Distribution: Symmetry & Balance

Count the total number of each medal type awarded across all Olympics in the dataset. The equal distribution of Gold, Silver, and Bronze medals (by design—one medal of each type per event) reflects the Olympic system's intentional structure. However, medal totals also reveal the expanding number of Olympic events over time, as newer sports have been added to the program.

In [25]:
# number of medals awarded by type
medal_counts = df['Medal'].value_counts()
print(medal_counts)

Gold      13372
Bronze    13295
Silver    13116
Name: Medal, dtype: int64


### Event & Sport Diversity: Olympic Program Expansion

**Interpretation:** The breadth of Olympic competition has expanded dramatically over the 120-year timespan in this dataset. Early Olympic Games featured only a handful of sports, primarily drawn from European and Western athletic traditions. Modern Olympics include over 30 sports with hundreds of individual events, reflecting:
- Political evolution of international sports governance
- Globalization of athletic competition and recruitment
- Expansion of Olympic programming to include non-Western sporting traditions
- Addition of sports with mass participation appeal (e.g., skateboarding, sport climbing, karate)

In [None]:
# Medal counts and unique events/sports
medal_counts = df['Medal'].value_counts()
print("Medal distribution:")
print(medal_counts)

print(f"\nNumber of unique events: {df['Event'].nunique()}")
print(f"Number of unique sports: {df['Sport'].nunique()}")

# Average age of an Olympic medalist
average_age = df[df['Medal'].notnull()]['Age'].mean()
print(f"\nAverage age of Olympic medalists: {average_age:.2f} years")

Number of unique events: 756


### Sport Diversity Count

### Central Age Tendency: The "Typical" Olympic Medalist

**Statistical Interpretation:** Mean age provides the central tendency around which most medalist ages cluster. However, this average masks the bimodal distribution created by sport-specific age variations. Disciplines like gymnastics pull the average younger, while shooting and equestrian pull it older. The true picture emerges only when examining sport-specific breakdowns—a reminder that aggregate statistics can conceal important variation underlying seemingly uniform categories.

### Sports Favoring Older Competitors: Experience as Competitive Advantage

**Insight:** Certain Olympic sports show dramatically different age profiles from the overall median. Disciplines like shooting, equestrian events, and sailing concentrate among the oldest medalists because:
- **Technical Mastery:** These sports reward precision and technique developed over decades of practice
- **Physical Persistence:** Experience and mental toughness can compensate for slight physical decline
- **Less Speed-Dependent:** Unlike track sprinting or gymnastics, performance doesn't depend on peak cardiovascular or neuromuscular power
- **Equipment Importance:** Athletes with resources and time accumulate superior equipment, coaching, and tactical experience

This pattern reveals that "athletic prime" is sport-specific—what constitutes peak performance varies dramatically across Olympic disciplines.

In [36]:
# most common sports among the 10 oldest medalists
medalists_with_age = df[df['Medal'].notnull() & df['Age'].notnull()]
top_10_oldest = medalists_with_age.sort_values(by='Age', ascending=False).head(10)
top_10_oldest['Sport'].value_counts()


Art Competitions    5
Sailing             3
Shooting            1
Archery             1
Name: Sport, dtype: int64

In [None]:
# Sports among the 10 oldest medalists (Sports Favoring Older Competitors)
medalists_with_age = df[df['Medal'].notnull() & df['Age'].notnull()]
top_10_oldest = medalists_with_age.sort_values(by='Age', ascending=False).head(10)
print("Sports represented among the 10 oldest medalists:")
print(top_10_oldest['Sport'].value_counts())

Art Competitions    5
Sailing             3
Shooting            1
Archery             1
Name: Sport, dtype: int64

### National Performance: Top 10 Medal-Winning Countries

**Research Question:** Which countries have most successfully competed across the entire Olympic era? Medal count reflects the combination of:
- Sustained participation across multiple Olympic cycles
- Economic resources for athlete development
- Geographic/population advantages
- Historical timing of Olympic infrastructure investment
- National sporting culture and priorities

**What Medal Counts Reveal:** Countries dominating Olympic competition reveal which nations have institutionalized elite athlete development pathways, invested in sports infrastructure, and prioritized international athletic competition as a national objective. These rankings are not random—they reflect deliberate policy choices and resource allocation spanning decades.

In [38]:
# What are the 10 winningest countries in total medal count?
medalists = df[df['Medal'].notnull()]
medal_counts_by_country = medalists['Country'].value_counts()
top_10_countries = medal_counts_by_country.head(10)
print(f"Top 10 winningest countries by total medal count:\n{top_10_countries.to_string()}")


Top 10 winningest countries by total medal count:
USA          5637
Russia       3947
Germany      3756
UK           2068
France       1777
Italy        1637
Sweden       1536
Canada       1352
Australia    1349
Hungary      1135


### Sport-Specific Analysis: Trampolining as a Case Study

**Research Context:** Trampolining is a relatively recent Olympic addition, first appearing in the 2000 Sydney Olympics. Examining its medal distribution provides insight into how newer sports distribute success compared to traditional Olympic disciplines. This comparison reveals:
- Whether newer sports show different medal concentration patterns
- If emerging sports have different national competitive bases
- Whether Olympic program expansion creates opportunities for athletic development in underrepresented regions

## Part 2: Narrative Analysis & Data Journalism

The exploratory analysis above reveals patterns in athlete demographics, national performance, and sport-specific characteristics. Effective data journalism transforms these patterns into compelling narratives that illuminate broader questions about Olympic competition.

**Key Insights to Explore:**
- Age distribution across Olympic sports reveals surprising patterns about career longevity
- Country-level medal concentration shows which nations have sustained Olympic competitive advantages
- Sport-specific demographics provide context for understanding how different disciplines recruit and develop athletes

Select one compelling pattern from the analysis above to develop into a data-driven story pitch. Your pitch should articulate:
1. **The Angle:** What specific phenomenon or trend warrants investigation?
2. **Supporting Evidence:** Which quantitative findings provide credibility?
3. **Story Value:** Why is this finding interesting or important to general audiences?
4. **Future Direction:** What additional information would strengthen the reporting?

Consider how demographic patterns might reveal untold stories about athletic achievement, national priorities, or the evolution of Olympic competition itself.


### Example Narrative: Age & Athletic Longevity

**Story Angle:** While popular culture portrays Olympic athletes as primarily young adults in their prime physical years, analysis reveals surprising longevity among medalists in specific sports. A handful of competitors have achieved Olympic success well beyond typical athletic prime age, suggesting that sport type, training methodology, and individual resilience significantly influence career trajectories.

**Supporting Evidence:** 
- Age range of medalists spans from youngest to oldest, indicating sport-specific demographic variation
- Concentration of older medalists in disciplines like shooting and equestrian suggests that strategic thinking, technical precision, and experience may outweigh youth-dependent physical attributes in certain Olympic sports
- Comparison across Summer and Winter Olympic sports reveals different age profiles, potentially reflecting recruitment strategies and sport-specific physical demands

**Story Value:** This narrative challenges conventional assumptions about athletic aging, offering readers insight into how Olympic success depends on sport selection and individual factors beyond raw physical ability. It provides inspirational examples of athlete resilience while revealing data-driven patterns about which disciplines reward experience.

**Next Steps for Investigation:**
- Interview older medalists about training adaptations and career longevity strategies
- Analyze longitudinal data on athlete age across decades to assess whether average medalist age is shifting
- Compare nations' approaches to athlete development and career extension
- Investigate whether specific coaching or training methodologies enable extended careers in certain sports


## Part 3: Advanced Analysis & Investigation

Deepen your understanding through targeted analysis of specific dimensions. These investigations complement the foundational analysis and may reveal additional angles for storytelling or data-driven journalism.


In [None]:
# Gold medals awarded to the United States
us_gold_medals = df[(df['Medal'] == 'Gold') & (df['Country'] == 'United States')]
num_us_gold_medals = us_gold_medals.shape[0]
print(f"Number of gold medals awarded to the United States: {num_us_gold_medals}")

# Trampolining medals
print("\nTrampolining medal distribution:")
trampolining_medals = df[(df['Sport'] == 'Trampolining') & (df['Medal'].notnull())]
print(trampolining_medals['Medal'].value_counts())

Number of gold medals awarded to the United States: 0


### Temporal Analysis: Olympic Games Timeline

**Analytical Purpose:** Identifying all unique Olympic Games and their chronological ordering enables temporal analysis patterns. The dataset spans Summer Olympics held approximately every 4 years plus Winter Olympics (established separately in 1924 and currently held 4 years apart from Summer, though offset by 2 years). Examining this timeline reveals:
- Periods of Olympic continuity vs. disruptions (notably WWI and WWII cancellations)
- Evolution of Olympic programming and expansion of participating nations
- Temporal shifts in athlete demographics and national competitive advantage

In [47]:
# Olympic Games in the dataset
df['Year'] = df['Games'].str.extract(r'(\d{4})').astype(int)
season_order = {'Summer': 0, 'Winter': 1}
df['SeasonOrder'] = df['Season'].map(season_order)

unique_games_df = df[['Games', 'Year', 'SeasonOrder']].drop_duplicates()

sorted_games = unique_games_df.sort_values(by=['Year', 'SeasonOrder'], ascending=[False, False])

print("Olympic Games in dataset starting with most recent:")
for game in sorted_games['Games']:
    print(game)

Olympic Games in dataset starting with most recent:
2016 Summer
2014 Winter
2012 Summer
2010 Winter
2008 Summer
2006 Winter
2004 Summer
2002 Winter
2000 Summer
1998 Winter
1996 Summer
1994 Winter
1992 Winter
1992 Summer
1988 Winter
1988 Summer
1984 Winter
1984 Summer
1980 Winter
1980 Summer
1976 Winter
1976 Summer
1972 Winter
1972 Summer
1968 Winter
1968 Summer
1964 Winter
1964 Summer
1960 Winter
1960 Summer
1956 Winter
1956 Summer
1952 Winter
1952 Summer
1948 Winter
1948 Summer
1936 Winter
1936 Summer
1932 Winter
1932 Summer
1928 Winter
1928 Summer
1924 Winter
1924 Summer
1920 Summer
1912 Summer
1908 Summer
1906 Summer
1904 Summer
1900 Summer
1896 Summer


### Physical Characteristics: Comparative Profiles Across Seasons

**Analytical Question:** Do Summer and Winter Olympic medalists show distinct physical profiles? Height and weight variations across seasons reveal sport-specific physical requirements:

**Winter Olympic Athletes** typically include:
- Alpine skiers and snowboarders with muscular builds for technique and power
- Speed skaters with lean, efficient builds for cardiovascular demands
- Curlers and biathletes with varied physiques (technical sports showing less physical specialization)
- Figure skaters and skiers with sport-specific morphologies

**Summer Olympic Athletes** show broader variation due to sport diversity, including:
- Swimmers and track athletes with lean, muscle-efficient builds
- Weightlifters and throwers with substantial muscle mass
- Basketball and volleyball players with height as competitive advantage
- Gymnasts, divers, and martial artists with lean, flexible builds

In [None]:
# Physical characteristics: Average height and weight in the most recent Winter and Summer Olympics
# Winter Olympics
winter_medalists = df[(df['Season'] == 'Winter') & (df['Medal'].notnull()) & (df['Height'].notnull())].copy()
winter_medalists['Year'] = winter_medalists['Games'].str.extract(r'(\d{4})').astype(int)
latest_winter_year = winter_medalists['Year'].max()
avg_height_cm_winter = winter_medalists[winter_medalists['Year'] == latest_winter_year]['Height'].mean()
total_inches = avg_height_cm_winter / 2.54
feet = int(total_inches // 12)
inches = int(round(total_inches % 12))
print(f"Average medalist height in the most recent Winter Olympics: {feet}' {inches}\"")

# Winter weight
winter_weight = df[(df['Season'] == 'Winter') & (df['Medal'].notnull()) & (df['Weight'].notnull())].copy()
winter_weight['Year'] = winter_weight['Games'].str.extract(r'(\d{4})').astype(int)
latest_year = winter_weight['Year'].max()
avg_weight_kg = winter_weight[winter_weight['Year'] == latest_year]['Weight'].mean()
print(f"Average medalist weight in the most recent Winter Olympics: {round(avg_weight_kg, 2)} kg")

# Summer Olympics
summer_medalists = df[(df['Season'] == 'Summer') & (df['Medal'].notnull()) & (df['Height'].notnull())].copy()
summer_medalists['Year'] = summer_medalists['Games'].str.extract(r'(\d{4})').astype(int)
latest_summer_year = summer_medalists['Year'].max()
avg_height_cm_summer = summer_medalists[summer_medalists['Year'] == latest_summer_year]['Height'].mean()
total_inches = avg_height_cm_summer / 2.54
feet = int(total_inches // 12)
inches = int(round(total_inches % 12))
print(f"\nAverage medalist height in the most recent Summer Olympics: {feet}' {inches}\"")

# Summer weight
summer_weight = df[(df['Season'] == 'Summer') & (df['Medal'].notnull()) & (df['Weight'].notnull())].copy()
summer_weight['Year'] = summer_weight['Games'].str.extract(r'(\d{4})').astype(int)
latest_year = summer_weight['Year'].max()
avg_weight_kg = summer_weight[summer_weight['Year'] == latest_year]['Weight'].mean()
print(f"Average medalist weight in the most recent Summer Olympics: {round(avg_weight_kg, 2)} kg")

Average medalist weight in the most recent Winter Olympics: 72.21 kg


### Physical Characteristics: Summer Olympics Height Profile

Analyze average medalist height in the most recent Summer Olympics. Compare Summer and Winter profiles to understand how sport-specific physical requirements vary across Olympic disciplines.

### Visualization: National Medal Performance

**Storytelling Through Data:** Visualizing country-level medal counts transforms raw numbers into a compelling narrative about global athletic competition. Bar charts reveal:
- **Concentration Patterns:** A small number of countries consistently dominate Olympic competition, reflecting sustained investment in athletic infrastructure
- **Historical Dominance:** Countries with lengthy histories of Olympic participation accumulate larger total medal counts
- **National Strategies:** Different countries prioritize different sports, creating distinct competitive profiles within the broader Olympic system

The visualization makes these patterns immediately apparent to viewers who might otherwise miss the statistical concentration evident in summary tables.

In [None]:
# Import plotly express library and create visualization
import plotly.express as px

# Assign top 10 winningest countries table to a variable
top10_countries = df[df['Medal'].notnull()].groupby('Country')['Medal'].count().sort_values(ascending=False).head(10).reset_index()
top10_countries.columns = ['Country', 'Medal Count']

# Visualize the table as a bar chart (National Medal Performance)
fig = px.bar(top10_countries,
             x='Country',
             y='Medal Count',
             title='Top 10 Countries by Total Olympic Medals',
             text='Medal Count',
             color='Country')

fig.show()