# Analyzing Olympics Data with SQL and Python
In this project, I'll leverage the power of Python and SQL to conduct a comprehensive analysis of Olympic data. Through a blend of these two powerful tools, I'll gather data from diverse sources, merge datasets, and conduct insightful exploratory analyses.

Starting with an Olympics dataset, I'll enrich it with country-specific details sourced from a SQL database. Utilizing the robust capabilities of pandas, I'll delve into the data, unearthing compelling insights. Additionally, I'll harness the interactive visualization capabilities of plotly to create dynamic visual representations, enhancing our understanding of the data and facilitating effective communication of key findings.

## Loading Olympics Data and Importing Essential Libraries


In [3]:
# Import libraries
import pandas as pd
import plotly.express as px

# Import the data
olympics = pd.read_csv("athlete_events.csv")

# Preview the DataFrame
pd.set_option('display.max_columns', None)
print(olympics.head())

To inspect data types and the number of non-null rows per column, we can employ the .info() function. This function offers a succinct overview of the dataset's structure, enabling us to quickly grasp key details such as column names, data types, and the presence of missing values.

In [4]:
# Inspect the DataFrame
olympics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   id      271116 non-null  int64  
 1   name    271116 non-null  object 
 2   sex     271116 non-null  object 
 3   age     261642 non-null  float64
 4   height  210945 non-null  float64
 5   weight  208241 non-null  float64
 6   team    271116 non-null  object 
 7   noc     271116 non-null  object 
 8   games   271116 non-null  object 
 9   year    271116 non-null  int64  
 10  season  271116 non-null  object 
 11  city    271116 non-null  object 
 12  sport   271116 non-null  object 
 13  event   271116 non-null  object 
 14  medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 31.0+ MB


To examine the number of missing values per column, I'll utilize the combination of isna() followed by .sum(). This approach efficiently calculates the sum of missing values across each column, providing valuable insights into the completeness of our dataset and highlighting areas that may require data imputation or further investigation.

In [5]:
# Check missing values
olympics.isna().sum()

id             0
name           0
sex            0
age         9474
height     60171
weight     62875
team           0
noc            0
games          0
year           0
season         0
city           0
sport          0
event          0
medal     231333
dtype: int64

The missing values in the medal column are likely due to the inclusion of all competitors, not just medal winners, which is consistent with the nature of the dataset. Additionally, it's noticeable that there are numerous missing entries in the height, weight, and age columns. This could be attributed to the absence of such data in the early years of the Olympics. However, for the current study, these missing values are not of primary concern.

During data exploration, it was observed that some team names contained hyphens and backslashes, as exemplified by "Denmark/Sweden" in the year 1900. To gain further insight, let's closely examine the unique values within this column.

To inspect the unique team names and their frequencies, we'll utilize an interactive table viewer for a more comprehensive analysis.

In [6]:
# Inspect the team column
olympics["team"].value_counts().to_frame()

Unnamed: 0,team
United States,17847
France,11988
Great Britain,11404
Italy,10260
Germany,9326
...,...
Briar,1
Hannover,1
Nan-2,1
Brentina,1


The team column exhibits inconsistencies, occasionally featuring multiple countries separated by forward slashes or hyphens. To address this issue, we'll utilize the .str.extract() method to isolate the first country mentioned in instances of slashes or hyphens. For example, "Denmark/Sweden" will be transformed into "Denmark", ensuring uniformity and clarity in our dataset.


In [8]:
# Split the team column on forward slashes and hyphens
olympics["team_clean"] = olympics["team"].str.split("[/-]").str[0]

# Preview the new column
olympics["team_clean"].unique()

array(['China', 'Denmark', 'Netherlands', 'United States', 'Finland',
       'Norway', 'Romania', 'Estonia', 'France', 'Taifun', 'Morocco',
       'Spain', 'Egypt', 'Iran', 'Bulgaria', 'Italy', 'Chad',
       'Azerbaijan', 'Sudan', 'Russia', 'Argentina', 'Cuba', 'Belarus',
       'Greece', 'Cameroon', 'Turkey', 'Chile', 'Mexico', 'Soviet Union',
       'Nicaragua', 'Hungary', 'Nigeria', 'Algeria', 'Kuwait', 'Bahrain',
       'Pakistan', 'Iraq', 'United Arab Republic', 'Lebanon', 'Qatar',
       'Malaysia', 'Germany', 'Thessalonki', 'Canada', 'Ireland',
       'Australia', 'South Africa', 'Eritrea', 'Tanzania', 'Jordan',
       'Tunisia', 'Libya', 'Belgium', 'Djibouti', 'Palestine', 'Comoros',
       'Kazakhstan', 'Brunei', 'India', 'Saudi Arabia', 'Syria',
       'Maldives', 'Ethiopia', 'United Arab Emirates', 'North Yemen',
       'Indonesia', 'Philippines', 'Singapore', 'Uzbekistan',
       'Kyrgyzstan', 'Tajikistan', 'Unified Team', 'Japan',
       'Congo (Brazzaville)', 'Switzerlan

## Incorporating Additional Data

To enrich our analysis, we'll fetch data from a MariaDB database containing comprehensive information on world nations.

The retrieved data will be stored as a pandas DataFrame named nations_data.

In [11]:
SELECT name AS country,
       year, 
	   population
FROM countries
INNER JOIN country_stats 
USING (country_id);

Unnamed: 0,country,year,population
0,Aruba,1986,62644
1,Aruba,1987,61833
2,Aruba,1988,61079
3,Aruba,1989,61032
4,Aruba,1990,62149
...,...,...,...
9509,Zimbabwe,2014,13586681
9510,Zimbabwe,2015,13814629
9511,Zimbabwe,2016,14030390
9512,Zimbabwe,2017,14236745


Now that we have country data available, we can enhance our analysis by merging it with the Olympics data. We'll employ the .merge() method to combine the two DataFrames based on the country and year columns.

For this merge operation, we'll utilize a "left" join, ensuring that all rows from the olympics_data DataFrame are retained, as some teams may not be present in the countries_data DataFrame.

In [15]:
# Perform a left join between the two DataFrames
olympics_full = olympics.merge(
   nations_data, left_on = ["team_clean", "year"], right_on=["country", "year"], how="left")

# Preview our data
pd.set_option('display.max_columns', None)
print(olympics_full.head())

   id                      name sex   age  height  weight            team  \
0   1                 A Dijiang   M  24.0   180.0    80.0           China   
1   2                  A Lamusi   M  23.0   170.0    60.0           China   
2   3       Gunnar Nielsen Aaby   M  24.0     NaN     NaN         Denmark   
3   4      Edgar Lindenau Aabye   M  34.0     NaN     NaN  Denmark/Sweden   
4   5  Christine Jacoba Aaftink   F  21.0   185.0    82.0     Netherlands   

   noc        games  year  season       city          sport  \
0  CHN  1992 Summer  1992  Summer  Barcelona     Basketball   
1  CHN  2012 Summer  2012  Summer     London           Judo   
2  DEN  1920 Summer  1920  Summer  Antwerpen       Football   
3  DEN  1900 Summer  1900  Summer      Paris     Tug-Of-War   
4  NED  1988 Winter  1988  Winter    Calgary  Speed Skating   

                              event medal   team_clean      country  \
0       Basketball Men's Basketball   NaN        China        China   
1      Judo Men'

## Which Countries Have the Most Gold Medals?

Let's kick off by computing and visualizing the tally of gold medals earned by athletes from various countries. To accomplish this, we'll leverage the .query() method to filter rows where the medal is "Gold". Subsequently, we'll employ .group_by() to aggregate the data based on our team_clean variable. Finally, we'll utilize .count() to determine the number of rows per team.

A subsequent line will sort the values of our query, ensuring that the top-performing teams are prominently displayed.

In [21]:
# Count the number of gold medals earned by a country
gold_count = olympics_full.query("medal == 'Gold'").groupby("team_clean", as_index=False)["medal"].count()

# Sort the values
gold_count.sort_values(by="medal", ascending = False, inplace = True)

# Preview our count
pd.set_option('display.max_columns', None)
print(gold_count.head())

        team_clean  medal
194  United States   2529
175   Soviet Union   1080
74         Germany    721
93           Italy    549
75   Great Britain    545


We can effectively visualize this data using Plotly. Let's generate a choropleth map (a world map) wherein each country's color corresponds to its respective medal count!

In [22]:
# Create choropleth map of gold medal counts
fig = px.choropleth(
gold_count,
locations = "team_clean",
locationmode="country names", 
color="medal",
labels= {"team_clean": "Country", "medal": "Medal Count"},
title = "Number of Gold Medals by Country")

fig.show()

 
![Screenshot 2024-04-28 at 9.42.31 PM](Screenshot%202024-04-28%20at%209.42.31%20PM.png)


## How Has the Number of Sports Evolved Over Time?

Another intriguing question to explore is whether the diversity of individual sports has expanded across different time periods.

To investigate this, we'll group the data by year and season (i.e., summer or winter), and then determine the count of unique sports using .nunique().

In [24]:
# Group by year and season and count the number of unique values
sport_count = olympics_full.groupby(["year", "season"], as_index=False)["sport"].nunique()

# Preview the DataFrame
pd.set_option('display.max_columns', None)
print(sport_count.head(15))

    year  season  sport
0   1896  Summer      9
1   1900  Summer     20
2   1904  Summer     18
3   1906  Summer     13
4   1908  Summer     24
5   1912  Summer     17
6   1920  Summer     25
7   1924  Summer     20
8   1924  Winter     10
9   1928  Summer     17
10  1928  Winter      8
11  1932  Summer     18
12  1932  Winter      7
13  1936  Summer     24
14  1936  Winter      8


Let's visualize this data using a line plot, segmented by season

In [27]:
# Create a line plot for Summer and Winter Olympics
fig = px.line(
    sport_count,
    x="year",
    y="sport",
    color="season",
    labels={"year":"Year", "sport":"Sport", "season":"Season"},
    title="Count of Distinct Sports by Year and Season"
)

fig.show()

![Screenshot 2024-04-28 at 9.48.10 PM](Screenshot%202024-04-28%20at%209.48.10%20PM.png)


## Identifying Countries with the Highest Medal Count per 10 Million People in 2016

To conclude, let's leverage the population data retrieved from our SQL database! An intriguing approach would be to recreate a variation of this chart from Business Insider, showcasing the number of medals per capita.

To accommodate team data, we'll initially aggregate by team, event, medal, and population, selecting the .first() medal. This strategy will enable us to obtain medals by country and event.

In [28]:
# Calculate event medals
event_medals = olympics_full.query("year == 2016")\
      .groupby(["team_clean", "event", "medal", "population"], as_index=False)["medal"].first()


# Preview the DataFrame
pd.set_option('display.max_columns', None)
print(event_medals.head(15))

   team_clean                                       event  population   medal
0     Algeria                Athletics Men's 1,500 metres  40551404.0  Silver
1     Algeria                  Athletics Men's 800 metres  40551404.0  Silver
2   Argentina                         Hockey Men's Hockey  43590368.0    Gold
3   Argentina              Judo Women's Extra-Lightweight  43590368.0    Gold
4   Argentina                     Sailing Mixed Multihull  43590368.0    Gold
5   Argentina                        Tennis Men's Singles  43590368.0  Silver
6     Armenia             Weightlifting Men's Heavyweight   2936146.0  Silver
7     Armenia       Weightlifting Men's Super-Heavyweight   2936146.0  Silver
8     Armenia    Wrestling Men's Heavyweight, Greco-Roman   2936146.0    Gold
9     Armenia   Wrestling Men's Welterweight, Greco-Roman   2936146.0  Silver
10  Australia                          Archery Men's Team  24190907.0  Bronze
11  Australia          Athletics Men's 20 kilometres Walk  24190

With the event data in hand, we can execute a comparable aggregation as previously done and compute the ratio of medal count to population, adjusting it by dividing by 10 million for enhanced interpretability.

In [32]:
# Group by the team and population
medal_counts = event_medals.groupby(["team_clean", "population"], as_index=False)["medal"].count()

# Calculate the number of medals per 10000000 people
medal_counts["per_10m"] = medal_counts["medal"] / (medal_counts["population"] / 10000000)

# Sort values and take the top 20 countries
top_countries = medal_counts.sort_values(by="per_10m", ascending = False).head(20)

# Preview the DataFrame
pd.set_option('display.max_columns', None)
print(top_countries.head(20))

        team_clean  population  medal    per_10m
29         Grenada    110261.0      1  90.693899
6          Bahamas    377931.0      2  52.919713
47     New Zealand   4693200.0     18  38.353362
37         Jamaica   2906238.0     11  37.849619
19         Denmark   5728010.0     15  26.187105
16         Croatia   4174349.0     10  23.955831
58        Slovenia   2065042.0      4  19.370066
26         Georgia   3727505.0      7  18.779318
5       Azerbaijan   9757812.0     18  18.446758
30         Hungary   9814023.0     15  15.284252
7          Bahrain   1425791.0      2  14.027301
41       Lithuania   2868231.0      4  13.945878
2          Armenia   2936146.0      4  13.623301
3        Australia  24190907.0     29  11.987975
46     Netherlands  17030314.0     19  11.156576
62          Sweden   9923085.0     11  11.085262
17            Cuba  11335109.0     11   9.704362
8          Belarus   9501534.0      9   9.472155
18  Czech Republic  10566332.0     10   9.464022
63     Switzerland  

Now, we can employ a bar chart to visualize the leading countries based on medals per capita!

In [36]:
# Create a column chart by medal per capita
fig = px.bar(
     top_countries,
     x= "team_clean",
    y= "per_10m",
    labels={"team_clean":"Country", "per_10m":"Medals per 10m"},
    title= "Number of Medals per 10 Million Population",
    hover_data = ["population", "medal"]
)

fig.show()

![Screenshot 2024-04-28 at 10.05.35 PM](Screenshot%202024-04-28%20at%2010.05.35%20PM.png)
