## 120 Years of Olympic History  

1. Data cleaning and processing
2. Statisitcal overview of the data + visualization
3. Data wrangaling and visualization:
    * Counries athletes / total medals ratio
    * Athletes total participation / total medals ratio
4. Choropleth - countries with most medals

In [None]:
# importing relevant python libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import folium

In [None]:
# assigning csv files to their respective variables

athlete_events = pd.read_csv('../input/120-years-of-olympic-history/athlete_events.csv')
country_def = pd.read_csv('../input/120-years-of-olympic-history/country_definitions.csv')

Let's get an overview of our dataset

In [None]:
athlete_events.head(5)

We can see that we have some null values that we will need to handle properly.

In [None]:
athlete_events.shape # we have 271,116 rows in our dataset and 15 columns

In [None]:
athlete_events.isnull().sum()

Overall, data looks fine. we have missing values in Age, Height, Weight, and Medal columns. The missing values in Age, Weight, and Height are proportionally small to the whole data set, so I am feeling comftorable filling them with their respective averages. The null values in Medal column represent that a specific athlete acheived none, so we're just going to fill "No Medal". 

In [None]:
# NaN values are filled with string "No Medal"
athlete_events['Medal'] = athlete_events['Medal'].fillna('No Medal') 

# average age is placed in Nan values
athlete_events['Age'] = athlete_events['Age'].fillna(athlete_events['Age'].mean())

# average height is placed in Nan values
athlete_events['Height'] = athlete_events['Height'].fillna(athlete_events['Height'].mean()) 

# average weight is placed in Nan values
athlete_events['Weight'] = athlete_events['Weight'].fillna(athlete_events['Weight'].mean())

In [None]:
# data types seems fine
athlete_events.dtypes

Now after we filled the required data, let's get some statistical info on some of the columns and let's get to know our athletes a little better

In [None]:
# let's get some statistical information
athlete_events[['Age', 'Height', 'Weight']].describe() 

Let's visualize it using histogram to get a better look of the average data

In [None]:
athlete_events['Age'].plot(kind='hist', figsize=(5,5))

In [None]:
athlete_events['Height'].plot(kind='hist', figsize=(5,5), color='Green')

In [None]:
athlete_events['Weight'].plot(kind='hist', figsize=(5,5), color='Orange')

Now comes the interesting part - wrangaling and visualizing some data 

**Ratios**: for every country and every athlete - what is the ratio between participants and medals? In other words, what is the total number of athletes each country sent to the Olympics, and what is the total amount of medals they won. Same goes for individual athletes - what is their total number of Olympic appearances and how many medals they won. Let's calculate the ratios and see what is coming out 

**We will start with countries**

In [None]:
# making a copy of the data specifically for this task, with the columns I need
spec_table = athlete_events[['Sex','NOC','Year','Medal']] 

# creating two tables, each of the table is grouped by country, and a count of the country's total participants and total medals
total_participants = spec_table[['NOC', 'Sex']].groupby('NOC').count()
total_medals = spec_table.loc[spec_table['Medal'] != 'No Medal']
total_medals = total_medals[['NOC', 'Medal']].groupby('NOC').count()

In [None]:
# joining the two tables i have created based on NOC column
joined_tble = pd.merge(total_participants,
                       total_medals,
                       how="inner",
                       on='NOC') 

# calculating the ratio and adding the result as new column "Ratio"
joined_tble['Ratio'] = joined_tble['Medal']/joined_tble['Sex']*100 

# sorting the values from high ratio to low
joined_tble.sort_values(by='Ratio',ascending=False).round(0) 

# finishing last arrangements to the table, taking the top-10 countries and preparing for visualization: 
viz_table = joined_tble['Ratio'].astype(int).to_frame().sort_values(by='Ratio', ascending=False).head(15)

viz_table = pd.merge(viz_table,
                     country_def, # the table that translates 'NOC' to full country name
                     how="inner",
                     on='NOC')

viz_table = viz_table.loc[viz_table['NOC'] != 'URS'] # represents the former Soviet Union, let's remove it for convinence purposes
viz_table = viz_table.loc[viz_table['NOC'] != 'GDR'] # represents East Germany, let's remove it for convinence purposes
viz_table = viz_table.loc[viz_table['NOC'] != 'EUN'] # represents the former Soviet Union, let's remove it for convinence purposes

viz_table.set_index('region', inplace=True)
viz_table = viz_table['Ratio']
viz_table.sort_values(ascending=True).plot(kind='barh', figsize=(10,10))

**Next, the actual athletes**

In [None]:
# filtering the original table to only get the data we want, then aggregating the total number of medals they won
medalists = athlete_events[['Name', 'NOC', 'Medal']].loc[athlete_events['Medal'] != 'No Medal'].groupby(['Name', 'NOC'])['Medal'].count().to_frame().sort_values(by='Medal', ascending=False)

# another table that contains the athletes and their total Olympic appearances
appearances = athlete_events[['Name','ID']].groupby('Name').count().sort_values(by='ID', ascending=False)

# merging the two tables together
joined_totals = pd.merge(medalists,
                         appearances,
                         how="inner",
                         on='Name').rename(columns={'ID': 'Total Appearances', 'Medal':'Total Medals'})

# calculating and creating the ratio column
joined_totals['Ratio'] = joined_totals['Total Medals']/joined_totals['Total Appearances']*100

# first, i want to get the table sorted by the ratio. However, we then want to leave out atheletes with 100% ratio comprised of 1 apperance. In other words, we want to 'heavy guns'
joined_totals = joined_totals.sort_values(by=['Ratio'], ascending=True)
viz_table2 = joined_totals.loc[joined_totals['Total Medals'] > 10]

# let's plot the best athlete in terms of their ratio
viz_table2['Ratio'].plot(kind='barh', figsize=(10, 10))

In [None]:
# let's also plot the athletes in terms of their total medals
viz_table2['Total Medals'].sort_values(ascending=True).plot(kind='barh', figsize=(10,10))

Now, let's turn to some basic data wrangaling (group by and sum) to get all the countries and their total medals. The fun part will be to create a world map using Folium so we can see which countries historically ruled the Olympics

**Wold Map - Which Countries Won Most Medals?**

In [None]:
# making all relevant preperations for visualizing the data using folium

world_map = folium.Map()
world_geo = "../input/world-countries/world-countries.json"

# organizing the data we want for the medals map
medals_per_country = athlete_events[['NOC', 'Medal']]
medals_per_country = medals_per_country.loc[medals_per_country['Medal'] != 'No Medal'].groupby('NOC').count().sort_values(by='Medal',ascending=False)

medals_per_country = pd.merge(medals_per_country,
                              country_def,
                              how="inner",
                              on='NOC')

# unlike in previous visualizations, this time we do not want to exclude East Germany and the Soviet Union
# so we are going to merge Soviet Union with current Russia and East Germany with current Germany
medals_per_country = medals_per_country[['region', 'Medal']].groupby('region').sum().sort_values(by='Medal', ascending=False)
medals_per_country = medals_per_country.reset_index()

# making some cosmetic changes to get map representation
medals_per_country['region'].replace('USA', 'United States of America', inplace=True)
medals_per_country['region'].replace('UK', 'United Kingdom', inplace=True)

# now that we have a table with all the countries and their total medals, let's visualize it using a cholopleth map
world_map.choropleth(geo_data = world_geo,
                    data = medals_per_country,
                    columns = ['region', 'Medal'],
                     key_on = "feature.properties.name",
                     fill_color = 'YlOrRd')
world_map

Russia and United States