# Example 2: Do Wealthier Countries Win More Medals?

## Tasks

#### Correlation
* GDP and total medals
* GDP per capita and medals per capita
* Segment countries into income tiers
* Compare medal distribution across tiers

#### Stats
* Pearson Correlation
* Spearman Correlation (rank-based)

#### Visuals
* Scatter plot with trend line
* Boxplots by income tier

In [32]:
import pandas as pd
import plotly.express as px
import numpy as np

In [33]:
file_path = 'C:/Users/viole/dev/analytics/kaggle/olympics-data-analysis/data/olympic_countries_efficiency.csv'
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,NOC,ISO3,Year,population,gdp_per_capita,income_group,host_country,athletes_sent,sports_participated,events_participated,female_athlete_percentage,prev_total_medals,prev_medals_per_athlete,Gold,Silver,Bronze,total_medals,medals_per_athlete
0,AFG,AFG,2004,23560654.0,221.763654,Low income,0,5,4,5,40.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,AFG,AFG,2008,26482622.0,381.733238,Low income,0,4,2,4,25.0,0.0,0.0,0.0,0.0,1.0,1.0,0.25
2,AFG,AFG,2012,30560034.0,651.417134,Low income,0,6,4,6,16.666667,1.0,0.25,0.0,0.0,1.0,1.0,0.166667
3,AFG,AFG,2016,34700612.0,522.082216,Low income,0,3,2,3,33.333333,1.0,0.166667,0.0,0.0,0.0,0.0,0.0
4,ALB,ALB,1992,3247039.0,200.85222,Low income,0,7,4,8,22.222222,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [34]:
df.columns

Index(['NOC', 'ISO3', 'Year', 'population', 'gdp_per_capita', 'income_group',
       'host_country', 'athletes_sent', 'sports_participated',
       'events_participated', 'female_athlete_percentage', 'prev_total_medals',
       'prev_medals_per_athlete', 'Gold', 'Silver', 'Bronze', 'total_medals',
       'medals_per_athlete'],
      dtype='str')

## Columns Needed

* NOC
* Year
* population
* gdp_per_capita
* total_medals
* gdp (calculated)

In [35]:
# Keep only columns needed, create a copy of df
df_small = df[[
    'NOC',
    'Year',
    'population',
    'gdp_per_capita',
    'total_medals'
]].copy()

df_small.head()

Unnamed: 0,NOC,Year,population,gdp_per_capita,total_medals
0,AFG,2004,23560654.0,221.763654,0.0
1,AFG,2008,26482622.0,381.733238,1.0
2,AFG,2012,30560034.0,651.417134,1.0
3,AFG,2016,34700612.0,522.082216,0.0
4,ALB,1992,3247039.0,200.85222,0.0


In [36]:
# Calculate total gdp column
df_small['gdp'] = df_small['gdp_per_capita'] * df_small['population']
df_small.head()

Unnamed: 0,NOC,Year,population,gdp_per_capita,total_medals,gdp
0,AFG,2004,23560654.0,221.763654,0.0,5224897000.0
1,AFG,2008,26482622.0,381.733238,1.0,10109300000.0
2,AFG,2012,30560034.0,651.417134,1.0,19907330000.0
3,AFG,2016,34700612.0,522.082216,0.0,18116570000.0
4,ALB,1992,3247039.0,200.85222,0.0,652175000.0


In [37]:
df_small[['NOC', 'gdp', 'total_medals']].sort_values('gdp', ascending=False).head()

Unnamed: 0,NOC,gdp,total_medals
196,CHN,11456020000000.0,113.0
195,CHN,8673665000000.0,125.0
194,CHN,4667346000000.0,184.0
149,BRA,2465228000000.0,59.0
193,CHN,1984197000000.0,94.0


## Correlation between GDP and Total Medals

In [53]:
# Scatter plot to see if the relationship even exists
fig = px.scatter(
    df_small,
    x='gdp',
    y='total_medals',
    hover_name='NOC',
    hover_data={
        'gdp_per_capita': ':,.0f',
        'population': ':,.0f',
        'total_medals': True,
        'gdp': ':,.of'
    },
    trendline='ols',
    title='Total Medals by GDP',
    labels={
        'gdp': 'GDP (USD)',
        'total_medals': 'Total Medals'
    }
)

# log scale for gdp
fig.update_xaxes(type='log')

fig.show()

The scatter plot above shows a strong, positive relationship between a country's GDP and total medals won. Countries with larger economies tend to win more medals. Many low-GDP countires win few or no medals.

## Correlation between GDP per Capita and Total Medals


In [54]:
# Scatter plot to see if the relationship even exists
fig = px.scatter(
    df_small,
    x='gdp_per_capita',
    y='total_medals',
    hover_name='NOC',
    hover_data={
        'gdp_per_capita': ':,.0f',
        'population': ':,.0f',
        'total_medals': True,
        'gdp': ':,.of'
    },
    trendline='ols',
    title='Total Medals by GDP per Capita',
    labels={
        'gdp_per_capital': 'GDP (USD)',
        'total_medals': 'Total Medals'
    }
)

# log scale for gdp
fig.update_xaxes(type='log')

fig.show()

The scatter plot above shows that there is a weaker correlation between gdp per capita and total medals. It's showing individual wealth whereas the previous scatter plot showed total country wealth. This means that wealth on a per-person basis does not strongly predict Olympic success. This suggests that population size and total economic wealth play a more significant role.

## Income Tiers

In [58]:
# Collapse df_small to most recent year
# Currently there is multiple rows for each country (1 for each year)
df_latest = (
    df_small.sort_values('Year')
      .groupby('NOC', as_index=False)
      .last()
)

df_latest.head()

Unnamed: 0,NOC,Year,population,gdp_per_capita,total_medals,gdp
0,AFG,2016,34700612.0,522.082216,0.0,18116570000.0
1,ALB,2016,2689469.0,4457.634122,0.0,11988670000.0
2,ALG,2016,40850721.0,4424.98529,2.0,180763800000.0
3,AND,2016,72181.0,40129.838581,0.0,2896612000.0
4,ARG,2016,43900313.0,12699.962314,22.0,557532300000.0


In [61]:
# Create income tiers
# Income tier boundaries based on World Bank
def income_tier(gdp_per_capita):
    if gdp_per_capita < 1135:
        return 'Low Income'
    elif gdp_per_capita < 4466:
        return 'Lower-Middle Income'
    elif gdp_per_capita < 13846:
        return 'Upper-Middle Income'
    else:
        return 'High Income'



In [62]:
df_latest['income_tier'] = df_latest['gdp_per_capita'].apply(income_tier)
df_latest.head()

Unnamed: 0,NOC,Year,population,gdp_per_capita,total_medals,gdp,income_tier
0,AFG,2016,34700612.0,522.082216,0.0,18116570000.0,Low Income
1,ALB,2016,2689469.0,4457.634122,0.0,11988670000.0,Lower-Middle Income
2,ALG,2016,40850721.0,4424.98529,2.0,180763800000.0,Lower-Middle Income
3,AND,2016,72181.0,40129.838581,0.0,2896612000.0,High Income
4,ARG,2016,43900313.0,12699.962314,22.0,557532300000.0,Upper-Middle Income


In [63]:
# Compare medla distributions by tier
df_latest.groupby(
    'income_tier'
)['total_medals'].mean()

income_tier
High Income            25.142857
Low Income              0.250000
Lower-Middle Income     3.000000
Upper-Middle Income    34.166667
Name: total_medals, dtype: float64

In [64]:
# Plot medals by income-tier
px.box(
    df_latest,
    x='income_tier',
    y='total_medals',
    title='Total Medals by Income Tier'
)

While lower to lower-middle income countries win less medals, high income countries do not neccessariky win more medals than upper-middle income countries.

## Pearson Correlation

In [67]:
# GDP vs total medals with log for stability
# Using country level dataframe
df_latest['log_gdp'] = np.log10(df_latest['gdp'])

df_latest['log_gdp'].corr(df_latest['total_medals'])

np.float64(0.7840782220967755)

A pearson value of 0.78 shows a moderately strong correlation between gdp and total medals.

## Spearman Correlation

In [70]:
df_latest['log_gdp'].corr(
    df_latest['total_medals'],
    method='spearman'
)


np.float64(0.802260226560853)

In [72]:
# Population vs medals
df_latest['population'].corr(
    df_latest['total_medals'],
    method='spearman'
)

np.float64(0.6075290582281284)

In [73]:
# Within similar income levels does economic ranking matter?
(
    df_latest
    .groupby('income_tier')
    .apply(lambda g: g['gdp'].corr(g['total_medals'], method='spearman'))
)

income_tier
High Income            0.945611
Low Income            -0.258199
Lower-Middle Income    0.463713
Upper-Middle Income    0.942857
dtype: float64

The higher the income, the more correlated medals are with gdp.