# Prep Course Capstone Project – “Olympic history: athletes and results” Dataset Study

July 2018 by Raaj Bhavsar
The aim of this project is to analyze the Olympic history: athletes and results dataset. I will now refer to this dataset as the Olympic dataset. The data set offers basic bio data on athletes and medal results from Athens 1896 to Rio 2016. Randi Griffin scraped the data using R code from www.sports-reference.com.   

“This dataset provides an opportunity to ask questions about how the Olympics have evolved over time, including questions about the participation and performance of women, different nations, and different sports and events.” (Randi Griffin)

In [4]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://cliparts.co/cliparts/8iG/6pB/8iG6pBKgT.jpg", width = 300, height = 300)

![Image of Olympics](http://cliparts.co/cliparts/8iG/6pB/8iG6pBKgT.jpg)

#not sure how to resize this ^^^

# Data
This dataset contains two CSV files. 
In total, there are:

    •	271116 Data entries
    •	15 fields of various biological information (shown below)


In [6]:
#importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [7]:
#Importing the DataSets
DataSet = pd.read_csv('athlete_events.csv')
Df2 = pd.read_csv('noc_regions.csv')

In [8]:
DataSet.head()

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,


In [9]:
Df2.head()

Unnamed: 0,NOC,region,notes
0,AFG,Afghanistan,
1,AHO,Curacao,Netherlands Antilles
2,ALB,Albania,
3,ALG,Algeria,
4,AND,Andorra,


# What Can We Determine from the Data?
Upon seeing the data, I have several thoughts about what to explore. Some questions of interest are listed below:
    
    1. Can we find countries that have shown unusual support in their women?
        a. At what times, places in the history of the Olympics do major changes occur?
        b. Are these changes consistent over time or blips in history?

    2. How have the Olympics grown over time?
        a. Number of events?
        b. Number of participants?
        c. Number of countries?

    3. How has participant’s biology (average age, height, weight, sex) changed over time?






In [10]:
#Encoding Categorical Medal Data
#Medals worth Gold = 3, Silver = 2, Bronze = 1, nan = 0
DataSet['Medal'].fillna(0, inplace=True)
mapping = {'Gold' : 3, 'Silver' : 2, 'Bronze' : 1}
DataSet = DataSet.replace({'Medal': mapping})

In [11]:
#Describe the initial statistics 
DataSet.describe()

Unnamed: 0,ID,Age,Height,Weight,Year,Medal
count,271116.0,261642.0,210945.0,208241.0,271116.0,271116.0
mean,68248.954396,25.556898,175.33897,70.702393,1978.37848,0.29376
std,39022.286345,6.393561,10.518462,14.34802,29.877632,0.774697
min,1.0,10.0,127.0,25.0,1896.0,0.0
25%,34643.0,21.0,168.0,60.0,1960.0,0.0
50%,68205.0,24.0,175.0,70.0,1988.0,0.0
75%,102097.25,28.0,183.0,79.0,2002.0,0.0
max,135571.0,97.0,226.0,214.0,2016.0,3.0


In [12]:
#Encoding Categorical Gender Data
#Male = 1, Female = 0
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
DataSet['Sex'] = labelencoder_X.fit_transform(DataSet['Sex'])

In [16]:
#Making Dataframes to show Medals by Country, Sex, Games
FemMedSumByCountry = DataSet.groupby(['Sex', 'NOC', 'Games'], as_index = False).sum()
FemMedSumByCountry = FemMedSumByCountry[['Medal', 'Sex', 'NOC', 'Games']]

#Finding % medals female vs male
###DO THIS NEXT

In [20]:
#Making Dataframe to show Number of Events by Sex, Games
FemEventCount = DataSet.groupby(['Sex', 'Games'], as_index = False).nunique()
FemEventCount = FemEventCount[['Event', 'Sex', 'Games']]

In [21]:
#Making Dataframe to show Number of Events by Sex, Games
EventCount = DataSet.groupby(['Sex', 'Games'], as_index = True).nunique()
EventCount = EventCount[['Event', 'Sex', 'Games']]
for index in EventCount.index:
    EventCount['Sex'][index] = index[0]
    EventCount['Games'][index] = index[1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [24]:
#Making Dataframe to show Biological Information
BiologyOverTime = DataSet.groupby(['Games'], as_index = True).mean()
BiologyOverTime = BiologyOverTime[['Age', 'Sex', 'Weight', 'Height']]

#WHY DO I NEED TO DO THIS STEP
BiologyOverTime['Season'] = 0
BiologyOverTime['Year'] = 0

for index in BiologyOverTime.index:
    BiologyOverTime['Season'][index] = index[0]
    BiologyOverTime['Year'][index] = index[1]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [25]:
BiologyOverTime.describe()

Unnamed: 0,Age,Sex,Weight,Height,Season,Year
count,51.0,51.0,51.0,51.0,51.0,51.0
mean,25.788976,0.78723,70.915812,174.817113,1.176471,7.392157
std,1.954422,0.138803,1.548416,1.541697,0.385013,3.458777
min,23.443241,0.545368,67.363636,170.701493,1.0,0.0
25%,24.248515,0.687727,70.076283,173.833129,1.0,9.0
50%,25.547112,0.780011,70.903281,174.991354,1.0,9.0
75%,26.420076,0.914284,71.353671,175.933986,1.0,9.0
max,33.45154,1.0,75.917073,178.206226,2.0,9.0


In [26]:
#DataFrame to show Growth over time
Growth = DataSet.groupby(['Games'], as_index = True).nunique()
Growth = Growth[['NOC', 'Event', 'Name']]
Growth['Games'] = 0
for index in Growth.index:
    Growth['Games'][index] = index

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
