# Analyse statistique sur les Jeux Olympiques avec R

![image.png](attachment:image.png)

#### Téléchargement des données
https://docs.google.com/document/d/1HgPH1u0r7Pze5HrErzVXbxFcA_9wUpknUCMtgbPh8BM/edit?usp=sharing

Plan:

1. Rappel sur Dplyr
2. Exemple de manipulation de donnée
3. Questions

#### Rappel sur R:

* https://suzan.rbind.io/2018/02/dplyr-tutorial-3/#filtering-rows-based-on-a-numeric-variable
* https://ggplot2.tidyverse.org/
* https://thinkr.fr/utiliser-la-grammaire-dplyr-pour-triturer-ses-donnees/

#### II- Exemple de manipulation de données

##### Manipulation de données avec dplyr

Il ya 5 fonctions à connaitre dans le package dplyr.
* **select** : Selection de colonnes
* **filter** : Filtre sur les données
* **mutate** : Création d'une nouvel colonne
* **arrange** : Trie des données
* **summarise** : Aggrege les données en un résumé

In [4]:
# Définition du nom de fichier
filename <- "athlete_events.csv" 
# Chargement du fichier
dataset <- read.csv(filename, header=TRUE, encoding = "UTF-8") 

# Affichage des 5 premières lignes
head(dataset)

“EOF within quoted string”


Unnamed: 0_level_0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
Unnamed: 0_level_1,<int>,<chr>,<chr>,<int>,<int>,<dbl>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
1,1,A Dijiang,M,24,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
2,2,A Lamusi,M,23,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
3,3,Gunnar Nielsen Aaby,M,24,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
4,4,Edgar Lindenau Aabye,M,34,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
5,5,Christine Jacoba Aaftink,F,21,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,
6,5,Christine Jacoba Aaftink,F,21,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,"Speed Skating Women's 1,000 metres",


In [5]:
#### Filtre sur le pays France
library("dplyr")
dataset %>% filter(Team == "France")

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
<int>,<chr>,<chr>,<int>,<int>,<dbl>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<chr>,<chr>
34,Jamale (Djamel-) Aarrass (Ahrass-),M,30,187,76,France,FRA,2012 Summer,2012,Summer,London,Athletics,"Athletics Men's 1,500 metres",
52,Patrick Abada,M,22,189,80,France,FRA,1976 Summer,1976,Summer,Montreal,Athletics,Athletics Men's Pole Vault,
56,Ren Abadie,M,21,,,France,FRA,1956 Summer,1956,Summer,Melbourne,Cycling,"Cycling Men's Road Race, Individual",
56,Ren Abadie,M,21,,,France,FRA,1956 Summer,1956,Summer,Melbourne,Cycling,"Cycling Men's Road Race, Team",Gold
73,Luc Abalo,M,23,182,86,France,FRA,2008 Summer,2008,Summer,Beijing,Handball,Handball Men's Handball,Gold
73,Luc Abalo,M,27,182,86,France,FRA,2012 Summer,2012,Summer,London,Handball,Handball Men's Handball,Gold
73,Luc Abalo,M,31,182,86,France,FRA,2016 Summer,2016,Summer,Rio de Janeiro,Handball,Handball Men's Handball,Silver
93,Jol Marc Abati,M,34,190,85,France,FRA,2004 Summer,2004,Summer,Athina,Handball,Handball Men's Handball,
93,Jol Marc Abati,M,38,190,85,France,FRA,2008 Summer,2008,Summer,Beijing,Handball,Handball Men's Handball,Gold
167,Ould Lamine Abdallah,M,,,,France,FRA,1952 Summer,1952,Summer,Helsinki,Athletics,"Athletics Men's 10,000 metres",


In [None]:
### Dimension des données

dim(dataset)

In [None]:
#### Type de données pour chaque variable

sapply(dataset, class)

In [None]:
#### Summury des données

summary(dataset)

       ID                               Name        Sex             Age       
 Min.   :     1   Robert Tait McKenzie    :    58   F: 74522   Min.   :10.00  
 1st Qu.: 34643   Heikki Ilmari Savolainen:    39   M:196594   1st Qu.:21.00  
 Median : 68205   Joseph "Josy" Stoffel   :    38              Median :24.00  
 Mean   : 68249   Ioannis Theofilakis     :    36              Mean   :25.56  
 3rd Qu.:102097   Takashi Ono             :    33              3rd Qu.:28.00  
 Max.   :135571   Alexandros Theofilakis  :    32              Max.   :97.00  
                  (Other)                 :270880              NA's   :9474   
     Height          Weight                 Team             NOC        
 Min.   :127.0   Min.   : 25.0   United States: 17847   USA    : 18853  
 1st Qu.:168.0   1st Qu.: 60.0   France       : 11988   FRA    : 12758  
 Median :175.0   Median : 70.0   Great Britain: 11404   GBR    : 12256  
 Mean   :175.3   Mean   : 70.7   Italy        : 10260   ITA    : 10715  
 3r

In [None]:
#### Chargement dplyr

library("dplyr")

"package 'dplyr' was built under R version 3.6.1"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [None]:
#### Sélection des colonnes Team, Games et City
head(select(dataset, Team, Games, City ),5)

Team,Games,City
China,1992 Summer,Barcelona
China,2012 Summer,London
Denmark,1920 Summer,Antwerpen
Denmark/Sweden,1900 Summer,Paris
Netherlands,1988 Winter,Calgary


In [None]:
#### Autrement, on peut écrire cette fonction avec le pipe
#### Equivalent du point en python
select(dataset, Team, Games, City ) %>% head()

Team,Games,City
China,1992 Summer,Barcelona
China,2012 Summer,London
Denmark,1920 Summer,Antwerpen
Denmark/Sweden,1900 Summer,Paris
Netherlands,1988 Winter,Calgary
Netherlands,1988 Winter,Calgary


In [None]:
#### Ou encore(méthode que je recommande)

dataset %>% select(Team, Games) %>% head()

Team,Games
China,1992 Summer
China,2012 Summer
Denmark,1920 Summer
Denmark/Sweden,1900 Summer
Netherlands,1988 Winter
Netherlands,1988 Winter


#### Filter

In [None]:
#### Filtre sur l'année 1980

head(filter(dataset,  Year== "1980"),5)

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
11,Jorma Ilmari Aalto,M,22,182,76.5,Finland,FIN,1980 Winter,1980,Winter,Lake Placid,Cross Country Skiing,Cross Country Skiing Men's 30 kilometres,
85,Alejandro Abascal Garca,M,28,181,82.0,Spain,ESP,1980 Summer,1980,Summer,Moskva,Sailing,Sailing Mixed Two Person Heavyweight Dinghy,Gold
86,Jos Manuel Abascal Gmez,M,22,182,67.0,Spain,ESP,1980 Summer,1980,Summer,Moskva,Athletics,"Athletics Men's 1,500 metres",
88,Nunu Dzhansuhivna Abashydze (-Myslaieva),F,25,168,105.0,Soviet Union,URS,1980 Summer,1980,Summer,Moskva,Athletics,Athletics Women's Shot Put,
108,Giuseppe Abbagnale,M,20,187,97.0,Italy,ITA,1980 Summer,1980,Summer,Moskva,Rowing,Rowing Men's Coxed Pairs,


Attention à bien mettre le == au lieu dusigne =

In [None]:
#### Autrement

dataset %>% filter(Year== "1980") %>% head()

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
11,Jorma Ilmari Aalto,M,22,182,76.5,Finland,FIN,1980 Winter,1980,Winter,Lake Placid,Cross Country Skiing,Cross Country Skiing Men's 30 kilometres,
85,Alejandro Abascal Garca,M,28,181,82.0,Spain,ESP,1980 Summer,1980,Summer,Moskva,Sailing,Sailing Mixed Two Person Heavyweight Dinghy,Gold
86,Jos Manuel Abascal Gmez,M,22,182,67.0,Spain,ESP,1980 Summer,1980,Summer,Moskva,Athletics,"Athletics Men's 1,500 metres",
88,Nunu Dzhansuhivna Abashydze (-Myslaieva),F,25,168,105.0,Soviet Union,URS,1980 Summer,1980,Summer,Moskva,Athletics,Athletics Women's Shot Put,
108,Giuseppe Abbagnale,M,20,187,97.0,Italy,ITA,1980 Summer,1980,Summer,Moskva,Rowing,Rowing Men's Coxed Pairs,
109,Abdul Latif Al-Sayed Abbas Youssef Hashem,M,27,176,64.0,Kuwait,KUW,1980 Summer,1980,Summer,Moskva,Athletics,Athletics Men's 400 metres Hurdles,


#### Arrange

In [None]:
#### Trie des données par naissance et année par ordre croissant

head(arrange(dataset, naissances,année), 10)

X,année,prénom,genre,naissances
835,1880,Adelle,F,5
836,1880,Adina,F,5
837,1880,Adrienne,F,5
838,1880,Albertine,F,5
839,1880,Alys,F,5
840,1880,Ana,F,5
841,1880,Araminta,F,5
842,1880,Arthur,F,5
843,1880,Birtha,F,5
844,1880,Bulah,F,5


In [None]:
#### Trie des données par ordre décroissant

head(arrange(dataset, desc(Year), desc(Sport)), 20)

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
250,Saeid Morad Abdevali,M,26,170,80,Iran,IRI,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Men's Middleweight, Greco-Roman",Bronze
338,Muminzhon Abdullayev,M,26,190,130,Uzbekistan,UZB,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Men's Super-Heavyweight, Greco-Roman",
354,Bekzod Makhamadzhonovich Abdurakhmonov,M,26,172,74,Uzbekistan,UZB,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Men's Middleweight, Freestyle",
721,Mara Jos Acosta Acosta,F,24,172,69,Venezuela,VEN,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Women's Light-Heavyweight, Freestyle",
864,Yasemin Adar,F,24,180,75,Turkey,TUR,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Women's Heavyweight, Freestyle",
902,Odunayo Folasade Adekuoroye,F,22,169,53,Nigeria,NGR,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Women's Featherweight, Freestyle",
924,Aminat Oluwafunmilayo Adeniyi,F,23,165,58,Nigeria,NGR,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Women's Lightweight, Freestyle",
1525,Zied Ait Ouagram,M,27,191,75,Morocco,MAR,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Men's Middleweight, Greco-Roman",
1634,Taha Akgl,M,25,192,125,Turkey,TUR,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Men's Super-Heavyweight, Freestyle",Gold
1645,Habibollah Jomeh Akhlaghi,M,31,175,90,Iran,IRI,2016 Summer,2016,Summer,Rio de Janeiro,Wrestling,"Wrestling Men's Light-Heavyweight, Greco-Roman",


#### Mutate

In [None]:
#### Ajout d'une nouvelle variable year_start qui définie l'année 1880 comme l'année 0

mutate(dataset, year_start = Year - 1880) %>% head()

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal,year_start
1,A Dijiang,M,24,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,,112
2,A Lamusi,M,23,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,,132
3,Gunnar Nielsen Aaby,M,24,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,,40
4,Edgar Lindenau Aabye,M,34,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold,20
5,Christine Jacoba Aaftink,F,21,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,,108
5,Christine Jacoba Aaftink,F,21,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,"Speed Skating Women's 1,000 metres",,108


#### Summarise

In [None]:
#### Résumé des données à l'aide de summaryse
#### Somme totale des prénoms

summarise(dataset, somme = sum(Height))

somme
""


In [2]:
library("dplyr")


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [8]:
#### Grouper par année et sommer le poins

by_annee <- group_by(dataset, Team)


In [None]:
dataset %>% filter(Sport == "Judo" | Sport == "Football") %>% arrange(Year)

ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
2829,Richard Louis Pierre Allemane,M,18,,,USFSA,FRA,1900 Summer,1900,Summer,Paris,Football,Football Men's Football,Silver
6510,Louis Dsir Bach,M,17,,,USFSA,FRA,1900 Summer,1900,Summer,Paris,Football,Football Men's Football,Silver
12293,Alfred Bloch,M,,,,USFSA,FRA,1900 Summer,1900,Summer,Paris,Football,Football Men's Football,Silver
16085,Claude Percival Buckenham,M,24,,,Upton Park FC,GBR,1900 Summer,1900,Summer,Paris,Football,Football Men's Football,Gold
16727,Tom Eustace Burridge,M,19,,,Upton Park FC,GBR,1900 Summer,1900,Summer,Paris,Football,Football Men's Football,Gold
17893,Fernand mile Canelle,M,18,,,USFSA,FRA,1900 Summer,1900,Summer,Paris,Football,Football Men's Football,Silver
19595,Alfred Ernest Chalk,M,25,,,Upton Park FC,GBR,1900 Summer,1900,Summer,Paris,Football,Football Men's Football,Gold
27218,Albert Delbecque,M,,,,Univ. of Brussels,BEL,1900 Summer,1900,Summer,Paris,Football,Football Men's Football,Bronze
30724,R. Duparc,M,,,,USFSA,FRA,1900 Summer,1900,Summer,Paris,Football,Football Men's Football,Silver
36903,Maurice Eugne Fraysse,M,20,,,USFSA,FRA,1900 Summer,1900,Summer,Paris,Football,Football Men's Football,Silver


#### Exercice


In [9]:
by_annee <- group_by(dataset, Year)
summarise(dataset, somme = sum(Height))

somme
<int>
""


In [11]:
group_by(dataset, Year) %>% 
distinct(Name, .keep_all = TRUE) %>% summarise(nb_name = n())

`summarise()` ungrouping output (override with `.groups` argument)



Year,nb_name
<int>,<int>
1896.0,16
1900.0,156
1904.0,74
1906.0,97
1908.0,278
1912.0,351
1920.0,388
1924.0,484
1928.0,490
1932.0,250


In [12]:
dataset %>% group_by(Year) %>% distinct(Event, .keep_all = TRUE) %>% summarise(nb_name = n())

`summarise()` ungrouping output (override with `.groups` argument)



Year,nb_name
<int>,<int>
1896.0,23
1900.0,64
1904.0,52
1906.0,54
1908.0,100
1912.0,102
1920.0,136
1924.0,144
1928.0,128
1932.0,108


<div class="alert alert-success">
 <b>EXERCISE</b>:
 <ul>
        <li>Has the number of athletes, nations, and events changed over time?</li>
        <li>Number of men and women over time ?</li>
        <li>Number of women relative to men across countries ?</li>
        <li>Proportion of women on Olympic teams: 1936 ?</li>
        <li>Medal counts for women of different nations: 1936 ?</li>
        <li>Proportion of women on Olympic teams: 1976 ?</li>
        <li>Medal counts for women of different nations: 1976 ?</li>
        <li>Athlete height over time ?</li>
        <li>Athlete weight over time ?</li>

</ul> 

</div>