# Activity 2: Analyzing Different Scenarios and Generating the Appropriate Visualization

We'll be working with the 120 years of Olympic History dataset acquired by Randi Griffin from https://www.sports-reference.com/ and made available on the GitHub repository of this book. Your assignment is to identify the top five sports based on the largest number of medals awarded in the year 2016, and then perform the following analysis:

1.  Generate a plot indicating the number of medals awarded in each of the top five sports in 2016.
2.  Plot a graph depicting the distribution of the age of medal winners in the top five sports in 2016.
3.  Find out which national teams won the largest number of medals in the top five sports in 2016.
4.  Observe the trend in the average weight of male and female athletes winning in the top five sports in 2016.

## High-Level Steps

1.  Download the dataset and format it as a pandas DataFrame.
2.  Filter the DataFrame to only include the rows corresponding to medal winners from 2016.
3.  Find out the medals awarded in 2016 for each sport.
4.  List the top five sports based on the largest number of medals awarded. Filter the DataFrame one more time to only include the records for the top five sports in 2016.
5.  Generate a bar plot of record counts corresponding to each of the top five sports.
6.  Generate a histogram for the Age feature of all medal winners in the top five sports (2016).
7.  Generate a bar plot indicating how many medals were won by each country's team in the top five sports in 2016.
8.  Generate a bar plot indicating the average weight of players, categorized based on gender, winning in the top five sports in 2016.

In [18]:
pip install pandas plotly

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [1]:
import pandas as pd
import plotly.express as px

In [6]:
#Télécharger et charger le dataset''
url = "./datasets/athlete_events.csv"
df = pd.read_csv(url)
df

Unnamed: 0,ID,Name,Sex,Age,Height,Weight,Team,NOC,Games,Year,Season,City,Sport,Event,Medal
0,1,A Dijiang,M,24.0,180.0,80.0,China,CHN,1992 Summer,1992,Summer,Barcelona,Basketball,Basketball Men's Basketball,
1,2,A Lamusi,M,23.0,170.0,60.0,China,CHN,2012 Summer,2012,Summer,London,Judo,Judo Men's Extra-Lightweight,
2,3,Gunnar Nielsen Aaby,M,24.0,,,Denmark,DEN,1920 Summer,1920,Summer,Antwerpen,Football,Football Men's Football,
3,4,Edgar Lindenau Aabye,M,34.0,,,Denmark/Sweden,DEN,1900 Summer,1900,Summer,Paris,Tug-Of-War,Tug-Of-War Men's Tug-Of-War,Gold
4,5,Christine Jacoba Aaftink,F,21.0,185.0,82.0,Netherlands,NED,1988 Winter,1988,Winter,Calgary,Speed Skating,Speed Skating Women's 500 metres,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
271111,135569,Andrzej ya,M,29.0,179.0,89.0,Poland-1,POL,1976 Winter,1976,Winter,Innsbruck,Luge,Luge Mixed (Men)'s Doubles,
271112,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Individual",
271113,135570,Piotr ya,M,27.0,176.0,59.0,Poland,POL,2014 Winter,2014,Winter,Sochi,Ski Jumping,"Ski Jumping Men's Large Hill, Team",
271114,135571,Tomasz Ireneusz ya,M,30.0,185.0,96.0,Poland,POL,1998 Winter,1998,Winter,Nagano,Bobsleigh,Bobsleigh Men's Four,


In [13]:
#Filtre pour les médaillés de 2016
df_2016 = df[(df['Year'] == 2016) & (df['Medal'].notna())]

In [14]:
#Affichage du nombre de médailles par sport
medals_by_sport = df_2016['Sport'].value_counts().nlargest(5)
top5_sports = medals_by_sport.index.tolist()

In [15]:
#Filtrer uniquement les lignes correspondant aux 5 sports les plus médaillés
df_top5 = df_2016[df_2016['Sport'].isin(top5_sports)]

In [19]:
#Graphe : nombre de médailles par sport
fig1 = px.bar(medals_by_sport.reset_index(),
    x='index',
    y='Sport',
    labels={'index': 'Sport', 'Sport': 'Nombre de médailles'},
    title="Top 5 sports par nombre de médailles en 2016"
)
fig1.show()

ValueError: Value of 'x' is not the name of a column in 'data_frame'. Expected one of ['Sport', 'count'] but received: index
 To use the index, pass it in directly as `df.index`.

In [None]:
#Histogramme de l'âge des médaillés dans les top 5 sports
fig2 = px.histogram(
    df_top5,
    x='Age',
    nbins=30,
    color='Sport',
    title="Distribution de l'âge des médaillés dans les 5 sports principaux (2016)"
)
fig2.show()

In [None]:
#Médailles par équipe nationale dans les 5 sports principaux
team_medals = df_top5['NOC'].value_counts().reset_index()
team_medals.columns = ['NOC', 'Nombre de médailles']
fig3 = px.bar(
    team_medals.head(10),  # top 10 pays
    x='NOC',
    
    y='Nombre de médailles',
    title="Top 10 des équipes nationales par médailles dans les 5 sports principaux (2016)"
)
fig3.show()

In [None]:
#Tendance du poids moyen par sexe dans les 5 sports
avg_weight_gender = df_top5.groupby('Sex')['Weight'].mean().reset_index()
fig4 = px.bar(
    avg_weight_gender,
    x='Sex',
    y='Weight',
    labels={'Sex': 'Sexe', 'Weight': 'Poids moyen'},
    title="Poids moyen des médaillés par sexe dans les 5 sports principaux (2016)"
)
fig4.show()