# Exploratory Data Analysis and Visualization

## Introduction

This notebook performs an Exploratory Data Analysis (EDA) and visualization on the dataset `final_data.csv`, which was scraped and merged in the notebooks located in the `01_data_collection` folder. For details on the data collection and merging process, refer to those notebooks.

The goal of this analysis is to uncover patterns, trends, and insights within the dataset through statistical summaries and visualizations. We will examine the data's structure, identify missing values, detect outliers, and explore relationships between variables. This EDA serves as a foundation for understanding the dataset's characteristics before the hypothesis tests.

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [3]:
df = pd.read_csv('../data/final_data.csv')
df.head()

Unnamed: 0,name,country,gender,boulder_points,lead_points,combined_points,highest_grade,count_8c_plus,avg_grade_first5
0,Sorato Anraku,JPN,male,3835.0,2281.0,6508.0,,,
1,Dohyun Lee,KOR,male,3708.0,2123.0,4710.0,,,
2,Meichi Narasaki,JPN,male,3055.0,,,,,
3,Sohta Amagasa,JPN,male,2967.0,,,,,
4,Tomoa Narasaki,JPN,male,2459.0,415.0,2860.0,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 551 entries, 0 to 550
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              551 non-null    object 
 1   country           551 non-null    object 
 2   gender            551 non-null    object 
 3   boulder_points    429 non-null    float64
 4   lead_points       316 non-null    float64
 5   combined_points   195 non-null    float64
 6   highest_grade     157 non-null    float64
 7   count_8c_plus     61 non-null     float64
 8   avg_grade_first5  157 non-null    float64
dtypes: float64(6), object(3)
memory usage: 38.9+ KB


In [5]:
df.shape

(551, 9)

Our dataset has a total of **551 unique climbers** with 9 columns

#### Summary Statistics
Now let's look at the summary statistics of our dataset

In [6]:
df.describe()

Unnamed: 0,boulder_points,lead_points,combined_points,highest_grade,count_8c_plus,avg_grade_first5
count,429.0,316.0,195.0,157.0,61.0,157.0
mean,301.124242,372.65538,857.554872,21.847134,16.180328,19.323439
std,581.40167,643.704321,1311.443736,4.081173,51.41806,3.806938
min,1.0,1.0,4.0,3.0,1.0,3.0
25%,12.0,12.0,50.0,20.0,1.0,18.0
50%,69.0,48.5,237.8,23.0,5.0,20.21
75%,282.5,421.0,1076.75,25.0,12.0,21.79
max,3835.0,3220.0,6508.0,29.0,388.0,25.54


As you can see from the output, we have:

**Competition Participation:**
  - 429 climbers competed in **bouldering**
  - 316 climbers competed in **lead**
  - 195 climbers competed in **combined**

**Profile Data:**
  - Out of the 551 climbers, only **157** have an **8a.nu profile**
    - Of those, only **61** have an ascent grade of **8c+ or above**

#### Gender Distribution
Let's look at the gender distribution of our dataset

In [14]:
print("Gender Distribution:")
# Calculate gender counts and percentages
gender_counts = df['gender'].value_counts()
gender_percentages = df['gender'].value_counts(normalize=True) * 100

# Print compact output
for gender, count in gender_counts.items():
    print(f"{gender.capitalize()}: {count} climbers ({gender_percentages[gender]:.2f}%)")

Gender Distribution:
Male: 294 climbers (53.36%)
Female: 257 climbers (46.64%)


The data shows a near equal split between male and female climbers. This distribution aligns with expectations, as the IFSC strives to maintain balanced gender representation in its competitions, such as World Cups and Championships.

#### NaN Values

In [15]:
df.isnull().sum() / df.shape[0]

name                0.000000
country             0.000000
gender              0.000000
boulder_points      0.221416
lead_points         0.426497
combined_points     0.646098
highest_grade       0.715064
count_8c_plus       0.889292
avg_grade_first5    0.715064
dtype: float64