# Race Trend Analysis
The goal of this notebook is to analyze trends within the entire data range and see what insights can be glean from the trends. 

In [1]:
import pandas as pd
import seaborn as sns

In [2]:
df = pd.read_csv("datasets/UM _RACEDATA_RAW.csv") 

  df = pd.read_csv("datasets/UM _RACEDATA_RAW.csv")


Calculating ```athlete_age```

In [3]:
df['athlete_age'] = 2022 - df['Athlete year of birth']

Only ```USA``` runners

In [4]:
df = df[df['Athlete country'].isin(['USA'])]

Removing 
* ```Athlete club```
* ```Athlete country```
* ```Athlete year of birth```
* ```Athlete age category```
* ```Event number of finishers```

In [5]:
df = df.drop(['Athlete club', 'Athlete country', 'Athlete year of birth', 'Athlete age category', 'Event number of finishers'], axis = 1)

Rename column headers

In [6]:
df = df.rename(columns = { 'Year of event' : 'year',
                           'Event dates' : 'race_day',
                           'Event name' : 'race_name',
                           'Event distance/length' : 'race_length',
                           'Athlete performance' : 'athlete_performance',
                           'Athlete gender' : 'athlete_gender',
                           'Athlete average speed' : 'athlete_average_speed',
                           'Athlete ID' : 'athlete_id'
})

Removing ```null``` values

In [7]:
df = df.dropna()

Drop ```duplicate``` rows

In [8]:
df = df.drop_duplicates()

fix types
* ```athlete_age``` to ```int```
* ```athlete_average_speed``` to ```float```

In [9]:
df['athlete_age'] = df['athlete_age'].astype(int)

Convert the ```'athlete_average_speed'``` column to numeric, coercing errors to ```NaN```

In [12]:
df['athlete_average_speed'] = pd.to_numeric(df['athlete_average_speed'], errors='coerce')

In [13]:
df = df.dropna()

In [14]:
df['athlete_average_speed'] = df['athlete_average_speed'].astype(float)

Determine how many different race lengths there are and then filter by the 4 most popular ultramarathon race distances

In [18]:
print(df['race_length'].value_counts(dropna=False))

race_length
50km             620836
50mi             278362
100mi            117543
100km             57324
24h               46990
                  ...  
1006km                1
875km                 1
1016km                1
80km/3Etappen         1
72.5km                1
Name: count, Length: 971, dtype: int64


### NOTE: Break off and look at 24 hour races

Focus only on the most popular races: 
* ```50km```
* ```50mi```
* ```100km```
* ```100mi```


In [20]:
df = df[df['race_length'].isin(['50km', '50mi', '100mi', '100km'])]

Convert ```'race_length'``` into a ```category``` type

In [22]:
df['race_length'] = df['race_length'].astype('category')

Filter for ```Male``` & ```Female``` athletes

In [26]:
df = df[df['athlete_gender'].isin(['M', 'F'])]

Convert ```'athlete_gender'``` into a ```category``` type

In [27]:
df['athlete_gender'] = df['athlete_gender'].astype('category')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['athlete_gender'] = df['athlete_gender'].astype('category')


In [29]:
df.dtypes

year                        int64
race_day                   object
race_name                  object
race_length              category
athlete_performance        object
athlete_gender           category
athlete_average_speed     float64
athlete_id                  int64
athlete_age                 int64
dtype: object

In [30]:
df.head(30)

Unnamed: 0,year,race_day,race_name,race_length,athlete_performance,athlete_gender,athlete_average_speed,athlete_id,athlete_age
55,2018,06.01.2018,Yankee Springs 50 Mile Winter Challenge (USA),50mi,9:53:05 h,M,8.141,55,39
58,2018,06.01.2018,Yankee Springs 50 Mile Winter Challenge (USA),50mi,11:38:17 h,M,6.914,58,36
59,2018,06.01.2018,Yankee Springs 50 Mile Winter Challenge (USA),50mi,11:56:35 h,M,6.738,59,34
60,2018,06.01.2018,Yankee Springs 50 Mile Winter Challenge (USA),50mi,12:32:16 h,M,6.418,60,27
61,2018,06.01.2018,Yankee Springs 50 Mile Winter Challenge (USA),50mi,12:39:36 h,M,6.356,61,43
62,2018,06.01.2018,Yankee Springs 50 Mile Winter Challenge (USA),50mi,12:39:36 h,F,6.356,62,45
63,2018,06.01.2018,Yankee Springs 50 Mile Winter Challenge (USA),50mi,13:24:05 h,F,6.004,63,32
64,2018,06.01.2018,Yankee Springs 50 km Winter Challenge (USA),50km,5:09:40 h,F,9.688,64,31
65,2018,06.01.2018,Yankee Springs 50 km Winter Challenge (USA),50km,6:00:47 h,M,8.315,65,46
66,2018,06.01.2018,Yankee Springs 50 km Winter Challenge (USA),50km,6:02:02 h,F,8.287,66,30


Next steps:
* Remove h from athlete_performance
* standardize athlete_performace values
* remove (USA) from race_name