<a href="https://www.kaggle.com/code/vtrackstar/machine-learning-project-olympic-t-f?scriptVersionId=194820828" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Ask
* The goal of this project is to predict future Olympic medalists in track and field events using over 100 years of historical data. 
* By analyzing trends in athlete performance, country dominance, and changes in event characteristics, I will develop machine learning models to forecast potential medalists.

## Prepare and Process

In [21]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

file_path = '/kaggle/input/olympic-track-field-results/results.csv'

# skip bad lines
try:
    df = pd.read_csv(file_path, on_bad_lines='skip')
except Exception as e:
    print(f"An error occurred: {e}")

# Display the first few rows to verify successful loading
print(df.head())

# Show distinct values in the "Event" column
if 'Event' in df.columns:
    distinct_events = df['Event'].unique()
    print("Distinct values in 'Event' column:")
    print(distinct_events)
else:
    print("'Event' column not found in the DataFrame.")
    
# Define keywords to filter out
keywords = ["Relay", "Decathlon", "Heptathlon"]

# Create a regex pattern to match any of the keywords
pattern = '|'.join(keywords)

# Drop rows where 'Event' column contains any of the keywords
df_cleaned = df[~df['Event'].str.contains(pattern, case=False, na=False)]

# Display the first few rows of the cleaned DataFrame
print(df_cleaned.head())

  Gender       Event Location  Year Medal                   Name Nationality  \
0      M  10000M Men      Rio  2016     G          Mohamed FARAH         USA   
1      M  10000M Men      Rio  2016     S  Paul Kipngetich TANUI         KEN   
2      M  10000M Men      Rio  2016     B           Tamirat TOLA         ETH   
3      M  10000M Men  Beijing  2008     G        Kenenisa BEKELE         ETH   
4      M  10000M Men  Beijing  2008     S         Sileshi SIHINE         ETH   

     Result  
0  25:05.17  
1  27:05.64  
2  27:06.26  
3  27:01.17  
4  27:02.77  
Distinct values in 'Event' column:
['10000M Men' '100M Men' '110M Hurdles Men' '1500M Men' '200M Men'
 '20Km Race Walk Men' '3000M Steeplechase Men' '400M Hurdles Men'
 '400M Men' '4X100M Relay Men' '4X400M Relay Men' '5000M Men'
 '50Km Race Walk Men' '800M Men' 'Decathlon Men' 'Discus Throw Men'
 'Hammer Throw Men' 'High Jump Men' 'Javelin Throw Men' 'Long Jump Men'
 'Marathon Men' 'Pole Vault Men' 'Shot Put Men' 'Triple Jump Men'

In [22]:
file_path1 = '/kaggle/input/paris2024-olympics-country-level-data/Paris2024_olympics_country_data.csv'
df1 = pd.read_csv(file_path1)

# Display the first few rows of the DataFrame to understand its structure
print(df1.head())

# Add Country column from Olympic Medal Dataset to Olympic Track and Field Dataset
merged_df = pd.merge(df_cleaned, df1[['Country Code', 'Country']], left_on='Nationality', right_on='Country Code', how='left')

# Drop the extra 'country_code' column if you only need 'country'
merged_df = merged_df.drop(columns='Country Code')

print(merged_df.head())

# Function to convert "Result" column time formats to seconds
def convert_to_seconds(Result):
    if isinstance(Result, str):
        # Remove unwanted text like " est" or any other text after a space
        Result = Result.split()[0]
        
        # Skip entries with a dash, assuming they are distances and not times
        if '-' in Result:
            return np.nan
        
        # Handle the case where the format is '1h19' or similar
        if 'h' in Result:
            hours, rest = Result.split('h')
            minutes, seconds = 0, 0
            if ':' in rest:
                parts = rest.split(':')
                if len(parts) == 2:
                    minutes = float(parts[0])
                    seconds = float(parts[1])
            else:
                minutes = float(rest)
            total_seconds = float(hours) * 3600 + minutes * 60 + seconds
            return round(total_seconds, 2)
        
        if ':' in Result:
            # Convert HH:MM:SS or MM:SS format to seconds
            parts = Result.split(':')
            if len(parts) == 3:  # HH:MM:SS format
                hours = float(parts[0])
                minutes = float(parts[1])
                seconds = float(parts[2])
                total_seconds = hours * 3600 + minutes * 60 + seconds
            elif len(parts) == 2:  # MM:SS format
                minutes = float(parts[0])
                seconds = float(parts[1])
                total_seconds = minutes * 60 + seconds
            return round(total_seconds, 2)
        
        # Directly return if it's already in seconds, formatting as SS.XX
        try:
            return round(float(Result), 2)
        except ValueError:
            return np.nan
    
    return np.nan

# Add the 'Result(S)' column to merged_df
merged_df['Result(S)'] = merged_df['Result'].apply(convert_to_seconds)

# Remove rows where 'Result(S)' is NaN (which indicates invalid or non-time data)
merged_df = merged_df.dropna(subset=['Result(S)'])

# Display the updated DataFrame to confirm the new column
print(merged_df[['Result', 'Result(S)']].head())

   Unnamed: 0         Country Country Code  Number of athletes  Gold medals  \
0           0     Afghanistan          AFG                   6            0   
1           1         Albania          ALB                   8            0   
2           2         Algeria          DZA                  45            2   
3           3  American Samoa          ASM                   2            0   
4           4         Andorra          AND                   2            0   

   Silver medals  Bronze medals  Total medals           GDP  GDP per capita  \
0              0              0             0  1.450216e+10      352.603733   
1              0              2             2  1.891638e+10     6810.114041   
2              0              1             3  2.255603e+11     5023.252932   
3              0              0             0  8.710000e+08    19673.390102   
4              0              0             0  3.380602e+09    42350.697069   

   Population  Life expectancy  Democracy  Gender 

## Analyze

In [23]:
# Country Dominance
# Mapping abbreviations to full names
medal_mapping = {'G': 'Gold', 'S': 'Silver', 'B': 'Bronze'}

# Apply the mapping to convert abbreviations to full names
merged_df['Medal'] = merged_df['Medal'].map(medal_mapping)

# Country Dominance
# Count the number of each type of medal by country and event
medal_counts = merged_df.groupby(['Country', 'Event'])['Medal'].value_counts().unstack(fill_value=0).reset_index()

# Rename columns for clarity
medal_counts.columns.name = None
medal_counts.columns = ['Country', 'Event', 'Gold', 'Silver', 'Bronze']

print("Medal Counts by Country and Event:")
print(medal_counts)

Medal Counts by Country and Event:
           Country               Event  Gold  Silver  Bronze
0        Argentina        Marathon Men     0       2       1
1        Argentina     Triple Jump Men     0       0       1
2        Australia          10000M Men     3       0       0
3        Australia  100M Hurdles Women     0       1       0
4        Australia            100M Men     2       0       0
..             ...                 ...   ...     ...     ...
426  United States        Shot Put Men    10      14      14
427  United States      Shot Put Women     1       1       0
428  United States     Triple Jump Men     2       7       5
429      Venezuela     Triple Jump Men     1       0       0
430      Venezuela   Triple Jump Women     0       0       1

[431 rows x 5 columns]


In [None]:
# Number of Olympic Appearances
# Calculate number of Olympic appearances by athlete
athlete_longevity = merged_df.groupby('Name')['Year'].count().reset_index()
athlete_longevity.columns = ['Name', 'Number of Appearances']
athlete_longevity = athlete_longevity.sort_values(by='Number of Appearances', ascending=False)

print("Number of Olympic Appearances by Athlete:")
print(athlete_longevity)

# Number of Medals by Olympic Athlete
# Count medals by athlete
athlete_medals = merged_df.groupby('Name')['Medal'].count().reset_index()
athlete_medals.columns = ['Name', 'Number of Medals']
athlete_medals = athlete_medals.sort_values(by='Number of Medals', ascending=False)

print("Number of Medals by Athlete:")
print(athlete_medals)

# Merge the two DataFrames
athlete_summary = pd.merge(athlete_longevity, athlete_medals, on='Name', how='left')
athlete_summary = athlete_summary.sort_values(by=['Number of Appearances', 'Number of Medals'], ascending=False)

print("Athlete Longevity and Medal Counts:")
print(athlete_summary)

# Search for a particular athlete
athlete_name = input("Enter the name of the athlete you want to search for: ")
athlete_info = athlete_summary.loc[athlete_summary['Name'] == athlete_name]

if not athlete_info.empty:
    print(f"Information for {athlete_name}:")
    print(athlete_info)
else:
    print(f"No data found for {athlete_name}.")


Number of Olympic Appearances by Athlete:
                          Name  Number of Appearances
1129               Paavo NURMI                      7
1005             Merlene OTTEY                      7
1399           Tirunesh DIBABA                      6
634         Irena KIRSZENSTEIN                      6
1187                Ralph ROSE                      5
...                        ...                    ...
584               Herb ELLIOTT                      1
583             Henry STALLARD                      1
582     Henry JONSSON-KÃLARNE                      1
581             Henry ERIKSSON                      1
1570  ÃâdÃÂ¶n FÃâLDESSY                      1

[1571 rows x 2 columns]
Number of Medals by Athlete:
                          Name  Number of Medals
1129               Paavo NURMI                 7
1005             Merlene OTTEY                 7
1399           Tirunesh DIBABA                 6
634         Irena KIRSZENSTEIN                 6
1187       

In [None]:
# Men's 400m Regression Analysis Evaluation
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

# Split the data by 400m Men and 400m Women
df_men_400m = merged_df[merged_df['Event'] == '400M Men'].copy()
df_women_400m = merged_df[merged_df['Event'] == '400M Women'].copy()

# Regression to predict 'Result(S)'
X_men = df_men_400m[['Year']]
y_men = df_men_400m['Result(S)']

X_women = df_women_400m[['Year']]
y_women = df_women_400m['Result(S)']

X_train_men, X_test_men, y_train_men, y_test_men = train_test_split(X_men, y_men, test_size=0.3, random_state=42)
X_train_women, X_test_women, y_train_women, y_test_women = train_test_split(X_women, y_women, test_size=0.3, random_state=42)

# Train Linear Regression model
lr_model_men = LinearRegression()
lr_model_men.fit(X_train_men, y_train_men)
y_pred_men = lr_model_men.predict(X_test_men)

lr_model_women = LinearRegression()
lr_model_women.fit(X_train_women, y_train_women)
y_pred_women = lr_model_women.predict(X_test_women)

# Evaluation
print("Men's 400m Regression Analysis Evaluation:")
print("Mean Absolute Error:", mean_absolute_error(y_test_men, y_pred_men))
print("R-squared:", r2_score(y_test_men, y_pred_men))

print("Women's 400m Regression Analysis Evaluation:")
print("Mean Absolute Error:", mean_absolute_error(y_test_women, y_pred_women))
print("R-squared:", r2_score(y_test_women, y_pred_women))


## Share
Interactive Tableau Dashboard Coming Soon
### 400M Olympic Regression Analysis Interpretation
* Men's 400m Regression Analysis:
    * Mean Absolute Error (MAE): **0.966 seconds**
    * On average, the model's predictions are off by about 0.966 seconds from the actual race times. This reflects the model's accuracy in predicting men's 400m times.
    * R-squared (R²): **0.709**
    * The model explains approximately 70.9% of the variance in men's 400m times, indicating a strong correlation between the year of the race and the time, and a good fit for the data.
* Women's 400m Regression Analysis:
    * Mean Absolute Error (MAE): **0.739 seconds**
    * The model's predictions are, on average, about 0.739 seconds off from the actual race times. This suggests that the model is fairly accurate for women's 400m times.
    * R-squared (R²): **-0.002**
    * The model's fit is poor, with an R-squared value slightly below zero, indicating that it explains almost none of the variance in women's 400m times. The model does not effectively capture the relationship between the year of the race and the time.

## Act
### 400M Regression Model Insights
* The regression model effectively captures the trend in men's 400m times over the years, with a high R-squared value (0.709). This suggests that improvements in race times over time are well-explained by the model. 
    * To build on this, future analyses could incorporate additional variables such as technological advancements or training techniques to refine predictions further and explore underlying factors contributing to the observed trends.
* For Women's 400m, the regression model's performance for women's 400m times is less effective, as indicated by the negative R-squared value (-0.002). This suggests that the model does not explain the variability in race times well. 
    * To address this, future analysis can include more features or exploring other modeling techniques. Additional factors, such as changes in training practices, coaching strategies, or athlete nutrition over the years, may provide better insights into trends in women's 400m times.