# Exploratory Analysis of FIS Ski Jumping Data

# Introduction

The sport of ski jumping is characterized by a discrepancy in performance of individuals from different countries. Athletes from a set of countries (mostly european countries and japan) consistently outperform those from the other countries. The ultimate purpose of this project is to determine the root causes of this discrepancy by analyzing ski jumping data. 

# Data

**Data Collection**

The data for this project was all collected from the FIS website. The majority of the data, information containing competition results and hill specs, was collected and made publically available by kaggle user wrotki8778 by dowlnloading and scrapping the competition results pdfs. The remaining  data on the athletes was scrapped directly off the FIS site by me. 

**Data Cleaning/Feature Engineering**

The following steps are taken in the next code cell to edit the data and make it more usable for the purpose of this project:

* A singe dataframe is created by combining columns from competition, results, and athletes datasets.
* Rows containing Nan values are dropped from the dataset.
* date-time values are parsed so that birthdate contains only birth year and comp date contains only the month (year is already stated in the *season* column)
* Fuzzy matching between club names is minimized by creating a *club-id* column, giving clubs with sufficiently similar names the same club id. Matching is likely imperfect and somewhat up to interpretation.
* columns containing strings representing numerical values are converted to numerical data type.
* flying hill and regular hill results are differentiated by creating *is_flying* column. True for flying hill, False otherwise.
* the distance travelled by an athlete is measured as a percent of the k point of the hill in column *percent_k*. Ths allows comparison of distance across the different hill sizes.
* The approximate age of the athlete is stored in the column *age*
* Whether the athlete is jumping in their home country is stored as a bool value in the column *home_comp*
* Whether the athlete is from one of the aformentioned top performing countries is stored in the column *is_euro* (also contains japanese athletes and excludes european athletes from european countries that are not consistently out-performing others, eg. britain, russia etc.)
* A rough approximation of the experience of an athlete is stored in the *years_seen* column. This contains the number of different seasons in which the athlete appears in the dataset.
* The size of a club (number of different athletes with the same club id) is contained in the *club_size* column
* The size of a nation (number of different athletes with the same nationality) is contained in the *nation_size* column.



In [None]:
# package imports
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set_theme(color_codes=True)
import os
from sklearn.feature_selection import mutual_info_regression
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
import time


#for dirname, _, filenames in os.walk('/kaggle/input'):
 #   for filename in filenames:
  #      print(os.path.join(dirname, filename))     

# data imports
ratings = pd.read_csv('../input/ski-jumping-results-database-2009now/all_ratings.csv')
names = pd.read_csv('../input/ski-jumping-results-database-2009now/all_names.csv')
comps = pd.read_csv('../input/ski-jumping-results-database-2009now/all_comps_r.csv')
results = pd.read_csv('../input/ski-jumping-results-database-2009now/all_results.csv')
stats = pd.read_csv('../input/ski-jumping-results-database-2009now/all_stats_r.csv')
athletes = pd.read_csv('../input/fis-ski-jumpers/all_athlete_data.csv')
athletes.drop(columns=['Unnamed: 0'], inplace=True)

#merge data into single dataset containing athlete, comp, hill, and results data
athletes_results = pd.merge(left = results, right = athletes, left_on='codex', right_on='xcode')
athletes_comp_results = pd.merge(left=comps, right=athletes_results, left_on = 'id', right_on = 'id')
#show all columns when visalizing 
pd.set_option('max_columns', None)
#drop repeat columns
athletes_comp_results.rename(columns={'codex_x': 'comp_codex', 'hill_size_x': 'hill_size', 'xcode': 'athlete_codex', 'gender_y': 'gender','points':'total_points', 'loc': 'ranking'}, inplace=True)
athletes_comp_results.drop(columns=['gender_x', 'hill_size_y', 'codex_y'], inplace=True)

#drop individual judge scores
athletes_comp_results.drop(columns=['note_1', 'note_2', 'note_3', 'note_4', 'note_5'], inplace=True)

#drop missing values
acr = athletes_comp_results.dropna()


### Data Cleaning ###

#fix birthdate to contain only year
def dtstring_year(bd):
    try:
        return int(pd.to_datetime(bd).year)
    except ValueError:
        return np.nan
    except:
        return int(bd)

#fix comp date to only month
def dtstring_month(date):
    try:
        return int(pd.to_datetime(date).month)
    except:
        return np.nan

#replace whitespace character wih Nan
def replace_nan(item):
    if len(item) == 1:
        return np.nan
    else:
        return item

#acr.club = acr.club.map(lambda c: replace_nan(c))
acr.birthdate = acr.birthdate.map(lambda bd: dtstring_year(bd))
acr.date = acr.date.map(lambda d: dtstring_month(d))
#acr.dropna(inplace=True)
acr.club = acr.club.map(lambda c: c.lower().strip())

#fix fuzzy matching for club names
#make a 'club id' column for similar enough club names 

clubs = acr[['club', 'nationality']]
club_names = np.sort(clubs.club.unique())

def matchscore(s1, s2, limit=1):
    return fuzz.partial_ratio(s1,s2)

def make_id(names, minscore):
    ids = [1]
    idnum = 1
    for i in range(len(names)-1):
        if matchscore(names[i+1], names[i]) > minscore:
            ids.append(idnum)
        else:
            idnum += 1
            ids.append(idnum)
    return ids

club_ids = make_id(club_names, 70)

club_id_df = pd.DataFrame({'club': club_names, 'club_id': club_ids})

acr = pd.merge(left = acr, right = club_id_df, left_on='club', right_on='club')


#change cols to numerical values where necessary
acr = acr.apply(pd.to_numeric, errors='ignore')


### Some Feature Engineering ###

def check_top_country(val):
    return val in ['GER', 'NOR', 'AUT', 'POL', 'SLO', 'JPN']

# seperate flying hills as k-point > 160
acr.loc[:,'flying_hill'] = acr['k.point'] > 160
acr.loc[:,'percent_k'] = acr.dist/acr['k.point']
# age
acr['age'] = acr.season - acr.birthdate
# home_comp
acr['home_comp'] = acr['country'] == acr['nationality']
# is from a top competing country
is_top_li = [check_top_country(val) for val in acr['nationality']]
acr['is_euro'] = is_top_li
# num_years_seen
acr['years_seen'] = acr.groupby('athlete_codex').season.transform('nunique')
# club size
acr['club_size'] = acr.groupby('club_id').athlete_codex.transform('nunique')
# nation_size (number of individuals from a particular nation)
acr['nation_size'] = acr.groupby('nationality').athlete_codex.transform('nunique')

acr.head()

# Analysis

the following was determined by analyzing a table of correlations:

* hill size and k point can probably be reduced to a single column
* points gained due to distance is strongly correlated with the size of the hill. This will need to be taken into account when analyzing points.
* performance columns as expected show some correlation with is_euro and nation_size
* club_size has little to no impact on athlete performance
* nation_size and club_size are negatively correlated

The interesting correlations with the club_size column should be taken with a grain of salt. Club size was calculated using club_id, an imperfect fix to the club name fuzzy matching. 

In [None]:
# look at correlations

Mcorr = acr.drop(columns=['training']).corr()
Mcorr

# Feature Importance

The correlation table showed a few notable relationships but I want to get a list of features reanked by their ability to predict the performance. Mutual information scores show the amount of information gained about a target from the feature. 0 is low, 2 is very high. his will be a useful metric. I will also train a random forest with the dataframe and list the feature importances. Ranking makes the most sense as a target, it will correct for the points discrepancies in hill size found earlier, and we wont have to worry about the different scoring system for ski flying competitions. Other useful festures to look at will be total points and distance, for a finer grain target. These will hold a little more information than the rank.

**Mutual Information between features and rank**

The following cell plots the mutual information scores of features using rank as the target. Aside from target leakage columns (points, distnce, etc.) we see that the features that provide the most information about rank are athletes, clubs, and then nations. It is interesting to see that the club an athlete is from gives more information than the nationality. Is_euro and years_seen have a little less information about the target, and club size interestingly has very little information. Home_comp, whether an athlete is jumping in their home country, holds virtually no information about the target, implying that athletes do not tend to jump any better in their home country than away at other countries.

It is interesting to me that club holds more information than nationality. Further analysis should be conducted in this area.

In [None]:
## get feat. importance with mutual information

def make_mi_scores(X, y):
    X = X.copy()
    for colname in X.select_dtypes(["object", "category"]):
        X[colname], _ = X[colname].factorize()
    # All discrete features should now have integer dtypes
    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores


def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")
    return 0

X = acr.dropna()
y = X.pop('ranking')

mi_scores = make_mi_scores(X, y)

plt.figure(dpi=100, figsize=(10, 10))
plot_mi_scores(mi_scores)

**Feature importance using points as the target**

It is necessary to seperate flying hill data from regular hill data when looking at total_points as the target because flying hill competitions score the distance differently. Hill size will also have a big impact on the total points, as we saw from the correlation table. The following cell shows the impact of flying hill on points, and seperatess flying and regular jumps into two datasets.

In [None]:
# demonstration of flying hill impact 
sns.scatterplot(x=acr.type, y=acr.total_points, hue=acr.flying_hill)
sns.lmplot(x='total_points', y='percent_k', hue='flying_hill', data=acr)

# seperate the two
flying_df = acr[acr['flying_hill'] == True]
regular_df = acr[acr['flying_hill'] == False]

**Feature importance with mutual information**

This time I will get rid of the target leakage columns before getting the feature importance. The following shows similar results to when rank was used as a target, this time the importance of the individual (athlete name and codex) is especially obvious. Place also seems to share a lot of information about the total points, this may reflect that points are awarded differently in diffrerent cities. Factors related to hill size (including wind/gate factor) populate the higher end of the list as expected. Wind seems to affect the score of an athlete in some way, perhaps indicating that wind factors are not always properly calibrated. Is_euro ranks far lower than nationality, indicating that maybe the distinction made in is_euro does not perfectly reflect the higher and lower ranking countries. Club size, again, ranks far lower than club id. Home_comp again seems to share little to no information with total_points. 

In [None]:
## get feat. importance with mutual information

#drop irrelevent/target-leakage columns
total_points_prediction_set = regular_df.drop(columns=['comp_codex','id','speed','dist','dist_points',
                                                       'note_points', 'wind_comp', 'gate_points', 'ranking'])

#drop NaNs
tp_reg = total_points_prediction_set.dropna()

X = tp_reg.drop(columns=['percent_k'])
y = X.pop('total_points')

mi_scores = make_mi_scores(X, y)

plt.figure(dpi=100, figsize=(10, 10))
plot_mi_scores(mi_scores)

**Feature importance with random forest**



In [None]:

#get categorical and numerical columns
numerical_cols = [col for col in X.columns if X[col].dtype in ['float64', 'int64']]
categorical_cols = [col for col in X.columns if X[col].dtype == 'object']

## shorten the number of nationalities by combining the less-common nationalities into 'other' - 
## necessary for encoding nationality information for numerical random forest implementation

acrn = acr.nationality
acrv = acr.nationality.value_counts()
nationalities = []
for n in acrn:
    if acrv[n] < 2500:
        nationalities.append('other')
    else:
        nationalities.append(n)
acr.nationality = nationalities

## Label Encode

# Make copy to avoid changing original data 
label_X = X.copy()

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in categorical_cols:
    label_X[col] = label_encoder.fit_transform(X[col])

Xtrain_L, Xval_L, ytrain, yval = train_test_split(label_X, y, train_size = 0.8, test_size = 0.2)

#random forest dat bish
#### get feature importance

plt.rcParams["figure.figsize"] = (20,20)


# Function for comparing different models
def score_model(X_t, X_v, y_t, y_v):
    model = RandomForestRegressor(n_estimators=200, random_state=0)
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    print (mean_absolute_error(y_v, preds))
    return model

def plot_features(model, X_t, title):
    feature_importances = model.feature_importances_

    imp = pd.Series(feature_importances, index = X_t.columns).sort_values(ascending=True)
    width = np.arange(len(imp))
    ticks = list(imp.index)
    plt.barh(width, imp)
    plt.yticks(width, ticks)
    plt.title("RF feature importance {}".format(title))
    return 0

Label_model = score_model(Xtrain_L, Xval_L, ytrain, yval)


plot_features(Label_model, Xtrain_L, 'Label Encoding')

**Feature importance using distance as the target**



In [None]:
dist_target_set = acr[acr.flying_hill == False].copy()

##drop irrelevent/target leakage cols
dist_target_set.drop(columns=['id', 'speed', 'dist', 'dist_points', 'gate', 'gate_points', 'bib', 'comp_codex', 'flying_hill'], inplace=True)

#change date-time vals to numerical
dist_target_set.birthdate = dist_target_set.birthdate.map(lambda bd: dtstring_year(bd))
dist_target_set.date = dist_target_set.date.map(lambda bd: dtstring_month(bd))
total_points_prediction_set = total_points_prediction_set.apply(pd.to_numeric, errors='ignore')

dist_target_set.dropna(inplace=True)

dist_target_set.head(10)


In [None]:
## mutual information scores

X = dist_target_set.drop(columns=['total_points']).copy()
y = X.pop('percent_k')

mi_scores = make_mi_scores(X, y)

plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores)

In [None]:
## Label Encode

#get categorical and numerical columns
numerical_cols = [col for col in X.columns if X[col].dtype in ['float64', 'int64']]
categorical_cols = [col for col in X.columns if X[col].dtype == 'object']

# Make copy to avoid changing original data 
label_X = X.drop(columns=['note_points']).copy()

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in categorical_cols:
    label_X[col] = label_encoder.fit_transform(X[col])

Xtrain_L, Xval_L, ytrain, yval = train_test_split(label_X, y, train_size = 0.8, test_size = 0.2)

plt.rcParams["figure.figsize"] = (20,20)

Label_model = score_model(Xtrain_L, Xval_L, ytrain, yval)

plot_features(Label_model, Xtrain_L, 'Label Encoding')

# Further analysis with data visualization

**Investigation of club performance in relation to rank**



In [None]:
def mode(series):
    return series.value_counts().idxmax()

statcols = ['club_id', 'nationality', 'nation_size', 'ranking', 'club_size', 'club']
stats = ['mean', 'median', 'std']

stats_df = acr[statcols].dropna()
stats_df['season'] = acr.season

for col in statcols[:-2]:
    stats_df[col + '_rank_mode'] = stats_df.groupby(col).ranking.transform(mode)
    for stat in stats:
        stats_df[col + '_rank_' + stat] = stats_df.groupby(col).ranking.transform(stat)
        
#look a top 50 clubs by mean and by median ranking 
top_50 = stats_df.groupby('club_id').apply(lambda df: df.loc[df.club_id_rank_mean.idxmin()]).sort_values(by='club_id_rank_mean').head(50)
bottom_50 = stats_df.groupby('club_id').apply(lambda df: df.loc[df.club_id_rank_mean.idxmin()]).sort_values(by='club_id_rank_mean', ascending=False).head(50)

top_50.head()

Clubs performance over time 

In [None]:
top_50_clubs = top_50.club_id

club50 = acr.loc[acr.club_id.isin(top_50_clubs)]

seasons = np.sort(acr.season.unique())

club50['cmr_season'] = club50.groupby(['season', 'club_id']).ranking.transform('mean')

#drop small clubs as the mean values will tend to be more extreme and less useful
club50 = club50.drop(club50[club50.club_size < 3].index)

#fig, ax = plt.subplots(figsize=(15,10))
#sns.lineplot(x=club50.season, y=club50.cmr_season, hue=club50.club)

#sort by nation
nations = club50.nationality.unique()
for nation in nations:
    df = club50.loc[club50.nationality == nation]
    fig, ax = plt.subplots(figsize=(10,7))
    sns.lineplot(ax=ax, x=df.season, y=df.cmr_season, hue=df.club_id)
    ax.set_title(nation)
    

We can see that clubs with id 308, 297, and 129 have shown quite dramatic improvement over the last 10 years.

In [None]:
##### look at some features

tp_flying = total_points_prediction_set[total_points_prediction_set['flying_hill'] == True]
tp_reg = total_points_prediction_set[total_points_prediction_set['flying_hill'] == False]

features = ["hill_size", 'place', 'club', "nationality"]
sns.relplot(
    x="value", y="total_points", col="variable", data=tp_reg.melt(id_vars="total_points", value_vars=features), facet_kws=dict(sharex=False),
);

features = ['wind', 'age', 'years_seen']
sns.relplot(
    x="value", y="total_points", col="variable", data=tp_reg.melt(id_vars="total_points", value_vars=features), facet_kws=dict(sharex=False),
);

features = ['home_comp', 'is_euro', 'club_size', 'nation_size']
sns.relplot(
    x="value", y="total_points", col="variable", data=tp_reg.melt(id_vars="total_points", value_vars=features), facet_kws=dict(sharex=False),
);

In [None]:
# look at relationship between hill size/k point, wind factor, gate factor, metre value

features = ['gate.factor', 'wind.factor', "meter.value"]
sns.relplot(
    x="value", y="hill_size", col="variable", data=tp_reg.melt(id_vars="hill_size", value_vars=features), facet_kws=dict(sharex=False),
);


In [None]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15., 7.))

ax1.scatter(tp_reg['gate.factor'], tp_reg['wind.factor'])
ax1.set_xlabel('gate factor')
ax1.set_ylabel('wind factor')
ax1.set_title('wind-gate factor correlation')

ax2.scatter(tp_reg['meter.value'], tp_reg.hill_size)
ax2.set_xlabel('meter value')
ax2.set_ylabel('hill size')
ax2.set_title('meter-value - hill size correlation')

In [None]:
fig, ax = plt.subplots(figsize=(10., 7.))
sns.scatterplot(x=tp_reg['gate.factor'], y=tp_reg['wind.factor'], hue=tp_reg.country)

In [None]:
# lets make some clusters and see if they correlate with points/distance

cluster_features = ['gate.factor', 'wind.factor']
scaled_cluster_df = tp_reg[cluster_features]
for feat in cluster_features:
    scaled_cluster_df[feat] = (scaled_cluster_df[feat])/scaled_cluster_df[feat].std()

kmeans = KMeans(n_clusters=2)
scaled_cluster_df["cluster"] = kmeans.fit_predict(scaled_cluster_df[cluster_features])

fig, ax = plt.subplots(figsize=(10., 7.))
sns.scatterplot(y=scaled_cluster_df['wind.factor'], x=scaled_cluster_df['gate.factor'], hue=scaled_cluster_df['cluster'])




In [None]:
#try and isolate the outliers

scaled_cluster_df['wind_gate_outlier'] = scaled_cluster_df['wind.factor'] > scaled_cluster_df['gate.factor']*2 - 7

fig, ax = plt.subplots(figsize=(10., 7.))
sns.scatterplot(y=scaled_cluster_df['wind.factor'], x=scaled_cluster_df['gate.factor'], hue=scaled_cluster_df['wind_gate_outlier'])



In [None]:
tp_reg['wind_gate_outlier'] = scaled_cluster_df['wind_gate_outlier']

#see what it affects
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10., 10.))
sns.scatterplot(ax=ax1, x=tp_reg['wind.factor'], y=tp_reg['gate.factor'], hue=tp_reg.wind_gate_outlier)
sns.kdeplot(ax = ax2, x='total_points', hue='wind_gate_outlier', data=tp_reg)
fig2, ax = plt.subplots(figsize = (10., 7.))
sns.kdeplot(ax = ax, x='hill_size', hue='wind_gate_outlier', data=tp_reg)

In [None]:
plt.rcParams['figure.figsize'] = (15,10)
#wind is slightly negatively correlated
sns.regplot(x=tp_reg.wind, y=tp_reg.total_points)

In [None]:
#maybe some correlation with age

age_df = tp_reg[["age", "total_points"]]
age_df['mean_points_age'] = age_df.groupby('age').total_points.transform('mean')

age_plot = sns.regplot(x=age_df.age, y=age_df.mean_points_age, logx=True)


In [None]:
sns.kdeplot(data=tp_reg, x='age', shade=True)


In [None]:
years_df = tp_reg[["years_seen", "total_points"]]
years_df['mean_points_years'] = years_df.groupby('years_seen').total_points.transform('mean')
years_df['var_points_years'] = years_df.groupby('years_seen').total_points.transform('std')


ax = sns.regplot(x=years_df.years_seen, y=years_df.mean_points_years)
ax.errorbar(x=years_df.years_seen, y=years_df.mean_points_years, yerr = years_df.var_points_years, fmt='none', capsize=5, zorder=1, color='C0')

In [None]:
sns.distplot(x=tp_reg.years_seen)

In [None]:
sns.catplot(x='is_euro', y='total_points', data=tp_reg, kind='violin')

In [None]:
sns.kdeplot(data=tp_reg, x="total_points", hue='is_euro', multiple="fill")

In [None]:
sns.catplot(x='home_comp', y='total_points', data=tp_reg, kind='violin')

In [None]:
sns.kdeplot(data=tp_reg, x="total_points", hue='home_comp', multiple="fill")

In [None]:
# look for correlation between club size and club mean performance

tp_reg['club_points_mean'] = tp_reg.groupby('club').total_points.transform('mean')

club_df = tp_reg[['club', 'club_size', 'club_points_mean', 'is_euro', 'nation_size']]

x = sns.catplot(x='club_size', y='club_points_mean', data=club_df, kind='box')
x.fig.set_figheight(10)
x.fig.set_figwidth(15)

In [None]:
club_df.sort_values(by='club_points_mean', inplace=True)
c = sns.distplot(x=club_df.club_points_mean)

In [None]:

print(tp_reg.club.nunique())
print(tp_reg.club_size.nunique())

sns.scatterplot(x=club_df.club_size, y=club_df.nation_size)



In [None]:
sns.kdeplot(data=dist_target_set, x="percent_k", hue='is_euro', multiple="fill")

In [None]:
sns.kdeplot(data=dist_target_set, x="percent_k", hue='home_comp', multiple="fill")

In [None]:
sns.lmplot(x="note_points", y="percent_k", hue="gender", data=dist_target_set)
sns.lmplot(x="note_points", y="percent_k", hue="is_euro", data=dist_target_set)