In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
df = pd.read_csv('/kaggle/input/strava-jeddah-segments-leaderboard/jeddah_strava_segments.csv')

Checking the completeness of the dataframe

In [None]:
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='copper')

Checking the datatypes of the dataframe

In [None]:
df.info()

In [None]:
df.head()

### Checking the date of attempts in the leaderboards

In [None]:
# resetting the format of this feature to datetime
df['attempt_date'] = pd.to_datetime(df['attempt_date'])

In [None]:
df.groupby('attempt_date')['attempt_date'].count().plot()

We can notice that majority of the enteries are recent. This is because the leaderboard of each segment has 1 entry (best entry) per user. For example, lets say i attempted a segment in 2018 and finished it in 5 minutes. Then yesterday, i reattempted the same segment and managed to finish it in 4 minutes. Now the leaderboard will be updated to show my new attempt (best attempt) which also changes the date

### Lets explore the average speed of cyclists during their entire activity based on their age group

In [None]:
sns.scatterplot(x='act_avg_spd',y='user_age_group',data=df, hue='gender')

Here we can notice that there are outliers in both sides (low speed and high speed). However, bear in mind that this is the average speed of the entire activity and it requires someone with intensive training to maintain a 35 km/h with a bicycle for the entire activity.

Exploring activity max speed and average speed

In [None]:
sns.distplot(df["act_avg_spd"])

Majority of the average speed of cyclist range from 10kmh to around 35 kmh

In [None]:
sns.distplot(df["act_max_spd"])

The density plot above raises many questions. This is because it is impossible to reach a speed above 200 with a bicycle and the official highest cycling speed record is 82.52 km/h.

Therefore, this indicates that there are faulty entries and cheaters who used cars or motorbikes within the leaderboard

## Checking the average speed of participants in each segment

In [None]:
sns.scatterplot(y='smt_name',x='smt_avg_spd',data=df)

It is difficult to get any information from the plot above. Hence, lets create a flag for entries with activity max speed above the official world max sped record 82.52 km/h

In [None]:
df['act_max_spd_weird'] = df['act_max_spd']
df['act_max_spd_weird'] = df['act_max_spd_weird'].apply(lambda x: 1 if x > 82.52 else 0)

In [None]:
sns.scatterplot(y='smt_name',x='smt_avg_spd',data=df, hue='act_max_spd_weird')

We can see that there are some entries with high segment average speed got flagged but it is not that clear.

Therefore, lets reduce the max speed 

In [None]:
sns.distplot(df["act_max_spd"]).set_xlim((20,100))

In [None]:
# Picking the threshold to be 50km/h from the distplot above
df['act_max_spd_weird'] = df['act_max_spd']
df['act_max_spd_weird'] = df['act_max_spd_weird'].apply(lambda x: 1 if x > 50 else 0)

In [None]:
sns.scatterplot(y='smt_name',x='smt_avg_spd',data=df, hue='act_max_spd_weird')

We now notice that majority of the segment enteries which contain redicoulusly high average segment speed are flagged (orange)