### Investigation into safety cars
In this notebook I want to investigate safety cars. My main question is whether the amount of safety cars can be modelled by using a poisson distribution. In order to do this I need to do the following things

1. Find safety cars in the data\
   Hypothesis: A safety car can be found by the following things
        - Increasing laptimes
        - No overtaking
        - Smaller time-difference between first and last driver
        - Smaller time-difference between all drivers
2. Look at their distribution.
    Find Lambda (expected per time interval)
        Investigate lambda, are there differences per track? Or year?
3. Comment on this, was Austria 2020 so strange?
    


In [None]:
## Imports
import pandas as pd
import os
import numpy as np
import seaborn as sns
import random
from datetime import datetime
from sklearn.linear_model import LinearRegression

# To get full output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

#set random seed
random.seed(20)

In [None]:
# Change current working directory to where our F1 data is stored
os.getcwd();
os.chdir('C:\\Users\\yanni\\OneDrive\\Documents\\Data_Science\\F1_data')
os.getcwd()  

We want to have the laptimes per round, so we need the laptimes dataset. 

In [None]:
laptimes_df = pd.read_csv('lap_times.csv')

In [None]:
laptimes_df.head()
laptimes_df.shape
laptimes_df.dtypes
laptimes_df.describe(include = 'all')
laptimes_df.head()

#### Step 1. Find safetycars in the data
In order to find these in the data we have the following hypotheses
1. There are increasing laptimes
2. There is no overtaking
3. The time difference between the first and the last driver decreases

In [None]:
# Hypothesis 1. There are increasing laptimes
# Here we check the distribution of laptimes in a "normal" race
# I also want to know which race I'm looking at, so I'll add that info
races_df = pd.read_csv('races.csv')
races_df.head()
races_df.dtypes
races_df['race'] = races_df['name'] + ' ' + races_df['year'].astype(str)
races_df.head()

In [None]:
# Here I'll get the mean time per lap
laptime_race_df = laptimes_df.merge(races_df[['raceId', 'race']]
                                    , how = 'left'
                                    , on = 'raceId')
laptime_race_df.head()
avg_laptimes_race_df = laptime_race_df[['raceId', 'lap', 'race','milliseconds',]]\
                                     .groupby(['raceId', 'lap'])\
                                     .agg({'milliseconds' : 'mean',
                                           'race'         : 'max'})
avg_laptimes_race_df = avg_laptimes_race_df.reset_index()
avg_laptimes_race_df.head()

In [None]:
# Visualize the laptimes of an individual race
# Here we select a race to visualize
plot_raceID = random.choice(np.unique(avg_laptimes_race_df['raceId']))

sns.regplot(x = 'lap'
            , y = 'milliseconds'
            , data = avg_laptimes_race_df[avg_laptimes_race_df['raceId'] == plot_raceID]
            , scatter = False
            , line_kws={'color' : 'blue',
                       'ls' : '--'}
           )
sns.lineplot(x = 'lap'
            , y = 'milliseconds'
            , data = avg_laptimes_race_df[avg_laptimes_race_df['raceId'] == plot_raceID]
            , color = 'red')\
.set_title("average laptimes of {}".format(avg_laptimes_race_df['race'][avg_laptimes_race_df['raceId'] == plot_raceID].iloc[0]))

In [None]:
sns.regplot(x = 'lap'
            , y = 'milliseconds'
            , data = avg_laptimes_race_df
            , scatter = False
            , line_kws={'color' : 'blue',
                       'ls' : '--'}
           )
sns.lineplot(x = 'lap'
            , y = 'milliseconds'
            , data = avg_laptimes_race_df
            , color = 'red')\
.set_title("average laptimes of all races")

In [None]:
# From this plot we can see that most of the times the laptimes
# should be going down
# we will check with a regression whether that is indeed the case
# Here X will be the lapnumber, and Y will be the laptime
X = avg_laptimes_race_df[['lap']]
y = avg_laptimes_race_df[['milliseconds']]
reg = LinearRegression().fit(X, y)
print('Every lap is on average {0:.2f}ms faster than the previous'.format(reg.coef_[0][0]))

In [None]:
# Thus, we can see that the laptime should be going down with each lap
# Based on this info we can see that someting is going on if this is not the case
# We will make an indicator for this, indicating whether the lap was faster than the previous
avg_laptimes_race_df['previous_round'] = avg_laptimes_race_df.groupby('raceId')['milliseconds'].shift()

avg_laptimes_race_df['faster_previous'] = np.where(avg_laptimes_race_df['milliseconds'] < avg_laptimes_race_df['previous_round']
                                                   , True
                                                   , False)
avg_laptimes_race_df.head(100)

We now have an indicator to show whether this lap was faster than the previous lap. However, there are multiple reasons why a lap can be slower. Such as: pit-stops, rain, traffic etc.
Therefore we want to have more conditions, and we will continue with our next hypothesis:
There will be no overtaking during a safety-car

#### How to find this
We can find this by taking a look at the position of a driver during the race, and comparing it to the round before. If it's the same then the driver was not overtaken or did not overtake.

In [None]:
# First we will take a look at the position of drivers during a race
ax = sns.lineplot(x = 'lap', y = 'position', hue = 'driverId', data = laptime_race_df[(laptime_race_df['raceId'] == 841)])
ax.set_title("Changes in position during {}".format(laptime_race_df['race'][laptime_race_df['raceId'] == 841].iloc[0]))
ax.invert_yaxis()
ax.legend_.remove()

In [None]:
# From here we can see that there are some times when there are many changes (pit-stops)
# We will now make an indicator to see if someone was overtaken
laptime_race_df.head()
laptime_race_df['position_previous_round'] = laptime_race_df.groupby(['raceId', 'driverId'])['position'].shift()
laptime_race_df['overtake'] = np.where(laptime_race_df['position'] < laptime_race_df['position_previous_round']
                                       , True
                                       , False)  
laptime_race_df.head(100)


In [None]:
# We now have an indicator per driver if they did an overtake.
# However, we want to know per lap of the race if there was an overtake
overtake_df = laptime_race_df[['raceId', 'lap','overtake',]]\
                                     .groupby(['raceId', 'lap'])\
                                     .max()\
                                     .reset_index() 

overtake_df.head(100)

In [None]:
# To check if this works the way we want to we check the results with the graph
overtake_df['lap'][(overtake_df['overtake'] == False) & (overtake_df['raceId'] == 841)]

# It seems there were no overtakes in lap 43 to 47
no_overtake_laps_df = laptime_race_df[(laptime_race_df['lap'] >= 43) 
                                   & (laptime_race_df['lap'] <= 47) 
                                   & (laptime_race_df['raceId'] == 841)]

ax = sns.lineplot(x = 'lap', y = 'position', hue = 'driverId', data = no_overtake_laps_df)
ax.set_title("Changes in position during {}".format(laptime_race_df['race'][laptime_race_df['raceId'] == 841].iloc[0]))
ax.invert_yaxis()
ax.legend_.remove()

# We only see straight lines so indeed no overtakes
# I checked and there was no safety car during this period

In [None]:
# We join this to our initial df
avg_laptimes_race_df = avg_laptimes_race_df.merge(overtake_df[['raceId'
                                                               , 'lap'
                                                               , 'overtake']]
                                    , how = 'left'
                                    , on = ['raceId', 'lap'])
avg_laptimes_race_df.head()

We now have our second indicator finished, whether there was overtaking. We will continue with working on our next indicator. There will be a decrease in time between the first and last driver.