#### Objectives
1. What percentage of drivers finish races every year?
2. How has the reliability of cars changed over the years? Is this result consistent with the previous statistic?
3. Who are the drivers with the most number of MDNFs? (Unluckiest drivers) 
4. Most reliable and least reliable drivers.
5. Most reliable constructors.
6. Circuits with most number of crashes (Most dangerous circuits). 

#### Setup

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

#### Reading the data and EDA

In [2]:
# These files contain data that needs to be put together
drivers = pd.read_csv('datasets/cleaned/drivers.csv')
constructors = pd.read_csv('datasets/cleaned/constructors.csv')
circuits = pd.read_csv('datasets/original/circuits.csv')
status = pd.read_csv('datasets/cleaned/status.csv')
races = pd.read_csv('datasets/cleaned/races.csv')

# The results file contains all the codes from above files and needs to be worked on
results = pd.read_csv('datasets/cleaned/race_results.csv')

Results Data

In [3]:
# Viewing the results dataframe before modifications
results.info()
results.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25660 entries, 0 to 25659
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   raceId           25660 non-null  int64  
 1   driverId         25660 non-null  int64  
 2   constructorId    25660 non-null  int64  
 3   grid             25660 non-null  int64  
 4   positionOrder    25660 non-null  int64  
 5   points           25660 non-null  float64
 6   laps             25660 non-null  int64  
 7   time             25660 non-null  object 
 8   milliseconds     25660 non-null  float64
 9   fastestLap       25660 non-null  float64
 10  rank             25660 non-null  float64
 11  fastestLapTime   25660 non-null  object 
 12  fastestLapSpeed  25660 non-null  float64
 13  statusId         25660 non-null  int64  
dtypes: float64(5), int64(7), object(2)
memory usage: 2.7+ MB


Unnamed: 0,raceId,driverId,constructorId,grid,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,18,1,1,1,1,10.0,58,0 days 01:34:50.616000,5690616.0,39.0,2.0,0 days 00:01:27.452000,218.3,1
1,18,2,2,5,2,8.0,58,0 days 01:34:56.094000,5696094.0,41.0,3.0,0 days 00:01:27.739000,217.586,1
2,18,3,3,7,3,6.0,58,0 days 01:34:58.779000,5698779.0,41.0,5.0,0 days 00:01:28.090000,216.719,1
3,18,4,4,11,4,5.0,58,0 days 01:35:07.797000,5707797.0,58.0,7.0,0 days 00:01:28.603000,215.464,1
4,18,5,1,3,5,4.0,58,0 days 01:35:08.630000,5708630.0,43.0,1.0,0 days 00:01:27.418000,218.385,1


In [4]:
results.drop(columns={'time', 'milliseconds', 'points', 'fastestLapTime', 'fastestLapSpeed','grid','fastestLap','rank'}, inplace=True)
results.head()

Unnamed: 0,raceId,driverId,constructorId,positionOrder,laps,statusId
0,18,1,1,1,58,1
1,18,2,2,2,58,1
2,18,3,3,3,58,1
3,18,4,4,4,58,1
4,18,5,1,5,58,1


Races Data

In [5]:
races.info()
races.drop(columns={'date','round'}, inplace=True)
races = races.rename(columns={'name': 'raceName'})
races.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1079 entries, 0 to 1078
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   raceId     1079 non-null   int64 
 1   year       1079 non-null   int64 
 2   round      1079 non-null   int64 
 3   circuitId  1079 non-null   int64 
 4   name       1079 non-null   object
 5   date       1079 non-null   object
dtypes: int64(4), object(2)
memory usage: 50.7+ KB


Unnamed: 0,raceId,year,circuitId,raceName
0,1,2009,1,Australian Grand Prix
1,2,2009,2,Malaysian Grand Prix
2,3,2009,17,Chinese Grand Prix
3,4,2009,3,Bahrain Grand Prix
4,5,2009,4,Spanish Grand Prix


Circuits Data

In [6]:
circuits.info()
circuits.drop(columns={'country','lat','lng','alt','url'}, inplace=True)
circuits = circuits.rename(columns={'name': 'circuitName'})
circuits.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76 entries, 0 to 75
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   circuitId   76 non-null     int64  
 1   circuitRef  76 non-null     object 
 2   name        76 non-null     object 
 3   location    76 non-null     object 
 4   country     76 non-null     object 
 5   lat         76 non-null     float64
 6   lng         76 non-null     float64
 7   alt         76 non-null     object 
 8   url         76 non-null     object 
dtypes: float64(2), int64(1), object(6)
memory usage: 5.5+ KB


Unnamed: 0,circuitId,circuitRef,circuitName,location
0,1,albert_park,Albert Park Grand Prix Circuit,Melbourne
1,2,sepang,Sepang International Circuit,Kuala Lumpur
2,3,bahrain,Bahrain International Circuit,Sakhir
3,4,catalunya,Circuit de Barcelona-Catalunya,Montmeló
4,5,istanbul,Istanbul Park,Istanbul


Constructors Data

In [7]:
constructors.info()
constructors.drop('nationality', axis=1, inplace=True)
constructors = constructors.rename(columns={'name': 'constructorName'})
constructors.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   constructorId   211 non-null    int64 
 1   constructorRef  211 non-null    object
 2   name            211 non-null    object
 3   nationality     211 non-null    object
dtypes: int64(1), object(3)
memory usage: 6.7+ KB


Unnamed: 0,constructorId,constructorRef,constructorName
0,1,mclaren,McLaren
1,2,bmw_sauber,BMW Sauber
2,3,williams,Williams
3,4,renault,Renault
4,5,toro_rosso,Toro Rosso


Drivers Data

In [8]:
drivers.info()
drivers.drop(columns={'dob','nationality'}, inplace=True)
drivers = drivers.rename(columns={'name': 'driverName'})
drivers.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 854 entries, 0 to 853
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   driverId     854 non-null    int64 
 1   driverRef    854 non-null    object
 2   dob          854 non-null    object
 3   nationality  854 non-null    object
 4   name         854 non-null    object
dtypes: int64(1), object(4)
memory usage: 33.5+ KB


Unnamed: 0,driverId,driverRef,driverName
0,1,hamilton,Lewis Hamilton
1,2,heidfeld,Nick Heidfeld
2,3,rosberg,Nico Rosberg
3,4,alonso,Fernando Alonso
4,5,kovalainen,Heikki Kovalainen


#### Joining the DataFrames

In [9]:
df = pd.merge(results, races, on='raceId',how='left')
df = pd.merge(df, circuits, on='circuitId', how='left')
df = pd.merge(df, drivers, on='driverId', how='left')
df = pd.merge(df, constructors, on='constructorId', how='left')
df = pd.merge(df, status, on='statusId', how='left')
df.head()

Unnamed: 0,raceId,driverId,constructorId,positionOrder,laps,statusId,year,circuitId,raceName,circuitRef,circuitName,location,driverRef,driverName,constructorRef,constructorName,status
0,18,1,1,1,58,1,2008,1,Australian Grand Prix,albert_park,Albert Park Grand Prix Circuit,Melbourne,hamilton,Lewis Hamilton,mclaren,McLaren,Finished
1,18,2,2,2,58,1,2008,1,Australian Grand Prix,albert_park,Albert Park Grand Prix Circuit,Melbourne,heidfeld,Nick Heidfeld,bmw_sauber,BMW Sauber,Finished
2,18,3,3,3,58,1,2008,1,Australian Grand Prix,albert_park,Albert Park Grand Prix Circuit,Melbourne,rosberg,Nico Rosberg,williams,Williams,Finished
3,18,4,4,4,58,1,2008,1,Australian Grand Prix,albert_park,Albert Park Grand Prix Circuit,Melbourne,alonso,Fernando Alonso,renault,Renault,Finished
4,18,5,1,5,58,1,2008,1,Australian Grand Prix,albert_park,Albert Park Grand Prix Circuit,Melbourne,kovalainen,Heikki Kovalainen,mclaren,McLaren,Finished


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25660 entries, 0 to 25659
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   raceId           25660 non-null  int64 
 1   driverId         25660 non-null  int64 
 2   constructorId    25660 non-null  int64 
 3   positionOrder    25660 non-null  int64 
 4   laps             25660 non-null  int64 
 5   statusId         25660 non-null  int64 
 6   year             25660 non-null  int64 
 7   circuitId        25660 non-null  int64 
 8   raceName         25660 non-null  object
 9   circuitRef       25660 non-null  object
 10  circuitName      25660 non-null  object
 11  location         25660 non-null  object
 12  driverRef        25660 non-null  object
 13  driverName       25660 non-null  object
 14  constructorRef   25660 non-null  object
 15  constructorName  25660 non-null  object
 16  status           25660 non-null  object
dtypes: int64(8), object(9)
memory u

#### Objective 1
What percentage of drivers finish races every year?

1. How many drivers finished a race that they started? To figure this out, we can look at the number of drivers across all races with the race status as 'Finished' or '+1/2/3 laps'. This is done because drivers can finish on the lead lap (i.e. they haven't been lapped by the race winner) or they can finish after being lapped. In some cases, drivers may retire in the last couple of laps of the race.

    In modern F1, they are still classified in the results as having finished provided they have run 95% of the race distance - usually anything over +3 laps would count as a Did Not Finish (DNF). Historically, all drivers who completed the race (regardless of the number of laps they were behind the winning driver) have been classified. Due to this, all finishers have been considered.

In [11]:
# Dataframe of all drivers who finished a Grand Prix
finished_status = [1, 11, 12, 13, 14, 15, 16, 17, 18, 19, 45, 50, 128, 53, 55,
                   58, 88, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 122, 123, 124, 125, 127, 133, 134]
finished_historic = df.loc[df['statusId'].isin(finished_status)].shape[0]

# Percentage of drivers finishing a race in all of F1 history
finish_historic_pc = round((finished_historic / df.shape[0]) * 100, 3)

print(f"In all of F1 history, only {finish_historic_pc}% of race starts have been finished.")

# If we consider modern F1 rules
finished_modern = df.loc[df['statusId'].isin([1, 11, 12, 13])].shape[0]
finish_modern_pc = round((finished_modern / df.shape[0]) * 100, 3)
print(f"But, if modern rules are considered, only {finish_modern_pc}% of race starts have been finished.")

In all of F1 history, only 55.374% of race starts have been finished.
But, if modern rules are considered, only 51.068% of race starts have been finished.


Note: Race starts in a race is the total number of drivers in the race. So, if 20 drivers compete in a race, it means that there are 20 race starts for that race.

2. Now, we determine the percentage of drivers finishing in each race over the years. Car reliability could be one of the factors which impact these results. Better reliability could mean more drivers finish the race.

    We can see from the below plot, the general downward trend in average finishes from 1972 until 1989 which has been the worst year where on average, only 29.5% of drivers would finish the each race. We see an upward trend from 1989 onwards until now where we see around 80% of drivers finishing each race.
    We investigate in further sections to see if car reliability is a major factor in determining these results.

In [12]:
df1 = df[['year', 'raceId', 'statusId']]

# Percentage of all status IDs at each race event through the years
df1 = df1.groupby(['year', 'raceId'])['statusId'].value_counts(normalize=True).reset_index(name="percentage")

# Extract percentages of drivers who FINISHED races (status IDs: 1, 11-19) and add up their percentages
prct_finish = df1.loc[df1['statusId'].isin(finished_status)]
prct_finish = prct_finish.groupby(['year', 'raceId'])['percentage'].sum().reset_index(name='percentageFinishers')

# Group by year and get the average to get the average percentage of finishers per race per year
prct_finish = prct_finish.groupby(['year'])['percentageFinishers'].mean().reset_index(name='avgFinishPrct')
prct_finish['avgFinishPrct'] = (round(prct_finish['avgFinishPrct'] * 100, 3))

prct_finish


Unnamed: 0,year,avgFinishPrct
0,1950,50.907
1,1951,53.087
2,1952,53.560
3,1953,49.103
4,1954,46.406
...,...,...
68,2018,79.762
69,2019,85.714
70,2020,83.235
71,2021,86.591


In [13]:
fig = px.bar(prct_finish, x='year', y='avgFinishPrct', color='avgFinishPrct',
             color_continuous_scale=px.colors.sequential.Viridis[::-1], width=1050, height=650,
             labels=dict(year="Year", avgFinishPrct="Percentage"),
             title='Average percentage of drivers finishing each race per year')

fig.update_layout(plot_bgcolor='gainsboro')
fig.update_yaxes(showgrid=True, gridwidth=1,  range=[0, 100])

fig.show()

#### Objective 2
How has the reliability of cars varied over the years?

In [14]:
# List of status codes for reliability issues
reliability_status = [5, 6, 7, 8, 9, 10, 21, 22, 23, 24, 25, 26, 51, 56, 121, 126, 129, 131, 132, 135, 141]
reliability_status.extend(range(30, 50))
reliability_status.extend(range(63, 111))


reliability_status = [i for i in reliability_status if i not in [
    33, 35, 36, 41, 45, 46, 65, 67, 68, 73, 74, 77, 78, 81, 82, 88, 89, 90, 92, 93, 96, 97, 100, 104, 107]]

In [15]:
# Dataframe of all reliability issues
rel_df = df.loc[df['statusId'].isin(reliability_status)]

dnfs_per_year = rel_df['year'].value_counts().reset_index().sort_values(by='index').rename(columns={'index': 'year', 'year': 'count'})

In [16]:
# Plotting the number of reliability issues causing DNFs over the years
fig = px.bar(dnfs_per_year, x="year", y="count", title = 'Number of DNFs per year due to reliability issues', color = 'count', 
             color_continuous_scale=px.colors.sequential.Viridis[::-1], width=1050, height=650)

fig.update_layout(plot_bgcolor='gainsboro')

fig.show()

The above plot clearly shows that the 1980s and early 1990s saw the most number of mechanical retirements from races. It is important to note that in the early days of F1, a season had 7 races. In the mid to late-1980s, there were 16 races per season. Additionally, more experimentation and increasing regulations meant the teams had to try out new approaches - often resulting in a DNF result.

So let's normalize the statistic to the number of races that happened in the respective seasons to get a better picture.

In [17]:
dnfs_per_race = df.loc[~df['statusId'].isin(finished_status)]
dnfs_per_race = dnfs_per_race.groupby(['year', 'raceId']).agg({'driverId': 'count'}).reset_index().rename(columns={'driverId': 'dnfs'})
avg_dnfs_per_race = dnfs_per_race.groupby(['year'])['dnfs'].mean().reset_index().rename(columns={'dnfs': 'avgDNFs'})
avg_dnfs_per_race

Unnamed: 0,year,avgDNFs
0,1950,11.428571
1,1951,11.125000
2,1952,12.500000
3,1953,14.333333
4,1954,13.444444
...,...,...
68,2018,4.047619
69,2019,3.000000
70,2020,3.352941
71,2021,3.105263


In [18]:
fig = px.bar(avg_dnfs_per_race, x='year', y='avgDNFs', title='Average number of DNFs per race every year',
             color='avgDNFs', color_continuous_scale=px.colors.sequential.Viridis[::-1],
             width=1050, height=650, labels=dict(year='Year', avgDNFs='Avg DNFs per race'))

fig.update_layout(plot_bgcolor='gainsboro')

fig.show()

Now we see that average DNFs per race increased between 1969 and 1989 with 1989 being the worst year. This seems fairly consistent with the previous statistic where we looked at the average percentage of drivers who finished races where we saw 1989 being the year in which the least proportion of drivers finished races. 

Next, we need to consider how many drivers took part in each F1 season.

The current F1 driver grid consists of 10 teams with 2 main drivers each. Teams sometimes have to use reserve drivers due to factors such as driver injury, illness, etc. A good example of this is Nico Hulkenberg - who has filled in for Sergio Perez, Lance Stroll and Sebastian Vettel with Racing Point/Aston Martin Racing due to Covid 19 over 2020-2022.

However, F1 has had many more drivers in a single season in the past. 108 different drivers representing 41 teams took part in the 1953 F1 season - the most drivers in a single season ever. Due to thse factors, the following plot looks at the average number of mechanical DNFs in a season per driver.

In [19]:
drivers_per_year = df.groupby(['year'])['driverName'].nunique().reset_index().rename(columns={'driverName': 'numDrivers'})

dnfs_per_year = pd.merge(dnfs_per_year, drivers_per_year, on='year', how='left')

dnfs_per_year['dnfPerDriver'] = (dnfs_per_year['count']/dnfs_per_year['numDrivers']).round(3)
dnfs_per_year

Unnamed: 0,year,count,numDrivers,dnfPerDriver
0,1950,61,81,0.753
1,1951,75,84,0.893
2,1952,75,105,0.714
3,1953,102,108,0.944
4,1954,89,97,0.918
...,...,...,...,...
68,2018,43,20,2.150
69,2019,28,20,1.400
70,2020,28,23,1.217
71,2021,24,21,1.143


In [20]:
fig = px.bar(dnfs_per_year, x = 'year', y = 'dnfPerDriver', title = 'Average DNFs per driver every year', 
             color = 'dnfPerDriver', color_continuous_scale=px.colors.sequential.Viridis[::-1], 
             width=1050, height=650, labels = dict(year = 'Year', dnfPerDriver = 'Average DNFs per driver'))

fig.update_layout(plot_bgcolor='gainsboro')

fig.show()

A little explaination for this statistic - It shows on average, how many races a driver wouldn't finish in a season. For example, it shows the 1986 season to be the worst for the drivers where an average driver would not finish 5-6 races in the season.

Since we have only considered DNFs caused by reliability issues. This could be another way to look at how car reliability has varied over the years.
Also for the next objective, we look at the unluckiest drivers in F1 who have had most of their unfinished races due to reliability issues not under their control. Looking at this graph, we can assume that many of the unluckiest drivers would have raced between 1983 and 2002.

#### Objective 3
Who are the drivers with the most number of MDNFs? (Unluckiest drivers)


Mechanical DNFs and accidents by driver
Now we can look at which drivers have had the most number of mechanical DNFs (i.e. the worst luck) and accidents/collisions. These figures will also be considered in relation to the total number of Grand Prix races they have started so as to get a good understanding.

This analysis makes use of the mdnf_df dataframe created earlier containing all mechanical DNF results in F1 history as well as additional data from the all_results dataframe to account for accidents

In [21]:
# Number of mechanical DNFs per driver
mdnf_drivers = rel_df['driverName'].value_counts().reset_index().sort_values(by='driverName', ascending = False).rename(columns={'index': 'driverName', 'driverName': 'mDNFs'})

# Races entered per driver
races_per_driver = df['driverName'].value_counts().reset_index().rename(columns={'index': 'driverName', 'driverName': 'raceStarts'})

# Crash DNFs per driver - considering status IDs 3 (accident), 4 (collision), 20 (spun off) and 130 (collision damage)
crashes = df.loc[df['statusId'].isin([3, 4, 20, 130])]
crashes = crashes['driverName'].value_counts().reset_index().rename(columns={'index': 'driverName', 'driverName': 'crashDNFs'})
crashes['crashDNFs'] = crashes['crashDNFs'].astype('int')

# Merging dataframes
dnf_drivers =  pd.merge(mdnf_drivers, races_per_driver, on ='driverName', how ='left')
dnf_drivers = pd.merge(dnf_drivers, crashes, on = 'driverName', how = 'left')
dnf_drivers

Unnamed: 0,driverName,mDNFs,raceStarts,crashDNFs
0,Riccardo Patrese,102,257,36.0
1,Andrea de Cesaris,97,214,37.0
2,Michele Alboreto,80,215,20.0
3,Gerhard Berger,67,210,26.0
4,Jacques Laffite,66,180,13.0
...,...,...,...,...
632,Toni Branca,1,3,
633,Georges Grignard,1,1,
634,Michael Andretti,1,13,6.0
635,Juan Jover,1,1,


In [22]:
# Not every driver will have had mechanical DNFs as well as accidents. This creates NaNs in the crashDNFs column
# Fill NaN values and convert crashDNFs to int
dnf_drivers['crashDNFs'] = dnf_drivers['crashDNFs'].fillna(0)
dnf_drivers['crashDNFs'] = dnf_drivers['crashDNFs'].astype('int')

# Create columns
dnf_drivers['racesPerMDNF'] = dnf_drivers['raceStarts']/dnf_drivers['mDNFs']
dnf_drivers['racesPerCDNF'] = dnf_drivers['raceStarts']/dnf_drivers['crashDNFs']

# Determine the total number of DNF results for each driver and arrange the dataframe in descending order based on this number
dnf_drivers['totalDNFs'] = dnf_drivers['mDNFs'] + dnf_drivers['crashDNFs']


dnf_drivers['totalFinishes'] = dnf_drivers['raceStarts'] - dnf_drivers['totalDNFs']
# dnf_drivers.drop(columns={'level_0', 'index'}, inplace=True)
dnf_drivers = dnf_drivers[['driverName', 'raceStarts', 'mDNFs', 'crashDNFs', 'totalDNFs', 'totalFinishes', 'racesPerMDNF', 'racesPerCDNF']].reset_index()

dnf_drivers.head()

Unnamed: 0,index,driverName,raceStarts,mDNFs,crashDNFs,totalDNFs,totalFinishes,racesPerMDNF,racesPerCDNF
0,0,Riccardo Patrese,257,102,36,138,119,2.519608,7.138889
1,1,Andrea de Cesaris,214,97,37,134,80,2.206186,5.783784
2,2,Michele Alboreto,215,80,20,100,115,2.6875,10.75
3,3,Gerhard Berger,210,67,26,93,117,3.134328,8.076923
4,4,Jacques Laffite,180,66,13,79,101,2.727273,13.846154


In [23]:
dnf_drivers = dnf_drivers.sort_values(by='mDNFs', ascending=False).reset_index()

fig = px.bar(dnf_drivers.head(20), x="driverName", y=["mDNFs", "totalFinishes"],
             labels=dict(driverName="Driver Name",
                         value="Count", variable='Result type'),
             color_discrete_sequence=px.colors.qualitative.Pastel,
             title="Top 20 Unluckiest drivers", width=1050, height=650)

fig.update_layout(plot_bgcolor='gainsboro')
fig.update_xaxes(tickangle=45)
fig.update_layout(legend_traceorder="reversed")

# Update legend entries
newnames = {'mDNFs': 'Mechanical DNF', 'totalFinishes': 'Finishes'}
fig.for_each_trace(lambda t: t.update(name=newnames[t.name],
                                      legendgroup=newnames[t.name],
                                      hovertemplate=t.hovertemplate.replace(
                                          t.name, newnames[t.name])
                                      )
                   )

fig.show()


The top 5 drivers in this chart were all active in F1 in the mid to late-1980s - the period of F1 with the highest number of mechanical DNFs (lowest car reliability) as per the 'Number of DNFs per year due to reliability issues' chart above.

#### Objective 4
Most Reliable drivers

In [24]:
dnf_drivers_crashes = dnf_drivers[dnf_drivers['raceStarts'] > 50].drop(columns='level_0')
dnf_drivers_crashes = dnf_drivers_crashes.sort_values(by='racesPerCDNF', ascending=False).reset_index()

fig = px.bar(dnf_drivers_crashes.head(20), x="driverName", y="racesPerCDNF", title='Most Reliable Drivers', color='racesPerCDNF',
             color_continuous_scale=px.colors.sequential.Viridis[::-1], width=1050, height=650)

fig.update_layout(plot_bgcolor='gainsboro')

fig.show()

Least Reliable Drivers

In [25]:
dnf_drivers_crashes.sort_values(by='racesPerCDNF', ascending=True, inplace=True)

fig = px.bar(dnf_drivers_crashes.head(20), x="driverName", y="racesPerCDNF", title='Least Reliable Drivers', color='racesPerCDNF',
             color_continuous_scale=px.colors.sequential.Viridis[::-1], width=1050, height=650)

fig.update_layout(plot_bgcolor='gainsboro')

fig.show()


#### Objective 5
Most Reliable Constructors


This section looks at the number of mechanical DNFs for each constructor. These numbers has been compared with the number of starts made by each constructor. In modern F1, each constructor has 2 starters per race. However, in the early days of F1, a team could enter more than 2 drivers in a race event.

Crashes per constructor have not been considered as accidents and collisions are mostly caused by driver error, weather conditions and other external factors. These numbers have been considered in the drivers section above.

Only constructors with over 100 starts in F1 have been considered so as to have a meaningful indication of reliability over the period of a few seasons.

The number of starts have been divided by number of mechanical DNFs. A higher number indicates higher reliability.

In [26]:
# Number of mechanical DNFs per constructor
mdnf_constructors = rel_df['constructorName'].value_counts().reset_index().sort_values(by='constructorName', ascending = False).rename(columns={'index': 'constructorName', 'constructorName': 'mDNFs'})

# Number of starts per constructor
constructor_starts = df['constructorName'].value_counts().reset_index().sort_values(by='constructorName', ascending = False).rename(columns={'index': 'constructorName', 'constructorName': 'raceStarts'})

# Merge dataframes
mdnf_constructors =  pd.merge(mdnf_constructors, constructor_starts, on ='constructorName', how ='left')

# Reorder columns
mdnf_constructors = mdnf_constructors[['constructorName', 'raceStarts', 'mDNFs']]

# Create column for number of raceStarts per mdnf and sort values
mdnf_constructors['startsPerMDNF'] = (mdnf_constructors['raceStarts']/mdnf_constructors['mDNFs']).round(3)
mdnf_constructors = mdnf_constructors.sort_values(by='startsPerMDNF', ascending = False)

# Only include constructors with more than 100 raceStarts
mdnf_constructors = mdnf_constructors[mdnf_constructors['raceStarts'] >= 100].reset_index().drop('index', axis = 1)

mdnf_constructors.head()

Unnamed: 0,constructorName,raceStarts,mDNFs,startsPerMDNF
0,Mercedes,542,32,16.938
1,Force India,424,31,13.677
2,BMW Sauber,140,12,11.667
3,Haas F1 Team,270,27,10.0
4,Marussia,109,11,9.909


In [27]:
fig = px.bar(mdnf_constructors.head(25), x = 'constructorName', y = 'startsPerMDNF', color = 'startsPerMDNF', 
              color_continuous_scale=px.colors.sequential.Viridis[::-1], width=1050, height=650,
              labels=dict(constructorName="Constructor", startsPerMDNF="Number of starts per mechanical DNF"), 
              title = 'Race Starts per MDNF')

fig.layout.coloraxis.colorbar.title = 'Number'
fig.update_layout(plot_bgcolor='gainsboro')
fig.update_xaxes(tickangle = 50)

fig.show()

Let's take a look at another statistic - MDNFs per season for each constructor and how that looks like

In [28]:
# Getting mDNF counts every season by every constructor applicable
mdnfs_per_year = rel_df.groupby(['constructorName', 'year']).agg({'statusId': 'count'}).reset_index().rename(columns={'statusId': 'mDNFs'})

# Finiding the average MDNFs per season and merging with constructor_starts to get the raceStarts
mdnfs_per_year = mdnfs_per_year.groupby(['constructorName']).agg({'mDNFs': 'mean'}).reset_index().rename(columns={'mDNFs': 'avgMDNFs'}).round(3)
mdnfs_per_year = pd.merge(mdnfs_per_year, constructor_starts, on='constructorName', how='left')

# Considering only those constructors who have had over 100 race starts and sorting by avgMDNFs
mdnfs_per_year = mdnfs_per_year[mdnfs_per_year['raceStarts'] >= 100].reset_index().sort_values(by='avgMDNFs', ascending=True)

In [29]:
fig = px.bar(mdnfs_per_year.head(25), x='constructorName', y='avgMDNFs', color='avgMDNFs',
             color_continuous_scale=px.colors.sequential.Viridis[::-1], width=1050, height=650,
             labels=dict(constructorName="Constructor",
                         avgMDNFs="Average MDNFs per season"),
             title='MDNFs per season')

fig.layout.coloraxis.colorbar.title = 'Number'
fig.update_layout(plot_bgcolor='gainsboro')
fig.update_xaxes(tickangle=50)

fig.show()

#### Objective 6
Circuits with the most crashes


Finally, we can consider the circuits which have had the most accidents and collisions.

Accidents and collisions are more common when weather conditions change during a race. Accidents are also common at narrow street circuits such as Monaco, Baku, Jeddah and Singapore with many drivers touching the wall and compromsing their laps/races.

The number of accidents at each circuit will also be compared against the number of Grands Prix held at the circuit to get a better picture of the dangers a certain circuit poses. An example of why this is necessary is Jeddah - a street circuit which (as of 2022) is very new to F1 (only 2 races held so far) but has had multiple crashes.

In [30]:
circuit_races = pd.merge(races, circuits, on='circuitId', how='left')
circuit_races = circuit_races[['raceId', 'year', 'circuitId', 'raceName', 'circuitRef', 'circuitName', 'location']]

circuit_races

Unnamed: 0,raceId,year,circuitId,raceName,circuitRef,circuitName,location
0,1,2009,1,Australian Grand Prix,albert_park,Albert Park Grand Prix Circuit,Melbourne
1,2,2009,2,Malaysian Grand Prix,sepang,Sepang International Circuit,Kuala Lumpur
2,3,2009,17,Chinese Grand Prix,shanghai,Shanghai International Circuit,Shanghai
3,4,2009,3,Bahrain Grand Prix,bahrain,Bahrain International Circuit,Sakhir
4,5,2009,4,Spanish Grand Prix,catalunya,Circuit de Barcelona-Catalunya,Montmeló
...,...,...,...,...,...,...,...
1074,1092,2022,22,Japanese Grand Prix,suzuka,Suzuka Circuit,Suzuka
1075,1093,2022,69,United States Grand Prix,americas,Circuit of the Americas,Austin
1076,1094,2022,32,Mexico City Grand Prix,rodriguez,Autódromo Hermanos Rodríguez,Mexico City
1077,1095,2022,18,Brazilian Grand Prix,interlagos,Autódromo José Carlos Pace,São Paulo


In [31]:
# Determine the total number of crashes (accidents, collisions and spins) which have happened at each circuit
circuit_accidents = df.loc[df['statusId'].isin([3, 4, 20])]
circuit_accidents = circuit_accidents[['circuitName', 'status']]
circuit_accidents = pd.crosstab(index=circuit_accidents['circuitName'], columns=circuit_accidents['status'])
circuit_accidents = circuit_accidents.reset_index()

# Determine the number of races held at each circuit
num_races = circuit_races['circuitName'].value_counts().reset_index().rename(columns={'index': 'circuitName', 'circuitName': 'numRaces'})
circuit_accidents = pd.merge(circuit_accidents, num_races, on = 'circuitName', how = 'left')

# Calculate the total number of accidents at each circuit and arrange the dataframe according to this
circuit_accidents['crashes'] = circuit_accidents['Accident'] + circuit_accidents['Collision'] + circuit_accidents['Spun off']
circuit_accidents = circuit_accidents.sort_values(by = 'crashes', ascending = False)

# Calculate the number of accidents per race
circuit_accidents['crashesPerRace'] = (circuit_accidents['crashes']/circuit_accidents['numRaces']).round(3)

# Clean the dataframe by resetting the index
circuit_accidents = circuit_accidents.reset_index()
circuit_accidents = circuit_accidents.drop(['index'], axis = 1)

circuit_accidents.head()

Unnamed: 0,circuitName,Accident,Collision,Spun off,numRaces,crashes,crashesPerRace
0,Circuit de Monaco,118,72,66,68,256,3.765
1,Autodromo Nazionale di Monza,52,40,52,72,144,2.0
2,Circuit de Spa-Francorchamps,52,47,37,55,136,2.473
3,Circuit Gilles Villeneuve,44,43,46,41,133,3.244
4,Silverstone Circuit,39,42,46,57,127,2.228


In [32]:
fig = px.bar(circuit_accidents.head(20), x="circuitName", y=["Accident", "Collision", 'Spun off'], 
             labels=dict(circuitName="Circuit", value="Count", variable='Crash type'),
             color_discrete_sequence=px.colors.qualitative.Pastel, 
             title="Top 20 Most Dangerous Circuits", width=1050, height=650)

fig.update_layout(plot_bgcolor='gainsboro')
fig.update_xaxes(tickangle = 45)
fig.update_layout(legend_traceorder="reversed")


fig.update_layout(hovermode="x")


fig.show()

Finally, we can look at the number of crashes per race at different circuits. A higher number indicates that a circuit is more dangerous.

In [33]:
crashes_per_race = circuit_accidents[circuit_accidents['numRaces'] > 10].sort_values(
    by='crashesPerRace', ascending=False)
crashes_per_race.head()

Unnamed: 0,circuitName,Accident,Collision,Spun off,numRaces,crashes,crashesPerRace
16,Adelaide Street Circuit,13,18,27,11,58,5.273
8,Indianapolis Motor Speedway,60,9,18,19,87,4.579
0,Circuit de Monaco,118,72,66,68,256,3.765
22,Autódromo do Estoril,11,9,26,13,46,3.538
3,Circuit Gilles Villeneuve,44,43,46,41,133,3.244


In [34]:
fig = px.bar(crashes_per_race.head(20), x = 'circuitName', y = 'crashesPerRace', 
             title="Top 20 Circuits with the Most Crashes per Race", width=1050, height=650,
            labels=dict(circuitName="Circuit", crashesPerRace='Crashes per race'), color = 'crashesPerRace', 
              color_continuous_scale=px.colors.sequential.Viridis[::-1])

fig.update_layout(plot_bgcolor='gainsboro')
fig.update_xaxes(tickangle = 45)

fig.show()