<a href="https://colab.research.google.com/github/so-dipe/Web-Scraping-Datasets/blob/main/F1%20Data%20-%20Feature%20Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [428]:
import pandas as pd
from datetime import datetime

pd.options.mode.chained_assignment = None

# Feature Engineering

In this notebook, I try to create features capable of predicting the final position of drivers in a F1 Grand Prix race before the race begins but after Qualifying and the starting grid is known.

first we import the dataset containing the F1 results from 2014 onwards (the hybrid era) from my github page

In [429]:
df = pd.read_csv('https://raw.githubusercontent.com/so-dipe/Web-Scraping-Datasets/main/f1-archive/f1-hybrid-era-results.csv')
df.shape

(3204, 17)

In [430]:
df.head()

Unnamed: 0.1,Unnamed: 0,Pos,No,Driver,Car,Laps,Time/Retired,PTS,Circuit,Location,year,Q1,Q2,Q3,Starting Grid,race_num,Time
0,0,1,6,Nico Rosberg ROS,Mercedes,57.0,1:32:58.710,25.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:32.564,1:42.264,1:44.595,3,1,
1,1,EX,3,Daniel Ricciardo RIC,Red Bull Racing Renault,57.0,+24.525s,0.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:30.775,1:42.295,1:44.548,2,1,
2,2,2,20,Kevin Magnussen MAG,McLaren Mercedes,57.0,+26.777s,18.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:30.949,1:43.247,1:45.745,4,1,
3,3,3,22,Jenson Button BUT,McLaren Mercedes,57.0,+30.027s,15.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:31.396,1:44.437,ELIMINATED,10,1,
4,4,4,14,Fernando Alonso ALO,Ferrari,57.0,+35.284s,12.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:31.388,1:42.805,1:45.819,5,1,


the `Time` column is unnecessary as it contains the time each driver used to complete the race/gap from first position. This values is not available for all races and also it consistutes leakage.

Same thing goes to the `Laps` column, it contains information about the number of laps completed by a driver in a race. This is information only available at the end of the Grand Prix so it's also a leaky feature and can't be used to make accurate predictions.

In [431]:
df.drop(columns=['Time', 'Unnamed: 0', 'Laps'], inplace=True)

In [432]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3204 entries, 0 to 3203
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Pos            3204 non-null   object 
 1   No             3204 non-null   int64  
 2   Driver         3204 non-null   object 
 3   Car            3204 non-null   object 
 4   Time/Retired   3204 non-null   object 
 5   PTS            3204 non-null   float64
 6   Circuit        3204 non-null   object 
 7   Location       3204 non-null   object 
 8   year           3204 non-null   int64  
 9   Q1             3204 non-null   object 
 10  Q2             3204 non-null   object 
 11  Q3             3204 non-null   object 
 12  Starting Grid  3204 non-null   int64  
 13  race_num       3204 non-null   int64  
dtypes: float64(1), int64(4), object(9)
memory usage: 350.6+ KB


In [433]:
df.nunique()

Pos                25
No                 47
Driver             51
Car                27
Time/Retired     1524
PTS                27
Circuit            51
Location           34
year                8
Q1               3066
Q2               2306
Q3               1518
Starting Grid      22
race_num           22
dtype: int64

As we can see, over the course of 8 years, F1 raced around 34 countries and 51 circuits. 27 cars have been involved in the Constructors Championship battle and 51 drivers in the WDC (World Drivers Championship)

Starting with the `starting grid` column (commonly known as starting lineup)...

In [434]:
df['Starting Grid'].unique()

array([ 3,  2,  4, 10,  5, 15,  7, 11,  6,  8, 16, 13, 20, 17, 18, 21, 19,
       12,  1,  9, 14, 22])

After checking the unique values in this column, there is nothing to be done there. For any race, there is a maximum of 22 drivers on the grid and the column has only 22 unique values and the unique values are just the numbers 1 to 22. 

Next lets move to the target column `Pos`. As can be seen from the `df.nunique()` query we ran before, there are 25 columns here as opposed to the maximum 22 drivers that start the race.

In [435]:
df['Pos'].unique()

array(['1', 'EX', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11',
       '12', '13', 'NC', '14', '15', '16', '17', '18', '19', '20', '21',
       '22', 'DQ'], dtype=object)

Well, the reason for the 25 unique values is evident. Apart from the 22 positions (numbers), we have three extra values namely,

EX - Excluded (same as disqualified)
NC - Not classified (Did not finish 90% of race)
DQ - Disqualified

For the prediction, I plan to use a classification algorithm and only 11 classes. The first 10 classes for the first 10 points finishes and the 11th for those that finish outside the points and retired or got disqualified.

In [436]:
category_dict = {
    '12':'11',
    '13':'11',
    '14':'11',
    '15':'11',
    '16':'11',
    '17':'11',
    '18':'11',
    '19':'11',
    '20':'11',
    '21':'11',
    '22':'11',
    'EX':'11',
    'NC':'11',
    'DQ':'11',
}
df['Pos'].replace(category_dict, inplace=True)

In [437]:
df.head()

Unnamed: 0,Pos,No,Driver,Car,Time/Retired,PTS,Circuit,Location,year,Q1,Q2,Q3,Starting Grid,race_num
0,1,6,Nico Rosberg ROS,Mercedes,1:32:58.710,25.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:32.564,1:42.264,1:44.595,3,1
1,11,3,Daniel Ricciardo RIC,Red Bull Racing Renault,+24.525s,0.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:30.775,1:42.295,1:44.548,2,1
2,2,20,Kevin Magnussen MAG,McLaren Mercedes,+26.777s,18.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:30.949,1:43.247,1:45.745,4,1
3,3,22,Jenson Button BUT,McLaren Mercedes,+30.027s,15.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:31.396,1:44.437,ELIMINATED,10,1
4,4,14,Fernando Alonso ALO,Ferrari,+35.284s,12.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:31.388,1:42.805,1:45.819,5,1


In [438]:
df['Pos'].nunique()

11

The column `No` contains information about the drivers car number and while I wanted to keep it initially as a categorical column to represent the different drivers instead of using `Preprocessing`, I was made aware of the fact that numbers change and can be picked up by different drivers when the current drivers have retired.
For example, Nico Rosberg was car 6 until he retired, a number now used by Nicholas Latifi. Also, Max Verstappen and Sebastian Vettel used the number 1 when they were reigning champions.
Also, it is evident that this column cannot work since there are 51 drivers and just 47 numbers.
This means the column is useless and should be dropped.

In [439]:
df.drop(columns='No', inplace=True)
df.head()

Unnamed: 0,Pos,Driver,Car,Time/Retired,PTS,Circuit,Location,year,Q1,Q2,Q3,Starting Grid,race_num
0,1,Nico Rosberg ROS,Mercedes,1:32:58.710,25.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:32.564,1:42.264,1:44.595,3,1
1,11,Daniel Ricciardo RIC,Red Bull Racing Renault,+24.525s,0.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:30.775,1:42.295,1:44.548,2,1
2,2,Kevin Magnussen MAG,McLaren Mercedes,+26.777s,18.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:30.949,1:43.247,1:45.745,4,1
3,3,Jenson Button BUT,McLaren Mercedes,+30.027s,15.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:31.396,1:44.437,ELIMINATED,10,1
4,4,Fernando Alonso ALO,Ferrari,+35.284s,12.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:31.388,1:42.805,1:45.819,5,1


Moving to the `Car` column

In [440]:
df['Car'].unique()

array(['Mercedes', 'Red Bull Racing Renault', 'McLaren Mercedes',
       'Ferrari', 'Williams Mercedes', 'Force India Mercedes',
       'STR Renault', 'Sauber Ferrari', 'Marussia Ferrari',
       'Lotus Renault', 'Caterham Renault', 'McLaren Honda',
       'Lotus Mercedes', 'Red Bull Racing TAG Heuer', 'Haas Ferrari',
       'Renault', 'Toro Rosso Ferrari', 'MRT Mercedes', 'Toro Rosso',
       'McLaren Renault', 'Scuderia Toro Rosso Honda',
       'Red Bull Racing Honda', 'Alfa Romeo Racing Ferrari',
       'Racing Point BWT Mercedes', 'AlphaTauri Honda',
       'Aston Martin Mercedes', 'Alpine Renault'], dtype=object)

This is unique column because it shows more constructors than necessary. This is because every team includes the name of their engine/power unit provided and this changes with time. Therefore, we have to first create a `Power Unit` column and then rename the values in the `Car` column to contain only the constructor name.

In [441]:
df['Power Unit'] = 0
for i in range(max(df.shape)):
  df['Power Unit'].iloc[i] = df['Car'].iloc[i].split(' ')[-1]
df.head()

Unnamed: 0,Pos,Driver,Car,Time/Retired,PTS,Circuit,Location,year,Q1,Q2,Q3,Starting Grid,race_num,Power Unit
0,1,Nico Rosberg ROS,Mercedes,1:32:58.710,25.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:32.564,1:42.264,1:44.595,3,1,Mercedes
1,11,Daniel Ricciardo RIC,Red Bull Racing Renault,+24.525s,0.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:30.775,1:42.295,1:44.548,2,1,Renault
2,2,Kevin Magnussen MAG,McLaren Mercedes,+26.777s,18.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:30.949,1:43.247,1:45.745,4,1,Mercedes
3,3,Jenson Button BUT,McLaren Mercedes,+30.027s,15.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:31.396,1:44.437,ELIMINATED,10,1,Mercedes
4,4,Fernando Alonso ALO,Ferrari,+35.284s,12.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:31.388,1:42.805,1:45.819,5,1,Ferrari


In [442]:
df['Power Unit'].unique()

array(['Mercedes', 'Renault', 'Ferrari', 'Honda', 'Heuer', 'Rosso'],
      dtype=object)

Over the past 8 years of the Formula 1 hybrid engine era, there have been only 4 engine providers... Mercedes, Renault, Ferrari and Honda. 

'Heuer' and 'Rosso' here are actually Red Bull and Toro Rosso using Renault Engines, so we have to make that change. 

In [443]:
df['Power Unit'].replace(
    {
        'Heuer':'Renault',
        'Rosso':'Renault'
    }, 
    inplace=True
)

In [444]:
df['Power Unit'].unique()

array(['Mercedes', 'Renault', 'Ferrari', 'Honda'], dtype=object)

Now, actually working on the `Car` column

In [445]:
df['Cars'] = 0
for i in range(max(df.shape)):
  df['Cars'].iloc[i] = df['Car'].iloc[i].split(' ')[0]
df['Cars'].unique()

array(['Mercedes', 'Red', 'McLaren', 'Ferrari', 'Williams', 'Force',
       'STR', 'Sauber', 'Marussia', 'Lotus', 'Caterham', 'Haas',
       'Renault', 'Toro', 'MRT', 'Scuderia', 'Alfa', 'Racing',
       'AlphaTauri', 'Aston', 'Alpine'], dtype=object)

In [446]:
df['Cars'].replace(
    {
        'Red':'Red Bull Racing',
        'Force':'Force India',
        'STR':'AlphaTauri',
        'Toro':'AlphaTauri',
        'Racing':'Racing Point',
        'Aston':'Aston Martin',
        'Renault':'Alpine',
        'MRT':'Manor Racing',
        'Alfa':'Alfa Romeo'
    },
    inplace=True
)
df.drop(columns='Car', inplace=True)
df.rename(columns={'Cars':'Car'}, inplace=True)
df['Car'].unique()

array(['Mercedes', 'Red Bull Racing', 'McLaren', 'Ferrari', 'Williams',
       'Force India', 'AlphaTauri', 'Sauber', 'Marussia', 'Lotus',
       'Caterham', 'Haas', 'Alpine', 'Manor Racing', 'Scuderia',
       'Alfa Romeo', 'Racing Point', 'Aston Martin'], dtype=object)

the next column `Time/Retired` is also not need as it would constitute leakage.

In [447]:
df.drop(columns='Time/Retired', inplace=True)

Moving rather quickly to `PTS`. The goal here is get the number of points before the race (kind of a lag)

In [448]:
df_2014 = df[df['year'] == 2014]
df_2014.shape

(396, 13)

In [449]:
def pts_sum(df):
  wdc_df = pd.pivot_table(df, index='Driver', columns='race_num')['PTS'].fillna(0)
  pts_ls = []
  for column in wdc_df.columns:
    pts = wdc_df.loc[:,:column].sum(axis=1)
    pts_ls.append(pts)
  k = 1
  i = 0
  for column in wdc_df.columns:
    wdc_df.insert(loc=k, column=f'lag {i+1}', value=pts_ls[i])
    i = i + 1
    k = k + 2
  return wdc_df

# wdc_df = pts_sum(df_2014)


In [450]:
# wdc_df.head()

In [451]:
for year in df['year'].unique():
  df_year = df[df['year'] == year]
  wdc_df = pts_sum(df_year)
  for driver in wdc_df.index:
    for num in df_year['race_num'].unique()[1:]:
      # year_mask = df_year['year']
      driver_mask = df_year['Driver'] == driver
      num_mask = df_year['race_num'] == num
      if min(df_year[driver_mask & num_mask].shape) == 0:
        continue
      index = df_year[driver_mask & num_mask].index[0]
      df.loc[index, 'lagged PTS'] = wdc_df.loc[driver, f'lag {num-1}']
      

In [452]:
df['lagged PTS'] = df['lagged PTS'].fillna(0)

In [453]:
df['Driver'].unique()

array(['Nico  Rosberg  ROS', 'Daniel  Ricciardo  RIC',
       'Kevin  Magnussen  MAG', 'Jenson  Button  BUT',
       'Fernando  Alonso  ALO', 'Valtteri  Bottas  BOT',
       'Nico  Hulkenberg  HUL', 'Kimi  Räikkönen  RAI',
       'Jean-Eric  Vergne  VER', 'Daniil  Kvyat  KVY',
       'Sergio  Perez  PER', 'Adrian  Sutil  SUT',
       'Esteban  Gutierrez GUT', 'Max  Chilton  CHI',
       'Jules  Bianchi  BIA', 'Pastor  Maldonado  MAL',
       'Marcus  Ericsson  ERI', 'Sebastian  Vettel  VET',
       'Lewis  Hamilton  HAM', 'Felipe  Massa  MAS',
       'Kamui  Kobayashi  KOB', 'Romain  Grosjean  GRO',
       'Andre  Lotterer  LOT', 'Will  Stevens  STE', 'Felipe  Nasr  NAS',
       'Carlos  Sainz  SAI', 'Max  Verstappen  VER',
       'Roberto  Merhi  MER', 'Alexander  Rossi  RSI',
       'Jolyon  Palmer  PAL', 'Stoffel  Vandoorne  VAN',
       'Pascal  Wehrlein  WEH', 'Rio  Haryanto  HAR',
       'Esteban  Ocon  OCO', 'Antonio  Giovinazzi  GIO',
       'Lance  Stroll  STR', 'Paul  di Rest

In [454]:
df[df['Driver'] == 'Valtteri  Bottas  BOT']

Unnamed: 0,Pos,Driver,PTS,Circuit,Location,year,Q1,Q2,Q3,Starting Grid,race_num,Power Unit,Car,lagged PTS
5,5,Valtteri Bottas BOT,10.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,1:31.601,1:43.852,1:48.147,15,1,Mercedes,Williams,0.0
28,8,Valtteri Bottas BOT,4.0,"Sepang International Circuit, Malaysia",Malaysia,2014,1:59.709,2:02.756,ELIMINATED,18,2,Mercedes,Williams,10.0
50,8,Valtteri Bottas BOT,4.0,"Bahrain International Circuit, Bahrain",Bahrain,2014,1:34.934,1:34.842,1:34.247,3,3,Mercedes,Williams,14.0
71,7,Valtteri Bottas BOT,6.0,"Shanghai International Circuit, China",China,2014,1:56.501,1:56.253,1:56.282,7,4,Mercedes,Williams,18.0
91,5,Valtteri Bottas BOT,10.0,"Circuit de Barcelona-Catalunya, Spain",Spain,2014,1:28.198,1:27.563,1:26.632,4,5,Mercedes,Williams,24.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3119,11,Valtteri Bottas BOT,0.0,"Autódromo Hermanos Rodríguez, Mexico City",Mexico,2021,1:16.727,1:16.864,1:15.875,1,18,Mercedes,Mercedes,181.0
3127,3,Valtteri Bottas BOT,15.0,"Autódromo José Carlos Pace, São Paulo",Brazil,2021,1:09.040,1:08.426,1:08.469,1,19,Mercedes,Mercedes,181.0
3164,11,Valtteri Bottas BOT,0.0,"Losail International Circuit, Lusail",Qatar,2021,1:22.016,1:21.991,1:21.478,6,20,Mercedes,Mercedes,196.0
3167,3,Valtteri Bottas BOT,15.0,"Jeddah Corniche Circuit, Jeddah",Saudi Arabia,2021,1:28.057,1:28.054,1:27.622,2,21,Mercedes,Mercedes,196.0


😩
Now that that is over, let's move on to `Q1`, `Q2`, `Q3`.
Here, the goal is to find the top 10/15 fastest drivers and put all eliminated drivers in same category.

First, we need to change the time strings to time object.

In [455]:
def str_to_time(time_str):
  # num_ls = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
  if time_str[0] not in '0123456789':
    time_str = '0:00.0'
    time_str = datetime.strptime(time_str, '%M:%S.%f').time()
  elif ':' not in time_str:
    time_str = datetime.strptime(time_str, '%S.%f').time()
  else:
    time_str = datetime.strptime(time_str, '%M:%S.%f').time()
  return time_str

In [456]:
q1 = df['Q1'].apply(str_to_time)

Well, it works. Now, to convert the time object to integer (`int`)

In [457]:
def time_to_int(time):
  time = (
      time.microsecond +
      (time.second * 1_000_000) + 
      (time.minute * 60 * 1_000_000)
  )
  return time

In [458]:
df['Q1'].apply(str_to_time).apply(time_to_int)

0       92564000
1       90775000
2       90949000
3       91396000
4       91388000
          ...   
3199    83350000
3200    84338000
3201    84118000
3202    84423000
3203    84779000
Name: Q1, Length: 3204, dtype: int64

In [459]:
df['Q1'] = df['Q1'].apply(str_to_time).apply(time_to_int)
df['Q2'] = df['Q2'].apply(str_to_time).apply(time_to_int)
df['Q3'] = df['Q3'].apply(str_to_time).apply(time_to_int)

In [460]:
df[['Q1', 'Q2', 'Q3']].max()

Q1    141611000
Q2    132470000
Q3    129776000
dtype: int64

Now, that the columns are not in `int` form, we still face a problem of scale. The maximum values of each column is about 119 million 😯

But to scale them, we have to do if race by race 😩

In [461]:
for year in df['year'].unique():
  df_year = df[df['year'] == year]
  pivt = pd.pivot(df_year, index='Driver', columns='race_num')
  pivtQ1 = pivt['Q1'].fillna(-1)
  pivtQ2 = pivt['Q2'].fillna(-1)
  pivtQ3 = pivt['Q3'].fillna(-1)
  for col in pivtQ1.columns:
    main = pivtQ1[pivtQ1[col] > 0][col]
    main = (main - main.min())/(main.max() - main.min()) + 5
    others = pivtQ1[pivtQ1[col] <= 0][col]
    main = main.append(others)
    pivtQ1[col] = main
  for col in pivtQ2.columns:
    main = pivtQ2[pivtQ2[col] > 0][col]
    main = (main - main.min())/(main.max() - main.min()) + 5
    others = pivtQ2[pivtQ2[col] <= 0][col]
    main = main.append(others)
    pivtQ2[col] = main
  for col in pivtQ3.columns:
    main = pivtQ3[pivtQ3[col] > 0][col]
    main = (main - main.min())/(main.max() - main.min()) + 5
    others = pivtQ3[pivtQ3[col] <= 0][col]
    main = main.append(others)
    pivtQ3[col] = main
  index_df = pd.pivot_table(df_year.reset_index(), index='Driver', columns='race_num')['index'].fillna(-1).astype(int)
  for i in range(index_df.shape[0]):
    for j in range(index_df.shape[1]):
      index = index_df.iloc[i, j]
      if index == -1:
        pass
      else:
        df.loc[index, 'Q1'] = pivtQ1.iloc[i, j]
        df.loc[index, 'Q2'] = pivtQ2.iloc[i, j]
        df.loc[index, 'Q3'] = pivtQ3.iloc[i, j] 

   

In [462]:
df.head()

Unnamed: 0,Pos,Driver,PTS,Circuit,Location,year,Q1,Q2,Q3,Starting Grid,race_num,Power Unit,Car,lagged PTS
0,1,Nico Rosberg ROS,25.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,5.408261,5.0,5.092952,3,1,Mercedes,Mercedes,0.0
1,11,Daniel Ricciardo RIC,0.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,5.0,5.006164,5.08095,2,1,Renault,Red Bull Racing,0.0
2,2,Kevin Magnussen MAG,18.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,5.039708,5.195466,5.386619,4,1,Mercedes,McLaren,0.0
3,3,Jenson Button BUT,15.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,5.141716,5.432094,0.0,10,1,Mercedes,McLaren,0.0
4,4,Fernando Alonso ALO,12.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,5.13989,5.107576,5.405516,5,1,Ferrari,Ferrari,0.0


I did get what I want, I have been able to turn each qualifying column for a string to a time object then to an integer and finally to scale each the numbers for each race to a number between `5` and `6`, with `5` for the winner of the session and `6` for the slowest car, also `0` for those that got disqualified in a previous session or could not set a time in that session (probably due to technical issue or whatever) and `-1` for those that didn't participate at all. 😌
(I really doubt I can do this again 😩)

The last thing I want to do is get each team's point finish for the previous season. I also want to use just the position by I want to show the scale of the distance between each team and points finish does that. I'm not going to add the points finish for 2014 has I don't want to go and scrape the data myself from f1's website (lazy 😩😓)

In [482]:
df['PTS (Last Season)'] = 0
for year in df['year'].unique()[1:]:
  year_mask = df['year'] == year
  df_year = df[year_mask]
  pts = df_year.groupby('Car').sum()['PTS']
  for index in pts.index:
    index_mask = df['Car'] == index
    locs = df[year_mask & index_mask].index
    for loc in locs:
      df.loc[loc, 'PTS (Last Season)'] = pts.loc[index]

In [483]:
df

Unnamed: 0,Pos,Driver,PTS,Circuit,Location,year,Q1,Q2,Q3,Starting Grid,race_num,Power Unit,Car,lagged PTS,PTS (Last Season)
0,1,Nico Rosberg ROS,25.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,5.408261,5.000000,5.092952,3,1,Mercedes,Mercedes,0.0,0.0
1,11,Daniel Ricciardo RIC,0.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,5.000000,5.006164,5.080950,2,1,Renault,Red Bull Racing,0.0,0.0
2,2,Kevin Magnussen MAG,18.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,5.039708,5.195466,5.386619,4,1,Mercedes,McLaren,0.0,0.0
3,3,Jenson Button BUT,15.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,5.141716,5.432094,0.000000,10,1,Mercedes,McLaren,0.0,0.0
4,4,Fernando Alonso ALO,12.0,"Melbourne Grand Prix Circuit, Australia",Australia,2014,5.139890,5.107576,5.405516,5,1,Ferrari,Ferrari,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3199,11,Sergio Perez PER,0.0,"Yas Marina Circuit, Yas Island",Abu Dhabi,2021,5.245027,5.222591,5.644615,4,22,Honda,Red Bull Racing,186.0,574.5
3200,11,Nicholas Latifi LAT,0.0,"Yas Marina Circuit, Yas Island",Abu Dhabi,2021,5.724406,0.000000,0.000000,16,22,Mercedes,Williams,7.0,23.0
3201,11,Antonio Giovinazzi GIO,0.0,"Yas Marina Circuit, Yas Island",Abu Dhabi,2021,5.617661,5.964120,0.000000,14,22,Ferrari,Alfa Romeo,3.0,13.0
3202,11,George Russell RUS,0.0,"Yas Marina Circuit, Yas Island",Abu Dhabi,2021,5.765648,0.000000,0.000000,17,22,Mercedes,Williams,16.0,23.0


I guess this is the end of this notebook. I have created all the feature I think I need. The next step should be to create a wrangling function that takes the initial dataset and turns it into what we have here. 

I promise to do that later, but for now, I'd just get the csv format of the dataframe and move on to model building in another notebook. 

Till then, goodbye.

Peace. ✌

In [484]:
df.to_csv('f1-data-for-model.csv')