# Predicting Formula 1 Constructor's Champion 

Author: Lauren Y Kim (laurenkim04102001@gmail.com)

Course Project, UC Irvine, Math 10, Summer 2023

## Introduction

In this project, I will be predicting the Constructor's standing of Race ID#1105 through LogisticRegression. Followed by accessing the accuracy of the test and ways to improve them.

## Exploring 

You can either have all one section or divide into multiple sections.  To make new sections, use `##` in a markdown cell.  Double-click this cell for an example of using `##`

In [1]:
import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LogisticRegression

In [2]:
drivers = pd.read_csv("drivers.csv")
results = pd.read_csv("results.csv")
races = pd.read_csv("races.csv")
constructors = pd.read_csv("constructors.csv")
constructor_standing = pd.read_csv("constructor_standings.csv")

In [3]:
results = results[['raceId', 'driverId', 'constructorId', 'grid', 'position']]
races = races[['raceId', 'year']]
races = races[races['raceId'] == 1105]
constructors = constructors[['constructorId', 'name']]
constructor_standing = constructor_standing[['raceId', 'constructorId', 'position']]
constructor = pd.merge(constructor_standing, constructors, on='constructorId')
race = pd.merge(constructor, races, on='raceId')
race = race.rename(columns={'position': 'constructorPosition'})
race = race.rename(columns={'name': 'constructorName'})
drivers = drivers[['driverId', 'surname', 'forename']]
driver = pd.merge(drivers, results, on='driverId')
driver = driver.rename(columns={'position': 'driverPosition'})
result = pd.merge(driver, race, on=['constructorId', 'raceId'])
#ChatGPT helped me with the code below to filter out any rows with \\N as their values.
result = result[~result['driverPosition'].str.contains(r"\\N")]
result['driverPosition'] = result['driverPosition'].astype(int)
result

Unnamed: 0,driverId,surname,forename,raceId,constructorId,grid,driverPosition,constructorPosition,constructorName,year
0,1,Hamilton,Lewis,1105,131,4,2,2,Mercedes,2023
1,847,Russell,George,1105,131,12,3,2,Mercedes,2023
2,4,Alonso,Fernando,1105,117,8,7,3,Aston Martin,2023
3,840,Stroll,Lance,1105,117,5,6,3,Aston Martin,2023
4,842,Gasly,Pierre,1105,214,10,10,5,Alpine F1 Team,2023
5,839,Ocon,Esteban,1105,214,6,8,5,Alpine F1 Team,2023
6,807,Hülkenberg,Nico,1105,210,7,15,7,Haas F1 Team,2023
7,825,Magnussen,Kevin,1105,210,17,18,7,Haas F1 Team,2023
8,815,Pérez,Sergio,1105,9,11,4,1,Red Bull,2023
9,830,Verstappen,Max,1105,9,1,1,1,Red Bull,2023


I have merged different datasets to create a master dataset, which we can see that it is a large Dataset with 10 columns. I then filtered the dataset to make the dataset only include datas from Race ID#1105. I chose this Race ID as it was the most recent dataset with datas from all 20 drivers of 10 constructor teams. 

In [4]:
result['driverPositionAve'] = result.groupby('constructorId')['driverPosition'].transform('sum')/2
result['gridAve'] = result.groupby('constructorId')['grid'].transform('sum')/2
result['finalPositionAve'] = result['gridAve'] + result['driverPositionAve']

result_sorted = result.sort_values(by='constructorName', ascending=True)
#ChatGPT helped me write the code below to keep only 1 row for each constructor's team.
result_sorted = result_sorted[result_sorted.index % 2 == 0]

result_sorted = result_sorted[['constructorName','gridAve','driverPositionAve', 'finalPositionAve','constructorPosition']]
result_sorted


Unnamed: 0,constructorName,gridAve,driverPositionAve,finalPositionAve,constructorPosition
10,Alfa Romeo,14.5,14.0,28.5,8
18,AlphaTauri,14.5,13.0,27.5,9
4,Alpine F1 Team,8.0,9.0,17.0,5
2,Aston Martin,6.5,6.5,13.0,3
12,Ferrari,10.5,8.0,18.5,4
6,Haas F1 Team,12.0,16.5,28.5,7
14,McLaren,6.0,15.0,21.0,6
0,Mercedes,8.0,2.5,10.5,2
8,Red Bull,6.0,2.5,8.5,1
16,Williams,19.0,18.0,37.0,10


Using the master dataset I created above, I then calculated the average of grid positions, average of the driver positions, and the average of both grid and driver positions combined, by summing grid and driver positions of drivers by constructor teams and dividing them by 2. I created new columns for all three averages. Then, I removed the columns that were not needed anymore to make the dataset easier to understand. I also sorted the dataset alphabetically by the constructor's team name. 

In [5]:
clf = LogisticRegression(max_iter=2000)
grid = result_sorted[['gridAve']]
gridy = result_sorted['constructorPosition']
clf.fit(grid, gridy)


LogisticRegression(max_iter=2000)

After, I set up the LinearRegression between average grid position average and the actual constructor's standings.

In [6]:
driv = result_sorted[['driverPositionAve']]
drivy = result_sorted['constructorPosition']
clf.fit(driv, drivy)


LogisticRegression(max_iter=2000)

 I repeated the process above with average driver's position and the actual constructor's standings.

In [7]:
result_sorted['predGridConstructorPosition'] = clf.predict(grid)
result_sorted['predDrivConstructorPosition'] = clf.predict(driv)

result_sorted

Feature names unseen at fit time:
- gridAve
Feature names seen at fit time, yet now missing:
- driverPositionAve



Unnamed: 0,constructorName,gridAve,driverPositionAve,finalPositionAve,constructorPosition,predGridConstructorPosition,predDrivConstructorPosition
10,Alfa Romeo,14.5,14.0,28.5,8,6,8
18,AlphaTauri,14.5,13.0,27.5,9,6,9
4,Alpine F1 Team,8.0,9.0,17.0,5,4,5
2,Aston Martin,6.5,6.5,13.0,3,3,3
12,Ferrari,10.5,8.0,18.5,4,5,4
6,Haas F1 Team,12.0,16.5,28.5,7,9,10
14,McLaren,6.0,15.0,21.0,6,3,6
0,Mercedes,8.0,2.5,10.5,2,4,2
8,Red Bull,6.0,2.5,8.5,1,3,2
16,Williams,19.0,18.0,37.0,10,10,10


Then, I added the predicted constructor's standings found using average grid position. I also added the predicted constructor's standings found using average driver's position.

In [8]:

X = result_sorted[['gridAve', 'driverPositionAve']]
y = result_sorted['constructorPosition']
clf.fit(X, y)


LogisticRegression(max_iter=2000)

 I repeated the process above with the average of both grid and driver's position and the actual constructor's standings.

In [9]:
result_sorted['predBothConstructorPosition'] = clf.predict(X)
result_sorted

Unnamed: 0,constructorName,gridAve,driverPositionAve,finalPositionAve,constructorPosition,predGridConstructorPosition,predDrivConstructorPosition,predBothConstructorPosition
10,Alfa Romeo,14.5,14.0,28.5,8,6,8,8
18,AlphaTauri,14.5,13.0,27.5,9,6,9,9
4,Alpine F1 Team,8.0,9.0,17.0,5,4,5,5
2,Aston Martin,6.5,6.5,13.0,3,3,3,3
12,Ferrari,10.5,8.0,18.5,4,5,4,4
6,Haas F1 Team,12.0,16.5,28.5,7,9,10,7
14,McLaren,6.0,15.0,21.0,6,3,6,6
0,Mercedes,8.0,2.5,10.5,2,4,2,2
8,Red Bull,6.0,2.5,8.5,1,3,2,1
16,Williams,19.0,18.0,37.0,10,10,10,10


Then, I added the predicted constructor's standings found using average of both grid and driver's position.

In [10]:
chart = alt.Chart(result_sorted).mark_circle(size=70).encode(
    x=alt.X('predGridConstructorPosition',scale=alt.Scale(domain=[1, 11])),
    y=alt.Y('constructorPosition',scale=alt.Scale(domain=[1, 10])),
    color=alt.Color('constructorName:N', scale=alt.Scale(scheme='magma')),
    tooltip=['gridAve', 'driverPositionAve', 'constructorName:N']
).properties(
    width=500,
    height=300
)
chart

I then created a chart comparing the predicted constructor's standings found using average grid positions and the actual constructor's standings of each teams. As we can see from this chart, using average grid positions to predict the constructor's standing is not recommended as it only predicted 2 constructor's teams' standings.

In [11]:
chart = alt.Chart(result_sorted).mark_circle(size=70).encode(
    x=alt.X('predDrivConstructorPosition',scale=alt.Scale(domain=[1, 11])),
    y=alt.Y('constructorPosition',scale=alt.Scale(domain=[1, 10])),
    color=alt.Color('constructorName:N', scale=alt.Scale(scheme='magma')),
    tooltip=['gridAve', 'driverPositionAve', 'constructorName:N']
).properties(
    width=500,
    height=300
)
chart

I then created a chart comparing the predicted constructor's standings found using average driver's positions and the actual constructor's standings of each teams. As we can see from this chart, using average driver's positions to predict the constructor's standing is not recommended but it is certainly better than using grid position, as it predicted 8 out 10 correctly.

In [12]:
chart = alt.Chart(result_sorted).mark_circle(size=70).encode(
    x=alt.X('predBothConstructorPosition',scale=alt.Scale(domain=[1, 11])),
    y=alt.Y('constructorPosition',scale=alt.Scale(domain=[1, 10])),
    color=alt.Color('constructorName:N', scale=alt.Scale(scheme='magma')),
    tooltip=['gridAve', 'driverPositionAve', 'constructorName:N']
).properties(
    width=500,
    height=300
)
chart

I then created a chart comparing the predicted constructor's standings found using average of both grid and driver's positions and the actual constructor's standings of each teams. As we can see from this chart, using average of both grid and driver's positions to predict the constructor's standing is highly recommended as it predicted all constructor's teams' standings.

In [13]:
from sklearn.metrics import accuracy_score
gridAccuracy = accuracy_score(result_sorted['predGridConstructorPosition'], result_sorted['constructorPosition'])
gridAccuracy

0.2

In [14]:
drivAccuracy = accuracy_score(result_sorted['predDrivConstructorPosition'], result_sorted['constructorPosition'])
drivAccuracy

0.8

In [15]:
bothAccuracy = accuracy_score(result_sorted['predBothConstructorPosition'], result_sorted['constructorPosition'])
bothAccuracy

1.0

I calculated the accuracy score for all three tests I ran to see if any of the predictions were successful. As we can see from above, using the average of both grid and driver's positions was the most successful in predicting the constructor's standings of Race ID#1105.

## Summary

In this project, I was able to explore different factors that could be used to predict the Formula 1 constructor's standing of Race ID#1105. I ran three tests to see if any of these three factors were useful in predicting the constructor's standing of Race ID#1105. I finished by calculating the accuracy score for all three tests to see which factor was most successful in predicting the constructor's standing of Race ID#1105.

## References

Your code above should include references.  Here is some additional space for references.

* What is the source of your dataset(s)?
https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020

* List any other references that you found helpful.
I used ChatGPT for some help with writing codes to filter my huge master dataset.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f63ee7df-da36-4632-b544-2fb9a10deb6f' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>