# Formula 1 Grand Prix result prediction

## This project is aimed towards predicting the future F1 GP winners based on the drivers, constructors or both 
### Things to keep in mind

Before begining the project we need to understand the history of F1 and the diffrent eras in which a certain driver or constructor dominated the whole grid. Here are some important eras of F1 after 2010.  

* 1994-2009 Schumacher (Scuderia Ferrari)
* 2007-2010 Alonso (Renault,Scuderia Ferrari)
* 2011-2013 Vettle (Redbull Racing)
* 2014-Present Hamilton (Mercedes-Benz)

F1 Constructors performance are largely dependent on the FIA techinical regulation for the season after the 2013 season new engine regulation were made (Hybrid era) Mercedes-Benz are most dominat team since followed bu Redbull Racing and Scuderia Ferrari. Rules are set to change for 2022 so whatever analysis made here will not apply for 2022 season and so far. only data after 2010 will be considered in the following analysis. 

## Create spark context

In [1]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SparkSession.builder.config("spark.sql.shuffle.partitions", "2").appName("InjestionProcessing").master("local[2]").getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext



In [2]:
%run "includes/configuration"

'hdfs://localhost:9000/user/sunbeam/f1/data/'

## Read files

In [3]:
results = spark.read.option("header", True).csv(f"{data}/results.csv")
races = spark.read.option("header", True).csv(f"{data}/races.csv")
qualifying = spark.read.option("header", True).csv(f"{data}/qualifying.csv")
drivers = spark.read.option("header", True).csv(f"{data}/drivers.csv")
constructors = spark.read.option("header", True).csv(f"{data}/constructors.csv")
circuits = spark.read.option("header", True).csv(f"{data}/circuits.csv")

## Rename columns for better understanding

In [4]:
results.columns

['resultId',
 'raceId',
 'driverId',
 'constructorId',
 'number',
 'grid',
 'position',
 'positionText',
 'positionOrder',
 'points',
 'laps',
 'time',
 'milliseconds',
 'fastestLap',
 'rank',
 'fastestLapTime',
 'fastestLapSpeed',
 'statusId']

In [5]:
results = results.withColumnRenamed("resultId", "result_id").withColumnRenamed("raceId", "race_id").withColumnRenamed("constructorId", "constructor_id").withColumnRenamed("statusId", "status_id").withColumnRenamed("number", "results_number").withColumnRenamed("time", "results_time").withColumnRenamed("driverId", "driver_id").withColumnRenamed("position", "result_position")

In [6]:
races.columns

['raceId', 'year', 'round', 'circuitId', 'name', 'date', 'time', 'url']

In [7]:
races = races.withColumnRenamed("raceId", "race_id").withColumnRenamed("circuitId", "circuit_id").withColumnRenamed("url", "race_url").withColumnRenamed("time", "race_time").withColumnRenamed("name", "race_name").withColumnRenamed("raceId", "race_id").withColumnRenamed("circuitId", "circuit_id")

In [8]:
qualifying.columns

['qualifyId',
 'raceId',
 'driverId',
 'constructorId',
 'number',
 'position',
 'q1',
 'q2',
 'q3']

In [9]:
qualifying = qualifying.withColumnRenamed("number", "qualifying_number").withColumnRenamed("qualifyingId", "qualifying_id").withColumnRenamed("raceId", "race_id").withColumnRenamed("driverId", "driver_id").withColumnRenamed("constructorId", "constructor_id").withColumnRenamed("position", "qualifying_position")

In [10]:
drivers.columns

['driverId',
 'driverRef',
 'number',
 'code',
 'forename',
 'surname',
 'dob',
 'nationality',
 'url']

In [11]:
drivers = drivers.withColumnRenamed("number", "driver_number").withColumnRenamed("nationality", "driver_nationality").withColumnRenamed("url", "driver_url").withColumnRenamed("driverId", "driver_id").withColumnRenamed("driverRef", "driver_ref")

In [12]:
constructors.columns

['constructorId', 'constructorRef', 'name', 'nationality', 'url']

In [13]:
constructors = constructors.withColumnRenamed("name", "constructor_name").withColumnRenamed("nationality", "constructor_nationality").withColumnRenamed("url", "constructor_url").withColumnRenamed("constructorId", "constructor_id").withColumnRenamed("constructorRef", "constructor_ref")

In [14]:
circuits.columns

['circuitId',
 'circuitRef',
 'name',
 'location',
 'country',
 'lat',
 'lng',
 'alt',
 'url']

In [15]:
circuits = circuits.withColumnRenamed("circuitId", "circuit_id").withColumnRenamed("circuitRef", "circuit_ref").withColumnRenamed("name", "circuit_name").withColumnRenamed("location", "circuit_location").withColumnRenamed("country", "circuit_country").withColumnRenamed("url", "circuit_url")

## Join DataFrames to create one

In [16]:
df1 = races.join(results, "race_id", "inner")

In [17]:
df2 = df1.join(qualifying, ["race_id", "driver_id", "constructor_id"], "inner")

In [18]:
df3 = df2.join(drivers, "driver_id", "inner")

In [19]:
df4 = df3.join(constructors, "constructor_id", "inner")

In [20]:
df5 = df4.join(circuits, "circuit_id", "inner")

In [21]:
df5.columns

['circuit_id',
 'constructor_id',
 'driver_id',
 'race_id',
 'year',
 'round',
 'race_name',
 'date',
 'race_time',
 'race_url',
 'result_id',
 'results_number',
 'grid',
 'result_position',
 'positionText',
 'positionOrder',
 'points',
 'laps',
 'results_time',
 'milliseconds',
 'fastestLap',
 'rank',
 'fastestLapTime',
 'fastestLapSpeed',
 'status_id',
 'qualifyId',
 'qualifying_number',
 'qualifying_position',
 'q1',
 'q2',
 'q3',
 'driver_ref',
 'driver_number',
 'code',
 'forename',
 'surname',
 'dob',
 'driver_nationality',
 'driver_url',
 'constructor_ref',
 'constructor_name',
 'constructor_nationality',
 'constructor_url',
 'circuit_ref',
 'circuit_name',
 'circuit_location',
 'circuit_country',
 'lat',
 'lng',
 'alt',
 'circuit_url']

## Select necessary columns

In [22]:
data = df5.select(['year', 'date', 'grid', 'status_id', 'qualifying_position', 'forename', 'surname', 'dob', 'driver_nationality', 'constructor_name', 'constructor_nationality', 'race_name', 'circuit_country'])

In [23]:
data.columns

['year',
 'date',
 'grid',
 'status_id',
 'qualifying_position',
 'forename',
 'surname',
 'dob',
 'driver_nationality',
 'constructor_name',
 'constructor_nationality',
 'race_name',
 'circuit_country']

## F1 Grand Prix structure

a F1 GP runs for 3 days in the weeekend and is made of 3 parts Practice session, Qualify scession and the actual Race.

In the practice sesssion there are 3 stages FP1, FP2 and FP3 this is a free practice scession for teams to test their cars on Friday and saturday.
Qualification session is also made of 3 stages Q1, Q2 and Q3 in this session all drivers compete to set the best lap time and bottom 5 drivers will be eliminated after Q1. top 15 drivers will participate in the Q2 and try to set best lap time and top 10 drivers will move to Q1 where they again set best best lap time they can and the cars position at the start of the race will be decided based on their qualifying time driver with best time will get to start at the front.
Sunday scession is the Race and points will be awarded to top 10 drivers and top three will get to enjoy podium.
This happens for a full season for a whole year at diffrent circuits and driver with the highest points will be awarded World championship and team with highest points will get Constructorschampionship(each team have two cars and two drivers)

In [24]:
#considering data points from 2010
data = data[data['year']>=2010]

In [25]:
#rename the columns
data = data.withColumnRenamed("race_name", "GP_name").withColumnRenamed("circuit_country", "country").withColumnRenamed("qualifying_position", "position").withColumnRenamed("grid", "quali_pos").withColumnRenamed("constructor_name", "constructor").withColumn("date", to_date(col("date"))).withColumn("dob", to_date(col("dob"))).withColumn("driver", concat(col("forename"), lit(" "), col("surname")))

In [26]:
# Creating driver age parameter
data = data.withColumn("age_at_gp_in_days", datediff(col("date"), col("dob")))
data = data.withColumn("age_at_gp_in_days", expr("CAST(age_at_gp_in_days AS STRING)"))

In [27]:
data = data.withColumn("constructor", when(col("constructor") == "Force India", "Racing Point")
                                   .when(col("constructor") == "Sauber", "Alfa Romeo")
                                   .when(col("constructor") == "Lotus F1", "Renault")
                                   .when(col("constructor") == "Toro Rosso", "AlphaTauri")
                                   .otherwise(col("constructor")))

In [28]:
data = data.withColumn('driver_nationality', data['driver_nationality'].substr(1, 3))
data = data.withColumn('constructor_nationality', data['constructor_nationality'].substr(1, 3))
data = data.withColumn('country', when(data['country'] == 'UK', 'Bri').otherwise(data['country']))
data = data.withColumn('country', when(data['country'] == 'USA', 'Ame').otherwise(data['country']))
data = data.withColumn('country', when(data['country'] == 'Fra', 'Fre').otherwise(data['country']))
data = data.withColumn('country', data['country'].substr(1, 3))
data = data.withColumn('driver_home', (data['driver_nationality'] == data['country']).cast("int"))
data = data.withColumn('constructor_home', (data['constructor_nationality'] == data['country']).cast("int"))

In [29]:
dnf_statuses = [3, 4, 20, 29, 31, 41, 68, 73, 81, 97, 82, 104, 107, 130, 137]
data = data.withColumn('driver_dnf', when(col('status_id').isin(dnf_statuses), 1).otherwise(0))
data = data.withColumn('constructor_dnf', when(~col('status_id').isin(dnf_statuses + [1]), 1).otherwise(0))
data = data.drop('forename', 'surname')

In [30]:
# Calculate DNF count by driver
dnf_by_driver = data.groupBy('driver').agg({'driver_dnf': 'sum'})

# Calculate race entered count by driver
driver_race_entered = data.groupBy('driver').count()

# Join the two calculated DataFrames
driver_stats = dnf_by_driver.join(driver_race_entered, 'driver')

# Calculate DNF ratio and driver confidence
driver_stats = driver_stats.withColumn('driver_dnf_ratio', driver_stats['sum(driver_dnf)'] / driver_stats['count'])
driver_stats = driver_stats.withColumn('driver_confidence', 1 - driver_stats['driver_dnf_ratio'])

# Select necessary columns and convert to a Pandas DataFrame for creating the dictionary
driver_confidence_dict = driver_stats.select('driver', 'driver_confidence').rdd.collectAsMap()

In [31]:
# Calculate DNF count by constructor
dnf_by_constructor = data.groupBy('constructor').agg({'constructor_dnf': 'sum'})

# Calculate race entered count by constructor
constructor_race_entered = data.groupBy('constructor').count()

# Join the two calculated DataFrames
constructor_stats = dnf_by_constructor.join(constructor_race_entered, 'constructor')

# Calculate DNF ratio and constructor reliability
constructor_stats = constructor_stats.withColumn('constructor_dnf_ratio', constructor_stats['sum(constructor_dnf)'] / constructor_stats['count'])
constructor_stats = constructor_stats.withColumn('constructor_reliability', 1 - constructor_stats['constructor_dnf_ratio'])

# Select necessary columns and convert to a Pandas DataFrame for creating the dictionary
constructor_reliability_dict = constructor_stats.select('constructor', 'constructor_reliability').rdd.collectAsMap()

In [32]:
# Create a DataFrame for driver confidence and constructor reliability dictionaries
driver_confidence_df = spark.createDataFrame(driver_confidence_dict.items(), ["driver", "driver_confidence"])
constructor_reliability_df = spark.createDataFrame(constructor_reliability_dict.items(), ["constructor", "constructor_reliability"])

# Adding 'driver_confidence' column
data = data.join(driver_confidence_df, on='driver', how='left')

# Adding 'constructor_reliability' column
data = data.join(constructor_reliability_df, on='constructor', how='left')

In [33]:
# Lists of active constructors and drivers
active_constructors = ['Renault', 'Williams', 'McLaren', 'Ferrari', 'Mercedes',
                       'AlphaTauri', 'Racing Point', 'Alfa Romeo', 'Red Bull',
                       'Haas F1 Team']
active_drivers = ['Daniel Ricciardo', 'Kevin Magnussen', 'Carlos Sainz',
                  'Valtteri Bottas', 'Lance Stroll', 'George Russell',
                  'Lando Norris', 'Sebastian Vettel', 'Kimi Räikkönen',
                  'Charles Leclerc', 'Lewis Hamilton', 'Daniil Kvyat',
                  'Max Verstappen', 'Pierre Gasly', 'Alexander Albon',
                  'Sergio Pérez', 'Esteban Ocon', 'Antonio Giovinazzi',
                  'Romain Grosjean', 'Nicholas Latifi']

# Adding 'active_driver' column
data = data.withColumn("active_driver", when(col("driver").isin(active_drivers), 1).otherwise(0))

# Adding 'active_constructor' column
data = data.withColumn("active_constructor", when(col("constructor").isin(active_constructors), 1).otherwise(0))

In [37]:
data.columns

['constructor',
 'driver',
 'year',
 'date',
 'quali_pos',
 'status_id',
 'position',
 'dob',
 'driver_nationality',
 'constructor_nationality',
 'GP_name',
 'country',
 'age_at_gp_in_days',
 'driver_home',
 'constructor_home',
 'driver_dnf',
 'constructor_dnf',
 'driver_confidence',
 'constructor_reliability',
 'active_driver',
 'active_constructor']

In [35]:
data.write.csv(r"/home/sunbeam/Desktop/f1/data/hnhr/", header=True)

Py4JJavaError: An error occurred while calling o407.csv.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 1 times, most recent failure: Lost task 0.0 in stage 28.0 (TID 24) (cdac executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/sunbeam/.local/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 485, in main
    ("%d.%d" % sys.version_info[:2], version))
RuntimeError: Python in worker has different version 3.7 than that in driver 3.6, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:556)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:762)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:744)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:509)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2450)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2399)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2398)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2398)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1156)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1156)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1156)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2638)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2580)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2569)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/sunbeam/.local/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 485, in main
    ("%d.%d" % sys.version_info[:2], version))
RuntimeError: Python in worker has different version 3.7 than that in driver 3.6, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:556)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:762)
	at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:744)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:509)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
