# **<p style="text-align: center;">Which circuits produces the fastest average finishing time?</p>**

##### <p style="text-align: center;">QTM 151
##### <p style="text-align: center;">Dr. Juan Estrada</p>
##### <p style="text-align: center;">Samuel Lim, Jonathan Wang, Enoc Dejsus</p>
##### <p style="text-align: center;">Section 1</p>

QTM FINAL PROJECT - Formula 1
Project members: Samue Lim, Jonathan Wang, Enoc Dejesus 
QTM 151-1

INTRODUCTION
Formula 1, or F1, stands at the pinnacle of international motorsport, showcasing cutting-edge technology and high-speed racing on a global scale. 
It features premier single-seater racing events, with many different iconic tracks. Teams, each with two drivers, compete in aerodynamically advanced cars powered by hybrid units, reaching speeds over 200 miles per hour. The sport demands a unique mix of skill, strategy, and teamwork as drivers vie for the Drivers' and Constructors' Championships. With a history dating back to the 1950s, Formula 1 has become a worldwide spectacle, captivating millions of fans and pushing the boundaries of automotive innovation. F1 races are commonly evaluated using diverse metrics like finishing times and racing features. The "Formula 1" dataset provides access to these metrics, encompassing F1 race data from 1950 to 2023

OUR QUESTION
The central research question of this project is as follows: _which circuits produces the fastest average finishing time?_ It would be interesting to analyze the average finishing time of racers (in milliseconds) of different circuits, and compare these averages to see which circuits produce the fastest racing times; From this, we can look at and compare different circuits relative to their finishing times. This is important as many different tracks might vary in their finishing time thus confusing what makes a fast finishing time.



.


 Our results indicate that Finland produces the lowest average finishing time among drivers who participate in F1 races. The following report outlines the data wrangling and analysis that produced this outcome by describing relevant data sets and information, explaining any data manipulation methods, and presenting and interpreting results.

In [11]:
# Imports for project
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Import datasets
races = pd.read_csv("raw_data/races.csv")
results = pd.read_csv("raw_data/results.csv")

# Count number of observations
print("# of observations in races:", len(races))
print("# of observations in results:", len(results))

# of observations in races: 1102
# of observations in results: 25840


#### 1. Merging Procedure
We merge the two datasets, “results” and “races” on the variable raceId because our two variables of interest, “milliseconds” and “name,” are in distinct datasets. Since we are concerned with the performance of drivers on different race circuits, merging on raceId is most appropriate. 

In [12]:
races_merge = pd.merge(races,
                       results,
                       on = "raceId",
                       how = "left")
display(races_merge)

Unnamed: 0,raceId,year,round,circuitId,name,date,time_x,url,fp1_date,fp1_time,...,positionOrder,points,laps,time_y,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...,\N,\N,...,1.0,10.0,58.0,1:34:15.784,5655784,17,3,1:28.020,216.891,1.0
1,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...,\N,\N,...,2.0,8.0,58.0,+0.807,5656591,43,14,1:29.066,214.344,1.0
2,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...,\N,\N,...,3.0,6.0,58.0,+1.604,5657388,50,10,1:28.916,214.706,1.0
3,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...,\N,\N,...,4.0,5.0,58.0,+4.435,5660219,53,6,1:28.416,215.920,1.0
4,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...,\N,\N,...,5.0,4.0,58.0,+4.879,5660663,53,9,1:28.712,215.199,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25858,1116,2023,19,69,United States Grand Prix,2023-10-22,19:00:00,https://en.wikipedia.org/wiki/2023_United_Stat...,2023-10-20,17:30:00,...,,,,,,,,,,
25859,1117,2023,20,32,Mexico City Grand Prix,2023-10-29,20:00:00,https://en.wikipedia.org/wiki/2023_Mexico_City...,2023-10-27,18:30:00,...,,,,,,,,,,
25860,1118,2023,21,18,São Paulo Grand Prix,2023-11-05,17:00:00,https://en.wikipedia.org/wiki/2023_S%C3%A3o_Pa...,2023-11-03,14:30:00,...,,,,,,,,,,
25861,1119,2023,22,80,Las Vegas Grand Prix,2023-11-19,06:00:00,https://en.wikipedia.org/wiki/2023_Las_Vegas_G...,2023-11-17,04:30:00,...,,,,,,,,,,


##### 2. Data Cleaning
To initiate the data cleaning process, we initially inspected the data types within our recently consolidated dataset. It was observed that the "Milliseconds" column was categorized as an object, prompting the necessity to eliminate empty entries and convert the values in this column into numeric format. To accomplish this, we scrutinized each entry in the milliseconds column to determine its numerical status. Subsequently, we identified multiple instances where incorrect format values were returned, indicating the presence of empty or non-numeric entries. These identified values were then replaced with NaN to signify missing data within the milliseconds column. 

In [21]:
subset = races_merge.query("milliseconds.str.isnumeric()==False")
list_unique = pd.unique(subset["milliseconds"])
list_unique
list_new = [np.nan]
races_merge["milliseconds"] = races_merge["milliseconds"].replace(list_unique,list_new)
display(list_unique)
pd.to_numeric(races_merge["milliseconds"])

array(['\\N'], dtype=object)

0        5655784.0
1        5656591.0
2        5657388.0
3        5660219.0
4        5660663.0
           ...    
25858          NaN
25859          NaN
25860          NaN
25861          NaN
25862          NaN
Name: milliseconds, Length: 25863, dtype: float64