## General F1 Data analysis

In this notebook I want to do some general analysis on F1 data. I will use this notebook to get back at using Python, and to try some interesting stuff with F1 data. My first step will be to import the data, and to just play around with it

In [None]:
## Imports
import pandas as pd
import os
import numpy as np
import seaborn as sns
from datetime import datetime
# To get full output
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# Change current working directory to where our F1 data is stored
os.getcwd();
os.chdir('C:\\Users\\yanni\\OneDrive\\Documents\\Data_Science\\F1_data')
os.getcwd()     


In [None]:
circuits_df = pd.read_csv('circuits.csv')

In [None]:
circuits_df.shape
circuits_df.dtypes
circuits_df.describe(include = 'all')
circuits_df.head()

In [None]:
# We saw that altitude only has 2 values
circuits_df.groupby(by = 'alt').count()

# This \N value probably indicates a missing value so we set this at missing
circuits_df = circuits_df.replace(r"\N", np.NaN)

#We've now changed it to null values
circuits_df.head()

#### general analysis circuits
In this part I will investigate the different tracks. I want to find the answers to the following questions:

- Which track is the most northern
- Which track is the most southern
- Which country has the most F1 tracks

In [None]:
# Which track is the most northern and southern

# Most northern
circuits_df.loc[circuits_df['lat'].idxmax()]
# We see that the most northern track is a track in Sweden

# Most southern
circuits_df.loc[circuits_df['lat'].idxmin()]
# The most southern is a Albert partk in Melbourne

In [None]:
# Which country has the most F1 tracks
# We see that this is the USA, which has 11 F1 tracks
circuits_df[['circuitId', 'country']] \
.groupby(by = 'country') \
.count()\
.sort_values(by = ['circuitId'], ascending = False) \
.head(5)

#### What has changed over the years
In this step we will see if things have changed over the years
We want to see whether:
- Cars have become faster
- Pit stops have become faster
- There are less safety cars
- There are smaller gaps between the drivers


In [None]:
# Check whether cars have become faster
# We will do this by comparing their time on different tracks and to see
# how this has improved

# To do this we need the laptimes data
laptimes_df = pd.read_csv('lap_times.csv')

In [None]:
#some basic checks
laptimes_df.shape
laptimes_df.dtypes
laptimes_df.describe(include = 'all')
laptimes_df.head()

In [None]:
# We see from this that we also need to join the race data 
race_df = pd.read_csv('races.csv')

In [None]:
race_df.shape
race_df.dtypes
race_df.describe(include = 'all')
race_df.head()

In [None]:
# Now we can find per lap where it was raced. We want to do this so that we
# can group per circuit per year

#First we merge with the race_df to get the circuitId and the year
laptime_race_df = laptimes_df.merge(race_df[['raceId', 'year'
                                             , 'circuitId', 'name']]
                                    , how = 'left'
                                    , on = 'raceId')
laptime_race_df = laptime_race_df.rename(columns = {'name': 'name_race'})
laptime_circuit_df = laptime_race_df.merge(circuits_df[['circuitId'
                                                        , 'name']]
                                           , how = 'left'
                                           , on = 'circuitId')
laptime_circuit_df = laptime_circuit_df.rename(columns = {'name': 'name_circuit'})

In [None]:
#We now have a df of all the laptimes per race per year
laptime_circuit_df.head()


In [None]:
times_circuit_df = laptime_circuit_df[['year', 'name_circuit', 'milliseconds']] \
                    .groupby(['year', 'name_circuit']) \
                    .agg(['min', 'max', 'mean', 'median']) 

times_circuit_df.columns = times_circuit_df.columns.droplevel(0)
times_circuit_df = times_circuit_df.reset_index()
times_circuit_df = times_circuit_df.rename(columns = { 'name': 'name_circuit'
                                                      , 'min': 'min_ms'
                                                      , 'max': 'max_ms'
                                                      , 'mean' : 'mean_ms'
                                                      , 'median': 'median_ms'})
times_circuit_df.head()

In [None]:
# We now have per year and circuit the info on the min, max, mean and median time
# We can visualize this

g = sns.barplot(x = 'year', y = 'min_ms', data = times_circuit_df, color = 'red')
g.set_xticklabels(g.get_xticklabels(), rotation = 90, ha = 'center')

In [None]:
# We can see from here that there are certain times were the cars became
# slower. This happened mostly in 2009


In [None]:
times_circuit_df.name_circuit.unique()