# Welcome to my first kernel

## About Formula 1
I am a novice to the sport just as I am to Data Science. At the time of my writing, Lewis Hamilton has just won the 7th World Championship Title, equaled Michael Schumacher's record with his latest win at Turkish Grand Prix. There are two annual World Championship in the sport: one for the drivers and one for the constructors. Each F1 constructor has two drivers and the results of each race are evaluated by a point system. The drivers from the same constructors have to work as a team to contribute scores to the Constructor's Championship meanwhile competing with each other and other drivers in the grid to win the Driver's Championship. Because of this reason, Formula 1 is both a team and an invidual sport. 

## About the dataset
Multiple datsets used are primarily from kaggle source: **Formula 1 World Championship (1950-2020)**. The dataset constains information of the racers, circuits, lap times, etc. To match each countries to its respective continent, a dataset from another kaggle source: **Covid19-plus-populations** is used. There is another dataset retrieved from [Dinuks](https://raw.githubusercontent.com/Dinuks/country-nationality-list/master/countries.csv) which contains country codes that needed later to match with the nationality of the drivers. 

## Questions
Europe is the traditional base of the sport and many F1 drivers are from European countries. There are many statistics we can pull from this sport but in my first attempt of data visualization, I am trying to answer the following questions:  
1. How many F1 circuits in the world? Where are they? 
2. Which nationality has the highest number of F1 drivers? 
3. How many races/Grand Prix in each season? 
4. Who is the fastest driver in the Grid? 

### Q1: How many F1 circuits in the world and where are they?
First I investigate all the locations of the F1 cuircuits by plotting them on a map. Unsuprisingly, there are many in European countries. 

In [None]:
import pandas as pd
df_circuits = pd.read_csv('../input/formula-1-world-championship-1950-2020/circuits.csv')
df_circuits.head()

In [None]:
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=(12,9))
fig.set_facecolor('#FFFFFF')

m = Basemap(projection='mill',
           llcrnrlat = -60,
           urcrnrlat = 90,
           llcrnrlon = -180,
           urcrnrlon = 180,
           resolution = 'c')

m.etopo(alpha=0.8)
m.drawcoastlines()

sites_lat_y = df_circuits['lat'].tolist()
sites_lon_x = df_circuits['lng'].tolist()

colors = sns.color_palette(None, 76)

m.scatter(sites_lon_x,sites_lat_y,latlon=True, s=100, c=colors, marker='^', alpha=1, edgecolor='k', linewidth=1, zorder=2)
plt.title('Where are all the F1 circuits?', fontsize=20)

plt.show()

Let's zoom in to the Europe continent 

In [None]:
fig = plt.figure(figsize=(12,9))

m = Basemap(projection='mill',
           llcrnrlat = 30,
           urcrnrlat = 65,
           llcrnrlon = -20,
           urcrnrlon = 35,
           resolution = 'c')

m.etopo(scale=0.5, alpha=0.8)
m.drawcoastlines()
m.drawcountries()
m.scatter(sites_lon_x,sites_lat_y,latlon=True, s=100, c=colors, marker='^', alpha=1, edgecolor='k', linewidth=1, zorder=2)
plt.show()

Now, let's find out the distribution of the circuits across different continents. Conventionally we can plot this kind of proportion information with a pie chart but let's explore using waffle chart instead.

In [None]:
df_country = pd.read_csv('../input/covid19pluspopulations/country-and-continent-codes-list.csv')
df_country.columns

In [None]:
df_country = df_country[df_country.Country_Name.str.contains('|'.join(df_circuits.country))]
df_country = df_country[['Continent_Name', 'Continent_Code', 'Country_Name', 'Three_Letter_Country_Code']]
df_country.rename(columns={'Three_Letter_Country_Code':'Country_Code'}, inplace=True)
df_country.head()

In [None]:
df_country[df_country['Country_Name'].duplicated(keep=False)]

First I need to classify Azerbaijian, Russian and Turkey to a particular continent and the choices made are solely based on personally opinion. 

In [None]:
df_country.drop(index=[8, 192, 235])
df_country.head()

In [None]:
df_circuits['dummy'] = 1
df_country['dummy'] = 1
df_combined = df_circuits.merge(df_country, on = 'dummy').drop('dummy', axis=1)
df = df_combined[df_combined.apply(lambda x: x.Country_Name.find(x.country), axis=1).ge(0)]
df = df[['circuitId', 'name', 'location', 'country', 'lat', 'lng', 
         'Continent_Name', 'Continent_Code', 'Country_Name', 'Country_Code']]
df.rename(columns={'name':'Circuit_Name'})
df = df.set_index('circuitId')
df.head()

In [None]:
df.Continent_Name.value_counts()

In [None]:
!pip install pywaffle
from pywaffle import Waffle
data = df.Continent_Name.value_counts().to_dict()
fig = plt.figure(
    figsize = (12,16),
    FigureClass=Waffle, 
    rows=5, 
    values=data, 
    colors=sns.color_palette("rocket",len(data)).as_hex(),
    title={'label': 'Distribution of {} circuits across the {} continents'.format(sum(data.values()), len(data)), 'loc': 'left', 'size':18},
    labels=["{0} ({1})".format(k, v) for k, v in data.items()],
    legend={'loc': 'upper left', 'bbox_to_anchor': (1, 1)},
    icons='flag-checkered', icon_size=45, 
    icon_legend=True
)
fig.set_facecolor('#FFFFFF')

So there we have it, more than half of the circuits are located in Europe. Next question:

### Q2: Which nationality has the highest number of F1 drivers? 

In [None]:
driver = pd.read_csv('../input/formula-1-world-championship-1950-2020/drivers.csv')
data = driver.nationality.value_counts().to_frame('counts')

url = 'https://raw.githubusercontent.com/Dinuks/country-nationality-list/master/countries.csv'
nationality = pd.read_csv(url)
nationality.head()

In [None]:
df_nat = nationality[nationality.nationality.str.contains('|'.join(driver.nationality))]
df_nat = df_nat.drop(index=[4,76,77,78,235,242])
set.difference(set(data.index), set(df_nat.nationality))

In [None]:
df_nat[df_nat.nationality.str.contains('|'.join(set.difference(set(data.index), set(df_nat.nationality))))]

In [None]:
nationality[nationality.en_short_name.str.contains('Monaco')]

In [None]:
nationality[nationality.nationality.str.contains('New Zealand|Liechtenstein')]

In [None]:
data[data.index.str.contains('New Zealand|Liechtenstein')]

In [None]:
df_nat = df_nat.append(nationality[nationality.nationality.str.contains('New Zealand|Monégasque|Liechtenstein')])
df_nat['nationality'].replace({'British, UK':'British', 'Dutch, Netherlandic': 'Dutch', 'Hungarian, Magyar': 'Hungarian', 'Monégasque, Monacan':'Monegasque', 'New Zealand, NZ':'New Zealander', 'Liechtenstein':'Liechtensteiner'}, inplace=True)
df_nat[df_nat.nationality.str.contains('New Zealand|Monegasque|Liechtenstein|British|Dutch|Hungarian')]

In [None]:
df_driver = pd.merge(driver, df_nat, on='nationality', how='left')
df_driver[df_driver['nationality'] == 'Liechtensteiner']

In [None]:
dfa = df_driver[~df_driver[['nationality', 'alpha_2_code']].duplicated()][['nationality', 'alpha_2_code']].set_index('nationality')
dfb = driver.nationality.value_counts().to_frame('counts')
data = pd.merge(dfa, dfb, left_index = True, right_index = True)
data = data.sort_values(by='counts', ascending=False)

In [None]:
data[data.alpha_2_code.isna()]

In [None]:
import requests
from io import BytesIO
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

fig, ax = plt.subplots(figsize=(12,24))
fig.set_facecolor('#FFFFFF')
ax.set_facecolor('#FFFFFF')

labels = data.alpha_2_code
values = data.counts

ax.barh(data.index, values, color='orangered')
ax.set_xlim(-25,)

# remove axes splines
for s in ['top','bottom','left','right']:
    ax.spines[s].set_visible(False)

# remove x, y ticks, x-axis  
ax.xaxis.set_ticks_position('none')
ax.yaxis.set_ticks_position('none')
ax.set_xticklabels([])

# to show the top value first 
ax.invert_yaxis()

ax.set_title('The Nationality of Formula 1 Drivers (1950-2020)', loc='center', size=18)

for i in ax.patches:
    ax.text(i.get_width()+1, i.get_y()+0.5, str(round((i.get_width()), 2)),
            fontsize=10, fontweight='bold', color='grey')

def isNaN(string):
    return string != string

def offset_image(x, y, label, ax):
    if(isNaN(label)): 
        response = requests.get(f'https://upload.wikimedia.org/wikipedia/commons/c/c9/Logof1.png')
        img = plt.imread(BytesIO(response.content))
        im = OffsetImage(img, zoom=0.05)
    else: 
        response = requests.get(f'https://www.countryflags.io/{label}/shiny/64.png')        
        img = plt.imread(BytesIO(response.content))
        im = OffsetImage(img, zoom=0.65)
        
    im.image.axes = ax
    x_offset = -45
    ab = AnnotationBbox(im, (0, y), xybox=(x_offset, 0), frameon=False,
                        xycoords='data', boxcoords="offset points", pad=0)
    ax.add_artist(ab)
    
max_value = values.max()
for i, (label, value) in enumerate(zip(labels, values)):
    offset_image(value, i, label, ax=plt.gca())
plt.subplots_adjust(left=0.15)    
plt.show()

Given the fact that the first F1 world championship race took place in the United Kingdom. It is not surprising to see so many British drivers. What surprises me is that, American comes second though. 

### Q3: How many races/Grand Prix in each season?

In [None]:
laptimes = pd.read_csv('../input/formula-1-world-championship-1950-2020/lap_times.csv')
races = pd.read_csv('../input/formula-1-world-championship-1950-2020/races.csv')
print(laptimes.columns)
print(races.columns)
print(driver.columns)
print(df_circuits.columns)

In [None]:
df_combined = pd.merge(laptimes, races, on='raceId', how='left')
df_combined.columns

In [None]:
df_combined = df_combined[['raceId', 'driverId', 'time_x', 'milliseconds',
       'year', 'round', 'circuitId', 'name', 'date']]
df_combined.rename(columns={'time_x':'lap_time', 'name':'circuit_name'}, inplace=True)
df_combined = pd.merge(df_combined, driver, on='driverId', how='left')
df_combined = pd.merge(df_combined, df_circuits, on='circuitId', how='left')
df_combined.columns

In [None]:
df_combined = df_combined[['raceId', 'driverId', 'lap_time', 'milliseconds', 'year', 'round',
       'circuitId', 'circuit_name', 'date', 'driverRef', 'number', 'code',
       'forename', 'surname', 'dob', 'nationality', 'circuitRef', 'location', 'country']]
df_combined.head(3)

In [None]:
from matplotlib import pyplot as plt
import numpy as np

rounds = races.groupby('year').round.max().reset_index()['round'].tolist()
years = races.groupby('year').round.max().reset_index()['year'].tolist()

N = len(rounds)
arrRounds = np.array(rounds)

theta=np.arange(0,1.75*np.pi,1.75*np.pi/N)
width = (1.75*np.pi)/N *0.9
bottom = 40

fig = plt.figure(figsize=(8,8))
ax = fig.add_axes([0.1, 0.1, 0.75, 0.75], polar=True)
fig.set_facecolor('#FFFFFF')
bars = ax.bar(theta, arrRounds, width=width, bottom=bottom, color=colors, alpha=0.5)
plt.axis('off')

rotations = np.rad2deg(theta)
for x, bar, rotation, counts in zip(theta, bars, rotations, rounds):
    lab = ax.text(x,bottom+bar.get_height(), counts, 
             ha='left', va='center', rotation=rotation, rotation_mode="anchor") 
    
for x, bar, rotation, labels in zip(theta, bars, rotations, years):
    lab = ax.text(x,bottom, labels, 
             ha='left', va='center', rotation=rotation, rotation_mode="anchor") 

rads = np.arange(0, 1.75*np.pi, 0.01) 
for rad in rads: 
    plt.polar(rad, bottom-2, 'g.') 

ax.plot(theta[0], bottom-2, '8', color='g', markersize=10)
ax.plot(rads[len(rads)-1], bottom-1.5, 'v', color='g', markersize=10)

response = requests.get(f'https://upload.wikimedia.org/wikipedia/commons/c/c9/Logof1.png')
img = plt.imread(BytesIO(response.content))
im = OffsetImage(img, zoom=0.15)
im.image.axes = plt.gca()
ab = AnnotationBbox(im, (0, 15), xybox=(-45, 30), frameon=False,
                    xycoords='data', boxcoords="offset points", pad=0)
plt.gca().add_artist(ab)

ax.text(0, 0, 'Number of Grand Prix per Season', fontsize=14,
                horizontalalignment='center',
                verticalalignment='center')

plt.show()

We have 70 seasons in total so the typical barplot looks very congested. Then I've decided to use circular barplot instead. 

We can see that the total number of races per season is pretty consistent over the years. This year due to Covid19 pandemic, they only managed to arrange for 17 races as compared to 21 races in the previous recent years. And way we go to the last question: 

### Q4: Who is the fastest driver in the Grid?
To determine who is the fastest driver, I use the shortest lap time in each race. Since the total of races differ slightly across years, so I calculated the percentage of number of shortest lap time per season in order to have a fair comparison among the fastest drivers across different years. In this analysis, I excluded 2020 season since the season is still on-going.

In [None]:
data = pd.merge(df_combined.groupby(['circuit_name','date']).lap_time.min().to_frame().reset_index(), df_combined[['circuit_name','date','lap_time', 'driverRef','code']], on=['circuit_name','date','lap_time'], how='left')
data = data.sort_values(by='date', ascending = False)
data.head(5)

In [None]:
data['year'] = pd.DatetimeIndex(data.date).year
data['counts'] = 1
data = data.groupby(['year', 'code', 'driverRef']).counts.count().to_frame().reset_index().sort_values(by='year', ascending=False)
data.head(3)

In [None]:
#fastest = data.loc[data.groupby(['year'])['occ'].idxmax()]
fastest = pd.merge(data, data.groupby(['year'])['counts'].max().to_frame(name='max').reset_index(), on='year', how='left')
fastest = fastest[fastest['counts'] == fastest['max']][['year','code','driverRef','counts']]
fastest.driverRef = fastest.driverRef.str.capitalize()

# Calculate the percentage of fastest lap per season 
fastest = pd.merge(fastest, df_combined.groupby('year')['round'].max().reset_index(), on='year', how='left')
fastest['percent'] = np.array(fastest['counts'])/np.array(fastest['round'])*100
fastest

In [None]:
fastest.iloc[[22,23,24],fastest.columns.get_loc('code')] = ['HAK','HAK','HAK']
fastest.iloc[26,fastest.columns.get_loc('code')] = 'FRE'
fastest.iloc[27,fastest.columns.get_loc('code')] = 'VIL'

# drop 2020 
fastest = fastest.drop(index=0)
fastest = fastest.sort_values(by='year', ascending=True)
fastest['year'] = fastest['year'].astype(str)
fastest = fastest.reset_index(drop=True)
fastest

In [None]:
fastest[fastest['year'].duplicated(keep=False)]

In [None]:
# concatenate code, driver name from duplicated rows 
data_code = fastest[fastest['year'].duplicated(keep=False)][['year','code']].groupby('year').transform(lambda x: ', '.join(x))
data_ref = fastest[fastest['year'].duplicated(keep=False)][['year','driverRef']].groupby('year').transform(lambda x: ', '.join(x))
data_combined = pd.merge(data_code, data_ref, left_index=True, right_index=True, how='outer')
data_combined

In [None]:
pd.merge(fastest[fastest['year'].duplicated(keep=False)]['year'], data_combined, left_index=True, right_index=True, how='outer')

In [None]:
fastest.loc[fastest['year'].duplicated(keep=False), ['year', 'code','driverRef']]

In [None]:
# replace the subset of fastest dataframe with the updated values 
fastest.loc[fastest['year'].duplicated(keep=False), ['year', 'code','driverRef']] = pd.merge(fastest[fastest['year'].duplicated(keep=False)]['year'], data_combined, left_index=True, right_index=True, how='outer').values
fastest

In [None]:
fastest = fastest[~fastest.duplicated()]
fastest

In [None]:
fastest[fastest['code'] == 'HAM']

In [None]:
from bokeh.palettes import Category20b

fig, ax = plt.subplots(figsize=(12,16))
fig.set_facecolor('#FFFFFF')
ax.set_facecolor('#FFFFFF')

ax.hlines(fastest.year, xmin=0, xmax=fastest.percent, linestyle='dotted')

groups = fastest[['year','percent','driverRef']].groupby('driverRef')
colors = Category20b[len(fastest.code.unique().tolist())]

for (name, group), color in zip(groups, colors):
    ax.plot(group.percent, group.year, marker='o', color=color, linestyle='', ms=12, label=name)
ax.set_xlim(0,65)
ax.legend()

for x,y, label, count in zip(fastest.percent, fastest.year, fastest.code, fastest.counts):
    ax.annotate(label+'({} races)'.format(count), xy=(x+0.8,y), textcoords='data')
    #ax.annotate('(%s, %s)' % xy, xy=xy, textcoords='data')

plt.xlabel('Percentage of Fastest Lap Wins(%)')
plt.title('Who is the fastest driver in each season?', fontsize=18)

plt.show()

As it turned out, Kami Raikkonen had won the fastest lap more than half of the time in season 2008 and in 2005, he also won the fastest lap 9 times out of 19 races with 47.37% wining margin. Lewis Hamilton, by comparison, did not have such a big winning margin. When he was the fastest driver of the season, his winning percentage was less than 40%. However, just as Michael Schumacher, he has been the fastest driver in the Grid for 4 times and most likely that he will break the record after this season ends.  