# Drivers Statistics
The objective of this notebook is to analize the data that's being used in our database in order to see which statistics can be utilised.  
Lets start by importing the pandas library.

While trying to read the csv file, I received the following error:  
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 4: invalid continuation byte  

So the csv file is not compatible with the normal UTF-8 encoding.  
I found the following information on the issue: [Unicode Error](https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python)  
By using ISO-8859-1 endoding instead of UTF-8, I was able to import the csv file as a pandas dataframe.

Added file to github


In [2]:
import pandas as pd

fileData = 'drivers.csv'
df  = pd.read_csv(fileData, header=0, index_col=None, encoding = "ISO-8859-1")
df

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url
0,1,hamilton,44.0,HAM,Lewis,Hamilton,07/01/1985,British,http://en.wikipedia.org/wiki/Lewis_Hamilton
1,2,heidfeld,,HEI,Nick,Heidfeld,10/05/1977,German,http://en.wikipedia.org/wiki/Nick_Heidfeld
2,3,rosberg,6.0,ROS,Nico,Rosberg,27/06/1985,German,http://en.wikipedia.org/wiki/Nico_Rosberg
3,4,alonso,14.0,ALO,Fernando,Alonso,29/07/1981,Spanish,http://en.wikipedia.org/wiki/Fernando_Alonso
4,5,kovalainen,,KOV,Heikki,Kovalainen,19/10/1981,Finnish,http://en.wikipedia.org/wiki/Heikki_Kovalainen
...,...,...,...,...,...,...,...,...,...
837,838,vandoorne,2.0,VAN,Stoffel,Vandoorne,26/03/1992,Belgian,http://en.wikipedia.org/wiki/Stoffel_Vandoorne
838,839,ocon,31.0,OCO,Esteban,Ocon,17/09/1996,French,http://en.wikipedia.org/wiki/Esteban_Ocon
839,840,stroll,18.0,STR,Lance,Stroll,29/10/1998,Canadian,http://en.wikipedia.org/wiki/Lance_Stroll
840,841,giovinazzi,36.0,GIO,Antonio,Giovinazzi,14/12/1993,Italian,http://en.wikipedia.org/wiki/Antonio_Giovinazzi


Lets see what we can do with the data that we just imported into the dataframe.  
Let's sort them based on the dob column.

In [77]:
df1 = df.sort_values('dob', ascending=True)
df1[0:10][:]

Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url
754,754,balsa,,,Marcel,Balsa,01/01/1909,French,http://en.wikipedia.org/wiki/Marcel_Balsa
433,434,sharp,,,Hap,Sharp,01/01/1928,American,http://en.wikipedia.org/wiki/Hap_Sharp
247,248,gimax,,,Carlo,Franchi,01/01/1938,Italian,http://en.wikipedia.org/wiki/Gimax
234,235,ickx,,,Jacky,Ickx,01/01/1945,Belgian,http://en.wikipedia.org/wiki/Jacky_Ickx
232,233,stuck,,,Hans-Joachim,Stuck,01/01/1951,German,http://en.wikipedia.org/wiki/Hans_Joachim_Stuck
110,111,gounon,,,Jean-Marc,Gounon,01/01/1963,French,http://en.wikipedia.org/wiki/Jean-Marc_Gounon
46,47,baumgartner,,,Zsolt,Baumgartner,01/01/1981,Hungarian,http://en.wikipedia.org/wiki/Zsolt_Baumgartner
684,684,whitehouse,,,Bill,Whitehouse,01/04/1909,British,http://en.wikipedia.org/wiki/Bill_Whitehouse
270,271,kessel,,,Loris,Kessel,01/04/1950,Swiss,http://en.wikipedia.org/wiki/Loris_Kessel
74,75,nakano,,,Shinji,Nakano,01/04/1971,Japanese,http://en.wikipedia.org/wiki/Shinji_Nakano


Ok, so we can see that the oldest driver in this dataset were born in 1909. This means that many of the drivers in this dataset are no longer active drivers, some of them are not even alive today, but this is a great collection of all the drivers that has ever participated in the sport.

Lets see which county has the most F1 drivers.

In [78]:
emptyDict = {}
for cty in df1['nationality']:
    if cty in emptyDict.keys():
        emptyDict[cty] += 1
    else:
        emptyDict[cty] = 1

sorted_d = sorted(emptyDict.items(), key=lambda x: x[1], reverse=True)
print(sorted_d)

[('British', 162), ('American', 157), ('Italian', 99), ('French', 73), ('German', 49), ('Brazilian', 31), ('Argentine', 24), ('Belgian', 23), ('Swiss', 23), ('South African', 23), ('Japanese', 19), ('Dutch', 17), ('Australian', 17), ('Spanish', 15), ('Austrian', 15), ('Canadian', 13), ('Swedish', 10), ('New Zealander', 9), ('Finnish', 9), ('Mexican', 6), ('Danish', 5), ('Irish', 5), ('Uruguayan', 4), ('Rhodesian', 4), ('Portuguese', 4), ('Venezuelan', 3), ('Colombian', 3), ('East German', 3), ('Monegasque', 3), ('Russian', 2), ('Indian', 2), ('Hungarian', 1), ('American-Italian', 1), ('Polish', 1), ('Argentine-Italian', 1), ('Czech', 1), ('Liechtensteiner', 1), ('Chilean', 1), ('Thai', 1), ('Malaysian', 1), ('Indonesian', 1)]


Lets display the names of the top three nationalities with the most F1 drivers.

In [79]:
print("No 1: " + str(sorted_d[0]))
print("No 2: " + str(sorted_d[1]))
print("No 3: " + str(sorted_d[2]))

No 1: ('British', 162)
No 2: ('American', 157)
No 3: ('Italian', 99)


Lets see if we can show the 3 oldest and the 3 youngest dirvers.

In [80]:
print("3 Oldest Drivers")
df1 = df1.sort_values(by=['dob'])
df1[0:3]

3 Oldest Drivers


Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url
754,754,balsa,,,Marcel,Balsa,01/01/1909,French,http://en.wikipedia.org/wiki/Marcel_Balsa
433,434,sharp,,,Hap,Sharp,01/01/1928,American,http://en.wikipedia.org/wiki/Hap_Sharp
247,248,gimax,,,Carlo,Franchi,01/01/1938,Italian,http://en.wikipedia.org/wiki/Gimax


In [83]:
print("3 Youngest Drivers")
df1 = df1.sort_values(by=['dob'], ascending=False)
df1[0:3]

3 Youngest Drivers


Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url
66,67,buemi,,BUE,SÌ©bastien,Buemi,31/10/1988,Swiss,http://en.wikipedia.org/wiki/S%C3%A9bastien_Buemi
320,321,bell,,,Derek,Bell,31/10/1941,British,http://en.wikipedia.org/wiki/Derek_Bell_(auto_...
214,215,guerra,,,Miguel Ìngel,Guerra,31/08/1953,Argentine,http://en.wikipedia.org/wiki/Miguel_Angel_Guerra


## Conclusion
The information from this dataframe contains all the driver information, which can be used to construct a bio of the drivers from the relevant season.  
This data should be used in conjunction with other datasets to construct more informative statistical patterns.