# Scrapping Formula1 For Race Results Using Python.


![banner-image](https://i.imgur.com/ay2hbsc.jpg)

[Formula One](https://www.formula1.com/) (also known as Formula 1 or F1) is the highest class of international racing for open-wheel single-seater formula racing cars sanctioned by the Fédération Internationale de l'Automobile (FIA). The World Drivers' Championship, which became the FIA Formula One World Championship in 1981, has been one of the premier forms of racing around the world since its inaugural season in 1950. The word formula in the name refers to the set of rules to which all participants' cars must conform. A Formula One season consists of a series of races, known as Grands Prix, which take place worldwide on both purpose-built circuits and closed public roads.

The page https://www.formula1.com/en/results.html/drivers.html provides details of race results. In this project we'll retrieve race results of all the races with respect to year using _web scrapping_: the process of extracting information from a website in an automated fashion using code. we'll use the python libraries [requests](https://pypi.org/project/requests/) and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) to scrape data from this page


Here's the project outline and the steps we'll fallow:
1. Download the web page using `requests` library  
2. Parse the HTML source page using `BeautifulSoup`
3. Write functions to scrape data such as Location of race, Race date, Winner name, Formula1 car, Race timing etc.. 
4. Store the scrapped data into a dictionary
5. From the dictionary create a data frame using `Pandas`
6. Save the data into CSV file


By the end of the project we'll create a CSV file in fallowing format:

```
GrandPrix_location, Race_date, winner, car, No_of_laps, Race_finish_time, Year
Brazil,	07 Nov 2010, Sebastian Vettel, RBR Renault, 71, 1:33:11.803,2010
Abu Dhabi, 14 Nov 2010, Sebastian Vettel, RBR Renault, 55, 1:39:36.837, 2010
...
...
```

In [7]:
!pip install jovian --upgrade --quiet

In [8]:
import jovian

In [9]:
import requests
from bs4 import BeautifulSoup

# importing request library, and install BeautifulSoup

def get_race_results(year):
    url = 'https://www.formula1.com/en/results.html/' + year + '/races.html'
    response = requests.get(url)
    # check if the webpage is valid
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(url))
    #parse through beatifulsoup
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [10]:
year = input('Eneter year ')

Eneter year 2022


In [11]:
doc = get_race_results(year)

In [12]:
doc.title.text.strip()

'2022 RACE RESULTS'

## Defining helper function to extract elements

Observe the screenshot below, `class` of the element is `<class:"dark bold ArchiveLink>`. and if we can extract this class and store it in a list

![](https://i.imgur.com/TYGpvlc.png)

#### Defining a function to extract element and looping it to extract for all element referred to the selection class

In [13]:
def get_grand_prix(doc):
    selection_class = 'dark bold ArchiveLink'
    gp_tags = doc.find_all('a', {'class' : selection_class})
    gp_titles = []
    for tag in gp_tags:
        gp_titles.append(tag.text.strip())
    return gp_titles

In [14]:
len(get_grand_prix(doc)) # check length to confirm with official website

20

In [15]:
gp_place = get_grand_prix(doc)

In [16]:
gp_place

['Bahrain',
 'Saudi Arabia',
 'Australia',
 'Emilia Romagna',
 'Miami',
 'Spain',
 'Monaco',
 'Azerbaijan',
 'Canada',
 'Great Britain',
 'Austria',
 'France',
 'Hungary',
 'Belgium',
 'Netherlands',
 'Italy',
 'Singapore',
 'Japan',
 'United States',
 'Mexico']

#### Define similar function to extract date.`class` of thr date element is `<class="dark hide-for-mobile>` 
    
 ![](https://i.imgur.com/FrBGSeA.png)

In [17]:
def get_grand_prix_date(doc):
    selection_class = 'dark hide-for-mobile'
    gp_date_tags = doc.find_all('td', {'class' : selection_class})
    gp_dates = []
    for tag in gp_date_tags:
        gp_dates.append(tag.text.strip())
    return gp_dates

In [18]:
gp_date = get_grand_prix_date(doc)

In [19]:
gp_date

['20 Mar 2022',
 '27 Mar 2022',
 '10 Apr 2022',
 '24 Apr 2022',
 '08 May 2022',
 '22 May 2022',
 '29 May 2022',
 '12 Jun 2022',
 '19 Jun 2022',
 '03 Jul 2022',
 '10 Jul 2022',
 '24 Jul 2022',
 '31 Jul 2022',
 '28 Aug 2022',
 '04 Sep 2022',
 '11 Sep 2022',
 '02 Oct 2022',
 '09 Oct 2022',
 '23 Oct 2022',
 '30 Oct 2022']

* Here it is little tricky...!

Driver name have two parts `first name` and `last name` with class `<class='hide-for-tablet'>` and `<class_='hide-for-mobile'>` respectively. Lets extract each element individually.

Here for `Lewis Hamilton` first name and last name can be extracted individually and combined using *list concatenation*

![](https://i.imgur.com/YfaUr7r.png)

In [20]:
def get_grand_prix_winner(doc):
    selection_class = 'hide-for-tablet'
    gp_winner_tags1 = doc.find_all('span', {'class' : selection_class})
    gp_winner_tags2 = doc.find_all('span', class_='hide-for-mobile')
    
    gp_winners = []
    for i in range(len(gp_winner_tags1)):                               #here we are combining using list concatenation
        gp_winners.append(gp_winner_tags1[i].text + " " + gp_winner_tags2[i].text)  
        
    return gp_winners

In [21]:
gp_winner = get_grand_prix_winner(doc)

In [22]:
gp_winner

['Charles Leclerc',
 'Max Verstappen',
 'Charles Leclerc',
 'Max Verstappen',
 'Max Verstappen',
 'Max Verstappen',
 'Sergio Perez',
 'Max Verstappen',
 'Max Verstappen',
 'Carlos Sainz',
 'Charles Leclerc',
 'Max Verstappen',
 'Max Verstappen',
 'Max Verstappen',
 'Max Verstappen',
 'Max Verstappen',
 'Sergio Perez',
 'Max Verstappen',
 'Max Verstappen',
 'Max Verstappen']

#### Defining a function to get car details

Here the selection class is `<class="semi-bold uppercase> "`

![](https://i.imgur.com/EC82udW.png)

In [23]:
def get_grand_prix_car(doc):
    selection_class = 'semi-bold uppercase'
    gp_car_tags = doc.find_all('td', {'class' : selection_class})
    gp_cars = []
    for tag in gp_car_tags:
        gp_cars.append(tag.text.strip())
    return gp_cars

In [24]:
gp_cars_name = get_grand_prix_car(doc)

In [25]:
gp_cars_name

['Ferrari',
 'Red Bull Racing RBPT',
 'Ferrari',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Ferrari',
 'Ferrari',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT',
 'Red Bull Racing RBPT']

#### Defining a function for Number of Laps

![](https://i.imgur.com/EC82udW.png)

In [26]:
def get_grand_prix_laps(doc):
    selection_class = 'bold hide-for-mobile'
    gp_lap_tags = doc.find_all('td', {'class' : selection_class})
    gp_laps = []
    for tag in gp_lap_tags:
        gp_laps.append(tag.text.strip())
    return gp_laps

In [27]:
gp_no_of_laps = get_grand_prix_laps(doc)

In [28]:
gp_no_of_laps

['57',
 '50',
 '58',
 '63',
 '57',
 '66',
 '64',
 '51',
 '70',
 '52',
 '71',
 '53',
 '70',
 '44',
 '72',
 '53',
 '59',
 '28',
 '56',
 '71']

#### Defining a helper function to Race finish time

![](https://i.imgur.com/EC82udW.png)

In [29]:
def get_grand_prix_time(doc):
    selection_class = 'dark bold hide-for-tablet'
    gp_time_tags = doc.find_all('td', {'class' : selection_class})
    gp_finish_time = []
    for tag in gp_time_tags:
        gp_finish_time.append(tag.text.strip())
    return gp_finish_time

In [30]:
gp_finish_time = get_grand_prix_time(doc)

In [31]:
gp_finish_time

['1:37:33.584',
 '1:24:19.293',
 '1:27:46.548',
 '1:32:07.986',
 '1:34:24.258',
 '1:37:20.475',
 '1:56:30.265',
 '1:34:05.941',
 '1:36:21.757',
 '2:17:50.311',
 '1:24:24.312',
 '1:30:02.112',
 '1:39:35.912',
 '1:25:52.894',
 '1:36:42.773',
 '1:20:27.511',
 '2:02:20.238',
 '3:01:44.004',
 '1:42:11.687',
 '1:38:36.729']

## Main function to extract all data

* So till now we have extracted data for a particular year, now using all the helper function will define a function `get_all_data` which will extract all the element required and store in a dictionary



In [32]:
def get_all_data(dates):
    dict1 = {
    'Race location' : [],                               # Create empty list to stored excrating data from loop
    'Race date' : [],
    'Race winner' : [],
    'Race car' : [],
    'Race lap' : [],
    'Race time' : []
    }
    
    for date in dates:                              # looping each input year from list and scraping all the pages 
        doc = get_race_results(date)                # Check for website validation and downolad to 'doc'
        dict1['Race location'].extend(get_grand_prix(doc))        
        dict1['Race date'].extend(get_grand_prix_date(doc))
        dict1['Race winner'].extend(get_grand_prix_winner(doc))
        dict1['Race car'].extend(get_grand_prix_car(doc))
        dict1['Race lap'].extend(get_grand_prix_laps(doc))
        dict1['Race time'].extend(get_grand_prix_time(doc))
        

    return dict1

- Input all the year as a list of string since the website is dynamically changing for every year. 

In [33]:
date_range = []
for i in range(1950, 2023):
    date_range.append(i)
dates = map(str, date_range)

In [34]:
race_data = get_all_data(dates)

## Dataframe using pandas

* Install pandas library and import
* Create a DataFrame using pandas 

In [35]:
!pip install pandas --upgrade --quiet

In [36]:
import pandas as pd

In [37]:
df = pd.DataFrame(race_data)

* From date using pandas lets extract only year from 'Race Date' using pandas date_time function

In [39]:
df['year'] = pd.to_datetime(df['Race date']).dt.year

In [40]:
df

Unnamed: 0,Race location,Race date,Race winner,Race car,Race lap,Race time,year
0,Great Britain,13 May 1950,Nino Farina,Alfa Romeo,70,2:13:23.600,1950
1,Monaco,21 May 1950,Juan Manuel Fangio,Alfa Romeo,100,3:13:18.700,1950
2,Indianapolis 500,30 May 1950,Johnnie Parsons,Kurtis Kraft Offenhauser,138,2:46:55.970,1950
3,Switzerland,04 Jun 1950,Nino Farina,Alfa Romeo,42,2:02:53.700,1950
4,Belgium,18 Jun 1950,Juan Manuel Fangio,Alfa Romeo,35,2:47:26.000,1950
...,...,...,...,...,...,...,...
1075,Italy,11 Sep 2022,Max Verstappen,Red Bull Racing RBPT,53,1:20:27.511,2022
1076,Singapore,02 Oct 2022,Sergio Perez,Red Bull Racing RBPT,59,2:02:20.238,2022
1077,Japan,09 Oct 2022,Max Verstappen,Red Bull Racing RBPT,28,3:01:44.004,2022
1078,United States,23 Oct 2022,Max Verstappen,Red Bull Racing RBPT,56,1:42:11.687,2022


* Writing the dataFrame into a CSV file

In [41]:
df.to_csv('Formula1 Race Results.csv', index=False)

In [42]:
jovian.commit(files = ['Formula1 Race Results.csv'], index=False)

<IPython.core.display.Javascript object>

[jovian] Updating notebook "trineshnk/f1-project" on https://jovian.ai[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/trineshnk/f1-project[0m


'https://jovian.ai/trineshnk/f1-project'

## Conclusion
1. Scrapped Formula1 website [link here](https://formula1.com/en/results.html) for race results using Requests and BeautifulSoup4
2. Built functions such as get_race_results(), get_all_data(), and many other helper functions to scrape race titles, dates, winners, cars name, laps and race finish time
3. Stored data consisting of 1080 rows x 7 columns into a file 'Formula1 Race Results.csv'

## Future work
* We can extract driver standings & constructor standing in the upcoming web scrapping project
* The same data can be used to data analysis of the performance of all F1 teams and drivers


## References

1. Formula1 official website: https://www.formula1.com/en/results.html/drivers.html
2. Jovian tutorial videos: https://jovian.ai/
3. *Wikipedia* :https://en.wikipedia.org/wiki/Formula_One

In [None]:
jovian.commit(project='F1 Project')

<IPython.core.display.Javascript object>