## Scraping Mercedes AMG Petronas F1 Team Points from 2010-2020

Building on the techniques I used to scrape RWC data, I scraped Mercedes AMG Petronas F1 data from the official F1 website. 

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
import datetime as dt
import numpy as np

I list all of the URLs, and then use requests and beautiful soup to scrape, and then perform some cleaning to get only the text that I need.

In [2]:
URL_2010 = 'https://www.formula1.com/en/results.html/2010/team/mercedes.html'
URL_2011 = 'https://www.formula1.com/en/results.html/2011/team/mercedes.html'
URL_2012 = 'https://www.formula1.com/en/results.html/2012/team/mercedes.html'
URL_2013 = 'https://www.formula1.com/en/results.html/2013/team/mercedes.html'
URL_2014 = 'https://www.formula1.com/en/results.html/2014/team/mercedes.html'
URL_2015 = 'https://www.formula1.com/en/results.html/2015/team/mercedes.html'
URL_2016 = 'https://www.formula1.com/en/results.html/2016/team/mercedes.html'
URL_2017 = 'https://www.formula1.com/en/results.html/2017/team/mercedes.html'
URL_2018 = 'https://www.formula1.com/en/results.html/2018/team/mercedes.html'
URL_2019 = 'https://www.formula1.com/en/results.html/2019/team/mercedes.html'
URL_2020 = 'https://www.formula1.com/en/results.html/2020/team/mercedes.html'
URLS = [URL_2010,URL_2011,URL_2012,URL_2013,URL_2014,URL_2015,URL_2016,URL_2017,URL_2018,URL_2019,URL_2020]
big_soup = []
for i in URLS:
    response = requests.get(i)
    soup = BeautifulSoup(response.text,'html.parser')
    table = soup.find('table', {'class':'resultsarchive-table'}).tbody
    text = table.text
    text_clean = text.replace('\n',' ')
    big_soup.append(text_clean)   

In [3]:
soup = " ".join(big_soup)

In [4]:
soup

'    Bahrain  14 Mar 2010 18      Australia  28 Mar 2010 11      Malaysia  04 Apr 2010 15      China  18 Apr 2010 16      Spain  09 May 2010 12      Monaco  16 May 2010 6      Turkey  30 May 2010 22      Canada  13 Jun 2010 8      Europe  27 Jun 2010 1      Great Britain  11 Jul 2010 17      Germany  25 Jul 2010 6      Hungary  01 Aug 2010 0      Belgium  29 Aug 2010 14      Italy  12 Sep 2010 12      Singapore  26 Sep 2010 10      Japan  10 Oct 2010 8      South Korea  24 Oct 2010 12      Brazil  07 Nov 2010 14      Abu Dhabi  14 Nov 2010 12        Australia  27 Mar 2011 0      Malaysia  10 Apr 2011 2      China  17 Apr 2011 14      Turkey  08 May 2011 10      Spain  22 May 2011 14      Monaco  29 May 2011 0      Canada  12 Jun 2011 12      Europe  26 Jun 2011 6      Great Britain  10 Jul 2011 10      Germany  24 Jul 2011 10      Hungary  31 Jul 2011 2      Belgium  28 Aug 2011 18      Italy  11 Sep 2011 10      Singapore  25 Sep 2011 6      Japan  09 Oct 2011 9      South Korea  16 O

Once I have all the text in "soup", I use the regular expressions to parse through and find the circuit, points, and dates, which I then append to individual lists.

In [5]:
#Circuit Regex
circuit_pattern = re.compile(r'\s{4,8}(\w*|\w*\s\w*)\s{2}')
circuit = circuit_pattern.findall(soup)
circuit_list = []
for i in circuit:
    circuit_list.append(i)

In [6]:
#datemethodregex
date_pattern = re.compile(r'\d{1,2}\s\w*\s\d{4}')
dates = date_pattern.findall(soup)
date_list = []
for i in dates:
    date_list.append(i)

In [7]:
#pointsregex
points_pattern = re.compile(r'\d{1,2}\s\w*\s\d{4}\s(\d{1,2})')
points = points_pattern.findall(soup)
points_list = []
for i in points:
    points_list.append(i)

In [8]:
dico = {'Circuit':circuit_list,'Date':date_list,'Points':points_list}
df = pd.DataFrame.from_dict(dico, orient='index')
df = df.transpose()
# df = df['Date'].astype(datetime)
points_df= df['Points'].astype(str).astype(int)

Finally I create a dictionary and create a Pandas dataframe from the lists I created.

In [9]:
df

Unnamed: 0,Circuit,Date,Points
0,Bahrain,14 Mar 2010,18
1,Australia,28 Mar 2010,11
2,Malaysia,04 Apr 2010,15
3,China,18 Apr 2010,16
4,Spain,09 May 2010,12
...,...,...,...
210,Emilia Romagna,01 Nov 2020,44
211,Turkey,15 Nov 2020,25
212,Bahrain,29 Nov 2020,29
213,Sakhir,06 Dec 2020,7


In [10]:
merc = df

I also realized there are some one offs, such as Styria, or Emilia Romagna grand prix, which aren't extremely useful in the context of data analysis. They should in fact be changed to be grouped with the countries in which the prix' happened. 

In [13]:
merc['Circuit'] = merc['Circuit'].replace('Eifel','Germany')
merc['Circuit'] = merc['Circuit'].replace('Tuscany','Italy')
merc['Circuit'] = merc['Circuit'].replace('70th Anniversary','Great Britain')
merc['Circuit'] = merc['Circuit'].replace('Sakhir','Bahrain')
merc['Circuit'] = merc['Circuit'].replace('Styria', 'Austria')
merc['Circuit'] = merc['Circuit'].replace('Emilia Romagna', 'Italy')

In [14]:
merc.describe()

Unnamed: 0,Circuit,Date,Points
count,215,215,215
unique,27,215,37
top,Italy,18 Sep 2016,43
freq,13,1,46


In [11]:
# merc.to_csv(r'C:\Users\lacar\DQ Projects\F1' + '\\mercedesrecord2010-2020.csv', index=False)