## Scraping the Results Page

Notes on scraping the mls results page using Selenium

The results page loads the form table dynamically and a quick look at the markup doesn't reveal an easy way to use the underlying api directly. So using a web client that will fully render the page (like Selenium) seems like the right way to approach the problem of getting a complete html page that can then be parsed.

In [25]:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import mlsutil

In [20]:
url = 'https://www.mlssoccer.com/results/2011'
file_path = '../raw/team-results-2011.html'

In [16]:
# create a new browser session
driver = webdriver.Chrome()
driver.get(url)

In [21]:
# save off the html file
with open(file_path, 'w', encoding='utf-8') as f:
    f.write(driver.page_source)

In [24]:
with open(file_path) as f:
    soup = BeautifulSoup(f, 'lxml')

table_elements = soup.find_all('table')
len(table_elements)

2

In [50]:
# this is going to require some special parsing

# there are two tables in the data, the first is a table of the teams
t = table_elements[0]
rows = t.find_all('tr')

# the first row is garbage
rows = rows[1:]

# get the teams and the number of rows in the data frame
n_rows = len(rows)
teams = [row.get_text().strip() for row in rows]
teams_index = pd.Series(data=teams, name="Club")

# get second table and separate header and data
t_results = table_elements[1]
rows = t_results.find_all('tr')
header = rows[0]
data = rows[1:]

column_names = [h.get_text().strip() for h in header.find_all('th')]
column_index = pd.Series(data=column_names, name="Match")

# insert results into the data frame
df = pd.DataFrame(columns=column_index, index=teams_index)

In [51]:
# insert results into the data frame
i_row = 0
for row in data:
    i_column = 0
    data_tags = row.find_all('td')
    for tag in data_tags:
        df.iloc[i_row, i_column] = tag.a.get_text().strip()
        i_column += 1
    i_row += 1

df.head()

Match,1,2,3,4,5,6,7,8,9,10,...,25,26,27,28,29,30,31,32,33,34
Club,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
CHI,D,W,L,L,L,D,D,D,D,L,...,W,W,L,W,W,W,D,L,W,W
CHV,L,L,D,D,D,W,W,L,W,L,...,D,L,L,L,L,D,W,D,L,L
CLB,L,D,W,D,W,D,W,D,L,L,...,W,L,L,D,L,L,L,W,W,L
COL,W,W,W,L,L,L,D,W,D,D,...,W,W,D,L,L,L,D,W,D,W
DAL,D,L,L,W,L,W,W,D,W,W,...,D,L,W,L,L,L,L,W,W,L


In [78]:
import re

data_tags = row.find_all('td')
tag = data_tags[6]
m = re.search(r'[^/]+(?=/$|$)', tag.a['href'])
m.group()

'2011-04-23-vancouver-whitecaps-fc-vs-fc-dallas'

In [49]:
# save off the data frame
df.to_csv('../data/team-results-2011.csv')