<span style="font-family:Trebuchet MS; font-size:2em;">Capstone | NB1: Data Collection | Scraping Official Team Sites</span>

Riley Robertson | DSIR Capstone Project | MLS Analytics: Predicting Game Outcomes

---

Before hunting down detailed statistics, I wanted to make sure I had solid basic data for each team in the league by scraping from their official sites.   
If I wind up with mistakes in this foundational data, at least I know that they were mistakes made by the team or the league itself, rather than being a result of my misplaced trust of an inaccurate third party.

## Imports

I brought in the `requests` and `BeautifulSoup` libraries for scraping the data and then `Pandas` for storing and saving it.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Scraping

## Strategy

To make sure I had working code for collecting the basic info I needed, I started by scraping for a single team and a single season. Once I was confident with my process, I could then expand that to include all seasons for that same team. From there, I'd be able to write a function to look at all teams in the league and collect every season's worth of available data.

**Steps**
1. One Team, One Season
2. One Team, All Seasons
3. All Teams, All Seasons

## One Team | One Season

In [3]:
url = 'https://www.soundersfc.com/schedule?month=all&year=2019'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')

ssfc_matches = []
site_code = soup.find_all('article',{'class':'match_item'})
match_id = 201901

for article in site_code:  
    match = {}    
    match['competition'] = article.find('span',{'class':'match_competition'}).text
    if match['competition'] == 'MLS':
        match['match_id'] = f'{match_id}'
        match_id += 1
    match['opponent'] = article.find('img').attrs['title']
    try:
        match['result'] = article.find('span',{'class':'match_result'}).text.split(' ')[0].lower()
    except:
        match['result'] = ''
    try:
        match['score'] = article.find('span',{'class':'match_result'}).text.split(' ')[-1]
    except:    
        match['score'] = ''
    match['type'] = article.find('span',{'class':'match_home_away'}).text
    match['location'] = article.find('div',{'class':'match_info match_location_short'}).text.title()
#     match['date'] = article.find('div',{'class':'match_date'}).text
    match['year'] = article.find('div',{'class':'match_date'}).text.split(' ')[3]
    match['weekday'] = article.find('div',{'class':'match_date'}).text.split(' ')[0][:-1]
    match['month'] = article.find('div',{'class':'match_date'}).text.split(' ')[1]
    match['day'] = article.find('div',{'class':'match_date'}).text.split(' ')[2][:-1]
    try:
        match['time'] = article.find('span',{'class':'match_time'}).text
    except:
        match['time'] = ''
        
        
    ssfc_matches.append(match)
    
    
ssfc_2019 = pd.DataFrame(ssfc_matches)
ssfc_2019.head()

Unnamed: 0,competition,opponent,result,score,type,location,year,weekday,month,day,time,match_id
0,Preseason,Houston Dynamo FC,loss,3-2,H,Kino Sports Complex,2019,Saturday,February,9,5:00PM PT,
1,Preseason,Portland Timbers,2-1,2-1,A,Kino Sports Complex,2019,Wednesday,February,13,6:00PM PT,
2,Preseason,FC Dallas,loss,2-1,H,Kino Sports Complex,2019,Saturday,February,16,9:00AM PT,
3,Preseason,Club Nacional de Football,2-0,2-0,H,"Lumen Field, Seattle, Wa",2019,Wednesday,February,20,7:30PM PT,
4,Preseason,San Jose Earthquakes,2-2,2-2,A,Paypal Park,2019,Saturday,February,23,12:45PM PT,


---

## One Team | All Seasons

In [3]:
import pandas as pd
import requests
import time
from bs4 import BeautifulSoup

column_names = ['competition', 'opponent', 'result', 'score', 'type', 
                'location', 'year', 'weekday', 'month', 'day', 'time']

ssfc_match_hist = pd.DataFrame(columns=column_names)


for year in range(2009,2021):
    url = f'https://www.soundersfc.com/schedule?month=all&year={year}'
    res = requests.get(url)
    time.sleep(1)
    soup = BeautifulSoup(res.content, 'lxml')
    match_items = soup.find_all('article',{'class':'match_item'})
    season = []
    match_id = year*100
    
    for article in match_items:  
        match = {}
        match['competition'] = article.find('span',{'class':'match_competition'}).text
        if match['competition'] == 'MLS':
            match['match_id'] = f'{match_id}'
            match_id += 1
        match['opponent'] = article.find('img').attrs['title']
        try:
            match['result'] = article.find('span',{'class':'match_result'}).text.split(' ')[0].lower()
        except:
            match['result'] = ''
        try:
            match['score'] = article.find('span',{'class':'match_result'}).text.split(' ')[-1]
        except:    
            match['score'] = ''
        match['type'] = article.find('span',{'class':'match_home_away'}).text
        match['location'] = article.find('div',{'class':'match_info match_location_short'}).text.title()
    #     match['date'] = article.find('div',{'class':'match_date'}).text
        match['year'] = article.find('div',{'class':'match_date'}).text.split(' ')[3]
        match['weekday'] = article.find('div',{'class':'match_date'}).text.split(' ')[0][:-1]
        match['month'] = article.find('div',{'class':'match_date'}).text.split(' ')[1]
        match['day'] = article.find('div',{'class':'match_date'}).text.split(' ')[2][:-1]
        try:
            match['time'] = article.find('span',{'class':'match_time'}).text
        except:
            match['time'] = ''
        season.append(match)
    season = pd.DataFrame(season)
    
    ssfc_match_hist = ssfc_match_hist.append(season, ignore_index=True)
ssfc_match_hist

Unnamed: 0,competition,opponent,result,score,type,location,year,weekday,month,day,time,match_id
0,Preseason,LA Galaxy,win,3-1,A,"Carson, Ca",2009,Monday,February,9,TBD,
1,Preseason,Shandong Luneng Taishan FC,win,2-0,A,"Oxnard, Ca",2009,Tuesday,February,10,TBD,
2,Preseason,San Jose Earthquakes,loss,3-2,A,"San Luis Obispo, Ca",2009,Friday,February,13,TBD,
3,Preseason,Vancouver Whitecaps FC,win,4-0,H,"Seattle, Wa",2009,Sunday,February,22,TBD,
4,Preseason,Estudiantes de La Plata,win,3-1,A,Argentina,2009,Friday,February,27,TBD,
...,...,...,...,...,...,...,...,...,...,...,...,...
574,MLS,San Jose Earthquakes,win,4-1,H,Lumen Field,2020,Sunday,November,8,3:30PM PT,202022
575,MLS Playoffs,Los Angeles Football Club,win,3-1,H,Lumen Field,2020,Tuesday,November,24,7:30PM PT,
576,MLS Playoffs,FC Dallas,win,1-0,H,Lumen Field,2020,Tuesday,December,1,6:30PM PT,
577,MLS Playoffs,Minnesota United FC,win,3-2,H,Lumen Field,2020,Monday,December,7,6:30PM PT,


In [4]:
# ssfc_match_hist[:50]

In [5]:
ssfc_match_hist.shape

(579, 12)

In [6]:
ssfc_match_hist['competition'].value_counts()

MLS             390
Preseason        77
CCL              38
MLS Playoffs     36
USOC             32
Cup               4
US_KO             1
ACC               1
Name: competition, dtype: int64

In [7]:
ssfc_match_hist.to_csv('../data/output_scraping/ssfc_match_hist.csv')

## Function

In [8]:
def get_team_hist(team_domain, team_code, start_year, end_year, filepathname):

    import pandas as pd
    import requests
    import time
    from bs4 import BeautifulSoup
    
    column_names = ['competition', 'opponent', 'result', 'score', 'type', 
                    'location', 'year', 'weekday', 'month', 'day', 'time']
    team_match_hist = pd.DataFrame(columns=column_names)
        
    for year in range(start_year, end_year + 1):
        url = f'https://www.{team_domain}/schedule?month=all&year={year}'
        res = requests.get(url)
        time.sleep(.5)
        soup = BeautifulSoup(res.content, 'lxml')
        match_items = soup.find_all('article',{'class':'match_item'})
        match_id = year*100+1
        season = []

        for article in match_items:  
            match = {}
            match['competition'] = article.find('span',{'class':'match_competition'}).text
            if match['competition'] == 'MLS':
                match['match_id'] = f'{team_code}{match_id}'
                match_id += 1
            match['opponent'] = article.find('img').attrs['title']
            try:
                match['result'] = article.find('span',{'class':'match_result'}).text.split(' ')[0].lower()
            except:
                match['result'] = ''
            try:
                match['score'] = article.find('span',{'class':'match_result'}).text.split(' ')[-1]
            except:    
                match['score'] = ''
            match['type'] = article.find('span',{'class':'match_home_away'}).text
            match['location'] = article.find('div',{'class':'match_info match_location_short'}).text.title()
            match['year'] = article.find('div',{'class':'match_date'}).text.split(' ')[3]
            match['weekday'] = article.find('div',{'class':'match_date'}).text.split(' ')[0][:-1]
            match['month'] = article.find('div',{'class':'match_date'}).text.split(' ')[1]
            match['day'] = article.find('div',{'class':'match_date'}).text.split(' ')[2][:-1]
            try:
                match['time'] = article.find('span',{'class':'match_time'}).text
            except:
                match['time'] = ''   
            season.append(match)

        season = pd.DataFrame(season)
        team_match_hist = team_match_hist.append(season, ignore_index=True)
        
    team_match_hist.to_csv(f'{filepathname}')

Before using the function on all teams, I tested it on the Sounders to see if I would get the same result as I did with the code above.

In [9]:
get_team_hist('soundersfc.com', 'sea', 2009, 2020, '../data/output_scraping/test/function_test_ssfc.csv')

In [10]:
test_team_hist = pd.read_csv('../data/output_scraping/test/function_test_ssfc.csv')
test_team_hist

Unnamed: 0.1,Unnamed: 0,competition,opponent,result,score,type,location,year,weekday,month,day,time,match_id
0,0,Preseason,LA Galaxy,win,3-1,A,"Carson, Ca",2009,Monday,February,9,TBD,
1,1,Preseason,Shandong Luneng Taishan FC,win,2-0,A,"Oxnard, Ca",2009,Tuesday,February,10,TBD,
2,2,Preseason,San Jose Earthquakes,loss,3-2,A,"San Luis Obispo, Ca",2009,Friday,February,13,TBD,
3,3,Preseason,Vancouver Whitecaps FC,win,4-0,H,"Seattle, Wa",2009,Sunday,February,22,TBD,
4,4,Preseason,Estudiantes de La Plata,win,3-1,A,Argentina,2009,Friday,February,27,TBD,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
574,574,MLS,San Jose Earthquakes,win,4-1,H,Lumen Field,2020,Sunday,November,8,3:30PM PT,sea202023
575,575,MLS Playoffs,Los Angeles Football Club,win,3-1,H,Lumen Field,2020,Tuesday,November,24,7:30PM PT,
576,576,MLS Playoffs,FC Dallas,win,1-0,H,Lumen Field,2020,Tuesday,December,1,6:30PM PT,
577,577,MLS Playoffs,Minnesota United FC,win,3-2,H,Lumen Field,2020,Monday,December,7,6:30PM PT,


In [11]:
test_team_hist.drop(columns='Unnamed: 0').shape

(579, 12)

In [12]:
test_team_hist['competition'].value_counts()

MLS             390
Preseason        77
CCL              38
MLS Playoffs     36
USOC             32
Cup               4
US_KO             1
ACC               1
Name: competition, dtype: int64

The test appears to have been successful. The shape and competition value counts are identical.

---

## All Teams | All Seasons

From https://www.mlssoccer.com/clubs and https://en.wikipedia.org/wiki/Major_League_Soccer, I collected each team's basic information including name, location, year joined, and site info so that I could loop through a list of teams and use the function on each.

Unfortunately, not all teams had the same schedule page structure, so the code I wrote to collect historical match data from the Sounders' site wouldn't work for a handful of teams in the league. I created a column (`site_type`), to record which sites these were. Site Type A matches the Sounders' site and Site Type B does not. 

The B sites have dynamic pages that use scripting (I think javascript) to show different seasons' data without changing the URL. With that being the case, I can't simply adjust URL parameters to get data from different seasons and will have to deal with those separately.

In [13]:
site_info = pd.read_csv('../data/manual_site_data/mls_site_info.csv')
static_site_info = site_info[site_info['site_type'] == 'A'].copy()
dynamic_site_info = site_info[site_info['site_type'] == 'B'].copy()

### Teams with Static Match History Pages

In [14]:
static_site_info.shape

(22, 12)

In [15]:
static_site_info.head()

Unnamed: 0,team,team_code,joined,city,state,url_base,url_domain,url_schedule,url_team,url_tld,site_type,conf
0,Atlanta United,atl,2017,Atlanta,Georgia,https://www.atlutd.com,atlutd.com,https://www.atlutd.com/schedule,atlutd,.com,A,East
1,CF Montreal,mtl,2012,Montreal,Quebec,https://www.cfmontreal.com,cfmontreal.com,https://www.cfmontreal.com/schedule,cfmontreal,.com,A,East
2,Chicago Fire FC,chi,1998,Chicago,Illinois,https://www.chicagofirefc.com,chicagofirefc.com,https://www.chicagofirefc.com/schedule,chicagofirefc,.com,A,East
3,Columbus Crew,col,1996,Columbus,Ohio,https://www.columbuscrew.com,columbuscrew.com,https://www.columbuscrew.com/schedule,columbuscrew,.com,A,East
4,D.C. United,dcu,1996,Washington,D.C.,https://www.dcunited.com,dcunited.com,https://www.dcunited.com/schedule,dcunited,.com,A,East


Now that I had the data I needed, I could begin a loop using my function from above to create a csv of all match data for each team.

In [16]:
for index, team in static_site_info.iterrows():
    code = team['team_code']
    get_team_hist(team['url_domain'], 
                  code, 
                  team['joined'], 
                  2020, 
                  f'../data/output_scraping/may23-scrape1/{code}-match_hist.csv')
    time.sleep(1)

It worked!! It looks like I'm going to need to do some serious QC, but I have a .csv file for each of the 22 teams whose sites are more easily scrape-able.

I'm going to read them back in and combine them into one DataFrame to look at them all together.

In [85]:
mls_matches_columns = ['competition', 'opponent', 'result', 'score', 'type', 'location', 'year',
                       'weekday', 'month', 'day', 'time', 'match_id', 'team_code']

mls_matches = pd.DataFrame(columns=mls_matches_columns)
mls_matches

for index, team in static_site_info.iterrows():
    code = team['team_code']
    if code == 'afc':
        continue
    match_hist = pd.read_csv(f'../data/output_scraping/may23-scrape1/{code}-match_hist.csv')
    match_hist = match_hist.drop(columns=['Unnamed: 0'])   
    match_hist['team_code'] = code
    mls_matches = mls_matches.append(match_hist)
    
mls_matches    
mls_matches.reset_index(inplace=True)
mls_matches.to_csv('../data/output_scraping/may23-scrape1/mls-match_history.csv')

# Initial Exploration

### Unique Scores (All Competitions)

In [92]:
allscores = mls_matches['score'].value_counts().copy()
allscores

2-1          2089
1-0          1893
2-0          1293
1-1          1094
3-1           810
             ... 
1(5)-(6)1       1
5-5             1
2(3)-(1)2       1
2(3)-(5)1       1
9-2             1
Name: score, Length: 71, dtype: int64

In [93]:
allscores = pd.DataFrame(allscores).rename(axis=1, mapper={'score':'count'})
allscores

Unnamed: 0,count
2-1,2089
1-0,1893
2-0,1293
1-1,1094
3-1,810
...,...
1(5)-(6)1,1
5-5,1
2(3)-(1)2,1
2(3)-(5)1,1


In [94]:
allscores.reset_index(inplace=True)

In [95]:
allscores = allscores.rename(axis=1, mapper={'index':'score'})
allscores

Unnamed: 0,score,count
0,2-1,2089
1,1-0,1893
2,2-0,1293
3,1-1,1094
4,3-1,810
...,...,...
66,1(5)-(6)1,1
67,5-5,1
68,2(3)-(1)2,1
69,2(3)-(5)1,1


In [106]:
print(allscores.sort_values(by='score').reset_index(drop=True)[:35])
print(allscores.sort_values(by='score').reset_index(drop=True)[35:])

        score  count
0   0(1)-(3)0      3
1   0(2)-(4)0      2
2   0(3)-(1)0      3
3   0(3)-(4)0      2
4   0(4)-(5)0      1
5   0(5)-(6)0      1
6         0-0    660
7   1(1)-(3)0      2
8   1(1)-(4)1      2
9   1(2)-(4)1      3
10  1(3)-(1)0      1
11  1(3)-(2)1      2
12  1(3)-(4)1      3
13  1(3)-(5)1      3
14  1(4)-(1)1      2
15  1(4)-(2)1      3
16  1(4)-(3)1      4
17  1(5)-(4)1      5
18  1(5)-(6)1      1
19  1(6)-(5)1      2
20  1(6)-(7)1      2
21  1(7)-(6)1      1
22  1(7)-(8)1      2
23        1-0   1893
24        1-1   1094
25  2(2)-(3)2      2
26  2(2)-(4)2      1
27  2(3)-(1)2      1
28  2(3)-(4)1      1
29  2(3)-(4)2      3
30  2(3)-(5)1      1
31  2(4)-(2)1      3
32  2(4)-(3)2      1
33  2(4)-(5)2      1
34  2(6)-(5)2      3
         score  count
35   2(7)-(6)2      2
36  2(9)-(10)2      1
37         2-0   1293
38         2-1   2089
39         2-2    589
40   3(2)-(4)2      2
41   3(3)-(4)3      1
42   3(4)-(3)3      1
43   3(5)-(4)3      1
44   3(7)-(6)3      1
45

### Count of Matches vs Expected Count of Matches

In [19]:
mls_matches_columns = ['competition', 'opponent', 'result', 'score', 'type', 'location', 
                       'year', 'weekday', 'month', 'day', 'time', 'location', 'match_id', 'team_code']

mls_matches = pd.DataFrame(columns=mls_matches_columns)
mls_matches

expected_matches = []

for index, team in static_site_info.iterrows():
    code = teamteam_code']
    if code == 'afc':
        continue
    match_hist = pd.read_csv(f'../data/output_scraping/may23-scrape1/{code}-match_hist.csv')
    match_hist = match_hist.drop(columns=['Unnamed: 0'])   
    match_hist['team_code'] = code
    match_hist = match_hist.dropna(subset=['match_id'])
    
    t = {}
    t['team'] = code
    t['match_count'] = len(match_hist)
    t['seasons'] = (2021-int(team['joined']))
    t['xGP'] = ((2021-int(team['joined']))*32)
    t['diff'] = t['xGP']-t['match_count']
    expected_matches.append(t)

pd.DataFrame(expected_matches)    

Unnamed: 0,team,match_count,seasons,xGP,diff
0,atl,125,4,128,3
1,mtl,193,9,288,95
2,chi,720,23,736,16
3,col,783,25,800,17
4,dcu,787,25,800,13
5,cin,57,2,64,7
6,mia,24,1,32,8
7,ner,789,25,800,11
8,nyc,193,6,192,-1
9,orl,191,6,192,1
