# First Data Scraper Notes

For the past few weeks I've been learning various data wrangling python libraries (pandas, numpy, matplotlib, etc.). And it is the middle of the MLS soccer season, so I've been making occasional posts on [Sounder at Heart](https://www.sounderatheart.com/). 

So both some intellectual curiosity and a need to overanalyze my favorite sports team led me on a search for raw MLS soccer data. But I couldn't find anything in the public domain that was suitable for direct monkeying around on. So I decided to build my own data set.

Here are some notes on the first part of that journey.

The following header block includes the modules needed to do the scraping.
* [requests](http://docs.python-requests.org/en/master/) is s simple HTTP client library suitable for grabbing data from the web.
* [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) parses HTML and provides simeple ways to access the node tree.

In [1]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [9]:
import mlsutil

In [3]:
year = '2014'
standings_file_path = '../raw/standings-ss-{}.html'.format(year)
standings_file_path

'../raw/standings-ss-2014.html'

In [6]:
# download the html to the raw data path
statndings_ss_url = 'https://www.mlssoccer.com/standings/supporters-shield/{}/'.format(year)
r = requests.get(statndings_ss_url)
with open(standings_file_path, 'w', encoding='utf-8') as f:
    f.write(r.text)


In [55]:
# open up the html file and get a reference to any tables
with open(standings_file_path) as f:
    soup = BeautifulSoup(f, 'lxml')
    
table_elements = soup.find_all('table')
len(table_elements)

1

In [56]:
# just assume there is only one
table_element = table_elements[0]

In [57]:
# fixup the standings table so that it can be converted using our generic table converter

# remove the first row, it is a secondary level header row
table_element.tr.extract()

# convert the first row in the table to hold th rather than td elements
for tag in table_element.tr.find_all('td'):
    tag.name = 'th'
    
# the Club column includes spans for mobile which erratically abreviate the team name
# so clear those spans
bad_spans = table_element.select('.show-on-mobile-inline')
for span in bad_spans:
    span.clear()

In [62]:
# import the data into a DataFrame

# alas, our simple code to do this which worked for the team standings doesn't work as well here
df = mlsutil.table_to_dataframe(table_element)
df.head()

Unnamed: 0,#,Club,PTS,PPG,Unnamed: 5,GP,W,L,T,GF,GA,GD,Unnamed: 13,W-L-T,Unnamed: 15,W-L-T.1
0,1,Seattle Sounders FC,64,1.88,,34,20,10,4,65,50,15,,12-4-1,,8-6-3
1,2,LA Galaxy,61,1.79,,34,17,7,10,69,37,32,,12-1-4,,5-6-6
2,3,D.C. United,59,1.74,,34,17,9,8,52,37,15,,11-2-4,,6-7-4
3,4,Real Salt Lake,56,1.65,,34,15,8,11,54,39,15,,11-1-5,,4-7-6
4,5,New England Revolution,55,1.62,,34,17,13,4,51,46,5,,11-4-2,,6-9-2


In [29]:
# save out the data frame 
team_stats_datafile_path = "../data/team-stats-{}.csv".format(year)
df.to_csv(team_stats_datafile_path)

Table Key

GP: Games Played, GS: Games Started, G: Goals, MIN: Minutes Played, A: Assists, SHT: Shots, SOG: Shots on Goal, FC: Fouls Committed, FS: Fouls Suffered, Y: Yellow Cards, R: Red Cards, GF: Goals For, GA: Goals Against, SO: Shutouts, SV: Saves, CK: Corner Kicks, PKA: Penalty Kick Attempts, PKG: Penalty Kick Goals, PKS: Penalty Kick Saves, OFF: Offsides