# Scraping football match results from the web

### Introduction

This notebook contains the code to scrape all match statistics from the website http://www.legaseriea.it/it.

The goal of this scraper is to extract the data without performing any transformation on them. 

I am a strong fan of software modularity and I believe, and my experience confirms it, a software system that works is one made of small components, each one of them with a very well defined set of responsibilities and it must be good at that and that alone.

This component has **2 main such responsibilities**:

1. Extract all the available statistics from each match report of each season available from the web site archive.
2. Dump them to a .csv file for further processing from other modules.

Because the scraping operations typically require a significant amount of time, and after seing the program fail a couple of times I decided to dump each single season right after its extraction was complete and then procede with the next one.

Therefore in the data directory you'll find one file for each season.

Before dumping the file to disk, I transformed it to a python pandas data frame. No attention has been given to the order of the columns at this point. Later on, once I'll read in all the .csv into a single dataframe, I'll take care of that, and also give the columns english short names.

### Libraries

First of all I load all the necessary libraries at the beginning of the notebook.

In [1]:
import urllib.request
from bs4 import BeautifulSoup
import re
import pandas as pd
import os.path

# Functions definition

### Support functions

The functions in this section will be used to transform the data scraped from the website into a data frame and then dump them to the filesystem in the form of a csv file.

The **build_column_names** function has the following responsibility: each season potentially has a different set of statistics that have been collected, therefore the first thing to do is extract these statistics names as they will be used to name the columns of the data frame and will become the headers of the csv file.

In [2]:
############################################################################################
# This function builds the columns names for the tabular representation of the data
# extracted by the scraper
############################################################################################
def build_column_names(m):
    # Fixed fileds, present in any given season
    cnames = ["Stagione", "Giornata", "Squadra", "Avversaria", "Campo", "Gol segnati", "Gol subiti"]
    
    # Each of the following season can have more or less fields, so we append them
    # dynamically scanning the extracted statistics starting from the part
    # where we have stored the tuples (statname, hvalue, avalue).
    # 
    # For any given statname, we also include its passive version, the value 
    # of the same statistic for the opponent of a given team
    if len(m) > 6:
        for statname, dummy1, dummy2 in m[6:]:
            cnames.append(statname)
            cnames.append(statname + " avversario")
    return(cnames)

When I extract the data from the HTML code I create a list of lists, where each single list element of the overall list hs a structure of this kind:

(season,   game_number,   home_team,   away_team,   home_goals,   away_goals,   (statistics_name, home_value, away_value))

This is because, while the first 6 fields are always present, statistics are not. As seasons progresses more statistics have been collected, so the first seasons have only a limited set of stats, while the most recent ones have a much richer library. In order to allow flexibility to the data structure, I used this kind of EAV (entity attribute value) tuple, of which I can have as many as I want without bother about their position in the list.

Later I'll take care of unpacking it when building the data frame.

By the way, this is the reason why columns will appear in that scrambled order in the .csv, but like I said, column order is not the responsibility of the scraper, so I'll let it be.

In [3]:
###############################################################################
# This function scans the extracted statistics and it build 2 lists:
#
#  1. all the statistics related to the home team, both active and passive
#  2. all the statistics related to the away team, both active and passive
###############################################################################
def build_statistics_lists(game):
    hlst = []
    alst = []
    # season
    hlst.append(game[0])
    alst.append(game[0])
    # game day
    hlst.append(game[1])
    alst.append(game[1])
    # home team
    hlst.append(game[2])
    # away team
    alst.append(game[3])
    # home opponent
    hlst.append(game[3])
    # away opponent
    alst.append(game[2])
    # Venue
    hlst.append("home")
    alst.append("away")
    # home goals scored
    hlst.append(game[4])
    # home goals allowed
    hlst.append(game[5])    
    # away goals scored
    alst.append(game[5])
    # away goals allowed
    alst.append(game[4])
    # all game statistics
    if len(game) > 6:
        for statname, hvalue, avalue in game[6:]:
            hlst.append(hvalue)
            hlst.append(avalue)
            alst.append(avalue)
            alst.append(hvalue)
    return (hlst, alst)

Here I read one element of the overall list, which also happens to be a list of its own and create 2 data frames:

1. _dfhome_ will contain all the stats for the home team
2. _dfaway_ will contain all the stats for the away team

In [4]:
################################################################
# This function reads 2 lists of lists:
#
#  1. all statistics for home teams
#  2. all statistics for away teams
#
# and it buils a data frame representation from them
################################################################
def build_data_frame(hteamstats, ateamstats, cols):
    dfhome = pd.DataFrame([hteamstats], columns = cols)
    dfaway = pd.DataFrame([ateamstats], columns = cols)
    df = dfhome.append(dfaway)
    return (df)

This is the function that takes a data frame, and dumps it to a csv file.

In [5]:
##########################################################
# This function stores the extracted data to a csv file
##########################################################
def dump_df_to_csv(df, season):
    save_path = "./data/"
    fname = season.replace("/","_") + ".csv"
    completeName = os.path.join(save_path, fname)
    fhand = open(completeName, "w")
    df.to_csv(fhand, index=False)
    fhand.close()

This function loops through all the statistics extracted for a single season and take care of the entire process of converting it to a data frame and than saving it to a file.

It is like the _main_ of the dumping process, with the only responsibility of looping through the list and calling the right functions in the right order.

In [6]:
##########################################################################################
# This function takes care of the entire process of converting the extracted statistics
# into a data frame and dumping it to a .csv file
##########################################################################################
def store_statistics(seas, stats):
    csv = pd.DataFrame(columns = ["dummy"])
    for match in stats:
        cnames = build_column_names(match)
        hval, aval = build_statistics_lists(match)
        df = build_data_frame(hval, aval, cnames)
        csv = pd.concat([csv, df], axis = 0, ignore_index=True)
    del csv["dummy"]
    dump_df_to_csv(csv, seas)

### Scraping functions

The functions in this section will be used to scrape the actual data from the website:

Scraping will be organized at different levels of detail:

>1. Scraping a single match report: this is actuall the function that implements the scraping logic.
>2. Scraping all games of a particular match day.
>3. Scraping all match days of a particular season.
>4. Scraping all seasons contained in the archive.

This is the core function, the one which scrapes the actual match reports.

In doing so it must take care of some minor peculiarities of some seasons, as far as the structure of the HTML code.

It builds and returns a list with the following structure:

* **opening fields**: season, match number, home team, away team, home goals, away goals
* **actual match statistics**: this will be organized as tuples like (statname, home value, away value), for example ("shots on goal", 7, 3).

In [7]:
##################################################################################
# This function scrapes one single match report.
# It takes the url of the match report in input.
# It return all the statistics as a list where with the following structure:
#
# - season
# - game day
# - home team name
# - away team name
# - home team goals scored
# - away team goals scored
# - as many tuples of the form
#        (statname, home value, away value)
#   as there are statistics in the report
##################################################################################
def scrape_match_report(url):
    try:
        print ("MATCH REPORT: Scraping url...", url)

        # Open the url, parse it and close the connection immediately
        page = urllib.request.urlopen(url)
        soup = BeautifulSoup(page, "html.parser")
        page.close()

        # Initialize return list
        lst = []

        # Extract season and matchday information form the url
        season = re.findall(".*match-report/([0-9]+-[0-9]+)/.*", url)
        matchday = re.findall(".*UNI/([0-9]+)/.*", url)
        lst.append(season[0])
        lst.append(matchday[0])

        # Extract team names and goals scored
        report_risultato = soup.find(class_ = "report-risultato")
        hteam = report_risultato.find(class_ = "report-squadra squadra-a").span.text
        ateam = report_risultato.find(class_ = "report-squadra squadra-b").span.text
        hgoal = report_risultato.find(class_ = "squadra-risultato squadra-a").text
        agoal = report_risultato.find(class_ = "squadra-risultato squadra-b").text
        ###print(hteam, ateam, hgoal, agoal)
        lst.extend([hteam, ateam, hgoal, agoal])

        # Extract all match statistics
        statistiche_comparate = soup.find(id = "statistiche-comparate")
        statnames_rs = statistiche_comparate.find_all(class_ = "valoretitolo")
        hvalues_rs = statistiche_comparate.find_all(class_ = "valoresx")
        avalues_rs = statistiche_comparate.find_all(class_ = "valoredx")
        for i in range(len(statnames_rs)):
            if (season[0] >= "2014-15"):
                if (i==0):                     
                    lst.append((statnames_rs[i].text, hvalues_rs[i].text, avalues_rs[i].text))
                elif (i==1):
                    continue
                else:
                    lst.append((statnames_rs[i].text, hvalues_rs[i-1].text, avalues_rs[i-1].text))                
            else:
                lst.append((statnames_rs[i].text, hvalues_rs[i].text, avalues_rs[i].text))

        return lst
    except:
        # On exception we dump the function's execution context
        fhand = open("scrape_match_report.err", "w")
        fhand.write("{}\n\n".format(url))
        fhand.write(soup.decode("utf-8"))
        fhand.close()        

This function loops through a given match day of the current season and extracts all the links to all the match reports and than calls the scrape_match_report function.

In [1]:
##################################################################################
# This function scrapes one single match day.
# As input it takes:
#
#  1. a list of stats
#  2. the url of the match day page
#
# It extract all links to match reports within the page and processes them
# one at a time.
##################################################################################
def scrape_match_day(stats, url):
    try:
        print("\n\nGAMEDAY: Scraping url...", url)
        # Base domain variable
        domain = "http://www.legaseriea.it"    

        # Open the url, parse it and close the connection immediately
        page = urllib.request.urlopen(url)
        soup = BeautifulSoup(page, "html.parser")
        page.close()
        
        # Extract season and matchday information form the url
        season = re.findall(".*archivio/([0-9]+-[0-9]+)/UNICO.*", url)
        
        # Find all links to match-reports within current matchday
        # Since 2015-16 there's an additional match-program link that we
        # have to skip        
        matchreports = soup.find_all(class_ = "link-matchreport")
        
        for matchreport in matchreports:
            if (season[0] == "2015-16"):
                stats.append(scrape_match_report(domain + matchreport.find_all("a")[1]['href']))
            else:
                stats.append(scrape_match_report(domain + matchreport.a['href']))

    except:
        # On exception we dump the function's execution context
        fhand = open("scrape_match_day.err", "w")
        fhand.write("{}\n\n".format(url))
        fhand.write(soup.decode("utf-8"))
        fhand.close()         

This function loops through a given season and extracts all the links to all the match days and than calls the scrape_match_day function.

In [9]:
##################################################################################
# This function scrapes one single season.
# As input it takes:
#
#  1. a list of stats
#  2. the url of the season page
#
# It extract all links to all match days within the page and processes them
# one at a time.
##################################################################################
def scrape_season(url):
    try:
        stats = []
        print("\n\n\n\n======\nSEASON\n======")
        # Base domain variable
        domain = "http://www.legaseriea.it"    

        # Open the url, parse it and close the connection immediately
        page = urllib.request.urlopen(url)
        soup = BeautifulSoup(page, "html.parser")
        page.close()

        # Find all links to all match days of current season
        matchdays_first_half_season = soup.find_all(class_ = "box_Ngiornata_andata")
        for matchday in matchdays_first_half_season:
            stats.append(scrape_match_day(stats, domain + matchday.a['href']))
            stats.pop()

        matchdays_second_half_season = soup.find_all(class_ = "box_Ngiornata_ritorno")
        for matchday in matchdays_second_half_season:
            stats.append(scrape_match_day(stats, domain + matchday.a['href']))      
            stats.pop()

        return stats
    except:
        # On exception we dump the function's execution context
        fhand = open("scrape_season.err", "w")
        fhand.write("{}\n\n".format(url))
        fhand.write(soup.decode("utf-8"))
        fhand.close() 

This function loops through the entire archive and does the following:

1. Extracts all the season contained in the archive.
2. For each one of these builds a link to the corresponding page.
3. Passes this link to the scrape_season function
4. Call the store_statistics function at the end of the season scraping process to dump the data to a csv file

In [None]:
##################################################################################
# This function scrapes the entire archive.
# As input it takes a list of stats.
# It builds all links to all seasons within the page and processes them
# one at a time.
##################################################################################
def scrape_archive():
    try:
        # Open the url, parse it and close the connection immediately
        page = urllib.request.urlopen("http://www.legaseriea.it/it/serie-a-tim/archivio")
        soup = BeautifulSoup(page, "html.parser")
        page.close()    

        # Extract list of all available seasons and sort it ascending
        archivio_stagione_id = soup.find(id = "archivio_stagione_id")
        options = archivio_stagione_id.find_all("option")
        seasons = []
        for option in options:
            seasons.append(option['value'])        
        seasons.sort()

        # For each season call the scrape_season method
        for season in seasons:   
            url = "http://www.legaseriea.it/it/serie-a-tim/archivio/" + season            
            stats = scrape_season(url)
###            return (stats)
            store_statistics(season, stats)

    except:
        # On exception we dump the function's execution context
        fhand = open("scrape_archive.err", "w")
        fhand.write("{}\n\n".format(url))
        fhand.write(soup.decode("utf-8"))
        fhand.close() 

### Execution

Once everything is set comes the fun, simply run the scrape_archive function, relax, sit back and enjoy the scraping... :-).

In [None]:
scrape_archive()





SEASON


GAMEDAY: Scraping url... http://www.legaseriea.it/it/serie-a-tim/archivio/1986-87/UNICO/UNI/1
MATCH REPORT: Scraping url... http://www.legaseriea.it/it/serie-a-tim/match-report/1986-87/UNICO/UNI/1/AVEFIO
MATCH REPORT: Scraping url... http://www.legaseriea.it/it/serie-a-tim/match-report/1986-87/UNICO/UNI/1/BRENAP
MATCH REPORT: Scraping url... http://www.legaseriea.it/it/serie-a-tim/match-report/1986-87/UNICO/UNI/1/EMPINT
MATCH REPORT: Scraping url... http://www.legaseriea.it/it/serie-a-tim/match-report/1986-87/UNICO/UNI/1/MILASC
MATCH REPORT: Scraping url... http://www.legaseriea.it/it/serie-a-tim/match-report/1986-87/UNICO/UNI/1/ROMCOM
MATCH REPORT: Scraping url... http://www.legaseriea.it/it/serie-a-tim/match-report/1986-87/UNICO/UNI/1/SAMATA
MATCH REPORT: Scraping url... http://www.legaseriea.it/it/serie-a-tim/match-report/1986-87/UNICO/UNI/1/TORVER
MATCH REPORT: Scraping url... http://www.legaseriea.it/it/serie-a-tim/match-report/1986-87/UNICO/UNI/1/UDIJUV


GAMEDAY: Sc