# Scraping data from fbref.com

The purpose of this notebook is to scrape football match and player data from https://fbref.com. The data will be used to create a model that predicts certain events in a football match.

The scraping will be done in the following steps:
1. **Scrape the webpages with the standings of the Bundesliga for the last 7 seasons for each team**
   The reason we scrape the webpages themselves is to avoid spamming the server with requests. We can later deal with the data in a more efficient way. The advantage here is that the crawler can be let to crawl all night with low chance of crashing.
2. **Scrape the data off the saved webpages**
    The data is saved in tables. We will iterate over the saved webpages and extract the data from the tables.
3. **Merge the data into one dataframe**
4. **Process the data further**

## 1.Scraping webpages

In [20]:
from time import sleep

import requests
from bs4 import BeautifulSoup
from requests import Response

In [14]:
base_url = 'https://fbref.com'
stats_href = '/en/comps/20/Bundesliga-Stats'

Save all webpages with team matches from the past seasons. Moreover, crawl the webpages with the stats for each match - shooting, passing, etc.

Sleep for 7 seconds between each request to avoid spamming the server.

In [17]:
seasons_to_scrape = 7
time_to_sleep = 7
current_stats_href = stats_href

def save_page(href) -> Response|None:
    try:
        url = f'{base_url}{href}'
        html = requests.get(url, headers={'User-agent': 'bot123'})
        url = url.replace('/', '(').replace(':', '_')
        with open(f'./data/webpages/{url}', 'w') as f:
            f.write(html.text)
        sleep(time_to_sleep)
        return html
    except Exception as e:
        print(e)
        print(f'Failed to save {href}')
        return None

def get_category_href(soup, category):
    try:
        return soup.select(f'div.filter div a[href*="all_comps/{category}"]')[0]['href']
    except Exception as e:
        print(f'Failed to get href for {category}')
        print(e)
        return ''

for season_no in range(seasons_to_scrape):
    stats_html = save_page(current_stats_href)
    stats_soup = BeautifulSoup(stats_html.text)

    standings_table = stats_soup.select('table.stats_table')[0]
    teams_anchors = standings_table.select('tr td:nth-of-type(1) a')
    team_hrefs = [anchor["href"] for anchor in teams_anchors]

    for team_href in team_hrefs:
        team_html = save_page(team_href)
        team_soup = BeautifulSoup(team_html.text)

        save_page(get_category_href(team_soup, 'shooting'))
        save_page(get_category_href(team_soup, 'keeper'))
        save_page(get_category_href(team_soup, 'passing'))
        save_page(get_category_href(team_soup, 'passing_types'))
        save_page(get_category_href(team_soup, 'gca'))
        save_page(get_category_href(team_soup, 'defense'))
        save_page(get_category_href(team_soup, 'possession'))
        save_page(get_category_href(team_soup, 'misc'))
        
    href_to_previous_season = stats_soup.select('div.prevnext a:-soup-contains("Previous Season")')[0]['href']
    current_stats_href = href_to_previous_season
    sleep(time_to_sleep)

✅ https_((fbref.com(en(comps(20(Bundesliga-Stats
✅ https_((fbref.com(en(squads(054efa67(Bayern-Munich-Stats
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(shooting(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(keeper(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(passing(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(passing_types(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(gca(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(defense(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(possession(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(match

## 2.Scraping data off the webpages

In [2]:
from os import listdir
import re
import pandas as pd
from tqdm.notebook import tqdm

Get a list of all the webpages we saved and check how many there are.

In [3]:
webpages = listdir('data/webpages')
len(webpages)

1142

Extract team stats pages. A team stats page is a page that contains the stats for a team for a given season. For example, Eintracht Frankfurt's stats for the 2019/2020 season. This page contains a table with all matches played by the team in the season, and links to the stats pages for each match. The stats are divided into categories, such as shooting, passing, etc. Each category has its own table.

We exclude the Bundesliga-Stats page because it contains the standings for the Bundesliga, which we don't need at the current stage.

In [4]:
team_stats_pattern = re.compile(r'^(?!.*Bundesliga-Stats$).*-Stats$')
team_stats_pages = list(filter(team_stats_pattern.match, webpages))
team_stats_pages[:5]

['https_((fbref.com(en(squads(f0ac8ee6(2020-2021(Eintracht-Frankfurt-Stats',
 'https_((fbref.com(en(squads(2818f8bc(2018-2019(Hertha-BSC-Stats',
 'https_((fbref.com(en(squads(32f3ee20(2021-2022(Monchengladbach-Stats',
 'https_((fbref.com(en(squads(62add3bf(2020-2021(Werder-Bremen-Stats',
 'https_((fbref.com(en(squads(60b5e41f(2018-2019(Hannover-96-Stats']

Iterate over team stats pages and create a dataframe for each team for each season. The initial dataframe contains basic statistics but we merge it with all the other statistics tables to get a complete picture of the team's performance in the season.

In [17]:
all_matches = []

def get_stats_dataframe(team_stats_page, stats, table_caption):
    link_without_team = '('.join(team_stats_page.split('(')[:-1])
    stats_page = [page for page in webpages if link_without_team in page and f'({stats}(' in page][0]
    try:
        stats_file = open(f'data/webpages/{stats_page}', 'r')
        stats_file_contents = stats_file.read()
        stats_df = pd.read_html(stats_file_contents, match=table_caption)[0]
        stats_df = stats_df.rename(columns=lambda x: re.sub('^For .+','',x))
        stats_df.columns = stats_df.columns.map(' '.join)
        stats_df.rename(columns={stats_df.columns[0]: 'Date', stats_df.columns[1]: 'Time'}, inplace=True)
        stats_df.columns = ['Date', 'Time'] + [f'{stats} {column}' for column in stats_df.columns if column != 'Date' and column != 'Time']
        return stats_df
    except Exception as e:
        print(f'Failed to create stats dataframe for {stats_page}')
        print(e)
        

with tqdm(total=len(team_stats_pages)) as prog_bar:
    for team_stats_page in team_stats_pages:
        team_name = team_stats_page \
                        .split('(')[-1] \
                        .replace('-Stats', '') \
                        .replace('-', ' ')
        team_stats_file = open(f'data/webpages/{team_stats_page}', 'r')
        team_stats_file_contents = team_stats_file.read()

        team_scores_and_fixtures_df = pd.read_html(team_stats_file_contents, match='Scores & Fixtures')[0]
        team_scores_and_fixtures_df['Team'] = team_name

        statistics_types = ['shooting', 'keeper', 'passing', 'passing_types', 'gca', 'defense', 'possession', 'misc']
        statistics_titles = ['Shooting', 'Goalkeeping', 'Passing', 'Pass Types', 'Goal and Shot Creation', 'Defensive Actions', 'Possession', 'Miscellaneous Stats']

        for stat_type, stat_title in zip(statistics_types, statistics_titles):
            stat_df = get_stats_dataframe(team_stats_page, stat_type, stat_title)
            team_scores_and_fixtures_df = team_scores_and_fixtures_df.merge(stat_df, on=['Date', 'Time'])

        all_matches.append(team_scores_and_fixtures_df)
        prog_bar.update(1)

  0%|          | 0/126 [00:00<?, ?it/s]

## 3. Merging the data into one dataframe

In [18]:
all_matches_df = pd.concat(all_matches)

## 4. Processing the data further

See the shape of the dataframe - rows and columns.

In [20]:
all_matches_df.shape

(4425, 239)

Inspect the columns to make sure there are no anomalies.

In [35]:
all_matches_df.columns.tolist()

['Date',
 'Time',
 'Comp',
 'Round',
 'Day',
 'Venue',
 'Result',
 'GF',
 'GA',
 'Opponent',
 'xG',
 'xGA',
 'Poss',
 'Attendance',
 'Captain',
 'Formation',
 'Referee',
 'Match Report',
 'Notes',
 'Team',
 'shooting  Comp',
 'shooting  Round',
 'shooting  Day',
 'shooting  Venue',
 'shooting  Result',
 'shooting  GF',
 'shooting  GA',
 'shooting  Opponent',
 'shooting Standard Gls',
 'shooting Standard Sh',
 'shooting Standard SoT',
 'shooting Standard SoT%',
 'shooting Standard G/Sh',
 'shooting Standard G/SoT',
 'shooting Standard Dist',
 'shooting Standard FK',
 'shooting Standard PK',
 'shooting Standard PKatt',
 'shooting Expected xG',
 'shooting Expected npxG',
 'shooting Expected npxG/Sh',
 'shooting Expected G-xG',
 'shooting Expected np:G-xG',
 'shooting Unnamed: 25_level_0 Match Report',
 'keeper  Comp',
 'keeper  Round',
 'keeper  Day',
 'keeper  Venue',
 'keeper  Result',
 'keeper  GF',
 'keeper  GA',
 'keeper  Opponent',
 'keeper Performance SoTA',
 'keeper Performance GA

There are numerous columns that have double spaces in their names. Replace them with single spaces.

In [36]:
all_matches_df.columns = all_matches_df.columns.str.replace('  ', ' ')

Get rid of all columns that contain `Match Report` as they are not needed.

In [38]:
all_matches_df = all_matches_df.loc[:,~all_matches_df.columns.str.contains('Match Report')]