# Scraping data from saved pages

Open saved pages and extract data from them and save it in a `.csv` file.

Imports.

In [2]:
from os import listdir
import re
import pandas as pd
from tqdm.notebook import tqdm

Get a list of all the pages saved and display how many there are.

In [26]:
data_path = '../data'
pages_path = f'{data_path}/pages'
pages = listdir(pages_path)
print(f'Saved pages: {len(pages)}')

Saved pages: 1142


Extract stats pages.

In [12]:
stats_page_pattern = re.compile(r'^.*-Stats$')
stats_pages = list(filter(stats_page_pattern.match, pages))
stats_pages[:7]

['https_((fbref.com(en(squads(f0ac8ee6(2020-2021(Eintracht-Frankfurt-Stats',
 'https_((fbref.com(en(squads(2818f8bc(2018-2019(Hertha-BSC-Stats',
 'https_((fbref.com(en(squads(32f3ee20(2021-2022(Monchengladbach-Stats',
 'https_((fbref.com(en(squads(62add3bf(2020-2021(Werder-Bremen-Stats',
 'https_((fbref.com(en(squads(60b5e41f(2018-2019(Hannover-96-Stats',
 'https_((fbref.com(en(comps(20(2017-2018(2017-2018-Bundesliga-Stats',
 'https_((fbref.com(en(squads(b1278397(2019-2020(Dusseldorf-Stats']

Remove `Bundesliga-Stats` pages as they do not contain any useful information. This way only team stats pages are left.

In [11]:
teams_stats_pages = [page for page in stats_pages if not page.endswith('Bundesliga-Stats')]
teams_stats_pages[:6]

['https_((fbref.com(en(squads(f0ac8ee6(2020-2021(Eintracht-Frankfurt-Stats',
 'https_((fbref.com(en(squads(2818f8bc(2018-2019(Hertha-BSC-Stats',
 'https_((fbref.com(en(squads(32f3ee20(2021-2022(Monchengladbach-Stats',
 'https_((fbref.com(en(squads(62add3bf(2020-2021(Werder-Bremen-Stats',
 'https_((fbref.com(en(squads(60b5e41f(2018-2019(Hannover-96-Stats',
 'https_((fbref.com(en(squads(b1278397(2019-2020(Dusseldorf-Stats']

Utility function to get certain statistics category as a dataframe.

An example `stats_page` can be `https_((fbref.com(en(squads(f0ac8ee6(2017-2018(matchlogs(all_comps(passing(Eintracht-Frankfurt-Match-Logs-All-Competitions` which contains the passing statistics for Eintracht Frankfurt for the 2017-2018 season.


In [20]:
def get_stats_dataframe(team_stats_page, stat_href, stat_caption):
    link_without_team = '('.join(team_stats_page.split('(')[:-1])
    stats_page = [page for page in pages if link_without_team in page and f'({stat_href}(' in page][0]
    try:
        stats_file = open(f'{pages_path}/{stats_page}', 'r')
        stats_file_contents = stats_file.read()
        stats_df = pd.read_html(stats_file_contents, match=stat_caption)[0]
        # Rename the 'For <team name>' columns as they are unique to each team
        stats_df = stats_df.rename(columns=lambda x: re.sub('^For .+','',x))
        # Join the first two header rows
        stats_df.columns = stats_df.columns.map(' '.join)
        stats_df.rename(columns={stats_df.columns[0]: 'Date', stats_df.columns[1]: 'Time'}, inplace=True)
        stats_df.columns = (['Date', 'Time']
                            + [f'{stat_href} {column}'
                               for column in stats_df.columns
                               if column != 'Date' and column != 'Time'])
        return stats_df
    except Exception as e:
        print(f'Failed to create stats dataframe for {stats_page}')
        print(e)

Iterate over all team stats pages and extract the data from them into a list of dataframes.

In [21]:
all_matches = []
statistics = {
    'shooting': 'Shooting',
    'keeper': 'Goalkeeping',
    'passing': 'Passing',
    'passing_types': 'Pass Types',
    'gca': 'Goal and Shot Creation',
    'defense': 'Defensive Actions',
    'possession': 'Possession',
    'misc': 'Miscellaneous Stats'
}

for team_stats_page in tqdm(teams_stats_pages):
    team_name = team_stats_page \
                    .split('(')[-1] \
                    .replace('-Stats', '') \
                    .replace('-', ' ')
    team_stats_file = open(f'{pages_path}/{team_stats_page}', 'r')
    team_stats_file_contents = team_stats_file.read()

    team_scores_and_fixtures_df = pd.read_html(team_stats_file_contents, match='Scores & Fixtures')[0]
    team_scores_and_fixtures_df['Team'] = team_name

    for stat_href, stat_caption in statistics.items():
        stat_df = get_stats_dataframe(team_stats_page, stat_href, stat_caption)
        team_scores_and_fixtures_df = team_scores_and_fixtures_df.merge(stat_df, on=['Date', 'Time'])

    all_matches.append(team_scores_and_fixtures_df)

  0%|          | 0/126 [00:00<?, ?it/s]

Merge all dataframes into one.

In [22]:
all_matches_df = pd.concat(all_matches)

Save the dataframe to a `.csv` file.

In [29]:
all_matches_df.to_csv(f'{data_path}/csv/raw/bundesliga_matches.csv', index=False)