# Scraping data from fbref.com

The purpose of this notebook is to scrape football match and player data from [https://fbref.com](https://fbref.com). The data will be used to create a model that predicts certain events in a football match.

The scraping will be done in the following steps:
1. **Scrape the webpages with the standings of the Bundesliga for the last 7 seasons for each team**
   The reason we scrape the webpages themselves is to avoid spamming the server with requests. We can later deal with the data in a more efficient way. The advantage here is that the crawler can be let to crawl all night with low chance of crashing.
2. **Scrape the data off the saved webpages**
    The data is saved in tables. We will iterate over the saved webpages and extract the data from the tables.
3. **Merge the data into one dataframe**
4. **Process the data further**

## 1.Scraping webpages

In [20]:
from time import sleep

import requests
from bs4 import BeautifulSoup
from requests import Response

In [14]:
base_url = 'https://fbref.com'
stats_href = '/en/comps/20/Bundesliga-Stats'

Save all webpages with team matches from the past seasons. Moreover, crawl the webpages with the stats for each match - shooting, passing, etc.

Sleep for 7 seconds between each request to avoid spamming the server.

In [17]:
seasons_to_scrape = 7
time_to_sleep = 7
current_stats_href = stats_href

def save_page(href) -> Response|None:
    try:
        url = f'{base_url}{href}'
        html = requests.get(url, headers={'User-agent': 'bot123'})
        url = url.replace('/', '(').replace(':', '_')
        with open(f'./data/webpages/{url}', 'w') as f:
            f.write(html.text)
        sleep(time_to_sleep)
        return html
    except Exception as e:
        print(e)
        print(f'Failed to save {href}')
        return None

def get_category_href(soup, category):
    try:
        return soup.select(f'div.filter div a[href*="all_comps/{category}"]')[0]['href']
    except Exception as e:
        print(f'Failed to get href for {category}')
        print(e)
        return ''

for season_no in range(seasons_to_scrape):
    stats_html = save_page(current_stats_href)
    stats_soup = BeautifulSoup(stats_html.text)

    standings_table = stats_soup.select('table.stats_table')[0]
    teams_anchors = standings_table.select('tr td:nth-of-type(1) a')
    team_hrefs = [anchor["href"] for anchor in teams_anchors]

    for team_href in team_hrefs:
        team_html = save_page(team_href)
        team_soup = BeautifulSoup(team_html.text)

        save_page(get_category_href(team_soup, 'shooting'))
        save_page(get_category_href(team_soup, 'keeper'))
        save_page(get_category_href(team_soup, 'passing'))
        save_page(get_category_href(team_soup, 'passing_types'))
        save_page(get_category_href(team_soup, 'gca'))
        save_page(get_category_href(team_soup, 'defense'))
        save_page(get_category_href(team_soup, 'possession'))
        save_page(get_category_href(team_soup, 'misc'))
        
    href_to_previous_season = stats_soup.select('div.prevnext a:-soup-contains("Previous Season")')[0]['href']
    current_stats_href = href_to_previous_season
    sleep(time_to_sleep)

✅ https_((fbref.com(en(comps(20(Bundesliga-Stats
✅ https_((fbref.com(en(squads(054efa67(Bayern-Munich-Stats
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(shooting(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(keeper(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(passing(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(passing_types(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(gca(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(defense(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(matchlogs(all_comps(possession(Bayern-Munich-Match-Logs-All-Competitions
✅ https_((fbref.com(en(squads(054efa67(2023-2024(match

## 2.Scraping data off the webpages

In [1]:
from os import listdir
import re
import pandas as pd
from tqdm.notebook import tqdm

Get a list of all the webpages we saved and check how many there are.

In [2]:
webpages = listdir('data/webpages')
len(webpages)

1142

Extract team stats pages. A team stats page is a page that contains the stats for a team for a given season. For example, Eintracht Frankfurt's stats for the 2019/2020 season. This page contains a table with all matches played by the team in the season, and links to the stats pages for each match. The stats are divided into categories, such as shooting, passing, etc. Each category has its own table.

We exclude the Bundesliga-Stats page because it contains the standings for the Bundesliga, which we don't need at the current stage.

In [3]:
team_stats_pattern = re.compile(r'^(?!.*Bundesliga-Stats$).*-Stats$')
team_stats_pages = list(filter(team_stats_pattern.match, webpages))
team_stats_pages[:5]

['https_((fbref.com(en(squads(f0ac8ee6(2020-2021(Eintracht-Frankfurt-Stats',
 'https_((fbref.com(en(squads(2818f8bc(2018-2019(Hertha-BSC-Stats',
 'https_((fbref.com(en(squads(32f3ee20(2021-2022(Monchengladbach-Stats',
 'https_((fbref.com(en(squads(62add3bf(2020-2021(Werder-Bremen-Stats',
 'https_((fbref.com(en(squads(60b5e41f(2018-2019(Hannover-96-Stats']

Iterate over team stats pages and create a dataframe for each team for each season. The initial dataframe contains basic statistics but we merge it with all the other statistics tables to get a complete picture of the team's performance in the season.

In [4]:
all_matches = []

def get_stats_dataframe(team_stats_page, stats, table_caption):
    link_without_team = '('.join(team_stats_page.split('(')[:-1])
    stats_page = [page for page in webpages if link_without_team in page and f'({stats}(' in page][0]
    try:
        stats_file = open(f'data/webpages/{stats_page}', 'r')
        stats_file_contents = stats_file.read()
        stats_df = pd.read_html(stats_file_contents, match=table_caption)[0]
        stats_df = stats_df.rename(columns=lambda x: re.sub('^For .+','',x))
        stats_df.columns = stats_df.columns.map(' '.join)
        stats_df.rename(columns={stats_df.columns[0]: 'Date', stats_df.columns[1]: 'Time'}, inplace=True)
        stats_df.columns = ['Date', 'Time'] + [f'{stats} {column}' for column in stats_df.columns if column != 'Date' and column != 'Time']
        return stats_df
    except Exception as e:
        print(f'Failed to create stats dataframe for {stats_page}')
        print(e)
        

with tqdm(total=len(team_stats_pages)) as prog_bar:
    for team_stats_page in team_stats_pages:
        team_name = team_stats_page \
                        .split('(')[-1] \
                        .replace('-Stats', '') \
                        .replace('-', ' ')
        team_stats_file = open(f'data/webpages/{team_stats_page}', 'r')
        team_stats_file_contents = team_stats_file.read()

        team_scores_and_fixtures_df = pd.read_html(team_stats_file_contents, match='Scores & Fixtures')[0]
        team_scores_and_fixtures_df['Team'] = team_name

        statistics_types = ['shooting', 'keeper', 'passing', 'passing_types', 'gca', 'defense', 'possession', 'misc']
        statistics_titles = ['Shooting', 'Goalkeeping', 'Passing', 'Pass Types', 'Goal and Shot Creation', 'Defensive Actions', 'Possession', 'Miscellaneous Stats']

        for stat_type, stat_title in zip(statistics_types, statistics_titles):
            stat_df = get_stats_dataframe(team_stats_page, stat_type, stat_title)
            team_scores_and_fixtures_df = team_scores_and_fixtures_df.merge(stat_df, on=['Date', 'Time'])

        all_matches.append(team_scores_and_fixtures_df)
        prog_bar.update(1)

  0%|          | 0/126 [00:00<?, ?it/s]

## 3. Merging the data into one dataframe

In [32]:
all_matches_df = pd.concat(all_matches)

## 4. Processing the data further

See the shape of the dataframe - rows and columns.

In [33]:
all_matches_df.shape

(4425, 239)

Inspect the columns to make sure there are no anomalies.

In [34]:
all_matches_df.columns.tolist()

['Date',
 'Time',
 'Comp',
 'Round',
 'Day',
 'Venue',
 'Result',
 'GF',
 'GA',
 'Opponent',
 'xG',
 'xGA',
 'Poss',
 'Attendance',
 'Captain',
 'Formation',
 'Referee',
 'Match Report',
 'Notes',
 'Team',
 'shooting  Comp',
 'shooting  Round',
 'shooting  Day',
 'shooting  Venue',
 'shooting  Result',
 'shooting  GF',
 'shooting  GA',
 'shooting  Opponent',
 'shooting Standard Gls',
 'shooting Standard Sh',
 'shooting Standard SoT',
 'shooting Standard SoT%',
 'shooting Standard G/Sh',
 'shooting Standard G/SoT',
 'shooting Standard Dist',
 'shooting Standard FK',
 'shooting Standard PK',
 'shooting Standard PKatt',
 'shooting Expected xG',
 'shooting Expected npxG',
 'shooting Expected npxG/Sh',
 'shooting Expected G-xG',
 'shooting Expected np:G-xG',
 'shooting Unnamed: 25_level_0 Match Report',
 'keeper  Comp',
 'keeper  Round',
 'keeper  Day',
 'keeper  Venue',
 'keeper  Result',
 'keeper  GF',
 'keeper  GA',
 'keeper  Opponent',
 'keeper Performance SoTA',
 'keeper Performance GA

There are numerous columns that have double spaces in their names. Replace them with single spaces.

In [35]:
all_matches_df.columns = all_matches_df.columns.str.replace('  ', ' ')

Get rid of all columns that contain `Match Report` as they are not needed.

In [36]:
all_matches_df = all_matches_df.loc[:,~all_matches_df.columns.str.contains('Match Report')]

Convert all columns to lowercase for consistency.

In [37]:
all_matches_df.columns = all_matches_df.columns.str.lower()

Replace all spaces in column names with underscores for ease of parsing later when the dataframe is saved to a csv file.

In [38]:
all_matches_df.columns = all_matches_df.columns.str.replace(' ', '_')

Check the columns again to make sure everything is in order and look for further improvements.

In [39]:
all_matches_df.columns.tolist()

['date',
 'time',
 'comp',
 'round',
 'day',
 'venue',
 'result',
 'gf',
 'ga',
 'opponent',
 'xg',
 'xga',
 'poss',
 'attendance',
 'captain',
 'formation',
 'referee',
 'notes',
 'team',
 'shooting_comp',
 'shooting_round',
 'shooting_day',
 'shooting_venue',
 'shooting_result',
 'shooting_gf',
 'shooting_ga',
 'shooting_opponent',
 'shooting_standard_gls',
 'shooting_standard_sh',
 'shooting_standard_sot',
 'shooting_standard_sot%',
 'shooting_standard_g/sh',
 'shooting_standard_g/sot',
 'shooting_standard_dist',
 'shooting_standard_fk',
 'shooting_standard_pk',
 'shooting_standard_pkatt',
 'shooting_expected_xg',
 'shooting_expected_npxg',
 'shooting_expected_npxg/sh',
 'shooting_expected_g-xg',
 'shooting_expected_np:g-xg',
 'keeper_comp',
 'keeper_round',
 'keeper_day',
 'keeper_venue',
 'keeper_result',
 'keeper_gf',
 'keeper_ga',
 'keeper_opponent',
 'keeper_performance_sota',
 'keeper_performance_ga',
 'keeper_performance_saves',
 'keeper_performance_save%',
 'keeper_performan

There are numerous columns from each category where we have `_unnamed_<some number>_level_0`.
```
 'passing_unnamed:_24_level_0_ast',
 'passing_unnamed:_25_level_0_xag',
 'passing_unnamed:_26_level_0_xa',
 'passing_unnamed:_27_level_0_kp',
 'passing_unnamed:_28_level_0_1/3',
 'passing_unnamed:_29_level_0_ppa',
 'passing_unnamed:_30_level_0_crspa',
 'passing_unnamed:_31_level_0_prgp',
```
This is due to the table having no column name for the first column.
![no-name-in-first-row](./no-name-in-first-row.png)

In [40]:
all_matches_df.columns = all_matches_df.columns.str.replace(r'(.*)_unnamed:_\d+_level_0_(.*)', r'\1_\2', regex=True)

One final check to column names. Everything looks good.

In [21]:
all_matches_df.columns.tolist()

['date',
 'time',
 'comp',
 'round',
 'day',
 'venue',
 'result',
 'gf',
 'ga',
 'opponent',
 'xg',
 'xga',
 'poss',
 'attendance',
 'captain',
 'formation',
 'referee',
 'notes',
 'team',
 'shooting_comp',
 'shooting_round',
 'shooting_day',
 'shooting_venue',
 'shooting_result',
 'shooting_gf',
 'shooting_ga',
 'shooting_opponent',
 'shooting_standard_gls',
 'shooting_standard_sh',
 'shooting_standard_sot',
 'shooting_standard_sot%',
 'shooting_standard_g/sh',
 'shooting_standard_g/sot',
 'shooting_standard_dist',
 'shooting_standard_fk',
 'shooting_standard_pk',
 'shooting_standard_pkatt',
 'shooting_expected_xg',
 'shooting_expected_npxg',
 'shooting_expected_npxg/sh',
 'shooting_expected_g-xg',
 'shooting_expected_np:g-xg',
 'keeper_comp',
 'keeper_round',
 'keeper_day',
 'keeper_venue',
 'keeper_result',
 'keeper_gf',
 'keeper_ga',
 'keeper_opponent',
 'keeper_performance_sota',
 'keeper_performance_ga',
 'keeper_performance_saves',
 'keeper_performance_save%',
 'keeper_performan

See the first and last ten rows of the dataframe and inspect the data.

In [41]:
all_matches_df.head(10)

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,misc_performance_crs,misc_performance_int,misc_performance_tklw,misc_performance_pkwon,misc_performance_pkcon,misc_performance_og,misc_performance_recov,misc_aerial_duels_won,misc_aerial_duels_lost,misc_aerial_duels_won%
0,2020-09-12,15:30,DFB-Pokal,First round,Sat,Away,W,2,1,1860 Munich,...,22,7,8,,,0,,,,
1,2020-09-19,15:30,Bundesliga,Matchweek 1,Sat,Home,D,1,1,Arminia,...,35,11,12,0.0,0.0,0,63.0,17.0,15.0,53.1
2,2020-09-25,20:30,Bundesliga,Matchweek 2,Fri,Away,W,3,1,Hertha BSC,...,15,8,12,1.0,0.0,1,57.0,23.0,12.0,65.7
3,2020-10-03,15:30,Bundesliga,Matchweek 3,Sat,Home,W,2,1,Hoffenheim,...,22,7,6,0.0,0.0,0,56.0,24.0,9.0,72.7
4,2020-10-18,15:30,Bundesliga,Matchweek 4,Sun,Away,D,1,1,Köln,...,24,7,10,1.0,0.0,0,65.0,36.0,32.0,52.9
5,2020-10-24,15:30,Bundesliga,Matchweek 5,Sat,Away,L,0,5,Bayern Munich,...,14,6,11,0.0,0.0,0,66.0,12.0,8.0,60.0
6,2020-10-31,15:30,Bundesliga,Matchweek 6,Sat,Home,D,1,1,Werder Bremen,...,35,11,8,0.0,0.0,0,59.0,14.0,19.0,42.4
7,2020-11-07,15:30,Bundesliga,Matchweek 7,Sat,Away,D,2,2,Stuttgart,...,29,12,13,0.0,1.0,0,55.0,23.0,9.0,71.9
8,2020-11-21,18:30,Bundesliga,Matchweek 8,Sat,Home,D,1,1,RB Leipzig,...,15,22,13,0.0,0.0,0,59.0,16.0,23.0,41.0
9,2020-11-28,15:30,Bundesliga,Matchweek 9,Sat,Away,D,3,3,Union Berlin,...,19,20,9,0.0,1.0,0,49.0,31.0,19.0,62.0


In [42]:
all_matches_df.tail(10)

Unnamed: 0,date,time,comp,round,day,venue,result,gf,ga,opponent,...,misc_performance_crs,misc_performance_int,misc_performance_tklw,misc_performance_pkwon,misc_performance_pkcon,misc_performance_og,misc_performance_recov,misc_aerial_duels_won,misc_aerial_duels_lost,misc_aerial_duels_won%
27,2022-03-06,17:30,Bundesliga,Matchweek 25,Sun,Home,L,0,1,Hoffenheim,...,25,7,6,0.0,0.0,0,53.0,19.0,17.0,52.8
28,2022-03-13,15:30,Bundesliga,Matchweek 26,Sun,Away,W,1,0,Leverkusen,...,9,20,8,0.0,0.0,0,59.0,19.0,17.0,52.8
29,2022-03-20,19:30,Bundesliga,Matchweek 27,Sun,Home,D,1,1,Dortmund,...,25,18,15,0.0,0.0,0,66.0,13.0,16.0,44.8
30,2022-04-01,20:30,Bundesliga,Matchweek 28,Fri,Away,L,0,1,Union Berlin,...,18,16,11,0.0,0.0,0,68.0,17.0,18.0,48.6
31,2022-04-09,15:30,Bundesliga,Matchweek 29,Sat,Home,W,3,2,Mainz 05,...,23,5,10,0.0,0.0,0,71.0,24.0,23.0,51.1
32,2022-04-16,18:30,Bundesliga,Matchweek 30,Sat,Away,W,3,1,M'Gladbach,...,23,12,12,0.0,0.0,0,63.0,16.0,14.0,53.3
33,2022-04-23,15:30,Bundesliga,Matchweek 31,Sat,Home,W,3,1,Arminia,...,22,6,12,0.0,0.0,1,63.0,20.0,18.0,52.6
34,2022-04-30,15:30,Bundesliga,Matchweek 32,Sat,Away,W,4,1,Augsburg,...,19,9,16,1.0,0.0,0,54.0,15.0,21.0,41.7
35,2022-05-07,15:30,Bundesliga,Matchweek 33,Sat,Home,L,0,1,Wolfsburg,...,49,11,9,0.0,0.0,0,57.0,18.0,25.0,41.9
36,2022-05-14,15:30,Bundesliga,Matchweek 34,Sat,Away,L,1,2,Stuttgart,...,35,2,14,0.0,1.0,0,52.0,7.0,24.0,22.6


From the above extraction, we can further clean the data:

The first observation is that `comp` and `shooting_comp` contain the same information. We know that there are other categories' `comp` column. Check all columns that contain `comp` in their name. Same applies to `round`, `day`, `venue`, `result`, `gf`, `ga` and `opponent`.

In [43]:
all_matches_df.filter(regex='.*comp.*').head(5)

Unnamed: 0,comp,shooting_comp,keeper_comp,passing_comp,passing_types_comp,gca_comp,defense_comp,possession_comp,misc_comp
0,DFB-Pokal,DFB-Pokal,DFB-Pokal,DFB-Pokal,DFB-Pokal,DFB-Pokal,DFB-Pokal,DFB-Pokal,DFB-Pokal
1,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga
2,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga
3,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga
4,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga,Bundesliga


In [44]:
all_matches_df.filter(regex='.*round.*').head(5)

Unnamed: 0,round,shooting_round,keeper_round,passing_round,passing_types_round,gca_round,defense_round,possession_round,misc_round
0,First round,First round,First round,First round,First round,First round,First round,First round,First round
1,Matchweek 1,Matchweek 1,Matchweek 1,Matchweek 1,Matchweek 1,Matchweek 1,Matchweek 1,Matchweek 1,Matchweek 1
2,Matchweek 2,Matchweek 2,Matchweek 2,Matchweek 2,Matchweek 2,Matchweek 2,Matchweek 2,Matchweek 2,Matchweek 2
3,Matchweek 3,Matchweek 3,Matchweek 3,Matchweek 3,Matchweek 3,Matchweek 3,Matchweek 3,Matchweek 3,Matchweek 3
4,Matchweek 4,Matchweek 4,Matchweek 4,Matchweek 4,Matchweek 4,Matchweek 4,Matchweek 4,Matchweek 4,Matchweek 4


In [45]:
all_matches_df.filter(regex='.*day.*').head(5)

Unnamed: 0,day,shooting_day,keeper_day,passing_day,passing_types_day,gca_day,defense_day,possession_day,misc_day
0,Sat,Sat,Sat,Sat,Sat,Sat,Sat,Sat,Sat
1,Sat,Sat,Sat,Sat,Sat,Sat,Sat,Sat,Sat
2,Fri,Fri,Fri,Fri,Fri,Fri,Fri,Fri,Fri
3,Sat,Sat,Sat,Sat,Sat,Sat,Sat,Sat,Sat
4,Sun,Sun,Sun,Sun,Sun,Sun,Sun,Sun,Sun


In [46]:
all_matches_df.filter(regex='.*venue.*').head(5)

Unnamed: 0,venue,shooting_venue,keeper_venue,passing_venue,passing_types_venue,gca_venue,defense_venue,possession_venue,misc_venue
0,Away,Away,Away,Away,Away,Away,Away,Away,Away
1,Home,Home,Home,Home,Home,Home,Home,Home,Home
2,Away,Away,Away,Away,Away,Away,Away,Away,Away
3,Home,Home,Home,Home,Home,Home,Home,Home,Home
4,Away,Away,Away,Away,Away,Away,Away,Away,Away


In [47]:
all_matches_df.filter(regex='.*result.*').head(5)

Unnamed: 0,result,shooting_result,keeper_result,passing_result,passing_types_result,gca_result,defense_result,possession_result,misc_result
0,W,W,W,W,W,W,W,W,W
1,D,D,D,D,D,D,D,D,D
2,W,W,W,W,W,W,W,W,W
3,W,W,W,W,W,W,W,W,W
4,D,D,D,D,D,D,D,D,D


In [48]:
all_matches_df.filter(regex='.*gf.*').head(5)

Unnamed: 0,gf,shooting_gf,keeper_gf,passing_gf,passing_types_gf,gca_gf,defense_gf,possession_gf,misc_gf
0,2,2,2,2,2,2,2,2,2
1,1,1,1,1,1,1,1,1,1
2,3,3,3,3,3,3,3,3,3
3,2,2,2,2,2,2,2,2,2
4,1,1,1,1,1,1,1,1,1


In [50]:
all_matches_df.filter(regex='.*ga.*').head(5)

Unnamed: 0,ga,xga,shooting_ga,keeper_ga,keeper_performance_ga,passing_ga,passing_types_ga,gca_ga,defense_ga,possession_ga,misc_ga
0,1,,1,1,1,1,1,1,1,1,1
1,1,0.8,1,1,1,1,1,1,1,1,1
2,1,1.2,1,1,1,1,1,1,1,1,1
3,1,1.4,1,1,1,1,1,1,1,1,1
4,1,1.4,1,1,1,1,1,1,1,1,1


In [52]:
all_matches_df.filter(regex='.*opponent.*').head(5)

Unnamed: 0,opponent,shooting_opponent,keeper_opponent,passing_opponent,passing_types_opponent,gca_opponent,defense_opponent,possession_opponent,misc_opponent
0,1860 Munich,1860 Munich,1860 Munich,1860 Munich,1860 Munich,1860 Munich,1860 Munich,1860 Munich,1860 Munich
1,Arminia,Arminia,Arminia,Arminia,Arminia,Arminia,Arminia,Arminia,Arminia
2,Hertha BSC,Hertha BSC,Hertha BSC,Hertha BSC,Hertha BSC,Hertha BSC,Hertha BSC,Hertha BSC,Hertha BSC
3,Hoffenheim,Hoffenheim,Hoffenheim,Hoffenheim,Hoffenheim,Hoffenheim,Hoffenheim,Hoffenheim,Hoffenheim
4,Köln,Köln,Köln,Köln,Köln,Köln,Köln,Köln,Köln


Drop all columns that contain the same information as other columns.

In [57]:
all_matches_df = all_matches_df.loc[:,~all_matches_df.columns.str.contains('_comp')]
all_matches_df = all_matches_df.loc[:,~all_matches_df.columns.str.contains('_round')]
all_matches_df = all_matches_df.loc[:,~all_matches_df.columns.str.contains('_day')]
all_matches_df = all_matches_df.loc[:,~all_matches_df.columns.str.contains('_venue')]
all_matches_df = all_matches_df.loc[:,~all_matches_df.columns.str.contains('_result')]
all_matches_df = all_matches_df.loc[:,~all_matches_df.columns.str.contains('_gf')]
all_matches_df = all_matches_df.loc[:,~all_matches_df.columns.str.contains('_ga')]
all_matches_df = all_matches_df.loc[:,~all_matches_df.columns.str.contains('_opponent')]

The columns above were simple to spot but I assume there are other ones that contain the same information. I looked on the Internet on how to do this and found the following solution at [https://stackoverflow.com/a/58002867/9553927](https://stackoverflow.com/a/58002867/9553927).

In [65]:
from itertools import combinations

list_of_equal_cols = [(i, j) for i,j in combinations(all_matches_df, 2) if all_matches_df[i].equals(all_matches_df[j])]
list_of_equal_cols

[('xg', 'shooting_expected_xg'),
 ('poss', 'possession_poss'),
 ('defense_tackles_tklw', 'misc_performance_tklw'),
 ('defense_int', 'misc_performance_int')]

Pick the second item from each tuple and drop it from the dataframe.

In [67]:
cols_to_drop = [item[1] for item in list_of_equal_cols]
all_matches_df = all_matches_df.drop(cols_to_drop, axis=1)

Remove the `notes` column as it doesn't provide any useful information.

In [73]:
all_matches_df = all_matches_df.drop('notes', axis=1)

Show 3 matches from the Bundesliga and 3 matches from other competitions. Matches that are not in Bundesliga have a visibly high amount of `NaN` values.

Add first column that shows how many `NaN` values there are in each row.

In [86]:
bundesliga_df = all_matches_df[all_matches_df['comp'] == 'Bundesliga'].head(3)
other_df = all_matches_df[all_matches_df['comp'] != 'Bundesliga'].head(3)

combined_df = pd.concat([bundesliga_df, other_df])
combined_df.insert(loc=0, column='NaNs', value=combined_df.isnull().sum(axis=1))
combined_df

Unnamed: 0,NaNs,date,time,comp,round,day,venue,result,gf,ga,...,misc_performance_fld,misc_performance_off,misc_performance_crs,misc_performance_pkwon,misc_performance_pkcon,misc_performance_og,misc_performance_recov,misc_aerial_duels_won,misc_aerial_duels_lost,misc_aerial_duels_won%
1,0,2020-09-19,15:30,Bundesliga,Matchweek 1,Sat,Home,D,1,1,...,11,2,35,0.0,0.0,0,63.0,17.0,15.0,53.1
2,0,2020-09-25,20:30,Bundesliga,Matchweek 2,Fri,Away,W,3,1,...,13,0,15,1.0,0.0,1,57.0,23.0,12.0,65.7
3,0,2020-10-03,15:30,Bundesliga,Matchweek 3,Sat,Home,W,2,1,...,7,9,22,0.0,0.0,0,56.0,24.0,9.0,72.7
0,116,2020-09-12,15:30,DFB-Pokal,First round,Sat,Away,W,2,1,...,13,6,22,,,0,,,,
16,116,2021-01-12,20:45,DFB-Pokal,Second round,Tue,Away,L,1,4,...,7,6,12,,,0,,,,
0,115,2018-08-20,18:30,DFB-Pokal,First round,Mon,Away,W,2,1,...,13,1,11,,,0,,,,


Drop all rows whose comp isn't `Bundesliga`.

In [87]:
all_matches_df = all_matches_df[all_matches_df['comp'] == 'Bundesliga']

In [88]:
all_matches_df.shape

(3682, 160)

Save the dataframe to a csv file.

In [90]:
all_matches_df.to_csv('./data/all_bundesliga_matches.csv', index=False)