# Step 1: Data Scraping #

In this first step, we scrape Bundesliga (BuLi) data from a popular German sports website. Please understand that due to legal issues I have removed any URLs linking to the website in this Notebook. However, I hope you still understand the general procedure. We capture data starting from season 2013/2014, since it is the first season that includes statistics for distance covered by players during the game. The latest (finished) season we capture is season 2019/2020.

In [1]:
%load_ext autoreload
%autoreload 2
import warnings
import dateparser
from pprint import pprint
import pandas as pd
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup
import warnings
import requests
warnings.filterwarnings('ignore')

We define three methods:
- **get_match_data:** Appends data for each game to the data nested list. Requires three input values: (1) the year in YY-format in which the captured season range starts (e.g. 13 to start with season 2013/2014), (2) the year in YY-format in which the final captured season range ends (e.g. 20 to capture season 2019/2020), and (3) the number of matchdays per season (for BuLi = 34 matchdays).
- **get_links_matchday:** Returns URLs to all game's statistic pages of a matchday. Requires two input values: (1) the season in the format YY/YY and (2) matchday for which links to each game statistic URL should be provided.
- **get_game_stats:** Returns game statistics for a single match. Requires three input values: (1) the URL to the statistics page, (2) the season in the format YY/YY and (2) the matchday

In [2]:
data = [] #nested list where all data is stored

In [3]:
def get_match_data(start_year, end_year, match_day):
    season = str(start_year) + '-' + str(end_year)
    links = get_links_matchday(season, match_day)
    for link in links:
        game_stats = get_game_stats(link, season, match_day)
        data.append(game_stats)

In [4]:
def get_links_matchday(season, match_day):
    URL = 'https://www.website_removed_due_to_legal_issues.de/bundesliga/spieltag/20' + season + '/' + str(match_day)
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    results = soup.findAll('a', {'class': 'kick__v100-scoreBoard kick__v100-scoreBoard--standard'})
    links = []
    for elem in results:
        link = elem.get('href')
        link_spieldaten = link.replace("analyse","spieldaten")
        URL = 'https://www.website_removed_due_to_legal_issues.de/'+ link_spieldaten
        links.append(URL)
    return links

In [5]:
def get_game_stats(URL, season, match_day):
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    stats = []
    stats.append(season)
    stats.append(match_day)
    teams = soup.findAll('div', {'class': 'kick__v100-gameCell__team__name'})
    stats.append(teams[0].text)
    stats.append(teams[1].text)
    score = soup.findAll('div', {'class': 'kick__v100-scoreBoard__scoreHolder__score'})
    for s,goal in enumerate(score):
        stats.append(score[s].text)
    stats_home = soup.findAll('div', {'class': 'kick__stats-bar__value kick__stats-bar__value--opponent1'})
    stats_away = soup.findAll('div', {'class': 'kick__stats-bar__value kick__stats-bar__value--opponent2'})
    for i in range(1,13,1):
        stats.append(stats_home[i].text)
        stats.append(stats_away[i].text)
    return stats

Now the actual scraping takes place, which will take approximately 45 minutes (for 8 seasons).

In [None]:
start_season = 14
end_season = 21

for season in tqdm(range(start_season,end_season,1)):
    for matchday in tqdm(range(1,35,1)):
        get_match_data(season-1,season,matchday)

We define a data frame, in which we will store our scraped data.

In [None]:
df_data = pd.DataFrame(data,columns=['season',
'matchday',
'h_team',
'a_team',
'h_goals',
'a_goals',
'h_ht_goals',
'a_ht_goals',
'h_shots_on_goal',
'a_shots_on_goal',
'h_distance',
'a_distance',
'h_total_passes',
'a_total_passes',
'h_success_passes',
'a_success_passes',
'h_failed_passes',
'a_failed_passes',
'h_pass_ratio',
'a_pass_ratio',
'h_possesion',
'a_possesion',
'h_tackle_ratio',
'a_tackle_ratio',
'h_fouls',
'a_fouls',
'h_got_fouled',
'a_got_fouled',
'h_offside',
'a_offside',
'h_corners',
'a_corners'])

And finally, we store our data in csv format for later use.

In [None]:
df_data.to_csv('./data/data_BuLi_13_20.csv', index=False) 