# Web Scraping Notebook

## Scraping the transfermarkt.com Website
Scraping [transfermarkt.com](https://www.transfermarkt.com/) for penalty shootouts in major tournaments. The website was utilized as the databases includes player profiles (age, team, market value, etc.) for each season.

**Major tournaments:**
- **European Championship** *aka Euro Cup*
- **South American Championship** *aka Copa America*
- **Champions League/European Cup**
- **Europa League/UEFA Cup**

In [114]:
import requests
from bs4 import BeautifulSoup

In [41]:
def get_soup(url):
    '''
    Function to request a web page and return a Beautiful Soup object
    '''
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
    }
    
    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    return soup

## Game page

In [171]:
# Sample game page
url = 'https://www.transfermarkt.com/spielbericht/index/spielbericht/3605575'
soup = get_soup(url)

In [182]:
def get_match_data(soup):
    '''
    Function to retrieve general match data
    '''
    
    # Home and away teams
    home_team = soup.find(attrs={'class': 'sb-team sb-heim'}).find('img')['alt']
    away_team = soup.find(attrs={'class': 'sb-team sb-gast'}).find('img')['alt']

    # Stadium link
    stad = soup.find(attrs={'class': 'sb-zusatzinfos'}).find('a')
    stad_link = stad['href']

    # Stadium page soup
    base_url = 'https://www.transfermarkt.us'
    stad_soup = get_soup(base_url + stad_link)
    
    # Stadium home team
    stad_home = stad_soup.find(attrs={'itemprop': 'name'}).find('span').text
    
    # Stadium address
    tds = stad_soup.find_all('table')[1].find_all('td')[:4]
    address = [ td.text.replace(u'\xa0', u' ') for td in tds ]
    
    # Neutral venue
    if (home_team == stad_home) or (home_team in address):
        neutral = 'False'
        true_home = home_team
    elif (away_team == stad_home) or (away_team in address):
        neutral = 'False'
        true_home = away_team
    else:
        neutral = 'True'
        true_home = stad_home
        
    # Match date
    match_date = soup.find(attrs={'sb-spieldaten'}).find('a')['href'][-10:]
    
    return home_team, away_team, neutral, true_home, match_date

In [184]:
home_team, away_team, neutral, true_home, match_date = get_match_data(soup)

print('Home Team:', home_team)
print('Away Team:', away_team)
print('Neutral Venue:', neutral)
print('True Home:', true_home)
print('Match Date:', match_date)

Home Team: Italy
Away Team: England
Neutral Venue: False
True Home: England
Match Date: 2021-07-11


In [47]:
# Penalty shootout section of game page
shootout = soup.find(attrs={'id': 'sb-elfmeterscheissen'}).find('ul')

# Home and away shots
home = shootout.find_all(attrs={'class': 'sb-aktion-heim'})
away = shootout.find_all(attrs={'class': 'sb-aktion-gast'})

In [139]:
def get_shot_data(shot):
    result = shot.find('span')['title']
    player = shot.find('img')['title']
    
    return result, player