<a href="https://colab.research.google.com/github/rodmart21/Sports_analytics/blob/main/Webscraping_football_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Webscrapping important football data from the most significant specialized websites.

## 1) Scrapping from Transfermarkt. It is possible to get some interesting information.

### 1.1) Using the URL

Import all neccesary libraries.

In [1]:
import re
import requests
from bs4 import BeautifulSoup

Every player has a specific ID in transfermarkt that identifies him uniquely.

In [7]:
url = "https://www.transfermarkt.us/erling-haaland/profil/spieler/418560"
player_id = url.split('/')[-1]

You can get your header from this website, it is neccesary to make the necessary request: https://www.whatismybrowser.com/detect/what-http-headers-is-my-browser-sending. USER-AGENT has to be selected.


In [13]:
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36'}

In [14]:
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
# BeautifulSoup()   Converts the HTML content into a BeautifulSoup, what makes easier the manipulation and analysis of the HTML.

In [15]:
response.status_code    # This code means the response has been successful.

200

To access the website, right-click and select "Inspect" from the menu. The backend is open, so you can check out the elements behind every part of the website.

In [17]:
p=soup.select_one('h1[class="data-header__headline-wrapper"]')

In [18]:
print(p)

<h1 class="data-header__headline-wrapper">
<span class="data-header__shirt-number">
                        #9                    </span>
                                Erling <strong>Haaland</strong> </h1>


In [19]:
player_name = soup.select_one('h1[class="data-header__headline-wrapper"]').text.split('\n')[-1].strip()
player_number = soup.select_one('span[class="data-header__shirt-number"]').text.strip().replace('#', '')

In [36]:
print(player_name)

Erling Haaland


Transfermarkt is not the easiest web to scrape data from. Using RegEx, it is possible to find the specific text we are interested in.

In [37]:
a=re.search("Contract expires: .*__content\">(.*?)</span>",str(soup)).group(1)
print(a)

Jun 30, 2027


Some other features can be extract like the contract expiry date, the birthdate, player agent and the height.

In [25]:
player_contract_expiry = re.search("Contract expires: (.*)", soup.text).group(1)
player_birthplace = re.search("Place of birth:.*?([A-z].*?) ", soup.text, re.DOTALL).group(1).strip()
player_agent = re.search("Agent:.*?([A-z].*?)\n", soup.text, re.DOTALL).group(1).strip()
player_height = re.search("Height:.*?([0-9].*?)\n", soup.text, re.DOTALL).group(1).strip()

In [26]:
player_height

'1,95 m'

Let's extract the historical value of Haaland during his career.

In [28]:
response=requests.get(f'https://www.transfermarkt.es/ceapi/marketValueDevelopment/graph/{player_id}', headers=headers,)

In [30]:
response.json()  # Tenemos la lista con todos sus valores de mercado.

{'list': [{'x': 1482015600000,
   'y': 200000,
   'mw': '200 mil €',
   'datum_mw': '18/12/2016',
   'verein': 'Bryne FK',
   'age': '16',
   'wappen': 'https://tmssl.akamaized.net/images/wappen/profil/1057.png?lm=1480871779'},
  {'x': 1513983600000,
   'y': 300000,
   'mw': '300 mil €',
   'datum_mw': '23/12/2017',
   'verein': 'Molde FK',
   'age': '17',
   'wappen': 'https://tmssl.akamaized.net/images/wappen/profil/687.png?lm=1409159512'},
  {'x': 1536530400000,
   'y': 2000000,
   'mw': '2,00 mill. €',
   'datum_mw': '10/09/2018',
   'verein': 'Molde FK',
   'age': '18',
   'wappen': ''},
  {'x': 1546124400000,
   'y': 5000000,
   'mw': '5,00 mill. €',
   'datum_mw': '30/12/2018',
   'verein': 'Molde FK',
   'age': '18',
   'wappen': ''},
  {'x': 1559512800000,
   'y': 5000000,
   'mw': '5,00 mill. €',
   'datum_mw': '03/06/2019',
   'verein': 'Red Bull Salzburgo',
   'age': '18',
   'wappen': 'https://tmssl.akamaized.net/images/wappen/profil/409_1557781653.png?lm=1557781653'},
  {

In [31]:
response.json().keys()

dict_keys(['list', 'current', 'highest', 'highest_date', 'last_change', 'details_url', 'thread', 'translations'])

### 1.2) Using API endpoints.

Ok, now that we have some basic regex patterns, let's move on to using API endpoints with curlconverter.com. In section 2.2, this method is develop deeply.

In [35]:
api_endpoints = [
    f"marketValueDevelopment/graph/{player_id}",
    f"transferHistory/list/{player_id}",
    f"player/{player_id}/performance"
]

for endpoint in api_endpoints:
    response = requests.get(
        'https://www.transfermarkt.us/ceapi/' + endpoint,
        headers=headers
    ).json()

## 2) Scrapping data form SofaScore.

### 2.1) Using the URL.

Data from Real Madrid vs Villarreal match will be scrapped played on May 19th.

In [40]:
# https://www.sofascore.com/es/real-madrid-villarreal/ugbsEgb   #id:11368619,tab:details

In [41]:
# response = requests.get(
#     'https://www.sofascore.com/arsenal-manchester-city/rsR#10385636',
#     headers={'User-Agent': 'Mozilla/5.0'} # you'll be blocked if you don't use some type of user agent
#)

In this case it has not been necessary to introduce our headers. The most important thing is to get that successful status code.

In [2]:
response = requests.get('https://www.sofascore.com/es/real-madrid-villarreal/ugbsEgb#id:11368619,tab:details')
response.status_code

200

The code soup = BeautifulSoup(response.text, 'html.parser') uses the BeautifulSoup library to parse the HTML content of an HTTP response.

In [3]:
soup = BeautifulSoup(response.text, 'html.parser')

This selects a part of the website that should return a list of the different elements and their nested attributes.

In [4]:
soup.select('g[cursor="pointer"]')

[]

This does not return anything because all the content is been loaded from JavaScript and request doesn´t work with this kind of page.

### 2.2) Using the API.

What we'll do is use the API's they are loading to call that data directly
Again, click on Inspect in the page we are interested on and then go to Network. In the `network` tab of the developer tools we can see the API calls they are making.


Steps:
1. Click on Network and refresh the page.
2. Find the API call `shotmap` in the network tab.
3. Right click and copy as a cURL (bash).
4. Go to curlconverter.com and paste the cURL.
5. Copy the python code.

In [18]:
cookies = {
    '_gcl_au': '1.1.2081927633.1716149495',
    '_ga': 'GA1.1.1941725413.1716149497',
    'FCCDCF': '%5Bnull%2Cnull%2Cnull%2C%5B%22CP-3IoAP-3IoAEsACBENA1EoAP_gAEPgAA6II3gB5C5ETSFBYH51KIsUYAEHwAAAIsAgAAYBAQABQBKQAIQCAGAAEAhAhCACgAAAIEYBIAEACAAQAAAAAAAAIAAEIAAQAAAIICAAAAAAAABIAAAIAAAAEAAAwCAABAAA0AgEAJIISMgAAAAAAAAAAgAAAAAAAgAAAEhAAAEIAAAAACgAEABAEAAAAAEIABBII3gB5C5ETSFBYHhVIIMUIAERQAAAIsAgAAQBAQAAQBKQAIQCEGAAAAgAACAAAAAAIEQBIAEAAAgAAAAAAAAAIAAEAAAAAAAIICAAAAAAAABAAAAIAAAAAAAAwCAABAAAwQhEAJIASEgAAAAgAAAAAoAAAAAAAgAAAEhAAAEAAAAAAAAAEAAAEAAAAAAAABBIAAA.dnAACAgAAAA%22%2C%222~41.70.89.108.149.211.313.358.415.486.540.621.981.1029.1046.1092.1097.1126.1205.1301.1516.1558.1584.1598.1651.1697.1716.1753.1810.1832.1985.2328.2373.2440.2571.2572.2575.2577.2628.2642.2677.2767.2860.2878.2887.2922.3182.3190.3234.3290.3292.3331.10631~dv.%22%2C%22D59998C3-B901-4C02-BBB2-BDDFE43E2F71%22%5D%5D',
    '__qca': 'P0-256114773-1716149503180',
    'cto_bundle': 'FoUvXF82R2hvSGRab0ZDaDA2cVBJalB4dm8zNlpJb2VROWw4aW1SaUhSa1ZmQURlV3FuYkd0MGxLT096QlZ6U1pQNjZXS0RBbjRkU2xJeHpyUzBuaFhQYnRuaFBmamFTVHBTN2ZTd0lvTkp2eks5NFRzQ2MyY096aFRPUGgwYmxoRWd0dGU3Um1QZ3ViNnpGWTZZNmlncWw4V21xbXhPSzlJTXFmOWV4RXJvVjdLU1klM0Q',
    '__gads': 'ID=5d75f2defa542b67:T=1716149503:RT=1716151368:S=ALNI_MZw9Ix0nE-Z5ANRGCYWN9kD7mdITQ',
    '__gpi': 'UID=00000d77dc0a78c4:T=1716149503:RT=1716151368:S=ALNI_MYOwWy2DfRrGHLHTg67ZGbALi6DRw',
    '__eoi': 'ID=f8c69ad5d57acdd5:T=1716149504:RT=1716151368:S=AA-AfjbGVbAMiWc-xXla9msMNOPW',
    'FCNEC': '%5B%5B%22AKsRol8v_YosFsb-eEuh-Off_X3IiDVmt-cjmf-Kg37lmygsEGbrPYoY50QH3lcrTiu-R0I6uJ8cHB1uaxdu1rnIIQYUW4xbhbodJMZmXKNgDif0Mii9m2Nzkpg9OcBe9HQYJ414g5uoNcDvaXmMieRjOOFm2aRacw%3D%3D%22%5D%5D',
    '_ga_QH2YGS7BB4': 'GS1.1.1716149496.1.1.1716151386.0.0.0',
    '_ga_3KF4XTPHC4': 'GS1.1.1716149496.1.1.1716151386.57.0.0',
    '_ga_HNQ9P9MGZR': 'GS1.1.1716149498.1.1.1716151398.54.0.0',
}

headers = {
    'accept': '*/*',
    'accept-language': 'en-US,en;q=0.9',
    'cache-control': 'max-age=0',
    # 'cookie': '_gcl_au=1.1.2081927633.1716149495; _ga=GA1.1.1941725413.1716149497; FCCDCF=%5Bnull%2Cnull%2Cnull%2C%5B%22CP-3IoAP-3IoAEsACBENA1EoAP_gAEPgAA6II3gB5C5ETSFBYH51KIsUYAEHwAAAIsAgAAYBAQABQBKQAIQCAGAAEAhAhCACgAAAIEYBIAEACAAQAAAAAAAAIAAEIAAQAAAIICAAAAAAAABIAAAIAAAAEAAAwCAABAAA0AgEAJIISMgAAAAAAAAAAgAAAAAAAgAAAEhAAAEIAAAAACgAEABAEAAAAAEIABBII3gB5C5ETSFBYHhVIIMUIAERQAAAIsAgAAQBAQAAQBKQAIQCEGAAAAgAACAAAAAAIEQBIAEAAAgAAAAAAAAAIAAEAAAAAAAIICAAAAAAAABAAAAIAAAAAAAAwCAABAAAwQhEAJIASEgAAAAgAAAAAoAAAAAAAgAAAEhAAAEAAAAAAAAAEAAAEAAAAAAAABBIAAA.dnAACAgAAAA%22%2C%222~41.70.89.108.149.211.313.358.415.486.540.621.981.1029.1046.1092.1097.1126.1205.1301.1516.1558.1584.1598.1651.1697.1716.1753.1810.1832.1985.2328.2373.2440.2571.2572.2575.2577.2628.2642.2677.2767.2860.2878.2887.2922.3182.3190.3234.3290.3292.3331.10631~dv.%22%2C%22D59998C3-B901-4C02-BBB2-BDDFE43E2F71%22%5D%5D; __qca=P0-256114773-1716149503180; cto_bundle=FoUvXF82R2hvSGRab0ZDaDA2cVBJalB4dm8zNlpJb2VROWw4aW1SaUhSa1ZmQURlV3FuYkd0MGxLT096QlZ6U1pQNjZXS0RBbjRkU2xJeHpyUzBuaFhQYnRuaFBmamFTVHBTN2ZTd0lvTkp2eks5NFRzQ2MyY096aFRPUGgwYmxoRWd0dGU3Um1QZ3ViNnpGWTZZNmlncWw4V21xbXhPSzlJTXFmOWV4RXJvVjdLU1klM0Q; __gads=ID=5d75f2defa542b67:T=1716149503:RT=1716151368:S=ALNI_MZw9Ix0nE-Z5ANRGCYWN9kD7mdITQ; __gpi=UID=00000d77dc0a78c4:T=1716149503:RT=1716151368:S=ALNI_MYOwWy2DfRrGHLHTg67ZGbALi6DRw; __eoi=ID=f8c69ad5d57acdd5:T=1716149504:RT=1716151368:S=AA-AfjbGVbAMiWc-xXla9msMNOPW; FCNEC=%5B%5B%22AKsRol8v_YosFsb-eEuh-Off_X3IiDVmt-cjmf-Kg37lmygsEGbrPYoY50QH3lcrTiu-R0I6uJ8cHB1uaxdu1rnIIQYUW4xbhbodJMZmXKNgDif0Mii9m2Nzkpg9OcBe9HQYJ414g5uoNcDvaXmMieRjOOFm2aRacw%3D%3D%22%5D%5D; _ga_QH2YGS7BB4=GS1.1.1716149496.1.1.1716151386.0.0.0; _ga_3KF4XTPHC4=GS1.1.1716149496.1.1.1716151386.57.0.0; _ga_HNQ9P9MGZR=GS1.1.1716149498.1.1.1716151398.54.0.0',
    'if-none-match': 'W/"256303f320"',
    'priority': 'u=1, i',
    'referer': 'https://www.sofascore.com/es/real-madrid-villarreal/ugbsEgb',
    'sec-ch-ua': '"Chromium";v="124", "Google Chrome";v="124", "Not-A.Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-origin',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
    'x-requested-with': '337e31',
}

response = requests.get('https://www.sofascore.com/api/v1/event/11368619/shotmap', cookies=cookies, headers=headers)

In [23]:
response.status_code  # Obtenemos un 304,

200

At first place, it is normal to obtain a status_code= 304, if that's the case, apply:

In [25]:
headers['If-Modified-Since'] = 'Tues, 18 Jul 2023 00:00:00 GMT'

In [24]:
response = requests.get('https://www.sofascore.com/api/v1/event/11368619/shotmap', headers=headers)
response

<Response [200]>

In [26]:
response.json()

{'shotmap': [{'player': {'name': 'Bertrand Traoré',
    'firstName': '',
    'lastName': '',
    'slug': 'bertrand-traore',
    'shortName': 'B. Traoré',
    'position': 'M',
    'jerseyNumber': '25',
    'userCount': 4965,
    'id': 218160,
    'fieldTranslations': {'nameTranslation': {'ar': 'برتران تراوري إيزيدور'},
     'shortNameTranslation': {'ar': 'ب. ت. إيزيدور'}}},
   'isHome': True,
   'shotType': 'save',
   'situation': 'assisted',
   'playerCoordinates': {'x': 9.5, 'y': 69.3, 'z': 0},
   'bodyPart': 'left-foot',
   'goalMouthLocation': 'low-right',
   'goalMouthCoordinates': {'x': 0, 'y': 46, 'z': 3.8},
   'blockCoordinates': {'x': 1.5, 'y': 54.2, 'z': 0},
   'xg': 0.046590391546488,
   'xgot': 0.1969,
   'id': 3262267,
   'time': 90,
   'timeSeconds': 5365,
   'draw': {'start': {'x': 69.3, 'y': 9.5},
    'block': {'x': 54.2, 'y': 1.5},
    'end': {'x': 54, 'y': 0},
    'goal': {'x': 54, 'y': 96.2}},
   'reversedPeriodTime': 1,
   'reversedPeriodTimeSeconds': 35,
   'inciden

Now, we are able to see all the shots that happen at that match. This same analysis is extrapolable to any use case you may want to develop.