# Extract Game Ids

This notebook allows to extract the games id for a season in a round by inspecting the web pages of the form 

```
https://nbl.com.au/schedule?round=<ROUND-ID>&season=<SEASON-ID>
```

For example this is pre-season for 2022-2033:

https://nbl.com.au/schedule?round=PS&season=34173

Those pages expose game links of the form `https://nbl.com.au/games/<GAME-ID>`, but only after Javascript has run. So, we need to use a virtual webdriver to actually browse the page (in silent) after that. We do this with module `selenium` that provides drivers for browsers. Here is [an explanation](https://stackoverflow.com/questions/11047348/is-this-possible-to-load-the-page-after-the-javascript-execute-using-python) how to load a page after Javascript has executed.

**Note:** the original page, before Javascript, will also expose the game ids in structures of the form `matchId:<GAME-ID>`, but it will give all of them of the season, without filtering on the round.


In [9]:
import re

# Download geckodriver (https://github.com/mozilla/geckodriver/releases) and put it in path
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

In [10]:
# def get_game_ids(season, round) -> list:
round="PS"
season=34173

url = f"https://nbl.com.au/schedule?round={round}&season={season}"

print("Link to inspect:", url)

Link to inspect: https://nbl.com.au/schedule?round=PS&season=34173


In [11]:
# We need an actual browser so that the JavaScript is loaded and the links https://.../games/<game_id> are generated
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
# browser = webdriver.Firefox(options=options, executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')

browser.get(url)
html_text = browser.page_source
browser.quit()

Now find all the game ids and generate list as per round.

In [12]:
game_ids = set(re.findall(r'/games/(\d+)', html_text))

print("Number of games extracted: ", len(game_ids))

games = [(x, round if isinstance(round, int) else 0) for x in game_ids]
print(games)

Number of games extracted:  25
[('2122059', 0), ('2120077', 0), ('2120053', 0), ('2120079', 0), ('2120054', 0), ('2122060', 0), ('2141127', 0), ('2116579', 0), ('2135117', 0), ('2120050', 0), ('2141126', 0), ('2120049', 0), ('2135116', 0), ('2120048', 0), ('2120078', 0), ('2120056', 0), ('2134935', 0), ('2120055', 0), ('2120080', 0), ('2120058', 0), ('2116576', 0), ('2120052', 0), ('2120057', 0), ('2124207', 0), ('2120051', 0)]


# Option 2: Get all the games with no browser

Here we don't use a web-browser and we extract just all the games from the original source from items `matchId:<GAME-ID>`.

With this solution the page gives ALL the game links, so we don't know what round each is.

In [5]:
import re
import bs4
import requests
from bs4 import BeautifulSoup # https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/

In [8]:
r = requests.get(url)
html_text = r.text

game_ids = set(re.findall(r'matchId:(\d+)', html_text))

print("Number of games extracted: ", len(game_ids))

games = [(x, round if isinstance(round, int) else 0) for x in game_ids]
print(games)

Number of games extracted:  165
[('2122059', 0), ('2116433', 0), ('2116506', 0), ('2116473', 0), ('2116459', 0), ('2116423', 0), ('2116466', 0), ('2120054', 0), ('2116517', 0), ('2116502', 0), ('2116472', 0), ('2120049', 0), ('2116492', 0), ('2116397', 0), ('2116468', 0), ('2116390', 0), ('2116429', 0), ('2116406', 0), ('2116418', 0), ('2116487', 0), ('2116437', 0), ('2116395', 0), ('2116464', 0), ('2116485', 0), ('2116512', 0), ('2116460', 0), ('2116453', 0), ('2116501', 0), ('2116426', 0), ('2116498', 0), ('2116470', 0), ('2116434', 0), ('2116507', 0), ('2116504', 0), ('2116488', 0), ('2116452', 0), ('2116481', 0), ('2116486', 0), ('2120077', 0), ('2116394', 0), ('2116510', 0), ('2120053', 0), ('2116467', 0), ('2116449', 0), ('2116409', 0), ('2116515', 0), ('2116448', 0), ('2116425', 0), ('2122060', 0), ('2116411', 0), ('2116579', 0), ('2116435', 0), ('2116443', 0), ('2120050', 0), ('2116379', 0), ('2116414', 0), ('2116450', 0), ('2116444', 0), ('2116440', 0), ('2120048', 0), ('21200