# Extract Game Ids

This notebook allows to extract the games id for a season in a round by inspecting the web pages of the form 

```
https://nbl.com.au/schedule?round=<ROUND-ID>&season=<SEASON-ID>
```

For example this is pre-season for 2022-2033:

https://nbl.com.au/schedule?round=PS&season=34173

Those pages expose game links of the form `https://nbl.com.au/games/<GAME-ID>`, but only after Javascript has run. So, we need to use a virtual webdriver to actually browse the page (in silent) after that. We do this with module `selenium` that provides drivers for browsers. Here is [an explanation](https://stackoverflow.com/questions/11047348/is-this-possible-to-load-the-page-after-the-javascript-execute-using-python) how to load a page after Javascript has executed.

**Note:** the original page, before Javascript, will also expose the game ids in structures of the form `matchId:<GAME-ID>`, but it will give all of them of the season, without filtering on the round.


In [None]:
import re

# Download geckodriver (https://github.com/mozilla/geckodriver/releases) and put it in path
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

In [None]:
# def get_game_ids(season, round) -> list:
round="PS"
season=34173

url = f"https://nbl.com.au/schedule?round={round}&season={season}"

print("Link to inspect:", url)

In [None]:
# We need an actual browser so that the JavaScript is loaded and the links https://.../games/<game_id> are generated
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
# browser = webdriver.Firefox(options=options, executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')

browser.get(url)
html_text = browser.page_source
browser.quit()

Now find all the game ids and generate list as per round.

In [None]:
game_ids = set(re.findall(r'/games/(\d+)', html_text))

print("Number of games extracted: ", len(game_ids))

games = [(x, round if isinstance(round, int) else 0) for x in game_ids]
print(games)

# Option 2: Get all the games with no browser

Here we don't use a web-browser and we extract just all the games from the original source from items `matchId:<GAME-ID>`.

With this solution the page gives ALL the game links, so we don't know what round each is.

In [None]:
import re
import bs4
import requests
from bs4 import BeautifulSoup # https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/

In [None]:
r = requests.get(url)
html_text = r.text

game_ids = set(re.findall(r'matchId:(\d+)', html_text))

print("Number of games extracted: ", len(game_ids))

games = [(x, round if isinstance(round, int) else 0) for x in game_ids]
print(games)