# Extract Game Ids

This notebook allows to extract the games id for a season in a round by inspecting the web pages of the form 

```
https://nbl.com.au/schedule?round=<ROUND-ID>&season=<SEASON-ID>
```

For example this is pre-season for 2022-2033:

https://nbl.com.au/schedule?round=PS&season=34173

Those pages expose game links of the form `https://nbl.com.au/games/<GAME-ID>`, but only after Javascript has run. So, we need to use a virtual webdriver to actually browse the page (in silent) after that. We do this with module `selenium` that provides drivers for browsers. Here is [an explanation](https://stackoverflow.com/questions/11047348/is-this-possible-to-load-the-page-after-the-javascript-execute-using-python) how to load a page after Javascript has executed.

**Note:** the original page, before Javascript, will also expose the game ids in structures of the form `matchId:<GAME-ID>`, but it will give all of them of the season, without filtering on the round.


In [1]:
import re

# Download geckodriver (https://github.com/mozilla/geckodriver/releases) and put it in path
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

In [2]:
# def get_game_ids(season, round) -> list:
ROUNDS = list(range(1, 22)) + ["SF", "F"]
SEASON=27725    # 2020-2021
# SEASON=30249    # 2021-2022

# rounds = ["PS"]
# season=34173    # 2022-2023

def get_url(season, round):
    return f"https://nbl.com.au/schedule?round={round}&season={season}"


print("Season", SEASON)
print("Rounds: ", ROUNDS)

Season 27725
Rounds:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 'SF', 'F']


In [3]:
# We need an actual browser so that the JavaScript is loaded and the links https://.../games/<game_id> are generated
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
# browser = webdriver.Firefox(options=options, executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')

html_texts = []

for rno in ROUNDS:
    url = get_url(SEASON, rno)
    print(f"Extracting web HTML for round {rno} at ", url)
    browser.get(url)
    html_text = browser.page_source
    html_texts.append((rno, html_text))

browser.quit()

Extracting web HTML for round 1 at  https://nbl.com.au/schedule?round=1&season=27725
Extracting web HTML for round 2 at  https://nbl.com.au/schedule?round=2&season=27725
Extracting web HTML for round 3 at  https://nbl.com.au/schedule?round=3&season=27725
Extracting web HTML for round 4 at  https://nbl.com.au/schedule?round=4&season=27725
Extracting web HTML for round 5 at  https://nbl.com.au/schedule?round=5&season=27725
Extracting web HTML for round 6 at  https://nbl.com.au/schedule?round=6&season=27725
Extracting web HTML for round 7 at  https://nbl.com.au/schedule?round=7&season=27725
Extracting web HTML for round 8 at  https://nbl.com.au/schedule?round=8&season=27725
Extracting web HTML for round 9 at  https://nbl.com.au/schedule?round=9&season=27725
Extracting web HTML for round 10 at  https://nbl.com.au/schedule?round=10&season=27725
Extracting web HTML for round 11 at  https://nbl.com.au/schedule?round=11&season=27725
Extracting web HTML for round 12 at  https://nbl.com.au/sched

Now find all the game ids and generate list as per round.

In [4]:
pattern = r'/games/(\d+)'

MAP_ROUNDS = { "SF" : 100, "F" : 101, "PS" : 0}

games = []
for (rnd, html_text) in html_texts:
    game_ids = set(re.findall(pattern, html_text))
    print(f"Number of games extracted for round {rnd}: ", len(game_ids))


    games.extend([(x, rnd if isinstance(rnd, int) else MAP_ROUNDS[rnd]) for x in game_ids])

print("Total games:", len(games))
print(games)

Number of games extracted for round 1:  5
Number of games extracted for round 2:  6
Number of games extracted for round 3:  7
Number of games extracted for round 4:  6
Number of games extracted for round 5:  7
Number of games extracted for round 6:  8
Number of games extracted for round 7:  8
Number of games extracted for round 8:  10
Number of games extracted for round 9:  10
Number of games extracted for round 10:  7
Number of games extracted for round 11:  7
Number of games extracted for round 12:  7
Number of games extracted for round 13:  9
Number of games extracted for round 14:  8
Number of games extracted for round 15:  7
Number of games extracted for round 16:  8
Number of games extracted for round 17:  4
Number of games extracted for round 18:  10
Number of games extracted for round 19:  11
Number of games extracted for round 20:  8
Number of games extracted for round 21:  9
Number of games extracted for round SF:  6
Number of games extracted for round F:  3
Total games: 171


# Option 2: Get all the games with no browser

Here we don't use a web-browser and we extract just all the games from the original source from items `matchId:<GAME-ID>`.

With this solution the page gives ALL the game links, so we don't know what round each is.

In [None]:
import re
import bs4
import requests
from bs4 import BeautifulSoup # https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/

In [None]:
r = requests.get(url)
html_text = r.text

game_ids = set(re.findall(r'matchId:(\d+)', html_text))

print("Number of games extracted: ", len(game_ids))

games = [(x, round if isinstance(round, int) else 0) for x in game_ids]
print(games)