# Extract Game Ids

This notebook allows to extract the games id for a season in a round by inspecting the web pages of the form 

```
https://nbl.com.au/schedule?round=<ROUND-ID>&season=<SEASON-ID>
```

For example this is pre-season for 2022-2033:

https://nbl.com.au/schedule?round=PS&season=34173

Those pages expose game links of the form `https://nbl.com.au/games/<GAME-ID>`, but only after Javascript has run. So, we need to use a virtual webdriver to actually browse the page (in silent) after that. We do this with module `selenium` that provides drivers for browsers. Here is [an explanation](https://stackoverflow.com/questions/11047348/is-this-possible-to-load-the-page-after-the-javascript-execute-using-python) how to load a page after Javascript has executed.

**Note:** the original page, before Javascript, will also expose the game ids in structures of the form `matchId:<GAME-ID>`, but it will give all of them of the season, without filtering on the round.


## Option 1: Via Salenium virtual browser

In [1]:
import re

# Download geckodriver (https://github.com/mozilla/geckodriver/releases) and put it in path
# Salenium webdriver: https://www.selenium.dev/documentation/overview/
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

In [27]:
# def get_game_ids(season, round) -> list:
# ROUNDS = list(range(1, 22)) + ["SF", "F"]
# SEASON=27725    # 2020-2021
# SEASON=30249    # 2021-2022

# ROUNDS = list(range(1, 19)) + ["PS"] + ['PlayOff 1']
ROUNDS = ["F"]
# ROUNDS = ["PS"]
# ROUNDS = ['PlayOff%201']

SEASON=34173    # 2022-2023
SEASON=35847    # 2023-2024

def get_url(season, round):
    return "https://nbl.com.au/schedule?season=35847&round=F"
    return f"https://nbl.com.au/schedule?round={round}&season={season}"


print("Season", SEASON)
print("Rounds: ", ROUNDS)

Season 35847
Rounds:  ['F']


In [28]:
# We need an actual browser so that the JavaScript is loaded and the links https://.../games/<game_id> are generated
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
# browser = webdriver.Firefox(options=options, executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')

html_texts = []

for rno in ROUNDS:
    url = get_url(SEASON, rno)
    print(f"Extracting web HTML for round {rno} at ", url)
    browser.get(url)
    html_text = browser.page_source
    html_texts.append((rno, html_text))

browser.quit()

Extracting web HTML for round F at  https://nbl.com.au/schedule?season=35847&round=F


In [29]:
print(html_text)

<html><head>
    <meta data-n-head="ssr" charset="utf-8"><meta data-n-head="ssr" name="viewport" content="width=device-width, initial-scale=1"><meta data-n-head="ssr" name="twitter:card" content="summary_large_image"><meta data-n-head="ssr" name="theme-color" content="#EF7627"><meta data-n-head="ssr" name="og:url" data-hid="og:url" content="/schedule"><meta data-n-head="ssr" name="twitter:url" data-hid="twitter:url" content="/schedule"><meta data-n-head="ssr" name="description" data-hid="description" content="The NBL is the pre-eminent professional men's basketball league in Australia and New Zealand. The league was founded in 1979 and is currently contested by 10 teams; 9 from AUS and 1 from NZ."><meta data-n-head="ssr" name="og:description" data-hid="og:description" content="The NBL is the pre-eminent professional men's basketball league in Australia and New Zealand. The league was founded in 1979 and is currently contested by 10 teams; 9 from AUS and 1 from NZ."><meta data-n-head="s

Now find all the game ids and generate list as per round.

In [31]:
pattern = r'/games/(\d+)'
pattern = r'matchId:(\d+),matchStatus:hm'


MAP_ROUNDS = { "SF" : 100, "F" : 200, "PS" : 0, "PlayOff%201" : 101, "PlayOff%202" : 102}

games = []
for (rnd, html_text) in html_texts:
    game_ids = set(re.findall(pattern, html_text))
    print(f"Number of games extracted for round {rnd}: ", len(game_ids))

    games.extend([(x, rnd if isinstance(rnd, int) else MAP_ROUNDS[rnd]) for x in game_ids])

print("Total games:", len(games))
print(games)

Number of games extracted for round F:  5
Total games: 5
[('2426437', 200), ('2426440', 200), ('2426441', 200), ('2426439', 200), ('2426438', 200)]


### Check for new games from previous set

Replace `PREVIOUS` with the existing set of games.

In [21]:
PREVIOUS = [('2266873', '1'), ('2266875', '1'), ('2266877', '1'), ('2266879', '1'), ('2266881', '1'), ('2266883', '1'), ('2266885', '1'), ('2266871', '2'), ('2266870', '2'), ('2266872', '2'), ('2266874', '2'), ('2266876', '2'), ('2266878', '2'), ('2266880', '2'), ('2266882', '2'), ('2266884', '3'), ('2266886', '3'), ('2266887', '3'), ('2266890', '3'), ('2266893', '3'), ('2266896', '3'), ('2266901', '3'), ('2266902', '4'), ('2266905', '4'), ('2266908', '4'), ('2266911', '4'), ('2266914', '4'), ('2266888', '4'), ('2266892', '5'), ('2266894', '5'), ('2266897', '5'), ('2266899', '5'), ('2266903', '5'), ('2266906', '5'), ('2266910', '5'), ('2266913', '5'), ('2266915', '6'), ('2266889', '6'), ('2266891', '6'), ('2266895', '6'), ('2266898', '6'), ('2266900', '6'), ('2266904', '6'), ('2266907', '6'), ('2266909', '7'), ('2266912', '7'), ('2266916', '7'), ('2266917', '7'), ('2266921', '7'), ('2266919', '8'), ('2266923', '8'), ('2266925', '8'), ('2266927', '8'), ('2266929', '8'), ('2266931', '8'), ('2266932', '8'), ('2266935', '9'), ('2266918', '9'), ('2266920', '9'), ('2266922', '9'), ('2266924', '9'), ('2266926', '9'), ('2266928', '9'), ('2266930', '10'), ('2266933', '10'), ('2266934', '10'), ('2266936', '10'), ('2266937', '10'), ('2266938', '10'), ('2266939', '11'), ('2266940', '11'), ('2266941', '11'), ('2266942', '11'), ('2266943', '11'), ('2266944', '11'), ('2266945', '11'), ('2266946', '12'), ('2266949', '12'), ('2266952', '12'), ('2266957', '12'), ('2266958', '12'), ('2266961', '12'), ('2266964', '12'), ('2266965', '12'), ('2266947', '12'), ('2266951', '13'), ('2266954', '13'), ('2266955', '13'), ('2266959', '13'), ('2266962', '13'), ('2266966', '13'), ('2266968', '14'), ('2266970', '14'), ('2266972', '14'), ('2266948', '14'), ('2266950', '14'), ('2266953', '14'), ('2266956', '15'), ('2266960', '15'), ('2266963', '15'), ('2266967', '15'), ('2266969', '15'), ('2266971', '15'), ('2266973', '15'), ('2266974', '15'), ('2266975', '16'), ('2266976', '16'), ('2266977', '16'), ('2266978', '16'), ('2266979', '16'), ('2266980', '16'), ('2266981', '16'), ('2266982', '16'), ('2266983', '17'), ('2266985', '17'), ('2266987', '17'), ('2266989', '17'), ('2266991', '17'), ('2266993', '17'), ('2266994', '17'), ('2266997', '17'), ('2266999', '18'), ('2267001', '18'), ('2267002', '18'), ('2266984', '18'), ('2266986', '18'), ('2266988', '18'), ('2266990', '19'), ('2266992', '19'), ('2266995', '19'), ('2266996', '19'), ('2266998', '19'), ('2267000', '19'), ('2267003', '20'), ('2267004', '20'), ('2267005', '20'), ('2267006', '20'), ('2267007', '20'), ('2267008', '20'), ('2267009', '20')]

new_games = [x for x in games if x not in PREVIOUS]
print("Number of new games:", len(new_games))
new_games

Number of new games: 165


[('2266911', 0),
 ('2266883', 0),
 ('2266897', 0),
 ('2266895', 0),
 ('2266964', 0),
 ('2266979', 0),
 ('2266912', 0),
 ('2266902', 0),
 ('2266956', 0),
 ('2266992', 0),
 ('2266879', 0),
 ('2266970', 0),
 ('2291377', 0),
 ('2266880', 0),
 ('2267001', 0),
 ('2266997', 0),
 ('2267007', 0),
 ('2266975', 0),
 ('2266944', 0),
 ('2266991', 0),
 ('2291376', 0),
 ('2266928', 0),
 ('2266972', 0),
 ('2291378', 0),
 ('2296188', 0),
 ('2266936', 0),
 ('2266981', 0),
 ('2267004', 0),
 ('2296357', 0),
 ('2266894', 0),
 ('2266893', 0),
 ('2266965', 0),
 ('2266954', 0),
 ('2266988', 0),
 ('2266993', 0),
 ('2266990', 0),
 ('2266945', 0),
 ('2266872', 0),
 ('2266995', 0),
 ('2266871', 0),
 ('2266974', 0),
 ('2296183', 0),
 ('2266889', 0),
 ('2291375', 0),
 ('2266882', 0),
 ('2266977', 0),
 ('2299029', 0),
 ('2266884', 0),
 ('2266978', 0),
 ('2266980', 0),
 ('2291374', 0),
 ('2266923', 0),
 ('2266969', 0),
 ('2266957', 0),
 ('2266962', 0),
 ('2266887', 0),
 ('2296186', 0),
 ('2266950', 0),
 ('2266913', 0

# Option 2: Get all the games with no browser

Here we don't use a web-browser and we extract just all the games from the original source from items `matchId:<GAME-ID>`.

With this solution the page gives ALL the game links, so we don't know what round each is.

In [None]:
import re
import bs4
import requests
from bs4 import BeautifulSoup # https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/

In [None]:
base_url = "https://nbl.com.au/schedule"
params = dict()
params["round"] = "3"
params["season"] = "35847"
# params["team"] = "3682"

r = requests.get(base_url, params=params)
print(r.url)
html_text = r.text

soup = BeautifulSoup(html_text, 'html.parser')

game_ids = set(re.findall(r'matchId:(\d+)', html_text))
print("Number of games extracted: ", len(game_ids))
games = [(x, round if isinstance(round, int) else 0) for x in game_ids]
print(games)

## Option 3: fully parsing HTML

As of June 2023, the above methods do not seem to obtain the HTML page for the specific round, all games are there, so it is not possible to assign it to the round!

However, on the HTML there is a call to the script managing the dropbox selection widgests and it has both the round ids, and each game associated to a round.

For example, the HTML will contain this text encoding the rounds filtering:

```
roundsFilter: { name: "Round", queryLabel: "round", type: iN, dropdownOptions: [{label:"Pre-season",value:cU},{label:"All Rounds",value:"Full"}, { label: "Round 1", value: cW }, { label: "Round 2", value: dW }, { label: "Round 3", value: ep }, { label: "Round 4", value: eO }, { label: "Round 5", value: ea }, { label: "Round 6", value: eP }, { label: "Round 7", value: fI }, { label: "Round 8", value: ff }, { label: "Round 9", value: fg }, { label: "Round 10", value: fv }, { label: "Round 11", value: eQ }, { label: "Round 12", value: eA }, { label: "Round 13", value: fh }, { label: "Round 14", value: fw }, { label: "Round 15", value: eR }, { label: "Round 16", value: eS }, { label: "Round 17", value: eT }, { label: "Round 18", value: fx }, { label: "Round 19", value: fy }, { label: "Round 20", value: fi }] }
```

So, for example, Round 1 is code `cW`.  Then, each game is listed with its number and the round id (e.g., `cW`) in a list of matches. Note that there are also non-numbered rounds, like "Pre-season"

We then do full regular expression extraction and mapping of round ids to round numbers.

**NOTE:** you must run Option 1 above first to get `html_text`

In [46]:
# example: { label: "Round 1", value: cW } or {label:"Pre-season",value:cU}
p1 = re.compile('{label:"(Round (.+?)|(.+?))",value:(.+?)}')    # https://docs.python.org/3/howto/regex.html
# print(p1.findall(html_text))
rounds_dict = {id : round for (_, round, _, id) in p1.findall(html_text) if round != ''}    # extract numbered rounds, e.g., "Rounds N"
rounds_dict.update({id : round for (_, _, round, id) in p1.findall(html_text) if round != ''})  # extract non-numbered rounds, e.g., "Pre-season"

# Next match games id with their round number
# example: matches: [{ matchId: 1928319, ........, roundNumber: cw, .... }, ....]
p2 = re.compile('matchId:(\d+?),.*?,roundNumber:(.+?),')
games_ids = p2.findall(html_text)
games = [(g[0], rounds_dict[g[1]]) for g in p2.findall(html_text)]

print("Rounds: ", rounds_dict)
print("Games:", games)

Rounds:  {'dq': '1', 'eL': '2', 'eM': '3', 'fE': '4', 'eW': '5', 'fq': '6', 'gj': '7', 'fF': '8', 'fG': '9', 'f$': '10', 'eX': '11', 'eY': '12', 'fr': '13', 'ga': '14', 'fs': '15', 'ft': '16', 'fu': '17', 'gb': '18', 'df': '19', 'fH': '20', 'cU': 'Pre-season', '"Full"': 'All Rounds'}
Games: [('2299028', 'Pre-season'), ('2294592', 'Pre-season'), ('2291374', 'Pre-season'), ('2292533', 'Pre-season'), ('2299029', 'Pre-season'), ('2292534', 'Pre-season'), ('2291375', 'Pre-season'), ('2291376', 'Pre-season'), ('2291378', 'Pre-season'), ('2291377', 'Pre-season'), ('2296183', 'Pre-season'), ('2296184', 'Pre-season'), ('2296186', 'Pre-season'), ('2296185', 'Pre-season'), ('2296188', 'Pre-season'), ('2296187', 'Pre-season'), ('2296189', 'Pre-season'), ('2296190', 'Pre-season'), ('2296191', 'Pre-season'), ('2296192', 'Pre-season'), ('2296193', 'Pre-season'), ('2296356', 'Pre-season'), ('2296357', 'Pre-season'), ('2296358', 'Pre-season'), ('2296359', 'Pre-season'), ('2266873', '1'), ('2266875', '1

In [49]:
PRESESON = [(x[0], 0) for x in games if x[1] == "Pre-season"]
PRESESON

[('2299028', 0),
 ('2294592', 0),
 ('2291374', 0),
 ('2292533', 0),
 ('2299029', 0),
 ('2292534', 0),
 ('2291375', 0),
 ('2291376', 0),
 ('2291378', 0),
 ('2291377', 0),
 ('2296183', 0),
 ('2296184', 0),
 ('2296186', 0),
 ('2296185', 0),
 ('2296188', 0),
 ('2296187', 0),
 ('2296189', 0),
 ('2296190', 0),
 ('2296191', 0),
 ('2296192', 0),
 ('2296193', 0),
 ('2296356', 0),
 ('2296357', 0),
 ('2296358', 0),
 ('2296359', 0)]

## Playground

In [None]:
import re

# Download geckodriver (https://github.com/mozilla/geckodriver/releases) and put it in path
# Salenium webdriver: https://www.selenium.dev/documentation/overview/
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

# url = get_url(SEASON, rno)
url = "https://nbl.com.au/schedule?round=1&season=35847"
#url="https://bit.ly/3NhGZ9O"

# We need an actual browser so that the JavaScript is loaded and the links https://.../games/<game_id> are generated
options = Options()
options.headless = True
options.page_load_strategy = 'eager'

browser = webdriver.Firefox(options=options)
# browser = webdriver.Chrome()
# browser.maximize_window()

print(f"Extracting web HTML at ", url)
# browser.get(url)
browser.execute_script(url)
html_text = browser.page_source


with open("page_source.html", "w") as f:
    f.write(html_text)


browser.quit()
