# Extract Game Ids

This notebook allows to extract the games id for a season in a round by inspecting the web pages of the form 

```
https://nbl.com.au/schedule?round=<ROUND-ID>&season=<SEASON-ID>
```

For example this is pre-season for 2022-2033:

https://nbl.com.au/schedule?round=PS&season=34173

Those pages expose game links of the form `https://nbl.com.au/games/<GAME-ID>`, but only after Javascript has run. So, we need to use a virtual webdriver to actually browse the page (in silent) after that. We do this with module `selenium` that provides drivers for browsers. Here is [an explanation](https://stackoverflow.com/questions/11047348/is-this-possible-to-load-the-page-after-the-javascript-execute-using-python) how to load a page after Javascript has executed.

**Note:** the original page, before Javascript, will also expose the game ids in structures of the form `matchId:<GAME-ID>`, but it will give all of them of the season, without filtering on the round.


## Option 1: Via Salenium virtual browser

In [None]:
import re

# Download geckodriver (https://github.com/mozilla/geckodriver/releases) and put it in path
# Salenium webdriver: https://www.selenium.dev/documentation/overview/
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

In [None]:
# def get_game_ids(season, round) -> list:
# ROUNDS = list(range(1, 22)) + ["SF", "F"]
# SEASON=27725    # 2020-2021
# SEASON=30249    # 2021-2022

ROUNDS = list(range(1, 19)) + ["PS"]
# ROUNDS = ["F"]
# ROUNDS = ["PS"]
SEASON=34173    # 2022-2023
SEASON=35847    # 2023-2024

def get_url(season, round):
    return f"https://nbl.com.au/schedule?round={round}&season={season}"


print("Season", SEASON)
print("Rounds: ", ROUNDS)

In [None]:
# We need an actual browser so that the JavaScript is loaded and the links https://.../games/<game_id> are generated
options = Options()
options.headless = True
browser = webdriver.Firefox(options=options)
# browser = webdriver.Firefox(options=options, executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')

html_texts = []

for rno in ROUNDS:
    url = get_url(SEASON, rno)
    print(f"Extracting web HTML for round {rno} at ", url)
    browser.get(url)
    html_text = browser.page_source
    html_texts.append((rno, html_text))

browser.quit()

Now find all the game ids and generate list as per round.

In [None]:
pattern = r'/games/(\d+)'

MAP_ROUNDS = { "SF" : 100, "F" : 101, "PS" : 0}

games = []
for (rnd, html_text) in html_texts:
    game_ids = set(re.findall(pattern, html_text))
    print(f"Number of games extracted for round {rnd}: ", len(game_ids))

    games.extend([(x, rnd if isinstance(rnd, int) else MAP_ROUNDS[rnd]) for x in game_ids])

print("Total games:", len(games))
print(games)

In [None]:
print(html_texts[1][1])

### Check for new games from previous set

Replace `PREVIOUS` with the existing set of games.

In [None]:
PREVIOUS = [('2116412', 1), ('2116391', 1), ('2116406', 1), ('2116390', 1), ('2116402', 1), ('2120056', 0), ('2122059', 0), ('2134935', 0), ('2120058', 0), ('2120079', 0), ('2124207', 0), ('2116576', 0), ('2120054', 0), ('2120077', 0), ('2141127', 0), ('2120052', 0), ('2120055', 0), ('2122060', 0), ('2141126', 0), ('2120057', 0), ('2135116', 0), ('2135117', 0), ('2120080', 0), ('2120051', 0), ('2116579', 0), ('2120049', 0), ('2120050', 0), ('2120078', 0), ('2120048', 0), ('2120053', 0), ('2116381', 2), ('2116379', 2), ('2116429', 2), ('2116437', 2), ('2116420', 2), ('2116436', 2), ('2116423', 2), ('2116435', 2), ('2116384', 3), ('2116518', 3), ('2116419', 3), ('2116422', 3), ('2116386', 3), ('2116385', 3), ('2116407', 3), ('2116413', 3), ('2116378', 4), ('2116389', 4), ('2116380', 4), ('2116382', 4), ('2116388', 4), ('2116387', 4), ('2116383', 4), ('2116411', 4), ('2116396', 5), ('2116417', 5), ('2116408', 5), ('2116414', 5), ('2116403', 5), ('2116393', 5), ('2116418', 5), ('2116400', 5), ('2116394', 6), ('2116424', 6), ('2116421', 6), ('2116401', 6), ('2116430', 6), ('2116432', 6), ('2116398', 6), ('2116427', 7), ('2116434', 7), ('2116410', 7), ('2116428', 7), ('2116433', 7), ('2116416', 7), ('2116405', 7), ('2116409', 8), ('2116397', 8), ('2116404', 8), ('2116395', 8), ('2116415', 8), ('2116399', 8), ('2116426', 9), ('2116438', 9), ('2116431', 9), ('2116455', 9), ('2116425', 9), ('2116454', 9), ('2116448', 9), ('2116443', 9), ('2116486', 10), ('2116444', 10), ('2116463', 10), ('2116473', 10), ('2116487', 10), ('2116439', 10), ('2116468', 10), ('2116478', 10), ('2116469', 11), ('2116456', 11), ('2116459', 11), ('2116449', 11), ('2116483', 11), ('2116482', 11), ('2116440', 11), ('2116475', 11), ('2116445', 11), ('2116453', 12), ('2116471', 12), ('2116450', 12), ('2116480', 12), ('2116467', 12), ('2116477', 12), ('2116462', 12), ('2116447', 13), ('2116442', 13), ('2116465', 13), ('2116452', 13), ('2116458', 13), ('2116460', 13), ('2116485', 13), ('2116441', 14), ('2116446', 14), ('2116474', 14), ('2116457', 14), ('2116484', 14), ('2116479', 14), ('2116470', 14), ('2116451', 14), ('2116481', 15), ('2116488', 15), ('2116466', 15), ('2116492', 15), ('2116472', 15), ('2116476', 15), ('2116489', 15), ('2116491', 15), ('2116490', 15), ('2116461', 15), ('2116505', 16), ('2116493', 16), ('2116495', 16), ('2116500', 16), ('2116497', 16), ('2116496', 16), ('2116494', 16), ('2116498', 16), ('2116502', 16), ('2116508', 17), ('2116510', 17), ('2116499', 17), ('2116514', 17), ('2116506', 17), ('2116516', 17), ('2116512', 17), ('2116517', 17), ('2116501', 17), ('2116504', 18), ('2116507', 18), ('2116513', 18), ('2116511', 18), ('2116503', 18), ('2116515', 18), ('2116509', 18)]


new_games = [x for x in games if x not in PREVIOUS]
new_games

# Option 2: Get all the games with no browser

Here we don't use a web-browser and we extract just all the games from the original source from items `matchId:<GAME-ID>`.

With this solution the page gives ALL the game links, so we don't know what round each is.

In [None]:
import re
import bs4
import requests
from bs4 import BeautifulSoup # https://stackabuse.com/guide-to-parsing-html-with-beautifulsoup-in-python/

In [None]:
base_url = "https://nbl.com.au/schedule"
params = dict()
params["round"] = "3"
params["season"] = "35847"
# params["team"] = "3682"

r = requests.get(base_url, params=params)
print(r.url)
html_text = r.text

soup = BeautifulSoup(html_text, 'html.parser')

game_ids = set(re.findall(r'matchId:(\d+)', html_text))
print("Number of games extracted: ", len(game_ids))
games = [(x, round if isinstance(round, int) else 0) for x in game_ids]
print(games)

## Option 3: fully parsing HTML

As of June 2023, the above methods do not seem to obtain the HTML page for the specific round, all games are there, so it is not possible to assign it to the round!

However, on the HTML there is a call to the script managing the dropbox selection widgests and it has both the round ids, and each game associated to a round.

For example, the HTML will contain this text encoding the rounds filtering:

```
roundsFilter: { name: "Round", queryLabel: "round", type: iN, dropdownOptions: [{ label: "All Rounds", value: "Full" }, { label: "Round 1", value: cW }, { label: "Round 2", value: dW }, { label: "Round 3", value: ep }, { label: "Round 4", value: eO }, { label: "Round 5", value: ea }, { label: "Round 6", value: eP }, { label: "Round 7", value: fI }, { label: "Round 8", value: ff }, { label: "Round 9", value: fg }, { label: "Round 10", value: fv }, { label: "Round 11", value: eQ }, { label: "Round 12", value: eA }, { label: "Round 13", value: fh }, { label: "Round 14", value: fw }, { label: "Round 15", value: eR }, { label: "Round 16", value: eS }, { label: "Round 17", value: eT }, { label: "Round 18", value: fx }, { label: "Round 19", value: fy }, { label: "Round 20", value: fi }] }
```

So, for example, Round 1 is code `cW`.  Then, each game is listed with its number and the round id (e.g., `cW`) in a list of matches.

We then do full regular expression extraction and mapping of round ids to round numbers

In [None]:
# example: { label: "Round 1", value: cW }
p1 = re.compile('{label:"Round (.+?)",value:(.+?)}')
rounds_dict = {id : round for (round, id) in p1.findall(html_text)}

# example: matches: [{ matchId: 1928319, ........, roundNumber: cw, .... }, ....]
p2 = re.compile('matchId:(\d+?),.*?,roundNumber:(.+?),')
games_ids = p2.findall(html_text)
games = [(g[0], rounds_dict[g[1]]) for g in p2.findall(html_text)]
print(games)

## Playground

In [None]:
import re

# Download geckodriver (https://github.com/mozilla/geckodriver/releases) and put it in path
# Salenium webdriver: https://www.selenium.dev/documentation/overview/
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

# url = get_url(SEASON, rno)
url = "https://nbl.com.au/schedule?round=1&season=35847"
#url="https://bit.ly/3NhGZ9O"

# We need an actual browser so that the JavaScript is loaded and the links https://.../games/<game_id> are generated
options = Options()
options.headless = True
options.page_load_strategy = 'eager'

browser = webdriver.Firefox(options=options)
# browser = webdriver.Chrome()
# browser.maximize_window()

print(f"Extracting web HTML at ", url)
# browser.get(url)
browser.execute_script(url)
html_text = browser.page_source


with open("page_source.html", "w") as f:
    f.write(html_text)


browser.quit()
