# Webscraping II
#### CAS Applied Data Science 2025 ####

#Exercise Solutions#

## Exercise 1

Use ``selenium`` to go to https://job-room.ch and search for jobs related to Python (you may first need to close the orange message asking employers to register). Fetch the source code of the page with the search results, and convert it to a ``BeautifulSoup`` object. Can you print out the number of jobs that were found?

Hints:
 * You might need to tell Python to wait for a bit before retrieveing the source code of the page (otherwise it might not have loaded fast enough). This can be done using the ``sleep`` function in the ``time`` module (or using ``waits`` in selenium).
 * To find out how to ``find_element()`` what you are looking for, try right click + "Inspect" in your browser to find suitable ways (e.g. via the ``id`` or ``class`` attribute).


In [None]:
# Import libraries
from selenium import webdriver
from selenium.webdriver.common.by import By # To find elements
from selenium.webdriver.common.keys import Keys # To enter keys such as Enter, delete etc.
import time
from bs4 import BeautifulSoup

In [None]:
# Initialize browser session and go to https://job-room.ch
browser = webdriver.Chrome()
browser.get('https://job-room.ch/home/job-seeker')

In [None]:
# Close orange message asking employers to register
browser.find_element(By.CSS_SELECTOR, "button.close").click()

In [None]:
# Navigate to the second search field (Skills) and enter Python
elem = browser.find_element(By.ID, "alv-multi-typeahead-portal.job-ad.search.query-panel.keywords.placeholder-0")
elem.send_keys('python')
elem.click() # To make black pop-up message about multiple search terms disappear

In [None]:
# Click on search botton
elem = browser.find_element(By.CSS_SELECTOR, "button[type='submit']")
elem.click() # Click on search button

In [None]:
# Fetch source code and parse it with Beautiful soup
time.sleep(5)
html = browser.page_source
soup = BeautifulSoup(html)

In [None]:
# Print number of jobs
soup.select("span[data-test='resultCount']")[0].text

'757'

## Exercise 2

Try to extract all the links to the pages on the indiviual jobs and store them in a list. How many links do you get?

In [None]:
# Solution 1 (with the source code you have already downloaded)
links = soup.select("a.result-list-item")
urls = [link["href"] for link in links]
print(urls)
len(urls)

['/job-search/fc3b9404-e55f-11ed-8ba0-d20bff28451e', '/job-search/e7e988e6-e16d-11ed-97ee-d20bff28451e', '/job-search/638ae2e8-e16b-11ed-97ee-d20bff28451e', '/job-search/cc7d0696-cf5c-11ed-97ee-d20bff28451e', '/job-search/10d2b125-dbff-11ed-97ee-d20bff28451e', '/job-search/623a4c11-cedf-11ed-97ee-d20bff28451e', '/job-search/abb0238d-cf64-11ed-97ee-d20bff28451e', '/job-search/a3f28bf9-c53a-11ed-b60e-d20bff28451e', '/job-search/facb2915-c078-11ed-a6ef-d20bff28451e', '/job-search/2565a2ae-e17e-11ed-97ee-d20bff28451e', '/job-search/3858b44b-dbee-11ed-97ee-d20bff28451e', '/job-search/baeea593-e3d3-11ed-8ba0-d20bff28451e', '/job-search/c3038b33-c46a-11ed-b930-d20bff28451e', '/job-search/fa27c3b0-e0ac-11ed-97ee-d20bff28451e', '/job-search/2b07b0e5-e572-11ed-8ba0-d20bff28451e', '/job-search/d32b840e-e4a3-11ed-8ba0-d20bff28451e', '/job-search/cbe124e7-d5b5-11ed-97ee-d20bff28451e', '/job-search/1904dc1c-d1c3-11ed-97ee-d20bff28451e', '/job-search/10eb5f1e-d1c0-11ed-97ee-d20bff28451e', '/job-searc

20

In [None]:
# Solution 2: With selenium in the browser
links = browser.find_elements(By.CSS_SELECTOR, "a[class='d-block result-list-item flex-height position-relative'")
[link.get_attribute("href") for link in links]
print(urls)
len(urls)

['/job-search/fc3b9404-e55f-11ed-8ba0-d20bff28451e', '/job-search/e7e988e6-e16d-11ed-97ee-d20bff28451e', '/job-search/638ae2e8-e16b-11ed-97ee-d20bff28451e', '/job-search/cc7d0696-cf5c-11ed-97ee-d20bff28451e', '/job-search/10d2b125-dbff-11ed-97ee-d20bff28451e', '/job-search/623a4c11-cedf-11ed-97ee-d20bff28451e', '/job-search/abb0238d-cf64-11ed-97ee-d20bff28451e', '/job-search/a3f28bf9-c53a-11ed-b60e-d20bff28451e', '/job-search/facb2915-c078-11ed-a6ef-d20bff28451e', '/job-search/2565a2ae-e17e-11ed-97ee-d20bff28451e', '/job-search/3858b44b-dbee-11ed-97ee-d20bff28451e', '/job-search/baeea593-e3d3-11ed-8ba0-d20bff28451e', '/job-search/c3038b33-c46a-11ed-b930-d20bff28451e', '/job-search/fa27c3b0-e0ac-11ed-97ee-d20bff28451e', '/job-search/2b07b0e5-e572-11ed-8ba0-d20bff28451e', '/job-search/d32b840e-e4a3-11ed-8ba0-d20bff28451e', '/job-search/cbe124e7-d5b5-11ed-97ee-d20bff28451e', '/job-search/1904dc1c-d1c3-11ed-97ee-d20bff28451e', '/job-search/10eb5f1e-d1c0-11ed-97ee-d20bff28451e', '/job-searc

20

## Exercise 3 (advanced and optional!)

You may have noticed that the you only got the urls for the first 20 search results. This happens because the other results are not rendered immediately, but only when you scroll down. Can you find a way to extract all the urls?

Hint: You can tell the browser to scroll down until the end of the page is reached and then retrieve the source code. One approach would be to ``find_element()`` an element that resides within the scrollable container and then sending a couple ``Keys.PAGE.DOWN`` (but there might also be other ways).

In [None]:
# Get number of results
nr_links = int(soup.select("span[data-test='resultCount']")[0].text)
print(nr_links)

# Go to first job description
elem = browser.find_element(By.CSS_SELECTOR, "a.result-list-item")

# Scroll down (as many times as the number of jobs found)
nfound = 0
i = 0
while nfound < nr_links:
    elem.send_keys(Keys.PAGE_DOWN*12) # send a dozen page down keys
    i = i+1
    if i % 10 == 0: # every ten loops check how far we are
        html = browser.page_source
        soup = BeautifulSoup(html)
        links = soup.select("a.result-list-item")
        urls = [link["href"] for link in links]
        nfound = len(urls)
    time.sleep(0.001) # Tiny waiting time

print(len(urls))

757
757


In [None]:
# Get source code and extract urls
html = browser.page_source
soup = BeautifulSoup(html)
links = soup.select("a.result-list-item")
urls = [link["href"] for link in links]
len(urls)

757

Now, navigate to and fetch the source code of **one single url** of your list (we want to avoid that we do too many request with our course). Again, you might have to introduce a waiting time between loading the page and fetching the source code.  Print out (1) the title and (2) the workload of the job.

In [None]:
# Fetch page and covcert to BeautifulSoup object
url = r"https://job-room.ch" + urls[0]
browser.get(url)
time.sleep(2)
html = browser.page_source
soup = BeautifulSoup(html)

In [None]:
# Print job title
soup.find("h2").text

'Python Java Developer Junior'

In [None]:
# Print workload
soup.find(class_="job-description ng-star-inserted").find_all("li")[1].text

'Workload:\n100%'