# Webscraping II

## Exercise 1

Use ``selenium`` to go to https://job-room.ch and search for jobs related to Python (you may first need to close the orange message asking employers to register). Fetch the source code of the page with the search results, and convert it to a ``BeautifulSoup`` object. Can you print out the number of jobs that were found?

Hints:
 * You might need to tell Python to wait for a bit before retrieveing the source code of the page (otherwise it might not have loaded fast enough). This can be done using the ``sleep`` function in the ``time`` module (or using ``waits`` in selenium).
 * To find out how to ``find_element()`` what you are looking for, try right click + "Inspect" in your browser to find suitable ways (e.g. via the ``id`` or ``class`` attribute).


In [86]:
# Import libraries
import requests
import math
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService # Renamed Service to avoid conflict
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By # To find elements
from selenium.webdriver.common.keys import Keys # For special keys (Enter, delete, down etc.)

In [87]:
# Initialize browser session and go to https://www.job-room.ch
# Automatically download and manage ChromeDriver
# The Service object will use the path provided by ChromeDriverManager
service = ChromeService(executable_path=ChromeDriverManager().install())

# Pass the service object when creating the driver instance
browser = webdriver.Chrome(service=service)

browser.get('https://www.job-room.ch')

In [88]:
# Close orange message (optional)


In [89]:
# Navigate to the second search field (Skills) and enter "Python"
elem = browser.find_element(By.ID, 'alv-multi-typeahead-portal.job-ad.search.query-panel.keywords.placeholder-0')

elem.send_keys('python')
elem.send_keys(Keys.ENTER)

In [90]:
# Click on search botton
elem = browser.find_element(By.CSS_SELECTOR, 'button[type = "submit"]')
elem.click()
time.sleep(3)

In [91]:
# Fetch source code and parse it with Beautiful soup
source = browser.page_source

soup = BeautifulSoup(source)


In [92]:
# Print number of jobs
n_jobs = soup.select("span[data-test='resultCount']")[0].text
n_jobs

'403'

## Exercise 2

Try to extract all the links to the pages on the indiviual jobs and store them in a list. How many links do you get?

In [93]:
job_pages = [a["href"] for a in soup.select(".d-block.result-list-item")]
len(job_pages)

20

## Exercise 3 (advanced and optional!)

You may have noticed that the you only got the urls for the first 20 search results. This happens because the other results are not rendered immediately, but only when you scroll down. Can you find a way to extract all the urls?

Hint: You can tell the browser to scroll down until the end of the page is reached and then retrieve the source code. One approach would be to ``find_element()`` an element that resides within the scrollable container and then sending a couple ``Keys.PAGE.DOWN`` (but there might also be other ways).

In [94]:
n_jobs = int(n_jobs)
job_per_scroll = 20

# how many times do we have to scroll to show all the jobs?
n_scrolls = math.ceil(n_jobs / job_per_scroll)

In [95]:
scrollable = browser.find_element(By.CLASS_NAME, "container-fluid.ng-star-inserted")

for i in range(n_scrolls):
    browser.execute_script("arguments[0].scrollTop = arguments[0].scrollTop + arguments[0].offsetHeight;", scrollable)
    time.sleep(0.1)

Now, navigate to and fetch the source code of **one single url** of your list (we want to avoid that we do too many request with our course). Again, you might have to introduce a waiting time between loading the page and fetching the source code.  Print out (1) the title and (2) the workload of the job.

In [None]:
# Fetch page and convert to BeautifulSoup object


In [None]:
# Print job title


In [None]:
# Print workload
