# Web Scraping II
#### CAS Applied Data Science 2025 ####


In the previous tutorials, we learned how to automatically retrieve pages from the internet through **web scraping**. We sent HTTP requests with the  ``requests`` library and used the ``BeautifulSoup`` library to parse and work with the HTML code from the response we got. This is a good start and works well for so-called **static** pages, but when scraping **dynamic** websites that use JavaScript in the background, you will notice that this approach often fails. Today, we will learn how to deal with this problem.

## Scraping dynamic websites

Imagine you want to scrape all currently listed open positions of "data scientists" in "Bern" on www.indeed.com.

Let's have a look at the output in a browser: https://ch.indeed.com/jobs?q=data+scientist&l=Bern%2C+BE

We could load the website with ``requests`` and extract the parts that we want using ``BeautifulSoup``:

In [None]:
import requests
from bs4 import BeautifulSoup

res = requests.get("https://ch.indeed.com/jobs?q=data+scientist&l=Bern%2C+BE&vjk=4d0278f36b754f74")
res.text

# information sits in <li> elements under <ul class='jobsearch-ResultsList'>, but there might be other ways to get to it too...
soup = BeautifulSoup(res.text)
soup
soup.find_all("p")

That looks bad. The webserver recognises that we are not a regular user entering normally through a browser. What could we do from here?

In [None]:
url = 'https://ch.indeed.com/jobs?q=data+scientist&l=Bern%2C+BE'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text)

soup.find_all("p")

Still not working. What is going on here? If the html source code you retrieve from a page does not correspond to what you see when viewing this page in your browser, you are probably dealing with a **dynamic web page**. Typically, this means that some JavaScript code is executed in the background to "create" the source code of the final page you see. When you scrape a dynamic page with the requests library, you will not get the final html code, but often an incomplete version where you might only see the instructions to run the JavaScript files. Some web pages also use dynamic features to actively prevent web scraping. You might then get an error message like the one in the example above. Luckily, there is another Python package we can use to address such problems.

## Selenium

``selenium`` is a Java based software that enables you to automate browsers (e.g. Chrome, Firefox etc.). You can think of it as an interface that allows you to control what your browser does through code. It was developed for automated website-testing, but is also very useful for scraping web pages, especially those that require JavaScript rendering. It can automate web browsing and interactions, like clicking buttons or filling out forms, and supports a wide range of browsers, including Firefox, Chrome, and Edge. Selenium can be used in multiple programming languages including Python. However, it might require an installation of browser drivers as well as Java and can sometimes be a bit tedious to set up.

Let's start by installing the ``selenium`` Python package.

**Note that you cannot use ``selenium`` from within Colab as the Python instance is running in the cloud where no browser is available. Copy the notbook of this tutorial to your local machine and run the code in, e.g., Jupyter Notebook.**

In [None]:
# Selenium can be installed via pip
!pip install selenium

Selenium requires a driver (e.g. chromedriver) to communicate with your favorite browser (e.g. chrome). In the newest version of the selenium module, webdrivers for different browsers should be installed automatically. If this is not the case, you can install them manually.

**A list of available drivers:**

* Chrome:
 - https://chromedriver.chromium.org/downloads

* Edge:
 - https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

* Firefox:
 - https://github.com/mozilla/geckodriver/releases

* Safari:
 - https://webkit.org/blog/6900/webdriver-support-in-safari-10/

If the code below not work, make sure you have an appropriate driver (matching version) installed and that the driver is accessible, i.e. it is located in your ``PATH`` environment. If you want to know where the selenium package was stored on your computer, you can import it and then type ``print(selenium.__file__)``.

You can read up setup instructions here: https://pypi.org/project/selenium/

Let's try to import the webdriver, initialize it and got to 'https://ch.indeed.com' (i.e. send a get request).  


In [None]:
from selenium import webdriver
browser = webdriver.Chrome()
browser.get('https://ch.indeed.com')

This opens your preferred browser. With selenium you can now take control of what the browser does, e.g.
* fill in textfields,
* click on elements,
* scroll,
* etc.


><font color = 4e1585> SIDENOTE: If the version of the driver and the version of your browser do not match, you will get an error. One way to address it is by using selenium's ``webdriver_manager``, which allows you to automatically install the driver version that corresponds to your browser (and also bypasses problems regarding the PATH environment):
>```python  
!pip install webdriver_manager
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())
```

---

<font color='teal'> **In-class exercise**:
Use Jupyter or an IDE (Spyder/pyCharm). Install ``selenium`` and try to start a selenium controlled browser window (Chrome/Firefox/Edge/Safari). If it does not immediately work, don't panic. Depending on your operating system, python version, selenium version, browser version etc. all kind of problems may occur! If it does not quickly work, take your time and try to find the problem after the course. Also note: we might get blocked from indeed when we do this in the course.
<font color='teal'> Make sure that:

* <font color='teal'>Both browser and selenium is the current version (e.g. force an update if you have an old version of selenium)
* <font color='teal'>and/or update your browser
* <font color='teal'>Depending on version and system you might need to manually install the driver for your browser.
* <font color='teal'>If the verions of your driver and your brwoser don't match, you can use selenium's built-in functionality to install a driver via the ``webdriver_manager`` (see sidenote above).



---


### Filling in text fields

Now that we have a browser window running that is controlled by our Python code, selenium allows us to do many different things (see here for all available methods: https://www.simplilearn.com/tutorials/python-tutorial/selenium-with-python). For example, we could send keystrokes to the searchbar.

We want to search for positions for data scientists. First, we need to locate the search fields. This can be done using the ``find_element`` method. It allows us to find matching elements on a website based on locator values such as the tage name, (ByTAG_NAME), id (By.ID), the class (By.Class), css selectors (By.CSS) and much more. If we inspect the source code of the page, we will see that the id of the tag containing the "What" field (for the job title) is "text-input-what" while the one for the "Where" field is 'text-input-where'. Let's start by locating the "What" field:

In [None]:
from selenium.webdriver.common.by import By # To find elements
from selenium.webdriver.common.keys import Keys # For special keys (Enter, delete, down etc.)
#elem = browser.find_element(By.CSS_SELECTOR, "[type='checkbox']")
#elem.click()

In [None]:
elem = browser.find_element(By.ID, 'text-input-what') # Find element

Now we can send the keytrokes to the selected field:

In [None]:
elem.send_keys('data scientist') # enter search query for "data scientist"

Let's limit our search to Bern and hit Enter to get the job openings:

In [None]:
elem = browser.find_element(By.ID, 'text-input-where') # find "where" field
elem.send_keys('Bern, BE' + Keys.RETURN) # enter "Bern. BE" and hit Enter!

---

 <font color='teal'> **In-class exercise**:
Can you click (method ``.click()``) on the drop down menu for the language?

---

### Extracting elements

From here we can fetch the pages source code and hand it over to ``BeautifulSoup()``

In [None]:
# now save the source code
html = browser.page_source # get the source code

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

Let us try to extract some elements from the website, e.g. the headlines which hold the job titles.

In [None]:
job_headlines = soup.select("ul.jobsearch-ResultsList li h2")
headlines = [job.get_text() for job in job_headlines]
headlines

Let's also store the links to the job descriptions.

In [None]:
job_links = soup.select("ul.jobsearch-ResultsList li h2 a") # "a" tag within "h2" tag within "li" tag within "ul" tag of class "jobsearch-ResultsList"

urls = [link["href"] for link in job_links]

print(len(urls))
urls[0]

---

<font color='teal'> **In-class exercise**:
Can you extract the information how many hits were found?

---

### Crawling through a list of links

Let's assume we want to store information on all the jobs we found. Good practice would be to split the process into two steps:
 1. Loop through all URLs to fetch and store the html source code
 2. Extract the relevant information from the stored files
This way you do not need to bother the webserver more than once. This is also important because often you do not know beforehand what you actually need and might end up making an unessearly large number of requests otherwise.

Let's start with step 1. To be able to store each html file we need to make sure of a few things:

 - create a new directory on the file system (where we can store the files)
 - add the base url in front
 - think about the filenames to use
 - use a file handler that saves the pages to the filesystem


><font color = 4e1585>SIDENOTE ON FILE HANDLERS: We can use Python's built-in ``open`` function to write to (or read from) different types of files (htlm, txt csv etc.). To create a new file, you must first open it in *write* mode (``w``) using the ``open`` function. They you can use the *write* method to write into your file:
>```python
my_file = open('page0.html', 'w')  # Open a (new) file in write (w) mode
my_file.write(myString)             # Write to file
my_file.close()                     # Close file
```

><font color = 4e1585> A more elegant way to do the same looks as follows (closes the file automatically after the ``with`` block):
>```python
with open('page0.html', 'w') as my_file:
  my_file.write(myString)
```

So let's change to the directory we created and then loop through all urls. For each job, we will append the url to the base url, fetch the source code and save it to our computer. For simplicity, we will save the files as job0, job1 etc. To get the index in each loop, we can use the ``enumerate()`` function.

In [None]:
import os
#os.chdir("C:/Users/farys/Documents/indeed")
os.chdir(r"C:\Users\jakob\Programming2\indeed")
#os.chdir("C:/Users/rudi/Documents/indeed") # was created manually before! Use os.mkdir(path) to create a directory from within Python

# Loop through all urls
for i, url in enumerate(urls):
    url = "http://ch.indeed.com" + url # Add base url in front
    browser.get(url) # Go to page
    with open('job' + str(i) + '.html', 'w', encoding="utf-8") as file: # Open file in write mode
        file.write(browser.page_source) # Write source code of each page into file
    print("Stored page for job " + str(i))

**But wait, what if there are more than 15 hits?** So far, we have only retrieved the jobs that were displayed on the first page.

We could make the browser click on the "next" button to create a list of the links to all the jobs (rather than just those on the first page):

In [None]:
# Re-initialize browser session
browser.quit() # Close the browser
browser = webdriver.Chrome()
browser.get('https://ch.indeed.com')

elem = browser.find_element(By.ID, 'text-input-what')
elem.send_keys('data scientist')
elem = browser.find_element(By.ID, 'text-input-where')
#elem.clear()
elem.send_keys('Bern, BE' + Keys.RETURN)

Let's first click on the button to accept cookies because it seems to get in our way when we want to click on "next page".

In [None]:
browser.find_element(By.CSS_SELECTOR, "#onetrust-accept-btn-handler").click()

Now, we can write a loop that clicks on the ``>`` (next) button as long as this is possible and retrieves all the urls to the indiviual job descriptions. We can do this using an infinite while loop and and a try-except block. We also have to deal with a window inviting us to register that pops up on the second page.

In [None]:
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import NoSuchElementException

urls = [] # Start with empty list
while True:
    html = browser.page_source
    soup = BeautifulSoup(html)
    job_links = soup.select("ul.jobsearch-ResultsList li h2 a") # retrieve "a" (link) tags
    page_urls = [link["href"] for link in job_links] # Get "href" attribute of each link and write into a list
    urls += page_urls # add to url list
    try:
        elem = browser.find_element(By.CSS_SELECTOR, "a[data-testid='pagination-page-next']")
        elem.click()
    except ElementClickInterceptedException: # Close pop-up window that gets into our way
        browser.find_element(By.CSS_SELECTOR, "button.icl-CloseButton").click()
        elem = browser.find_element(By.CSS_SELECTOR, "a[data-testid='pagination-page-next']")
        elem.click()
    except NoSuchElementException: # Break on last page where > button does not exist
        break

len(urls)

Now that we have the complete list of URLs, we can scrape all the respective pages and save them to our computer:

In [None]:
for i, url in enumerate(urls):
    url = "http://ch.indeed.com" + url
    browser.get(url)
    with open('job' + str(i) + '.html', 'w', encoding="utf-8") as file:
        file.write(browser.page_source)
        print("Stored page for job " + str(i))

><font color = 4e1585> SIDENOTE: *Indeed* might realize that you are not a normal user and ask you to verify that you are human. To make your scraper more robust to such problems (and make sure it is less of a burden to the page you are scraping), you could add some sleep time in each iteration. For example, ``time.sleep(10)`` (from the ``time`` module) would tell Python to wait for 10 seconds in each iteration.


### Extracting information from html files
Now that we seperately stored the pages about the jobs we can proceed to step 2 and start to work with them. We would like to create a nice pandas dataframe with information on all the jobs we found. Suppose, we are interested in the job title and the employer. Let's define a function that extracts these elements and returns them in a list:

In [None]:
# define a function that extracts the elements we want from the files with source code for each page
def getStuff(page):
    with open(page, encoding = "utf-8") as file:
        content = file.read() # Open file in read mode and assign to variable "content"

    soup = BeautifulSoup(content)

    # extract elements we like
    jobtitle = soup.select("h1 span")[0].get_text()
    employer = soup.select("div[data-company-name='true'] a")[0].get_text()
    return [jobtitle, employer]

Now, let's loop through all the html files, apply our function and write everything into a nested list:

In [None]:
# Get names of all files
pages = os.listdir()

# Loop through html files
job_summary = []
for page in pages:
    try:
        job_summary.append(getStuff(page))
    except:
        print("Problem with file", page)

Finally, we can convert our nested list into a Pandas Dataframe.

In [None]:
import pandas as pd
df = pd.DataFrame.from_records(job_summary, columns = ["jobtitle", "employer"])
df

### What to do from here?

We wanted to give you some first impressions of what ``selenium`` is capable of -- but there is much more to learn! You could extend/finetune our example project in different ways:

 * make sure that each site is dowloaded correctly and only once:
   - check if file exists on the system and is not empty (e.g. if exists(somefile): skip downloading)
   - use better filenames, e.g. based on the ID (jk=...) to make this check easier
   - introduce a short wait time in your loop if necessary (selenium can even wait/check until a certain element is present)
 * extract other elements:
   - skills
   - full texts
   - ...
 * Split your code into two separate scripts (one for data collection and one for data processing)
 * Think about a strategy to extend the search terms and/or locations
 * Make your scaper check for new job openings once a week
 * ...

You might like to check out the following tutorials for more ideas: https://youtube.com/playlist?list=PLzMcBGfZo4-n40rB1XaJ0ak1bemvlqumQ

---

<font color='teal'> **In-class exercise**:
Using the files you saved to your computer, create a pandas dataframe that contains not only the job title and the employer, but, also the place (Thun, Bern etc.).