# Web Scraping II

### Important:  use Jupyter instead of Colab today!!

In last week's tutorial, we learned how to automatically retrieve pages from the internet through **web scraping**. We sent HTTP requests with the  ``requests`` library and used the ``BeautifulSoup`` library to parse and work with the HTML code from the response we got. This is a good start and works well for so-called **static** pages, but when scraping **dynamic** websites that use JavaScript in the background, you will notice that this approach often fails. Today, we will learn how to deal with this problem.

## Scraping dynamic websites and its problems (an example which is not working any longer)

Imagine you want to scrape all currently listed open positions of "data scientist" in Bern.

Let's have a look at the output in a browser: https://ch.indeed.com/jobs?q=data+analyst&l=Bern%2C+BE

We could load the website with ``requests`` and extract the parts that we want using ``BeautifulSoup``:

In [30]:
import requests
from bs4 import BeautifulSoup

res = requests.get("https://ch.indeed.com/jobs?q=data+analyst&l=Bern%2C+BE")
res.text

# information sits in <li> elements of class="jobTitle", but there might be other ways to get to it too...
soup = BeautifulSoup(res.text)
soup
soup.find_all("p")

[<p>You have been blocked. If you believe this in error, please go to support.indeed.com and reference the following information:</p>,
 <p>Your Ray ID for this request is <span style="font-style:italic">93761be29adc6aa1</span></p>,
 <p>Your current IP for this request is <span style="font-style:italic">130.92.232.189</span></p>,
 <p><a href="https://support.indeed.com/hc/en-us/articles/33465379855501-Troubleshooting-Cloudflare-Errors">Troubleshooting Cloudflare Errors</a></p>,
 <p>Need more help? <a href="https://www.indeed.com/support/contact">Contact us</a></p>]

That looks bad. The webserver recognises that we are not a regular user entering normally through a browser. What could we do from here?

In [31]:
url = 'https://ch.indeed.com/jobs?q=data+analyst&l=Bern%2C+BE'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text)

soup.find_all("p")

[<p>You have been blocked. If you believe this in error, please go to support.indeed.com and reference the following information:</p>,
 <p>Your Ray ID for this request is <span style="font-style:italic">93761be29adc6aa1</span></p>,
 <p>Your current IP for this request is <span style="font-style:italic">130.92.232.189</span></p>,
 <p><a href="https://support.indeed.com/hc/en-us/articles/33465379855501-Troubleshooting-Cloudflare-Errors">Troubleshooting Cloudflare Errors</a></p>,
 <p>Need more help? <a href="https://www.indeed.com/support/contact">Contact us</a></p>]

Still not working. What is going on here? If the html source code you retrieve from a page does not correspond to what you see when viewing this page in your browser, you are probably dealing with a **dynamic web page**. Typically, this means that some JavaScript code is executed in the background to "create" the source code of the final page you see. When you scrape a dynamic page with the requests library, you will not get the final html code, but often an incomplete version where you might only see the instructions to run the JavaScript files. Some web pages also use dynamic features to actively prevent web scraping. You might then get an error message like the one in the example above. There is another Python package we can use to address such problems.

## Selenium

``selenium`` is a Java based software that enables you to automate browsers (e.g. Chrome, Firefox etc.). You can think of it as an interface that allows you to control what your browser does through code. It was developed for automated website-testing, but is also very useful for scraping web pages, especially those that require JavaScript rendering. It can automate web browsing and interactions, like clicking buttons or filling out forms, and supports a wide range of browsers, including Firefox, Chrome, and Edge. Selenium can be used in multiple programming languages including Python. However, it might require an installation of browser drivers as well as Java and can sometimes be a bit tedious to set up.

Let's start by installing the ``selenium`` Python package.

**Note that you cannot use ``selenium`` from within Colab as the Python instance is running in the cloud where no browser is available. Copy the notbook of this tutorial to your local machine and run the code in, e.g., Jupyter Notebook.**

In [32]:
# Selenium can be installed via pip
!pip install selenium
!pip uninstall webdriver-manager -y
!pip install webdriver-manager




[notice] A new release of pip is available: 25.0.1 -> 25.1
[notice] To update, run: C:\Users\Raphael\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Found existing installation: webdriver-manager 4.0.2
Uninstalling webdriver-manager-4.0.2:
  Successfully uninstalled webdriver-manager-4.0.2
Collecting webdriver-manager
  Using cached webdriver_manager-4.0.2-py2.py3-none-any.whl.metadata (12 kB)
Using cached webdriver_manager-4.0.2-py2.py3-none-any.whl (27 kB)
Installing collected packages: webdriver-manager
Successfully installed webdriver-manager-4.0.2



[notice] A new release of pip is available: 25.0.1 -> 25.1
[notice] To update, run: C:\Users\Raphael\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Selenium requires a driver (e.g. chromedriver) to communicate with your favorite browser (e.g. chrome). In the newest version of the selenium module, webdrivers for different browsers should be installed automatically. If this is not the case, you can install them manually.

**A list of available drivers:**

* Chrome:
 - https://chromedriver.chromium.org/downloads

* Edge:
 - https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

* Firefox:
 - https://github.com/mozilla/geckodriver/releases

* Safari:
 - https://webkit.org/blog/6900/webdriver-support-in-safari-10/

If the code below not work, make sure you have an appropriate driver (matching version) installed and that the driver is accessible, i.e. it is located in your ``PATH`` environment. If you want to know where the selenium package was stored on your computer, you can import it and then type ``print(selenium.__file__)``.

You can read up setup instructions here: https://pypi.org/project/selenium/

Let's try to import the webdriver, initialize it and got to 'https://ch.indeed.com' (i.e. send a get request).  


In [33]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService # Renamed Service to avoid conflict
from webdriver_manager.chrome import ChromeDriverManager

# Automatically download and manage ChromeDriver
# The Service object will use the path provided by ChromeDriverManager
service = ChromeService(executable_path=ChromeDriverManager().install())

# Pass the service object when creating the driver instance
browser = webdriver.Chrome(service=service)

browser.get('https://ch.indeed.com')


In [34]:
service = ChromeService(executable_path=ChromeDriverManager().install())

# Pass the service object when creating the driver instance
browser = webdriver.Chrome(service=service)

browser.get('https://youtube.com')

This opens your preferred browser. With selenium you can now take control of what the browser does, e.g.
* fill in textfields,
* click on elements,
* scroll,
* etc.


><font color = ffffff> SIDENOTE: If the version of the driver and the version of your browser do not match, you will get an error. One way to address it is by using selenium's ``webdriver_manager``, which allows you to automatically install the driver version that corresponds to your browser (and also bypasses problems regarding the PATH environment):
>```python  
!pip install webdriver_manager
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())
```

### Filling in text fields

Now that we have a browser window running that is controlled by our Python code, selenium allows us to do many different things (see here for all available methods: https://www.simplilearn.com/tutorials/python-tutorial/selenium-with-python). For example, we could send keystrokes to the searchbar.

We want to search for positions for data analysts. First, we need to locate the search fields. This can be done using the ``find_element`` method. It allows us to find matching elements on a website based on locator values such as the tage name, (ByTAG_NAME), id (By.ID), the class (By.Class), css selectors (By.CSS) and much more. If we inspect the source code of the page, we will see that the id of the tag containing the "What" field (for the job title) is "text-input-what" while the one for the "Where" field is 'text-input-where'. Let's start by locating the "What" field:

In [35]:
from selenium.webdriver.common.by import By # To find elements
from selenium.webdriver.common.keys import Keys # For special keys (Enter, delete, down etc.)

In [36]:
elem = browser.find_element(By.ID, 'text-input-what') # Find element

NoSuchWindowException: Message: no such window: target window already closed
from unknown error: web view not found
  (Session info: chrome=135.0.7049.115)
Stacktrace:
	GetHandleVerifier [0x0114D363+60275]
	GetHandleVerifier [0x0114D3A4+60340]
	(No symbol) [0x00F806F3]
	(No symbol) [0x00F5FBEE]
	(No symbol) [0x00FF3BDE]
	(No symbol) [0x0100DFE9]
	(No symbol) [0x00FECE86]
	(No symbol) [0x00FBC623]
	(No symbol) [0x00FBD474]
	GetHandleVerifier [0x01398FE3+2467827]
	GetHandleVerifier [0x013945E6+2448886]
	GetHandleVerifier [0x013AF80C+2560028]
	GetHandleVerifier [0x01163DF5+153093]
	GetHandleVerifier [0x0116A3BD+179149]
	GetHandleVerifier [0x01154BB8+91080]
	GetHandleVerifier [0x01154D60+91504]
	GetHandleVerifier [0x0113FA10+4640]
	BaseThreadInitThunk [0x76A05D49+25]
	RtlInitializeExceptionChain [0x770ACF0B+107]
	RtlGetAppContainerNamedObjectPath [0x770ACE91+561]


Now we can send the keytrokes to the selected field:

In [None]:
elem.send_keys('data analyst') # enter search query for "data analyst"

Let's limit our search to Bern and hit Enter to get the job openings:

In [None]:
elem = browser.find_element(By.ID, 'text-input-where') # find "where" field
elem.send_keys('Bern, BE' + Keys.RETURN) # enter "Bern. BE" and hit Enter!

As you can see, its really hard to scrape this website as it is protected by cloudflare.

We will turn to a different example to demonstrate the capabilities of selenium.

---

<font color='teal'> **In-class exercise**:
Use Jupyter or an IDE (Spyder/pyCharm). Install ``selenium`` and try to start a selenium controlled browser window (Chrome/Firefox/Edge/Safari). If it does not immediately work, don't panic. Depending on your operating system, python version, selenium version, browser version etc. all kind of problems may occur! If it does not quickly work, take your time and try to find the problem after the course. Also note: Load a different website (not indeed.ch), e.g. www.google.ch
<font color='teal'> If it does not work after the standard installation of selenium/webdriver, make sure that:

* <font color='teal'>Both browser and selenium is the current version (e.g. force an update if you have an old version of selenium)
* <font color='teal'>and/or update your browser
* <font color='teal'>Depending on version and system you might need to manually install the driver for your browser.
* <font color='teal'>If the verions of your driver and your browser don't match, you can use selenium's built-in functionality to install a driver via the ``webdriver_manager`` (see sidenote above).


---
## job-room.ch

Lets try a different example now which is **job-room.ch** and a bit less restrictive:


In [38]:
service = ChromeService(executable_path=ChromeDriverManager().install())

# Pass the service object when creating the driver instance
browser = webdriver.Chrome(service=service)
browser.get("https://www.job-room.ch/home/job-seeker")

In [39]:
# Close orange message asking employers to register
browser.find_element(By.CSS_SELECTOR, "button.close").click()

NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":"button.close"}
  (Session info: chrome=135.0.7049.115); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
	GetHandleVerifier [0x0114D363+60275]
	GetHandleVerifier [0x0114D3A4+60340]
	(No symbol) [0x00F806F3]
	(No symbol) [0x00FC8690]
	(No symbol) [0x00FC8A2B]
	(No symbol) [0x01010EE2]
	(No symbol) [0x00FED0D4]
	(No symbol) [0x0100E6EB]
	(No symbol) [0x00FECE86]
	(No symbol) [0x00FBC623]
	(No symbol) [0x00FBD474]
	GetHandleVerifier [0x01398FE3+2467827]
	GetHandleVerifier [0x013945E6+2448886]
	GetHandleVerifier [0x013AF80C+2560028]
	GetHandleVerifier [0x01163DF5+153093]
	GetHandleVerifier [0x0116A3BD+179149]
	GetHandleVerifier [0x01154BB8+91080]
	GetHandleVerifier [0x01154D60+91504]
	GetHandleVerifier [0x0113FA10+4640]
	BaseThreadInitThunk [0x76A05D49+25]
	RtlInitializeExceptionChain [0x770ACF0B+107]
	RtlGetAppContainerNamedObjectPath [0x770ACE91+561]


In [40]:
# Navigate to the second search field (Skills) and enter Python
elem = browser.find_element(By.ID, "alv-multi-typeahead-portal.job-ad.search.query-panel.occupations.placeholder-0")
elem.send_keys('data scientis')
elem.send_keys('t')
import time
time.sleep(1)
elem = browser.find_element(By.ID, "ngb-typeahead-0-0")
elem.click() # To make black pop-up message about multiple search terms disappear

In [42]:
# Click on search botton
elem = browser.find_element(By.CSS_SELECTOR, "button[type='submit']")
elem.click() # Click on search button

In [43]:
# Fetch source code and parse it with Beautiful soup
time.sleep(1)
html = browser.page_source
soup = BeautifulSoup(html)

### Extracting elements

From here we can fetch the pages source code and hand it over to ``BeautifulSoup()``

In [44]:
# Print number of jobs
soup.select("span[data-test='resultCount']")[0].text

'9'

Nice, now let's try to extract all the links to the pages on the indiviual jobs and store them in a list. How many links do you get?

In [45]:
# Solution 1 (with the source code you have already downloaded)
links = soup.select("a.result-list-item")
urls = [link["href"] for link in links]
print(urls)
len(urls)

['/job-search/f0c4d218-2411-11f0-8c83-b28d11879f6f', '/job-search/fd399a11-087d-11f0-ae8d-82588dc8ab8d', '/job-search/f60a8224-211c-11f0-8c83-b28d11879f6f', '/job-search/a7173234-2046-11f0-8c83-b28d11879f6f', '/job-search/ea731a48-19f1-11f0-9127-8e2c7651dd26', '/job-search/eb604dee-1394-11f0-ae8d-82588dc8ab8d', '/job-search/86af0cf4-1391-11f0-ae8d-82588dc8ab8d', '/job-search/3ce1909a-138c-11f0-ae8d-82588dc8ab8d', '/job-search/7ad83129-fee1-11ef-b3a5-82588dc8ab8d']


9

In [46]:
# Solution 2: With selenium in the browser
links = browser.find_elements(By.CSS_SELECTOR, "a[class='d-block result-list-item flex-height position-relative'")
[link.get_attribute("href") for link in links]
print(urls)
len(urls)

['/job-search/f0c4d218-2411-11f0-8c83-b28d11879f6f', '/job-search/fd399a11-087d-11f0-ae8d-82588dc8ab8d', '/job-search/f60a8224-211c-11f0-8c83-b28d11879f6f', '/job-search/a7173234-2046-11f0-8c83-b28d11879f6f', '/job-search/ea731a48-19f1-11f0-9127-8e2c7651dd26', '/job-search/eb604dee-1394-11f0-ae8d-82588dc8ab8d', '/job-search/86af0cf4-1391-11f0-ae8d-82588dc8ab8d', '/job-search/3ce1909a-138c-11f0-ae8d-82588dc8ab8d', '/job-search/7ad83129-fee1-11ef-b3a5-82588dc8ab8d']


9

---

We could also extract the job titles from here:

In [47]:
job_titles_soup = soup.select("a.result-list-item h3")
job_titles = [job.get_text() for job in job_titles_soup]
job_titles

['Hochschulpraktikant:in Data Science',
 'Scientist in Battery Modeling & Testing 60-100%',
 'Data Manager/in Kantonale Verwaltung',
 'Hochschulpraktikantin/-praktikant Data Analytics 80-100 %',
 'Data Managerin/ Manager Medizinische Qualitätsprogramme 80-90%',
 'Data Analyst – Wealth & Investment Management',
 'Data Analyst (f|m|d) (80-100%) - Zurich - Hybrid Work',
 'Biotechnology Data Scientist',
 'Teamleiter:in Insel Data Platform']

However, a better way might be, to get job-specific information from each of the job-urls directly.

---

### Crawling through a list of links

Let's assume we want to store information on all the jobs we found. Good practice would be to split the process into two steps:
 1. Loop through all URLs to fetch and store the html source code
 2. Extract the relevant information from the stored files
This way you do not need to bother the webserver more than once. This is also important because often you do not know beforehand what you actually need and might end up making an unessearly large number of requests otherwise.

Let's start with step 1. To be able to store each html file we need to make sure of a few things:

 - create a new directory on the file system (where we can store the files)
 - add the base url in front of the links if they are relative links
 - think about the filenames to use
 - use a file handler that saves the pages to the filesystem


><font color = ffffff>SIDENOTE ON FILE HANDLERS: We can use Python's built-in ``open`` function to write to (or read from) different types of files (htlm, txt csv etc.). To create a new file, you must first open it in *write* mode (``w``) using the ``open`` function. Then you can use the *write* method to write into your file:
>```python
my_file = open('page0.html', 'w')  # Open a (new) file in write (w) mode
my_file.write(myString)             # Write to file
my_file.close()                     # Close file
```

><font color = ffffff> A more elegant way to do the same looks as follows (closes the file automatically after the ``with`` block):
>```python
with open('page0.html', 'w') as my_file:
  my_file.write(myString)
```

So let's change to the directory we created and then loop through all urls. For each job, we will append the url to the base url, fetch the source code and save it to our computer. For simplicity, we will save the files as job0, job1 etc. To get the index in each loop, we can use the ``enumerate()`` function.

In [48]:
import os
#os.chdir("C:/Users/farys/Documents/jobroom")
os.chdir("../4_Data/jobroom") # was created manually before! Use os.mkdir(path) to create a directory from within Python

# Loop through all urls
for i, url in enumerate(urls):
    url = "https://www.job-room.ch/" + url # Add base url in front
    browser.get(url) # Go to page
    time.sleep(1) # maybe optional
    with open('job' + str(i) + '.html', 'w', encoding="utf-8") as file: # Open file in write mode
        file.write(browser.page_source) # Write source code of each page into file
    print("Stored page for job " + str(i))

Stored page for job 0
Stored page for job 1
Stored page for job 2
Stored page for job 3
Stored page for job 4
Stored page for job 5
Stored page for job 6
Stored page for job 7
Stored page for job 8


><font color = ffffff> SIDENOTE: The website might realize that you are not a normal user and ask you to verify that you are human. To make your scraper more robust to such problems (and make sure it is less of a burden to the page you are scraping), you could add some sleep time in each iteration. For example, ``time.sleep(10)`` (from the ``time`` module) would tell Python to wait for 10 seconds in each iteration.


### Extracting information from html files
Now that we seperately stored the pages about the jobs we can proceed to step 2 and start to work with them. We would like to create a nice pandas dataframe with information on all the jobs we found. Suppose, we are interested in the job title and the employer. Let's define a function that extracts these elements and returns them in a list:

In [49]:
# define a function that extracts the elements we want from the files with source code for each page
def getStuff(page):
    with open(page, encoding = "utf-8") as file:
        content = file.read() # Open file in read mode and assign to variable "content"

    soup = BeautifulSoup(content)

    # extract elements we like
    title = soup.select(".job-title")[0].get_text()
    company = soup.select(".job-company")[0].get_text()
    description = soup.select(".job-description")[0].get_text()
    return [title, company, description]

Now, let's loop through all the html files, apply our function and write everything into a nested list:

In [50]:
# Get names of all files
pages = os.listdir()
pages

['job0.html',
 'job1.html',
 'job2.html',
 'job3.html',
 'job4.html',
 'job5.html',
 'job6.html',
 'job7.html',
 'job8.html']

In [51]:
with open('job0.html', encoding = "utf-8") as file:
        content = file.read()

soup = BeautifulSoup(content)

title = soup.select(".job-title")[0].get_text()
company = soup.select(".job-company")[0].get_text()
description = soup.select(".job-description")[0].get_text()

In [53]:
# Get names of all files
pages = os.listdir()

# Loop through html files
job_summary = []
for page in pages:
    try:
        job_summary.append(getStuff(page))
    except:
        print("Problem with file", page)

job_summary

[['Hochschulpraktikant:in Data Science',
  'Post CH AG',
  '80 - 100%, Bern und Homeoffice\nHast du kürzlich dein Masterstudium abgeschlossen und möchtest in einem 12-monatigen Praktikum im Bereich Data Science durchstarten? Dann bist du bei uns genau richtig! Du bekommst die Möglichkeit, in einem dynamischen Team an spannenden Projekten zu arbeiten und deine Fähigkeiten im Bereich Data Science weiter auszubauen. Mit dir werden Daten zu Insights und Modelle zum Erfolg.\nDas kannst du bewirken\n\nDu sammelst, bereinigst und bereitest Kundendaten auf und führst explorative Analysen durch, um Erkenntnisse für die Segmentierung zu gewinnen.\nDu entwickelst und bewertest Machine-Learning-Modelle -- beispielsweise zur Kundensegmentierung oder Churn-Vorhersage. Gemeinsam mit dem Kampagnenmanagement arbeitest du zudem an der datenbasierten Optimierung der Leadausspielung.\nZur wirkungsvollen Darstellung der Analyseergebnisse erstellst du Visualisierungen und Dashboards, du entwickelst neue Rep

In [55]:
!pip install itables

Collecting itables
  Downloading itables-2.3.0-py3-none-any.whl.metadata (8.6 kB)
Downloading itables-2.3.0-py3-none-any.whl (2.3 MB)
   ---------------------------------------- 0.0/2.3 MB ? eta -:--:--
   ---------------------------------------- 2.3/2.3 MB 14.5 MB/s eta 0:00:00
Installing collected packages: itables
Successfully installed itables-2.3.0



[notice] A new release of pip is available: 25.0.1 -> 25.1
[notice] To update, run: C:\Users\Raphael\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


Finally, we can convert our nested list into a Pandas Dataframe.

In [56]:
import pandas as pd
df = pd.DataFrame.from_records(job_summary, columns = ["title", "company", "description"])

from itables import show
show(df)

title,company,description
Loading ITables v2.3.0 from the internet... (need help?),,


### What to do from here?

Now that you have some first impressions of what ``selenium`` is capable of:  there is much more to learn! You could extend/finetune the example in different ways:

 * make sure that each site is dowloaded correctly and only once:
   - check if file exists on the system and is not empty (e.g. if exists(somefile): skip downloading)
   - use better filenames, e.g. based on some ID to make this check easier
   - introduce a short wait time in your loop if necessary (selenium can even wait/check until a certain element is present)
 * extract other elements:
   - skills
   - full texts
   - working hours
   - ...
 * Split your code into two separate scripts (one for data collection and one for data processing)
 * Think about a strategy to extend the search terms and/or locations
 * Make your scaper check for new job openings once a week, e.g. using cron jobs or the task scheduler
 * ...

You might like to check out the following tutorials for more ideas: https://youtube.com/playlist?list=PLzMcBGfZo4-n40rB1XaJ0ak1bemvlqumQ