# **Webscraping of Pracuj.pl**
**Zofia Broszczak** 446277

**Anna Lorenz** 429840

The goal of this project was to collect job offer data and build a dataset specifically for Data Science Master’s students who are looking for a job in their field. The project gathered job offer information from  https://www.pracuj.pl, focusing on roles relevant to three key areas within the IT sector: Data Science, Business Analytics and AI/ML.

The resulting dataset can be used for various purposes, such as analyzing current job market trends, understanding employer requirements and preparing for applications. This project was carried out solely for academic purposes and will not be used for any commercial gain.


## Ethical considerations, robots.txt and Terms of Use
The first step was to check whether scraping job offer information was permitted. 

The robots.txt file from pracuj.pl (https://www.pracuj.pl/robots.txt) prohibits agents from accessing technical directories such as stylesheets, scripts, images, templates, user accounts (/konto/) and internal media folders. However, job offer listings are not disallowed.

Therefore, scraping job offer information was allowed under the site's robots.txt, as long as we avoided the restricted technical or user-specific paths.


As the robots.txt file is not legally binding, we also made sure that the Terms of Use do not prohibit scraping.
The only Terms of Service file was for the provision of services on the website (https://dlafirm.pracuj.pl/static/terms/Regulamin_swiadczenia_Uslug_w_ramach_Sklepu_20241118.pdf). In other related documents (Privacy Policy and Your Rights) there was no information regarding scraping or any related concepts or keywords. 

Based on this, we concluded that scraping job offers for academic and non-commercial purposes is acceptable. To remain on the safe side, we also ensured that we complied with the guidelines outlined in robots.txt and did not place unnecessary load on the server or access private or sensitive parts of the site.

## Project structure:
Each code cell in the project includes both the necessary code from earlier sections and newly introduced code. This allows every cell to be executed independently. As a result, the final cell in the *Dynamic Scraping* section contains the complete code for that section.

To improve readability, newly added code is clearly separated from reused code using the following visual divider: '#|||||||||||||||||||||||||||||||||||||||||'.


## Importing necessary packages

In [1]:
# FOR DATA PROCESSING:
import pandas as pd
import numpy as np
import selenium

# FOR MEASURING COMPUTATION TIME, CREATING FIXED DELAYS:
import time

# FOR APPLYING SELENIUM:
import selenium
from selenium import webdriver

from webdriver_manager.firefox import GeckoDriverManager

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

# FOR APPLYING BEAUTIFULSOUP:
import requests
from bs4 import BeautifulSoup

# FOR USING REGEX:
import re

# FOR SAVING DATA:
import pickle

def save_object(obj, filename):
    with open(filename, 'wb') as output:
        pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)

In [2]:
firefoxpath = GeckoDriverManager().install(); print(firefoxpath)

C:\Users\zosia\.wdm\drivers\geckodriver\win64\v0.36.0\geckodriver.exe


## Dynamic scraping

The first part of our project focuses on dynamic scraping.

We first start a Firefox browser using Selenium and open it with default settings.
We then maximize the browser window so that all elements on the page are easy to see and interact with. Finally, we use the .get() method to load the main content of the website.

In [3]:
website = 'https://www.pracuj.pl'

service_firefox = Service(executable_path = firefoxpath) 
options_firefox = webdriver.FirefoxOptions()
driver_firefox = webdriver.Firefox(service = service_firefox, options = options_firefox) # opens Firefox

driver_firefox.maximize_window() # maximizes browser's window
driver_firefox.get(website) # opens a website
driver_firefox.close()

When the website opens, a cookie banner appears. The script waits up to 30 seconds for the 'Accept cookies' button to show up on the page.

Before clicking the button, it waits for a short, random amount of time to create human-like delays. This delay is created using a special formula based on the chi-square distribution.

After the wait, the script clicks the button to accept the cookies.

In [4]:
website = 'https://www.pracuj.pl'

service_firefox = Service(executable_path = firefoxpath) 
options_firefox = webdriver.FirefoxOptions()
driver_firefox = webdriver.Firefox(service = service_firefox, options = options_firefox)

driver_firefox.maximize_window()
driver_firefox.get(website)

cookies_button_xpath = '''/html/body/div[1]/div[12]/div/div/div[3]/div/button[1]'''

# ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

# waits at most 30 seconds until cookies button is visible
WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, cookies_button_xpath))) 

# + waits random time drawn from specific (strongly right-side-skewed) distribution to better imitate human behavior
time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', cookies_button_xpath) # finds the button
content.click() # clicks the button
time.sleep(3)

driver_firefox.close()

### Filtering offers

The first step of filtering is clicking the 'IT' button, typing in 'Warszawa' as the location and choosing three key areas within the IT sector: 'Big Data / Data Science', 'Business analytics' and 'AI/ML'. Delays based on a random chi square distribution are used between actions to simulate human behavior and avoid detection.

In [5]:
website = 'https://www.pracuj.pl'

service_firefox = Service(executable_path = firefoxpath) 
options_firefox = webdriver.FirefoxOptions()
driver_firefox = webdriver.Firefox(service = service_firefox, options = options_firefox)

driver_firefox.maximize_window()
driver_firefox.get(website)

cookies_button_xpath = '''/html/body/div[1]/div[12]/div/div/div[3]/div/button[1]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, cookies_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', cookies_button_xpath)
content.click()
time.sleep(3)

# ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

# Clicking IT button -----------------------------------------------------------------------------------------------------------
IT_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/div[1]/div/div[1]/button[2]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, IT_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', IT_button_xpath)
content.click()
time.sleep(3)

# Typing Warszawa --------------------------------------------------------------------------------------------------------------
location_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[1]/div[2]/div/div/div/input'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, location_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', location_button_xpath)
content.click()
time.sleep(np.random.chisquare(1)+3)

location_box = WebDriverWait(driver_firefox, 10).until(
    EC.element_to_be_clickable((By.XPATH, location_button_xpath))
)

location_box.send_keys('Warszawa') # types 'Warszawa'
location_box.send_keys(Keys.RETURN) # simulates pressing Enter

# Clicking Big Data/ Data Science button-----------------------------------------------------------------------------------------
data_science_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[8]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, data_science_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', data_science_button_xpath)
content.click()
time.sleep(3)

# Clicking Business Analytics button---------------------------------------------------------------------------------------------
business_analytics_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[17]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, business_analytics_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', business_analytics_button_xpath)
content.click()
time.sleep(3)

# Clicking AI/ML button----------------------------------------------------------------------------------------------------------
AI_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[21]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, AI_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', AI_button_xpath)
content.click()
time.sleep(3)

driver_firefox.close()

The second step is choosing Intern, Junior and Mid/Regular positions from the slide list. It was more complicated than the earlier step, because of the list which was visible after clicking the 'position' button. Because of that, the list was not visible for the script until scrolling down the page. Scrolling to the exact position of the button was not the best choice, because there was the Google sign in window which covered the list. There was a problem with closing this window as it was not visible from the script's perspective so we applied scrolling down 350 pixels to ensure that Google sign in window did not cover the list. After that there were no more obstacles therefore we could chose the positions.

We were interested in only hybrid and office work positions, therefore, the next step was choosing those work modes. This step was executed analogically to the previous one.

In [6]:
website = 'https://www.pracuj.pl'

service_firefox = Service(executable_path = firefoxpath) 
options_firefox = webdriver.FirefoxOptions()
driver_firefox = webdriver.Firefox(service = service_firefox, options = options_firefox)

driver_firefox.maximize_window()
driver_firefox.get(website)

cookies_button_xpath = '''/html/body/div[1]/div[12]/div/div/div[3]/div/button[1]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, cookies_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', cookies_button_xpath)
content.click()
time.sleep(3)

# Clicking IT button -----------------------------------------------------------------------------------------------------------
IT_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/div[1]/div/div[1]/button[2]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, IT_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', IT_button_xpath)
content.click()
time.sleep(3)

# Typing Warszawa --------------------------------------------------------------------------------------------------------------
location_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[1]/div[2]/div/div/div/input'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, location_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', location_button_xpath)
content.click()
time.sleep(np.random.chisquare(1)+3)

location_box = WebDriverWait(driver_firefox, 10).until(
    EC.element_to_be_clickable((By.XPATH, location_button_xpath))
)

location_box.send_keys('Warszawa')
location_box.send_keys(Keys.RETURN)

# Clicking Big Data/ Data Science button-----------------------------------------------------------------------------------------
data_science_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[8]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, data_science_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', data_science_button_xpath)
content.click()
time.sleep(3)

# Clicking Business Analytics button---------------------------------------------------------------------------------------------
business_analytics_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[17]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, business_analytics_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', business_analytics_button_xpath)
content.click()
time.sleep(3)

# Clicking AI/ML button----------------------------------------------------------------------------------------------------------
AI_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[21]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, AI_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', AI_button_xpath)
content.click()
time.sleep(3)

# ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

# Clicking position button ------------------------------------------------------------------------------------------------------
position_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[3]/div[1]/div/div/button/span[1]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, position_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

driver_firefox.execute_script('window.scrollBy(0, 350);') # scrolls down 350 pixels
time.sleep(3)

content = driver_firefox.find_element('xpath', position_button_xpath)
content.click()
time.sleep(3)

# Ticking Intern box -------------------------------------------------------------------------------------------------------------
intern_button_xpath = '''/html/body/div[2]/div/div/ul/li[1]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, intern_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', intern_button_xpath)
content.click()
time.sleep(3)

# Ticking Junior box -------------------------------------------------------------------------------------------------------------
junior_button_xpath = '''/html/body/div[2]/div/div/ul/li[3]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, junior_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', junior_button_xpath)
content.click()
time.sleep(3)

# Ticking Mid/Regular box ---------------------------------------------------------------------------------------------------------
mid_button_xpath = '''/html/body/div[2]/div/div/ul/li[4]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, mid_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', mid_button_xpath)
content.click()
time.sleep(3)

# Clicking work mode button -------------------------------------------------------------------------------------------------------
work_mode_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[3]/div[4]/div/div/button'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, work_mode_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', work_mode_button_xpath)
content.click()
time.sleep(3)

# Ticking Office work box ----------------------------------------------------------------------------------------------------------
office_button_xpath = '''/html/body/div[2]/div/div/ul/li[1]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, office_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', office_button_xpath)
content.click()
time.sleep(3)

# Ticking Hybrid work box -----------------------------------------------------------------------------------------------------------
hybrid_button_xpath = '''/html/body/div[2]/div/div/ul/li[2]/label/span[2]/span/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, hybrid_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', hybrid_button_xpath)
content.click()
time.sleep(3)

# Clicking Search button ------------------------------------------------------------------------------------------------------------
search_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/button'''

WebDriverWait(driver_firefox, 10).until(EC.visibility_of_element_located((By.XPATH, search_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', search_button_xpath)
content.click()
time.sleep(3)

driver_firefox.close()

After clicking search sometimes a pracuj.pl's banner asking about location sharing appears. The button would appear unexpectedly bur only after clicking the search button however it only appeared a few times in our process. To stay on the safe side, we kept this step that waits for 15 second for the banner to appear. If the banner appears, it is closed and if it does not, after 15 seconds we move on.

In [7]:
website = 'https://www.pracuj.pl'

service_firefox = Service(executable_path = firefoxpath) 
options_firefox = webdriver.FirefoxOptions()
driver_firefox = webdriver.Firefox(service = service_firefox, options = options_firefox)

driver_firefox.maximize_window()
driver_firefox.get(website)

cookies_button_xpath = '''/html/body/div[1]/div[12]/div/div/div[3]/div/button[1]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, cookies_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', cookies_button_xpath)
content.click()
time.sleep(3)

# Clicking IT button -----------------------------------------------------------------------------------------------------------
IT_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/div[1]/div/div[1]/button[2]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, IT_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', IT_button_xpath)
content.click()
time.sleep(3)

# Typing Warszawa --------------------------------------------------------------------------------------------------------------
location_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[1]/div[2]/div/div/div/input'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, location_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', location_button_xpath)
content.click()
time.sleep(np.random.chisquare(1)+3)

location_box = WebDriverWait(driver_firefox, 10).until(
    EC.element_to_be_clickable((By.XPATH, location_button_xpath))
)

location_box.send_keys('Warszawa')
location_box.send_keys(Keys.RETURN)

# Clicking Big Data/ Data Science button-----------------------------------------------------------------------------------------
data_science_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[8]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, data_science_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', data_science_button_xpath)
content.click()
time.sleep(3)

# Clicking Business Analytics button---------------------------------------------------------------------------------------------
business_analytics_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[17]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, business_analytics_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', business_analytics_button_xpath)
content.click()
time.sleep(3)

# Clicking AI/ML button----------------------------------------------------------------------------------------------------------
AI_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[21]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, AI_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', AI_button_xpath)
content.click()
time.sleep(3)

# Clicking position button ------------------------------------------------------------------------------------------------------
position_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[3]/div[1]/div/div/button/span[1]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, position_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

driver_firefox.execute_script('window.scrollBy(0, 350);')
time.sleep(3)

content = driver_firefox.find_element('xpath', position_button_xpath)
content.click()
time.sleep(3)

# Ticking Intern box -------------------------------------------------------------------------------------------------------------
intern_button_xpath = '''/html/body/div[2]/div/div/ul/li[1]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, intern_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', intern_button_xpath)
content.click()
time.sleep(3)

# Ticking Junior box -------------------------------------------------------------------------------------------------------------
junior_button_xpath = '''/html/body/div[2]/div/div/ul/li[3]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, junior_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', junior_button_xpath)
content.click()
time.sleep(3)

# Ticking Mid/Regular box ---------------------------------------------------------------------------------------------------------
mid_button_xpath = '''/html/body/div[2]/div/div/ul/li[4]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, mid_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', mid_button_xpath)
content.click()
time.sleep(3)

# Clicking work mode button -------------------------------------------------------------------------------------------------------
work_mode_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[3]/div[4]/div/div/button'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, work_mode_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', work_mode_button_xpath)
content.click()
time.sleep(3)

# Ticking Office work box ----------------------------------------------------------------------------------------------------------
office_button_xpath = '''/html/body/div[2]/div/div/ul/li[1]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, office_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', office_button_xpath)
content.click()
time.sleep(3)

# Ticking Hybrid work box -----------------------------------------------------------------------------------------------------------
hybrid_button_xpath = '''/html/body/div[2]/div/div/ul/li[2]/label/span[2]/span/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, hybrid_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', hybrid_button_xpath)
content.click()
time.sleep(3)

# Clicking Search button ------------------------------------------------------------------------------------------------------------
search_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/button'''

WebDriverWait(driver_firefox, 10).until(EC.visibility_of_element_located((By.XPATH, search_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', search_button_xpath)
content.click()
time.sleep(3)

# ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

# Closing the location sharing button -----------------------------------------------------------------------------------------------
loc_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/button''' # XPath for the location selection button (that may or may not appear)

try:
    WebDriverWait(driver_firefox, 15).until(EC.visibility_of_element_located((By.XPATH, loc_button_xpath)))

    time.sleep(np.random.chisquare(1)+3)

    content = driver_firefox.find_element('xpath', loc_button_xpath)
    content.click()
    time.sleep(3)

except:
    print('Location button not found, continuing.') # continues if the the button does not appear


# waits at most 30 seconds until webpage is reloaded
WebDriverWait(driver_firefox, 5).until(EC.visibility_of_element_located((By.XPATH, '''//a[contains(@data-test,'link-offer')]'''))) 
time.sleep(np.random.chisquare(1)+3)

driver_firefox.close()

Location button not found, continuing.


### Final cell for dynamic scraping contating the whole code from this section
The next step is to collect links to individual job offer subpages. To gather all the links on a given page, we use a loop that scrapes each offer’s URL. After completing one page, the 'Next' button is clicked to move to the next page, and the process repeats until the 'Next' button is no longer visible.

In [18]:
start1 = time.time()

website = 'https://www.pracuj.pl'

service_firefox = Service(executable_path = firefoxpath) 
options_firefox = webdriver.FirefoxOptions()
driver_firefox = webdriver.Firefox(service = service_firefox, options = options_firefox)

driver_firefox.maximize_window()
driver_firefox.get(website)

cookies_button_xpath = '''/html/body/div[1]/div[12]/div/div/div[3]/div/button[1]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, cookies_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', cookies_button_xpath)
content.click()
time.sleep(3)

# Clicking IT button -----------------------------------------------------------------------------------------------------------
IT_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[1]/div[1]/div[1]/div/div[1]/button[2]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, IT_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', IT_button_xpath)
content.click()
time.sleep(3)

# Typing Warszawa --------------------------------------------------------------------------------------------------------------
location_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[1]/div[2]/div/div/div/input'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, location_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', location_button_xpath)
content.click()
time.sleep(3)


location_box = WebDriverWait(driver_firefox, 10).until(
    EC.element_to_be_clickable((By.XPATH, location_button_xpath))
)

location_box.send_keys('Warszawa')
time.sleep(np.random.chisquare(1)+3)
location_box.send_keys(Keys.RETURN)

# Clicking Big Data/ Data Science button-----------------------------------------------------------------------------------------
data_science_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[8]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, data_science_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', data_science_button_xpath)
content.click()
time.sleep(3)

# Clicking Business Analytics button---------------------------------------------------------------------------------------------
business_analytics_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[17]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, business_analytics_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', business_analytics_button_xpath)
content.click()
time.sleep(3)

# Clicking AI/ML button----------------------------------------------------------------------------------------------------------
AI_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[1]/div/div[2]/span[21]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, AI_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', AI_button_xpath)
content.click()
time.sleep(3)

# Clicking position button ------------------------------------------------------------------------------------------------------
position_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[3]/div[1]/div/div/button/span[1]'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, position_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

driver_firefox.execute_script('window.scrollBy(0, 350);')
time.sleep(3)

content = driver_firefox.find_element('xpath', position_button_xpath)
content.click()
time.sleep(3)

# Ticking Intern box -------------------------------------------------------------------------------------------------------------
intern_button_xpath = '''/html/body/div[2]/div/div/ul/li[1]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, intern_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', intern_button_xpath)
content.click()
time.sleep(3)

# Ticking Junior box -------------------------------------------------------------------------------------------------------------
junior_button_xpath = '''/html/body/div[2]/div/div/ul/li[3]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, junior_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', junior_button_xpath)
content.click()
time.sleep(3)

# Ticking Mid/Regular box ---------------------------------------------------------------------------------------------------------
mid_button_xpath = '''/html/body/div[2]/div/div/ul/li[4]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, mid_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', mid_button_xpath)
content.click()
time.sleep(3)

# Clicking work mode button -------------------------------------------------------------------------------------------------------
work_mode_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/div[3]/div[4]/div/div/button'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, work_mode_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', work_mode_button_xpath)
content.click()
time.sleep(3)

# Ticking Office work box ----------------------------------------------------------------------------------------------------------
office_button_xpath = '''/html/body/div[2]/div/div/ul/li[1]/label/span[2]/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, office_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', office_button_xpath)
content.click()
time.sleep(3)

# Ticking Hybrid work box -----------------------------------------------------------------------------------------------------------
hybrid_button_xpath = '''/html/body/div[2]/div/div/ul/li[2]/label/span[2]/span/span/span'''

WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, hybrid_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', hybrid_button_xpath)
content.click()
time.sleep(3)

# Clicking Search button ------------------------------------------------------------------------------------------------------------
search_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/button'''

WebDriverWait(driver_firefox, 10).until(EC.visibility_of_element_located((By.XPATH, search_button_xpath))) 

time.sleep(np.random.chisquare(1)+3)

content = driver_firefox.find_element('xpath', search_button_xpath)
content.click()
time.sleep(3)

# Closing the location sharing button -----------------------------------------------------------------------------------------------
loc_button_xpath = '''/html/body/div[1]/div[5]/div[2]/div[2]/div/div[2]/button'''

try:
    WebDriverWait(driver_firefox, 15).until(EC.visibility_of_element_located((By.XPATH, loc_button_xpath)))

    time.sleep(np.random.chisquare(1)+3)

    content = driver_firefox.find_element('xpath', loc_button_xpath)
    content.click()
    time.sleep(3)

except:
    print('Location button not found, continuing.')

WebDriverWait(driver_firefox, 5).until(EC.visibility_of_element_located((By.XPATH, '''//a[contains(@data-test,'link-offer')]'''))) 
time.sleep(np.random.chisquare(1)+3)

# ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

# Collecting links ------------------------------------------------------------------------------------------------------------------
hrefs = []
next_button_xpath = '''//button[@data-test='bottom-pagination-button-next']'''

while True:
    try:
        tags = driver_firefox.find_elements('xpath','''//a[contains(@data-test,'link-offer')]''') # finds all elements with offer links
        
        for tag in tags:
            href = tag.get_attribute('href') # extracts links from each <a> element
            if href and href not in hrefs: # checks if the link is not already in the list
                hrefs.append(href)

        WebDriverWait(driver_firefox, 30).until(EC.visibility_of_element_located((By.XPATH, next_button_xpath))) 

        time.sleep(np.random.chisquare(1)+3)

        content = driver_firefox.find_element('xpath', next_button_xpath)
        content.click()
        time.sleep(3)
        
    except:
        break # continues if does not find the 'Next' button
            
print('Number of job offers: ', len(hrefs))

driver_firefox.close() # this closes the webdriver


end1 = time.time()
print('Time duration: ',(end1-start1)/60,'minutes')

Location button not found, continuing.
Number of job offers:  252
Time duration:  3.6842697739601133 minutes


## Static scraping

After collecting all the offer links, the next step was to scrape detailed information about each job from the individual subpages. Each URL from the list of hrefs was visited and data such as Job Title, Company, Location, Deadline, Contract Type, Work Time, Work Mode and Requirements was extracted.

Unfortunately, we could not extract all the details using our preferred method - locating elements by HTML tags, classes or attributes - because fields like Deadline, Contract Type, Work Time and Work Mode had identical structure. To find these details we either used regular expressions or specific keywords.

Once all the data was scraped, it was compiled into a list and then converted into a data frame.

In [9]:
start2 = time.time()

data = []

def extract_offer_details(url):
    offer = []

    # sends a request to the URL and get the HTML content
    response = requests.get(url)
    html = response.text

    # parses the HTML with BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')

# extract JOB TITLE -----------------------------------------------------------------------------------------------------------------
    try:
        job_title = soup.find('h1').text # finds the job title
    except AttributeError:
        job_title = None
    offer.append(job_title)

# extract COMPANY NAME --------------------------------------------------------------------------------------------------------------
    try:
        company = soup.find('h2').text # finds the company name
    except AttributeError:
        company = None
    offer.append(company)

# extract LOCATION ------------------------------------------------------------------------------------------------------------------
    details = soup.find_all('div', attrs={'data-test': 'offer-badge-title'})    
    matched_phrases = []

    for div in details:
        text = div.get_text(strip=True)
        if re.search('Warszawa', text, flags=re.IGNORECASE):
            matched_phrases.append(text)
            break
        else:
            matched_phrases.append('Different city')
            break

    offer.append(', '.join(matched_phrases))
    
# extract DEADLINE ------------------------------------------------------------------------------------------------------------------
    keywords = ['ważna', 'valid']
    matched_phrases = []

    for div in details:
        text = div.get_text(strip=True)
        
        for keyword in keywords:
            if keyword in text:
                matched_phrases.append(text)
                break
            
    offer.append(', '.join(matched_phrases))

# extract CONTRACT TYPE -------------------------------------------------------------------------------------------------------------
    keywords = ['umowa', 'kontrakt', 'contract']
    matched_phrases = []

    for div in details:
        text = div.get_text(strip=True)

        for keyword in keywords:
            if keyword in text:
                matched_phrases.append(text)
                break
            
    offer.append(', '.join(matched_phrases))
    
# extract WORK TIME -----------------------------------------------------------------------------------------------------------------
    keywords=[r'(pełny|pełen|część)\setat(u)?', r'(full-|part)\s?time']
    matched_phrases = []

    # loops through the divs and applies regex matching
    for div in details:
        text = div.get_text(strip=True)

        for pattern in keywords:
            if re.search(pattern, text, flags=re.IGNORECASE):
                matched_phrases.append(text)
                break
                
    offer.append(', '.join(matched_phrases))

# extract WORK MODE -----------------------------------------------------------------------------------------------------------------
    keywords = [r'praca\s(hybrydowa|stacjonarna)', r'(hybrid|full office)\swork']
    matched_phrases = []

    # loops through the divs and apply regex matching
    for div in details:
        text = div.get_text(strip=True)

        for pattern in keywords:
            if re.search(pattern, text, flags=re.IGNORECASE):
                matched_phrases.append(text)
                break
               
    offer.append(', '.join(matched_phrases))
    
# extract REQUIREMENTS --------------------------------------------------------------------------------------------------------------
    req=[]
    requirements = soup.find('div', attrs={'data-test': 'offer-sub-section','data-scroll-id': 'requirements-expected-1'})

    ul = requirements.find('ul')
    items = ul.find_all('li')
    for item in items:
        req.append(item.text.strip())
    offer.append(', '.join(req))
    
    data.append(offer)
    return data


# loops through each link in the hrefs list
for url in hrefs:
    data = extract_offer_details(url)

columns = ['Job Title', 'Company', 'Location', 'Deadline', 'Contract Type', 'Work Time', 'Work Mode', 'Requirements']    
data = pd.DataFrame(data, columns=columns) # creates a dataframe from the offers list
print(data)

end2 = time.time()
print('Time duration: ',(end2-start2)/60,'minutes')

                                             Job Title  \
0                                 Power Apps Developer   
1                                Deployment Strategist   
2             Administrator / Administratorka Big Data   
3                             Data Warehouse Developer   
4                            Backend Engineer (Python)   
..                                                 ...   
247  Administrator Baz Danych (DBA) PostgreSQL - In...   
248                           Quantitative AML Analyst   
249                     Analityk biznesowo - systemowy   
250  Specjalista ds. analiz danych w Dziale Analiz ...   
251                            Junior Business Analyst   

                                            Company  \
0              CLOUDICA sp. z o.o.About the company   
1    AICONIC PROSTA SPÓŁKA AKCYJNAAbout the company   
2               Wirtualna Polska Media S.A.O firmie   
3                                Connectis_O firmie   
4    AICONIC PROSTA SPÓŁKA A

## Saving results

Saving results to pickle and excel format.

In [10]:
save_object(data, r'C:/Users/zosia/OneDrive/Dokumenty/DS&BA/SEM2/Webscraping/output_data.pkl')
data.to_excel('C:/Users/zosia/OneDrive/Dokumenty/DS&BA/SEM2/Webscraping/output_data.xlsx', index=False)

Checking if the results are saved correctly.

In [11]:
with open('C:/Users/zosia/OneDrive/Dokumenty/DS&BA/SEM2/Webscraping/output_data.pkl', 'rb') as f:  # 'rb' stands for read in a binary mode
    data = pickle.load(f)
print(data)

                                             Job Title  \
0                                 Power Apps Developer   
1                                Deployment Strategist   
2             Administrator / Administratorka Big Data   
3                             Data Warehouse Developer   
4                            Backend Engineer (Python)   
..                                                 ...   
247  Administrator Baz Danych (DBA) PostgreSQL - In...   
248                           Quantitative AML Analyst   
249                     Analityk biznesowo - systemowy   
250  Specjalista ds. analiz danych w Dziale Analiz ...   
251                            Junior Business Analyst   

                                            Company  \
0              CLOUDICA sp. z o.o.About the company   
1    AICONIC PROSTA SPÓŁKA AKCYJNAAbout the company   
2               Wirtualna Polska Media S.A.O firmie   
3                                Connectis_O firmie   
4    AICONIC PROSTA SPÓŁKA A