# Webscraping and Social Media Scraping Project
## 08.04.2025

### Anna Kędzierska-Głodek (332945), Stanisław Godlewski (473016)

# 1. Choise of the website
    https://www.arbeitnow.com/
   

## 1.1. Reason
The data from the Arbeitnow portal was selected due to its public availability, transparency, and up-to-date job listings. The job postings include salary ranges, location, work mode (e.g., remote or on-site), and technical requirements — making them highly valuable for analyzing the job market. This dataset enables the assessment of differences in compensation, demand for specific technologies, and the popularity of remote work. It also allows for tracking occupational trends over time, which makes it a useful resource for job seekers.


## 1.2. https://www.arbeitnow.com/robots.txt
  
       
           User-agent: *              # any user-agent
           Disallow:                  # all parts of the website are allowed to be accessed by web crawlers.
           Disallow: /*?__hstc





## 1.3. https://www.arbeitnow.com/terms
  
      Use License
            a. Permission is granted to temporarily download one copy of the materials (information or software) on arbeitnow's website for personal, non-commercial transitory viewing only. This is the grant of a license, not a transfer of title, and under this license you may not:
                i. modify or copy the materials;
                ii. use the materials for any commercial purpose, or for any public display (commercial or non-commercial);
                iii. attempt to decompile or reverse engineer any software contained on arbeitnow's website;
                iv. remove any copyright or other proprietary notations from the materials; or
                v. ransfer the materials to another person or "mirror" the materials on any other server.
            b. This license shall automatically terminate if you violate any of these restrictions and may be terminated by arbeitnow at any time. Upon terminating your viewing of these materials or upon the termination of this license, you must destroy any downloaded materials in your possession whether in electronic or printed format.



# 2. Collecting data from the website

## 2.1. Instal packages, import libraries and tools

In [3]:
# Instal required packages
%pip install pandas numpy selenium webdriver_manager

# Import system and data libraries
import time  # used to pause for JS loading and measure execution time
import re  # used for extracting text
import pandas as pd  # used to store and export scraped data
import numpy as np  # used for random time delays
import csv # used to write CSV files
import os # used to create directories

# Import web scraping tools
from bs4 import BeautifulSoup  # used to parse HTML
from selenium import webdriver  # controls the web browser
from selenium.webdriver.common.by import By  # used to locate elements on page
from selenium.webdriver.chrome.service import Service  # manages ChromeDriver
from selenium.webdriver.support.ui import Select  # used to select dropdown options
from webdriver_manager.chrome import ChromeDriverManager  # installs the ChromeDriver

Note: you may need to restart the kernel to use updated packages.


## 2.2. Check total number of pages for English-speaking jobs

In [4]:
# Launch Chrome browser
service = Service(ChromeDriverManager().install())
options_chrome = webdriver.ChromeOptions() # set Chrome options
options_chrome.add_argument("--incognito")  # open browser in incognito mode
options_chrome.add_argument("--disable-notifications")  # block system push notifications
options_chrome.add_argument("--disable-infobars")  # hide infobars
options_chrome.add_argument("--disable-extensions")  # disable browser extensions
options_chrome.add_argument("--no-sandbox") # disable sandboxing
options_chrome.add_argument("--disable-dev-shm-usage") # overcome limited resource problems
options_chrome.add_argument("--headless")  # run browser without a visible UI

driver = webdriver.Chrome(service=service, options=options_chrome)  # launch Chrome with specified options

# Open website
url = "https://www.arbeitnow.com"
driver.get(url) # request the page with the specified URL

# Cilck button to search English-speaking jobs
button = driver.find_element(By.CSS_SELECTOR, 'button[aria-label="tag english speaking"]')
button.click()

# Wait to ensure the content is fully loaded
time.sleep(3) # wait for 3 seconds

# Read the total number of pages
try:
    last_page_element = driver.find_element(By.ID, "last_page")
    last_page = int(last_page_element.text.strip())  # strip whitespace and convert to integer
except:
    last_page = 1  # if not found, assume only 1 page
print(f"Total number of pages: {last_page}")

url_1 = driver.current_url
print("url:", url_1)

url_base = url_1.rsplit('=', 1)[0] + '=' # get the base URL by removing the last part after '='
url_base = url_base.replace("sort_by=relevance", "sort_by=newest")
print("url_base:", url_base)

Total number of pages: 31
url: https://www.arbeitnow.com/?search=&tags=%5B%22english+speaking%22%5D&sort_by=relevance&page=1
url_base: https://www.arbeitnow.com/?search=&tags=%5B%22english+speaking%22%5D&sort_by=newest&page=


## 2.3. Collect links

In [6]:
# Start tracking execution time
start_time = time.time()

# Launch Chrome browser again for scraping job offers
driver = webdriver.Chrome(service=service, options=options_chrome)

# Create an empty list to store job offer URLs
job_links = []

# Loop through each search result page
for i in range(1, last_page+1):
#for i in range(1, 2): # limit for testing
    url = url_base + str(i)
    driver.get(url)  # open the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")  # scroll down to load content
    time.sleep(np.random.chisquare(1) + 2.5)  # wait for content to load

    # Find all job links on the page
    tags = driver.find_elements(By.CSS_SELECTOR, 'a[data-job-item-link="true"]')
    for tag in tags:
        href = tag.get_attribute('href')  # extract URL from link
        if href and href not in job_links:
            job_links.append(href)  # store only unique links

    print(f"Page {i} – found: {len(tags)} links, total collected: {len(job_links)}")

end_time = time.time()

pd.DataFrame({'Job_URL': job_links}).to_csv("arbeitnow_job_links.csv", index=False)

print(f"\nTotal number of job links collected: {len(job_links)}")
print(f"Time: {end_time - start_time:.2f} seconds")

Page 1 – found: 35 links, total collected: 35
Page 2 – found: 35 links, total collected: 70
Page 3 – found: 35 links, total collected: 105
Page 4 – found: 35 links, total collected: 140
Page 5 – found: 35 links, total collected: 175
Page 6 – found: 35 links, total collected: 210
Page 7 – found: 35 links, total collected: 245
Page 8 – found: 35 links, total collected: 280
Page 9 – found: 35 links, total collected: 315
Page 10 – found: 35 links, total collected: 350
Page 11 – found: 35 links, total collected: 385
Page 12 – found: 35 links, total collected: 420
Page 13 – found: 35 links, total collected: 455
Page 14 – found: 35 links, total collected: 490
Page 15 – found: 35 links, total collected: 525
Page 16 – found: 35 links, total collected: 560
Page 17 – found: 35 links, total collected: 595
Page 18 – found: 35 links, total collected: 630
Page 19 – found: 35 links, total collected: 665
Page 20 – found: 35 links, total collected: 700
Page 21 – found: 35 links, total collected: 735
Pag

## 2.4. Define list of keywords, function to extract keywords from text, and set limit of how many job offers to scrape

In [7]:
# Define a list of keywords to extract from job descriptions
keywords_list = [
    "Python", "SQL", "R", "Pandas", "NumPy", "Matplotlib", "TensorFlow", "Keras", 
    "PyTorch", "Scikit-learn", "Hadoop", "Spark", "AWS", "GCP", "Azure", "Docker", 
    "Kubernetes", "PostgreSQL", "MongoDB", "Airflow", "Tableau", "Power BI", "Linux", 
    "Git", "GitHub", "Bitbucket", "Java", "Scala", "NoSQL", "Flask", "Django", "Machine Learning", "ML",
    "BigQuery", "ETL", "Data Warehouse", "Data Lake", "Data Pipeline", "Data Science",
    "Data Engineering", "Business Intelligence", "BI", "Data Visualization", "Deep Learning",
    "NLP", "Natural Language Processing", "Statistics", "Algorithms", "Data Mining", 
    "Statistical Analysis", "Agile", "Scrum", "Kanban", "DevOps", "CI/CD",
    "Microservices", "REST", "GraphQL", "API", "Web Scraping", "Web Crawling", 
    "Google Cloud", "Cloud", "SAP", "Apache", "Jira", "Confluence"
]

# Define function to extract keywords from text using regex
def extract_keywords(text, keywords_list):
    found = []  # create empty list to store matched keywords
    for tech in keywords_list:
        if tech == "R":
            # match "R" only if it's a separate word 
            if re.search(r'\bR\b', text):
                found.append(tech)
        else:
            # match other keywords case-insensitively
            if re.search(rf'\b{re.escape(tech)}\b', text, re.IGNORECASE):
                found.append(tech)
    return ", ".join(sorted(set(found)))  # return sorted, unique list

## 2.5. Scrape job details

In [8]:
# Load job links from file 
links_df = pd.read_csv("arbeitnow_job_links.csv") 
job_links = links_df['Job_URL'].dropna().unique().tolist() # load job links from CSV file to list

#Define how many job offers to scrape
LIMIT_LINKS = len(job_links)  # all links
#LIMIT_LINKS = 50 # limit for testing
BATCH_SIZE = 50 # save at once
output_file = "arbeitnow_jobs.csv"

print(f"Loaded {len(job_links[:LIMIT_LINKS])} job links.")

# Remove old CSV file if exists
if os.path.exists(output_file):
    os.remove(output_file)

# Define headers
fieldnames = [
    "Date_Posted", "Title", "Company", "Location", "Remote",
    "Salary", "Keywords", "Description", "URL", "Apply_Link"
]

# Create CSV file with headers before scraping
with open(output_file, mode="w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()

# Launch Chrome browser for scraping job offers
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options_chrome)

# Track execution time
start_time = time.time()

batch = [] # list to store job data for batch processing
total_saved = 0 # total number of saved job offers

# Loop over job offers
for idx, url in enumerate(job_links[:LIMIT_LINKS]):
    print(f"Scraping job offer {idx + 1}/{LIMIT_LINKS} ({(idx + 1) / LIMIT_LINKS * 100:.2f}%): {url}")
    try:
        driver.get(url) # open job offer page

        # Parse page source with BeautifulSoup
        soup = BeautifulSoup(driver.page_source, "html.parser")

        # Extract description
        desc_tag = soup.select_one('div[itemprop="description"]') # extract description
        if not desc_tag:
            print(f"Skipping {url}: No description found.")
            continue

        description = desc_tag.get_text(separator="\n", strip=True) if desc_tag else ""
        
        # Extract keywords from description
        keywords = extract_keywords(description, keywords_list)
        if not keywords:
            print(f"----> skipping - no keywords found")
            continue  
        else:
            print(f"-> {keywords}")

        # Extract posting date
        time_tag = soup.find("time")
        date_posted = time_tag["datetime"] if time_tag and "datetime" in time_tag.attrs else ""

        # Extract title
        title_tag = soup.find("h1")
        title = title_tag.get_text(strip=True) if title_tag else ""

        # Extract company
        company_tag = soup.select_one('a[itemprop="hiringOrganization"]')
        company = company_tag.get_text(strip=True) if company_tag else ""

        # Extract location and check if job is remote
        location_tag = soup.select_one("p.flex.items-center span.text-gray-600")
        location = location_tag.get_text(strip=True) if location_tag else ""
        # Check for any remote button e.g. Remote, Remote full time
        remote = "Yes" if "Remote" in location or "Home Office" in location else "No" # default to "No"
        for button in soup.select('button[aria-label]'):
            if "remote" in button.get("aria-label", "").lower():
                remote = "Yes"
                break

        # Search for "Salary Information" only in elements preceding the description
        salary = ""
        if desc_tag:
            # Find all previous elements
            previous_elements = desc_tag.find_all_previous()
            for elem in previous_elements:
                if elem.name == "div" and elem.get("title") == "Salary Information":
                    salary = elem.get_text(strip=True).replace("Salary Icon", "").strip()
                    break

        # Add to batch
        batch.append({
            "Date_Posted": date_posted,
            "Title": title,
            "Company": company,
            "Location": location,
            "Remote": remote,
            "Salary": salary,
            "Keywords": keywords,
            "Description": description,
            "URL": url,
            "Apply_Link": url + "/apply"
        })

        # Write to CSV every batch
        if len(batch) >= BATCH_SIZE:
            pd.DataFrame(batch).to_csv(output_file, mode="a", header=False, index=False)
            total_saved += len(batch)
            print(f"* Saved {len(batch)} job offers to CSV. Total saved: {total_saved} ({total_saved / LIMIT_LINKS * 100:.2f}%)\n")
            batch = []

    except Exception as e:
        print(f"Skipping {url}: {e}")
        continue

# Save remaining data
if batch:
    pd.DataFrame(batch).to_csv(output_file, mode="a", header=False, index=False)
    total_saved += len(batch)
    print(f"* Saved remaining {len(batch)} job offers to CSV. Total saved: {total_saved} ({total_saved / LIMIT_LINKS * 100:.2f}%)\n")

# Close browser after scraping
driver.quit()
end_time = time.time()

# Load the CSV file to count the rows
df = pd.read_csv("arbeitnow_jobs.csv")

# Print summary information
print(f"Total number of saved job offers: {len(df)}")
print(f"Time: {end_time - start_time:.2f} seconds")

Loaded 1074 job links.
Scraping job offer 1/1074 (0.09%): https://www.arbeitnow.com/jobs/companies/tradelink/solution-consultant-all-genders-munich-448140
-> API, R
Scraping job offer 2/1074 (0.19%): https://www.arbeitnow.com/jobs/companies/pathway-solutions-gmbh/senior-backend-devops-engineer-spring-boot-google-cloud-hamburg-141884
-> API, CI/CD, Cloud, DevOps, GCP, Google Cloud, GraphQL, Java, NoSQL, REST
Scraping job offer 3/1074 (0.28%): https://www.arbeitnow.com/jobs/companies/lautsprecherteufel/working-student-embedded-testing-berlin-222858
----> skipping - no keywords found
Scraping job offer 4/1074 (0.37%): https://www.arbeitnow.com/jobs/companies/bigrep-gmbh/working-student-it-berlin-82630
-> AWS, Azure, Cloud
Scraping job offer 5/1074 (0.47%): https://www.arbeitnow.com/jobs/companies/bigrep-gmbh/working-student-hardware-projects-berlin-150237
----> skipping - no keywords found
Scraping job offer 6/1074 (0.56%): https://www.arbeitnow.com/jobs/companies/justplay-gmbh/head-of-da

# 3. Prepare collected data for further analysis

In [None]:
# Load the CSV file
df = pd.read_csv("arbeitnow_jobs.csv")

# Remove duplicates based on 'Date_Posted', 'Title' and 'Company'
df = df.drop_duplicates(subset=['Date_Posted', 'Title', 'Company', 'Location', 
                                'Remote', 'Salary', 'Keywords', 'Description'])
print(f"Number of unique saved job offers: {len(df)}")

# Convert 'Date_Posted' to datetime
df['Date_Posted'] = pd.to_datetime(df['Date_Posted'], errors='coerce')

# Split 'Salary' into 'Min_Salary' and 'Max_Salary'
def split_salary(s):
    try:
        if pd.isna(s) or not isinstance(s, str) or s.strip() == "":
            return pd.Series([None, None])

        # Replace non-standard characters
        s = s.replace("–", "-").replace("\n", "").replace("€", "").strip()
        
        # Handle different salary formats
        parts = s.split("-")
        parts = [p.strip().replace(".", "").replace(",", "") for p in parts]

        if len(parts) == 1 and parts[0].isdigit():
            return pd.Series([int(parts[0]), int(parts[0])])
        elif len(parts) == 2 and all(p.isdigit() for p in parts):
            return pd.Series([int(parts[0]), int(parts[1])])
        else:
            return pd.Series([None, None])

    except Exception as e:
        print(f"Error parsing salary: {s} -> {e}")
        return pd.Series([None, None])

df[['Min_Salary [€]', 'Max_Salary [€]']] = df['Salary'].apply(split_salary)

# Prepare for column reordering
min_salary_col = 'Min_Salary [€]'
max_salary_col = 'Max_Salary [€]'
cols = df.columns.tolist()
salary_index = cols.index('Salary')

# Reorder columns
new_order = (
    cols[:salary_index + 1] +
    [min_salary_col, max_salary_col] +
    [col for col in cols if col not in [min_salary_col, max_salary_col] and col not in cols[:salary_index + 1]]
)

df = df[new_order]

# Save cleaned and updated data
df.to_csv("arbeitnow_jobs_preprocessed.csv", index=False)
print(f"Total number of saved job offers: {len(df)}")

# Show sample
df[['Date_Posted', 'Title', 'Company', 'Location', 'Remote', 'Salary', 
    'Min_Salary [€]', 'Max_Salary [€]', 'Keywords']]

Number of unique saved job offers: 718
Total number of saved job offers: 718


Unnamed: 0,Date_Posted,Title,Company,Location,Remote,Salary,Min_Salary [€],Max_Salary [€],Keywords
0,2025-04-07 21:49:07,Solution Consultant (all genders),TradeLink,Munich,Yes,,,,"API, R"
1,2025-04-07 19:49:03,Senior Backend & DevOps Engineer (Spring Boot ...,pathway solutions gmbh,Hamburg,Yes,,,,"API, CI/CD, Cloud, DevOps, GCP, Google Cloud, ..."
2,2025-04-07 19:09:42,Working Student IT,BigRep GmbH,Berlin,No,,,,"AWS, Azure, Cloud"
3,2025-04-07 17:49:05,Head of Data (all genders),JustPlay GmbH,Berlin,Yes,,,,"Cloud, Docker, GCP, Git, Google Cloud, Python,..."
4,2025-04-07 16:49:06,Working Student - Frontend Developer,perto GmbH,Berlin,No,,,,"API, Agile"
...,...,...,...,...,...,...,...,...,...
714,2025-02-17 13:20:18,Working Student Digital Transformation (all ge...,Q Energy,Berlin,No,,,,"Agile, BI, DevOps, Power BI"
715,2025-02-17 12:49:03,(Senior) Tech Recruiter (m/f/d),Project A Ventures,Berlin,No,,,,R
716,2025-02-17 08:45:16,"Senior Software Engineer - iOS, Swift",Funded.club,Freiburg im Breisgau,No,,,,"Agile, Git, GraphQL, REST"
717,2025-02-17 07:04:07,Working Student -- Software Engineer: Munich,Uncountable,Munich,No,,,,"Flask, Python"
