Project Overview

In the ever-evolving job market, staying ahead with the latest openings in Artificial Intelligence and Machine Learning can be quite the task. This project aims to streamline the job-hunting process, leveraging the power of automation with Selenium. By targeting popular job portals like rozee.pk and Indeed, we develop a script designed to navigate through the sea of listings, fetching pertinent details to aid in the job search. The technologies/libraries used are Selenium and Pandas. Furthermore, I am aware that this code might not be the best one, and that it isn't dynamic. However, I wanted to share it to show others that this is how I started web scraping. It isn't easy and dynamic stuff adds a layer of complexity, but you gotta do your best!

In [3]:
# importing all necessary libraries . . .
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
import time
import pandas as pd

In [4]:
# opening the web browser . . .
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

In [5]:
driver.get('https://www.rozee.pk/') # opening the website . . .

# Finding the search bar and entering the job title . . .
search_bar = driver.find_element(By.ID, 'search')  
search_bar.send_keys('AI/ML Engineer')  
search_bar.send_keys(Keys.RETURN)

# For this task, I selected a couple of random jobs from the website to speed up the process
# some of these jobs might not be available atm, so I know this isn't dynamically fetching urls 
# but this was just my first ever attempt at web scraping.

hrefs = [
    "https://www.rozee.pk/teamsolve-senior-qa-engineer-ai-islamabad-faisalabad-lahore-jobs-1424979?utm_source=jobSearch&utm_medium=website&utm_content=jobSearch_1424979&utm_campaign=ROZEE.PK_job_search&utm_term=undefined",
    "https://www.rozee.pk/legalator-ai-ai-engineer-islamabad-jobs-1427337?utm_source=jobSearch&utm_medium=website&utm_content=jobSearch_1427337&utm_campaign=ROZEE.PK_job_search&utm_term=undefined",
    "https://www.rozee.pk/ai-technologies-design-sales-consultant-karachi-jobs-1429743?utm_source=jobSearch&utm_medium=website&utm_content=jobSearch_1429743&utm_campaign=ROZEE.PK_job_search&utm_term=undefined",
    "https://www.rozee.pk/ai-professionals-pvt-limited-business-development-officer-islamabad-jobs-1433231?utm_source=jobSearch&utm_medium=website&utm_content=jobSearch_1433231&utm_campaign=ROZEE.PK_job_search&utm_term=undefined",
    "https://www.rozee.pk/production-king-ai-developer-sialkot-jobs-1430149?utm_source=jobSearch&utm_medium=website&utm_content=jobSearch_1430149&utm_campaign=ROZEE.PK_job_search&utm_term=undefined"
]

# First, scrape basic information from the main listing page
jobs_xpath = "//div[@class='jobt float-left']"
jobs = driver.find_elements(By.XPATH, jobs_xpath)

# Now, visit each job detail page and scrape more information
for index, job_href in enumerate(hrefs):
    driver.get(job_href)
    
    try:
        # Wait for the Gender Preference element to be present on the page
        job_title_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//h1"))
        )
        job_title = job_title_element.text
    except TimeoutException:
        job_title = "Not specified"
    print(f"Job Title: {job_title}")
        
    
    try:
        # Wait for the Gender Preference element to be present on the page
        job_company_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//h2[contains(@class, 'cname')]/a[@class='text-dark']"))
        )
        job_company = job_company_element.text
    except TimeoutException:
        job_company = "Not specified"
    print(f"Job Company: {job_company}")
    
    try:
        # Wait for the Gender Preference element to be present on the page
        job_location_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//h4[@class='lh1 cname im2 font18 text-dark d-flex align-items-center']"))
        )
        job_location = job_location_element.text
    except TimeoutException:
        job_location = "Not specified"
    print(f"Job Location: {job_location}")
    
    
    try:
        # Wait for the Gender Preference element to be present on the page
        gender_pref_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//div[b[contains(text(), 'Gender:')]]/following-sibling::div[1]"))
        )
        gender_pref = gender_pref_element.text
    except TimeoutException:
        gender_pref = "Not specified"
    print(f"Gender Preference: {gender_pref}")
    
    
    try:
        # Try to find the age requirement element
        age_requirement_element = driver.find_element(By.XPATH, "xpath_for_age")
        age_requirement = age_requirement_element.text.strip()
    except NoSuchElementException:
        # If the age requirement element is not found, it's not present for this job
        age_requirement = "Not specified"
    
    # Output the result or add it to your data structure
    print(f"Age Requirement: {age_requirement}")
     
    try:
        # Wait for the Gender Preference element to be present on the page
        min_education_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//div[b[contains(text(), 'Minimum Education:')]]/following-sibling::div[1]"))
        )
        min_education = min_education_element.text
    except TimeoutException:
        min_education = "Not specified"
    print(f"Minimum Education: {min_education}")
    
    try:
        # Wait for the Gender Preference element to be present on the page
        experience_detail_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//p[strong]/following-sibling::ul[2]"))
        )
        experience_detail = experience_detail_element.text
    except TimeoutException:
        experience_detail = "Not specified"
    print(f"Job Details: {experience_detail}")

Job Title: Senior QA Engineer - AI
Job Company: TeamSolve
Job Location: Multiple Cities,
Pakistan
Gender Preference: No Preference
Age Requirement: Not specified
Minimum Education: Bachelors
Job Details: Bachelor's degree in Computer Science or related field. 
Understanding and experience of testing methodologies, techniques, and tools. 
Understanding and experience of manual and automated testing. 
Understanding and experience of JIRA. 
Manage a team of Junior QA resources. 
At least 3 to 4 years of relevant experience. 
Excellent analytical and problem-solving skills. 
Strong attention to detail and ability to identify and document defects. 
Excellent communication and collaboration skills. 
Strong work ethic and willingness to learn and grow. 
Job Title: AI Engineer
Job Company: Legalator AI
Job Location: Islamabad ,
Pakistan
Gender Preference: No Preference
Age Requirement: Not specified
Minimum Education: Bachelors
Job Details: Bachelor's or Master's degree in Computer Science, Ar

In [6]:
# New cell for dataframe creation
import pandas as pd

# Initialize an empty list to store job data dictionaries
job_data_list = []

for index, job_href in enumerate(hrefs):
    driver.get(job_href)
    
    try:
        # Wait for the Gender Preference element to be present on the page
        job_title_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//h1"))
        )
        job_title = job_title_element.text
    except TimeoutException:
        job_title = "Not specified"
#    print(f"Job Title: {job_title}")
        
    
    try:
        # Wait for the Gender Preference element to be present on the page
        job_company_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//h2[contains(@class, 'cname')]/a[@class='text-dark']"))
        )
        job_company = job_company_element.text
    except TimeoutException:
        job_company = "Not specified"
#    print(f"Job Company: {job_company}")
    
    try:
        # Wait for the Gender Preference element to be present on the page
        job_location_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//h4[@class='lh1 cname im2 font18 text-dark d-flex align-items-center']"))
        )
        job_location = job_location_element.text
    except TimeoutException:
        job_location = "Not specified"
#    print(f"Job Location: {job_location}")
    
    
    try:
        # Wait for the Gender Preference element to be present on the page
        gender_pref_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//div[b[contains(text(), 'Gender:')]]/following-sibling::div[1]"))
        )
        gender_pref = gender_pref_element.text
    except TimeoutException:
        gender_pref = "Not specified"
#    print(f"Gender Preference: {gender_pref}")
    
    
    try:
        # Try to find the age requirement element
        age_requirement_element = driver.find_element(By.XPATH, "xpath_for_age")
        age_requirement = age_requirement_element.text.strip()
    except NoSuchElementException:
        # If the age requirement element is not found, it's not present for this job
        age_requirement = "Not specified"
    
    # Output the result or add it to your data structure
#    print(f"Age Requirement: {age_requirement}")
     
    try:
        # Wait for the Gender Preference element to be present on the page
        min_education_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//div[b[contains(text(), 'Minimum Education:')]]/following-sibling::div[1]"))
        )
        min_education = min_education_element.text
    except TimeoutException:
        min_education = "Not specified"
#    print(f"Minimum Education: {min_education}")
    
    try:
        # Wait for the Gender Preference element to be present on the page
        experience_detail_element = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//p[strong]/following-sibling::ul[2]"))
        )
        experience_detail = experience_detail_element.text
    except TimeoutException:
        experience_detail = "Not specified"
#    print(f"Job Details: {experience_detail}")

    job_data = {
        'Job Title': job_title,
        'Job Company': job_company,
        'Job Location': job_location,
        'Gender Preference': gender_pref,
        'Minimum Education': min_education,
        'Experience Details': experience_detail,
        'Job Link': job_href  # Assuming job_href is the URL of the job
    }
    job_data_list.append(job_data)
    
    #driver.quit()

# After the loop, convert the list of dictionaries to a dataframe
jobs_df = pd.DataFrame(job_data_list)

In [7]:
# Checking if any job listings have 'Not specified' for the Gender Preference . . .
# This is done to find jobs that are potentially more inclusive, as they do not specify a gender requirement . . .

if (jobs_df['Gender Preference'].str.contains("Not specified")).any():
    
    # If there are such jobs, create a new DataFrame that includes only these jobs . . .
    inclusive_jobs_df = jobs_df[jobs_df['Gender Preference'].str.contains("Not specified")]
else:
    # If there are no such jobs, we will consider all jobs . . .
    inclusive_jobs_df = jobs_df

# Now I want to sort the DataFrame based on 'Minimum Education' to prioritize jobs that require higher qualifications.
# I am assuming that a job that requires a higher degree might offer more complex and advanced work, which could be more challenging and rewarding.
# Since 'Minimum Education' is a string, we would need to map these strings to a numerical scale if we were actually sorting on this column.

sorted_jobs_df = inclusive_jobs_df.sort_values(by='Job Title', ascending=False)

# Displaying the top job listing after sorting. This job is considered the best fit based on the chosen sorting criteria . . .
# It is the first job in the sorted list, which means it ranks highest according to the sorting parameter used . . .
best_fit_job = sorted_jobs_df.head(1)
best_fit_job

Unnamed: 0,Job Title,Job Company,Job Location,Gender Preference,Minimum Education,Experience Details,Job Link
0,Senior QA Engineer - AI,TeamSolve,"Multiple Cities,\nPakistan",No Preference,Bachelors,Bachelor's degree in Computer Science or relat...,https://www.rozee.pk/teamsolve-senior-qa-engin...


In [8]:
# Sorting the DataFrame based on the length of the 'Minimum Education' string . . .
# The assumption here is that a shorter education requirement in the job listing 
# might indicate a lower level of minimum education required.
sorted_jobs_df = jobs_df.sort_values(by='Minimum Education', key=lambda col: col.str.len())

# Displaying the top job listing after sorting. This job is considered the best fit based on the sorting criteria,
# as it potentially requires the least amount of education and therefore might be the most accessible.
best_fit_job = sorted_jobs_df.head(1)
best_fit_job

# To save the dataframe to a CSV file, uncomment the following line:
# jobs_df.to_csv('job_listings.csv', index=False)

Unnamed: 0,Job Title,Job Company,Job Location,Gender Preference,Minimum Education,Experience Details,Job Link
4,AI Developer,Production King,"Sialkot ,\nPakistan",No Preference,Masters,Bachelor's or higher degree in Computer Scienc...,https://www.rozee.pk/production-king-ai-develo...
