# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping Job Postings

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?


To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

---

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.


In [1]:
#Set up libraries and functions

import os
import time
import requests
import numpy as np
import pandas as pd
import re

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.common.exceptions import NoSuchElementException        
def check_exists_by_xpath(xpath):
    try:
        driver.find_element_by_xpath(xpath)
    except NoSuchElementException:
        return False
    return True

In [11]:
#Load Chrome Driver and settings
chromedriver = "../chromedriver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument("--disable-extensions")
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver", options=chrome_options)

#Shorter tests
job_search = ['data scientist']
#job_search = ['data scientist', 'data science']


#Jobs to search for
"""job_search = ['data scientist', 'data science', 'data analyst', 'data analytics',\
        'business analytics', 'financial analytics', 'marketing analytics', \
        'data visualisation', 'data operations', 'data strategist', \
        'data engineer', 'data architect', 'data manager', 'data mining','data intern', \
        'data lead', 'data consultant', 'machine learning', 'deep learning', \
        'big data', 'business intelligence']"""

driver.get("https://www.mycareersfuture.sg/")
assert "MyCareersFuture" in driver.title

#Get all the links for the jobs
links = []
for job in job_search:    
    i=0
    driver.get("https://www.mycareersfuture.sg/search?search="+\
                job+\
               "&sortBy=new_posting_date")
    #Check whether links have appeared. Otherwise wait longer
    while check_exists_by_xpath('//*[@id="search-results"]/div[3]//a[@href]') == False:
        time.sleep(max([np.random.normal(0.5,0.01),np.random.normal(0.4,0.1)]))

    #Check to see if next page exists
    while check_exists_by_xpath("//*[contains(text(), '❯')]"):
        #Close popup if it shows
        if check_exists_by_xpath('//*[@id="snackbar"]/div[1]/div/span'):
            ActionChains(driver).click(driver.find_element_by_xpath('//*[@id="snackbar"]/div[1]/div/span')).perform()
            time.sleep(0.1)
        #Get links
        elems = driver.find_elements_by_xpath('//*[@id="search-results"]/div[3]//a[@href]')
        for elem in elems:
            links.append(elem.get_attribute("href"))
        #Click next page
        ActionChains(driver).click(driver.find_element_by_xpath("//*[contains(text(), '❯')]")).perform()
        #Keep track of pages
        i += 1
        time.sleep(max([np.random.normal(2,0.01),np.random.normal(3,0.2)]))
        print(job,i)
    #Repeat for last page for particular job search
    #Close popup
    if check_exists_by_xpath('//*[@id="snackbar"]/div[1]/div/span'):
        ActionChains(driver).click(driver.find_element_by_xpath('//*[@id="snackbar"]/div[1]/div/span')).perform()
        time.sleep(0.1)
    #Get HTML of page
    html = driver.page_source
    elems = driver.find_elements_by_xpath('//*[@id="search-results"]/div[3]//a[@href]')
    for elem in elems:
        links.append(elem.get_attribute("href"))
    #Keep track of how many pages have been checked    
    print(job,i+1)
#Remove duplicates
links = list(dict.fromkeys(links))
#Remove ads from links
links = [x for x in links if not x.startswith('https://content.mycareersfuture.sg/')]
print('Number of unique results: ' + str(len(links)))

data scientist 1
data scientist 2
data scientist 3
data scientist 4
data scientist 5
data scientist 6
data scientist 7
Number of unique results: 122


In [12]:
#Get Job Details for each link
#chrome_options = webdriver.ChromeOptions()
#prefs = {"profile.managed_default_content_settings.images": 2}
#chrome_options.add_experimental_option("prefs", prefs)
#chrome_options.add_argument("--disable-extensions")
#driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver", options=chrome_options)

driver.get('https://www.mycareersfuture.sg/')
time.sleep(3)

j=0
jobs = []

xpaths = [['//*[@id="job_details"]/div[1]/div[2]/div[1]/section[1]/p', None],\
['//*[@id="job_title"]', None],\
['//*[@id="address"]', None],\
['//*[@id="employment_type"]', None],\
['//*[@id="seniority"]', None],\
['//*[@id="min_experience"]', None],\
['//*[@id="job-categories"]', None],\
['//*[@id="job_details"]/div[1]/div[2]/div[1]/div/section[2]/div/span[2]/div/span[1]', None],\
['//*[@id="job_details"]/div[1]/div[2]/div[1]/div/section[2]/div/span[2]/div/span[2]', None],\
['//*[@id="job_details"]/div[1]/div[2]/div[1]/div/section[2]/div/span[3]', None],\
['//*[@id="description-content"]', None],\
['//*[@id="requirements-content"]', None],\
['//*[@id="skills-needed"]/div/div', []]]

for link in links:
    driver.get(link)
    j +=1
    if j%50==0:
        print(j)
    #print(link) #Check URL
    time.sleep(1)
    
    #Check to see if company name loaded. if so, record, if not, wait.
    while check_exists_by_xpath(xpaths[0][0]) == False:
        time.sleep(0.5)
    xpaths[0][1] = driver.find_elements_by_xpath(xpaths[0][0])[0].text
    
    #Record all other values.
    for i in range(1, 12):
        if check_exists_by_xpath(xpaths[i][0]):
            xpaths[i][1] = driver.find_elements_by_xpath(xpaths[i][0])[0].text
        else: xpaths[i][1] = None
    
    #Record skills into list
    if check_exists_by_xpath('//*[@id="skills-needed"]/div/div'):
        skills = []
        for i in range(len(driver.find_elements_by_xpath('//*[@id="skills-needed"]/div/div'))):
            skills.append(driver.find_elements_by_xpath('//*[@id="skills-needed"]/div/div')[i].text)
        xpaths[12][1] = skills
    #Create row of data
    job = [link]
    for i in range(13):
        job.append(xpaths[i][1])
    #Append jobs with job
    jobs.append(job)
driver.close()

50
100


In [13]:
#Make into Dataframe and clean
jobs = pd.DataFrame(jobs)
jobs.columns = ['link','company', 'job_title', 'address', 'employment_type', 'seniority', 'min_experience',\
                'job_category', 'salary_low', 'salary_high', 'salary_time', 'role_description', 'job_requirements', 'skills']

In [14]:
jobs.to_csv('out_test.csv', sep='\t')

In [15]:
jobs = pd.read_csv('out_test.csv', delimiter='\t')
jobs = jobs.drop(jobs.columns[[0]], axis=1)