# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from datetime import datetime
from arsenic import get_session
from arsenic.browsers import Firefox
from arsenic.services import Geckodriver
import asyncio
import time

# disable arsenic logging to stdout
import structlog
import logging

logger = logging.getLogger()
logger.setLevel(logging.WARN)
structlog.configure(logger_factory=lambda: logger)

### Path to webdriver (Firefox, Chrome) 

In [2]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
# driver_path = "./drivers/windows/geckodriver.exe"
#driver_path = "./programs/geckodriver.exe"
driver_path = r"C:\Users\ranga\Downloads\geckodriver-v0.32.1-win64\geckodriver.exe"
# Linux
# driver_path = "./drivers/linux/geckodriver"
# driver_path = "/usr/bin/geckodriver"

options = {
  'moz:firefoxOptions': {
    # if you want it to be headless
    'args': ['-headless'],
    'log': {'level': 'warn'},
    # Needed for windows / non-default firefox install
    'binary': 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
  }
}

### Define position and location 

In [3]:
## Enter a job position
position = "data analyst"
## Enter a location (City, State or Zip or remote)
locations = "san francisco"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [4]:
## Number of postings to scrape
postings = 1600

## Number of browser instances to use
n = 3

pages = list(range(0, postings, 7))

state = {
  'lock': asyncio.Lock(),
  'ids': set(),
  'n': 0
}
             
async def get_jobs(url, pages, state):
  data = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for i in pages:
      await session.get(url + "&start=" + str(i))
      jobs = await session.get_elements("[class='job_seen_beacon']")

      for job in jobs:
        result_html = await job.get_property('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        liens = await job.get_elements("a")
        link = await liens[0].get_attribute("href")

        title = soup.select('.jobTitle')[0].get_text().strip()
        try:
          company = soup.select('.companyName')[0].get_text().strip()
        except:
          continue
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
            
        Id = f"{title}{company}{location}{rating}{date}{salary}{description}"
        dupe = False
        async with state['lock']:
          if Id in state['ids']:
            dupe = True
          else:
            state['ids'].add(Id)
            state['n'] = state['n'] + 1
            print("Job number {0:4d} added - {1:s}".format(state['n'],title))
        if dupe:
          continue

        data.append({
          'Title': title,
          "Company": company,
          'Location': location,
          'Rating': rating,
          'Date': date,
          "Salary": salary,
          "Description": description,
          "Links": link
        })

        # print("Job number {0:4d} added - {1:s}".format(jn,title))
      i = i + 10
  return data

tasks = [asyncio.create_task(get_jobs(url, p, state)) for p in np.array_split(pages, n)]
dataframe = pd.DataFrame([j for task in tasks for j in await task])

Job number    1 added - Data Analyst
Job number    2 added - Senior Data Analyst
Job number    3 added - Classification and Compensation Analyst (Confidential Administrative Support II) - Office of Human Resources
Job number    4 added - Data Analyst
Job number    5 added - Programs & Outreach Coordinator (Administrative Analyst/Specialist - Exempt I) - Institute for Civic & Community Engagement
Job number    6 added - Data Analyst
Job number    7 added - Academic Office Coordinator (Administrative Analyst/Specialist, Exempt I) - Computer Science, College of Science & Engineering
Job number    8 added - Production Analyst
Job number    9 added - Education Services Analyst - Administrative Analyst Specialist Exempt I
Job number   10 added - Data Analyst
Job number   11 added - Grant Support Coordinator (Administrative Analyst/Specialist, Non-Exempt) - Office of Research and Sponsored Programs
Job number   12 added - Compliance Coordinator (Administrative Analyst/Specialist, Non-Exempt) 

Job number  145 added - Programs & Outreach Coordinator (Administrative Analyst/Specialist - Exempt I) - Institute for Civic & Community Engagement
Job number  146 added - Academic Office Coordinator (Administrative Analyst/Specialist, Exempt I) - Computer Science, College of Science & Engineering
Job number  147 added - Education Services Analyst - Administrative Analyst Specialist Exempt I
Job number  148 added - Grant Support Coordinator (Administrative Analyst/Specialist, Non-Exempt) - Office of Research and Sponsored Programs
Job number  149 added - Compliance Coordinator (Administrative Analyst/Specialist, Non-Exempt) - Office of Research and Sponsored Program
Job number  150 added - Position Control Analyst - Office of Sr. Business Officer (2022-23)
Job number  151 added - Compliance and Systems Analyst
Job number  152 added - Research Analyst
Job number  153 added - Senior Programmer Analyst (Senior IT Business Analyst)
Job number  154 added - Program Analyst, Electronic Revenu

Job number  269 added - Data Analyst Lead - Data Scientist
Job number  270 added - Lead Talent Business Analyst
Job number  271 added - Senior Software Engineer - Analyst Experience
Job number  272 added - Senior Business Intelligence Analyst - Sales Finance
Job number  273 added - Business Analyst
Job number  274 added - Management Analyst II
Job number  275 added - Senior Management Analyst
Job number  276 added - Senior Project Business Analyst
Job number  277 added - Senior Business Analyst
Job number  278 added - Senior SuccessFactors Business Analyst, HR Systems
Job number  279 added - Programming Analyst, Medical Affairs
Job number  280 added - Senior Analyst, QC Analytical Training and Validation
Job number  281 added - Senior Consultant and Business Analyst in Regulatory Technology & Compliance Analytics
Job number  282 added - (Agile1)IT- Business Analyst - Senior
Job number  283 added - Senior Payroll Analyst
Job number  284 added - Sr. Salesforce Business Analyst - REMOTE
J

### Scrape full job descriptions

In [5]:
Links_list = dataframe['Links'].tolist()

import random

async def get_description(urls):
  descriptions = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for url in urls:
      await session.get("https://www.indeed.com"+url)
      time.sleep(1) #adding a one-second delay
      jd = await session.get_element('#jobDescriptionText')
      descriptions.append(await jd.get_text())
      await asyncio.sleep(random.random() * 1.5)
  return descriptions

## Number of browser instances to use
n = 3

tasks = [asyncio.create_task(get_description(urls)) for urls in np.array_split(Links_list, n)]
dataframe['Descriptions'] = [desc for task in tasks for desc in await task]

### Save results

In [6]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [7]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Analyst,RPX,"San Francisco, CA 94105 (Financial District area)",,PostedPosted 9 days ago,"$80,000 - $92,000 a year",Provide data-based answers to questions posed ...,/rc/clk?jk=a269faf55acfd4cc&fccid=3bd140ae8fb1...,RPX Corporation is the leading provider of a c...
1,Senior Data Analyst,North East Medical Services,"Daly City, CA 94014",2.7,PostedJust posted,"$125,544 - $145,496 a year",Perform research by gathering data from a vari...,/rc/clk?jk=a23f5b0b575886ab&fccid=9bf2c3dc844c...,The Senior Data Analyst will be responsible fo...
2,Data Analyst,Attentive Mobile,"Hybrid remote in San Francisco, CA",3.6,PostedPosted 6 days ago,"$85,200 - $127,000 a year",Our primary data visualization tool is Looker....,/rc/clk?jk=c350edef327bb8ab&fccid=6bab5fc4ce26...,"SAN FRANCISCO, CA /\nENGINEERING /\nFULL-TIME\..."
3,Data Analyst,Vetro Tech Inc,"Remote in San Francisco, CA",,EmployerActive 4 days ago,"$83,000 - $93,000 a year",Proven working experience as a data analyst or...,/company/Vetro-Tech-Inc/jobs/Data-Analyst-b338...,We are looking for a passionate certified Data...
4,Production Analyst,Procter & Gamble,"San Francisco, CA",4.1,PostedPosted 25 days ago,"$71,700 - $102,400 a year",Strong background in data analysis and ability...,/rc/clk?jk=869cc4f55b98f8f3&fccid=2da0dedf6df9...,P&G is the largest consumer packaged goods com...
...,...,...,...,...,...,...,...,...,...
287,Senior Talent Acquisition Analyst (Senior Huma...,City of Santa Ana,"Civic Center Plaza, CA",4.3,PostedPosted 14 days ago,,Provides complex professional staff assistance...,/rc/clk?jk=076fd2c44b6bff66&fccid=4b76342a7e1a...,JOB\nThe City of Santa Ana is looking for indi...
288,Health IT Senior Business Analyst – Limited Te...,County of San Mateo,"San Mateo, CA",3.9,PostedPosted 3 days ago,,Performs other duties as assignedIdeal candida...,/rc/clk?jk=1c10e09ff29f326f&fccid=12de6aa257aa...,JOB\nSan Mateo County Health is seeking a well...
289,Human Resources Analyst I or II (Employee and ...,"Superior Court of California, County of Alameda","Oakland, CA 94612",2.4,PostedPosted 21 days ago,,"Assists in the development, maintenance, and a...",/rc/clk?jk=d50afde47e8681db&fccid=7d7bd3fb4c71...,"JOB\nThe Superior Court of California, County ..."
290,Associate Program Analyst,Alameda County Transportation Commission,"Oakland, CA 94607 (Downtown area)",,PostedPosted 17 days ago,,Three (3) years of responsible professional-le...,/rc/clk?jk=980ac5fae7eca87c&fccid=300684e1f07e...,JOB\nTHE OPPORTUNITY\n\nUnder general supervis...
