# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from datetime import datetime
from arsenic import get_session
from arsenic.browsers import Firefox
from arsenic.services import Geckodriver
import asyncio
import time

# disable arsenic logging to stdout
import structlog
import logging

logger = logging.getLogger()
logger.setLevel(logging.WARN)
structlog.configure(logger_factory=lambda: logger)

### Path to webdriver (Firefox, Chrome) 

In [2]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
# driver_path = "./drivers/windows/geckodriver.exe"
#driver_path = "./programs/geckodriver.exe"
driver_path = r"C:\Users\ranga\Downloads\geckodriver-v0.32.1-win64\geckodriver.exe"
# Linux
# driver_path = "./drivers/linux/geckodriver"
# driver_path = "/usr/bin/geckodriver"

options = {
  'moz:firefoxOptions': {
    # if you want it to be headless
    'args': ['-headless'],
    'log': {'level': 'warn'},
    # Needed for windows / non-default firefox install
    'binary': 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
  }
}

### Define position and location 

In [3]:
## Enter a job position
position = "director of analytics"
## Enter a location (City, State or Zip or remote)
locations = "california"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [4]:
## Number of postings to scrape
postings = 1000

## Number of browser instances to use
n = 3

pages = list(range(0, postings, 10))

state = {
  'lock': asyncio.Lock(),
  'ids': set(),
  'n': 0
}
             
async def get_jobs(url, pages, state):
  data = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for i in pages:
      await session.get(url + "&start=" + str(i))
      jobs = await session.get_elements("[class='job_seen_beacon']")

      for job in jobs:
        result_html = await job.get_property('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        liens = await job.get_elements("a")
        link = await liens[0].get_attribute("href")

        title = soup.select('.jobTitle')[0].get_text().strip()
        try:
          company = soup.select('.companyName')[0].get_text().strip()
        except:
          continue
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
            
        Id = f"{title}{company}{location}{rating}{date}{salary}{description}"
        dupe = False
        async with state['lock']:
          if Id in state['ids']:
            dupe = True
          else:
            state['ids'].add(Id)
            state['n'] = state['n'] + 1
            print("Job number {0:4d} added - {1:s}".format(state['n'],title))
        if dupe:
          continue

        data.append({
          'Title': title,
          "Company": company,
          'Location': location,
          'Rating': rating,
          'Date': date,
          "Salary": salary,
          "Description": description,
          "Links": link
        })

        # print("Job number {0:4d} added - {1:s}".format(jn,title))
      i = i + 10
  return data

tasks = [asyncio.create_task(get_jobs(url, p, state)) for p in np.array_split(pages, n)]
dataframe = pd.DataFrame([j for task in tasks for j in await task])

Job number    1 added - Head of Partnerships Tech & Analytics
Job number    2 added - Director, Geospatial Products & Analytics
Job number    3 added - Director, Product Analytics
Job number    4 added - Director of Operations
Job number    5 added - Director of Threat Research - Intrusion Prevention
Job number    6 added - Assoc Director/Director, Strategy & Analytics
Job number    7 added - Medical Director/Senior Medical Director
Job number    8 added - Assistant Director Evaluation
Job number    9 added - Director of Business and Portfolio Analytics
Job number   10 added - Director Clinical Pharmacology
Job number   11 added - Transaction Advisory Services Director – Technology, Media, and Telecommunications (TMT)
Job number   12 added - Director of Product Management (SASE - Visibility and Analytics)
Job number   13 added - Associate Director, Prospect Management and Research
Job number   14 added - Director of Analytics & Performance Reporting
Job number   15 added - Credit Analy

Job number  133 added - Director, Data Science and Analytics
Job number  134 added - Relationship Executive, Technology Group, Corporate Client Banking, Executive Director
Job number  135 added - Director of Trading
Job number  136 added - Director, Payment Risk & Fraud
Job number  137 added - Director, Product Management
Job number  138 added - Director, Data Analytics
Job number  139 added - DIRECTOR, BUSINESS INTELLIGENCE
Job number  140 added - Director of Finance
Job number  141 added - Director, Competitive Intelligence
Job number  142 added - Director of Information Technology
Job number  143 added - Executive Director, Insights and Business Analysis - Pharmaceuticals
Job number  144 added - Director of Strategic Quality & Results
Job number  145 added - Senior Director, System Data Analytics (IT/Data Processing)
Job number  146 added - Director, Revenue Operations
Job number  147 added - Associate Dir/Assistant Director - Research (Hybrid or Remote)
Job number  148 added - Bios

Job number  259 added - Director of Operations
Job number  260 added - (USA) Director, User Experience - Customer Platform
Job number  261 added - Vice President, Head of Web Strategy and Governance
Job number  262 added - Director, Talent Management & Development
Job number  263 added - IT Business Solutions Director
Job number  264 added - Senior Director, SMB Programs
Job number  265 added - Director of Adoption, Consulting Processes and Tools
Job number  266 added - Head of Product
Job number  267 added - Director of Machine Learning
Job number  268 added - Director, Portfolio Monetization Analytics & Strategy
Job number  269 added - Director, Data Science
Job number  270 added - Sr. Principal Cloud Specialist - Customer Success Director
Job number  271 added - Director of Field Operations - Americas
Job number  272 added - (Associate) Director, Recruiting
Job number  273 added - Director, Financial Services
Job number  274 added - Director, Data Strategy & Governance
Job number  2

Job number  390 added - Director of Business Planning & Analysis, Commercial Finance
Job number  391 added - Production Systems Support Director
Job number  392 added - Sr. Director, Digital Data & Analytics
Job number  393 added - Director, Controller for the FEP Business Unit - (M6)
Job number  394 added - Corporate Director Information Technology (IT)
Job number  395 added - Director, Black Maternal Health Program
Job number  396 added - Director, Platform Engineering Leader, Development Services and Experiences
Job number  397 added - Statistical Science Director, Late Oncology
Job number  398 added - Associate Director, Statistics
Job number  399 added - Director, Product Management- Software
Job number  400 added - IT Enterprise Business Systems Director
Job number  401 added - Director, Financial Analysis & Enterprise Risk Management
Job number  402 added - Director Licensing Product Management
Job number  403 added - Director, Financial Planning & Analysis
Job number  404 added

Job number  512 added - Science Director


### Scrape full job descriptions

In [5]:
Links_list = dataframe['Links'].tolist()

import random

async def get_description(urls):
  descriptions = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for url in urls:
      await session.get("https://www.indeed.com"+url)
      time.sleep(1) #adding a one-second delay
      jd = await session.get_element('#jobDescriptionText')
      descriptions.append(await jd.get_text())
      await asyncio.sleep(random.random() * 1.5)
  return descriptions

## Number of browser instances to use
n = 3

tasks = [asyncio.create_task(get_description(urls)) for urls in np.array_split(Links_list, n)]
dataframe['Descriptions'] = [desc for task in tasks for desc in await task]

### Save results

In [6]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [7]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Head of Partnerships Tech & Analytics,Snapchat,"Los Angeles, CA 90291 (Venice area)",3.6,PostedPosted 12 days ago,"$196,000 - $285,000 a year",The Strategy & Ops Team is a core group within...,/rc/clk?jk=84b5e99096caca3e&fccid=f368300325e8...,Snap Inc\nis a technology company. We believe ...
1,"Director, Geospatial Products & Analytics",Prologis,"Hybrid remote in San Francisco, CA",3.9,PostedPosted 5 days ago,"$170,000 - $210,000 a year",Refine and execute the strategic plan for the ...,/rc/clk?jk=22251926e7343040&fccid=594d10b1768e...,Prologis is the global leader in logistics rea...
2,"Director, Product Analytics",ServiceNow,"Santa Clara, CA 95054",3.7,PostedPosted 1 day ago,"$198,500 - $347,500 a year",Provide thought leadership in the area of Prod...,/rc/clk?jk=bdc9c70e47688613&fccid=7442885bc0fa...,"Company Description\n\nAt ServiceNow, our tech..."
3,Director of Operations,PSA (Professional Sports Authenticator),"Santa Ana, CA",,PostedPosted 27 days ago,"$150,682 - $234,443 a year","In this role, you will lead all operations and...",/rc/clk?jk=17cbc7f854984298&fccid=dd616958bd9d...,Collectors is building the future of how colle...
4,"Assoc Director/Director, Strategy & Analytics",Sony Music Entertainment US,"Los Angeles, CA",4.3,PostedPosted 30+ days ago,"$100,000 - $120,000 a year","As Associate Director/Director, Strategy & Ana...",/rc/clk?jk=00c843be7039172a&fccid=c5284eeef83d...,"At AWAL, we are on a mission to partner with i..."
...,...,...,...,...,...,...,...,...,...
507,Chinese Marketing Director,The One Pioneer,"Arcadia, CA 91006",,EmployerActive 9 days ago,"$50,000 - $75,000 a year",The Director of Marketing will be responsible ...,/company/The-One-Pioneer/jobs/Chinese-Marketin...,The Director of Marketing will be responsible ...
508,Transaction Advisory Services Director – Techn...,RSM US LLP,"San Francisco, CA 94104 (Financial District area)",3.6,PostedPosted 5 days ago,,RSM is seeking a Transaction Advisory Services...,/rc/clk?jk=f6f5ecba3c3f39b9&fccid=198b7374e1f8...,RSM is seeking a Transaction Advisory Services...
509,Director of Information Technology,Cornbread Hustle,"Imperial Valley, CA",,PostedPosted 3 days ago,"$115,000 - $140,000 a year",Had Enough of “Typical” Tech Jobs?\nDrive Inno...,/pagead/clk?mo=r&ad=-6NYlbfkN0AryF7OTLCWq605df...,Had Enough of “Typical” Tech Jobs? Drive Innov...
510,Senior Manager/Director - Building Simulation ...,TRC,"Remote in Oakland, CA 94612",3.5,PostedPosted 23 days ago,,Strong quantitative and data analytics skills....,/rc/clk?jk=75ab6b6451ebe0c4&fccid=68cfe21d4d32...,TRC Advanced Energy (TRC) is a nationally know...
