# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from datetime import datetime
from arsenic import get_session
from arsenic.browsers import Firefox
from arsenic.services import Geckodriver
import asyncio
import time

# disable arsenic logging to stdout
import structlog
import logging

logger = logging.getLogger()
logger.setLevel(logging.WARN)
structlog.configure(logger_factory=lambda: logger)

### Path to webdriver (Firefox, Chrome) 

In [2]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
# driver_path = "./drivers/windows/geckodriver.exe"
#driver_path = "./programs/geckodriver.exe"
driver_path = r"C:\Users\ranga\Downloads\geckodriver-v0.32.1-win64\geckodriver.exe"
# Linux
# driver_path = "./drivers/linux/geckodriver"
# driver_path = "/usr/bin/geckodriver"

options = {
  'moz:firefoxOptions': {
    # if you want it to be headless
    'args': ['-headless'],
    'log': {'level': 'warn'},
    # Needed for windows / non-default firefox install
    'binary': 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
  }
}

### Define position and location 

In [3]:
## Enter a job position
position = "data analyst"
## Enter a location (City, State or Zip or remote)
locations = "seattle"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [4]:
## Number of postings to scrape
postings = 1600

## Number of browser instances to use
n = 3

pages = list(range(0, postings, 7))

state = {
  'lock': asyncio.Lock(),
  'ids': set(),
  'n': 0
}
             
async def get_jobs(url, pages, state):
  data = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for i in pages:
      await session.get(url + "&start=" + str(i))
      jobs = await session.get_elements("[class='job_seen_beacon']")

      for job in jobs:
        result_html = await job.get_property('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        liens = await job.get_elements("a")
        link = await liens[0].get_attribute("href")

        title = soup.select('.jobTitle')[0].get_text().strip()
        try:
          company = soup.select('.companyName')[0].get_text().strip()
        except:
          continue
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
            
        Id = f"{title}{company}{location}{rating}{date}{salary}{description}"
        dupe = False
        async with state['lock']:
          if Id in state['ids']:
            dupe = True
          else:
            state['ids'].add(Id)
            state['n'] = state['n'] + 1
            print("Job number {0:4d} added - {1:s}".format(state['n'],title))
        if dupe:
          continue

        data.append({
          'Title': title,
          "Company": company,
          'Location': location,
          'Rating': rating,
          'Date': date,
          "Salary": salary,
          "Description": description,
          "Links": link
        })

        # print("Job number {0:4d} added - {1:s}".format(jn,title))
      i = i + 10
  return data

tasks = [asyncio.create_task(get_jobs(url, p, state)) for p in np.array_split(pages, n)]
dataframe = pd.DataFrame([j for task in tasks for j in await task])

Job number    1 added - DATA ANALYST
Job number    2 added - Data Analyst - Tiktok Ads
Job number    3 added - Senior Financial Analyst - Customer Service
Job number    4 added - Sr. Business Intelligence Analyst (Remote Eligible)
Job number    5 added - Data Analyst
Job number    6 added - Business Analyst Consultant
Job number    7 added - Data Analyst (Remote)
Job number    8 added - Senior Compensation Analyst - PSE
Job number    9 added - Data Operations Analyst
Job number   10 added - Tax - Business Analyst - Lead
Job number   11 added - Data Analyst
Job number   12 added - Business Analyst, Revenue Accounting
Job number   13 added - Data Analyst I - Human Cell Types
Job number   14 added - digital asset management business analyst - hybrid
Job number   15 added - Data Analyst (Seattle or US Remote)
Job number   16 added - Salesforce Business Analyst - Remote
Job number   17 added - DATA ANALYST, EVALUATIONS
Job number   18 added - Data Analytics Lead, Cell Therapy
Job number   1

Job number  154 added - Sr. Business Analyst
Job number  155 added - Stroke RN Data Analyst
Job number  156 added - Business Analyst (306)
Job number  157 added - Business Intelligence Analyst
Job number  158 added - HR Analyst
Job number  159 added - Staff Analyst
Job number  160 added - Business Analyst (W2)
Job number  161 added - Senior-Technology Business Analysis
Job number  162 added - Senior Asset & Performance Systems Analyst (Senior Management Systems Analyst)
Job number  163 added - Sr Analyst, Business Analysis
Job number  164 added - Data Analyst
Job number  165 added - Financial Reporting & Analytics Analyst
Job number  166 added - Senior Business Analyst, Global E-commerce
Job number  167 added - Business Analyst
Job number  168 added - Senior Business Analyst
Job number  169 added - Sr. Technical Business Analyst (308)
Job number  170 added - ERP Data Migration Analyst
Job number  171 added - Cost Analyst/Accountant
Job number  172 added - Asset Analyst - Software Asset

### Scrape full job descriptions

In [5]:
Links_list = dataframe['Links'].tolist()

import random

async def get_description(urls):
  descriptions = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for url in urls:
      await session.get("https://www.indeed.com"+url)
      time.sleep(1) #adding a one-second delay
      jd = await session.get_element('#jobDescriptionText')
      descriptions.append(await jd.get_text())
      await asyncio.sleep(random.random() * 1.5)
  return descriptions

## Number of browser instances to use
n = 3

tasks = [asyncio.create_task(get_description(urls)) for urls in np.array_split(Links_list, n)]
dataframe['Descriptions'] = [desc for task in tasks for desc in await task]

### Save results

In [6]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [7]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,DATA ANALYST,University of Washington,"Seattle, WA 98195 (University District area)",4.1,PostedPosted 5 days ago,"$5,747 - $6,304 a month","Bring together data, analytic engines, and dat...",/rc/clk?jk=67267e3e9ee7c149&fccid=142783ac2edb...,"Benefits:\nAs a UW employee, you will enjoy ge..."
1,Data Analyst - Tiktok Ads,TikTok,"Seattle, WA",3.4,PostedPosted 30+ days ago,"$114,000 - $209,000 a year",Expert experience pulling large and complex da...,/rc/clk?jk=ea2bf81d89f746f0&fccid=caed318a9335...,Responsibilities\nTikTok is the leading destin...
2,Data Analyst,Foundation for Tacoma Students,"Hybrid remote in Tacoma, WA 98405",4.5,PostedPosted 3 days ago,"$55,000 - $70,000 a year",Audits systems and troubleshoot data submissio...,/pagead/clk?mo=r&ad=-6NYlbfkN0Alt2oG4DKRumPSQw...,Job Description | Data Analyst | Exempt\nAre y...
3,Data Analyst (Remote),Path Mental Health,"Remote in Seattle, WA",4.7,PostedPosted 26 days ago,,"High focus to detail, including knowing explic...",/rc/clk?jk=eb7fe532e33d887a&fccid=d8ebbe8fe3f1...,Who we are\nPath is a healthtech company dedic...
4,Data Operations Analyst,BlackRock,"Hybrid remote in Seattle, WA",3.8,PostedPosted 7 days ago,"$76,000 - $90,000 a year",Data Operations and Solutions is entrusted wit...,/rc/clk?jk=bcd87601c39e0922&fccid=58c732f14833...,Description\nAbout this role\nBlackRock is one...
...,...,...,...,...,...,...,...,...,...
195,Senior Business Analyst ERP - Sourcing,Seagen,"Bothell, WA 98021",3.5,PostedPosted 30+ days ago,"$127,000 - $164,000 a year","Seagen is a global, multi-product biotechnolog...",/rc/clk?jk=be373dcac19941d8&fccid=52558a648544...,"Seagen is a global, multi-product biotechnolog..."
196,Board Certified Behavior Analyst (BCBA),Holly Ridge Center,"Bremerton, WA 98312",4.2,PostedPosted 30+ days ago,$29.25 - $32.95 an hour,Holly Ridge Center is currently recruiting for...,/rc/clk?jk=3de916be95553e62&fccid=df45baffc979...,Holly Ridge Center is currently recruiting for...
197,Operations Systems Analyst,City of SeaTac,"SeaTac, WA 98188",3.4,PostedPosted 30+ days ago,,JOB This position is overtime eligible and a m...,/rc/clk?jk=e32298bcc9b93f6b&fccid=73601eb07244...,JOB\nThis position is overtime eligible and a ...
198,Sr. Product Strategy & Operations Analyst (USA),Lyft,"Remote in Seattle, WA",3.5,PostedPosted 3 days ago,,"At Lyft, our mission is to improve people's li...",/rc/clk?jk=ab6c26a3f5abad8b&fccid=7dcf8477ac40...,"At Lyft, our mission is to improve people's li..."
