# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from datetime import datetime
from arsenic import get_session
from arsenic.browsers import Firefox
from arsenic.services import Geckodriver
import asyncio
import time

# disable arsenic logging to stdout
import structlog
import logging

logger = logging.getLogger()
logger.setLevel(logging.WARN)
structlog.configure(logger_factory=lambda: logger)

### Path to webdriver (Firefox, Chrome) 

In [2]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
# driver_path = "./drivers/windows/geckodriver.exe"
#driver_path = "./programs/geckodriver.exe"
driver_path = r"C:\Users\ranga\Downloads\geckodriver-v0.32.1-win64\geckodriver.exe"
# Linux
# driver_path = "./drivers/linux/geckodriver"
# driver_path = "/usr/bin/geckodriver"

options = {
  'moz:firefoxOptions': {
    # if you want it to be headless
    'args': ['-headless'],
    'log': {'level': 'warn'},
    # Needed for windows / non-default firefox install
    'binary': 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
  }
}

### Define position and location 

In [3]:
## Enter a job position
position = "data scientist"
## Enter a location (City, State or Zip or remote)
locations = "remote"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [4]:
## Number of postings to scrape
postings = 1600

## Number of browser instances to use
n = 3

pages = list(range(0, postings, 10))

state = {
  'lock': asyncio.Lock(),
  'ids': set(),
  'n': 0
}
             
async def get_jobs(url, pages, state):
  data = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for i in pages:
      await session.get(url + "&start=" + str(i))
      jobs = await session.get_elements("[class='job_seen_beacon']")

      for job in jobs:
        result_html = await job.get_property('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        liens = await job.get_elements("a")
        link = await liens[0].get_attribute("href")

        title = soup.select('.jobTitle')[0].get_text().strip()
        try:
          company = soup.select('.companyName')[0].get_text().strip()
        except:
          continue
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
            
        Id = f"{title}{company}{location}{rating}{date}{salary}{description}"
        dupe = False
        async with state['lock']:
          if Id in state['ids']:
            dupe = True
          else:
            state['ids'].add(Id)
            state['n'] = state['n'] + 1
            print("Job number {0:4d} added - {1:s}".format(state['n'],title))
        if dupe:
          continue

        data.append({
          'Title': title,
          "Company": company,
          'Location': location,
          'Rating': rating,
          'Date': date,
          "Salary": salary,
          "Description": description,
          "Links": link
        })

        # print("Job number {0:4d} added - {1:s}".format(jn,title))
      i = i + 10
  return data

tasks = [asyncio.create_task(get_jobs(url, p, state)) for p in np.array_split(pages, n)]
dataframe = pd.DataFrame([j for task in tasks for j in await task])

Job number    1 added - Statistician III
Job number    2 added - Chief Data Officer
Job number    3 added - Software Developer (AI/ML)
Job number    4 added - Statistician
Job number    5 added - Sr Data Engineer
Job number    6 added - Associate Data Scientist - Rotational Program
Job number    7 added - Senior Reliability Data Analyst
Job number    8 added - Data Scientist - RWD
Job number    9 added - Staff Data Scientist - Marketing
Job number   10 added - Jr. Data Scientist
Job number   11 added - Data Scientist
Job number   12 added - Senior IT Data Analyst
Job number   13 added - Interdisciplinary-Microbiologist/Data Scientist
Job number   14 added - Senior Staff Machine Learning Engineer, Perception
Job number   15 added - Data Scientist I, Product Analytics
Job number   16 added - AI Developer/Software Engineer (Remote)
Job number   17 added - Jr. Data Scientist
Job number   18 added - Genesys Business Analyst with AI and CX Experience
Job number   19 added - Computational Bio

Job number  151 added - Senior Data Analyst / Data Modeler
Job number  152 added - Data Scientist
Job number  153 added - Senior Analyst/Statistician
Job number  154 added - Senior Data Science Analyst
Job number  155 added - Senior/Staff Machine Learning Engineer, Infrastructure
Job number  156 added - Senior Financial Data Analyst
Job number  157 added - Associate Data Scientist (Remote)
Job number  158 added - Senior Data Scientist
Job number  159 added - Data Scientist (Hybrid)
Job number  160 added - Senior Data Analyst
Job number  161 added - Data Scientist Advisor
Job number  162 added - Data Scientist
Job number  163 added - Data Scientist
Job number  164 added - Lead Data Scientist
Job number  165 added - Data Forensics Scientist
Job number  166 added - Data Science Specialist
Job number  167 added - Data Science, eBay Shipping Initiatives
Job number  168 added - Staff Software Engineer - Data Infrastructure - (Permanent Remote, US)
Job number  169 added - Artificial Intellige

Job number  312 added - AI Research Engineer
Job number  313 added - Sr. Data Analyst
Job number  314 added - Machine Learning Software Engineer
Job number  315 added - Machine Learning Engineer
Job number  316 added - Associate Principal Statistical Programmer, Submission Data Standards Quality Management (SDS QM) (Remote)
Job number  317 added - Sr. Product Data Scientist (Remote)
Job number  318 added - Senior Machine Learning Product Owner (100% Remote)
Job number  319 added - Data Scientist
Job number  320 added - Machine Learning Engineer in the Optimization team - US Remote
Job number  321 added - Data Specialist
Job number  322 added - Data Scientist
Job number  323 added - Senior Data Analyst
Job number  324 added - Full stack data scientist (able to do pre-sales)
Job number  325 added - Data Scientist, Statistics
Job number  326 added - Senior Azure Data Engineer (Remote - US)
Job number  327 added - Data Scientist - remote
Job number  328 added - Senior Statistical Programme

Job number  459 added - Senior Data Analyst
Job number  460 added - Data Science Engineer
Job number  461 added - Data Scientist II
Job number  462 added - Actuarial Data Scientist - Remote
Job number  463 added - Senior Data Scientist
Job number  464 added - Senior Data Analyst- Evernorth Health Services- Remote
Job number  465 added - Lead Data Scientist
Job number  466 added - Staff Data Scientist - Revenue Cycle
Job number  467 added - Vitria Data Scientist
Job number  468 added - Data Scientist / Data Analyst
Job number  469 added - Data Scientist
Job number  470 added - Senior Data Scientist
Job number  471 added - Data Scientist
Job number  472 added - Senior Machine Learning Engineer
Job number  473 added - Senior Data Scientist
Job number  474 added - Sr. Data Scientist
Job number  475 added - Sr. Marketing Data Analyst
Job number  476 added - Global Data Scientist
Job number  477 added - Sr. Data Analyst/Project Lead - Health Plan Sales Analytics - 100% Remote
Job number  478

Job number  615 added - Risk Intelligence - Data Analytics Lead (REMOTE/Contract))
Job number  616 added - Manager of Manufacturing Analytics/ Data Science (Remote)
Job number  617 added - Data Engineer/Scientist (Remote)
Job number  618 added - Machine Learning Engineer (Remote)
Job number  619 added - Data Scientist
Job number  620 added - Associate Director - Data Science - Remote
Job number  621 added - Data Scientist
Job number  622 added - Data Scientist Principal
Job number  623 added - Enterprise Data Specialist - Machine Learning
Job number  624 added - Machine Learning Engineer
Job number  625 added - Machine Learning Research Team Lead
Job number  626 added - Senior Data Scientist | Machine Learning
Job number  627 added - Embedded BI Data Scientist SIBU510
Job number  628 added - Data Scientist
Job number  629 added - Sr Clinical Data Scientist
Job number  630 added - Senior Data Scientist
Job number  631 added - Business Data Analyst, Sr Consultant - Remote
Job number  632

Job number  767 added - Machine Learning Engineer
Job number  768 added - Senior Data Scientist | Remote-US
Job number  769 added - Staff Software Engineer, AI
Job number  770 added - Senior Data Analyst
Job number  771 added - Statistician
Job number  772 added - Conversational AI Design Specialist, NLP/NLU *REMOTE*
Job number  773 added - Machine Learning Engineer
Job number  774 added - Damage Detection Machine Learning Engineer
Job number  775 added - Data Scientist I
Job number  776 added - Senior Data Analyst
Job number  777 added - Principal Data Scientist I (Remote)
Job number  778 added - Senior Research Scientist, Real World Evidence, Data Analytics & Evidence Synthesis
Job number  779 added - Staff Data Scientist, Analytics
Job number  780 added - Data Science, Inventory Planning Sr. Manager
Job number  781 added - Principal Data Scientist
Job number  782 added - Sr. Data Scientist - Machine Learning
Job number  783 added - 100% Remote Cloud Engineer (Python, AWS, AI/ML)
Job

### Scrape full job descriptions

In [5]:
Links_list = dataframe['Links'].tolist()

import random

async def get_description(urls):
  descriptions = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for url in urls:
      await session.get("https://www.indeed.com"+url)
      time.sleep(1) #adding a one-second delay
      jd = await session.get_element('#jobDescriptionText')
      descriptions.append(await jd.get_text())
      await asyncio.sleep(random.random() * 1.5)
  return descriptions

## Number of browser instances to use
n = 3

tasks = [asyncio.create_task(get_description(urls)) for urls in np.array_split(Links_list, n)]
dataframe['Descriptions'] = [desc for task in tasks for desc in await task]

### Save results

In [6]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [7]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Associate Data Scientist - Rotational Program,Mutual of Omaha,Remote,3.7,PostedToday,"$72,000 - $114,000 a year",The program lets the associate data scientists...,/rc/clk?jk=20c8f542b3f17ed7&fccid=7f07a4e82fbe...,Location: Various Locations\nWork Type: Full T...
1,Data Scientist - RWD,Norstella,Remote,,EmployerActive 7 days ago,"$125,000 - $175,000 a year",Design data pipelines and queries and analyze ...,/company/NorStella/jobs/Data-Scientist-98a669a...,Job Summary:\nWe are seeking an experienced Da...
2,Jr. Data Scientist,Net2Aspire,Remote,,EmployerActive 5 days ago,"$65,000 - $80,000 a year", Create data dashboards and other data visual...,/company/net2aspire/jobs/Junior-Data-Scientist..., Apply Statistical and Machine Learning metho...
3,Data Scientist,Synchrony Systems,Remote,,PostedToday,$70 - $75 an hour,Must have strong data visualization skills.\nP...,/company/Synchrony-Systems/jobs/Data-Scientist...,Role: Data Scientist\nLocation: REMOTE\nJob De...
4,Interdisciplinary-Microbiologist/Data Scientist,USDA,Remote,4.1,EmployerActive 1 day ago,"$98,496 - $158,432 a year",Analyzing data provided by NAHLN laboratories ...,/company/US-Department-of-Agriculture/jobs/Int...,MUST APPLY ON USAJOBS - https://www.usajobs.go...
...,...,...,...,...,...,...,...,...,...
895,Job Title: ML Engineer Job Location: Remote Jo...,Global Information Technology,"Remote in Southfield, MI 48076",4.1,PostedPosted 30+ days ago,,Job Description: Work closely with Tech Anchor...,/rc/clk?jk=efc9366f75880b50&fccid=e23935f4162c...,Job Description:\nWork closely with Tech Ancho...
896,Machine Learning Engineer: Recall Team (Remote),Constructor,Remote,4.3,PostedPosted 30+ days ago,,About us Constructor.io powers product search ...,/rc/clk?jk=277d4ed20b74f7f9&fccid=546a4f6b509c...,About us\nConstructor.io powers product search...
897,Senior Research Scientist/Director - Epidemiol...,Thermo Fisher Scientific,"Remote in Wilmington, NC",3.5,PostedPosted 6 days ago,,"At Thermo Fisher Scientific, you’ll discover m...",/rc/clk?jk=047ce743223110a7&fccid=126e3afd205c...,"At Thermo Fisher Scientific, you’ll discover m..."
898,"Senior Data Scientist, Product Analytics",Coalition,"Remote in Austin, TX",,PostedPosted 30+ days ago,,Lead cross-functional projects using quantitat...,/rc/clk?jk=8fe89543d44ab679&fccid=5da39dba7be7...,"About Us\nFounded in 2017, Coalition combines ..."
