# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from datetime import datetime
from arsenic import get_session
from arsenic.browsers import Firefox
from arsenic.services import Geckodriver
import asyncio
import time

# disable arsenic logging to stdout
import structlog
import logging

logger = logging.getLogger()
logger.setLevel(logging.WARN)
structlog.configure(logger_factory=lambda: logger)

### Path to webdriver (Firefox, Chrome) 

In [2]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
# driver_path = "./drivers/windows/geckodriver.exe"
#driver_path = "./programs/geckodriver.exe"
driver_path = r"C:\Users\ranga\Downloads\geckodriver-v0.32.1-win64\geckodriver.exe"
# Linux
# driver_path = "./drivers/linux/geckodriver"
# driver_path = "/usr/bin/geckodriver"

options = {
  'moz:firefoxOptions': {
    # if you want it to be headless
    'args': ['-headless'],
    'log': {'level': 'warn'},
    # Needed for windows / non-default firefox install
    'binary': 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
  }
}

### Define position and location 

In [3]:
## Enter a job position
position = "data analyst"
## Enter a location (City, State or Zip or remote)
locations = "remote"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [4]:
## Number of postings to scrape
postings = 1600

## Number of browser instances to use
n = 3

pages = list(range(0, postings, 7))

state = {
  'lock': asyncio.Lock(),
  'ids': set(),
  'n': 0
}
             
async def get_jobs(url, pages, state):
  data = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for i in pages:
      await session.get(url + "&start=" + str(i))
      jobs = await session.get_elements("[class='job_seen_beacon']")

      for job in jobs:
        result_html = await job.get_property('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        liens = await job.get_elements("a")
        link = await liens[0].get_attribute("href")

        title = soup.select('.jobTitle')[0].get_text().strip()
        try:
          company = soup.select('.companyName')[0].get_text().strip()
        except:
          continue
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
            
        Id = f"{title}{company}{location}{rating}{date}{salary}{description}"
        dupe = False
        async with state['lock']:
          if Id in state['ids']:
            dupe = True
          else:
            state['ids'].add(Id)
            state['n'] = state['n'] + 1
            print("Job number {0:4d} added - {1:s}".format(state['n'],title))
        if dupe:
          continue

        data.append({
          'Title': title,
          "Company": company,
          'Location': location,
          'Rating': rating,
          'Date': date,
          "Salary": salary,
          "Description": description,
          "Links": link
        })

        # print("Job number {0:4d} added - {1:s}".format(jn,title))
      i = i + 10
  return data

tasks = [asyncio.create_task(get_jobs(url, p, state)) for p in np.array_split(pages, n)]
dataframe = pd.DataFrame([j for task in tasks for j in await task])

Job number    1 added - Sr. Data Analyst
Job number    2 added - Data Reporting and Analytics Consultant II, Cost and Managerial Accounting *REMOTE*
Job number    3 added - Sr. Business Intelligence Analyst
Job number    4 added - IT BUSINESS SYSTEMS ANALYST SENIOR
Job number    5 added - Analyst, Value-Based Care
Job number    6 added - Product Analyst
Job number    7 added - IT Project Business Analyst - Remote
Job number    8 added - Agile Business Analyst (Hybrid)
Job number    9 added - Senior Associate, Digital Data Analyst/Architect
Job number   10 added - Sr. Business Intelligence Analyst
Job number   11 added - Data Analyst Level 2 (W2 ONLY)(REMOTE)
Job number   12 added - Data Analyst/Business Intelligence Developer
Job number   13 added - Business Analyst - Remote
Job number   14 added - Sr Risk Score Data Analyst, remote
Job number   15 added - Functional Analyst, Senior
Job number   16 added - AWS Data & Reporting Analyst
Job number   17 added - Security Analyst II
Job num

Job number  166 added - DATA ANALYST
Job number  167 added - Business Data Analyst I- ADS Auxiliary Services
Job number  168 added - Data Analyst
Job number  169 added - DTC Data Analyst
Job number  170 added - Sr. Data Analyst
Job number  171 added - Quality Assurance Business Analyst
Job number  172 added - Technical Risk Analyst
Job number  173 added - Technical Business Analyst (contract)
Job number  174 added - IT Business Analyst
Job number  175 added - Data Analyst-Remote!
Job number  176 added - BI Analyst
Job number  177 added - Data Analyst/Engineer/Scientist
Job number  178 added - Healthcare Industry Data Analyst (100% Remote)
Job number  179 added - Data Analyst II
Job number  180 added - Data Analyst
Job number  181 added - Business Intelligence Analyst
Job number  182 added - Data Health Analyst
Job number  183 added - Data Analyst (Intermediate Level)
Job number  184 added - Jr. Business Analyst
Job number  185 added - Business Analyst (Must live within Sacramento Calif

Job number  330 added - Risk Analyst
Job number  331 added - Business Analyst - EDI
Job number  332 added - Sr. Business Analyst
Job number  333 added - GRC Business Analyst
Job number  334 added - Operations Analyst – Evernorth Care Group – Remote
Job number  335 added - Business Analyst - EDI
Job number  336 added - Senior Marketing Data Analyst – Channel Optimization - Remote
Job number  337 added - Data Governance Analyst---Remote (GC and Citizens)
Job number  338 added - BA/QA
Job number  339 added - Data Analyst
Job number  340 added - Business Analyst II
Job number  341 added - Data Analyst- IT (Fully Remote)
Job number  342 added - Sr. Product Data Analyst
Job number  343 added - Data Acquisition and Optimization Analyst
Job number  344 added - Business Analyst
Job number  345 added - Data Analyst - REQ 617
Job number  346 added - Sr. Data Analyst
Job number  347 added - Junior Business Analyst
Job number  348 added - CRM Data Analyst
Job number  349 added - Senior Data Analyst

Job number  492 added - Business Analyst
Job number  493 added - Social Media Data Analyst
Job number  494 added - Data Analyst Remote
Job number  495 added - Data Analyst (Case Management Experience)
Job number  496 added - Data Analyst
Job number  497 added - Data Analyst
Job number  498 added - ECOSYSTEM DATA ANALYST
Job number  499 added - Data & Compliance Analyst
Job number  500 added - IT Business Analyst
Job number  501 added - Data Analyst
Job number  502 added - Lead IT Data Analyst
Job number  503 added - Data Systems Analyst -100% remote in Texas
Job number  504 added - Healthcare Data Analyst
Job number  505 added - Data Analyst
Job number  506 added - Data & Compliance Analyst
Job number  507 added - IT Business Analyst
Job number  508 added - Clinical Data Analyst II
Job number  509 added - Data Analyst
Job number  510 added - SQL Developer/Data Analyst
Job number  511 added - Material Data Analyst
Job number  512 added - Data Quality Analyst (REMOTE)
Job number  513 add

Job number  650 added - Lead Medicare Stars Data Analyst
Job number  651 added - Business Analyst - Managed Care
Job number  652 added - Data Analyst Specialist
Job number  653 added - Philanthropy Data Analyst
Job number  654 added - Business Analyst (Tableau)
Job number  655 added - IT Business Analyst - REMOTE
Job number  656 added - Business Analyst- Behavioral Healthcare Systems- Clinical
Job number  657 added - Commodity Risk Analyst
Job number  658 added - Resource Management Analyst III: Business Services (Austin District)
Job number  659 added - Business Analyst
Job number  660 added - Business Analyst - Remote
Job number  661 added - Remote MES BA/Technical Writer 707648
Job number  662 added - Business Analyst
Job number  663 added - Senior Data Analyst, Research Success
Job number  664 added - Research/Data Analyst
Job number  665 added - Business System Data Analyst
Job number  666 added - Senior Financial Data Analyst
Job number  667 added - Data Analyst
Job number  668 a

Job number  797 added - Data Analyst - Business Development (Full Time/Part Time)
Job number  798 added - Business Analyst - Specialty Pharmacy
Job number  799 added - Business Analyst
Job number  800 added - Lead Data and Reporting Analyst Call Center
Job number  801 added - Senior Product Data Analyst - Fully Remote
Job number  802 added - Database Analyst/Administrator - IM-Data Services
Job number  803 added - Marketing Data Analyst
Job number  804 added - Analytics & Insights Analyst
Job number  805 added - Senior Financial Data Analyst - Health Care Industry
Job number  806 added - Contract Activation Analyst (Remote)
Job number  807 added - Customer Data Integration Analyst
Job number  808 added - Analyst, Business Intelligence
Job number  809 added - Procurement Data Analyst
Job number  810 added - Senior Labor Market Data Analyst
Job number  811 added - Marketing Data Analyst
Job number  812 added - Senior Business Intelligence Analyst
Job number  813 added - Data Analyst with

### Scrape full job descriptions

In [5]:
Links_list = dataframe['Links'].tolist()

import random

async def get_description(urls):
  descriptions = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for url in urls:
      await session.get("https://www.indeed.com"+url)
      time.sleep(1) #adding a one-second delay
      jd = await session.get_element('#jobDescriptionText')
      descriptions.append(await jd.get_text())
      await asyncio.sleep(random.random() * 1.5)
  return descriptions

## Number of browser instances to use
n = 3

tasks = [asyncio.create_task(get_description(urls)) for urls in np.array_split(Links_list, n)]
dataframe['Descriptions'] = [desc for task in tasks for desc in await task]

### Save results

In [6]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [7]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,AWS Data & Reporting Analyst,Stand Up Wireless,Remote,3.9,EmployerActive 3 days ago,"$75,000 - $80,000 a year",Implement best practices in data visualization...,/pagead/clk?mo=r&ad=-6NYlbfkN0BjCBRdhP65IZiSCQ...,AWS Data & Reporting Analyst\n1. As an AWS Dat...
1,"Data Analyst, Advanced Excel",ATL International (formerly DeNuke),Remote,,EmployerActive 2 days ago,$30 - $45 an hour,I. Connecting to data from multiple sources.\n...,/company/ATL-International-(formerly-DeNuke)/j...,Seeking a Data Analyst with advanced Excel and...
2,Business Analyst,Aceolution Inc,Remote,,PostedPosted 30+ days ago,$80 an hour,Experience working as a Retail e-commerce data...,/pagead/clk?mo=r&ad=-6NYlbfkN0BT75f5_1uiyebGPe...,Job Description\nRole Description:\n● Business...
3,Business Information Analyst II,Elevance Health,"Remote in Chicago, IL 60610",3.6,PostedPosted 24 days ago,,This position is responsible for serving as an...,/pagead/clk?mo=r&ad=-6NYlbfkN0CYKz7WkjjIBo9g-U...,Business Information Analyst II\nThis position...
4,Data Analyst,Phoenix Loss Control,Remote,4.2,EmployerActive 6 days ago,"$60,000 - $120,000 a year",Help analyze KPI report data and deduce busine...,/company/Phoenix-Loss-Control/jobs/Data-Analys...,Phoenix Loss Control\n` 382 NE 191st St. PMB 9...
...,...,...,...,...,...,...,...,...,...
920,Data Analyst (Case Management Experience),Devcare Solutions,Remote,3.6,EmployerActive 18 days ago,$50 - $60 an hour,"*Job Title:* Data Analyst *Location:* Jackson,...",/company/Devcare-Solutions/jobs/Data-Analyst-3...,"Job Title: Data Analyst\nLocation: Jackson, MS..."
921,Implementation Business Analyst,Viz.ai,"Remote in San Francisco, CA",,PostedPosted 30+ days ago,"$89,000 - $115,000 a year",About Viz.ai Viz.ai is the pioneer in the use ...,/rc/clk?jk=7e479f5a759b425a&fccid=309afdc18f5b...,About Viz.ai\nViz.ai is the pioneer in the use...
922,Data Analyst (Intermediate Level),Topline Group,Remote,,PostedPosted 30+ days ago,"$50,000 - $60,000 a year",Topline Group is looking for a strong intermed...,/rc/clk?jk=da284515bf36a61f&fccid=010548e0d754...,Topline Group is looking for a strong intermed...
923,Implementation Business Analyst,Viz.ai,"Remote in San Francisco, CA",,PostedPosted 30+ days ago,"$89,000 - $115,000 a year",Increased data integrity and data gathering wi...,/rc/clk?jk=7e479f5a759b425a&fccid=309afdc18f5b...,About Viz.ai\nViz.ai is the pioneer in the use...
