# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from datetime import datetime
from arsenic import get_session
from arsenic.browsers import Firefox
from arsenic.services import Geckodriver
import asyncio
import time

# disable arsenic logging to stdout
import structlog
import logging

logger = logging.getLogger()
logger.setLevel(logging.WARN)
structlog.configure(logger_factory=lambda: logger)

### Path to webdriver (Firefox, Chrome) 

In [2]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
# driver_path = "./drivers/windows/geckodriver.exe"
#driver_path = "./programs/geckodriver.exe"
driver_path = r"C:\Users\ranga\Downloads\geckodriver-v0.32.1-win64\geckodriver.exe"
# Linux
# driver_path = "./drivers/linux/geckodriver"
# driver_path = "/usr/bin/geckodriver"

options = {
  'moz:firefoxOptions': {
    # if you want it to be headless
    'args': ['-headless'],
    'log': {'level': 'warn'},
    # Needed for windows / non-default firefox install
    'binary': 'C:\\Program Files\\Mozilla Firefox\\firefox.exe'
  }
}

### Define position and location 

In [3]:
## Enter a job position
position = "manager of analytics"
## Enter a location (City, State or Zip or remote)
locations = "california"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [4]:
## Number of postings to scrape
postings = 1000

## Number of browser instances to use
n = 3

pages = list(range(0, postings, 10))

state = {
  'lock': asyncio.Lock(),
  'ids': set(),
  'n': 0
}
             
async def get_jobs(url, pages, state):
  data = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for i in pages:
      await session.get(url + "&start=" + str(i))
      jobs = await session.get_elements("[class='job_seen_beacon']")

      for job in jobs:
        result_html = await job.get_property('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')

        liens = await job.get_elements("a")
        link = await liens[0].get_attribute("href")

        title = soup.select('.jobTitle')[0].get_text().strip()
        try:
          company = soup.select('.companyName')[0].get_text().strip()
        except:
          continue
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
            
        Id = f"{title}{company}{location}{rating}{date}{salary}{description}"
        dupe = False
        async with state['lock']:
          if Id in state['ids']:
            dupe = True
          else:
            state['ids'].add(Id)
            state['n'] = state['n'] + 1
            print("Job number {0:4d} added - {1:s}".format(state['n'],title))
        if dupe:
          continue

        data.append({
          'Title': title,
          "Company": company,
          'Location': location,
          'Rating': rating,
          'Date': date,
          "Salary": salary,
          "Description": description,
          "Links": link
        })

        # print("Job number {0:4d} added - {1:s}".format(jn,title))
      i = i + 10
  return data

tasks = [asyncio.create_task(get_jobs(url, p, state)) for p in np.array_split(pages, n)]
dataframe = pd.DataFrame([j for task in tasks for j in await task])

Job number    1 added - Program Manager, Analytics
Job number    2 added - Information Technology Product/ Project Manager [HYBRID]
Job number    3 added - Manager of Operations
Job number    4 added - User Support Manager
Job number    5 added - Pre-Award Manager
Job number    6 added - Manager, CRM Data Science Analytics
Job number    7 added - Growth Analytics Manager
Job number    8 added - Manager of Planning & Analysis
Job number    9 added - Product Manager - Feed Foundations
Job number   10 added - Senior Manager of Strategy & Analytics
Job number   11 added - Manager, Insights and Analytics
Job number   12 added - AR/VR Media Tech Lead/ Engineering Manager
Job number   13 added - Associate Manager, Digital Analytics
Job number   14 added - Sr Manager, Financial Planning and Analysis
Job number   15 added - Enterprise Account Manager - Analytics
Job number   16 added - Accounting Manager
Job number   17 added - Sr. Manager, Business Analytics
Job number   18 added - Internal Au

Job number  144 added - Lead Application Admin
Job number  145 added - Manager, VIC & Client Development- West Coast
Job number  146 added - IT Manager
Job number  147 added - Business Operations Manager Senior-IO/IT
Job number  148 added - Senior Manager ITSM
Job number  149 added - Manager, Digital Data & Analytics (Data Analysis)
Job number  150 added - Lead IT Auditor (Remote)
Job number  151 added - Senior Project Manager
Job number  152 added - Sr. Manager, HRIS Technology and Strategy
Job number  153 added - Group Product Manager - Virtual Expert Platform - Collaboration Capabilities
Job number  154 added - Data Engineering Manager (San Diego)
Job number  155 added - Manager, Database Administration
Job number  156 added - Project Manager Sr-IT
Job number  157 added - Evaluation Lead (RSCH DATA ANL 4)-SOM: MED: Equity Diversity Inclusion-Sacramento Campus
Job number  158 added - Product Manager, Upsell Experience
Job number  159 added - Sr Manager, Diversity, Equity and Inclusio

Job number  281 added - Manager, Analytics
Job number  282 added - Product Manager, Creator Experience
Job number  283 added - Project Manager/Financial Analyst
Job number  284 added - Product Manager
Job number  285 added - Employee Experience Manager
Job number  286 added - Senior Product Manager
Job number  287 added - Senior Manager, CRM Lifecycle
Job number  288 added - Lead Research Data Analyst
Job number  289 added - Product Manager, Web
Job number  290 added - Manager, Quality Engineering
Job number  291 added - Sr. Product Manager
Job number  292 added - Program Manager, Learning and Talent Management
Job number  293 added - Manager, Financial Planning and Analysis - Supply Chain
Job number  294 added - Research Manager Ad Solutions
Job number  295 added - AIML - Sr Manager, Speech Technologies Product & Program Management
Job number  296 added - Business Solutions Manager (Remote)
Job number  297 added - Manager, Competitive Intelligence (US Strategy) | Businesses, Global an

Job number  419 added - Manager, Business Intelligence & Analysis
Job number  420 added - Production Manager
Job number  421 added - Manager, Data & Analytics - Apple Retail Online
Job number  422 added - Principal Vehicle Product Manager
Job number  423 added - Data Manager
Job number  424 added - Senior Product Manager - TikTok Effect
Job number  425 added - Senior Product Manager
Job number  426 added - Senior Product Manager - Healthy Light Software Applications
Job number  427 added - Digital Data Technical Project Manager
Job number  428 added - Sr HIPAA Compliance Program Manager - REMOTE
Job number  429 added - Product Manager, Silicon SoC
Job number  430 added - New NaaS Introduction Program Manager
Job number  431 added - Principal Product Manager, UI and Application Platforms
Job number  432 added - Sr. Program Manager - Talent
Job number  433 added - Product Manager
Job number  434 added - Senior Product Manager I
Job number  435 added - Strategic Planning Manager
Job numbe

Job number  550 added - Database Operations Manager (San Diego)
Job number  551 added - Senior Manager, Residuals and Data Management
Job number  552 added - DEI Transformation Manager
Job number  553 added - Analytic Lab Manager
Job number  554 added - Senior Technical Product Manager - Family Tree Intelligence
Job number  555 added - Senior Commercial Business Systems Manager
Job number  556 added - Manager, Business Planning & Financial Analysis
Job number  557 added - Data Manager Analyst
Job number  558 added - Senior Associate/Manager, Quality Control
Job number  559 added - Manager - Financial Planning & Analytics (1.0 FTE, Days)
Job number  560 added - Senior Product Manager, Lingo Biowearables
Job number  561 added - Technical Project Manager-USDS
Job number  562 added - Technical Product Manager - Development Infrastructure Product Team
Job number  563 added - Voice of the Customer, Insights Lead, Remote
Job number  564 added - Tech Lead Manager, Data Platform - USDS
Job numb

Job number  681 added - Clinical Project Manager
Job number  682 added - Senior Technical Product Manager
Job number  683 added - Sr. Business Coach / Regional Manager - Las Vegas
Job number  684 added - Consumer Insights, Senior Manager Advertising Effectiveness
Job number  685 added - Senior Product Manager, International
Job number  686 added - Senior Manager, Data Science - Inventory Forecasting (San Bruno)
Job number  687 added - Sr Manager, Trust Strategic Insights
Job number  688 added - Lead Product Manager, Mobile Games
Job number  689 added - Project Manager, REVOLVE Owned Brands
Job number  690 added - Ads Data Science Manager
Job number  691 added - Product Manager
Job number  692 added - Manager, Data & Analytics – Santa Monica, California
Job number  693 added - Project Manager - Structural Analysis
Job number  694 added - Sales Product Manager - Server
Job number  695 added - Manager, IT Business Engagement Analyst
Job number  696 added - Tech Lead Manager, Architect and

Job number  816 added - MANAGER, FINANCIAL PLANNING & ANALYSIS
Job number  817 added - Engineering Manager- Product Development
Job number  818 added - Product Manager
Job number  819 added - Manager - Applications Software Engineering - Federal
Job number  820 added - Principal Program Manager
Job number  821 added - Senior Product Manager - Banking
Job number  822 added - Product Manager, Charging (Digital Experience)
Job number  823 added - Healthcare Analytics Manager/Senior Analyst
Job number  824 added - Product Manager
Job number  825 added - Senior Engineering Manager, Data & Analytics
Job number  826 added - Space Systems Program Manager
Job number  827 added - Senior Product Manager, Fulfillment Solutions (Remote Positions Available)
Job number  828 added - Lead Experimentation Scientist
Job number  829 added - Patient Experience Project Manager - FT - Days - Patient Experience
Job number  830 added - Manager, Operations
Job number  831 added - Manager, Service Desk
Job numbe

### Scrape full job descriptions

In [5]:
Links_list = dataframe['Links'].tolist()

import random

async def get_description(urls):
  descriptions = []
  async with get_session(Geckodriver(binary=driver_path, log_file=asyncio.subprocess.PIPE), Firefox(**options)) as session:
    for url in urls:
      await session.get("https://www.indeed.com"+url)
      time.sleep(1) #adding a one-second delay
      jd = await session.get_element('#jobDescriptionText')
      descriptions.append(await jd.get_text())
      await asyncio.sleep(random.random() * 1.5)
  return descriptions

## Number of browser instances to use
n = 3

tasks = [asyncio.create_task(get_description(urls)) for urls in np.array_split(Links_list, n)]
dataframe['Descriptions'] = [desc for task in tasks for desc in await task]

### Save results

In [6]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [7]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,"Program Manager, Analytics",PayPal,Remote in California+1 location,3.9,PostedPosted 30+ days ago,"$84,500 - $204,600 a year",You will have exposure and visibility to PayPa...,/rc/clk?jk=d80c44b2b2ed76a1&fccid=978d9fd9799d...,"At PayPal (NASDAQ: PYPL), we believe that ever..."
1,Manager of Operations,SpaceX,"Hawthorne, CA 90250 (North Hawthorne area)",3.6,PostedPosted 30+ days ago,"$130,000 - $180,000 a year",SpaceX was founded under the belief that a fut...,/rc/clk?jk=e713533258f4bd4f&fccid=8bc01be85b97...,SpaceX was founded under the belief that a fut...
2,User Support Manager,TikTok,"Los Angeles, CA",3.4,PostedPosted 30+ days ago,"$101,333 - $188,522 a year",TikTok is the leading destination for short-fo...,/rc/clk?jk=af2cfc39b44a3fb8&fccid=caed318a9335...,Responsibilities\nTikTok is the leading destin...
3,Pre-Award Manager,UCLA Health,"Los Angeles, CA 90095",4.0,PostedPosted 30+ days ago,"$61,400 - $121,400 a year",Minimum 3-years pre-award experience with NIH ...,/rc/clk?jk=c8d09b99bad832d6&fccid=74540dcd08f5...,Description\nThe Semel Institute for Neuroscie...
4,"Manager, CRM Data Science Analytics",Sephora USA,"Remote in San Francisco, CA 94105",3.8,PostedPosted 13 days ago,"$179,878 - $180,878 a year",Define marketing analytics requirements for ML...,/rc/clk?jk=8dbca01bb3417326&fccid=afda98bfc822...,"Position\nManager, CRM Data Science Analytics\..."
...,...,...,...,...,...,...,...,...,...
844,"Remote - Sr. Product Manager, Healthcare Innov...",City of Hope,"Remote in Duarte, CA 91009",3.7,PostedPosted 22 days ago,$59.44 - $99.27 an hour,City of Hope's mission is to deliver the cures...,/rc/clk?jk=5e27837917748aeb&fccid=e4c4ad85ebf2...,About City of Hope\nCity of Hope's mission is ...
845,Technology Support Manager,Pyramid Technology Services,"Riverside, CA",,PostedPosted 22 days ago,"$86,000 - $90,000 a year","Riverside, California - 2023-04-05 Technology ...",/rc/clk?jk=a0078795d802b6c4&fccid=07954fd3e787...,"Riverside, California - 2023-04-05 Technology ..."
846,Project Manager - Product,Alignment Healthcare,California,2.8,PostedPosted 30+ days ago,,Alignment Health was founded with a mission to...,/rc/clk?jk=c875ad0748f6ed34&fccid=b671a434bbdb...,"Job Number 4903\nRemote - CA , California\nAli..."
847,"Senior Product Manager, Gaming",Yoom,"Los Angeles, CA",,PostedPosted 30+ days ago,"$150,000 a year",We are seeking an experienced Senior Product M...,/rc/clk?jk=c2ecfa7ea5877354&fccid=95ab4146c159...,We are seeking an experienced Senior Product M...
