# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Instal Firefox, Selenium, Gecko Driver, Beautiful Soup

In [8]:
#Install firefox
!apt-get update
!apt install firefox

#Install selenium
!pip install selenium

#Updating and installing firefox libraries
!apt-get update && apt-get install -y wget bzip2 libxtst6 libgtk-3-0 libx11-xcb-dev libdbus-glib-1-2 libxt6 libpci-dev && rm -rf /var/lib/apt/lists/*

#Installing Geck Driver
!wget https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
!tar -xvzf geckodriver*
!chmod +x geckodriver
!export PATH=$PATH:/path-to-extracted-file/.

#Instal beautifulsoup
!pip install beautifulsoup4

'apt-get' is not recognized as an internal or external command,
operable program or batch file.
'apt' is not recognized as an internal or external command,
operable program or batch file.




'apt-get' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
tar: Error opening archive: Failed to open 'geckodriver*'
'chmod' is not recognized as an internal or external command,
operable program or batch file.
'export' is not recognized as an internal or external command,
operable program or batch file.




### Import Dependencies

In [1]:
import selenium.webdriver as webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options as FirefoxOptions

import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By

import random
import time

### Define Position and Location

In [75]:
## Enter a job position
position = "data+scientist"
## Enter a location (City, State or Zip or remote)
locations = "NewYork"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])
print(url)

https://www.indeed.com/jobs?q=data+scientist&l=NewYork


### Set Path to Webdriver

In [3]:
driver_path = '/content/geckodriver'
firefox_driver_path = '/content/geckodriver'
user_agent = 'Mozilla'
firefox_options = FirefoxOptions()
firefox_options.add_argument('--headless')
driver = webdriver.Firefox(options=firefox_options)

### Scrape Job Postings

In [81]:
## Number of postings to scrape
postings = 200
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])
jn=0
for i in range(100, postings, 10):
    driver.get(url + "&start=" + str(i))
    driver.implicitly_wait(3)

    jobs = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')
    #print(jobs)

    for job in jobs:
        #print(job)
        result_html = job.get_attribute('innerHTML')
        #print(result_html)
        soup = BeautifulSoup(result_html, 'html.parser')
        #print(soup , '\n')

        jn += 1

        liens = job.find_elements(By.TAG_NAME, "a")
        #print(liens)
        links = liens[0].get_attribute("href")
        #print(links)

        title = soup.select('.jobTitle')[0].get_text().strip()
        print(title)

        #company = soup.find_all(attrs={'data-testid': 'company-name'})[0].get_text().strip()
        #print(company)
        try:
            company = soup.find_all(attrs={'data-testid': 'company-name'})[0].get_text().strip()
            #print(company)
        except:
            company = 'Nan'
        print(company)
        #location = soup.select('.companyLocation')[0].get_text().strip() #origional
        #location = soup.select('.company_location')[0].get_text().strip()
        location = soup.find_all(attrs={'data-testid': 'text-location'})[0].get_text().strip()
        print(location)
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''

        dataframe = pd.concat([dataframe, pd.DataFrame([{'Title': title,
                                          "Company": company,
                                          'Location': location,
                                          'Rating': rating,
                                          'Date': date,
                                          "Salary": salary,
                                          "Description": description,
                                          "Links": links}])], ignore_index=True)
        print("Job number {0:4d} added - {1:s}".format(jn,title))

Research Computing and Data Science Facilitator
University at Buffalo
Hybrid remote in Buffalo, NY 14260
Job number    1 added - Research Computing and Data Science Facilitator
Artificial Intelligence Engineer (AI Engineer)
Integrass
Hybrid remote in New York, NY 10001
Job number    2 added - Artificial Intelligence Engineer (AI Engineer)
Senior AI Specialist, Solution Consulting
ServiceNow
Remote in New York, NY
Job number    3 added - Senior AI Specialist, Solution Consulting
Lead Decision Scientist - EDA Product
CVS Health
New York, NY
Job number    4 added - Lead Decision Scientist - EDA Product
Senior Data Analyst, Amazon
Harry's
New York, NY 10004 (Financial District area)
Job number    5 added - Senior Data Analyst, Amazon
Director, Data Science Engineering
Dassault Systèmes
New York, NY
Job number    6 added - Director, Data Science Engineering
Machine Learning Engineering Lead
MMC Corporate
New York, NY 10036 (Midtown area)
Job number    7 added - Machine Learning Engineering 

Senior Data Analyst
Prutech Solutions
New York, NY 10007 (Financial District area)
Job number   61 added - Senior Data Analyst
Machine Learning Engineer
Altice USA
Long Island City, NY 11101
Job number   62 added - Machine Learning Engineer
Data Scientist
FARM
Lancaster, NY 14043
Job number   63 added - Data Scientist
Lead Data Scientist
Fusemachines
New York, NY
Job number   64 added - Lead Data Scientist
Machine Learning Engineer (Entry Level)
SRC, Inc.
Syracuse, NY
Job number   65 added - Machine Learning Engineer (Entry Level)
Member of Technical Staff, Machine Learning
Runway
Remote in New York, NY 10014
Job number   66 added - Member of Technical Staff, Machine Learning
Senior Data Analyst, Emerging Business
Peloton
New York, NY 10011 (Chelsea area)
Job number   67 added - Senior Data Analyst, Emerging Business
Senior Data Analyst
NYC Careers
Manhattan, NY
Job number   68 added - Senior Data Analyst
Generative AI Strategist, Generative AI Innovation Center
Amazon Web Services, In

Senior Risk Adjustment Data Analyst - Remote
EmblemHealth
Remote in New York, NY
Job number  119 added - Senior Risk Adjustment Data Analyst - Remote
Senior Decision Scientist - Enterprise Digital Analytics
CVS Health
Hybrid remote in New York, NY
Job number  120 added - Senior Decision Scientist - Enterprise Digital Analytics
Data Scientist
NYC Careers
Manhattan, NY 10004 (Financial District area)
Job number  121 added - Data Scientist
Senior Data Engineer
Codeignitors inc
Irving, NY 14081
Job number  122 added - Senior Data Engineer
Principal Machine Learning Engineer
Figure
Remote in New York, NY 10027
Job number  123 added - Principal Machine Learning Engineer
Senior Data Quality Engineer, Corporate Vice President
New York Life Insurance Co
New York, NY
Job number  124 added - Senior Data Quality Engineer, Corporate Vice President
Senior Software Engineer - AI Platform (NY)
AlphaSense
New York, NY 10003 (Flatiron area)
Job number  125 added - Senior Software Engineer - AI Platform 

In [82]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links
0,Research Computing and Data Science Facilitator,University at Buffalo,"Hybrid remote in Buffalo, NY 14260",,PostedPosted 30+ days ago,,The Center for Computational Research (CCR) at...,https://www.indeed.com/rc/clk?jk=33e3b9decf5d0...
1,Artificial Intelligence Engineer (AI Engineer),Integrass,"Hybrid remote in New York, NY 10001",,PostedPosted 30+ days ago,,AI Engineer - New York Experience Hire: BAM is...,https://www.indeed.com/rc/clk?jk=86a5360917a93...
2,"Senior AI Specialist, Solution Consulting",ServiceNow,"Remote in New York, NY",,PostedPosted 30+ days ago,"$165,825 - $273,675 a year","Company Description At ServiceNow, our technol...",https://www.indeed.com/rc/clk?jk=f199691e2f7a7...
3,Lead Decision Scientist - EDA Product,CVS Health,"New York, NY",,PostedPosted 22 days ago,"$132,250 - $260,000 a year",Bring your heart to CVS Health. Every one of u...,https://www.indeed.com/rc/clk?jk=300ff83543345...
4,"Senior Data Analyst, Amazon",Harry's,"New York, NY 10004 (Financial District area)",,PostedPosted 30+ days ago •Many applications i...,,About Harry's Harry's Inc. started in 2013 wit...,https://www.indeed.com/rc/clk?jk=d44a402a00875...
...,...,...,...,...,...,...,...,...
145,Data Coordinator I - Medicine Clinical Trials ...,Mount Sinai,"New York, NY 10029 (Yorkville area)",,PostedPosted 30+ days ago,,Description Strength Through Diversity Ground ...,https://www.indeed.com/rc/clk?jk=f8e5356af5611...
146,Director Operational Data Science and Systems,Metropolitan Transportation Authority,"New York, NY 11221 (Bedford Stuyvesant area)",,PostedPosted 30+ days ago,"$115,568.00 - $151,683.53 a year",Description POSTING NO. 3979 JOB TITLE: Direct...,https://www.indeed.com/rc/clk?jk=e194740b1bf10...
147,"Sr. SDE - ML + Big Data, Measurement, Ad Tech,...",Amazon.com Services LLC - A57,"New York, NY",,PostedPosted 7 days ago,"From $134,500 a year",- 5+ years of non-internship professional soft...,https://www.indeed.com/rc/clk?jk=92340f0ca3e03...
148,"Post Doc Researcher - Artificial Intelligence,...",Microsoft,"New York, NY",,PostedPosted 30+ days ago,"$94,300 - $182,600 a year",Microsoft Research New York City (MSR NYC) is ...,https://www.indeed.com/rc/clk?jk=52df5e48bb86d...


### Scrape Full Job Descriptions

In [83]:
Links_list = dataframe['Links'].tolist()
#Links_list

In [84]:
descriptions=[]
for i in Links_list:
    driver.get(i)
    driver.implicitly_wait(random.randint(3, 8))
    jd = driver.find_element(By.XPATH, '//div[@id="jobDescriptionText"]').text
    descriptions.append(jd)
    time.sleep(random.randint(5,10))

dataframe['Descriptions'] = descriptions

### Save Results

In [85]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + "2.csv", index=False)

In [86]:
dataframe.head()

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Research Computing and Data Science Facilitator,University at Buffalo,"Hybrid remote in Buffalo, NY 14260",,PostedPosted 30+ days ago,,The Center for Computational Research (CCR) at...,https://www.indeed.com/rc/clk?jk=33e3b9decf5d0...,The Center for Computational Research (CCR) at...
1,Artificial Intelligence Engineer (AI Engineer),Integrass,"Hybrid remote in New York, NY 10001",,PostedPosted 30+ days ago,,AI Engineer - New York Experience Hire: BAM is...,https://www.indeed.com/rc/clk?jk=86a5360917a93...,AI Engineer - New York\nExperience Hire:\nBAM ...
2,"Senior AI Specialist, Solution Consulting",ServiceNow,"Remote in New York, NY",,PostedPosted 30+ days ago,"$165,825 - $273,675 a year","Company Description At ServiceNow, our technol...",https://www.indeed.com/rc/clk?jk=f199691e2f7a7...,"Company Description\n\nAt ServiceNow, our tech..."
3,Lead Decision Scientist - EDA Product,CVS Health,"New York, NY",,PostedPosted 22 days ago,"$132,250 - $260,000 a year",Bring your heart to CVS Health. Every one of u...,https://www.indeed.com/rc/clk?jk=300ff83543345...,Bring your heart to CVS Health. Every one of u...
4,"Senior Data Analyst, Amazon",Harry's,"New York, NY 10004 (Financial District area)",,PostedPosted 30+ days ago •Many applications i...,,About Harry's Harry's Inc. started in 2013 wit...,https://www.indeed.com/rc/clk?jk=d44a402a00875...,About Harry's\nHarry's Inc. started in 2013 wi...
