# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Instal Firefox, Selenium, Gecko Driver, Beautiful Soup

In [1]:
#Install firefox
!apt-get update
!apt install firefox

#Install selenium
%pip install selenium

#Updating and installing firefox libraries
!apt-get update && apt-get install -y wget bzip2 libxtst6 libgtk-3-0 libx11-xcb-dev libdbus-glib-1-2 libxt6 libpci-dev && rm -rf /var/lib/apt/lists/*

#Installing Geck Driver
!wget https://github.com/mozilla/geckodriver/releases/download/v0.24.0/geckodriver-v0.24.0-linux64.tar.gz
!tar -xvzf geckodriver*
!chmod +x geckodriver
!export PATH=$PATH:/path-to-extracted-file/.

#Instal beautifulsoup
%pip install beautifulsoup4

'apt-get' is not recognized as an internal or external command,
operable program or batch file.
'apt' is not recognized as an internal or external command,
operable program or batch file.


Collecting selenium
  Downloading selenium-4.15.2-py3-none-any.whl.metadata (6.9 kB)
Collecting urllib3<3,>=1.26 (from urllib3[socks]<3,>=1.26->selenium)
  Downloading urllib3-2.1.0-py3-none-any.whl.metadata (6.4 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.23.1-py3-none-any.whl.metadata (4.9 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting certifi>=2021.10.8 (from selenium)
  Downloading certifi-2023.7.22-py3-none-any.whl.metadata (2.2 kB)
Collecting attrs>=20.1.0 (from trio~=0.17->selenium)
  Downloading attrs-23.1.0-py3-none-any.whl (61 kB)
     ---------------------------------------- 0.0/61.2 kB ? eta -:--:--
     -------------------------- ------------- 41.0/61.2 kB 1.9 MB/s eta 0:00:01
     ---------------------------------------- 61.2/61.2 kB 1.6 MB/s eta 0:00:00
Collecting sortedcontainers (from trio~=0.17->selenium)
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (2

'apt-get' is not recognized as an internal or external command,
operable program or batch file.
'wget' is not recognized as an internal or external command,
operable program or batch file.
tar: Error opening archive: Failed to open 'geckodriver*'
'chmod' is not recognized as an internal or external command,
operable program or batch file.
'export' is not recognized as an internal or external command,
operable program or batch file.


Collecting beautifulsoup4
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
     ---------------------------------------- 0.0/143.0 kB ? eta -:--:--
     ------- ----------------------------- 30.7/143.0 kB 660.6 kB/s eta 0:00:01
     ----------------------------- -------- 112.6/143.0 kB 1.3 MB/s eta 0:00:01
     -------------------------------------- 143.0/143.0 kB 1.4 MB/s eta 0:00:00
Collecting soupsieve>1.2 (from beautifulsoup4)
  Downloading soupsieve-2.5-py3-none-any.whl.metadata (4.7 kB)
Downloading soupsieve-2.5-py3-none-any.whl (36 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.12.2 soupsieve-2.5


### Import Dependencies

In [16]:
import selenium.webdriver as webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options as FirefoxOptions

import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By

import random
import time

### Define Position and Location

In [17]:
## Enter a job position
position = "data+scientist"
## Enter a location (City, State or Zip or remote)
locations = "Canada"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])
print(url)

https://www.indeed.com/jobs?q=data+scientist&l=Canada


### Set Path to Webdriver

In [18]:
driver_path = '/content/geckodriver'
firefox_driver_path = '/content/geckodriver'
user_agent = 'Mozilla'
firefox_options = FirefoxOptions()
firefox_options.add_argument('--headless')
driver = webdriver.Firefox(options=firefox_options)

### Scrape Job Postings

In [19]:
## Number of postings to scrape
postings = 1000

jn=0
for i in range(0, postings, 10):
    driver.get(url + "&start=" + str(i))
    driver.implicitly_wait(3)

    jobs = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')
    #print(jobs)

    for job in jobs:
        #print(job)
        result_html = job.get_attribute('innerHTML')
        #print(result_html)
        soup = BeautifulSoup(result_html, 'html.parser')
        #print(soup , '\n')

        jn += 1

        liens = job.find_elements(By.TAG_NAME, "a")
        #print(liens)
        links = liens[0].get_attribute("href")
        #print(links)

        title = soup.select('.jobTitle')[0].get_text().strip()
        print(title)

        #company = soup.find_all(attrs={'data-testid': 'company-name'})[0].get_text().strip()
        #print(company)
        try:
            company = soup.find_all(attrs={'data-testid': 'company-name'})[0].get_text().strip()
            #print(company)
        except:
            company = 'Nan'
        print(company)
        #location = soup.select('.companyLocation')[0].get_text().strip() #origional
        #location = soup.select('.company_location')[0].get_text().strip()
        location = soup.find_all(attrs={'data-testid': 'text-location'})[0].get_text().strip()
        print(location)
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''

        dataframe = pd.concat([dataframe, pd.DataFrame([{'Title': title,
                                          "Company": company,
                                          'Location': location,
                                          'Rating': rating,
                                          'Date': date,
                                          "Salary": salary,
                                          "Description": description,
                                          "Links": links}])], ignore_index=True)
        print("Job number {0:4d} added - {1:s}".format(jn,title))

VISS/AXIS Ai Developer - (Ai modeling, data engineering, X-ray imaging) - Kentucky (BOSK)
SK Battery America
Kentucky
Job number    1 added - VISS/AXIS Ai Developer - (Ai modeling, data engineering, X-ray imaging) - Kentucky (BOSK)
Data Manager - DoD TS/SCI - Camp Humphrey - Korea
Peraton
United States
Job number    2 added - Data Manager - DoD TS/SCI - Camp Humphrey - Korea
Principal Statistician - CRO - Remote
Compass Life Sciences
Remote in United States
Job number    3 added - Principal Statistician - CRO - Remote
Principal Statistical Programmer - CRO - Remote
Compass Life Sciences
Remote in United States
Job number    4 added - Principal Statistical Programmer - CRO - Remote
VISS/AXIS Ai Developer - (Ai modeling, data engineering, X-ray imaging) - Kentucky (BOSK)
SK Battery America
Kentucky
Job number    5 added - VISS/AXIS Ai Developer - (Ai modeling, data engineering, X-ray imaging) - Kentucky (BOSK)
Principal Statistician - CRO - Remote
Compass Life Sciences
Remote in United S

In [20]:
dataframe.head()

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links
0,"VISS/AXIS Ai Developer - (Ai modeling, data en...",SK Battery America,Kentucky,,EmployerActive 4 days ago,,Come join us and build your future with SK bat...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
1,Data Manager - DoD TS/SCI - Camp Humphrey - Korea,Peraton,United States,,PostedPosted 30+ days ago,"$112,000 - $179,000 a year",Peraton Overview Peraton drives missions of co...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
2,Principal Statistician - CRO - Remote,Compass Life Sciences,Remote in United States,,PostedPosted 30+ days ago,"$140,000 a year",Compass Life Sciences have partnered with a ra...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
3,Principal Statistical Programmer - CRO - Remote,Compass Life Sciences,Remote in United States,,PostedPosted 30+ days ago,"$145,000 a year",Compass Life Sciences are working with a growt...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...
4,"VISS/AXIS Ai Developer - (Ai modeling, data en...",SK Battery America,Kentucky,,EmployerActive 4 days ago,,Data Management & Retrieval: Design and manage...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...


### Scrape Full Job Descriptions

In [21]:
Links_list = dataframe['Links'].tolist()
#Links_list

In [22]:
print(len(Links_list))

142


In [23]:
descriptions=[]
for i in Links_list:
    driver.get(i)
    driver.implicitly_wait(random.randint(3, 8))
    jd = driver.find_element(By.XPATH, '//div[@id="jobDescriptionText"]').text
    descriptions.append(jd)
    time.sleep(random.randint(5,10))

dataframe['Descriptions'] = descriptions

: 

### Save Results

In [None]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations + ".csv", index=False)

In [None]:
dataframe.head()

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,"Data Scientist, Core AI",Indeed,Remote,,EmployerActive 3 days ago,"$122,000 - $178,000 a year",Participate in code review and process innovat...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Our Mission\nAs the world’s number 1 job site*...
1,Post Marketing Pharmacovigilance Scientist,Medasource,Remote,,PostedPosted 30+ days ago,$80 - $120 an hour,"Irrespective, the PV Scientist is a critical t...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Title: Post Marketing PV Scientist\nDuration: ...
2,Mid-level Conversational Banking AI Training –...,ALTA IT Services,"Remote in Vienna, VA 22180",,PostedPosted 1 day ago,,Ability to process large data sets and provide...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Mid-level Conversational Banking AI Training –...
3,Data Scientist,COGNIZE TECH SOLUTIONS,Remote,,EmployerActive 2 days ago,"$65,000 - $80,000 a year"," Using tools such as Tableau, Looker and Goog...",https://www.indeed.com/company/Cognize-tech-so...,Apply Statistical and Machine Learning methods...
4,Data Scientist,The Chefs' Warehouse,Remote,,PostedPosted 24 days ago,"$90,000 - $98,000 a year",Strong knowledge of machine learning technique...,https://www.indeed.com/rc/clk?jk=bcabc61c15e08...,About The Chefs' Warehouse\nThe Chefs' Warehou...
