# Web Scraping - Indeed.com
General steps for Web Scraping
1. Check whether the website allows web scraping
2. Obtain the source code (HTML File) by using the website URL
3. Download the website content
4. Parse the content using keywords tags for elements of interest
5. Extract relevant data/features
6. Organize raw data in structured format (e.g., CSV)

### Import Dependencies 

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.by import By

### Path to webdriver (Firefox, Chrome) 

In [3]:
# Ensure that the driver path is correct before running this script.
# Microsoft Windows
driver_path = "C:/Users/10063/Desktop/Web_scrapper/chromedriver.exe"
# # Linux
# #driver_path = "./drivers/linux/geckodriver"
driver = webdriver.Chrome(executable_path=driver_path)

### Define position and location 

In [4]:
## Enter a job position
position = "data scientist"
## Enter a location (City, State or Zip or remote)
locations = "remote"

def get_url(position, location):
    url_template = "https://www.indeed.com/jobs?q={}&l={}"
    url = url_template.format(position, location)
    return url

url = get_url(position, locations)
dataframe = pd.DataFrame(columns=["Title", "Company", "Location", "Rating", "Date", "Salary", "Description", "Links"])

### Scrape job postings

In [5]:
## Number of postings to scrape
postings = 1000

jn=0
for i in range(0, postings, 10):
    driver.get(url + "&start=" + str(i))
    driver.implicitly_wait(3)

    jobs = driver.find_elements(By.CLASS_NAME, 'job_seen_beacon')

    for job in jobs:
        result_html = job.get_attribute('innerHTML')
        soup = BeautifulSoup(result_html, 'html.parser')
        
        jn += 1
        
        liens = job.find_elements(By.TAG_NAME, "a")
        links = liens[0].get_attribute("href")
        
        title = soup.select('.jobTitle')[0].get_text().strip()
        company = soup.select('.companyName')[0].get_text().strip()
        location = soup.select('.companyLocation')[0].get_text().strip()
        try:
            salary = soup.select('.salary-snippet-container')[0].get_text().strip()
        except:
            salary = 'NaN'
        try:
            rating = soup.select('.ratingNumber')[0].get_text().strip()
        except:
            rating = 'NaN'
        try:
            date = soup.select('.date')[0].get_text().strip()
        except:
            date = 'NaN'
        try:
            description = soup.select('.job-snippet')[0].get_text().strip()
        except:
            description = ''
       
        dataframe = pd.concat([dataframe, pd.DataFrame([{'Title': title,
                                          "Company": company,
                                          'Location': location,
                                          'Rating': rating,
                                          'Date': date,
                                          "Salary": salary,
                                          "Description": description,
                                          "Links": links}])], ignore_index=True)
        print("Job number {0:4d} added - {1:s}".format(jn,title))

Job number    1 added - Data Scientist
Job number    2 added - Jr. Data Scientist
Job number    3 added - Analyst I, Data Science
Job number    4 added - Junior Data Scientist
Job number    5 added - Data Scientist
Job number    6 added - Data Scientist
Job number    7 added - Data Scientist
Job number    8 added - Data Scientist (Remote)
Job number    9 added - Junior Data Scientist
Job number   10 added - Data Scientist Co-Op (Spring 2023)
Job number   11 added - Data Scientist
Job number   12 added - Data Scientist, Baseball Research & Development
Job number   13 added - Data Scientist, Analytics II #0000
Job number   14 added - Data Scientist
Job number   15 added - Data Scientist or Statistician
Job number   16 added - Data Scientist
Job number   17 added - Data Scientist
Job number   18 added - Data Scientist
Job number   19 added - Data Scientist
Job number   20 added - Data Scientist
Job number   21 added - Data Scientist
Job number   22 added - Data Scientist
Job number   23 a

In [6]:
driver.quit()

### Scrape full job descriptions

In [7]:
Links_list = dataframe['Links'].tolist()
#Links_list

In [8]:
import random
import time

In [9]:
driver = webdriver.Chrome(executable_path=driver_path)
descriptions=[]
for i in Links_list:
    driver.get(i)
    driver.implicitly_wait(random.randint(3, 8))
    jd = driver.find_element(By.XPATH, '//div[@id="jobDescriptionText"]').text
    descriptions.append(jd)
    time.sleep(random.randint(5,10))

dataframe['Descriptions'] = descriptions

In [10]:
driver.quit()

### Save results

In [11]:
# Convert the dataframe to a csv file
date = datetime.today().strftime('%Y-%m-%d')
dataframe.to_csv(date + "_" + position + "_" + locations +"_Assignment3"+ ".csv", index=False)

In [12]:
dataframe

Unnamed: 0,Title,Company,Location,Rating,Date,Salary,Description,Links,Descriptions
0,Data Scientist,"Shaw Industries Group, Inc.",Remote,3.8,PostedPosted 7 days ago,,Partner with data scientists across the enterp...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,We are looking for a data scientist to join ou...
1,Jr. Data Scientist,Talentheed Inc,Remote,4.6,PostedPosted 4 days ago,"$56,951 - $119,187 a year","To apply to data sets, create unique data mode...",https://www.indeed.com/company/Talentheed-Inc/...,Responsibilities : -\nCoordinate with differen...
2,"Analyst I, Data Science",Liberty Mutual Insurance,Remote,3.6,PostedPosted 3 days ago,"$70,100 - $161,600 a year",Competencies typically acquired through a Mast...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,The Product Design and Modeling Department of ...
3,Junior Data Scientist,Evolven Software,Remote,,PostedJust posted,,"Work with Product Managers, Engineers, and Cus...",https://www.indeed.com/rc/clk?jk=e54b34a429376...,Location: Remote\nRole Description:\nWe are lo...
4,Data Scientist,Procal Technologies,Remote,,PostedPosted 1 day ago,"$80,000 a year","Develops statistical, machine learning and AI ...",https://www.indeed.com/rc/clk?jk=38ae3e9b86111...,$80k USD/year\nRemote Job\nFull-time\nBrief Ov...
...,...,...,...,...,...,...,...,...,...
1494,Head of Machine Learning,"Ursus, Inc.","Remote in New York, NY 10001",4.9,PostedPosted 11 days ago,"$180,000 - $250,000 a year",You will be responsible for leading a team tha...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,JOB TITLE: Head of Machine Learning\nLOCATION:...
1495,Sr Software Engineer (AI) - Telecommute,UnitedHealth Group,"Remote in Boston, MA 02112",3.6,PostedPosted 2 days ago,,"In addition to your salary, UnitedHealth Group...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Combine two of the fastest-growing fields on t...
1496,Project Manager III / Senior Quality Data Analyst,Atlas,Remote,,PostedPosted 5 days ago,$50 - $68 an hour,"Understanding of industry trends in analytics,...",https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,No C2C - Only W2\nJob Title: Project Manager I...
1497,Senior Statistical Programmer Analyst - Remote,Penfield Search Partners,"Remote in Fairfield, CT",,PostedPosted 30+ days ago,,Bachelor’s degree or higher and/or equivalent ...,https://www.indeed.com/pagead/clk?mo=r&ad=-6NY...,Salary: commensurate with experience\nReferenc...
