<a href="https://www.kaggle.com/code/lumarian/web-scraping-indeed-job-postings?scriptVersionId=159976536" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Web Scraping Indeed Job Postings
This is my first web scraping project and it's still in progress. I'm gathering data from Indeed job postings and then analyzing them to find any trends or patterns.

In [1]:
import bs4
from bs4 import BeautifulSoup
import requests
import time
import random as ran
import sys
import pandas as pd
import re

## Getting Started
Preparing all the packages and getting an API key from a web scraper.

In [2]:
url = "https://api.scrapingdog.com/indeed"
api_key = "65ae0b013dc7fb6d1a5d3145"
jobs_url = "https://www.indeed.com/"

params = {"api-key": api_key, "url": jobs_url}
print(params)

{'api-key': '65ae0b013dc7fb6d1a5d3145', 'url': 'https://www.indeed.com/'}


In [3]:
target_url = "https://www.indeed.com/"
scrape_url = "https://api.scrapingdog.com/scrape?api_key=65ae0b013dc7fb6d1a5d3145&url=https://www.indeed.com/jobs?q=&l=Los+Angeles%2C+CA&from=searchOnHP&vjk=89eb443ee8b65264&dynamic=false"
api_key = "65ae0b013dc7fb6d1a5d3145"
payload = {'api-key': api_key, 'url':scrape_url}
head= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}

resp = requests.get(scrape_url)

print("status is", resp.status_code)

status is 200


## Extracting Items from HTML
I'm making lists of the features that I want to analyze later. I wrote functions for each variable since they have their unique specification. Having a function for these removes the need for a for loop, which would help me decrease runtime a bit.

In [4]:
# Making the soup and identifying the unit containers for the job postings
soup = BeautifulSoup(resp.text, "html.parser")
allData = soup.findAll('div', attrs={'data-testid': 'slider_container'})

In [5]:
# Writing functions for each variable that I'm interested in to map onto the main dataset
# Job Title
def get_title(x):
    new_x = x.find('span', id=re.compile("jobTitle"))
    return new_x.text    

# Salary
def get_pay(x):
    new_x = x.find('div', attrs={'data-testid': 'attribute_snippet_testid'})
    if new_x is None:
        return "Not Found"
    elif '$' in str(new_x):        
        return new_x.text

# Location
def get_loc(x):
    new_x = x.find('div', attrs={'data-testid': 'text-location'})
    if new_x is None:
        new_x="Not Found"
    else:
        return new_x.text

# Company
def get_comp(x):
    new_x = x.find('span', attrs={'data-testid': 'company-name'})
    if new_x is None:
        new_x="Not Found"
    else:
        return new_x.text
    
pays = list(map(get_pay, allData))
locs = list(map(get_loc, allData))
comps = list(map(get_comp, allData))
titles = list(map(get_title, allData))

# checking lengths of the lists
print(len(titles), len(locs), len(pays), len(comps))    

15 15 15 15


## Creating Dataframe
Now that I have stored the relevant data in lists and confirmed that they are the same length, I can now combine them into a pandas dataframe.

In [6]:
jobData = pd.DataFrame({'Job Title':titles, 'Location': locs, 'Salary':pays, 'Company':comps})
jobData

Unnamed: 0,Job Title,Location,Salary,Company
0,Quality Control Manager,"Los Angeles, CA","$80,000 - $120,000 a year",Confidential
1,Boba Time HP/USC Team Member ($18.25/$19.5 to ...,"Los Angeles, CA 90007 (Exposition Park area)",$18.25 - $30.00 an hour,It's Boba Time HP
2,Mid Level provider for neurology clinic (spani...,"Los Angeles, CA 90057 (Westlake area)","$12,000 - $18,000 a month",NILA
3,Special Needs Assistant,"Los Angeles, CA",$23 - $28 an hour,Birch Agency
4,Dog Walker,"Santa Monica, CA",$16 - $50 an hour,Spot Dog Walking
5,Project Manager / Water Technician,"Los Angeles, CA",$22 - $26 an hour,ServiceMaster by C2C Restoration
6,Legal Assistant,"Los Angeles, CA 90015 (Downtown area)",$20 - $25 an hour,Downtown LA Law Group
7,Business Office Manager,"Los Angeles, CA 90029",$30 - $35 an hour,Palazzo Post Acute
8,988 Crisis Counselor,"Los Angeles, CA 90067",$24 an hour,Didi Hirsch Mental Health Services
9,Appointment Clerk,"Anaheim, CA",$25.30 - $28.03 an hour,Kaiser Permanente


## Project Loading...
This is where I'm at for now on this project. My plan for the next steps are:
* Edit the Salary variable to make it numeric and also standardize the units to per hour, since some of them are per month or per year.
* Clean the job title field to make them more concise
* Categorize jobs by on-site, hybrid, and remote to compare salaries to see if there are any trends (but it would be more useful if I compared jobs within the same field, which is the reason for my next point)
* Categorize the job types using NLP to find trends with the salaries

Some other thoughts:
* It would be nice if I could obtain company size in numbers of employees to see if that has an impact on the pay.
* It would be interesting if I could import the data into a mapping software like ArcGIS and plot them there to observe any location-based patterns.
