<a href="https://www.kaggle.com/code/lumarian/web-scraping-indeed-job-postings?scriptVersionId=159975507" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Web Scraping Indeed Job Postings
This is my first web scraping project and it's still in progress. I'm gathering data from Indeed job postings and then analyzing them to find any trends or patterns.

In [1]:
import bs4
from bs4 import BeautifulSoup
import requests
import time
import random as ran
import sys
import pandas as pd
import re

## Getting Started
Preparing all the packages and getting an API key from a web scraper.

In [2]:
url = "https://api.scrapingdog.com/indeed"
api_key = "65ae0b013dc7fb6d1a5d3145"
jobs_url = "https://www.indeed.com/"

params = {"api-key": api_key, "url": jobs_url}
print(params)

{'api-key': '65ae0b013dc7fb6d1a5d3145', 'url': 'https://www.indeed.com/'}


In [3]:
target_url = "https://www.indeed.com/"
scrape_url = "https://api.scrapingdog.com/scrape?api_key=65ae0b013dc7fb6d1a5d3145&url=https://www.indeed.com/jobs?q=&l=Los+Angeles%2C+CA&from=searchOnHP&vjk=89eb443ee8b65264&dynamic=false"
api_key = "65ae0b013dc7fb6d1a5d3145"
payload = {'api-key': api_key, 'url':scrape_url}
head= {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}

resp = requests.get(scrape_url)

print("status is", resp.status_code)

status is 200


## Extracting Items from HTML
I'm making lists of the features that I want to analyze later. Some of them, like location and salary, need to be re-formatted and cleaned before it goes into the dataframe.

In [4]:
soup = BeautifulSoup(resp.text, "html.parser")
allData = soup.findAll('div', attrs={'data-testid': 'slider_container'})

In [5]:
def get_title(x):
    new_x = x.find('span', id=re.compile("jobTitle"))
    return new_x.text    

def get_pay(x):
    new_x = x.find('div', attrs={'data-testid': 'attribute_snippet_testid'})
    if new_x is None:
        return "Not Found"
    elif '$' in str(new_x):        
        return new_x.text
    
def get_loc(x):
    new_x = x.find('div', attrs={'data-testid': 'text-location'})
    if new_x is None:
        new_x="Not Found"
    else:
        return new_x.text
    
def get_comp(x):
    new_x = x.find('span', attrs={'data-testid': 'company-name'})
    if new_x is None:
        new_x="Not Found"
    else:
        return new_x.text
    
pays = list(map(get_pay, allData))
locs = list(map(get_loc, allData))
comps = list(map(get_comp, allData))
titles = list(map(get_title, allData))

# checking lengths of the lists
print(len(titles), len(locs), len(pays), len(comps))    

15 15 15 15


## Creating Dataframe
Now that I have stored the relevant data in lists, I can now combine them into a pandas dataframe.

In [6]:
jobData = pd.DataFrame({'Job Title':titles, 'Location': locs, 'Salary':pays, 'Company':comps})
jobData

Unnamed: 0,Job Title,Location,Salary,Company
0,Estimator,"Los Angeles, CA 90022 (East Los Angeles area)","$72,000 - $110,000 a year","Duran's Body Shop, Inc."
1,Substitute Teacher - Highest Pay in LA,"Los Angeles, CA",$25 - $41 an hour,Teachers On Reserve
2,Boba Time HP/USC Team Member ($18.25/$19.5 to ...,"Los Angeles, CA 90007 (Exposition Park area)",$18.25 - $30.00 an hour,It's Boba Time HP
3,School Bus Driver. *Training Provided*,"Los Angeles, CA 90012",$26 - $33 an hour,Zum SF Inc.
4,Dog Walker,"Santa Monica, CA",$16 - $50 an hour,Spot Dog Walking
5,Roofer,"Valley Village, CA 91607",$25.56 - $27.89 an hour,PAC Properties
6,Front Desk Agent,"Los Angeles, CA 90027 (Los Feliz area)",$21 - $23 an hour,Cara Hotel
7,Full Charge Bookkeeper (Music Touring),"El Segundo, CA 90245","$80,000 - $99,000 a year",Woodwest Business Management
8,988 Crisis Counselor,"Los Angeles, CA 90067",$24 an hour,Didi Hirsch Mental Health Services
9,AI Content Writer,"Remote in Los Angeles, CA",$20 - $25 an hour,DataAnnotation


## Project Loading...
This is where I'm at for now on this project. My plan for the next steps are:
* Edit the Salary variable to make it numeric and also standardize the units to per hour, since some of them are per month or per year.
* Categorize the job types using NLP to find trends with the salaries

Some other thoughts:
* It would be nice if I could obtain company size in numbers of employees to see if that has an impact on the pay.
* It would be interesting if I could import the data into a mapping software like ArcGIS and plot them there to observe any location-based patterns.


In [7]:
print(allData)

[<div class="slider_container css-12igfu2 eu4oa1w0" data-testid="slider_container" dir="auto"><div class="slider_list css-ltdjbe eu4oa1w0"><div class="slider_item css-mk9n32 eu4oa1w0" data-testid="slider_item"><div class="job_seen_beacon"><table cellpadding="0" cellspacing="0" class="big6_visualChanges css-1v79ar eu4oa1w0" role="presentation"><tbody><tr><td class="resultContent css-1qwrrf0 eu4oa1w0"><style data-emotion="css dekpa">.css-dekpa{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;box-sizing:border-box;margin:0;min-width:0;overflow:hidden;text-overflow:ellipsis;-webkit-flex-direction:column-reverse;-ms-flex-direction:column-reverse;flex-direction:column-reverse;-webkit-padding-end:2.6rem;padding-inline-end:2.6rem;}@media (max-width: 400px){.css-dekpa h2{-webkit-line-clamp:4;}}</style><div class="css-dekpa e37uo190"><style data-emotion="css 1u6tfqq">.css-1u6tfqq{box-sizing:border-box;margin:0;min-width:0;display:-webkit-box;-webkit-box-orient:vertical;f