
- [Beautiful Soup: Build a Web Scraper With Python (realpython)](https://realpython.com/beautiful-soup-web-scraper-python/)
    - [Fake Jobs](https://realpython.github.io/fake-jobs/)

- [Python Jobs](https://pythonjobs.github.io/)
- [Remote Developers](https://remote.co/remote-jobs/developer/)
- [Indeed: Sr Data Engineer in SF Bayarea](https://www.indeed.com/jobs?q=Senior+data+engineer&l=San+Francisco+Bay+Area%2C+CA&from=searchOnHP&vjk=21174414366273e2)

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
# URL = "https://www.cs.cornell.edu/people/faculty"
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)

pass page.content instead of page.text to avoid problems with character encoding. The .content attribute holds raw bytes, which can be decoded better than the text representation you printed earlier using the .text attribute.

In [3]:
soup = BeautifulSoup(page.content, "html.parser")

In [4]:
soup

<!DOCTYPE html>

<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>Fake Python</title>
<link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
</head>
<body>
<section class="section">
<div class="container mb-5">
<h1 class="title is-1">
        Fake Python
      </h1>
<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>
</div>
<div class="container">
<div class="columns is-multiline" id="ResultsContainer">
<div class="column is-half">
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<figure class="image is-48x48">
<img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
</figure>
</div>
<div class="media-content">
<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
</div>
</div>

## Find Elements by ID

Locate content container (id or class)

In [5]:
results = soup.find(id="ResultsContainer")
print(results.prettify())

<div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
  <div class="card">
   <div class="card-content">
    <div class="media">
     <div class="media-left">
      <figure class="image is-48x48">
       <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
      </figure>
     </div>
     <div class="media-content">
      <h2 class="title is-5">
       Senior Python Developer
      </h2>
      <h3 class="subtitle is-6 company">
       Payne, Roberts and Davis
      </h3>
     </div>
    </div>
    <div class="content">
     <p class="location">
      Stewartbury, AA
     </p>
     <p class="is-small has-text-grey">
      <time datetime="2021-04-08">
       2021-04-08
      </time>
     </p>
    </div>
    <footer class="card-footer">
     <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
      Learn
     </a>
     <a class="card-footer-item" href=

## Find Elements by HTML Class Name

Each job_element is another BeautifulSoup() object. Therefore, you can use the same methods on it as you did on its parent element, results.

In [6]:
job_elements = results.find_all("div", class_="card-content")

### Extract Text From HTML Elements

You can add .text to a Beautiful Soup object to return only the text content of the HTML elements that the object contains:

In [7]:
for n, job_element in enumerate(job_elements):
#     if n > 1: break
#     print(job_element, end="\n"*2)
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    # filter out non-python jobs
    if not "python" in title_element.text.lower(): continue
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA

Software Engineer (Python)
Garcia PLC
Ericberg, AE

Python Programmer (Entry-Level)
Moss, Duncan and Allen
Port Sara, AE

Python Programmer (Entry-Level)
Cooper and Sons
West Victor, AE

Software Developer (Python)
Adams-Brewer
Brockburgh, AE

Python Developer
Rivera and Sons
East Michaelfort, AA

Back-End Web Developer (Python, Django)
Stewart-Alexander
South Kimberly, AA

Back-End Web Developer (Python, Django)
Jackson, Ali and Mckee
New Elizabethside, AA

Python Programmer (Entry-Level)
Mathews Inc
Robertborough, AP

Software Developer (Python)
Moreno-Rodriguez
Martinezburgh, AE



## Find Elements by Class Name and Text Content and pass function as method

In [8]:
python_jobs = results.find_all(
    "h2", class_="title", 
    string=lambda text: "python" in text.lower()
)
print(len(python_jobs))

10


### Access Parent Elements

In [9]:
python_jobs = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)

python_job_elements = [
    h2_element.parent.parent.parent for h2_element in python_jobs
]

for n, job_element in enumerate(python_job_elements):
#     if n > 1: break
#     print(job_element, end="\n"*2)
    title_element = job_element.find("h2", class_="title")
    company_element = job_element.find("h3", class_="company")
    location_element = job_element.find("p", class_="location")
    # filter out non-python jobs
#     if not "python" in title_element.text.lower(): continue
    print(title_element.text.strip())
    print(company_element.text.strip())
    print(location_element.text.strip())
    print()

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA

Software Engineer (Python)
Garcia PLC
Ericberg, AE

Python Programmer (Entry-Level)
Moss, Duncan and Allen
Port Sara, AE

Python Programmer (Entry-Level)
Cooper and Sons
West Victor, AE

Software Developer (Python)
Adams-Brewer
Brockburgh, AE

Python Developer
Rivera and Sons
East Michaelfort, AA

Back-End Web Developer (Python, Django)
Stewart-Alexander
South Kimberly, AA

Back-End Web Developer (Python, Django)
Jackson, Ali and Mckee
New Elizabethside, AA

Python Programmer (Entry-Level)
Mathews Inc
Robertborough, AP

Software Developer (Python)
Moreno-Rodriguez
Martinezburgh, AE



### Extract Attributes From HTML Elements

In [10]:
for job_element in python_job_elements:
    # -- snip --
    links = job_element.find_all("a")
    for link in links:
        print(f"{link.text.strip()}: {link['href']}")

Learn: https://www.realpython.com
Apply: https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html
Learn: https://www.realpython.com
Apply: https://realpython.github.io/fake-jobs/jobs/software-engineer-python-10.html
Learn: https://www.realpython.com
Apply: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-20.html
Learn: https://www.realpython.com
Apply: https://realpython.github.io/fake-jobs/jobs/python-programmer-entry-level-30.html
Learn: https://www.realpython.com
Apply: https://realpython.github.io/fake-jobs/jobs/software-developer-python-40.html
Learn: https://www.realpython.com
Apply: https://realpython.github.io/fake-jobs/jobs/python-developer-50.html
Learn: https://www.realpython.com
Apply: https://realpython.github.io/fake-jobs/jobs/back-end-web-developer-python-django-60.html
Learn: https://www.realpython.com
Apply: https://realpython.github.io/fake-jobs/jobs/back-end-web-developer-python-django-70.html
Learn: https://www.realpython.c