# Title
***
**Beautiful Soup: Build a Web Scraper With Python** <br>
URL Source: https://realpython.com/beautiful-soup-web-scraper-python/ <br>

# Introduction

## What is Web Scraping
***
Web scraping is a process for gathering information from the Internet especially from website

# Start Scraping
***
URL Target For Scraping: https://realpython.github.io/fake-jobs/ <br>
We will retrieve **Job Title, Company that provide the job, Job location, Posted Date**

## Import Libraries

In [4]:
from bs4 import BeautifulSoup
import requests

## Scrape HTML Content From a Page
***
The differences between HTML Parser in BeautifulSoup <br>
1. lxml
2. html.parser
3. html5lib
4. xml

In [7]:
base_url = 'https://realpython.github.io/fake-jobs/'
get_url = requests.get(base_url, timeout=5)
page_content = BeautifulSoup(get_url.content, 'lxml')

### Find element by ID

In [9]:
results = page_content.find(id='ResultsContainer')
print(results.prettify())

<div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
  <div class="card">
   <div class="card-content">
    <div class="media">
     <div class="media-left">
      <figure class="image is-48x48">
       <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
      </figure>
     </div>
     <div class="media-content">
      <h2 class="title is-5">
       Senior Python Developer
      </h2>
      <h3 class="subtitle is-6 company">
       Payne, Roberts and Davis
      </h3>
     </div>
    </div>
    <div class="content">
     <p class="location">
      Stewartbury, AA
     </p>
     <p class="is-small has-text-grey">
      <time datetime="2021-04-08">
       2021-04-08
      </time>
     </p>
    </div>
    <footer class="card-footer">
     <a class="card-footer-item" href="https://www.realpython.com" target="_blank">
      Learn
     </a>
     <a class="card-footer-item" href=

### Find element by Class Name

In [13]:
job_elements = results.find_all('div', class_='card-content')

for job_element in job_elements:
    job_title = job_element.find('h2', class_='title')
    company = job_element.find('h3', class_='company')
    job_location = job_element.find('p', class_='location')
    posted_date = job_element.find('time')
    
    print(job_title)
    print(company)
    print(job_location)
    print(posted_date)
    print('+'*30)

<h2 class="title is-5">Senior Python Developer</h2>
<h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
<p class="location">
        Stewartbury, AA
      </p>
<time datetime="2021-04-08">2021-04-08</time>
++++++++++++++++++++++++++++++
<h2 class="title is-5">Energy engineer</h2>
<h3 class="subtitle is-6 company">Vasquez-Davidson</h3>
<p class="location">
        Christopherville, AA
      </p>
<time datetime="2021-04-08">2021-04-08</time>
++++++++++++++++++++++++++++++
<h2 class="title is-5">Legal executive</h2>
<h3 class="subtitle is-6 company">Jackson, Chambers and Levy</h3>
<p class="location">
        Port Ericaburgh, AA
      </p>
<time datetime="2021-04-08">2021-04-08</time>
++++++++++++++++++++++++++++++
<h2 class="title is-5">Fitness centre manager</h2>
<h3 class="subtitle is-6 company">Savage-Bradley</h3>
<p class="location">
        East Seanview, AP
      </p>
<time datetime="2021-04-08">2021-04-08</time>
++++++++++++++++++++++++++++++
<h2 class="title is-5">Pro

### Retrieve Text From HTML Element

In [20]:
job_elements = results.find_all('div', class_='card-content')

for job_element in job_elements:
    job_title = job_element.find('h2', class_='title').text.strip()
    company = job_element.find('h3', class_='company').text.strip()
    job_location = job_element.find('p', class_='location').text.strip()
    posted_date = job_element.find('time').text.strip()
    
    print(job_title)
    print(company)
    print(job_location)
    print(posted_date)
    print('+'*30)

Senior Python Developer
Payne, Roberts and Davis
Stewartbury, AA
2021-04-08
++++++++++++++++++++++++++++++
Energy engineer
Vasquez-Davidson
Christopherville, AA
2021-04-08
++++++++++++++++++++++++++++++
Legal executive
Jackson, Chambers and Levy
Port Ericaburgh, AA
2021-04-08
++++++++++++++++++++++++++++++
Fitness centre manager
Savage-Bradley
East Seanview, AP
2021-04-08
++++++++++++++++++++++++++++++
Product manager
Ramirez Inc
North Jamieview, AP
2021-04-08
++++++++++++++++++++++++++++++
Medical technical officer
Rogers-Yates
Davidville, AP
2021-04-08
++++++++++++++++++++++++++++++
Physiological scientist
Kramer-Klein
South Christopher, AE
2021-04-08
++++++++++++++++++++++++++++++
Textile designer
Meyers-Johnson
Port Jonathan, AE
2021-04-08
++++++++++++++++++++++++++++++
Television floor manager
Hughes-Williams
Osbornetown, AE
2021-04-08
++++++++++++++++++++++++++++++
Waste management officer
Jones, Williams and Villa
Scotttown, AP
2021-04-08
++++++++++++++++++++++++++++++
Software 

### Find element by Class Name and Text Content
***
Instead of printing out all the jobs listed on the website, you’ll first filter them using keywords.

In [21]:
python_job = results.find_all('h2', string='Python')

In [22]:
python_job

[]

In [23]:
# passing lambda function for check if any substring "python" in the text element
python_job = results.find_all(
    'h2', string= lambda job:'python' in job.lower()
)

In [24]:
python_job

[<h2 class="title is-5">Senior Python Developer</h2>,
 <h2 class="title is-5">Software Engineer (Python)</h2>,
 <h2 class="title is-5">Python Programmer (Entry-Level)</h2>,
 <h2 class="title is-5">Python Programmer (Entry-Level)</h2>,
 <h2 class="title is-5">Software Developer (Python)</h2>,
 <h2 class="title is-5">Python Developer</h2>,
 <h2 class="title is-5">Back-End Web Developer (Python, Django)</h2>,
 <h2 class="title is-5">Back-End Web Developer (Python, Django)</h2>,
 <h2 class="title is-5">Python Programmer (Entry-Level)</h2>,
 <h2 class="title is-5">Software Developer (Python)</h2>]

### Access Parent Elements from Child Element that we have been filtered
***
One way to get access to all the information you need is to step up in the hierarchy of the DOM starting from the '\<h2>' elements that you identified. <br>
The \<div> element with the card-content class contains all the information you want. It’s a third-level parent of the\<h2> title element that you found using your filter. <br>
With this information in mind, you can now use the elements in **python_job** and fetch their great-grandparent elements instead to get access to all the information you want:

In [29]:
python_job = results.find_all(
    'h2', string= lambda job:'python' in job.lower()
)
python_job_elements = [
    h2_element.parent.parent.parent for h2_element in python_job
]

### Extract Attributes From HTML Elements
***
Instead extracting text from HTML element, we will extract the *attribute value* from HTML element. For example, we will retrieve URL Link from \<a> tag 

In [39]:
python_job_elements[0].find_all('a')

[<a class="card-footer-item" href="https://www.realpython.com" target="_blank">Learn</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>]

First h2_element in **python_job_elements** contains two \<a> tag. We will retrieve *href* value from it

In [40]:
python_job_elements[0].find_all('a')[1]['href']

'https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html'

In [46]:
# looping through all python_job_elements for retrieve href value
for job_element in python_job_elements:
    a_tags = job_element.find_all('a')[1]['href']
    print(f"Apply here: {link_url}\n")

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html

Apply here: https://realpython.github.io/fake-jobs/jobs/software-developer-python-90.html

