# Extract with BeautifulSoup: PBS.org
Now you can collect the contents of a single HTML file it is time to extract data from it. For this we are going to use the package [BeautifulSoup4](https://pypi.org/project/beautifulsoup4/). BeautifulSoup is a library that makes it easy to scrape information from web pages. It tranforms the textual HTML files into an object iterating, searching, and modifying the parse tree. The easiest way to extract data from HTML files.

In this Notebook you will learn to:
1. collect data
2. extract data based on HTML element
3. extract data based on class attribute
4. extract data based on other HTML attribute

### 1. Collect data

In [None]:
import requests

# retrieve an individual article
url = 'https://www.pbs.org/newshour/economy/google-ceo-calls-for-regulation-of-artificial-intelligence'

page = requests.get(url)

### 2. Read file with BeautifulSoup
The `page.content` from `requests` is used and transformed into a BS4 object.

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html')


### 3. Extract the title
Now that we have an `soup` object we can fire up queries to extract the data. There is extensive [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) about BS4 so look at it when in doubt. This Notebook will cover the basics. This part also requires some familiarisation with HTML and CSS. We will discuss parts in class but a good primer can be found [here](https://www.w3schools.com/). For now we are going to extract content from the `<title></title>` element.

Finding HTML elements such as `<h1></h1>` or `<title></title>` can easily be done with `soup.find()`. The response, in this case will be `<title>Google CEO calls for regulation of artificial intelligence | PBS NewsHour</title>`. Use the `.get_text()` method to extract the text from the `<title/>` element.

In [None]:
# find the <title></title> in the document. 
title = soup.find('title').get_text()

### 4. Cleaning up the response
Cleaning is a large part of Web data extraction. In this case we want to get rid of the ` | PBS NewsHour` and will use the versatile String method `.replace()`

In [None]:
title = title.replace(' | PBS NewsHour', '')
title

### 5. Getting more data
Unfortunately, not all data is as easy to extract as the title. Consider elements such as `<a>`, `<span>`, or `<div>` that can occur many times. Let's extract the author of this article. Go to your browser and deterime where the content is on the page. Right-click on the link and inspect the element.

If everything works as expected you will see that the element is `<a class="post__byline-name-hyphenated" href="https://www.pbs.org/newshour/author/kelvin-chan-associated-press">Kelvin Chan, Associated Press</a>`. How would you extract this? Searching for an `<a>` element will return a value but will it be the right value? In this case it will search and return the first one it finds. 

A way to avoid this is by searching for the attributes of an HTML element. In this case the CSS class `post__byline-name-hyphenated`

In [None]:
author = soup.find(class_='post__byline-name-hyphenated')
author

##### 5a. Extract and clean the author

In [None]:
author = soup.find(class_='post__byline-name-hyphenated')
author = author.get_text().replace(', Associated Press', '').strip()
author

##### 5b. Extracting the date and time
A quick inspection shows that there is a `<time>` element you can extract the date and time: `<time class="post__date" itemprop="datePublished" content="2020-01-20T09:13:07-05:00">Jan 20, 2020 9:13 AM EST</time>`. You have two opions:
1. extract the text data `Jan 20, 2020 9:13 AM EST` and transform it into a datetime format
2. extract the info from the `content="2020-01-20T09:13:07-05:00"` attribute.

Let's explore the second option. In this case there is only one `<time>` so we can use `soup.find()`. Extracting the attribute by adding `['content']` after the `.find()`

In [None]:
datetime = soup.find('time')['content']
datetime

##### 5c. Extracting the post category
With the knowledge you have know you can extract the post category. Do so below:

In [None]:
# code goes here

### 6. Meta data
If you are in luck, and in this case you are, the website has meta data describing the post. This makes life a lot easier. Go to the webpage and look at the HTML code by inspecting the page. Go to the `<head>` element on top of the code. Here you will see all different meta elements with properties such as type, title, description, image, section, etc. 

In this case we cannot use a CSS class as we did earlier. Luckily, we can search for specific attributes!

In [None]:
title = soup.find('meta',  attrs={'property': 'og:title'})['content']
title

##### 6a. Extract the meta data and save in a dictionary
Make sure you have the following dictionary:
`{'title': 'Google CEO calls for regulation of artificial intelligence', 'description': '“There is no question in my mind that artificial intelligence needs to be regulated. The question is how best to approach this,” Sundar Pichai said.','datetime': '2020-01-20T09:13:07-05:00', 'section': 'Economy', 'tags': 'artificial intelligence, google, technology'}`

In [None]:
title = soup.find('meta',  attrs={'property': 'og:title'})['content']
description = soup.find('meta',  attrs={'property': 'og:description'})['content']
datetime = soup.find('meta',  attrs={'property': 'article:published_time'})['content']
section = soup.find('meta',  attrs={'property': 'article:section'})['content']
tags = soup.find('meta',  attrs={'property': 'article:tag'})['content']

data = {
  'title': title,
  'description': description,
  'datetime': datetime,
  'section': section,
  'tags': tags
}