# Adapted from CS109a Introduction to Data Science

slides are available [here](https://docs.google.com/presentation/d/1EaZWSQBCAUyVgXZbvOFBiVWpSL25yx8O90j96p2DlJA/edit?usp=sharing)

## Seminar 6, Exercise 1: Intro to BS4

## Description

**OVERVIEW**

As we learned in class, the three most common sources of data used for Data Science are:

- files (e.g, `.csv`, `.txt`) that already contain the dataset
- APIs (e.g., Twitter or Facebook)
- web scraping (e.g., Requests)

Here, we get practice with web scraping by using **Requests**. Once we fetch the page contents, we will need to extract the information that we actually care about. We rely on <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">BeautifulSoup</a> to help with this.

In [1]:
import re
import requests
from bs4 import BeautifulSoup

For this exercise, we will be grabbing data (the Top News stories) from [AP News](apnews.com), a not-for-profit news agency.

In [2]:
# the URL of the webpage that has the desired info
url = "https://apnews.com/hub/ap-top-news"

Let's use [`requests`](https://requests.readthedocs.io/en/master/user/quickstart/) to fetch the contents. Specifically, the [`requests`](https://requests.readthedocs.io/en/master/user/quickstart/) library has a `.get()` function that returns a [Response object](https://www.w3schools.com/python/ref_requests_response.asp). A Response object contains the server's _response_ to the HTTP request, and thus contains all the information that we could want from the page.

Below, fill in the blank to fetch  AP News' Top News website.

In [3]:
home_page = requests.get(url)
home_page.status_code

200

You should have received a status code of 200, which means the page was successfully found on the server and sent to receiver (aka client/user/you). [Again, you can click here](https://www.restapitutorial.com/httpstatuscodes.html) for a full list of status codes. Recall that sometimes, while browsing the Internet, webpages will report a 404 error, possibly with an entertaining graphic to ease your pain. That 404 is the status code, just like we are using here!

`home_page` is now a [Response object](https://www.w3schools.com/python/ref_requests_response.asp). It contains many attributes, including the `.text`. Run the cell below and note that it's identical to if we were to visit the webpage in our browser and clicked 'View Page Source'.

In [7]:
home_page.text

'<!DOCTYPE html><html lang="en"><head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1"><script>const MOBILE_REGEX =\n\'/Mobile|iP(hone|od|ad)|Android|BlackBerry|IEMobile|Kindle|NetFront|Silk-Accelerated|(hpw|web)OS|Fennec|Minimo|Opera M(obi|ini)|Blazer|Dolfin|Dolphin|Skyfire|Zune/\';\nconst ua = window.navigator.userAgent;\nlet regex = new RegExp(MOBILE_REGEX);\nwindow.isMobile = regex.test(ua);\nconsole.log(\'isMobile: \' + window.isMobile);\nwindow.PWT = window.PWT || {};\n(window.PWT).jsLoaded = function () {\n    \n}\nvar pwtow = document.createElement(\'script\');\nvar useSSL = \'https:\' == document.location.protocol;\npwtow.src = (useSSL ? \'https:\' : \'http:\') + \'//ads.pubmatic.com/AdServer/js/pwt/160964/\' + (window.isMobile ? \'4959\' : \'4958\') + \'/pwt.js\';\npwtow.async = true;\nvar node = document.getElementsByTagName(\'script\')[0];\nnode.parentNode.insertBefore(pwtow, node);\n\nvar gads = document.createElement(\'script\');\n

The above `.text` property is atrocious to view and make sense of. Sure, we could write Regular Expressions to extract all of the contents that we're interested in. Instead, let's first use [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the content into more manageable chunks.

Below, fill in the blank to construct an HTML-parsed [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) object from our website.

In [14]:
soup = BeautifulSoup(home_page.content, 'html.parser')
soup

<!DOCTYPE html>
<html lang="en"><head><meta charset="utf-8"/><meta content="width=device-width, initial-scale=1" name="viewport"/><script>const MOBILE_REGEX =
'/Mobile|iP(hone|od|ad)|Android|BlackBerry|IEMobile|Kindle|NetFront|Silk-Accelerated|(hpw|web)OS|Fennec|Minimo|Opera M(obi|ini)|Blazer|Dolfin|Dolphin|Skyfire|Zune/';
const ua = window.navigator.userAgent;
let regex = new RegExp(MOBILE_REGEX);
window.isMobile = regex.test(ua);
console.log('isMobile: ' + window.isMobile);
window.PWT = window.PWT || {};
(window.PWT).jsLoaded = function () {
    
}
var pwtow = document.createElement('script');
var useSSL = 'https:' == document.location.protocol;
pwtow.src = (useSSL ? 'https:' : 'http:') + '//ads.pubmatic.com/AdServer/js/pwt/160964/' + (window.isMobile ? '4959' : '4958') + '/pwt.js';
pwtow.async = true;
var node = document.getElementsByTagName('script')[0];
node.parentNode.insertBefore(pwtow, node);

var gads = document.createElement('script');
var useSSL = 'https:' == document.locati

You'll notice that the `soup` object is better formatted than just looking at the entire text. It's still dense, but it helps.

Below, fill in the blank to set `webpage_title` equal to the text of the webpage's title (no HTML tags included).

In [16]:
webpage_title = soup.title
print(webpage_title)

<title data-rh="true">Top News: US &amp; International Top News Stories Today | AP News</title>


Again, our BeautifulSoup object allows for quick, convenient searching and access to the web page contents.


Anytime you wish to extract specific contents from a webpage, it is necessary to:
- **Step 1**. While viewing the page in your browser, identify what contents of the page you're interested in.
- **Step 2**. Look at the HTML returned from the BeautifulSoup object, and pinpoint the specific context that surrounds each of these items that you're interested in
- **Step 3.** Devise a pattern using BeautifulSoup and/or RegularExpressions to extract said contents.

For example:
### **Step 1:**
Let's say, for every news article found on the AP's Top News page, you want to extract the link and associated title. In this screenshot
<img src="https://github.com/Harvard-IACS/2020-CS109A/blob/master/content/lectures/lecture03/images/apnews_sample.png?raw=true">

we can see one news article (there are many more below on the page). Its title is `"California fires bring more chopper rescues, power shutoffs"` and its link is to [/c0aa17fff978e9c4768ee32679b8555c](/c0aa17fff978e9c4768ee32679b8555c). Since the current page is stored at apnews.com, the article link's full address is [apnews.com/c0aa17fff978e9c4768ee32679b8555c](apnews.com/c0aa17fff978e9c4768ee32679b8555c).



### **Step 2:**

After printing the `soup` object, we saw a huge mess of all of the HTML still. So, let's drill down into certain sections. As illustrated in the [official documentation here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names), we can retrieve all `<a>` links by running the cell below:

In [17]:
soup.find_all("a")

[<a class="header-logo" href="/"><svg class="SvgSprite"><use xlink:href="/dist/spritemap.svg#sprite-logo"></use></svg></a>,
 <a class="topic-link" href="/hub/us-news?utm_source=apnewsnav&amp;utm_medium=navigation">U.S. News</a>,
 <a class="topic-link" href="/hub/world-news?utm_source=apnewsnav&amp;utm_medium=navigation">World News</a>,
 <a class="topic-link" href="/hub/politics?utm_source=apnewsnav&amp;utm_medium=navigation">Politics</a>,
 <a class="topic-link" href="/hub/sports?utm_source=apnewsnav&amp;utm_medium=navigation">Sports</a>,
 <a class="topic-link" href="/hub/entertainment?utm_source=apnewsnav&amp;utm_medium=navigation">Entertainment</a>,
 <a class="topic-link" href="/hub/business?utm_source=apnewsnav&amp;utm_medium=navigation">Business</a>,
 <a class="topic-link" href="/hub/technology?utm_source=apnewsnav&amp;utm_medium=navigation">Technology</a>,
 <a class="topic-link" href="/hub/health?utm_source=apnewsnav&amp;utm_medium=navigation">Health</a>,
 <a class="topic-link" hre

This is still a ton of text (links). So, let's get more specific. I now search for the title text `California fires bring more chopper rescues, power shutoffs` within the output of the previous cell (the HTML of all links). I notice the following:

`<a class="Component-headline-0-2-110" data-key="card-headline" href="/c0aa17fff978e9c4768ee32679b8555c"><h1 class="Component-h1-0-2-111">California fires bring more chopper rescues, power shutoffs</h1></a>`

I also see that this is repeatable; every news article on the Top News page has such text! Great!

### **Step 3:**

The pattern is that we want the value of the `href` attribute, along with the text of the link. There are many ways to get at this information. Below, I show just a few:

In [26]:
# EXAMPLE 1
# returns all `a` links that also contain `Component-headline-0-2-110`
print(soup.find_all("a", "Component-headline-0-2-123"))

[<a class="Component-headline-0-2-123" data-key="card-headline" href="/article/buffalo-supermarket-shooting-442c6d97a073f39f99d006dbba40f64b"><h2 class="Component-heading-0-2-124 Component-headingMobile-0-2-125 -cardHeading">10 dead in Buffalo supermarket attack police call hate crime</h2></a>, <a class="Component-headline-0-2-123" data-key="card-headline" href="/article/russia-ukraine-kyiv-war-crimes-4cc0ea6b166aa0fd9fb1306f280739fc"><h2 class="Component-heading-0-2-124 Component-headingMobile-0-2-125 -cardHeading">Ukraine: Russians withdraw from around Kharkiv, batter east</h2></a>, <a class="Component-headline-0-2-123" data-key="card-headline" href="/article/abortion-us-supreme-court-new-york-city-88bd8dd83f9df333f61ec2de647f955d"><h2 class="Component-heading-0-2-124 Component-headingMobile-0-2-125 -cardHeading">Abortion rights backers rally in anger over post-Roe future</h2></a>, <a class="Component-headline-0-2-123" data-key="card-headline" href="/article/trump-russia-probe-trial-

In [27]:
urls, titles = [], []
# iterates over each link and extracts the href and title
for link in soup.find_all("a", "Component-headline-0-2-123"):
    url = "www.apnews.com" + link['href']
    urls.append(url)

    title = link.text
    titles.append(title)

titles

['10 dead in Buffalo supermarket attack police call hate crime',
 'Ukraine: Russians withdraw from around Kharkiv, batter east',
 'Abortion rights backers rally in anger over post-Roe future',
 'EXPLAINER: Why stakes are high in trial tied to Russia probe',
 'Ukrainian band Kalush Orchestra wins Eurovision amid war',
 'Radio station elevates voices of Hungary’s Roma minority',
 'North Korea reports 15 more suspected COVID-19 deaths ',
 '‘Reprehensible’: Oz condemns GOP opponent’s tweet on Islam',
 'Clarence Thomas says abortion leak has changed Supreme Court',
 'As Musk buyout looms, Twitter searches for its soul',
 'McConnell, GOP senators meet Zelenskyy in surprise Kyiv stop',
 'Israel police to investigate conduct at journalist funeral',
 'Sheikh Mohammed bin Zayed Al Nahyan becomes UAE’s president',
 'Britney Spears says she’s lost baby due to miscarriage',
 'Shootings near Milwaukee Bucks playoff game prompt curfew',
 'Small plane crashes on bridge near Miami, striking an SUV',
 '

As mentioned in the official documentation [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes) and [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments), a tag (such as `a`) may have many attributes, and you can search them by putting your terms in a dictionary.

In [28]:
# EXAMPLE 2
# this returns the same exact subset of links as the example above
# so, we could iterate through the list just like above
soup.find_all("a", attrs={"data-key": "card-headline"})

[<a class="Component-headline-0-2-123" data-key="card-headline" href="/article/buffalo-supermarket-shooting-442c6d97a073f39f99d006dbba40f64b"><h2 class="Component-heading-0-2-124 Component-headingMobile-0-2-125 -cardHeading">10 dead in Buffalo supermarket attack police call hate crime</h2></a>,
 <a class="Component-headline-0-2-123" data-key="card-headline" href="/article/russia-ukraine-kyiv-war-crimes-4cc0ea6b166aa0fd9fb1306f280739fc"><h2 class="Component-heading-0-2-124 Component-headingMobile-0-2-125 -cardHeading">Ukraine: Russians withdraw from around Kharkiv, batter east</h2></a>,
 <a class="Component-headline-0-2-123" data-key="card-headline" href="/article/abortion-us-supreme-court-new-york-city-88bd8dd83f9df333f61ec2de647f955d"><h2 class="Component-heading-0-2-124 Component-headingMobile-0-2-125 -cardHeading">Abortion rights backers rally in anger over post-Roe future</h2></a>,
 <a class="Component-headline-0-2-123" data-key="card-headline" href="/article/trump-russia-probe-tri

Alternatively, we could use Regular Expressions if we were confident that our Regex pattern only matched on the relevant links.

In [30]:
# EXAMPLE 3
# instead of using the BeautifulSoup, we are handling all of the parsing
# ourselves, and working directly with the original Requests text
re.findall("<a class=\"Component-headline.*?href=\"(.+?)\"><h2.+?>(.+?)</h2></a>", home_page.text)

[('/article/buffalo-supermarket-shooting-442c6d97a073f39f99d006dbba40f64b',
  '10 dead in Buffalo supermarket attack police call hate crime'),
 ('/article/russia-ukraine-kyiv-war-crimes-4cc0ea6b166aa0fd9fb1306f280739fc',
  'Ukraine: Russians withdraw from around Kharkiv, batter east'),
 ('/article/abortion-us-supreme-court-new-york-city-88bd8dd83f9df333f61ec2de647f955d',
  'Abortion rights backers rally in anger over post-Roe future'),
 ('/article/trump-russia-probe-trial-durham-why-stakes-are-high-a9a92fa5c4cd932be948f4f7da7865af',
  'EXPLAINER: Why stakes are high in trial tied to Russia probe'),
 ('/article/russia-ukraine-eurovision-song-contest-entertainment-turin-3ecdb3b0f75c563667b66525afa42f23',
  'Ukrainian band Kalush Orchestra wins Eurovision amid war'),
 ('/article/hungary-social-issues-budapest-c1d4b38061a87c163a8fb41a744b8424',
  'Radio station elevates voices of Hungary’s Roma minority'),
 ('/article/covid-health-pandemics-south-korea-flu-d6c20af012cd035bcf411dbf952a3a5e'