### Types of HTML tags

There are many different tags.  A complete listing is here: https://html.spec.whatwg.org/multipage/

The tags used in the example above include:

| Tag name | Description |
| :- | :- |
| \<html> | A file marked up using HTML |
| \<head> | Header: information that's not visible in the web page |
| \<title> | Title of the page |
| \<style> | Formatting class definitions |
| \<body> | Body: the part that's visible in the web page |
| \<h1> | A top-level header (\<h2>, \<h3>, and \<h4> are lower-level headers) |
| \<p> | A paragraph |
| \<b> | Boldface text |
| \<a> | A hypertext link |
 


#### Real web pages

A real web page is just like the one above, but more complicated.  To see a useful example, go to <a href="https://www.npr.org/">https://www.npr.org/</a>.  In your browser menu, find the option that says **View Page Source** (in Firefox, that's inside the **Tools** menu), and click on it.

Notice that the top of the file is a very long header, including `<script>` and `<style>` tags that will be used later in the page.

After the very long header you will find a body, with lists formatted using `<ul>` and `<li>` tags, and with news content in plaintext between the tags.

<a id='section_2'></a>

## 2. Using BeautifulSoup to extract the content you want

<a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Beatiful Soup</a> is a python package that makes it relatively easy to extract content from web pages.

In [40]:
import bs4

with open("testpage.html") as f:
    example_soup = bs4.BeautifulSoup(f, "html.parser")

ptags = example_soup.findAll("p")
print("There are", len(ptags), "paragraphs in the document.\n")
print("The first one is:\n")
print(ptags[0], "\n")
print("The third one is:\n")
print(ptags[2], "\n")

print("The children of the third paragraph are:\n")
print(ptags[2].contents)

There are 4 paragraphs in the document.

The first one is:

<p>
        Web pages are written using the hypertext markup language (HTML).  You can write HTML using special tools, but you can also write it using any plaintext editor.   The markup in an HTML file is done using tags.  Each tag either opens an envelope, or closes an envelope.  The text in between the opening tag and the closing tag is called the content of the envelope.  Envelopes can be nested, one inside another, as this <b>p</b> tag is nested inside the <b>body</b> tag.
        </p> 

The third one is:

<p class="leftmargin">
<a class="bluetext" href="https://wikipedia.org">Wikipedia</a>
</p> 

The children of the third paragraph are:

['\n', <a class="bluetext" href="https://wikipedia.org">Wikipedia</a>, '\n']


In [41]:
atag = ptags[2].find("a")

print("The href attribute of the hyperlink in paragraph 3 is:",atag['href'])

print("The text content of that hyperlink is:", atag.text)


The href attribute of the hyperlink in paragraph 3 is: https://wikipedia.org
The text content of that hyperlink is: Wikipedia


#### Using BeautifulSoup to explore a real-world web page

Now let's use BeautifulSoup to explore the NPR web page.

In [42]:
import bs4, requests
webpage = requests.get("https://npr.org")
npr_soup = bs4.BeautifulSoup(webpage.text, "html.parser")

ptags = npr_soup.findAll("p")
print("There are", len(ptags), "paragraphs in the document.\n")
print("The first one is:\n")
print(ptags[0], "\n")
print("The third one is:\n")
print(ptags[2], "\n")


There are 33 paragraphs in the document.

The first one is:

<p>
                The time-tracking software TimeCamp is able to monitor what files are accessed, and for how long, and whether other non-work activities, such as streaming services, are used on a laptop.
                <b aria-label="Image credit" class="credit">
                    
                    Elise Amendola/AP
                    
                </b>
<b class="hide-caption"><b>hide caption</b></b>
</p> 

The third one is:

<p>
                A 6-year-old boy shot and wounded his teacher earlier this month at Richneck Elementary School in Newport News, Va., according to police and school officials.
                <b aria-label="Image credit" class="credit">
                    
                    Billy Schuerman/Virginian Pilot/Tribune News Service via Getty Images
                    
                </b>
<b class="hide-caption"><b>hide caption</b></b>
</p> 



The news items in the NPR web page are stored in `<div>` envelopes with a special class: they are called `<div class="story-text">`.  Let's list those.

In [43]:
div_tags = npr_soup.find_all('div', 'story-text')

print("There are", len(div_tags), "story text sections.\n")
print("The first one is:\n")
print(div_tags[0],"\n")
print("The third one is:\n")
print(div_tags[2],"\n")

There are 21 story text sections.

The first one is:

<div class="story-text">
<div class="slug-wrap">
<h2 class="slug">
<a data-metrics='{"action":"Click Slug"}' data-metrics-ga4='{"action":"homepage_curation_click","clickPosition":1,"clickType":"section slug","clickUrl":"https://www.npr.org/sections/business/"}' href="https://www.npr.org/sections/business/">
                        Business
                        </a>
</h2>
</div>
<a data-metrics='{"action" : "Click Story 1"}' data-metrics-ga4='{"action":"homepage_curation_click","clickPosition":1,"clickType":"curated story","clickUrl":"https://www.npr.org/2023/01/13/1148985075/time-tracking-software-canadian-woman-reach-cpa-court"}' href="https://www.npr.org/2023/01/13/1148985075/time-tracking-software-canadian-woman-reach-cpa-court">
<h3 class="title">A woman is ordered to repay $2,000 after her employer used software to track her time</h3>
</a>
<a data-metrics='{"action" : "Click Story 1"}' data-metrics-ga4='{"action":"homepage_c

If you look through those `story-text` sections, you can see that there are only two parts that might sound good if spoken out loud:

* Each of them has a title, called `<h3 class="title">`
* One of them also has a teaser, called `<p class="teaser">`.

Let's write a function that extracts a list of story-texts from the NPR web page, and returns the title and (if it exists) the teaser for each of them.

In [44]:
def get_stories(soup):
    stories = []
    for div_tag in soup.find_all('div', 'story-text'):
        title = div_tag.find('h3', 'title')
        teaser = div_tag.find('p', 'teaser')
        story = title.text + ". "
        if teaser != None:
            story += teaser.text
        stories.append(story)
    return stories

In [45]:
stories = get_stories(npr_soup)
print("There are", len(stories), "stories.\n")
for n in range(5):
    print("Story number %d:"%(n))
    print(stories[n], "\n")


There are 21 stories.

Story number 0:
A woman is ordered to repay $2,000 after her employer used software to track her time. The remote employee had charged her company for 50 hours that were not associated with her job, a Canadian court found. The company used time-tracking software installed on her laptop. 

Story number 1:
The school searched a 6-year-old's backpack for a gun before he shot his teacher.  

Story number 2:
The Pentagon got hundreds of new reports of UFOs in 2022, a government report says.  

Story number 3:
How to save what you need for retirement.  

Story number 4:
Tesla slashes prices across all its models in a bid to boost sales.  



<a id='section_3'></a>

## 3. Automatic news announcer

#### Making speech_package available everywhere on your machine

Last week we created the `speech_package`.  Unfortunately, in order to make it available everywhere on your machine, there is one step we left out!

Please use a terminal to navigate to the parent directory of your `speech_package`, and type the following:

```
python setup.py install
```

(Last week, we missed the word `install`).


#### Using speech_package to automatically read the news

Now that we've installed your `speech_package`, it should be available everywhere on your PC, including the directory where you're doing this week's work.  Let's use it to make an automatic news announcer.



In [46]:
import speech_package

def read_nth_story(stories, n, filename):
    speech_package.synthesize(stories[n],"en",filename)


In [47]:
import librosa, IPython

read_nth_story(stories, 10, 'test.mp3')
x, fs = librosa.load('test.mp3')

IPython.display.Audio(data=x, rate=fs)


<a id='homework'></a>

## Homework for Week 12

Create a plaintext file called `week12.py`.  Copy into it the following template code:

```
import bs4, speech_package, librosa

def extract_stories_from_NPR_text(webpage_text):
    '''
    Input: 
    webpage_text (string): the text of a webpage
    
    Output:
    stories (list of strings): a list of the news stories in the web page
    '''
    raise RuntimeError('You need to write this part!')
    return stories
    
def synthesize_nth_story(stories, n):
    '''
    Input:
    stories (list of strings): a list of the news stories from a web page
    n (int): the index of the story you want me to read
    filename (str): filename in which to store the synthesized audio

    Output: None
    '''
    raise RuntimeError('You need to write this part!')
    return speech_wave, speech_rate
```

Notice that: `extract_stories_from_NPR_text` is almost the same as `get_stories`, but it starts from the webpage text, not from the soup.  So you need to run `soup = bs4.BeautifulSoup(webpage_text, "html.parser")`, and then you re-use the rest of the code from `get_stories`.

Once you've created the file,

1. Replace the `raise RuntimeError` lines with code that works
1. Try it in the following code block.  Once you get it to work here, then
1. Try uploading it to Gradescope

In [52]:
import week12, requests, IPython, importlib, librosa
importlib.reload(week12)

webpage = requests.get("https://npr.org")
stories = week12.extract_stories_from_NPR_text(webpage.text)
week12.read_nth_story(stories, 10, 'test.mp3')

x, fs = librosa.load('test.mp3')
IPython.display.Audio(data=x, rate=fs)
