# Extracting text from HTML web pages

by Koenraad De Smedt at UiB

---
HTML (hypertext markup language) is the standard format for webpages. It has codes for headings, paragraphs, lists, etc. 

This notebook shows how to:
1.  Extract plain text from HTML on webpages
2.  Inspect the data access and extraction process at every step.

Note: There is also a package [textract](https://textract.readthedocs.io/en/stable/python_package.html) which provides text extraction from many types of files, but this package is complex and beyond the scope of the present course.

---

You need to import a module that can request access to a webpage based on its URL.

In [None]:
import requests

The following is an example of a webpage with usable html. Also open this URL in a browser window to see what the page looks like. We are interested in extracting the text in normal paragraphs.


In [None]:
chess_url = 'https://en.wikipedia.org/wiki/Chess'

Open the webpage and get the content. The `b` in front of the string indicates that it is a string of raw bytes.

In [None]:
chess_html = requests.get(chess_url).content
print(chess_html[:125])

We see that the content is HTML. In order to get plain text out of the HTML, we need the `bs4` module (Beautiful Soup).

In [None]:
from bs4 import BeautifulSoup

We are interested in the text contained in paragraphs, indicated with `<p>` tags. First, we use the `html.parser` to parse the html, then we find all paragaphraphs in the parsed html. Optionally print the number of paragraphs for information, and print the first few of them.

(Note: not all html has paragraphs. If the content is a table, for instance, we may need to find all `td` elements instead.)


In [None]:
parsed_html = BeautifulSoup(chess_html, 'html.parser')
paragraphs = parsed_html.find_all('p')
print(len(paragraphs), 'paragraphs were extracted.')
print(paragraphs[:4])

Optionally, if you know that some paragraphs are not relevant, you can skip them.

In [None]:
paragraphs = paragraphs[2:]
print(paragraphs[:4])

Next, concatenate the plain text in all the paragraphs together in a string. Print its length and the first thousand characters.

In [None]:
chess_text = ''.join([node.text for node in paragraphs]).strip()
print('Text is', len(chess_text), 'characters long.\n')
print(chess_text[:1000])

Voilà! Now you have plain text that you can process further. Let's tokenize and compute the frequencies of the most common words.

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize, FreqDist

In [None]:
chess_tokens = word_tokenize(chess_text.casefold())
print(len(chess_tokens), 'tokens were found.')
frequencies = FreqDist(chess_tokens)
frequencies.most_common(20)

We can plot some counts. First we import the pyplot module.

In [None]:
import matplotlib.pyplot as plt

Then we make a vector of the frequencies and make a dot plot. Notice what happens when you uncomment the line that says the *y* scale should be logarithmic.

In [None]:
freq_counts = [count for word, count in frequencies.most_common(500)]

plt.figure(dpi=100)
plt.plot(freq_counts, '.')
#plt.yscale('log')
plt.title('Word frequency counts in Chess article')
plt.ylabel('Frequency')
plt.xlabel('Index')
plt.show()

Let's do some n-grams as well. The result is a generator. With the `next` function we can produce the next item from the generator.

In [None]:
from nltk.util import ngrams

In [None]:
ng3 = ngrams(chess_tokens, 3)
for i in range(12):
  print(next(ng3))

### Exercises

Put things together and make a function to extract plain text from html. Define a function `get_web_text` with three arguments: 

1.   a string representing a url to a webpage assumed to contain html.
2.   a tag (default is *p*)

The function should return a string with the plain text in all on the website (don't forget to use `return`). Also print the number of characters read.

In [None]:
# Write your function here.
def get_text_from_webpage(url, tag='p'):
  ...

In [None]:
# This is a test of your function. Run this cell after defining your function.
tennis_url = 'https://en.wikipedia.org/wiki/Tennis'
tennis_text = get_text_from_webpage(tennis_url)
print(tennis_text[:150])