# Reading plain text from the web

by Koenraad De Smedt at UiB

---
There is a lot of textual material on the web that can be processed. But there are also a lot of different text encodings and formats. This notebook will deal with the simplest case, namely, reading a webpage consisting of plain Unicode text.

---

# 0. Getting the text

We need to import the `requests` module that can open a webpage based on its url.

In [None]:
import requests

Here is an example URL that points to a plain text on the web. You can open it in a new tab in the browser to check that it contains plain text.

In [None]:
emma_url = 'https://raw.githubusercontent.com/fbkarsdorp/python-course/master/data/austen-emma.txt'

The function `requests.get` opens a webpage based on the url. There are several kinds of information in the response, but here we are only interested in getting the textual content as a Unicode string by means of `.text`.

In this example, we take only a 800 characters because the whole text is too big to display.

In [None]:
emma_text = requests.get(emma_url).text[:800]
print(emma_text)

# 1. Computing tokens, types and frequencies

Now that we have plain text, we can further process it. We use the `nltk` module which provides some useful text manipulation and counting functions.

In [None]:
import nltk
nltk.download('punkt')
from nltk import word_tokenize, FreqDist

Make a list of word tokens from the lowercased text.

In [None]:
emma_tokens = word_tokenize(emma_text.lower())
print(emma_tokens[:30])

The set of types can also be called the vocabulary of the text.

In [None]:
emma_types = set(emma_tokens)
print(emma_types)

We can compute the lexical variation by dividing the number of types by the number of tokens. The larger this number, the more varied use of words. The lower this number, the more repetition of words. For a very short text, this number doesn't mean so much.

In [None]:
len(emma_types) / len(emma_tokens)

Let's define a function for lexical variation based on this proportion.

In [None]:
def lexical_variation (text):
  tokens = word_tokenize(text.lower())
  types = set(tokens)
  return len(types) / len(tokens)

lexical_variation(emma_text)

Make a frequency distribution.

In [None]:
freq_dist = FreqDist(emma_tokens)
freq_dist['emma'] # assume tokens are all lowercase

# 2. Streaming line by line

Instead of reading the whole text of the webpage into a string, which may be very long, it is also possible to read and process *streamed* content line by line. The stream works like a generator, so that only as many lines are read as the program asks for. The code in the following cell reads and prints the first 20 lines only and also prints the counter.

By default, `iter_lines` gives raw strings, so we need to tell it to decode each line into text.

In [None]:
emma_stream = requests.get(emma_url, stream=True)
linestream = emma_stream.iter_lines(decode_unicode=True)
for n in range(20):
  print(n, next(linestream))

Here is an alternative way of reading and printing 20 lines. We zip a range of numbers and a stream of lines. Again, this is very efficient because the zip is limited to 20 lines that are are read.

In [None]:
emma_stream = requests.get(emma_url, stream=True)
for n, line in zip(range(20), emma_stream.iter_lines(decode_unicode=True)):
  print(n, line)

Notice that the text has some empty lines. Suppose we want to read 20 lines and print only non-empty lines, we add a condition with `if`.

In [None]:
emma_stream = requests.get(emma_url, stream=True)
for n, line in zip(range(20), emma_stream.iter_lines(decode_unicode=True)):
  if line: 
    print(n, line)

The previous counts 20 lines read, not 20 lines written. Suppose we want to print 20 non-empty lines, then we should use a counter that is increased only after we know we have a non-empty line.

In [None]:
emma_stream = requests.get(emma_url, stream=True)
linestream = emma_stream.iter_lines(decode_unicode=True)
printed = 0
while printed < 20:
  line = next(linestream)
  if line:
    print(printed, line)
    printed += 1

## Exercises

1.  Read the *full* text of Emma into Python. Do not print the whole text, because it is too long, but lowercase it, tokenize it and compute the lexical variation.

2.  Given the list `tokens` and the frequency distribution `freq_dist`, print the frequencies of *she* and *he*. Also, compute the relative frequency of these words per million words. Optionally, plot both relative frequencies together in a barplot.

3.  Extend the code for reading non-empty lines from Emma so that two counters are printed, one that counts the lines read and another that counts the non-empty lines printed.

4.  Find a large word list online with one word on each line. Iterate over its lines and print only the lines that are palindromes. Reuse the palindrome function from the earlier notebook about palindromes. The following are possible URLs for a large English word list. Alternatively, you can look for a list in another language.

 *   http://wiki.puzzlers.org/pub/wordlists/unixdict.txt
 *   https://raw.githubusercontent.com/quinnj/Rosetta-Julia/master/unixdict.txt
 *   https://searchcode.com/codesearch/raw/29038705/

5.  Supposed you need help in solving a crossword puzzle. Write a function `search_words` that iterates an online word list, as suggested above, and prints all lines matching a given regex. For instance, `search_words('^[db]a...$')` will look for five-letter words starting with *d* or *b* followed by *a*. You may want to limit the number of words that are printed because there could be many.

6.  (optional) Also using an online word list as above, use a comprehension that includes all words longer than, for instance, 15 characters.

## Notes

1.  See the [documentation of requests](https://docs.python-requests.org/en/master/user/quickstart/) if you need more possibilities to access webpages.
In case a webpage is encoded in anything else than UTF-8, then instead of `.text`, one can also use `.content.decode(encoding)`,  which gets the content and interprets that according to the given encoding, for instance `.content.decode('cp1252')`

2.  In some versions of Python on MacOS, the use of the *requests* module may cause an error about certificates. If that is the case, you can run the following command on MacOS.

><img src="https://git.app.uib.no/desmedt/teaching/-/raw/main/Install-certificates.png" alt = "slicing" width = 240px>