Skip to content

Using the Web Scraper

Drew French edited this page Jan 26, 2014 · 9 revisions

In order to make proper use of the nlplib we need to acquire samples of natural language. Fortunately the internet is filled with a ridiculous amount of natural language. Most of it is even of decent quality, with the notable exception of YouTube comments (Seriously, how did those even become so terrible?).

To take advantage of the massive amounts of data, the nlplib comes equipped with the web scraper decorator. This decorator transforms a generator function that yields URLs into a generator function that yields response objects.

This web scraper retrieves the text from python.org's main page.

import nlplib

@nlplib.scraper
def scrape_python_dot_org () :
    yield 'http://python.org'

In order to get the response objects from the function, one simply has to iterate over the responses. The response text can then be gathered using the built-in str function.

for response in scrape_python_dot_org() :
    # The URL of the response, this can be different from the original input URL if you end up getting redirected.
    print(response.url)

    # Prints the first 100 characters of the string containing the response's text, typically HTML code.
    print(str(response)[:100] + '...')
    print()

Most resources contain HTML code, in order to make use of the HTML, a parser is necessary. Processing HTML is outside of the scope of the nlplib; however, Python has several options available for handling raw HTML. The standard library features an HTML parser. However, a significant portion of the content on the internet is malformed/incorrectly written HTML. So it's often necessary to use a parser that gracefully handles malformed HTML. The BeautifulSoup parser does this job quite admirably.

Because the internet is a wild and unpredictable place, it's advisable to set the silent argument to true. This silences exceptions caused by a resource being too large (MemoryError), or in an unfamiliar encoding (UnicodeError), or unavailable for other reasons (urllib.request.URLError). If any of these situations are encountered, the response for the URL is simply not yielded. Otherwise if the silent argument is set to false (the default), the scraper can potentially throw a nlplib.CouldNotOpenURL error (this is chained to the aforementioned errors).

@nlplib.scraper(silent=True)
def some_scraped_stuff () :
    # Scrape from multiple diffrent URLs.
    yield 'http://python.org'
    yield 'http://wikipedia.org'
    yield 'http://github.com'

for response in some_scraped_stuff() :
    print(response.url)
    print(str(response)[:100] + '...')
    print()

The full source text for this demo can be found here.

Clone this wiki locally