# Chapter 1: Getting text data for analysis

One of the best sources of data for text mining is, of course, the web. There are two main ways of sourcing text data. If we are lucky, a website will have a public API, which is simply an endpoint that we can query and extract data from. If we are unlucky, we will need to scrape the website, which is much messier and can require a bit more clean up.

For this book, we will be analysing Grimm's fairytales. The [full set](https://en.wikipedia.org/wiki/Grimms%27_Fairy_Tales) of these comprises of 201 stories and 10 legends, and was originally published in German as *Kinder- und Hausmärchen* (*Children's and Household Tales*) in 1812, and translated to English in *Household Tales by the Brothers Grimm* in 1884. Given that there are no sites hosting these tables with a public API, we will need to scrape some sites to put together our dataset. As such, I won't cover accessing data through an API in this book, but if you are interested in getting data this way you can have a look at a couple of previous text mining projects I have done using data from the [reddit](http://t-redactyl.io/blog/2015/11/analysing-reddit-data-part-1-setting-up-the-environment.html) and [Twitter](http://t-redactyl.io/blog/2017/04/applying-sentiment-analysis-with-vader-and-the-twitter-api.html) APIs.

## Getting the text - the hacky version

The idea behind web scraping is not too complicated: we want to retrieve data which is on a website (usually in HTML format) and convert it into a format that we can use for an analysis (generally a dataframe). The tricky part can be finding the correct way to select the content that is relevant to us, as websites contain a lot of stuff that we probably don't need. To do this, we need to find the HTML tag(s) that identify the content we want, and pray that the people who built the website have used the tags consistently for the same types of information! I find that the easiest way to identify the relevant tags for your data is to use the developer tools inbuilt into browsers like Chrome. Let's talk through this with a concrete example.

### English-language tales

We will be pulling the English-language tales from [this website](http://www.worldoftales.com/fairy_tales/Grimm_fairy_tales_Margaret_Hunt.html). Once we've navigated to this page, we can open up our developer tools (in Chrome) by right-clicking and selecting 'Inspect'.

<img src="/figure/Chapter 1.1.png" title="Developer tools 1" style="display: block; margin: auto;" />

Once you’ve done that the developer tools will open on the right of the screen. In the image above, I have highlighted a button that allows you to view the tags associated with any element of the page. If you click on this and select one of the tale names, it will take you to the title tag, like so:

<img src="/figure/Chapter 1.2.png" title="Developer tools 2" style="display: block; margin: auto;" />

Looking at the first tale, *The Frog-King, or Iron Henry*, we can see that the link is tagged with ['a'](https://www.w3schools.com/tags/tag_a.asp), which indicates it is a hyperlink, and 'href', which gives the link's destination. A quick look down the list of tales indicates that they have the same tagging, which means we can use this to pull out a list of the URLs linking to the full tale text. We can now use a package called `bs4` (short for BeautifulSoup 4), as well as `urllib` and `urlparse` to help us extract these URLs from this mess of HTML. Let's start by installing the packages:

In [21]:
from bs4 import BeautifulSoup
import urllib
import urlparse

We then open a read-only connection to our website using the `urlopen` and `read()` methods from the `urllib` package.

In [62]:
conn = urllib.urlopen('http://www.worldoftales.com/fairy_tales/Grimm_fairy_tales_Margaret_Hunt.html').read()

We can now pass this connection to BeautifulSoup, which will parse out all of the webiste elements and allow us to start finding and extracting our URLs.

In [63]:
main_page = BeautifulSoup(conn, "html5lib")

The method we need to use from BeautifulSoup is `find_all()`. [This method](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) has a couple of arguments of interest to us. The first, the `name` argument, will only search for tags with certain names. We will tell it to only search for tags called 'a'.

In [None]:
main_page.find_all('a')

You can see this has returned *all* of the URLs on this page, including stuff that is not relevant to our analysis. Luckily we can use [another argument](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments), `href`, which allows us to filter URLs using regular expressions. Let's only keep those that contain 'Brothers_Grimm'.

In [65]:
import re

main_page.find_all('a', href = re.compile('Brothers_Grimm'))

[<a href="Brothers_Grimm/Margaret_Hunt/The_Frog-King,_or_Iron_Henry.html">1.The Frog-King, or Iron Henry</a>,
 <a href="Brothers_Grimm/Margaret_Hunt/Cat_and_Mouse_in_Partnership.html">2.Cat and Mouse in Partnership</a>,
 <a href="Brothers_Grimm/Margaret_Hunt/Our_Lady's_Child.html"> 3.Our Lady's Child</a>,
 <a href="Brothers_Grimm/Margaret_Hunt/The_Story_of_the_Youth_who_Went_Forth_to_Learn_What_Fear_Was.html">  4.The Story of the Youth who Went Forth to Learn What Fear Was</a>,
 <a href="Brothers_Grimm/Margaret_Hunt/The_Wolf_and_The_Seven_Little_Kids.html"> 5.The Wolf and The Seven Little Kids </a>,
 <a href="Brothers_Grimm/Margaret_Hunt/Faithful_John.html">6.Faithful John</a>,
 <a href="Brothers_Grimm/Margaret_Hunt/The_Good_Bargain.html">7.The Good Bargain</a>,
 <a href="Brothers_Grimm/Margaret_Hunt/The_Wonderful_Musician.html">8.The Wonderful Musician</a>,
 <a href="Brothers_Grimm/Margaret_Hunt/The_Twelve_Brothers.html">9.The Twelve Brothers</a>,
 <a href="Brothers_Grimm/Margaret_Hun

We can now loop over all of the links that match our criteria. As you would have seen from the last result, `find_all` only returns a partial link. As such, we'll need to join our links to a prefix containing the rest of the link using the `urljoin` method from `urlparse`. We can now create an empty list and append all of the URLs to this.

In [66]:
urls = []
url = "http://www.worldoftales.com/fairy_tales/"
for tag in main_page.find_all('a', href = re.compile('Brothers_Grimm')):
    match = urlparse.urljoin(url, tag['href'])
    urls.append(match)

In [67]:
urls[0:5]

[u'http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Frog-King,_or_Iron_Henry.html',
 u'http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/Cat_and_Mouse_in_Partnership.html',
 u"http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/Our_Lady's_Child.html",
 u'http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Story_of_the_Youth_who_Went_Forth_to_Learn_What_Fear_Was.html',
 u'http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Wolf_and_The_Seven_Little_Kids.html']

Now that we have our URLs, we can use them to grab the text of each of the stories. Let's start by looking at one of the specific stories on the website.

<img src="/figure/Chapter 1.3.png" title="Developer tools 3" style="display: block; margin: auto;" />

We can see that the text is tagged as 'div', which just [marks a section](https://www.w3schools.com/tags/tag_div.asp) in the HTML document. The ID attribute is used to set a specific style for all things tagged the same. Finally, the class is 'GM', which just means that [all similar blocks of text](https://www.w3schools.com/html/html_classes.asp) have been given this name. Let's use this information to try and pull out the text of the first tale.

In [None]:
conn = urllib.urlopen("http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Frog-King,_or_Iron_Henry.html").read()
tale_page = BeautifulSoup(conn)
tale_page.find_all("div", id = 'text', class_ = "GM")

Great, we have all of the text! The problem is that it is not actually text yet - it is stored in something called a ResultSet. In order to pull the text out (and get rid of those leftover HTML tags), we need to use the `text` method over all of the parts of the ResultsSet.

In [None]:
for div in tale_page.find_all("div", id = 'text', class_ = "GM"):
    print(div.text)

We can now grab all of the text using a similar method to that we used to grab the URLs. However, you can see that we need to create a connection to each URL within the loop. Following that, we can loop over this list in order to extract all of the text.

In [70]:
raw_tales = []
for url in urls:
    conn = urllib.urlopen(url).read()
    tale_page = BeautifulSoup(conn)
    tale = tale_page.find_all("div", id = 'text', class_ = "GM")
    raw_tales.append(tale)
    
english_tales = []
for tale in raw_tales:
    for div in tale:
        english_tales.append(div.text)

We could use a similar method to extract the titles. However, BeautifulSoup actually tags certain parts of the HTML like titles for us automatically. This means we simply need to use the `title` and `text` methods to pull it out as clean text.

In [71]:
conn = urllib.urlopen("http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Frog-King,_or_Iron_Henry.html").read()
tale_page = BeautifulSoup(conn)
tale_page.title.text

u'Brothers Grimm fairy tales - The Frog-King, or Iron Henry '

Let's now do this for all of the titles.

In [72]:
english_titles = []
for url in urls:
    conn = urllib.urlopen(url).read()
    tale_page = BeautifulSoup(conn)
    title = tale_page.title.text
    english_titles.append(title)

In [73]:
english_titles[0:5]

[u'Brothers Grimm fairy tales - The Frog-King, or Iron Henry ',
 u'Brothers Grimm fairy tales - Cat and Mouse in Partnership (Margaret Hunt)',
 u"Brothers Grimm fairy tales - Our Lady's Child",
 u'Brothers Grimm fairy tales - The Story of the Youth who Went Forth to Learn What Fear Was (Margaret Hunt)',
 u'Brothers Grimm fairy tales - The Wolf and The Seven Little Kids (Margaret Hunt)']

### German-language tales

We can obviously apply the same methods to pull our German tales from [this site](http://www.grimmstories.com/de/grimm_maerchen/list). Bringing up the developer tools shows us that again, the hyperlinks are tagged with 'a' and 'href'.

<img src="/figure/Chapter 1.4.png" title="Developer tools 4" style="display: block; margin: auto;" />

Let's have a look at the URLs from the site, limiting only to those results containing 'grimm_maerchen' as this seems to be common to the tale links.

In [None]:
conn = urllib.urlopen("http://www.grimmstories.com/de/grimm_maerchen/list").read()
main_page = BeautifulSoup(conn)
main_page.find_all('a', href = re.compile('grimm_maerchen'))

Ahhh, we have a bit of an issue. We have picked up all of the tale links, but we have also picked up a number of extraneous links such as 'index' and 'favourites'. Luckily, these seem to be at the beginning and the end of the set of links, so we can simply trim them off once we've created our list. The German-language list also contains a number of additional tales not present in the English-language list (those under *In der Ausgabe letzter Hand nicht mehr enthalten*, or those no longer contained in the final version), so we should get rid of those so our datasets from the two languages match. You might have also noticed that this time we've gotten full links back, so we don't need to add a prefix.

In [75]:
conn = urllib.urlopen("http://www.grimmstories.com/de/grimm_maerchen/list").read()
main_page = BeautifulSoup(conn)

urls = []
for tag in main_page.find_all('a', href = re.compile('grimm_maerchen')):
    match = tag['href']
    urls.append(match)
    
urls = urls[9:220]

Now let's look at pulling out our tale text. You can see from the developer tools that our text is tagged as 'div', with class 'text' and itemprop 'text'. Let's use this to grab all of our text in a list, as we did for the English-language tales.

In [76]:
raw_tales = []
for url in urls:
    conn = urllib.urlopen(url).read()
    tale_page = BeautifulSoup(conn)
    tale = tale_page.find_all("div", itemprop = "text", class_ = "text")
    raw_tales.append(tale)
    
german_tales = []
for tale in raw_tales:
    for div in tale:
        german_tales.append(div.text)

As with the English-language texts, we can also extract the titles using the `title` and `text` methods. Let's run another loop to pull out these results.

In [77]:
german_titles = []
for url in urls:
    conn = urllib.urlopen(url).read()
    tale_page = BeautifulSoup(conn)
    title = tale_page.title.text
    german_titles.append(title)

In [78]:
german_titles[0:5]

[u'Der Froschk\xf6nig oder der eiserne Heinrich - Br\xfcder Grimm',
 u'Katze und Maus in Gesellschaft - Br\xfcder Grimm',
 u'Marienkind - Br\xfcder Grimm',
 u'Von einem, der auszog, das F\xfcrchten zu lernen - Br\xfcder Grimm',
 u'Der Wolf und die sieben jungen Gei\xdflein - Br\xfcder Grimm']

### Putting it all in a DataFrame

Of course, there's not much we can do with a bunch of lists. The easiest way to make some useful data is to pop our 4 lists into a `pandas` DataFrame. To do this, we simply need to import the `DataFrame` function from `pandas`, and then simply assign each list as the value in a dictionary we assign to that function.

In [86]:
from pandas import DataFrame

texts = DataFrame(
    {
        'english_titles': english_titles,
        'english_tales': english_tales,
        'german_titles': german_titles,
        'german_tales': german_tales
    },
    columns = ['english_titles', 'english_tales', 'german_titles', 'german_tales'])

In [87]:
texts[0:5]

Unnamed: 0,english_titles,english_tales,german_titles,german_tales
0,"Brothers Grimm fairy tales - The Frog-King, or...",\n In old times when wishing still hel...,Der Froschkönig oder der eiserne Heinrich - Br...,"In den alten Zeiten, wo das Wünschen noch geho..."
1,Brothers Grimm fairy tales - Cat and Mouse in ...,\n A certain cat had made the acquaint...,Katze und Maus in Gesellschaft - Brüder Grimm,Eine Katze hatte Bekanntschaft mit einer Maus ...
2,Brothers Grimm fairy tales - Our Lady's Child,\n Hard by a great forest dwelt a wood...,Marienkind - Brüder Grimm,Vor einem großen Walde lebte ein Holzhacker mi...
3,Brothers Grimm fairy tales - The Story of the ...,"\n A certain father had two sons, the ...","Von einem, der auszog, das Fürchten zu lernen ...","Ein Vater hatte zwei Söhne, davon war der älte..."
4,Brothers Grimm fairy tales - The Wolf and The ...,\n There was once on a time an old goa...,Der Wolf und die sieben jungen Geißlein - Brüd...,"Es war einmal eine alte Geiß, die hatte sieben..."
