# Chapter 1: Getting text data for analysis

One of the best sources of data for text mining is, of course, the web. There are two main ways of sourcing text data. If we are lucky, a website will have a public API, which is simply an endpoint that we can query and extract data from. If we are unlucky, we will need to scrape the website, which is much messier and can require a bit more clean up.

For this book, we will be analysing Grimm's fairytales. The [full set](https://en.wikipedia.org/wiki/Grimms%27_Fairy_Tales) of these comprises of 201 stories and 10 legends, and was originally published in German as *Kinder- und Hausmärchen* (*Children's and Household Tales*) in 1812, and translated to English in *Household Tales by the Brothers Grimm* in 1884. Given that there are no sites hosting these tables with a public API, we will need to scrape some sites to put together our dataset. As such, I won't cover accessing data through an API in this book, but if you are interested in getting data this way you can have a look at a couple of previous text mining projects I have done using data from the [reddit](http://t-redactyl.io/blog/2015/11/analysing-reddit-data-part-1-setting-up-the-environment.html) and [Twitter](http://t-redactyl.io/blog/2017/04/applying-sentiment-analysis-with-vader-and-the-twitter-api.html) APIs.

## Getting the text - the hacky version

The idea behind web scraping is not too complicated: we want to retrieve data which is on a website (usually in HTML format) and convert it into a format that we can use for an analysis (generally a dataframe). The tricky part can be finding the correct way to select the content that is relevant to us, as websites contain a lot of stuff that we probably don't need. To do this, we need to find the HTML tag(s) that identify the content we want, and pray that the people who built the website have used the tags consistently for the same types of information! I find that the easiest way to identify the relevant tags for your data is to use the developer tools inbuilt into browsers like Chrome. Let's talk through this with a concrete example.

We will be pulling the English-language tales from [this website](http://www.worldoftales.com/fairy_tales/Grimm_fairy_tales_Margaret_Hunt.html). Once we've navigated to this page, we can open up our developer tools (in Chrome) by right-clicking and selecting 'Inspect'.

<img src="/figure/Chapter 1.1.png" title="Developer tools 1" style="display: block; margin: auto;" />

Once you’ve done that the developer tools will open on the right of the screen. In the image above, I have highlighted a button that allows you to view the tags associated with any element of the page. If you click on this and select one of the tale names, it will take you to the title tag, like so:

<img src="/figure/Chapter 1.2.png" title="Developer tools 2" style="display: block; margin: auto;" />

Looking at the first tale, *The Frog-King, or Iron Henry*, we can see that the link is tagged with ['a'](https://www.w3schools.com/tags/tag_a.asp), which indicates it is a hyperlink, and 'href', which gives the link's destination. A quick look down the list of tales indicates that they have the same tagging, which means we can use this to pull out a list of the URLs linking to the full tale text. We can now use a package called `bs4` (short for BeautifulSoup 4), as well as `urllib` and `urlparse` to help us extract these URLs from this mess of HTML. Let's start by installing the packages:

In [21]:
from bs4 import BeautifulSoup
import urllib
import urlparse

We then open a read-only connection to our website using the `urlopen` and `read()` methods from the `urllib` package.

In [2]:
conn = urllib.urlopen('http://www.worldoftales.com/fairy_tales/Grimm_fairy_tales_Margaret_Hunt.html').read()

We can now pass this connection to BeautifulSoup, which will parse out all of the webiste elements and allow us to start finding and extracting our URLs.

In [5]:
main_page = BeautifulSoup(conn, "html5lib")

The method we need to use from BeautifulSoup is `find_all()`. [This method](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all) has a couple of arguments of interest to us. The first, the `name` argument, will only search for tags with certain names. We will tell it to only search for tags called 'a'.

In [None]:
main_page.find_all('a')

You can see this has returned *all* of the URLs on this page, including stuff that is not relevant to our analysis. Luckily we can use [another argument](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-keyword-arguments), `href`, which allows us to filter URLs using regular expressions. Let's only keep those that contain 'Brothers_Grimm'.

In [None]:
import re

main_page.find_all('a', href = re.compile('Brothers_Grimm'))

We can now loop over all of the links that match our criteria. As you would have seen from the last result, `find_all` only returns a partial link. As such, we'll need to join our links to a prefix containing the rest of the link using the `urljoin` method from `urlparse`. We can now create an empty list and append all of the URLs to this.

In [22]:
urls = []
url = "http://www.worldoftales.com/fairy_tales/"
for tag in main_page.find_all('a', href = re.compile('Brothers_Grimm')): # Specifically only want URLS for the Grimm's fairytales
    match = urlparse.urljoin(url, tag['href'])
    urls.append(match)

In [27]:
urls[0:5]

[u'http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Frog-King,_or_Iron_Henry.html',
 u'http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/Cat_and_Mouse_in_Partnership.html',
 u"http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/Our_Lady's_Child.html",
 u'http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Story_of_the_Youth_who_Went_Forth_to_Learn_What_Fear_Was.html',
 u'http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Wolf_and_The_Seven_Little_Kids.html']

Now that we have our URLs, we can use them to grab the text of each of the stories. Let's start by looking at one of the specific stories on the website.

<img src="/figure/Chapter 1.3.png" title="Developer tools 3" style="display: block; margin: auto;" />

We can see that the text is tagged as 'div', which just [marks a section](https://www.w3schools.com/tags/tag_div.asp) in the HTML document. The class is 'GM', which just means that [all similar blocks of text](https://www.w3schools.com/html/html_classes.asp) have been given this name.

In [None]:
tale = soup.find_all("div", id = 'text', class_ = "GM")

TypeError: list indices must be integers, not str

In [None]:
#tale

In [None]:
# Now having a look at how to extract all of the URLs and put them in a list.
from bs4 import BeautifulSoup
import urllib

r = urllib.urlopen('http://www.worldoftales.com/fairy_tales/Grimm_fairy_tales_Margaret_Hunt.html').read()
soup = BeautifulSoup(r)

In [None]:
urls = []
url = "http://www.worldoftales.com/fairy_tales/"
for tag in soup.findAll('a', href = re.compile('Brothers_Grimm')): # Specifically only want URLS for the Grimm's fairytales
    match = urlparse.urljoin(url, tag['href'])
    urls.append(match)

In [None]:
#urls

In [None]:
# Now extract all of the texts from each URL and put in a list.
tales = []
for url in urls:
    r = urllib.urlopen(url).read()
    soup = BeautifulSoup(r)
    tale = soup.find_all("div", id = 'text', class_ = "GM")
    tales.append(tale)

In [None]:
#tales[0]

## To do

- [ ] Extract titles for each tale (in a separate list for ease of feeding into Pandas dataframe)
- [ ] Work out if it is possible to pull out cleaner text from the URLs without all of the HTML tags

In [None]:
clean_tales = []

for tale in tales:
    for div in tale:
        clean_tales.append(div.text)

In [None]:
#clean_tales

In [None]:
from pandas import DataFrame

In [None]:
p = DataFrame({'text': clean_tales})

In [None]:
p[0:5]

In [None]:
r = urllib.urlopen('http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Spindle,_The_Shuttle,_and_the_Needle.html').read()
soup = BeautifulSoup(r)
soup.title.text

In [None]:
titles = []
for url in urls:
    r = urllib.urlopen(url).read()
    soup = BeautifulSoup(r)
    title = soup.title.text
    titles.append(title)

In [None]:
titles[0:5]

In [None]:
len(tales)

In [None]:
len(titles)

In [None]:
p = DataFrame({'titles': titles,
               'text': clean_tales})

In [None]:
p[0:5]

In [None]:
p.to_csv('/Users/jodieburchell/Documents/text-cleaning/Scraping the project text/raw_data.csv',
         encoding='utf-8')

In [None]:
r = urllib.urlopen('http://www.grimmstories.com/de/grimm_maerchen/list').read()
soup = BeautifulSoup(r)
soup.findAll('a')


In [None]:
import os
os.environ