# Chapter 1: Getting text data for analysis

One of the best sources of data for text mining is, of course, the web. There are two main ways of sourcing text data. If we are lucky, a website will have a public API, which is simply an endpoint that we can query and extract data from. If we are unlucky, we will need to scrape the website, which is much messier and can require a bit more clean up.

For this book, we will be analysing Grimm's fairytales. The [full set](https://en.wikipedia.org/wiki/Grimms%27_Fairy_Tales) of these comprises of 201 stories and 10 legends, and was originally published in German as *Kinder- und Hausmärchen* (*Children's and Household Tales*) in 1812, and translated to English in *Household Tales by the Brothers Grimm* in 1884. Given that there are no sites hosting these tables with a public API, we will need to scrape some sites to put together our dataset. As such, I won't cover accessing data through an API in this book, but if you are interested in getting data this way you can have a look at a couple of previous text mining projects I have done using data from the [reddit](http://t-redactyl.io/blog/2015/11/analysing-reddit-data-part-1-setting-up-the-environment.html) and [Twitter](http://t-redactyl.io/blog/2017/04/applying-sentiment-analysis-with-vader-and-the-twitter-api.html) APIs.

## Getting the text - the hacky version

The idea behind web scraping is not too complicated: we want to retrieve data which is on a website (usually in HTML format) and convert it into a format that we can use for an analysis (generally a dataframe). The tricky part can be finding the correct way to select the content that is relevant to us, as websites contain a lot of stuff that we probably don't need. To do this, we need to find the HTML tag(s) that identify the content we want, and pray that the people who built the website have used the tags consistently for the same sort of information!

I find that the easiest way to identify the relevant tags for your data is to use the developer tools inbuilt into browsers like Chrome. Essentially all you need to do to use these (in Chrome) is to go to the website you want to scrape, right-click, and select 'Inspect'. This will




In [21]:
# Playing around with how to extract the text from a specific fairytale link.
from bs4 import BeautifulSoup
import urllib

r = urllib.urlopen('http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Spindle,_The_Shuttle,_and_the_Needle.html').read()
soup = BeautifulSoup(r)
print type(soup)

<class 'bs4.BeautifulSoup'>


In [114]:
#print(soup.prettify())

In [36]:
tale = soup.find_all("div", id = 'text', class_ = "GM")

In [115]:
#tale

In [38]:
# Now having a look at how to extract all of the URLs and put them in a list.
from bs4 import BeautifulSoup
import urllib

r = urllib.urlopen('http://www.worldoftales.com/fairy_tales/Grimm_fairy_tales_Margaret_Hunt.html').read()
soup = BeautifulSoup(r)

In [54]:
urls = []
url = "http://www.worldoftales.com/fairy_tales/"
for tag in soup.findAll('a', href = re.compile('Brothers_Grimm')): # Specifically only want URLS for the Grimm's fairytales
    match = urlparse.urljoin(url, tag['href'])
    urls.append(match)

In [124]:
#urls

In [59]:
# Now extract all of the texts from each URL and put in a list.
tales = []
for url in urls:
    r = urllib.urlopen(url).read()
    soup = BeautifulSoup(r)
    tale = soup.find_all("div", id = 'text', class_ = "GM")
    tales.append(tale)

In [117]:
#tales[0]

## To do

- [ ] Extract titles for each tale (in a separate list for ease of feeding into Pandas dataframe)
- [ ] Work out if it is possible to pull out cleaner text from the URLs without all of the HTML tags

In [99]:
clean_tales = []

for tale in tales:
    for div in tale:
        clean_tales.append(div.text)

In [118]:
#clean_tales

In [83]:
from pandas import DataFrame

In [122]:
p = DataFrame({'text': clean_tales})

In [123]:
p[0:5]

Unnamed: 0,text
0,\n In old times when wishing still hel...
1,\n A certain cat had made the acquaint...
2,\n Hard by a great forest dwelt a wood...
3,"\n A certain father had two sons, the ..."
4,\n There was once on a time an old goa...


In [104]:
r = urllib.urlopen('http://www.worldoftales.com/fairy_tales/Brothers_Grimm/Margaret_Hunt/The_Spindle,_The_Shuttle,_and_the_Needle.html').read()
soup = BeautifulSoup(r)
soup.title.text

u'Brothers Grimm fairy tales - The Spindle, The Shuttle, and the Needle'

In [105]:
titles = []
for url in urls:
    r = urllib.urlopen(url).read()
    soup = BeautifulSoup(r)
    title = soup.title.text
    titles.append(title)

In [120]:
titles[0:5]

[u'Brothers Grimm fairy tales - The Frog-King, or Iron Henry ',
 u'Brothers Grimm fairy tales - Cat and Mouse in Partnership (Margaret Hunt)',
 u"Brothers Grimm fairy tales - Our Lady's Child",
 u'Brothers Grimm fairy tales - The Story of the Youth who Went Forth to Learn What Fear Was (Margaret Hunt)',
 u'Brothers Grimm fairy tales - The Wolf and The Seven Little Kids (Margaret Hunt)']

In [107]:
len(tales)

211

In [108]:
len(titles)

211

In [110]:
p = DataFrame({'titles': titles,
               'text': clean_tales})

In [121]:
p[0:5]

Unnamed: 0,text,titles
0,\n In old times when wishing still hel...,"Brothers Grimm fairy tales - The Frog-King, or..."
1,\n A certain cat had made the acquaint...,Brothers Grimm fairy tales - Cat and Mouse in ...
2,\n Hard by a great forest dwelt a wood...,Brothers Grimm fairy tales - Our Lady's Child
3,"\n A certain father had two sons, the ...",Brothers Grimm fairy tales - The Story of the ...
4,\n There was once on a time an old goa...,Brothers Grimm fairy tales - The Wolf and The ...


In [113]:
p.to_csv('/Users/jodieburchell/Documents/text-cleaning/Scraping the project text/raw_data.csv',
         encoding='utf-8')

In [125]:
r = urllib.urlopen('http://www.grimmstories.com/de/grimm_maerchen/list').read()
soup = BeautifulSoup(r)
soup.findAll('a')


[<a href="http://www.grimmstories.com/zh/grimm_tonghua/list">ZH</a>,
 <a href="http://www.grimmstories.com/vi/grimm_truyen/list">VI</a>,
 <a href="http://www.grimmstories.com/tr/grimm_masallari/list">TR</a>,
 <a href="http://www.grimmstories.com/ru/grimm_skazki/list">RU</a>,
 <a href="http://www.grimmstories.com/ro/grimm_basme/list">RO</a>,
 <a href="http://www.grimmstories.com/pt/grimm_contos/list">PT</a>,
 <a href="http://www.grimmstories.com/pl/grimm_basnie/list">PL</a>,
 <a href="http://www.grimmstories.com/nl/grimm_sprookjes/list">NL</a>,
 <a href="http://www.grimmstories.com/ko/grimm_donghwa/list">KO</a>,
 <a href="http://www.grimmstories.com/ja/grimm_dowa/list">JA</a>,
 <a href="http://www.grimmstories.com/it/grimm_fiabe/list">IT</a>,
 <a href="http://www.grimmstories.com/hu/grimm_mesek/list">HU</a>,
 <a href="http://www.grimmstories.com/fr/grimm_contes/list">FR</a>,
 <a href="http://www.grimmstories.com/fi/grimm_sadut/list">FI</a>,
 <a href="http://www.grimmstories.com/es/grimm