<br>
<img style="float:left" src="http://ipython.org/_static/IPy_header.png" />
<br>

# Downloading and analysing text with Python and NLTK

Welcome to *Python* and the *IPython Notebook*! Today, we're demonstrating **Python**, a programming language, and **NLTK**, a Python library for working with language.

Whatever your area of study, Python can speed up repetitive tasks and ensure that whatever you do can quickly be redone, by anyone.

In [None]:
subjects = ['Environmental Law', 'Family Law', 'Mergers & Acquisitions']

for subject in subjects:
    print "You're taking %s!? How interesting!" % subject

# Downloading a lot of text

You can use Python to automatically download a bunch of text from the web.

## Import HTML parser, define and read a URL

In [None]:
from bs4 import BeautifulSoup
from urllib import urlopen

In [None]:
url = 'http://www.lawyersweekly.com.au'
raw = urlopen(url).read()
print raw[:2000]
soup = BeautifulSoup(raw)

## Get a list of all links in that URL

In [None]:
links = []
for link in soup.find_all('a'):
    link = link.get('href')
    if link and '/news/' in link and 'disqus' not in link:
        links.append(url + link)

# remove duplicates
links = sorted(set(links))

## Our URLs

In [None]:
for link in links:
    print link

## Get article text from each URL

In [None]:
texts = []
for link in links:
    raw = urlopen(link).read()
    soup = BeautifulSoup(raw)
    paras = soup.find_all('p')
    text = '\n'.join([para.text for para in paras if not para.text.startswith('Lawyers Weekly')])
    texts.append(text)

## What did we get?

In [None]:
print 'We have %d stories!\n' % len(texts)

In [None]:
print texts[2]

# Analysing these texts

Let's turn our texts into a single item:

In [None]:
text = '\n'.join(texts)
print text[:500]

Then, we turn our text into a list of words with NLTK

In [None]:
import nltk
words = nltk.word_tokenize(text)
print words[:50]

With a list of words, we can then search for interesting patterns.

## Concordancing

In [None]:
searchable_text = nltk.Text(words)  # formats our tokens for concordancing
searchable_text.concordance("Australia")

## Keywording

In [None]:
import corpkit
from corpkit import keywords
encoded_text = text.encode('utf-8', errors = 'ignore')
keywords, ngrams = keywords(encoded_text, dictionary = 'bnc.p')

Results?

In [None]:
for key in keywords[:25]:
    print key

In [None]:
for ngram in ngrams[:25]:
    print ngram

## Other ideas ...

Using Python and/or NLTK, you can automatically:

* Group texts into topics
* Analyse sentiment
* Annotate the text for grammatical features, for more advanced searching
* Quantify the tone of texts
* Sort texts by the likelihood of their containing what you want
* Archive texts cleanly

## Use this Notebook and code!

**Head to [github.com/resbaz/nltk](https://www.github.com/resbaz/nltk) to access these materials and more.**