<br>
<img style="float:left" src="http://ipython.org/_static/IPy_header.png" />
<br>

# Session 2: Common NLTK tasks

<br>
In this session we provide an quick introduction to the field of *corpus linguistics*. We then engage with common uses of NLTK within these areas, such as sentence segmentation, tokenisation and stemming. Often, NLTK has inbuilt methods for performing these tasks. As a learning exercise, however, we will sometimes build basic tools from scratch.

## Corpus linguistics

Though corpus linguistics has been around since the 1950s, it is only in the last 20 years that its methods have been made available to individual researchers. GUIs including [Wordsmith Tools](http://www.lexically.net/wordsmith/) and [AntConc](http://www.laurenceanthony.net/software.html). 

Alongside the development of GUIs, there has also been a shift from *general, balanced corpora* (corpora seeking to represent a language generally) toward *specialised corpora* (corpora containing texts of one specific type, from one speaker, etc.). More and more commonly, texts are taken from the Web.

> **Note:** We'll discuss building corpora from online texts in a bit more detail tomorrow afternoon.

After a long period of resistance, corpus linguistics has gained acceptence within a number of research areas. A few popular applications are within:

* **Lexicography** (creating usage-based definitions of words and locating real examples)
* **Language pedagogy** (advanced language learners can use a concordancing GUI or collocation tests to understand how certain words are used in the target language)
* **Discourse analysis** (researching how meaning is made beyond the level of the clause/sentence)

Notably, corpus linguistic methods have been embraced within the emerging paradigm of Digital Humanities, where it's sometimes called *distant reading*.

### Corpora and discourse

As hardware, software and data become more and more available, people have started using corpus linguistic methods for discourse-analytic work. Paul Baker refers the combination of corpus linguistics and (critical) discourse analysis as a [*useful methodological synergy*](#ref:baker). Corpora bring objectivity and empiricism to a qualitative, interpretative tradition, while discourse-analytic methods provide corpus linguistics with a means of contextualising abstracted results.

Within this area, researchers rely on corpora to varying extents. In *corpus-driven* discourse analysis, researchers interpret the corpus based on the findings of the corpus interrogation. In *corpus-assisted* discourse analysis, researchers may use corpora to provide evidence about the way a given person/idea/discourse is commonly represented by certain people/in certain publications etc.

Our work here falls under the *corpus-driven* heading, as we are exploring the dataset without any major hypotheses in mind.

> **Note:** Some linguists remain skeptical of corpus linguistics generally. In a well-known critique, Henry Widdowson ([2000, p. 6-7](#ref:widdowson)) said:
>
> Corpus linguistics \[...\] (there) is no doubt that this is an immensely important development in descriptive linguistics. That is not the issue here. The quantitative analysis of text by computer reveals facts about actual language behaviour which are not, or at least not immediately, accessible to intuition. There are frequencies of occurrence of words, and regular patterns of collocational co-occurrence, which users are unaware of, though they must be part of their competence in a procedural sense since they would not otherwise be attested. They are third person observed data ('When do they use the word X?') which are different from the first person data of introspection ('When do I use the word X?'), and the second person data of elicitation ('When do you use the word X?'). Corpus analysis reveals textual facts, fascinating profiles of produced language, and its concordances are always springing surprises. They do indeed reveal a reality about language usage which was hitherto not evident to its users.
>
> But this achievement of corpus analysis at the same time necessarily defines its limitations. For one thing, since what is revealed is contrary to intuition, then it cannot represent the reality of first person awareness. We get third person facts of what people do, but not the facts of what people know, nor what they think they do: they come from the perspective of the observer looking on, not the introspective of the insider. In ethnomethodogical terms, we do not get member categories of description. Furthermore, it can only be one aspect of what they do that is captured by such quantitative analysis. For, obviously enough, the computer can only cope with the material products ofwhat people do when they use language. It can only analyse the textual traces of the processes whereby meaning is achieved: it cannot account for the complex interplay of linguistic and contextual factors whereby discourse is enacted. It cannot produce ethnographic descriptions of language use. In reference to Hymes's components of communicative competence (Hymes 1972), we can say that corpus analysis deals with the textually attested, but not with the encoded possible, nor the contextually appropriate.
> 
> To point out these rather obvious limitations is not to undervalue corpus analysis but to define more clearly where its value lies. What it can do is reveal the properties of text, and that is impressive enough. But it is necessarily only a partial account of real language. For there are certain aspects of linguistic reality that it cannot reveal at all. In this respect, the linguistics of the attested is just as partial as the linguistics of the possible.

## Loading a corpus

First, we have to load a corpus. We'll use a text file containing posts to an Australian online forum for discussing politics. It's full of very interesting natural language data!

In [None]:
from IPython.display import display
from IPython.display import HTML
HTML('<iframe src=http://www.ozpolitic.com/forum/YaBB.pl?board=global width=700 height=350></iframe>')

This file is available online, at the [ResBaz GitHub](https://github.com/resbaz). We can ask Python to get it for us. 

> Later in the course, we'll discuss how to extract data from the Web and turn this data into a corpus.

In [None]:
from urllib import urlopen # a library for working with urls
url = "https://raw.githubusercontent.com/resbaz/nltk/corpora/oz_politics/ozpol.txt" # define the url
raw = urlopen(url).read() # download and read the corpus into raw variable
raw = unicode(raw.lower(), 'utf-8') # make it lowercase and unicode
len(raw) # how many characters does it contain?
raw[:2000] # first 2000 characters

We actually already downloaded this file when we first cloned the ResBaz GitHub repository. It's in our *corpora* folder. We can access it like this:

In [2]:
f = open('../../corpora/oz_politics/ozpol.txt')
raw = f.read()
raw = unicode(raw.lower(), 'utf-8') # make it lowercase and unicode
len(raw)
raw[:2000]

u"no greens-win, many of us right wingers want to stay the hell out of the middle east. nothing is going to stop that sh.thole of the world from tearing each others' throats out.\nafter all, they have been doing it successfully for centuries.\nthen you better start educating your hard right mates about greens' renewable energy. sooner we end our addiction to arab oil and middle eastern exports, like live animals, the sooner we can distance ourselves.\ni am not saying that all muslims are like this. however, i am saying that many, many of them are. they are flying under the radar and getting strong, just like hitler's storm troopers before little adolph rose to power.\nsensationalist right? no one thought much about little adolph and his brown shirted followers, until they stormed to power, whilst the do gooders stood by and tut tutted.\nif you are a student of history you can see exactly what is happening here. soon it will be too late, it most probably already is.\njust when sydney-si

## Sentence segmentation

So, with a basic understanding of regex, we can now start to turn our corpus into a structured resource. At present, we have 'raw', a very, very long string of text.

 We should break the string into segments. First, we'll split the corpus into sentences. This task is a pretty boring one, and it's tough for us to improve on existing resources. We'll try, though.

In [4]:
import nltk
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
sents = sent_tokenizer.tokenize(raw)
sents[101:111]

[u'post some better information.',
 u"there's been hundreds in iraq alone.",
 u'the iraq terrorist attacks are reported.',
 u'in the last 10 to 15 years, most (more than half) terrorist attacks have been carried out by non-muslims.',
 u"i've already provided the proof of this.",
 u"if you refuse to believe, that's your problem.",
 u'go on crossing the road every time you see a muslim on your side of the street, if that makes you feel safer.',
 u"no, actually you have not..your 'proof' ends in 2005, which is 9 years ago..how many non-mulim terror attacks happened between 2005 and 2014, and how many muslim terror attacks happened in the same period??",
 u'well just for the july section of that wiki link, about 26 out of 31 attacks were muslim inspired attacks.']

Alright, we have sentences. Now what?

## Tokenisation

Tokenisation is simply the process of breaking texts down into words. We already did a little bit of this in Session 1. We won't build our own tokenizer, because it's not much fun. NLTK has one we can rely on.

Keep in mind that definitions of tokens are not standardised, especially for languages other than English. Serious problems arise when comparing two corpora that have been tokenised differently.

> **Note:** It is also possible to use NLTK to break tokens into morphemes, syllables, or phonemes. We're not going to go down those roads, though.

In [5]:
tokenized_sents = [nltk.word_tokenize(i) for i in sents]
print tokenized_sents[:10]
# another view:
# tokenized_sents[:10]

[[u'no', u'greens-win', u',', u'many', u'of', u'us', u'right', u'wingers', u'want', u'to', u'stay', u'the', u'hell', u'out', u'of', u'the', u'middle', u'east', u'.'], [u'nothing', u'is', u'going', u'to', u'stop', u'that', u'sh.thole', u'of', u'the', u'world', u'from', u'tearing', u'each', u'others', u"'", u'throats', u'out', u'.'], [u'after', u'all', u',', u'they', u'have', u'been', u'doing', u'it', u'successfully', u'for', u'centuries', u'.'], [u'then', u'you', u'better', u'start', u'educating', u'your', u'hard', u'right', u'mates', u'about', u'greens', u"'", u'renewable', u'energy', u'.'], [u'sooner', u'we', u'end', u'our', u'addiction', u'to', u'arab', u'oil', u'and', u'middle', u'eastern', u'exports', u',', u'like', u'live', u'animals', u',', u'the', u'sooner', u'we', u'can', u'distance', u'ourselves', u'.'], [u'i', u'am', u'not', u'saying', u'that', u'all', u'muslims', u'are', u'like', u'this', u'.'], [u'however', u',', u'i', u'am', u'saying', u'that', u'many', u',', u'many', u'of

## Stemming

Stemming is the task of finding the stem of a word. So, *cats --> cat*, or *taking --> take*. It is an important task when counting words, as often the counting each inflection seperately is not particuarly helpful: forms of the verb 'to be' might seem under-represented if we could *is, are, were, was, am, be, being, been* separately. 

NLTK has pre-programmed stemmers, but we can build our own using some of the skills we've already learned.

A stemmer is the kind of thing that would make a good function, so let's do that.

In [6]:
def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']: # list of suffixes
        if word.endswith(suffix):
            return word[:-len(suffix)] # delete the suffix
    return word

Let's run it over some text and see how it performs.

In [7]:
# empty list for our output
stemmed_sents = []
for sent in tokenized_sents:
    # empty list for stemmed sentence:
    stemmed = []
    for word in sent:
        # append the stem of every word
        stemmed.append(stem(word))
    # append the stemmed sentence to the list of sentences
    stemmed_sents.append(stemmed)
# pretty print the output
stemmed_sents[:10]

[[u'no',
  u'greens-win',
  u',',
  u'many',
  u'of',
  u'u',
  u'right',
  u'winger',
  u'want',
  u'to',
  u'stay',
  u'the',
  u'hell',
  u'out',
  u'of',
  u'the',
  u'middle',
  u'east',
  u'.'],
 [u'noth',
  u'i',
  u'go',
  u'to',
  u'stop',
  u'that',
  u'sh.thole',
  u'of',
  u'the',
  u'world',
  u'from',
  u'tear',
  u'each',
  u'other',
  u"'",
  u'throat',
  u'out',
  u'.'],
 [u'after',
  u'all',
  u',',
  u'they',
  u'have',
  u'been',
  u'do',
  u'it',
  u'successful',
  u'for',
  u'centur',
  u'.'],
 [u'then',
  u'you',
  u'better',
  u'start',
  u'educat',
  u'your',
  u'hard',
  u'right',
  u'mat',
  u'about',
  u'green',
  u"'",
  u'renewable',
  u'energy',
  u'.'],
 [u'sooner',
  u'we',
  u'end',
  u'our',
  u'addiction',
  u'to',
  u'arab',
  u'oil',
  u'and',
  u'middle',
  u'eastern',
  u'export',
  u',',
  u'like',
  u'l',
  u'animal',
  u',',
  u'the',
  u'sooner',
  u'we',
  u'can',
  u'distance',
  u'ourselv',
  u'.'],
 [u'i',
  u'am',
  u'not',
  u'say',
  u

Looking at the output, we can see that the stemmer works: *wingers* becomes *winger*, and *tearing* becomes *tear*. But, sometimes it does things we don't want: *Nothing* becomes *noth*, and *mate* becomes *mat*. Even so, for the learns, let's rewrite our function with a regex:

We can see that this approach has obvious limitations. So, let's rely on a purpose-built stemmer. These rely in part on dictionaries. Note the subtle differences between the two possible stemmers:

In [10]:
tokens = [] 
for sent in tokenized_sents:
    for word in sent:
        tokens.append(word)

In [11]:
# define stemmers
lancaster = nltk.LancasterStemmer()
porter = nltk.PorterStemmer()
# stem each word in tokens
stems = [lancaster.stem(t) for t in tokens]  # replace lancaster with porter here
print stems[:100]

[u'no', u'greens-win', u',', u'many', u'of', u'us', u'right', u'wing', u'want', u'to', u'stay', u'the', u'hel', u'out', u'of', u'the', u'middl', u'east', u'.', u'noth', u'is', u'going', u'to', u'stop', u'that', u'sh.thole', u'of', u'the', u'world', u'from', u'tear', u'each', u'oth', u"'", u'throats', u'out', u'.', u'aft', u'al', u',', u'they', u'hav', u'been', u'doing', u'it', u'success', u'for', u'century', u'.', u'then', u'you', u'bet', u'start', u'educ', u'yo', u'hard', u'right', u'mat', u'about', u'green', u"'", u'renew', u'energy', u'.', u'soon', u'we', u'end', u'our', u'addict', u'to', u'arab', u'oil', u'and', u'middl', u'eastern', u'export', u',', u'lik', u'liv', u'anim', u',', u'the', u'soon', u'we', u'can', u'dist', u'ourselv', u'.', u'i', u'am', u'not', u'say', u'that', u'al', u'muslim', u'ar', u'lik', u'thi', u'.', u'howev']


Notice that both stemmers handle some things rather poorly. The main reason for this is that they are not aware of the *word class* of any particular word: *nothing* is a noun, and nouns ending in *ing* should not have *ing* removed by the stemmer (swing, bling, ring...). Later in the course, we'll start annotating corpora with grammatical information. This improves the accuracy of stemmers a lot.

> Note: stemming is not *always* the best thing to do: though *thing* is the stem of *things*, things has a unique meaning, as in *things will improve*. If we are interested in vague language, we may not want to collapse things --> thing.

## Keywording: 'the aboutness of a text'

Keywording is the process of generating a list of words that are unusually frequent in the corpus of interest. To do it, you need a *reference corpus*, or at least a *reference wordlist* to which your *target corpus* can be compared. Often, *reference corpora* take the form of very large collections of language drawn from a variety of spoken and written sources.

Keywording is what generates word-clouds beside online news stories, blog posts, and the like. In combination with speech-to-text, it's used in Oxford University's [Spindle Project](http://openspires.oucs.ox.ac.uk/spindle/) to automatically archive recorded lectures with useful tags.

We'll use corpkit, which relies on Spindle.

In [12]:
#! pip install corpkit
import corpkit
from corpkit import keywords

In [13]:
# this tool works with raw text, not tokens!
keys, ngrams = keywords(raw.encode("UTF-8"))
for key in keys[:20]:
    print key

[0, 'isis', 716.823485041825]
[1, 'terrorist', 498.3939674455444]
[2, 'muslims', 424.3632631632848]
[3, 'muslim', 423.25935057638816]
[4, 'three', 306.8193032226369]
[5, 'genocide', 304.39492316647016]
[6, 'isil', 245.2290869879928]
[7, 'iraq', 242.37140350399955]
[8, 'time', 225.6694491556164]
[9, 'attacks', 214.58846725527616]
[10, 'sort', 172.9831520838031]
[11, 'moslems', 170.87994026322178]
[12, 'australian', 169.24670158499322]
[13, 'work', 148.97528640889706]
[14, 'bit', 148.83300159093574]
[15, 'islamic', 147.7422380777213]
[16, 'lot', 146.8443398846189]
[17, 'good', 146.54802603896655]
[18, 'people', 143.73019189795275]
[19, 'things', 133.9788295328285]


Success! We have keywords.

> Keep in mind, the BNC reference corpus was created before ISIS and ISIL existed. *Moslem/moslems* is a dispreferred spelling of Muslim, used more frequently in anti-Islamic discourse. Also, it's unlikely that a transcriber of the spoken BNC would choose the Moslem spelling. *Having an inappropriate reference corpus is a common methodological problem in discourse analytic work*.

Now, we can fiddle with the stemmer and BNC frequency to get different keyword lists.

In [14]:
# try raising the threshold if there are still bad spellings!
stemmed = newstemmer(raw, 'Lancaster', 10)

NameError: name 'newstemmer' is not defined

In [None]:
keys = keywords_and_ngrams(stemmed)
keys[0] # only keywords
keys[1] # only n-grams

## Collocation

> *You shall know a word by the company it keeps.* - J.R. Firth, 1957

Collocation is a very common area of interest in corpus linguistics. Words pattern together in both expected and unexpected ways. In some contexts, *drug* and *medication* are synonymous, but it would be very rare to hear about *illicit* or *street medication*. Similarly, doctors are unlikely to prescribe the *correct* or *appropriate drug*.

This kind of information may be useful to lexicographers, discourse analysts, or advanced language learners.

In NLTK, collocation works from ordered lists of tokens. Let's put out tokenised sents into a single, huge list of tokens:

In [15]:
allwords = []
# for each sentence,
for sent in tokenized_sents:
    # for each word,
    for word in sent:
        # make a list of all words
        allwords.append(word)
print allwords[:20]
# small challenge: can you think of any other ways to do this?

[u'no', u'greens-win', u',', u'many', u'of', u'us', u'right', u'wingers', u'want', u'to', u'stay', u'the', u'hell', u'out', u'of', u'the', u'middle', u'east', u'.', u'nothing']


Now, let's feed these to an NLTK function for measuring collocations:

In [16]:
# get all the functions needed for collocation work
from nltk.collocations import *
# define statistical tests for bigrams
bigram_measures = nltk.collocations.BigramAssocMeasures()
# go and find bigrams
finder = BigramCollocationFinder.from_words(allwords)
# measure which bigrams are important and print the top 30
sorted(finder.nbest(bigram_measures.raw_freq, 30))

[(u',', u'and'),
 (u',', u'but'),
 (u',', u'i'),
 (u',', u'the'),
 (u'.', u'and'),
 (u'.', u'i'),
 (u'.', u'if'),
 (u'.', u'it'),
 (u'.', u'the'),
 (u'.', u'we'),
 (u'.', u'you'),
 (u'?', u'?'),
 (u'and', u'the'),
 (u'do', u"n't"),
 (u'for', u'the'),
 (u'have', u'been'),
 (u'in', u'the'),
 (u'is', u'a'),
 (u'it', u"'s"),
 (u'it', u'is'),
 (u'middle', u'east'),
 (u'of', u'the'),
 (u'on', u'the'),
 (u'that', u'the'),
 (u'the', u'middle'),
 (u'the', u'us'),
 (u'the', u'world'),
 (u'they', u'are'),
 (u'to', u'be'),
 (u'to', u'the')]

So, that tells us a little: we can see that terrorists, Muslims and the Middle East are commonly collocating in the text. At present, we are only looking for immediately adjacent words. So, let's expand out search to a window of *five words either side*

''window size'' specifies the distance at which 
two tokens can still be considered collocates
finder = BigramCollocationFinder.from_words(allwords, window_size=5)
sorted(finder.nbest(bigram_measures.raw_freq, 30))

Now we have the appearance of very common words! Let's use NLTK's stopwords list to remove entries containing these:

In [17]:
finder = BigramCollocationFinder.from_words(allwords, window_size=5)
# get a list of stopwords from nltk
ignored_words = nltk.corpus.stopwords.words('english')
# make sure no part of the bigram is in stopwords
finder.apply_word_filter(lambda w: len(w) < 2 or w.lower() in ignored_words)
finder.apply_freq_filter(2)
#print the sorted collocations
sorted(finder.nbest(bigram_measures.raw_freq, 30))

[(u'...', u'..'),
 (u'...', u'...'),
 (u'2001', u'carried'),
 (u'``', u"''"),
 (u'``', u'...'),
 (u'around', u'world'),
 (u'attacks', u'around'),
 (u'attacks', u'since'),
 (u'attacks', u'world'),
 (u'ca', u"n't"),
 (u'carried', u'non-muslims'),
 (u'etc', u'etc'),
 (u'iraq', u'syria'),
 (u'let', u"'s"),
 (u'majority', u'around'),
 (u'majority', u'attacks'),
 (u'majority', u'terrorist'),
 (u'middle', u'east'),
 (u"n't", u'know'),
 (u"n't", u'think'),
 (u'shale', u'oil'),
 (u'since', u'2001'),
 (u'spirit', u'anzac'),
 (u'syria', u'iraq'),
 (u'terrorist', u'around'),
 (u'terrorist', u'attacks'),
 (u'terrorist', u'groups'),
 (u'terrorist', u'world'),
 (u'world', u'2001'),
 (u'world', u'since')]

There! Now we have some interesting collocates. Finally, let's remove punctuation-only entries, or entries that are *n't*, as this is caused by different tokenisers:

In [19]:
import re
finder = BigramCollocationFinder.from_words(allwords, window_size=5)
ignored_words = nltk.corpus.stopwords.words('english')
# anything containing letter or number
regex = r'[A-Za-z0-9]'
# the n't token
nonot = r'n\'t'
# lots of conditions!
finder.apply_word_filter(lambda w: len(w) < 2 or w.lower() in ignored_words or not re.match(regex, w) or re.match(nonot, w))
finder.apply_freq_filter(2)
sorted(finder.nbest(bigram_measures.raw_freq, 30))

[(u'2001', u'carried'),
 (u'air', u'strikes'),
 (u'around', u'2001'),
 (u'around', u'since'),
 (u'around', u'world'),
 (u'attacks', u'around'),
 (u'attacks', u'carried'),
 (u'attacks', u'since'),
 (u'attacks', u'world'),
 (u'boots', u'ground'),
 (u'carried', u'non-muslims'),
 (u'etc', u'etc'),
 (u'foreign', u'policy'),
 (u'iraq', u'syria'),
 (u'majority', u'around'),
 (u'majority', u'attacks'),
 (u'majority', u'terrorist'),
 (u'middle', u'east'),
 (u'muslim', u'groups'),
 (u'shale', u'oil'),
 (u'since', u'2001'),
 (u'since', u'carried'),
 (u'spirit', u'anzac'),
 (u'syria', u'iraq'),
 (u'terrorist', u'around'),
 (u'terrorist', u'attacks'),
 (u'terrorist', u'groups'),
 (u'terrorist', u'world'),
 (u'world', u'2001'),
 (u'world', u'since')]

You can get a lot more info on collocation at the [NLTK homepage](http://www.nltk.org/howto/collocations.html).

## Clustering/n-grams

Clustering is the task of finding words that are commonly **immediately** adjacent (as opposed to collocates, which may just be nearby). This is also often called n-grams: bigrams are two tokens that appear together, trigrams are three, etc.

Clusters/n-grams have a spooky ability to tell us what a text is about.

We can use *Spindle*/corpkit for bigram searching as well:

In [20]:
# an argument here to stop keywords from being produced.
keys, ngrams = keywords(raw.encode("UTF-8"))
for ngram in ngrams[:50]:
    print ngram

[0, 'middle east', 34]
[1, 'terrorist attacks', 25]
[2, 'shale oil', 10]
[3, 'muslim terrorist', 6]
[4, 'wholesale genocide', 5]
[5, 'terror attacks', 5]
[6, 'islamic terrorism', 4]
[7, 'long time', 3]
[8, 'anzac day', 3]


There's also a method for n-gram production in NLTK. We can use this to understand how n-gramming works.

Below, we get lists of any ten adjacent tokens:

In [21]:
from nltk.util import ngrams
# define a sentence
sentence = 'give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime'  
# length of ngram
n = 10
# use builtin tokeniser (but we could use a different one)
tengrams = ngrams(sentence.split(), n)
for gram in tengrams:
  print gram

('give', 'a', 'man', 'a', 'fish', 'and', 'you', 'feed', 'him', 'for')
('a', 'man', 'a', 'fish', 'and', 'you', 'feed', 'him', 'for', 'a')
('man', 'a', 'fish', 'and', 'you', 'feed', 'him', 'for', 'a', 'day;')
('a', 'fish', 'and', 'you', 'feed', 'him', 'for', 'a', 'day;', 'teach')
('fish', 'and', 'you', 'feed', 'him', 'for', 'a', 'day;', 'teach', 'a')
('and', 'you', 'feed', 'him', 'for', 'a', 'day;', 'teach', 'a', 'man')
('you', 'feed', 'him', 'for', 'a', 'day;', 'teach', 'a', 'man', 'to')
('feed', 'him', 'for', 'a', 'day;', 'teach', 'a', 'man', 'to', 'fish')
('him', 'for', 'a', 'day;', 'teach', 'a', 'man', 'to', 'fish', 'and')
('for', 'a', 'day;', 'teach', 'a', 'man', 'to', 'fish', 'and', 'you')
('a', 'day;', 'teach', 'a', 'man', 'to', 'fish', 'and', 'you', 'feed')
('day;', 'teach', 'a', 'man', 'to', 'fish', 'and', 'you', 'feed', 'him')
('teach', 'a', 'man', 'to', 'fish', 'and', 'you', 'feed', 'him', 'for')
('a', 'man', 'to', 'fish', 'and', 'you', 'feed', 'him', 'for', 'a')
('man', 'to',

So, there are plenty of tengrams in there! What we're interested in, however, is duplicated n-grams:

In [22]:
# arguments: a text, ngram size, and minimum occurrences
def ngrammer(text, gramsize, threshold = 4):
    """Get any repeating ngram containing gramsize tokens"""
    # we need to import this in order to find the duplicates:
    from collections import defaultdict
    from nltk.util import ngrams
    # a subdefinition to get duplicate lists in a list
    def list_duplicates(seq):
        tally = defaultdict(list)
        for i,item in enumerate(seq):
            tally[item].append(i)
            # return to us the index and the ngram itself:
        return ((len(locs),key) for key,locs in tally.items() 
               if len(locs) > threshold)
    # get ngrams of gramsize    
    raw_grams = ngrams(text.split(), gramsize)
    # use our duplication detector to find duplicates
    dupes = list_duplicates(raw_grams)
    # return them, sorted by most frequent
    return sorted(dupes, reverse = True)

Now that it's defined, let's run it, looking for trigrams

In [23]:
ngrammer(raw, 3)

[(31, (u'the', u'middle', u'east')),
 (24, (u'in', u'the', u'middle')),
 (18, (u'carried', u'out', u'by')),
 (17, (u'the', u'majority', u'of')),
 (15, (u'of', u'terrorist', u'attacks')),
 (13, (u'out', u'by', u'non-muslims.')),
 (13, (u'majority', u'of', u'terrorist')),
 (12, (u'around', u'the', u'world')),
 (10, (u'we', u'need', u'to')),
 (10, (u'the', u'middle', u'east.')),
 (10, (u'terrorist', u'attacks', u'around')),
 (10, (u'have', u'been', u'carried')),
 (10, (u'been', u'carried', u'out')),
 (10, (u'attacks', u'around', u'the')),
 (9, (u'the', u'world', u'since')),
 (8, (u'to', u'do', u'with')),
 (8, (u'the', u'rest', u'of')),
 (8, (u'out', u'of', u'the')),
 (7, (u'world', u'since', u'2001')),
 (7, (u'to', u'deal', u'with')),
 (7, (u'this', u'is', u'a')),
 (7, (u'the', u'spirit', u'of')),
 (7, (u'since', u'2001', u'have')),
 (7, (u'of', u'the', u'world')),
 (7, (u'do', u'you', u'think')),
 (7, (u'2001', u'have', u'been')),
 (6, (u'you', u'are', u'a')),
 (6, (u'there', u'is', u'no

Too many results? Let's set a higher threshold than the default.

In [24]:
ngrammer(raw, 3, threshold = 10)

[(31, (u'the', u'middle', u'east')),
 (24, (u'in', u'the', u'middle')),
 (18, (u'carried', u'out', u'by')),
 (17, (u'the', u'majority', u'of')),
 (15, (u'of', u'terrorist', u'attacks')),
 (13, (u'out', u'by', u'non-muslims.')),
 (13, (u'majority', u'of', u'terrorist')),
 (12, (u'around', u'the', u'world'))]

## Concordancing with regular expressions

We've already done a bit of concordancing. In discourse-analytic research, concordancing is often used to perform thematic categorisation.

In [25]:
text = nltk.Text(tokens)  # formats our tokens for concordancing
text.concordance("muslims")

Displaying 25 of 60 matches:
urselves . i am not saying that all muslims are like this . however , i am sayi
 a result of long term migration of muslims into our country we now have a burg
 shut the gate as we now have young muslims , embracing isis ideals and as far 
 of over-sensationalized crap . the muslims are actually less trouble than are 
aying the statistics will show that muslims are more likely to be criminal than
ty . what about crimes committed by muslims whose grandparents emigrated here ,
, would be collateral damage as the muslims promote their promise to 'have the 
anda ) greens -win , i am comparing muslims to storm troopers . do you have any
ers . do you have any idea what the muslims are up to in iraq ? isis are commit
uld be 'aussie ' , not muslim . all muslims ? you are on a hate speech , hiding
iding behind a keyboard , about all muslims and that makes you a fanatical terr
nt history ) . the sunni and shiite muslims have been killing each other for 12
ehalf of is

We could even our stemmed corpus here:
text = nltk.Text(stemmed)
text.concordance("muslims")

You get no matches in the latter case, because all instances of *muslims* were stemmed to *muslim*.

A problem with the NLTK concordancer is that it only works with individual tokens. What if we want to find words that end with **ment*, or words beginning with *poli**?

We already searched text with Regular Expressions. It's not much more work to build regex functionality into our own concordancer.

From running the code below, you can see that bracketting sections of our regex causes results to split into lists:

In [26]:
# define a regex for different aussie words
aussie = r'(aussie|australia)'
searchpattern = re.compile(r"(.*)" + aussie + r"(.*)")
search = re.findall(searchpattern, raw)
search[:5]

[(u'as they woke to another day\x92s work the people of sydney [and melbourne] learned of a joint operation between their states\x92 police forces and the ',
  u'australia',
  u'n federal police, against an imminent terrorist attack in their cities.'),
 (u'four men - all ',
  u'australia',
  u"n citizens - were arrested this morning as federal and state police, armed with search warrants, swooped on members of the suspected terror cell this morning in the second-largest counter-terrorism operation in the nation's history\x85. about 400 police raided homes in the northern melbourne suburbs of glenroy, meadow heights, roxburgh park, broadmeadows, westmeadows, preston and epping. they also raided homes at carlton in inner melbourne and colac in southwestern victoria. (source)"),
 (u"authorities believe the group is at an advanced stage of preparing to storm an australian army base, using automatic weapons, as punishment for australia's military involvement in muslim countries. it is under

Well, it's ugly, but it works. We can see five bracketted results, each containing three strings. The first and third strings are the left-context and right-context. The second of the three strings is the search term.

These three sections are, with a bit of tweaking, the same as the output given by a concordancer.

Let's go ahead and turn our regex seacher into a concordancer:

In [27]:
def concordancer(text, regex):
    """Concordance using regular expressions"""
    import re
    # limit context to 30 characters max
    searchpattern = re.compile(r"(.{,30})(\b" + regex + r"\b)(.{,30})")
    # find all instances of our regex
    search = re.findall(searchpattern, raw)
    for result in search:
        #join each result with a tab, and print
        print("\t".join(result).expandtabs(20))
        # expand tabs helps align results

In [28]:
concordancer(raw, r'aus.*?')

states police forces and the           australian           federal police, against an im
four men - all      australian           citizens - were arrested this
tage of preparing to storm an           australian           army base, using automatic we
apons, as punishment for                australia           's military involvement in mus
be the worst terror attack on           australian           soil.
lly less trouble than are the           aussies             . and if we had never interfer
re likely to be criminal than           aussies             ? and we've been interfering i
....muslim is a religion, and           australian           is a nationality. what about 
ou classify them as muslim or           aussie              ???
may have been slaughtered by '          aussies             ', i don't blame christianity 
se? they would have had their           aussieness           brow-beaten into them.
think there would be too many           australians          who would like t

Great! With six lines of code, we've officially created a function that improves on the one provided by NLTK! And think how easy it would be to add more functionality: an argument dictating the size of the window (currently 30 characters), or printing line numbers beside matches, would be pretty easy to add, as well.

> Adding too much functionality is known as *feature creep*. It's often best to keep your functions simple and more varied. An old adage in programming is to *make each program do one thing well*.

In the cells below, try concordancing a few things. Also try creating variables with concordance results, and then manipulate the lists. If you encounter problems with the way the concordancer runs, alter the function and redefine it. If you want, try implementing the window size variable!

> **Tip:** If you wanted to get really creative, you could try stemming concordance or n-gram results!

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#

## Summary

That's the end of session three! Great work.

So, some of these tasks are a little dry---seeing results as lists of words and scores isn't always a lot of fun. But ultimately, they're pretty important things to know if you want to avoid the 'black box approach', where you simply dump words into a machine and analyse what the machine spits out.

Remember that almost every task in corpus linguistics/distance reading depends on how we segment our data into sentences, clauses, words, etc.

Building a stemmer from scratch taught us how to use regular expressions, and their power. But, we also saw that they weren't perfect for the task. In later lessons, we'll use more advanced methods to normalise our data. 

*See you tomorrow!*

# Bibliography

<a id="ref:baker"></a>
Baker, P., Gabrielatos, C., Khosravinik, M., Krzyzanowski, M., McEnery, T., & Wodak, R. (2008). A useful methodological synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse & Society, 19(3), 273-306.

<a id="firth"></a>
Firth, J. (1957).  *A Synopsis of Linguistic Theory 1930-1955*. In: Studies in Linguistic Analysis, Philological Society, Oxford; reprinted in Palmer, F. (ed.) 1968 Selected Papers of J. R. Firth, Longman, Harlow.

<a id="ref:hymes"></a>
Hymes, D. (1972). On communicative competence. In J. Pride & J. Holmes (Eds.), Sociolinguistics (pp. 269-293). Harmondsworth: Penguin Books. Retrieved from [http://humanidades.uprrp.edu/smjeg/reserva/Estudios%20Hispanicos/espa3246/Prof%20Sunny%20Cabrera/ESPA%203246%20-%20On%20Communicative%20Competence%20p%2053-73.pdf](http://humanidades.uprrp.edu/smjeg/reserva/Estudios%20Hispanicos/espa3246/Prof%20Sunny%20Cabrera/ESPA%203246%20-%20On%20Communicative%20Competence%20p%2053-73.pdf)

<a id="ref:widdowson"></a>
Widdowson, H. G. (2000). On the limitations of linguistics applied. Applied Linguistics, 21(1), 3. Available at [http://applij.oxfordjournals.org/content/21/1/3.short](http://applij.oxfordjournals.org/content/21/1/3.short).

### Workspace

Here are a few blank cells, in case you need them for anything:

In [37]:
def gutenberger(list_of_nums):
    text = []
    from urllib import urlopen
    for num in list_of_nums:
        num = str(num)
        url = 'https://www.gutenberg.org/cache/epub/' + num + '/pg' + num + '.txt'
        raw = urlopen(url).read()
        raw = unicode(raw, 'utf-8')
        title = [line for line in raw.splitlines() if line.startswith('Title:')]
        if title:
            print title[0]
        text.append([title, raw])
    return text

In [38]:
booknums = ['24510', '19073', '21592']

In [39]:
books = gutenberger(booknums)

Title: The Production of Vinegar from Honey
Title: Cocoa and Chocolate
Title: The Art of Making Whiskey


In [36]:
for title, text in books:
    print text[:100]

project gutenberg's the production of vinegar from honey, by gerard w bancks

this ebook is for th
the project gutenberg ebook of cocoa and chocolate, by arthur w. knapp

this ebook is for the use 
project gutenberg's the art of making whiskey, by anthony boucherie

this ebook is for the use of 


In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
#

In [None]:
# 