# Introduction to NLTK

Use computers to identify patterns in language and textual data

## Orientation: Where am I?
<img src='res/launch_1.jpg'>


<i>Credits: Kerbal space program: Falcon 9 Space X</i>

## Our command module: Interactive python (IPython) with Jupyter Notebooks


Let's play around with our environment:
- Setting up and get ready with [Anaconda](https://www.continuum.io/downloads). It's free. 
- What is a notebook?
- What can I do in the cells?
- Most used features

### Challenge: Explore the jupyter notebooks
- Open the notebook [material](http://a.com) or make a copy for you in the cloud
- Add 5 cells to the notebook
- Delete 3 cells
- Type "Hello world" inside the last emtpy cell and run it (we will learn how to to put more things inside Next!)
- Move the cell to be the first


Now we know where we are standing... few comments:

1. The main strength of IPython is that you can run bits of code individually
2. IPython allows you to display images alongside code, and to save the input and output together.
3. IPython makes learning a bit easier, as mistakes are easier to find and do not break an entire workflow.

### Wait.. what is python?

<img src='res/missionpython_cover.png'>

Python is easy-to-use programming language and comes with handy / efficient tools to manipulate linguistic data. We  will learn just the basics to perform reproducible research. 

## What is the Natural Language Toolkit?
<img src='res/NLTK.png'>

NLTK is a Python Library for working with written language data. 

##### NLTK is free and extensively documented [here](http://www.nltk.org/).
> Note: NLTK provides tools for tasks ranging from very simple (counting words in a text) to very complex (writing and training parsers, etc.)

We will start by importing NLTK, setting a path to NLTK resources, and downloading some additional stuff.

In [1]:
# import nltk library
import nltk

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


We need some data to experiment with... 

In [4]:
# get some example data from nltk
# asterisk means 'everything'
from nltk.book import *

sorted(set(text1))

['!',
 '!"',
 '!"--',
 "!'",
 '!\'"',
 '!)',
 '!)"',
 '!*',
 '!--',
 '!--"',
 "!--'",
 '"',
 '"\'',
 '"--',
 '"...',
 '";',
 '$',
 '&',
 "'",
 "',",
 "',--",
 "'-",
 "'--",
 "';",
 '(',
 ')',
 '),',
 ')--',
 ').',
 ').--',
 '):',
 ');',
 ');--',
 '*',
 ',',
 ',"',
 ',"--',
 ",'",
 ",'--",
 ',)',
 ',*',
 ',--',
 ',--"',
 ",--'",
 '-',
 '--',
 '--"',
 "--'",
 '--\'"',
 '--(',
 '---"',
 '---,',
 '.',
 '."',
 '."*',
 '."--',
 ".'",
 '.\'"',
 '.)',
 '.*',
 '.*--',
 '.,',
 '.--',
 '.--"',
 '...',
 '....',
 '.]',
 '000',
 '1',
 '10',
 '100',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '11',
 '110',
 '111',
 '112',
 '113',
 '114',
 '115',
 '116',
 '117',
 '118',
 '119',
 '12',
 '120',
 '121',
 '122',
 '123',
 '124',
 '125',
 '126',
 '127',
 '128',
 '129',
 '13',
 '130',
 '131',
 '132',
 '133',
 '134',
 '135',
 '14',
 '144',
 '1492',
 '15',
 '150',
 '15th',
 '16',
 '1652',
 '1668',
 '1671',
 '1690',
 '1695',
 '16th',
 '17',
 '1726',
 '1729',
 '1750',
 '1772',
 '1775

## Python as a calculator
We can use the iPython environment as a caluculator; try doing some basic mathematics with python. *Hint*: use * and / like your smartphone.

## Methods and functions

<img src='res/methods_functions.jpg'>

<i>Credits: Kerbal space program: Falcon 9 Space X</i>

The syntax we'll use the most involves two types of commands: "functions and methods"; one that look like this ```len()``` and anothers that look like this ```.count()```

Both need an object (text data in our case) to work on; for example, ```len(text1)``` or ```text1.count("Whale")```.

## Quick start: Let computers do the reading and count



### Exploring vocabulary:  Useful functions

NLTK makes it really easy to get basic information about the size of a text and the complexity of its vocabulary using **python functions**.
*Please* note that all these commands use the same *syntax*; this is the first python syntax we'll learn.

```len(text1)``` gives the number of symbols or 'tokens' in your text. This is the total number of words and items of punctuation.

In [None]:
print(len(text1))
print(set(text1))

```set(text2)``` gives you a list of all the tokens in the text, without the duplicates. Hence, ```len(set(text3))``` will give you the total number unique tokens. Remember this still includes punctuation. 

```sorted(text4)``` places items in the list into alphabetical order, with punctuation symbols and capitalised words first.

#### Challenge: Lexical richness

We can investigate the *lexical richness* of a text. For example, by dividing the total number of words by the number of unique words, we can see the average number of times each word is used. 

For this challenge you will have to combine your knowledge of the syntax we've learnt so far and iPython's mathematical abilities.

Have a go at calculating the lexical richness of text3.

We can use methods from the object **text** to count the words

In general, NLTK is counting everything for us:
see method ```.vocab()```


#### Challenge: Percentaje taken by a word in text
Store in a variable the amount of times the word "sea" is in the text3. Then calculate the percentaje taken up by this word in the whole text

### Exploring text - useful methods to search inside text
NLTK has useful methods that helps us to search in the text


**Concordance** shows you a word in context and is useful if you want to be able to discuss the ways in which a word is used in a text. 

**Similar** will find words used in similar contexts; it is not looking for synonyms, although the results may include synonyms

**Common contexts** allows us to examine just the contexts that are shared by two or more words, such as monstrous and very. We have to enclose these words by square brackets as well as parentheses, and separate them with a comma:

We can also find words that typically occur together, which tend to be very specific to a text or genre of texts. A **collocation** is a sequence of words that occur together unusually often

#### Challenge: Text exploration

1. Find the collocations in the Inaugural Address text. 

2. Chose one of the words to concordance. 

3. Investigate how the word is used. What words are used similarly? 

4. And what are the common contexts of these words? 

5. Report your findings to the person next to you. 

6. Do the same with the chat forum.

### Exploring text: Plotting dispersion of words
If we can find words in a text, we can also take note of their position within the text, Python lets you create graphs to analize textual data.
We can then generate a **dispersion plot** that shows where given words occur in a text.

In [None]:
# different roles played by the male and female protagonists in Sense and Sensibility
#['Elinor', 'Marianne', 'Edward', 'Willoughby']


In [None]:
# happy faces and sad faces in chat text

#### Challenge: Dispersion of words in a text

Create a dispersion plot for the terms "democracy", "freedom", "America","Government","peace","war","happiness" and "fear" in the innaugural address corpus.
What do you think it tells you? 

## Data structures: Texts as Lists of Words
Python treats a text as a long list of words. First, we'll make some lists of our own, to give you an idea of how a list behaves.


In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']
# Note we use Square brackets here to define our list

In [None]:
sent1

A few things to consider...
<br>
<img src='res/lists.jpg'>
*Credit: Head First Python by Paul Barry*

You can add lists together, creating a new list containing all the items from both lists. You can do this by typing out the two lists or you can add two or more pre-defined lists. This is called concatenation.

You can think of text as a concatenation of sentences, and sentences as a concatenation of words

What if we want to add a single item to a list? This is known as appending. When we append() to a list, the list itself is updated as a result of the operation.

#### Indexing Lists
We can navigate this list with the help of indexes. Just as we can find out the number of times a word occurs in a text, we can also find where a word first occurs. We can navigate to different points in a text without restriction, so long as we can describe where we want to be.

In [None]:
print(text4.index('awaken')) # print is a python function we can use to show the result of an text operation

This works in reverse as well. We can ask Python to locate the 158th item in our list (note that we use square brackets here, not parentheses)

As well as pulling out individual items from a list, indexes can be used to pull out selections of text from a large corpus to inspect. We call this **slicing**.

If we're asking for the beginning or end of a text, we can leave out the first or second number. For instance, [:5] will give us the first five items in a list while [8:] will give us all the elements from the eighth to the end.

To help you understand how indexes work, let's create one.
We start by defining the name of our index and then add the items. You probably won't do this in your own work, but you may want to manipulate an index in other ways. Pay attention to the quote marks and commas when you create your test sentence.

Note that the first element in the list is zero. This is because we are telling Python to go zero steps forward in the list. If we use an index that is too large (that is, we ask for something that doesn't exist), we'll get an error.
We can modify elements in a list by assigning new data to one of its index values. We can also replace a slice with new material.

#### Challenge: Lists 

Take a few minutes to define a sentence of your own and modify individual words and groups of words (slices) using the same methods used earlier

## Strings: A useful object to store texts

A string is a sequence of characters, you can think of it as a list. For example, we can assign a string to a variable, index a string, and slice a string

We can also make some mathematical operations, like 
```+``` and ```*```:

We can ```join```
 the words of a list to make a single string, or ```split```
 a string into a list, as follows:

And it will be helpful to normalize your text. E.g. lowercase, uppercase

## Let computers do the repetitive work: Python Loops

We can use Python to automate tasks, such as performing a function on all items in a list. For instance, we could ask it to tell us the size of all the files in a directory. To do that, we'll have to teach the computer how to repeat things. We do this by creating something called a *loop*.

An example task that we might want to repeat is printing each character in a word on a line of its own. One way to do this would be to use a series of print statements:

What are the cons of doing this do you think?

<img src='res/loop.jpg'>
<br>
<i>Credit: Head First Python by Paul Barry</i>

This is call a 'for loop'. Try it with other words.

The general form of a loop is:
```python
for variable in collection:
    # do this
    # do that
    ```
We can call the loop variable whatever we want, but there must be a colon at the end of the line starting the loop, and we must indent anything we want to run inside the loop. This is used to signify when the loop ends, instead of using a symbol to end the loop.

This is a loop that repeatedly updates a variable:

In [None]:
length = 0
for char in 'Python':
    length = length + 1
print('There are', length, 'letters in this word')

Let's go through this line by line.
The variable is set to 0, so that python starts counting from 0.
The second line opens a for loop that loops over the characters in Python.
Now, the third line tells python to count by one, pretty obvious to us humans... Since there are six characters in 'Python', the statement on line 3 will be executed five times.
The first time around, length is zero (the value assigned to it on line 1). The statement adds 1 to the old value of length, producing 1, and updates length to refer to that new value. The next time around, char is 'y' and length is 1, so length is updated to be 2. After four more updates, length is 6.
Since there is nothing left in 'Python', the loop finishes and the print statement on line 4 tells us our final answer inbetween two strings.

Of course, we already know that we could use ```len()``` to find the length of a string, but it's just an example... Let's compare results:

Now for another example using a list! Fruit salad, yummy, yummy.

In [None]:
fruits = ['banana', 'apple', 'mango']
for fruit in fruits:        
    print('Current fruit :', fruit)
    print('Done!')

#### Challenge: Loops
Define a list called Library with the 9 NLTK books in it. Write a for loop that will  print the lexical diversity over each book and tell you its score.

## Frecuency Distributions: Counting for analysis
We can use Python's ability to perform statistical analysis of data to do further exploration of vocabulary. For instance, we might want to be able to find the most common or least common words in a text. We'll start by looking at frequency distribution.

In [None]:
# import FreqDist from the nltk.probability module


#### Challenge: Frequency distributions
Use a loop to compare the 10 most common words of the texts in the NLTK book.

## Exploring vocabulary (cont.)
As well as counting individual words, we can count other features of vocabulary, such as how often words of different lengths occur. We do this by putting together a number of the commands we've already learned.

We could start like this: 

```[len(word) for word in text1]```

... but this would print the length of every word in the whole book, so let's skip that bit!

In [None]:
# try get the frequency distribution for the lenght of words in text1


In [None]:
# what is the most common word lenght?


In [None]:
# how frequent is that length in the overall text?


These last two commands tell us that the most common word length, and that these account for about 20% of the book. We can see this just by visually inspecting the list produced by ```fdist2.most_common()```
, but if this list were too long to inspect readily, or we didn't want to print it, there are other ways to explore it.

There are a number of functions defined for NLTK's frequency distributions:

 | Function | Purpose  |
 |--------------|------------|
 | fdist = FreqDist(samples) | create a frequency distribution containing the given samples |
 | fdist[sample] += 1 | increment the count for this sample |
 | fdist['monstrous']  | count of the number of times a given sample occurred |
 | fdist.freq('monstrous') | frequency of a given sample |
 | fdist.N()  |  total number of samples |
 | fdist.most_common(n)   |  the n most common samples and their frequencies |
 | for sample in fdist:   |  iterate over the items in fdist, when in the loop, we refer to each item as sample |
 | fdist.max() | sample with the greatest count |
 | fdist.tabulate()   |  tabulate the frequency distribution |
 | fdist.plot()  |   graphical plot of the frequency distribution |
 | fdist.plot(cumulative=True) | cumulative plot of the frequency distribution |
 | fdist1 < fdist2 | test if samples in fdist1 occur less frequently than in fdist2 |

It is possible to select the longest words in a text, which may tell you something about its vocabulary and style

In [None]:
vocab = set(text4)
long_words=[]
for word in ...:
    if len(word)>15:
        long_words....(word)
        
sorted(long_words)

We can use this other template to do exactly the same thing, in just one line. We are not going to go deep in this but if you are interested it's called **List comprehension**

In [None]:
vocab = set(text4)
long_words = [word for word in vocab if len(...) > 15]
sorted(long_words)

We can also use numerical operators to refine the types of searches we ask Python to run. We can use the following relational operators:


### Common relationals
 |  Relational | Meaning |
 |--------------:|:------------|
 | <    |  less than |
 | <=   |   less than or equal to |
 | ==  |    equal to (note this is two "=" signs, not one) |
 | !=   |   not equal to |
 | \>   |   greater than |
 | \>= |   greater than or equal to |

#### Challenge: Use operator to explore text

Using one of the pre-defined sentences in the NLTK corpus, use the relational operators above to find:
- Words longer than four characters
- Words of four or more characters
- Words of exactly four characters

We can fine-tune our selection even further by adding other conditions. For instance, we might want to find long words that occur frequently (or rarely).  

#### Challenge: Search with conditions

Can you find all the words in a text that are more than seven letters long and occur more than seven times?

### Common methods for strings

 | Operator  | Purpose  |
 |--------------|------------|
 | s.startswith(t) | test if s starts with t |
 | s.endswith(t)  |  test if s ends with t | 
 | t in s         |  test if t is a substring of s | 
 | s.islower()    |  test if s contains cased characters and all are lowercase | 
 | s.isupper()    |  test if s contains cased characters and all are uppercase | 
 | s.isalpha()    |  test if s is non-empty and all characters in s are alphabetic | 
 | s.isalnum()    |  test if s is non-empty and all characters in s are alphanumeric | 
 | s.isdigit()    |  test if s is non-empty and all characters in s are digits | 
 | s.istitle()    |  test if s contains cased characters and is titlecased (i.e. all words in s have initial capitals) | 

In [None]:
# get all the words from text 1 that ends with 'ableness'

In [None]:
# get all the words from sent7 which are digits


#### Challenge

You'll remember right at the beginning we started looking at the size of the vocabulary of a text, but there were two problems with the results we got from using:

```len(set(text1))```

This count includes items of punctuation and treats capitalised and non-capitalised words as different things (*This* vs *this*). We can now fix these problems. We start by getting rid of capitalised words, then we get rid of the punctuation and numbers.

In [None]:
len(set(text1))

In [None]:
## Normalize by lower case

In [None]:
## Get rid of numbers and punctuation

## Explore your own text: Accessing a corpus   

**Corpus** - Structured set of texts. 
*Corpora* is the plural of this. Example: A collection of medical journals.

Now, let's load in our text.

Google the [Gutenberg Project](https://www.gutenberg.org) and download a book as a plain text file. 

I chose [A Modest Proposal](https://www.gutenberg.org/ebooks/1080)

We can also look at file contents within the IPython Notebook itself:

In [None]:
import os

text_path = 'books/pg1080.txt'
path=os.path.join(text_path)
print(path)
file = open(os.path.join(text_path), "r", encoding='UTF-8')
text = file.read()
print(text)

Or we can get it from the web directly to our jupyter notebook:

In [None]:
from urllib import request
url = "http://www.gutenberg.org/cache/epub/1080/pg1080.txt"
response = request.urlopen(url)
raw = response.read().decode('utf-8-sig')
print(raw)

we can inspect the type of object we just got in python:

In [None]:
type(raw)

We have a big string, that means, a sequence of characters. We already know how to work on this. For example, can we know the size of this string?

In [None]:
len(raw)

In [None]:
raw[:5]

We are interested in words and sentences. In our first lessons, we analyzed texts already presented as words and sentences, we can do this with an operation called "tokenization"

**Tokenization = cut the text into pieces like sentences or words**
<img src="res/token.jpg"/>

In [None]:
from nltk import word_tokenize,sent_tokenize,wordpunct_tokenize

In [None]:
## get the tokens

In [None]:
# create a text object out of the tokens

In [None]:
# explore your text

#### Challenge: Explore your vocabulary with your text

Get a text from the Gutenberg Project

1. Explore the lexical richness
2. Calculate the percentage taken by a word
3. Find the collocations
4. Chose one of the words to concordance. 
5. Investigate how the word is used
6. Create a dispersion plot
7. Create a frequency distribution
8. Get the top 50 words


## Structuring the code: Writing our own functions (yay!)
One thing that makes Python unique is that whitespace at the start of the line (use four spaces for consistency!) is meaningful. In many other languages, whitespace at the start of lines is simply a readability convention.

In [None]:
# Fix this whitespace problem!

string = 'user'
if string == 'user':
print('Phew, fixed.')

So, whitespace tells both Python and human readers where things start and stop.

### Defining a function

<img src='res/function.jpg'>

*Credit: Head First Python by Paul Barry*

In [None]:
def welcomer(name):
    print('Welcome, %s!' % name)# here '%s' tells Python to expect a string and how many strings to expect.  

Here, the word 'name' is a placeholder. It stands in for any argument we might care to place in the brackets. The placeholder could be anything. It could be n or fsdlkfjs; it will still work. What matters for the program is that you use the same one consistently. On the other hand, what matters for us is that you use descriptive names so you remember what the code does!
Notice that it doesn't do anything by itself. It needs to actually be called, and given some data:

Notice that it doesn't do anything by itself. It needs to actually be called, and given some data:

In [None]:
welcomer('jack')

#### Challenge: Functions
Previously, we calculated the lexical diversity of a text. In NLTK, we can create a function called lexical diversity that runs a single line of code. We can then call this function to quickly determine the lexical density of a corpus or subcorpus. Challenge!
Write a function to calculate the lexical diversity of a text; test it out on the books in the NLTK corpus

In [None]:
def lexical_diversity(text):
    return len(text)/len(set(text))

In [None]:
#After the function has been defined, we can run it:
lexical_diversity(text2)

Other functions that we've used already include ```len()``` and ```sorted()``` - these were predefined. ```lexical_diversity()```
 is one we set up ourselves; note that it's conventional to put a set of parentheses after a function, to make it clear what we're talking about.

**--bonus--**
#### Challenge: Functions to set up our text from the web 
Write a function that receives a URL (internet address where the text is) and return an object ```Text```
 that you can use to explore.