# FLIP(01):  Advanced Data Science
**(Module 03: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, but NOT allowed to change or distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au)

---


# Session 00 -  Language Processing and Python
### Getting Started with NLTK

Before going further you should install NLTK, downloadable for free from http://www.nltk.org/. Follow the instructions there to download the version required for your platform.

In [None]:
import nltk
nltk.download()

In [None]:
from nltk.book import *

In [None]:
text6

In [None]:
text1

In [None]:
#find the word 'monstrous' in the text1
text1.concordance('monstrous')

In [None]:
#find the word 'hello' in the text5
text5.concordance('hello')

In [None]:
#find the word like 'monstrous' in the text1
text1.similar('monstrous')

In [None]:
#find the word like 'hello' in the text5
text5.similar('hello')

In [None]:
#use the function 'common_contexts' to finding the context of a word
text2.common_contexts(['monstrous','very'])

In [None]:
#a small test
#compare the function 'similar()' and 'common_contexts()' 

In [None]:
#predict the distribution of some word
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

In [None]:
#get the length of a text
len(text3)

In [None]:
#get the word list of text3
set(text3)

In [None]:
sorted(set(text3))

In [None]:
from __future__ import division

In [None]:
#Measuring text vocabulary richness
len(text3) / len(set(text3))

In [None]:
#Calculates the number of times a word appears in the text
text3.count('smote')

In [None]:
#Calculate the percentage of a particular word in the text
100*text4.count('a')/len(text4)

In [None]:
#a small test
#Calculate the percentage of a particular word---'lol' in the text5

In [None]:
#define a function to calculate the text vocabulary richness
def lexical_diversity(text):
    return len(text)/len(set(text))

In [None]:
#define a function to calculate the percentage of a particular word in the text
def percentage(count,total):
    return 100*count/total

In [None]:
def per(word,text):
    count = word.count('a')
    total = len(text)
    return 100*count/total

In [None]:
per(text4,text4)

# A Closer Look at Python: Texts as Lists of Words

## List

What is a text? At one level, it is a sequence of symbols on a page such as this one. At another level, it is a sequence of chapters, made up of a sequence of sections, where each section is a sequence of paragraphs, and so on. However, for our purposes, we will think of a text as nothing more than a sequence of words and punctuation. Here’s how we represent text in Python, in this case the opening sentence of Moby Dick:

In [None]:
sent1 = ['Call','me','Ishmael','.']

In [None]:
sent1

In [None]:
def lexical_diversity(text):
    return len(text)/len(set(text))

In [None]:
lexical_diversity(sent1)

In [None]:
sent2 = ['The','family','of','Dashwood','had','long','been','settled','in','Sussex','.']

In [None]:
sent2

In [None]:
sent3 = ['In','the','beginning','God','created','the','heaven','and','the','earth','.']

In [None]:
sent3

In [None]:
# small test
# Make up a few sentences of your own, by typing a name, equals sign, and a list of words, 
# like this: ex1 = ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail'].

Adding two lists creates a new list with everything from the first list, followed by everything from the second list:

In [None]:
['Monty','Python']+['and','the','Holy','Grail']

This special use of the addition operation is called concatenation; it combines the lists together into a single list. We can concatenate sentences to build up a text.We don’t have to literally type the lists either; we can use short names that refer to predefined lists.

In [None]:
sent3 + sent1

In [None]:
sent1.append("some")

In [None]:
sent1

## Indexing Lists

As we have seen, a text in Python is a list of words, represented using a combination of brackets and quotes. Just as with an ordinary page of text, we can count up the total number of words in text1 with len(text1), and count the occurrences in a text of a
particular word—say, heaven—using text1.count('heaven').
The number that represents this position is the item’s index. We instruct Python to show us the item that occurs at an index such as 173 in a text by writing the name of the text followed by the index inside square brackets:

In [None]:
from nltk.book import *

In [None]:
text4

In [None]:
text4[173]

In [None]:
# find the index of when it first occur
text4.index('awaken')

Indexes are a common way to access the words of a text, or, more generally, the elements of any list. Python permits us to access sublists as well, extracting manageable pieces of language from large texts, a technique known as slicing.

In [None]:
text5[16715:16735]

In [None]:
text6[1600:1625]

In [None]:
sent = ['word1', 'word2', 'word3', 'word4', 'word5',
        'word6', 'word7', 'word8', 'word9', 'word10']

In [None]:
sent[0]

In [None]:
sent[9]

Notice that our indexes start from zero: sent element zero, written sent[0], is the first word, 'word1', whereas sent element 9 is 'word10'.If we accidentally use an index that is too large, we get an error:

In [None]:
sent[10]

In [None]:
sent[5:8]

In [None]:
sent[5]

In [None]:
sent[7]

In [None]:
sent[:3]

In [None]:
text2[141525:]

In [None]:
sent[0]

In [None]:
sent[9]

In [None]:
len(sent)

The slicing operation is equivalent to the reference, which changes the list itself. Run the next cell, and you will narrow the list. Therefore, when you run the next third cell, you will get an error.

In [None]:
sent[1:9] = ['Second','Third']

In [None]:
sent

In [None]:
sent[9]

In [None]:
# small test:
# Take a few minutes to define a sentence of your own and modify individual words and groups of words (slices) using the same
# methods used earlier.

## Variables
You have had access to texts called text1, text2, and so on. It saved a lot of typing to be able to refer to a 250,000-word book with a short name like this! In general, we can make up names for anything we care to calculate.defining a variable sent1, as follows:

In [None]:
sent1 = ['call','me','Ishmael','.']

In [None]:
my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',
           'forth', 'from', 'Camelot', '.']

In [None]:
noun_phrase = my_sent[1:4]

In [None]:
noun_phrase

In [None]:
wOrDs = sorted(noun_phrase)

In [None]:
wOrDs

We will often use variables to hold intermediate steps of a computation, especially when this makes the code easier to follow. Thus len(set(text1)) could also be written:

In [None]:
vocab = set(text1)
vocab_size = len(vocab)
vocab_size

## Strings
Some of the methods we used to access the elements of a list also work with individual words, or strings. For example, we can assign a string to a variable , index a string, and slice a string.

In [None]:
name = 'Monty'

In [None]:
name[0]

In [None]:
name[:4]

In [None]:
# We can also perform multiplication and addition with strings:
name * 2

In [None]:
name + "!"

In [None]:
# We can join the words of a list to make a single string, or split a string into a list, as follows:
' '.join(['Monty', 'Python'])

In [None]:
'Monty Python'.split()

# Computing with Language: Simple Statistics

## frequency distribution
frequency distribution：it tells us the frequency of each vocabulary item in the text.(In general, it could count any kind of observable event.) It is a “distribution” since it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. Let’s use a FreqDist to find the 50 most frequent words of Moby Dick.

In [None]:
saying = ['After', 'all', 'is', 'said', 'and', 'done','more', 'is', 'said', 'than', 'done']

In [None]:
tokens = set(saying)

In [None]:
tokens = sorted(tokens)

In [None]:
tokens[-2:]

In [None]:
from nltk.book import *

In [None]:
fdist1 = FreqDist(text1)

In [None]:
fdist1

In [None]:
#python 3.0
fdist1.most_common(20)

In [None]:
fdist1['whale']

In [None]:
fdist1['from']

In [None]:
fdist1.plot(50, cumulative=True)

In [None]:
fdist2 = FreqDist(text2)

In [None]:
fdist2

In [None]:
fdist2.most_common(10)

In [None]:
fdist2['common']

In [None]:
fdist1.hapaxes()

## Fine-Grained Selection of Words
We would like to find the words from the vocabulary of the text that are more than 15 characters long. Let’s call this property P, so that P(w) is true if and only if w is more than 15 characters long. Now we can express the words of interest using mathematical set notation as shown in (1a). This means “the set of all w such that w is an element of V (the vocabulary) and w has property P.”
           
               (1) a. {w | w ∈ V & P(w)}
                   b. [w for w in V if p(w)]
The corresponding Python expression is given in (1b). (Note that it produces a list, not a set, which means that duplicates are possible.) Observe how similar the two notations are. Let’s go one more step and write executable Python code:

In [None]:
from nltk.book import *

In [None]:
V = set(text1)

In [None]:
long_words = [w for w in V if len(w) > 15]

In [None]:
sorted(long_words)

In [None]:
W = set(text6)

In [None]:
long_word = [w for w in W if len(w) > 12]

In [None]:
sorted(long_word)

In [None]:
fdist5 = FreqDist(text5)

In [None]:
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])

## Collocations and Bigrams
A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses.To get a handle on collocations, we start off by extracting from a text a list of word pairs, also known as bigrams. This is easily accomplished with the function bigrams():

In [None]:
from nltk import bigrams
bigrams(['more', 'is', 'said', 'than', 'done'])

In [None]:
text4.collocations()

In [None]:
text8.collocations()

## Counting Other Things
Counting words is useful, but we can count other things too. For example, we can look at the distribution of word lengths in a text, by creating a FreqDist out of a long list of numbers, where each number is the length of the corresponding word in the text:

In [None]:
[len(w) for w in text1]

In [None]:
fdist = FreqDist([len(w) for w in text1])

In [None]:
fdist

In [None]:
fdist.items()

In [None]:
fdist.max()

In [None]:
fdist[3]

In [None]:
fdist.freq(6)

# Back to Python: Making Decisions and Taking Control
## Conditionals
Python supports a wide range of operators, such as < and >=, for testing the relationship between values. We can use  relational operators to select different words from a sentence of news text. Here are some examples—notice only the operator is changed from one line to the next.

In [None]:
from nltk.book import *

In [None]:
sent7

In [None]:
[w for w in sent7 if len(w) < 4]

In [None]:
[w for w in sent7 if len(w) <= 4]

In [None]:
[w for w in sent7 if len(w) == 4]

In [None]:
[w for w in sent7 if len(w) != 4]

In [None]:
sorted([w for w in set(text1) if w.endswith('ableness')])

In [None]:
sorted([term for term in set(text4) if 'gnt' in term])

In [None]:
sorted([item for item in set(text6) if item.istitle()])

In [None]:
sorted([item for item in set(sent7) if item.isdigit()])

In [None]:
# small test
# Run the following examples and try to explain what is going on in each one. Next, try to make up some conditions of your own.

## Operating on Every Element

In [None]:
[len(w) for w in text1]

In [None]:
[w.upper() for w in text1]

In [None]:
len(text1)

In [None]:
len(set(text1))

In [None]:
len(set([word.lower() for word in text1]))

In [None]:
len(set([word.lower() for word in text1 if word.isalpha()]))

## Nested Code Blocks
Most programming languages permit us to execute a block of code when a conditional expression, or if statement, is satisfied. We already saw examples of conditional tests in code like [w for w in sent7 if len(w) < 4]. In the following program, we have created a variable called word containing the string value 'cat'. The if statement checks whether the test len(word) < 5 is true. It is, so the body of the if statement is invoked and the print statement is executed, displaying a message to the user.

In [None]:
word = 'cat'
if len(word) < 5:
    print('word length is less than 5')

In [None]:
if len(word) >= 5:
    print('word length is greater than or equal to 5')

In [None]:
for word in ['Call', 'me', 'Ishmael', '.']:
    print(word)

## Looping with Conditions
Now we can combine the if and for statements. We will loop over every item of the list, and print the item only if it ends with the letter l. We’ll pick another name for the variable to demonstrate that Python doesn’t try to make sense of variable names.

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']
for xyzzy in sent1:
    if xyzzy.endswith('l'):
        print(xyzzy)

In [None]:
for token in sent1:
    if token.islower():
        print(token, 'is a lowercase word')
    elif token.istitle():
        print(token, 'is a titlecase word')
    else:
        print(token, 'is punctuation')

In [None]:
tricky = sorted([w for w in set(text2) if 'cie' in w or 'cei' in w])

In [None]:
for word in tricky:
    print(word,)