# FLIP(01):  Advanced Data Science
**(Module 07: Natural Language Processing)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.
- If you found any issue/bug for this document, please submit an issue at [tulip-lab/mds](https://github.com/tulip-lab/mds/issues)

Prepared by and for 
**Student Members** |
2006-2019 [TULIP Lab](http://www.tulip.org.au)

---

# Session A - Language Processing and Python

## Contents

1 [Computing with Language: Texts and Words](#Computing)
* Getting Started with Python
* Getting Started with NLTK
* Searching Text
* Counting Vocabulary


2 [A Closer Look at Python: Texts as Lists of Words](#Words)
* Lists
* Indexing Lists
* Variables
* Strings


3 [Computing with Language: Simple Statistics](#Statistics)
* Frequency Distributions
* Fine-Grained Selection of Words
* Collocations and Bigrams
* Counting Other Things


4 [Back to Python: Making Decisions and Taking Control](#Control)
* Conditionals
* Operating on Every Element
* Nested Code Blocks
* Looping with Conditions

<a id = "Computing"></a>

## <span style="color:#0b486b">1. Computing with Language: Texts and Words</span>

### Getting Started with Python

### Getting Started with NLTK

In [None]:
import nltk

#### NLTK book

In [None]:
nltk.download()

<img src='https://github.com/tulip-lab/mds/raw/master/Jupyter/image/nlp/download.png' width = '600' height = '600' align = center>

<img src='https://github.com/tulip-lab/mds/raw/master/Jupyter/image/nlp/download2.png' width = '600' height = '600' align = center>

In [None]:
from nltk.book import *

In [None]:
text1

In [None]:
dir(text1)

In [None]:
text2

### Searching Text

#### concordance

##### A concordance view shows us every occurrence of a given word, together with some context.

In [None]:
text1.concordance("monstrous")

### Your Turn: 

Try searching for other words; to save re-typing, you might
be able to use up-arrow, Ctrl-up-arrow, or Alt-p to access the previous
command and modify the word being searched. You can also try search-
es on some of the other texts we have included. For example, search
Sense and Sensibility for the word affection, using text2.concord
ance("affection"). Search the book of Genesis to find out how long
some people lived, using: text3.concordance("lived"). You could look
at text4, the Inaugural Address Corpus, to see examples of English going
back to 1789, and search for words like nation, terror, god to see how
these words have been used differently over time. We’ve also included
text5, the NPS Chat Corpus: search this for unconventional words like
im, ur, lol. (Note that this corpus is uncensored!)

In [None]:
# 해봅시다!!

#### similar

A concordance permits us to see words in context. For example, we saw that mon- strous occurred in contexts such as "the ... picutes" and "the ... size". What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses

In [None]:
text1.similar("monstrous")

In [None]:
text2.similar("monstrous")

#### common_contexts

The term common_contexts allows us to examine just the contexts that are shared by two or more words, such as monstrous and very.

In [None]:
text2.common_contexts(["monstrous", "very"])

### Your Turn:

 Pick another pair of words and compare their usage in two
different texts, using the similar() and common_contexts() functions.

In [None]:
# 해봅시다!!

#### dispersion plot

We can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text.

In [None]:
text4

In [None]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

#### generate

Now, just for fun, let’s try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate.

In [None]:
text3

In [None]:
text3.generate()

### Counting Vocabulary

#### token

A token is the technical name for a sequence of characters—such as hairy, his, or :)—that we want to treat as a group.

In [None]:
len(text3)

So Genesis has 44,764 words and punctuation symbols, or “tokens.”


##### token example

###### "to be or not to be"

When we count the number of tokens in a text, say, the phrase to
be or not to be, we are counting occurrences of these sequences. Thus, in our example
phrase there are two occurrences of to, two of be, and one each of or and not.

* to, to, be, be, or, not


#### vocabulary

The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together.

vocabulary example

###### "to be or not to be"

* to , be, or, not

In [None]:
sorted(set(text3))

#### word type

Although it has 44,764 tokens, this book has only 2,789 distinct words, or “word types.” A word type is the form or spelling of the word independently of its specific occurrences in a text—that is, the word considered as a unique item of vocabulary.

In [None]:
len(set(text3))

#### measure of the lexical richness of the text.

Now, let’s calculate a measure of the lexical richness of the text. The next example
shows us that each word is used 16 times on average (we need to make sure Python
uses floating-point division)


In [None]:
from __future__ import division

In [None]:
len(text3) / len(set(text3))

#### let’s focus on particular words

In [None]:
text3.count("smote")

In [None]:
100 * text4.count('a') / len(text4)

### Your Turn : 

How many times does the word lol appear in text5? How
much is this as a percentage of the total number of words in this text?


In [None]:
# 해보자!

#### function

In [None]:
def lexical_diversity(text):
    return len(text) / len(set(text))

In [None]:
def percentage(count, total): 
    return 100 * count / total

In [None]:
lexical_diversity(text3)

In [None]:
percentage(4, 5)

In [None]:
percentage(text4.count('a'), len(text4))

<a id = "Words"></a>

## <span style="color:#0b486b">2. A Closer Look at Python: Texts as Lists of Words</span>

### Lists

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']

In [None]:
sent1

In [None]:
len(sent1)

In [None]:
lexical_diversity(sent1)

In [None]:
sent2

In [None]:
sent3

#### concatenation

In [None]:
['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']

In [None]:
sent4 + sent1

#### appending

In [None]:
sent1.append("Some")

In [None]:
sent1

### Indexing Lists

#### index

In [None]:
text4[173]

In [None]:
text4.index('awaken')

#### slicing

In [None]:
text5[16715:16735]

In [None]:
sent = ['word1', 'word2', 'word3', 'word4', 'word5', 'word6', 'word7', 'word8', 'word9', 'word10']

In [None]:
sent[0]

In [None]:
sent[1]

#### runtime error

In [None]:
sent[10]

In [None]:
sent[5:8]

In [None]:
sent[5]

In [None]:
sent[6]

In [None]:
sent[7]

In [None]:
sent[:3]

In [None]:
text2[141525:]

In [None]:
sent[0] = 'First'

In [None]:
sent[9] = 'Last'

In [None]:
len(sent)

In [None]:
sent[1:9] = ['Second', 'Third']

In [None]:
sent

### Variables

#### variable

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']

#### assignment

In [None]:
my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', 'forth', 'from', 'Camelot', '.']

In [None]:
noun_phrase = my_sent[1:4]

In [None]:
noun_phrase

In [None]:
wOrDs = sorted(noun_phrase)

In [None]:
wOrDs

In [None]:
vocab = set(text1)

In [None]:
vocab_size = len(vocab)

In [None]:
vocab_size

### Strings

In [None]:
name = 'Monty'

In [None]:
name[0]

In [None]:
name[:4]

In [None]:
name * 2

In [None]:
name + '!'

In [None]:
'/'.join(['Monty', 'Python'])

In [None]:
'Monty Python'.split()

<a id = "Statistics"></a>

## <span style="color:#0b486b">3. Computing with Language: Simple Statistics</span>

In [None]:
saying = ['After', 'all', 'is', 'said', 'and', 'done', 'more', 'is', 'said', 'than', 'done']

In [None]:
tokens = set(saying)

In [None]:
tokens = sorted(tokens)

In [None]:
tokens[-2:]

### Frequency Distributions

<img src='https://github.com/tulip-lab/mds/raw/master/Jupyter/image/nlp/tbl1-3.png' width = '600' height = '600' align = center>

In [None]:
fdist1 = FreqDist(text1)

In [None]:
fdist1

In [None]:
vocabulary1 = fdist1.keys()

In [None]:
vocabulary1[:50]

In [None]:
fdist1.plot(50, cumulative=True)

In [None]:
dir(fdist1)

#### hapaxes

If the frequent words don’t help us, how about the words that occur once only, the so- called hapaxes?

In [None]:
fdist1.hapaxes()

### Fine-Grained Selection of Words

#### long words

Next, let’s look at the long words of a text; perhaps these will be more characteristic and informative.

In [None]:
 V = set(text1)

In [None]:
long_words = [w for w in V if len(w) > 15]

In [None]:
sorted(long_words)

#### infrequent long words

In [None]:
fdist5 = FreqDist(text5)

In [None]:
sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7])

### Collocations and Bigrams

#### collocation

A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds very odd.

#### bigrams

To get a handle on collocations, we start off by extracting from a text a list of word pairs, also known as bigrams.

In [None]:
bigrams(['more', 'is', 'said', 'than', 'done'])

In [None]:
text4.collocations()

In [None]:
text8.collocations()

### Counting Other Things

In [None]:
[len(w) for w in text1]

In [None]:
fdist = FreqDist([len(w) for w in text1])

In [None]:
fdist

In [None]:
fdist.keys()

In [None]:
fdist.items()

In [None]:
fdist.max()

In [None]:
fdist[3]

In [None]:
fdist.freq(3)

<img src='https://github.com/tulip-lab/mds/raw/master/Jupyter/image/nlp/tbl1-2.png' width = '600' height = '600' align = center>

<a id = "Control"></a>

## <span style="color:#0b486b">4. Back to Python: Making Decisions and Taking Control</span>

### Conditionals

<img src='https://github.com/tulip-lab/mds/raw/master/Jupyter/image/nlp/tbl1-3.png' width = '600' height = '600' align = center>

In [None]:
sent7

In [None]:
[w for w in sent7 if len(w) < 4]

In [None]:
[w for w in sent7 if len(w) == 4]

In [None]:
[w for w in sent7 if len(w) != 4]

<img src='https://github.com/tulip-lab/mds/raw/master/Jupyter/image/nlp/tbl1-2.png' width = '600' height = '600' align = center>

In [None]:
sorted([w for w in set(text1) if w.endswith('ableness')])

In [None]:
sorted([term for term in set(text4) if 'gnt' in term])

In [None]:
sorted([item for item in set(text6) if item.istitle()])

In [None]:
sorted([item for item in set(sent7) if item.isdigit()])

### Operating on Every Element

In [None]:
[len(w) for w in text1]

In [None]:
[w.upper() for w in text1]

In [None]:
len(text1)

In [None]:
len(set(text1))

In [None]:
len(set([word.lower() for word in text1]))

In [None]:
len(set([word.lower() for word in text1 if word.isalpha()]))

### Nested Code Blocks

In [None]:
word = 'cat'

In [None]:
if len(word) < 5:
    print 'word length is less than 5'


In [None]:
if len(word) >= 5:
    print 'word length is greater than or equal to 5'


In [None]:
for word in ['Call', 'me', 'Ishmael', '.']:
    print word

### Looping with Conditions

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']

In [None]:
for xyzzy in sent1:
    if xyzzy.endswith('l'):
        print xyzzy

In [None]:
for token in sent1 : 
    if token.islower():
        print token, 'is a lowercase word'
    elif token.istitle():
        print token, 'is a titlecase word'
    else:
        print token, 'is punctuation'

In [None]:
tricky = sorted([w for w in set(text2) if 'cie' in w or 'cei' in w])

In [None]:
for word in tricky:
    print word,