# <u> Chapter 1 - Language Processing and Python </u>

### Chapter Summary

**Focus Areas**
* What can be achieved by combining simple programming techniques with large quantities of text
* How to automatically extract key words and phrases that sum up the style and content of a text
* Tools and techniques that the Python programming language provides for this work
* Some of the interesting challenges of natural language processing

**Summary of Learnings**
* Texts are represented in Python using lists: ['Monty', 'Pyton']. We can use indexing, slicing, and the len() function on lists.
* A word "token" is a particular appearance of a given word in a text; a word "type" is the unique form of the word as a particular sequence of letters. We count word tokens using len(text) and word types using len(set(text)).
* We obtain the vocabulary of a text t using sorted(set(t)).
* We operate on each item of a text using [f(x) for x in text].
* To derive the vocabulary, collapsing case distinctions and ignoring punctuation, we can write set([w.lower() for w in text if w.isalpha()]).
* We process each word in a text using a for statement, such as for w in t: or for word in text:. This must be followed by the colon character and an indented block of code, to be executed each time through the loop.
* We test a condition using an if statement: if len(word) < 5:. This must be followed by the colon character and an indented block of code, to be executed only if the condition is true.
* A frequency distribution is a collection of items along with their frequency counts (e.g., the words of a text and their frequency of appearance).
* A function is a block of code that has been assigned a name and can be reused. Functions are defined using the def keyword, as in def mult(x, y); x and y are parameters of the function, and act as placeholders for actual data values.
* A function is called by specifying its name, followed by one or more arguments inside parentheses, like this: mult(3, 4), e.g., len(text1).

### Chapter Index

[Chapter Content / Exploratory Exercises](#Chapter-Content-/-Exploratory-Exercises)

[1.1 Computing with Language: Texts and Words](#1.1-Computing-with-Language:-Texts-and-Words)

[1.2 A Closer Look at Python: Texts as Lists of Words](#1.2-A-Closer-Look-at-Python:-Texts-as-Lists-of-Words)

[1.3 Computing with Language: Simple Statistics](#1.3-Computing-with-Language:-Simple-Statistics)

[1.4 Back to Python: Making Decisions and Taking Control](#1.4-Back-to-Python:-Making-Decisions-and-Taking-Control)

[Exercises & Solution Code](#Exercises-&-Solution-Code)

## <U> Chapter Content / Exploratory Exercises </U>

##### 1.1 Computing with Language: Texts and Words

Importing the library and downloading content (note: a window prompt will appear; this must be closed before continuing to execute code in this notebook):

In [None]:
import nltk 
nltk.download()

In [None]:
from nltk.book import *

Examples of books imported:

In [None]:
text1

In [None]:
text2

In [None]:
text3

The **Concordance** function shows every occurance of a given word (in this example, 'monstrous') along with additional sentence context: 

In [None]:
text1.concordance("monstrous")

In [None]:
text2.concordance("affection")

In [None]:
text3.concordance("lived")

In [None]:
text4.concordance("nation")

In [None]:
text4.concordance("terror")

In [None]:
text4.concordance("god")

In [None]:
text5.concordance("im")

In [None]:
text5.concordance("ur")

In [None]:
text5.concordance("lol")

The **Similar** function shows other words that appear in a similar context set as a given keyword (in this case, 'monstrous'):

In [None]:
text1.similar('monstrous')

In [None]:
text2.similar('monstrous')

The **Common Contexts** function shows only contexts shared by two or more words:

In [None]:
text2.common_contexts(['monstrous', 'very'])

A **Dispersion Plot** can help visualize positional information for a word in a given text. Each row in the diagram represents the entire length of text, and each strike indicates an instance of the word. An example is as follows:

In [None]:
import numpy
import matplotlib
text4.dispersion_plot(['citizens','democracy', 'freedom', 'duties', 'America'])

The **Generate** function uses a text base to generate a random set of output text:

In [None]:
text3.generate()

For example, here is random text in the style of an internet chat room:

In [None]:
text5.generate()

Note that the text is random but reuses common words and phrases from source text. Additionally, it removes punctuation, as words and punctuation are considered independent of one another in NLP project work.

The **Len** function determines the number of words (or tokens) in a text:

In [None]:
len(text3)

The **Set** function displays the unique tokens in a text, without duplicates:

In [None]:
set(text3)

The vocabulary set can be sorted in alphabetical order, capitalized tokens preceeding lowercase tokens, as well:

In [None]:
sorted(set(text3))

The lexical richness of the text is as follows:

In [None]:
from __future__ import division
len(text3) / len(set(text3))

This number represents, on average, how mnay times each word is used in the text.

The percentage of text containing the word **"lol"** is as follows:

In [None]:
100 * text5.count('lol') / len(text5)

Examples of code for a **Lexical Diversity** function and **Percentage** function are as follows:

In [None]:
def lexical_diversity(text):
    return len(text) / len(set(text))

def percentage(count, total):
    return 100 * count / total

These functions can now be called with different arguments:

In [None]:
lexical_diversity(text3)

In [None]:
lexical_diversity(text5)

In [None]:
percentage(4, 5)

In [None]:
percentage(text4.count('a'), len(text4))

##### 1.2 A Closer Look at Python: Texts as Lists of Words

Here is an example of how sentence text is represented in a Python list:

In [None]:
sent1 = ['Call', 'me', 'Ishmael', '.']

Different functions can be performed on this list:

In [None]:
len(sent1)

In [None]:
lexical_diversity(sent1)

The first sentences of each of the nine books are as follows:

In [None]:
sent1

In [None]:
sent2

In [None]:
sent3

In [None]:
sent4

In [None]:
sent5

In [None]:
sent6

In [None]:
sent7

In [None]:
sent8

In [None]:
sent9

The concatenation of two sentences can be done as follows:

In [None]:
['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail']

In [None]:
sent4 + sent1

A word can be appended to a list of words as follows:

In [None]:
sent1.append("Some")
sent1

A list of words can be both indexed, as well as return the index value for a word of interest:

In [None]:
text4[173]

In [None]:
text4.index('awaken')

A list of words can be sliced to produce a sublist of words:

In [None]:
text5[16715:16735]

In [None]:
text6[1600:1625]

In [None]:
text6[:3]

In [None]:
text2[141525:]

The convention m:n returns elements m...n-1. 

Here are examples of Python variables and assignments:

In [None]:
my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', 'forth', 'from', 'Camelot', '.']

In [None]:
noun_phrase = my_sent[1:4]
noun_phrase

In [None]:
wOrDs = sorted(noun_phrase)
wOrDs

Additionally, some similar operations can be performed on Python strings:

In [None]:
name = 'Monty'
name[0]

In [None]:
name[:4]

In [None]:
name * 2

In [None]:
name + '!'

In [None]:
' '.join(['Monty', 'Python'])

In [None]:
'Monty Python'.split()

##### 1.3 Computing with Language: Simple Statistics 

The following code first sorts the vocabulary set used to construct the phrase ***After all is said and done more is said than done***. It then returns the last two tokens in the set.

In [None]:
saying = ['After', 'all', 'is', 'said', 'and', 'done', 'more', 'is', 'said', 'than', 'done']
tokens = set(saying)
tokens = sorted(tokens)
tokens[-2:]

A sorted frequency distribution, in the form of a key-value pair, is produced with the **FreqDist** function as follows:

In [None]:
fdist1 = FreqDist(text1)
fdist1

The 50 most frequent words in this distribution can be identified as follows:

In [None]:
vocabulary1 = list(fdist1.keys())
vocabulary1[:50]

Here is a cumulative frequency plot of the 50 most frequently used words in Moby Dick:

In [None]:
fdist1.plot(50, cumulative=True)

Meanwhile, the hapaxes (words occuring once in the text) are as follows:

In [None]:
fdist1.hapaxes()

Here is an example of using set theory notation to select words greater than 15 characters long:

In [None]:
V = set(text1)
long_words = [word for word in V if len(word) > 15]
sorted(long_words)

Here is an example of finding words that are longer than 7 characters and occur more than 7 times

In [None]:
fdist5 = FreqDist(text5)
sorted([word for word in set(text5) if len(word) > 7 and fdist5[word] > 7])

The **Bigrams** function extracts a list of word pairs from a text:

In [None]:
wordList = ['more', 'is', 'said', 'than', 'done']
list(bigrams(wordList))

The **Collocation List** function finds a list of unusual word pairs from a text:

In [None]:
text4.collocation_list()

In [None]:
text8.collocation_list()

The distribution of word lengths in a text is calculated as follows:

In [None]:
fdist = FreqDist([len(word) for word in text1])
fdist

The frequency of each word length is below:

In [None]:
fdist.items()

The most occuring word length is as follows:

In [None]:
fdist.max()

The number of occurrences for this word length is as follows:

In [None]:
maxLength = fdist.max()
fdist[maxLength]

The percentage of text comprised by this word length is as follows:

In [None]:
maxLength = fdist.max()
fdist.freq(maxLength)

The following are important NLTK frequency distribution functions:

| Example | Description |
| --- | --- |
| fdist = FreqDist(samples) | Create a frequency distribution containing the given samples | 
| fdist.inc(sample) | Increment the count for this sample |
| fdist['monstrous'] | Count the number of times a given sample occurred | 
| fdist.freq('monstrous') | Frequency of a given sample |
| fdist.N() | Total number of samples | 
| fdist.keys() | The samples sorted in order of decreasing frequency |
| for sample in fdst: | Iterate over the samples, in order of decreasing frequency |
| fdist.max() | Sample with the greatest count |
| fdist.tabulate() | Tabulate the frequency distribution |
| fdist.plot() | Graphical plot of the frequency distribution |
| fdist.plot(cumulative=True) | Cumulative plot of the frequency distribution |
| fdist1 < fdst2 | Test if samples in fdst1 occur less frequently than in fdist2 |



##### 1.4 Back to Python: Making Decisions and Taking Control

This is the first sentence from the Wall Street Journal text:

In [None]:
sent7

Here are some examples of using conditional operators in Python:

In [None]:
[w for w in sent7 if len(w) < 4]

In [None]:
[w for w in sent7 if len(w) <= 4]

In [None]:
[w for w in sent7 if len(w) == 4]

In [None]:
[w for w in sent7 if len(w) != 4]

This is a list of frequency used word comparison operators:

| Function | Meaning |
| --- | --- |
| s.startswith(t) | Test if s starts with t |
| s.endswith(t) | Test if s ends with t |
| t in s | Test if t is contained inside s |
| s.islower() | Test if all cased characters in s are lowercase|
| s.isupper() | Test if all cased characters in s are uppercase | 
| s.isalpha() | Test if all characters in s are alphabetic |
| s.isalnum() | Test if all characters in s are alphanumeric |
| s.isdigit() | Test if all characters in s are digits |
| s.istitle() | Test if s is titlecased (all words in s have initial capitals)|


Here are some examples of using word comparison operators in Python:

In [None]:
sorted([word for word in set(text1) if word.endswith('ableness')])

In [None]:
sorted([term for term in set(text4) if 'gnt' in term])

In [None]:
sorted([item for item in set(text6) if item.istitle()])

In [None]:
sorted([item for item in set(sent7) if item.isdigit()])

Here are some examples of list comprehensions on word sets:

In [None]:
output = set([word.lower() for word in text1])
output

In [None]:
output = set([word.lower() for word in text1 if word.isalpha()])
output

Here is an example of Python control structures:

In [None]:
tricky = sorted([word for word in set(text2) if 'cie' in word or 'cei' in word])
for word in tricky:
    print(word, end=" ")

## <U> Exercises & Solution Code </U>

**1) Try using the Python interpreter as a calculator, and typing expressions like 12 / (4 + 1)**

In [None]:
12 / (4+1)

**2) Given an alphabet of 26 letters, there are 26 to the power 10, or 26 ** 10, 10-letter strings we can form. That works out to 141167095653376L (the L at the end just indicates that this is Python's long-number format). How many hundred-letter strings are possible?**

In [None]:
26 ** 100

**3) The Python mulitiplication operation can be applied to lists. What happens when you type ['Monty', 'Python'] * 20, or 3 * sent1?**

In [None]:
['Monty', 'Python'] * 20

In [None]:
3 * sent1

**4) Review Section 1.1 on computing with language. How many words are there in text2? How many distinct words are there?**

The number of words in text2 is as follows:

In [None]:
len(text2)

The number of distinct words in text2 is as follows:

In [None]:
len(set(text2))

**5) Compare the lexical diversity scores for humor and romance fiction in Table 1-1. Which genre is more lexically diverse?**

**6) Produce a dispersion plot of the four main protagonists in Sense and Sensibility: Elinor, Marianne, Edward, and Willoughby. What can you observe about the different roles played by the males and females in this novel? Can you identify the couples?**