# IMPORTS

In [1]:
import re
import nltk
import requests
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
from bs4 import BeautifulSoup

In [3]:
%matplotlib inline

# Exercise 1

Define a string s = 'colorless'. Write a Python statement that changes this to
“colourless” using only the slice and concatenation operations.

In [4]:
s = 'colorless'

In [5]:
s[:4] + 'u' + s[4:]

'colourless'

# Exercise 2

We can use the slice notation to remove morphological endings on words. For
example, 'dogs'[:-1] removes the last character of dogs, leaving dog. Use slice
notation to remove the affixes from these words (we’ve inserted a hyphen to indicate
the affix boundary, but omit this from your strings): dish-es, run-ning, nationality,
un-do, pre-heat.

In [6]:
w = 'dishes'
w[:-2] + '-' + w[-2:]

'dish-es'

In [7]:
w = 'running'
w[:3] + '-'  + w[3:]

'run-ning'

In [8]:
w = 'nationality'
w[:6] + '-' + w[6:]

'nation-ality'

In [9]:
w = 'undo'
w[:2] + '-' + w[2:]

'un-do'

In [10]:
w = 'preheat'
w[:3] + '-' + w[3:]

'pre-heat'

# Exercise 3

We saw how we can generate an IndexError by indexing beyond the end of a
string. Is it possible to construct an index that goes too far to the left, before the
start of the string?

In [11]:
w = 'abc'
for i in range(len(w) + 2):
    print(i, w[-i])
# w[0], w[-1], w[-2], w[-3], w[-4]

0 a
1 c
2 b
3 a


IndexError: string index out of range

Unfortunately, no. The minimum index value is -len(w)

# Exercise 4

We can specify a “step” size for the slice. The following returns every second
character within the slice: monty[6:11:2]. It also works in the reverse direction:
monty[10:5:-2]. Try these for yourself, and then experiment with different step
values.

In [12]:
w = 'Hello world!'

In [13]:
w[2:9:2]

'lowr'

In [14]:
w[-1:0:-3]

'!r l'

# Exercise 5

What happens if you ask the interpreter to evaluate monty[::-1]? Explain why
this is a reasonable result.

In [15]:
w[::-1]

'!dlrow olleH'

Reverse string. Moving from the start to end of the string with the negative step.

# Exercise 6

Describe the class of strings matched by the following regular expressions:<div>
a. [a-zA-Z]+<div>
b. [A-Z][a-z]*<div>
c. p[aeiou]{,2}t<div>
d. \d+(\.\d+)?<div>
e. ([^aeiou][aeiou][^aeiou])*<div>
f. \w+|[^\w\s]+<div>
Test your answers using nltk.re_show().

a. any combination of one or more letters in lower/upper case

b. any pair of letters, where the first letter is in upper case and the second is in lower case

c. any 3-4-letter word that starts with 'p', ends with 't', and contains inside 1-2 vowels

d. any number (may have leading zeroes), that has or hasn't a floating part 

e. 3-char combinations that starts with non-vowel chars followed by vowel char, and ends with non-vowwel char again. 

f. any alphanumeric sequence or any sequence that doesn't contain alphanumeric or space chars

# Exercise 7

Write regular expressions to match the following classes of strings:<div>
a. A single determiner (assume that a, an, and the are the only determiners)<div>
b. An arithmetic expression using integers, addition, and multiplication, such as 2*3+8

In [16]:
nltk.re_show(
    regexp='\\b([aA]|[aA]n|[tT]he)\\b',
    string='A single determiner (assume that a, an, and the are the only determiners)'
)

{A} single determiner (assume that {a}, {an}, and {the} are {the} only determiners)


In [17]:
nltk.re_show(
    regexp='\d([*+]\d)+',
    string='An arithmetic expression using integers, addition, and multiplication, such as 2*3+8'
)

An arithmetic expression using integers, addition, and multiplication, such as {2*3+8}


# Exercise 8

Write a utility function that takes a URL as its argument, and returns the contents
of the URL, with all HTML markup removed. Use urllib.urlopen to access the contents of the URL, e.g.:
raw_contents = urllib.urlopen('http://www.nltk.org/').read()

In [18]:
html = requests.get('http://www.nltk.org/').text

In [19]:
raw = BeautifulSoup(html, "lxml").get_text()

In [20]:
raw

'\n\n\nNatural Language Toolkit — NLTK 3.4.4 documentation\n\n\n\n\n\n\n\n\n\n\n\n\n\nNLTK 3.4.4 documentation\n\nnext |\n          modules |\n          index\n\n\n\n\n\n\n\n\n\n\nNatural Language Toolkit¶\nNLTK is a leading platform for building Python programs to work with human language data.\nIt provides easy-to-use interfaces to over 50 corpora and lexical\nresources such as WordNet,\nalong with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning,\nwrappers for industrial-strength NLP libraries,\nand an active discussion forum.\nThanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation,\nNLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike.\nNLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project.\nNLTK has been c

# Exercise 9

Save some text into a file corpus.txt. Define a function load(f) that reads from
the file named in its sole argument, and returns a string containing the text of the
file.<div>
a. Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the various
kinds of punctuation in this text. Use one multiline regular expression inline
comments, using the verbose flag (?x).<div>
b. Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the following
kinds of expressions: monetary amounts; dates; names of people and
organizations.

In [21]:
def load(filename, encoding='utf8'):
    with open(filename, encoding=encoding) as f:
        text = f.read()
    return text

In [22]:
text = load('ch3_ex09.txt')
text

'Save some text into a file corpus.txt. Define a function load(f) that reads €100 from the file named in its sole argument, and returns a string containing the text of the file.\n\n$56.12 ₽56.56\n\na. Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multiline regular expression inline comments, using the verbose flag (?x).\nb. Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the following kinds of expressions: monetary amounts; dates; names of people and organizations    !"#%&\'()*+,-./:;<=>?@[]^_`{|}~\\ffffff\n\n2019-04-15\n\nName Surname\n'

In [23]:
print(nltk.regexp_tokenize(text, """[!"#%&'()*+,-./:;<=>?@\[\]^_`{|}~\\\\]"""))

['.', '.', '(', ')', ',', '.', '.', '.', '.', '.', '_', '(', ')', '.', ',', '(', '?', ')', '.', '.', '.', '_', '(', ')', ':', ';', ';', '!', '"', '#', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', ']', '^', '_', '`', '{', '|', '}', '~', '\\', '-', '-']


In [24]:
pattern = """(?x)
    [$€₽]\d+(?:\.\d+)?
  | \d{4}-\d{2}-\d{2}
  | [A-Z]\w+
"""

In [25]:
print(nltk.regexp_tokenize(text, pattern))

['Save', 'Define', '€100', '$56.12', '₽56.56', 'Use', 'Use', 'Use', '2019-04-15', 'Name', 'Surname']


# Exercise 10

Rewrite the following loop as a list comprehension:

In [26]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = []
for word in sent:
    word_len = (word, len(word))
    result.append(word_len)
result

[('The', 3),
 ('dog', 3),
 ('gave', 4),
 ('John', 4),
 ('the', 3),
 ('newspaper', 9)]

In [27]:
result = [(w, len(w)) for w in sent]
result

[('The', 3),
 ('dog', 3),
 ('gave', 4),
 ('John', 4),
 ('the', 3),
 ('newspaper', 9)]

# Exercise 11

Define a string raw containing a sentence of your own choosing. Now, split raw
on some character other than space, such as 's'.

In [28]:
raw = """Define a string raw containing a sentence of your own choosing. Now, split raw
on some character other than space, such as 's'"""

In [29]:
raw.split('s')

['Define a ',
 'tring raw containing a ',
 'entence of your own choo',
 'ing. Now, ',
 'plit raw\non ',
 'ome character other than ',
 'pace, ',
 'uch a',
 " '",
 "'"]

# Exercise 12

Write a for loop to print out the characters of a string, one per line.

In [30]:
for c in raw:
    print(c)

D
e
f
i
n
e
 
a
 
s
t
r
i
n
g
 
r
a
w
 
c
o
n
t
a
i
n
i
n
g
 
a
 
s
e
n
t
e
n
c
e
 
o
f
 
y
o
u
r
 
o
w
n
 
c
h
o
o
s
i
n
g
.
 
N
o
w
,
 
s
p
l
i
t
 
r
a
w


o
n
 
s
o
m
e
 
c
h
a
r
a
c
t
e
r
 
o
t
h
e
r
 
t
h
a
n
 
s
p
a
c
e
,
 
s
u
c
h
 
a
s
 
'
s
'


# Exercise 13

What is the difference between calling split on a string with no argument and
one with ' ' as the argument, e.g., sent.split() versus sent.split(' ')? What
happens when the string being split contains tab characters, consecutive space
characters, or a sequence of tabs and spaces? (In IDLE you will need to use '\t' to
enter a tab character.)

If sep is not specified or is None, any whitespace string (\s) is a separator and empty strings are removed from the result.

In [31]:
'asd sad\n\t\t\t\t\r   asdasds'.split()

['asd', 'sad', 'asdasds']

In [32]:
'asd sad\n\t\t\t\t\r   asdasds'.split(' ')

['asd', 'sad\n\t\t\t\t\r', '', '', 'asdasds']

# Exercise 14

Create a variable words containing a list of words. Experiment with
words.sort() and sorted(words). What is the difference?

In [33]:
words = nltk.word_tokenize('Create a variable words containing a list of words')
words

['Create', 'a', 'variable', 'words', 'containing', 'a', 'list', 'of', 'words']

In [34]:
words.sort()
words

['Create', 'a', 'a', 'containing', 'list', 'of', 'variable', 'words', 'words']

In [35]:
sorted(words)

['Create', 'a', 'a', 'containing', 'list', 'of', 'variable', 'words', 'words']

According to https://stackoverflow.com/questions/22442378/what-is-the-difference-between-sortedlist-vs-list-sort

<b>sorted() returns a new sorted list</b>, leaving the original list unaffected. <b>list.sort() sorts the list in-place</b>, mutating the list indices, and returns None (like all in-place operations).

sorted() works on any iterable, not just lists. Strings, tuples, dictionaries (you'll get the keys), generators, etc., returning a list containing all elements, sorted.

Use list.sort() when you want to mutate the list, sorted() when you want a new sorted object back. Use sorted() when you want to sort something that is an iterable, not a list yet.

For lists, <b>list.sort() is faster than sorted()</b> because it doesn't have to create a copy. For any other iterable, you have no choice.

# Exercise 15

Explore the difference between strings and integers by typing the following at a
Python prompt: "3" * 7 and 3 * 7. Try converting between strings and integers
using int("3") and str(3).

In [36]:
"3" * 7,  3 * 7

('3333333', 21)

In [37]:
int("3"),  str(3)

(3, '3')

# Exercise 16

Earlier, we asked you to use a text editor to create a file called test.py, containing the single line monty = 'Monty Python'. If you haven’t already done this (or can’t find the file), go ahead and do it now. Next, start up a new session with the Python interpreter, and enter the expression monty at the prompt. You will get an error from the interpreter. Now, try the following (note that you have to leave off the .py part of the filename):<div>
<b>from test import msg</b><div>
<b>msg</b><div>
This time, Python should return with a value. You can also try import test, in which case Python should be able to evaluate the expression test.monty at the prompt.

In [38]:
monty

NameError: name 'monty' is not defined

In [39]:
from ch3_ex16 import monty

In [40]:
monty

'Monty Python'

In [41]:
import ch3_ex16

In [42]:
ch3_ex16.monty

'Monty Python'

# Exercise 17

What happens when the formatting strings %6s and %-6s are used to display
strings that are longer than six characters?

In [43]:
'%6s' % 'aaaaaaa', '%6s' % 'bb'

('aaaaaaa', '    bb')

In [44]:
'%-6s' % 'bbbbbbb', '%-6s' % 'bb'

('bbbbbbb', 'bb    ')

# Exercise 18

Read in some text from a corpus, tokenize it, and print the list of all wh-word
types that occur. (wh-words in English are used in questions, relative clauses, and
exclamations: who, which, what, and so on.) Print them in order. Are any words
duplicated in this list, because of the presence of case distinctions or punctuation?

In [45]:
from nltk.corpus import gutenberg
from nltk.corpus import stopwords

In [46]:
raw_text = gutenberg.raw('carroll-alice.txt')
words = [w.lower() for w in nltk.word_tokenize(raw_text) if w.isalpha()]

In [47]:
wh_word_types = sorted(
    set(w for w in words if w.startswith('wh'))
    & set(set(stopwords.words('english') + ['whose']))
)

In [48]:
wh_word_types

['what', 'when', 'where', 'which', 'while', 'who', 'whom', 'whose', 'why']

# Exercise 19

Create a file consisting of words and (made up) frequencies, where each line
consists of a word, the space character, and a positive integer, e.g., fuzzy 53. Read
the file into a Python list using open(filename).readlines(). Next, break each line
into its two fields using split(), and convert the number into an integer using
int(). The result should be a list of the form: [['fuzzy', 53], ...].

In [49]:
result = []
with open('ch3_ex19.txt') as f:
    for line in f.readlines():
        w, freq = line.split()
        result.append([w, int(freq)])
result

[['x', 53], ['y', 156], ['z', 567]]

# Exercise 20

Write code to access a favorite web page and extract some text from it. For
example, access a weather site and extract the forecast top temperature for your
town or city today.

In [50]:
response = requests.get('https://www.tut.by/')

In [51]:
content = BeautifulSoup(response.text, "lxml")

In [52]:
dollar_rate = (
    content
    .select_one('a[href="https://finance.tut.by/kurs/"]')
    .get_text(strip=True)
)
dollar_rate

'$2.0488'

# Exercise 21

Write a function unknown() that takes a URL as its argument, and returns a list
of unknown words that occur on that web page. In order to do this, extract all
substrings consisting of lowercase letters (using re.findall()) and remove any
items from this set that occur in the Words Corpus (nltk.corpus.words). Try to
categorize these words manually and discuss your findings.

In [53]:
from nltk.corpus import words

In [54]:
def unknown(url, regex='[a-z]+'):
    response = requests.get(url)
    text = BeautifulSoup(response.text, "lxml").get_text()
    words_set = set(words.words())
    return sorted(set([
        s for s in re.findall(regex, text) 
        if s not in words_set
    ]))

In [55]:
unknown_words = unknown("https://en.wikipedia.org/wiki/Plain_text")
len(unknown_words)

552

In [56]:
unknown_words

['abel',
 'ac',
 'accented',
 'acintosh',
 'ackend',
 'adding',
 'adges',
 'adget',
 'ahasa',
 'ain',
 'ais',
 'allbacks',
 'allocated',
 'allows',
 'ames',
 'amespace',
 'amespaces',
 'andom',
 'anguage',
 'anguages',
 'anonical',
 'anuary',
 'applications',
 'arametric',
 'arget',
 'argued',
 'ariant',
 'ariants',
 'arkup',
 'articles',
 'assigning',
 'assigns',
 'ata',
 'atal',
 'ategories',
 'atin',
 'ational',
 'autostart',
 'av',
 'avoided',
 'became',
 'bfloat',
 'bi',
 'bignum',
 'bitmapped',
 'bits',
 'bject',
 'books',
 'breaks',
 'browsers',
 'bstract',
 'bytes',
 'cachereport',
 'categories',
 'ccording',
 'centralauth',
 'centralautologin',
 'challenged',
 'changes',
 'characters',
 'charinsert',
 'charset',
 'checksum',
 'chema',
 'citations',
 'cleartext',
 'codes',
 'commands',
 'compactlinks',
 'companies',
 'completed',
 'computers',
 'computing',
 'concerns',
 'config',
 'conflicts',
 'considers',
 'consisting',
 'containing',
 'contains',
 'conventions',
 'countries

There are many plural nouns and forms of the same verbs

# Exercise 22

Examine the results of processing the URL http://news.bbc.co.uk/ using the regular
expressions suggested above. You will see that there is still a fair amount of
non-textual data there, particularly JavaScript commands. You may also find that
sentence breaks have not been properly preserved. Define further regular expressions
that improve the extraction of text from this web page.

In [57]:
unknown_bbc = unknown("http://news.bbc.co.uk/", regex='\\b[a-z]+\\b')
len(unknown_bbc)

397

In [58]:
unknown_bbc

['abs',
 'accuses',
 'adchoices',
 'adding',
 'additions',
 'ads',
 'adverts',
 'alsos',
 'amp',
 'antialiased',
 'ap',
 'api',
 'app',
 'apps',
 'arguments',
 'articles',
 'arts',
 'async',
 'attributes',
 'autocapitalize',
 'autocomplete',
 'autocorrect',
 'auxclick',
 'balatarin',
 'bbc',
 'bbccookies',
 'bbcdotcom',
 'bbci',
 'bbcpage',
 'bbcprivacy',
 'bbcredirection',
 'bbcthree',
 'bbcuser',
 'bbcworldwide',
 'biowaste',
 'bitesize',
 'blog',
 'bossa',
 'bramstein',
 'branding',
 'browsers',
 'btn',
 'bubbles',
 'bugs',
 'bundles',
 'calc',
 'callback',
 'called',
 'cbbc',
 'cbeebies',
 'cc',
 'ccauds',
 'cdn',
 'cf',
 'challenging',
 'charset',
 'chartbeat',
 'checksum',
 'children',
 'choices',
 'claims',
 'closeable',
 'cmd',
 'co',
 'com',
 'comments',
 'comp',
 'conditions',
 'config',
 'configurable',
 'const',
 'contexts',
 'cookie',
 'cookies',
 'cps',
 'criticised',
 'crwdcntrl',
 'css',
 'cta',
 'ctrl',
 'cy',
 'dcdcdc',
 'debug',
 'declarations',
 'defied',
 'degrees'

# Exercise 23

Are you able to write a regular expression to tokenize text in such a way that the
word don’t is tokenized into do and n’t? Explain why this regular expression won’t
work: «n't|\w+».

In [59]:
re.findall("n't|\w+", "don't")  # + is a greedy quantifier

['don', 't']

In [60]:
re.findall("\\bdo|n't\\b|\w+", "bla don't blabla")

['bla', 'do', "n't", 'blabla']

# Exercise 24

Try to write code to convert text into hAck3r, using regular expressions and
substitution, where e → 3, i → 1, o → 0, l → |, s → 5, . → 5w33t!, ate → 8. Normalize
the text to lowercase before converting it. Add more substitutions of your own.
Now try to map s to two different values: $ for word-initial s, and 5 for wordinternal
s.

In [61]:
text = """Examine the results of processing the URL http://news.bbc.co.uk/ 
using the regular expressions suggested above. You will see that there is still a 
fair amount of non-textual data there, particularly JavaScript commands."""

In [62]:
text_v01 = (
    text
    .lower()
    .replace('e', '3')
    .replace('i', '1')
    .replace('l', '|')
    .replace('s', '5')
    .replace('.', '5w33t!')
    .replace('ate', '8')
)
text_v01

'3xam1n3 th3 r35u|t5 of proc3551ng th3 ur| http://n3w55w33t!bbc5w33t!co5w33t!uk/ \nu51ng th3 r3gu|ar 3xpr3551on5 5ugg35t3d abov35w33t! you w1|| 533 that th3r3 15 5t1|| a \nfa1r amount of non-t3xtua| data th3r3, part1cu|ar|y java5cr1pt command55w33t!'

In [63]:
text_v02 = (
    text.lower()
    .replace('e', '3')
    .replace('i', '1')
    .replace('l', '|')
    .replace('.', '5w33t!')
    .replace('ate', '8')
)
text_v02 = re.sub('\\bs', '$', text_v02)
text_v02 = re.sub('s', '5', text_v02)
text_v02

'3xam1n3 th3 r35u|t5 of proc3551ng th3 ur| http://n3w55w33t!bbc5w33t!co5w33t!uk/ \nu51ng th3 r3gu|ar 3xpr3551on5 $ugg35t3d abov35w33t! you w1|| $33 that th3r3 15 $t1|| a \nfa1r amount of non-t3xtua| data th3r3, part1cu|ar|y java5cr1pt command55w33t!'

# Exercise 25

Pig Latin is a simple transformation of English text. Each word of the text is
converted as follows: move any consonant (or consonant cluster) that appears at
the start of the word to the end, then append ay, e.g., string → ingstray, idle →
idleay (see http://en.wikipedia.org/wiki/Pig_Latin).<div>
a. Write a function to convert a word to Pig Latin.<div>
b. Write code that converts text, instead of individual words.<div>
c. Extend it further to preserve capitalization, to keep qu together (so that
quiet becomes ietquay, for example), and to detect when y is used as a consonant
(e.g., yellow) versus a vowel (e.g., style).

In [64]:
def word_to_pig_latin(word):
    result = word
    if word.isalpha():
        pattern = re.compile('^(y|qu|[^aeiouy]*)', re.IGNORECASE)
        beginning = re.findall(pattern, word)[0]
        stem = word.replace(beginning, '', 1)
        result = stem + beginning + 'ay'
        # preserve capitalization
        result = result.title() if word.istitle() else result
    return result

In [65]:
def text_to_pig_latin(text):
    return ' '.join(list(map(word_to_pig_latin, nltk.word_tokenize(text))))

In [66]:
list(map(word_to_pig_latin, ['string', 'idle', '000...', 'yellow', 'style', 'Yellow']))

['ingstray', 'idleay', '000...', 'ellowyay', 'ylestay', 'Ellowyay']

In [67]:
text = """Pig Latin is a simple transformation of English text. Each word of the text is converted as follows: 
move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay"""

In [68]:
text_to_pig_latin(text)

'Igpay Atinlay isay aay implesay ansformationtray ofay Englishay exttay . Eachay ordway ofay ethay exttay isay onvertedcay asay ollowsfay : ovemay anyay onsonantcay ( oray onsonantcay usterclay ) atthay appearsay atay ethay artstay ofay ethay ordway otay ethay enday , enthay appenday ayay'

# Exercise 26

Download some text from a language that has vowel harmony (e.g., Hungarian),
extract the vowel sequences of words, and create a vowel bigram table.

In [69]:
from nltk.corpus import udhr
from nltk.probability import ConditionalFreqDist

In [70]:
turkish_vowel_harmony = ConditionalFreqDist(
    bigram
    for w in udhr.words('Turkish_Turkce-UTF8')
    for bigram in nltk.bigrams(re.findall('[aıоueiöü]', w.lower()))
)

In [71]:
turkish_vowel_harmony.tabulate(conditions='aıоueiöü', samples='aıоueiöü')

    a   ı   о   u   e   i   ö   ü 
a 205 247   0  21  65  83   0   6 
ı  97  64   0   1   2   1   0   0 
о   0   0   0   0   0   0   0   0 
u  62   0   0  72   6   3   0   3 
e  56   0   0   4 246 236   0   0 
i  58   0   0   3 189 156   0   0 
ö   0   0   0   0  30   0   2   7 
ü   7   0   0   0  34  13   0  26 


Turkish vowel harmony was confirmed.<div>
'a' -> ('a', 'ı')<div>
'ı' -> ('a', 'ı')<div>
'u' -> ('a', 'u')<div>
'e' -> ('e', 'i')<div>
'i' -> ('e', 'i')<div>
'ö' -> 'e'<div>
'ü' -> ('e', 'ü', 'i)

# Exercise 27

Python’s random module includes a function choice() which randomly chooses
an item from a sequence; e.g., choice("aehh ") will produce one of four possible
characters, with the letter h being twice as frequent as the others. Write a generator
expression that produces a sequence of 500 randomly chosen letters drawn from
the string "aehh ", and put this expression inside a call to the ''.join() function,
to concatenate them into one long string. You should get a result that looks like
uncontrolled sneezing or maniacal laughter: he haha ee heheeh eha. Use split()
and join() again to normalize the whitespace in this string.

In [72]:
' '.join(''.join(np.random.choice(list("aehh "), 500)).split())

'h haah ahh hahe ha ah hhhhah eh eh hhh ea hhheha ehhaheahhaaahhhah hhhhhe eaha hahhaeahhhhhh he aeh ahehe hehhehha eahhhahheh ah he a h ehhh ahaea aha eaahe eeaahaa ahhehaehhhah ee hhe ea hh he hee he heeeeaahhh eeh e hahhh hhhheehhh ehhhee hheaaa hhea h ehhhhhhhaehhh eahheh eheahhhea hhehhheeh a hheeeeh h hhhah h hh hheaa hhehhhhehhhh aeeh ha heeh aea hhh eaeh ehh aa h eheh h ahehaah aah ahhea ehe hh h ahaheeaea heahaahhhahhh hhahaaheheeeheeh hh ha h heaahhah ahehhh a'

# Exercise 28

Consider the numeric expressions in the following sentence from the MedLine
Corpus: <b>The corresponding free cortisol fractions in these sera were 4.53 +/- 0.15%
and 8.16 +/- 0.23%, respectively.</b> Should we say that the numeric expression 4.53
+/- 0.15% is three words? Or should we say that it’s a single compound word? Or
should we say that it is actually nine words, since it’s read “four point five three,
plus or minus fifteen percent”? Or should we say that it’s not a “real” word at all,
since it wouldn’t appear in any dictionary? Discuss these different possibilities. Can
you think of application domains that motivate at least two of these answers?

Interpretation depends on our text analysis task. When we want to build a voice assistant we should process this structure as nine words. If we want to parse numbers we should treat <b>4.53 +/- 0.15%</b> as two words. If our goal is to classify text we can even get rid of all digits as well as any punctuations.

# Exercise 29

Readability measures are used to score the reading difficulty of a text, for the
purposes of selecting texts of appropriate difficulty for language learners. Let us
define μw to be the average number of letters per word, and μs to be the average
number of words per sentence, in a given text. The Automated Readability Index
(ARI) of the text is defined to be: 4.71 μw + 0.5 μs - 21.43. Compute the ARI score
for various sections of the Brown Corpus, including section f (popular lore) and
j (learned). Make use of the fact that nltk.corpus.brown.words() produces a sequence
of words, whereas nltk.corpus.brown.sents() produces a sequence of
sentences.

In [73]:
from nltk.corpus import brown

In [74]:
def get_ari(words, sents):
    mu_w = sum(len(w) for w in words) / len(words)
    mu_s = sum(len(s) for s in sents) / len(sents)
    return 4.71 * mu_w + 0.5 * mu_s - 21.43

In [75]:
ari_df = pd.DataFrame(
    data=[(category, get_ari(brown.words(categories=category),
                             brown.sents(categories=category)))
          for category in brown.categories()],
    columns=['category', 'ari']
)

In [76]:
ari_df.sort_values('ari', inplace=True)

In [77]:
ari_df

Unnamed: 0,category,ari
9,mystery,3.833552
0,adventure,4.084168
13,romance,4.349224
3,fiction,4.910474
14,science_fiction,4.978058
6,humor,7.887805
5,hobbies,8.922356
2,editorial,9.471025
10,news,10.176685
11,religion,10.20311


# Exercise 30

Use the Porter Stemmer to normalize some tokenized text, calling the stemmer
on each word. Do the same thing with the Lancaster Stemmer, and see if you observe
any differences.

In [78]:
from nltk.corpus import brown
from nltk.stem import PorterStemmer, LancasterStemmer

In [79]:
porter_stemmer  = PorterStemmer()
lancaster_stemmer = LancasterStemmer()

In [80]:
sent = brown.sents(categories='romance')[42]
print(sent)

[')', 'And', 'his', 'eyes', '--', 'those', 'miniature', 'sundials', 'of', 'variegated', 'yellow', '--', 'had', 'not', 'altered', 'their', 'expression', 'or', 'direction', '.']


In [81]:
porter_output = [porter_stemmer.stem(w) for w in sent]
print(porter_output)

[')', 'and', 'hi', 'eye', '--', 'those', 'miniatur', 'sundial', 'of', 'varieg', 'yellow', '--', 'had', 'not', 'alter', 'their', 'express', 'or', 'direct', '.']


In [82]:
lancaster_output = [lancaster_stemmer.stem(w) for w in sent]
print(lancaster_output)

[')', 'and', 'his', 'ey', '--', 'thos', 'miny', 'sund', 'of', 'varieg', 'yellow', '--', 'had', 'not', 'alt', 'their', 'express', 'or', 'direct', '.']


Lancaster stemmer has successfully stemmed 'his' to 'his' while Porter hasn't. Lancaster stemmer tries to stem a word more strongly than Porter stemmer ('sundials' -> 'sund', 'eyes' -> 'ey', 'miniature' -> 'miny', etc.)

# Exercise 31

Define the variable saying to contain the list ['After', 'all', 'is', 'said',
'and', 'done', ',', 'more', 'is', 'said', 'than', 'done', '.']. Process the list using a for loop, and store the result in a new list lengths. Hint: begin by assigning
the empty list to lengths, using lengths = []. Then each time through the loop,
use append() to add another length value to the list.

In [83]:
saying = ['After', 'all', 'is', 'said', 'and', 'done', ',', 'more', 'is', 'said', 'than', 'done', '.']

In [84]:
lengths = []
for w in saying:
    lengths.append(len(w))
lengths

[5, 3, 2, 4, 3, 4, 1, 4, 2, 4, 4, 4, 1]

# Exercise 32

Define a variable silly to contain the string: 'newly formed bland ideas are
inexpressible in an infuriating way'. (This happens to be the legitimate interpretation
that bilingual English-Spanish speakers can assign to Chomsky’s famous
nonsense phrase colorless green ideas sleep furiously, according to Wikipedia). Now
write code to perform the following tasks:<div>
a. Split silly into a list of strings, one per word, using Python’s split() operation,
and save this to a variable called bland.<div>
b. Extract the second letter of each word in silly and join them into a string, to
get 'eoldrnnnna'.<div>
c. Combine the words in bland back into a single string, using join(). Make sure
the words in the resulting string are separated with whitespace.<div>
d. Print the words of silly in alphabetical order, one per line.

In [85]:
silly = 'newly formed bland ideas are inexpressible in an infuriating way'

In [86]:
bland = silly.split()
print(bland)

['newly', 'formed', 'bland', 'ideas', 'are', 'inexpressible', 'in', 'an', 'infuriating', 'way']


In [87]:
''.join([w[1] for w in bland])

'eoldrnnnna'

In [88]:
' '.join(bland)

'newly formed bland ideas are inexpressible in an infuriating way'

In [89]:
for w in sorted(bland):
    print(w)

an
are
bland
formed
ideas
in
inexpressible
infuriating
newly
way


# Exercise 33

The index() function can be used to look up items in sequences. For example,
'inexpressible'.index('e') tells us the index of the first position of the letter e. <div>
a. What happens when you look up a substring, e.g., 'inexpressible'.index('re')?<div>
b. Define a variable words containing a list of words. Now use words.index() to
look up the position of an individual word.<div>
c. Define a variable silly as in Exercise 32. Use the index() function in combination
with list slicing to build a list phrase consisting of all the words up to
(but not including) in in silly.

In [90]:
'inexpressible'.index('re')

5

In [91]:
words = ['aaa', 'bbb', 'ccc']
words.index('bbb')

1

In [92]:
silly = 'newly formed bland ideas are inexpressible in an infuriating way'

In [93]:
words = silly.split()
phrase = words[:words.index('in')]
phrase

['newly', 'formed', 'bland', 'ideas', 'are', 'inexpressible']

# Exercise 34

Write code to convert nationality adjectives such as Canadian and Australian to their corresponding nouns Canada and Australia (see http://en.wikipedia.org/wiki/List_of_adjectival_forms_of_place_names ).

In [94]:
from nltk.corpus import wordnet

In [95]:
def adjective2country(w):
    words = set(wordnet.words())
    if w.endswith('ian'):
        stem = w[:-len('ian')]
        for suffix in ['ium', 'ia', 'a', 'i', '']:
            if (stem + suffix).lower() in words:
                return stem + suffix
    elif w.endswith('an'):
        stem = w[:-len('an')]
        for suffix in ['a', '']:
            if (stem + suffix).lower() in words:
                return stem + suffix
    elif w.endswith('ese'):
        stem = w[:-len('ese')]
        for suffix in ['a', '']:
            if (stem + suffix).lower() in words:
                return stem + suffix
    elif w.endswith('ish'):
        stem = w[:-len('ish')]
        for suffix in ['eland', 'land', 'mark', 'and', 'en', 'ey']:
            if (stem + suffix).lower() in words:
                return stem + suffix
    # And more...
    return w

In [96]:
adjective2country('Turkish'), adjective2country('Chinese'), adjective2country('Canadian')

('Turkey', 'China', 'Canada')

# Exercise 35

Read the LanguageLog post on phrases of the form as best as p can and as best p
can, where p is a pronoun. Investigate this phenomenon with the help of a corpus
and the findall() method for searching tokenized text described in Section 3.5.
The post is at http://itre.cis.upenn.edu/~myl/languagelog/archives/002733.html.

In [97]:
from nltk.corpus import brown

In [98]:
pronouns = ['I', 'you', 'he', 'she', 'it', 'we', 'they']

In [99]:
text = nltk.Text(brown.words())

In [100]:
text.findall('<[Aa]s><best><as><{}><can|could>'.format('|'.join(pronouns)))

As best as I could


In [101]:
text.findall('<[Aa]s><best><{}><can|could>'.format('|'.join(pronouns)))

as best he could; as best I could; as best I could


# Exercise 36

Study the lolcat version of the book of Genesis, accessible as nltk.corpus.gene
sis.words('lolcat.txt'), and the rules for converting text into lolspeak at http://
www.lolcatbible.com/index.php?title=How_to_speak_lolcat. Define regular expressions
to convert English words into corresponding lolspeak words.

In [102]:
from nltk.corpus import genesis

In [103]:
text = genesis.raw('english-web.txt')[:500]
print(text)

In the beginning God created the heavens and the earth.
Now the earth was formless and empty.  Darkness was on the surface
of the deep.  God's Spirit was hovering over the surface
of the waters.
God said, "Let there be light," and there was light.
God saw the light, and saw that it was good.  God divided
the light from the darkness.
God called the light Day, and the darkness he called Night.
There was evening and there was morning, one day.
God said, "Let there be an expanse in the middle of the


In [104]:
conversions = [
    ('ight\\b', 'iet'),
    ('\\bI\\b', 'ai'),
    ('\\bhe\\b', 'him'),
    ('\\bhis\\b', 'him'),
    ('\\bshe\\b', 'her'),
    ('\\bhers\\b', 'her'),
    ('\\bthey\\b', 'dem'),
    ('\\btheir\\b', 'dem'),
    ('\\bthem\\b', 'dem'),
    ('\\bthe\\b', 'teh'),
    ('th', 'f'),
    ('\\bam\\b', 'iz'),
    ('\\bme\\b', 'meh'),
    ('\\byou\\b', 'yu'),
    ('ee\\b', 'e'),
    ('ee', 'ea'),
    ('le\\b', 'el'),
    ('oa', 'ow'),
    ('er\\b', 'ah'),
    ('ing\\b', 'in')
]

In [105]:
for pattern, repl in conversions:
    text = re.sub(pattern, repl, text, flags=re.IGNORECASE)

In [106]:
print(text)

In teh beginnin God created teh heavens and teh earf.
Now teh earf was formless and empty.  Darkness was on teh surface
of teh deap.  God's Spirit was hoverin ovah teh surface
of teh waters.
God said, "Let fere be liet," and fere was liet.
God saw teh liet, and saw fat it was good.  God divided
teh liet from teh darkness.
God called teh liet Day, and teh darkness him called Niet.
fere was evenin and fere was mornin, one day.
God said, "Let fere be an expanse in teh middel of teh


# Exercise 37

Read about the re.sub() function for string substitution using regular expressions,
using help(re.sub) and by consulting the further readings for this chapter.
Use re.sub in writing code to remove HTML tags from an HTML file, and to
normalize whitespace.

In [107]:
help(re.sub)

Help on function sub in module re:

sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the match object and must return
    a replacement string to be used.



In [108]:
html = requests.get('http://www.nltk.org/').text

In [109]:
html_cleaned = re.sub('<.*?>', '', html, flags=re.DOTALL)
html_cleaned

'\n\n\n\n  \n    \n    Natural Language Toolkit &#8212; NLTK 3.4.4 documentation\n    \n    \n    \n    \n    \n    \n    \n    \n    \n     \n  \n    \n      \n        NLTK 3.4.4 documentation\n        \n          next |\n          modules |\n          index\n        \n       \n    \n\n    \n      \n        \n            \n      \n        \n          \n            \n  \nNatural Language Toolkit¶\nNLTK is a leading platform for building Python programs to work with human language data.\nIt provides easy-to-use interfaces to over 50 corpora and lexical\nresources such as WordNet,\nalong with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning,\nwrappers for industrial-strength NLP libraries,\nand an active discussion forum.\nThanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation,\nNLTK is suitable for linguists, engineers, stu

In [110]:
html_normalized = re.sub('\s+', ' ', html_cleaned).strip()
html_normalized

'Natural Language Toolkit &#8212; NLTK 3.4.4 documentation NLTK 3.4.4 documentation next | modules | index Natural Language Toolkit¶ NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum. Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project. NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics usi

# Exercise 38

An interesting challenge for tokenization is words that have been split across a
linebreak. E.g., if long-term is split, then we have the string long-\nterm.<div>
a. Write a regular expression that identifies words that are hyphenated at a linebreak.
The expression will need to include the \n character.<div>
b. Use re.sub() to remove the \n character from these words.<div>
c. How might you identify words that should not remain hyphenated once the
newline is removed, e.g., 'encyclo-\npedia'?

In [111]:
from nltk.corpus import wordnet

In [112]:
def fix_hyphenated_words(text):
    valid_words = set(wordnet.words())
    result = text[:]
    for w in set(re.findall('\w+-\n\w+', result)):
        w_new = w.replace('\n', '')
        if w_new not in valid_words:
            w_new = w_new.replace('-', '')
        result = result.replace(w, w_new)
    return result

In [113]:
text = """An interesting challenge for tokenization is words that have been split across a linebreak. E.g., if long-term is split, then we have the string long-\nterm.
a. Write a regular expression that identifies words that are hyphenated at a linebreak. The expression will need to include the \n character.
b. Use re.sub() to remove the \n character from these words.
c. How might you identify words that should not remain hyphenated once the newline is removed, e.g., 'encyclo-\npedia'?"""

In [114]:
fix_hyphenated_words(text)

"An interesting challenge for tokenization is words that have been split across a linebreak. E.g., if long-term is split, then we have the string long-term.\na. Write a regular expression that identifies words that are hyphenated at a linebreak. The expression will need to include the \n character.\nb. Use re.sub() to remove the \n character from these words.\nc. How might you identify words that should not remain hyphenated once the newline is removed, e.g., 'encyclopedia'?"

# Exercise 39

Read the Wikipedia entry on Soundex. Implement this algorithm in Python.

In [115]:
from itertools import groupby

In [116]:
def soundex_v01(word):
    step0 = word[0]
    step1 = re.sub('[hw]', '', word[1:])
    step20 = re.sub('[bfpv]+', '1', step1)
    step21 = re.sub('[cgjkqsxz]+', '2', step20)
    step22 = re.sub('[dt]+', '3', step21)
    step23 = re.sub('l+', '4', step22)
    step24 = re.sub('[mn]+', '5', step23)
    step25 = re.sub('r+', '6', step24)
    step3 = re.sub('[aeiouy]+', '', step25)
    step4 = step0.upper() + step3
    step5 = step4[:4].ljust(4, '0')
    return step5

In [117]:
def soundex_v02(word):
    step0, step1 = word[0], word[1:]
    step2 = step1.translate(
        ''.maketrans(
            'bfpvcgjkqsxzdtlmnr', 
            '111122222222334556', 
            'hw'
        )
    )
    step3 = ''.join(k for k, g in groupby(step2))
    step4 = step3.translate(
        ''.maketrans('', '', 'aeiouy')
    )
    step5 = step0.upper() + step4
    step6 = step5[:4].ljust(4, '0')
    return step6

In [118]:
words = [
    'аmmonium', 
    'implementation', 
    'Robert', 
    'Rupert', 
    'Rubin', 
    'Ashcraft', 
    'Ashcroft', 
    'Tymczak'
]

In [119]:
list(map(soundex_v01, words))

['А555', 'I514', 'R163', 'R163', 'R150', 'A261', 'A261', 'T522']

In [120]:
list(map(soundex_v02, words))

['А555', 'I514', 'R163', 'R163', 'R150', 'A261', 'A261', 'T522']

# Exercise 40

Obtain raw texts from two or more genres and compute their respective reading
difficulty scores as in the earlier exercise on reading difficulty. E.g., compare ABC
Rural News and ABC Science News (nltk.corpus.abc). Use Punkt to perform sentence
segmentation.

In [121]:
from nltk.corpus import abc

In [122]:
ari_df = pd.DataFrame(
    data=[
        (fileid, get_ari(nltk.word_tokenize(abc.raw(fileid)),
                         [nltk.word_tokenize(sent) for sent in nltk.sent_tokenize(abc.raw(fileid))]))
        for fileid in abc.fileids()
    ],
    columns=['fileid', 'ari']
)

In [123]:
ari_df

Unnamed: 0,fileid,ari
0,rural.txt,12.61662
1,science.txt,12.773322


# Exercise 41

Rewrite the following nested loop as a nested list comprehension:

In [124]:
words = ['attribution', 'confabulation', 'elocution',
         'sequoia', 'tenacious', 'unidirectional']

In [125]:
vsequences = set()
for word in words:
    vowels = []
    for char in word:
        if char in 'aeiou':
            vowels.append(char)
    vsequences.add(''.join(vowels))
sorted(vsequences)

['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']

In [126]:
sorted(''.join(char for char in word if char in 'aeiou') for word in words)

['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']

In [127]:
sorted(re.sub('[^aeiou]', '', word) for word in words)

['aiuio', 'eaiou', 'eouio', 'euoia', 'oauaio', 'uiieioa']

# Exercise 42

Use WordNet to create a semantic index for a text collection. Extend the concordance
search program in Example 3-1, indexing each word using the offset of
its first synset, e.g., wn.synsets('dog')[0].offset (and optionally the offset of some
of its ancestors in the hypernym hierarchy).

In [128]:
from nltk.corpus import wordnet

In [129]:
class IndexedTextExercise42(object):
    def __init__(self, text, strategy):
        self._text = text
        self._strategy = strategy
        assert self._strategy in ['first_synset', 'first_hypernym'], \
        "Valid strategy values are ['first_synset', 'first_hypernym']"
        self._offset = (
            self._offset_first_synset 
            if self._strategy == 'first_synset' 
            else self._offset_first_hypernym
        )
        self._index = nltk.Index(
            (self._offset(word), i)
            for (i, word) in enumerate(text)
        )

    def concordance(self, word, width=40):
        key = self._offset(word)
        wc = width // 4  # words of context
        for i in self._index[key]:
            lcontext = ' '.join(self._text[i-wc:i])
            rcontext = ' '.join(self._text[i:i+wc])
            ldisplay = '%*s' % (width, lcontext[-width:])
            rdisplay = '%-*s' % (width, rcontext[:width])
            print(ldisplay, rdisplay)

    def _offset_first_synset(self, word):
        offset = -1
        synsets = wordnet.synsets(word)
        if synsets:
            offset = synsets[0].offset()
        return offset
    
    def _offset_first_hypernym(self, word):
        offset = -1
        synsets = wordnet.synsets(word)
        if synsets:
            hypernyms = synsets[0].hypernyms()
            if hypernyms:
                offset = hypernyms[0].offset()
        return offset

In [130]:
grail = nltk.corpus.webtext.words('grail.txt')

In [131]:
text = IndexedTextExercise42(grail, strategy='first_synset')

In [132]:
text.concordance('lie')

 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


In [133]:
text.concordance('palace')

thur , son of Uther Pendragon , from the castle of Camelot . King of the Britons 
: Man . Sorry . What knight live in that castle over there ? DENNIS : I ' m thirt
  Arthur , King of the Britons . Who ' s castle is that ? WOMAN : King of the who
     . I am in haste . Who lives in that castle ? WOMAN : No one live there . ART
 my Knights of the Round Table . Who ' s castle is this ? FRENCH GUARD : This is 
tle is this ? FRENCH GUARD : This is the castle of my master Guy de Loimbard . AR
t show us the Grail , we shall take your castle by force ! FRENCH GUARD : You don
TOR : Action ! HISTORIAN : Defeat at the castle seems to have utterly disheartene
lcome gentle Sir Knight . Welcome to the Castle Anthrax . GALAHAD : The Castle An
me to the Castle Anthrax . GALAHAD : The Castle Anthrax ? ZOOT : Yes . Oh , it   
        and - a - half , cut off in this castle with no one to protect us . Oooh 
   Grail ! I have seen it , here in this castle ! DINGO : Oh no . Oh , no        
d she must pay t

In [134]:
text = IndexedTextExercise42(grail, strategy='first_hypernym')

In [135]:
text.concordance('lie')

 beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
       Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
   you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


In [136]:
text.concordance('bridge')

 with you , good Sir Knight , but I must cross this bridge . BLACK KNIGHT : Then 
 good Sir Knight , but I must cross this bridge . BLACK KNIGHT : Then you shall d
uiet ! Quiet ! Quiet ! Quiet ! There are ways of telling whether she is a witch .
   made of wood ? VILLAGER # 1 : Build a bridge out of her . BEDEVERE : Ah , but 
    : Who are you who are so wise in the ways of science ? ARTHUR : I am Arthur ,
cred quest . If he will give us food and shelter for the night he can join us in 
 each of the knights went their separate ways . Sir Robin rode north , through th
 not at all afraid to be killed in nasty ways . Brave , brave , brave , brave Sir
  is the Grail ?! OLD MAN : Seek you the Bridge of Death . ARTHUR : The Bridge of
k you the Bridge of Death . ARTHUR : The Bridge of Death , which leads to the Gra
   come and rescue me . I am in the Tall Tower of Swamp Castle .' At last ! A cal
tter . FATHER : You fell out of the Tall Tower , you creep ! HERBERT : No , I    
    : Let us tau

# Exercise 43

With the help of a multilingual corpus such as the Universal Declaration of
Human Rights Corpus (nltk.corpus.udhr), along with NLTK’s frequency distribution
and rank correlation functionality (nltk.FreqDist, nltk.spearman_correla
tion), develop a system that guesses the language of a previously unseen text. For
simplicity, work with a single character encoding and just a few languages.

In [137]:
from nltk.corpus import udhr
from nltk.probability import FreqDist, ConditionalFreqDist
from nltk.metrics import spearman_correlation

In [138]:
def get_ranks():
    cfd = ConditionalFreqDist(
        (fileid.replace('-Latin1', ''), char.lower())
        for fileid in udhr.fileids()
        for char in udhr.raw(fileid)
        if fileid.endswith('Latin1')
        and char.isalpha()
    )
    ranks = {
        lang: [(c, i) for i, (c, f) in enumerate(cfd[lang].most_common())]
        for lang in cfd.conditions()
    }
    return ranks

In [139]:
def define_language(text, ranks):
    fd = FreqDist(char.lower() for char in text if char.isalpha())
    rank_test = [(c, i) for i, (c, f) in enumerate(fd.most_common())]
    return max(
        (spearman_correlation(rank_test, ranks[lang]), lang) 
        for lang in ranks
    )

In [140]:
ranks = get_ranks()

In [141]:
text_english = """
With the help of a multilingual corpus such as the Universal Declaration of Human Rights Corpus (nltk.corpus.udhr),
along with NLTK’s frequency distribution and rank correlation functionality (nltk.FreqDist, nltk.spearman_correla tion), 
develop a system that guesses the language of a previously unseen text. For simplicity, work with a single character 
encoding and just a few languages.
"""

In [142]:
define_language(text_english, ranks)

(0.9299999999999999, 'English')

In [143]:
text_german = """
Julian und Stefan wollen gemeinsam verreisen. Als Urlaubsort haben sie sich Teneriffa ausgesucht. Zusammen kümmern 
sie sich um die Organisation der Reise. Julian schaut im Internet nach besonders günstigen Flügen. Der Flug ab Hamburg
hat den besten Preis. Allerdings müssen sie zwei Stunden Zeit für die Fahrt zum Flughafen einplanen. Stefan kümmert sich 
um die Unterkunft. Auf Teneriffa gibt es sehr viele Hotels mit unterschiedlichen Preisen. Vieles ist zu teuer, deswegen 
schaut Stefan erst einmal nach einem Hostel. Allerdings muss man sich dort selbst um die Verpflegung kümmern. Stefan 
entscheidet sich schließlich für ein günstiges Hotel, das drei Kilometer vom Strand entfernt liegt. Dafür hat das Hotel
auf der Internetseite gute Bewertungen. 
"""

In [144]:
define_language(text_german, ranks)

(0.9265384615384615, 'German_Deutsch')

In [145]:
text_french = """
La région des Alpes est située à l’Est de la France. Les Alpes sont la chaîne de montagnes la plus haute d’Europe, 
c’est aussi une frontière naturelle avec d’autres pays européens : la Suisse et l’Italie. La montagne la plus haute 
des Alpes s’appelle le Mont Blanc, on peut y monter grâce à un téléphérique. Les français aiment beaucoup cette 
région car ils peuvent y passer leurs vacances en été et en hiver. En effet, en hiver il y a beaucoup de neige dans 
les Alpes et les touristes peuvent pratiquer le ski, la luge ou le snowboard. Il fait très froid et pour se réchauffer 
ils mangent des plats typiques et boivent du vin chaud. En été, les touristes aiment se balader dans les montagnes, 
ils font de la randonnée. Ils peuvent se baigner dans les lacs et observer les animaux sauvages qui vivent dans cette 
région : les marmottes et les chamois. Mais on peut voir aussi beaucoup d’animaux domestiques pendant l’été : des vaches, 
des chèvres et des montons. 
"""

In [146]:
define_language(text_french, ranks)

(0.9442735042735043, 'French_Francais')

# Exercise 44

Write a program that processes a text and discovers cases where a word has been
used with a novel sense. For each word, compute the WordNet similarity between
all synsets of the word and all synsets of the words in its context. (Note that this
is a crude approach; doing it well is a difficult, open research problem.)

In [147]:
from collections import defaultdict
from itertools import combinations
from nltk.corpus import wordnet as wn
from nltk.corpus import brown, stopwords

In [148]:
def get_similarity(word1, word2):
    return max(
        [synset1.path_similarity(synset2)
         for synset1 in wn.synsets(word1)
         for synset2 in wn.synsets(word2)], 
        default=0, 
        key=lambda sim: sim if sim else 0
    )

In [149]:
sents = brown.sents(categories='romance')
scores = defaultdict(lambda : defaultdict(list))
stops = set(nltk.corpus.stopwords.words('english'))

In [150]:
for i, sent in enumerate(sents[:20]):
    sent[0] = sent[0].lower()
    sent = {
        word
        for word in sent
        if word not in stops
        and word.isalpha()
        and word.islower()
    }

    if len(sent) < 3:
        continue

    for word1, word2 in combinations(sent, 2):
        top_score = get_similarity(word1, word2)
        if top_score:
            scores[i][word1].append(top_score)
            scores[i][word2].append(top_score)

    for word, word_scores in scores[i].items():
        avg_score = sum(word_scores) / len(word_scores)
        if avg_score < 0.1:
            print('In sentence {} the word "{}" may be used in '
                  'a novel sense. Its average similarity score is {}.'.format(i, word, round(avg_score, 4)))

In sentence 1 the word "noon" may be used in a novel sense. Its average similarity score is 0.0747.
In sentence 1 the word "boredom" may be used in a novel sense. Its average similarity score is 0.0824.
In sentence 4 the word "aluminum" may be used in a novel sense. Its average similarity score is 0.0948.
In sentence 4 the word "trays" may be used in a novel sense. Its average similarity score is 0.0833.
In sentence 4 the word "cheesecloth" may be used in a novel sense. Its average similarity score is 0.0857.
In sentence 4 the word "tomato" may be used in a novel sense. Its average similarity score is 0.0887.
In sentence 19 the word "suitcase" may be used in a novel sense. Its average similarity score is 0.0811.


# Exercise 45

Read the article on normalization of non-standard words (Sproat et al., 2001),
and implement a similar system for text normalization.

Visit https://github.com/EFord36/normalise and install it

In [151]:
from normalise import normalise



In [152]:
text = ["On", "the", "13", "Feb.", "2007", ",", "Theresa", "May", "MP", "announced",
        "on", "ITV", "News", "that", "the", "rate", "of", "childhod", "obesity", "had", "risen",
        "from", "7.3-9.6%", "in", "just", "3", "years", ",", "costing", "the", "Gov.", "£20m", "."]

In [153]:
normalise(text, verbose=True)


CREATING NSW DICTIONARY
-----------------------

10 NSWs found

TAGGING NSWs
------------

10 of 10 tagged

SPLITTING NSWs
--------------

0 of 0 split

RETAGGING SPLIT NSWs
--------------------

0 of 0 retagged

CLASSIFYING ALPHABETIC NSWs
---------------------------

5 of 5 classified

CLASSIFYING NUMERIC NSWs
------------------------

5 of 5 classified

CLASSIFYING MISCELLANEOUS NSWs
------------------------------

0 of 0 classified

EXPANDING ALPHABETIC NSWs
-------------------------

5 of 5 expanded

EXPANDING NUMERIC NSWs
----------------------

5 of 5 expanded

EXPANDING MISCELLANEOUS NSWs
----------------------------

0 of 0 expanded



['On',
 'the',
 'thirteenth of',
 'February',
 'two thousand and seven',
 ',',
 'Theresa',
 'May',
 'M P',
 'announced',
 'on',
 'I T V',
 'News',
 'that',
 'the',
 'rate',
 'of',
 'childhood',
 'obesity',
 'had',
 'risen',
 'from',
 'seven point three to nine point six percent',
 'in',
 'just',
 'three',
 'years',
 ',',
 'costing',
 'the',
 'government',
 'twenty million pounds',
 '.']