# NTLK BOOK. Chapter 3.

In [1]:
%pprint

Pretty printing has been turned OFF


In [2]:
import nltk

import re
from bs4 import BeautifulSoup
from urllib import request

### Exercise 2.

We can use the slice notation to remove morphological endings on words. For example, 'dogs'[:-1] removes the last character of dogs, leaving dog. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): dish-es, run-ning, nation-ality, un-do, pre-heat.

In [3]:
def remove_affix(word, suffix=True, prefix=False):
    if suffix:
        hyphen = word.rfind('-')
        word = word[:hyphen]
    if prefix:
        hyphen = word.find('-')
        word = word[hyphen+1:]
    return word

print(remove_affix('dish-es'))
print(remove_affix('run-ning'))
print(remove_affix('nation-ality'))
print(remove_affix('un-do', False, True))
print(remove_affix('pre-heat', False, True))
print(remove_affix('un-do-ing', prefix=True))
print(remove_affix('cat', suffix=False))

dish
run
nation
do
heat
do
cat


### Exercise 3.

We saw how we can generate an IndexError by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string?

In [4]:
word = "NLTK"

try:
    i = word[-6] # 6th character from the right doesn't exist, so an IndexError is returned.
except IndexError:
    print("Looks like there's an Index error here.")    

Looks like there's an Index error here.


### Exercise 5.

In [5]:
monty = 'Monty Python'
monty[::-1]

'nohtyP ytnoM'

Como los dos primeros parámetros no están indicados, se reproduce la cadena entera. El tercer parámetro indica cada cuántos caracteres queremos reproducir. Con `1` indicamos que nos interesan todos los caracteres, y como el valor es negativo, la cadena se va a leer desde el final.  

### Exercise 6.

Describe the class of strings matched by the following regular expressions.

    [a-zA-Z]+                      uno o más caracteres alpha
    [A-Z][a-z]*                    una mayúscula seguida de zero, una o más minúsculas
    p[aeiou]{,2}t                  una 'p' seguida de zero, una o dos vocales seguidas de una 't'
    \d+(\.\d+)?                    uno o más dígitos seguidos o no de (un punto seguido de uno o más dígitos) 
    ([^aeiou][aeiou][^aeiou])*     (nada) o (un caracter que no sea una vocal seguido de una vocal seguida de un
                                   caracter que no sea una vocal)
    \w+|[^\w\s]+                   (uno o más cracteres alfanuméricos) o (uno o más caracteres que no sean  ni
                                   alfanuméricos ni de tipo whitespace)

### Exercise 7.

Write regular expressions to match the following classes of strings:

        1. A single determiner (assume that a, an, and the are the only determiners).

In [6]:
regexp = r'\b([aA][n]?|[tT]he)\b'
re.findall(regexp, 'the tank Anna scandal then an a April The')

['the', 'an', 'a', 'The']

       2. An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.

In [7]:
regexp = r'(?:^|\s)(\d+(?:\*\d+\+|\+\d+\*)\d+)(?:\s|[\.,!\?]+ |$)'
#  matches expressions that
    # contain one '+' and one '*', in any order
    # don't contain any '-'
    # open the string or are preceded by a whitespace
    # close the string or are followed by a whitespace or punctuation mark, followed by a space char.
seq = '9+23*45 10*23+45?! 666*2+14,p 9*237+11+*3+2 1+2+4  600*93+5'

def find_with_overlap(regexp, seq, capgroup_order=1):
    """ Returns a list of overlapping matches of a regular expression
    found in a string.
    If the regular expression has several capturable groups, only the match 
    for one of the groups (the first one, by default) is considered. 
    """
    results=[]
    while True:
        match = re.search(regexp, seq)
        if match is None:
            break
        results.append(match.groups()[capgroup_order-1])
        seq = seq[match.end():]
    return results

print(find_with_overlap(regexp, seq))

['9+23*45', '10*23+45', '600*93+5']


### Exercise 8.

Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use from urllib import request and then request.urlopen('http://nltk.org/').read().decode('utf8') to access the contents of the URL.

In [8]:
# This solution is mostly copied from Ranveer Aggarwal's answer to https://www.quora.com/How-can-I-extract-only-text-data-from-HTML-pages

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element.encode('utf-8'))):
        return False
    return True

def text_from_web(url, decoder='utf8'):
    html = request.urlopen(url).read().decode(decoder)
    soup = BeautifulSoup(html, "lxml")
    data = soup.findAll(text=True)
    lines = filter(visible, data)
    lines = [line.strip().replace('\n', ' ') for line in lines if line != '\n']
    raw = ' '.join(list(lines))
    return raw

text_from_web('http://nltk.org/')

'NLTK 3.2.5 documentation next | modules | index Natural Language Toolkit ¶ NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum . Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics, plus comprehensive API documentation, NLTK is suitable for linguists, engineers, students, educators, researchers, and industry users alike. NLTK is available for Windows, Mac OS X, and Linux. Best of all, NLTK is a free, open source, community-driven project. NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural

### Exercise 9.

Save some text into a file corpus.txt. Define a function load(f) that reads from the file named in its sole argument, and returns a string containing the text of the file.

   * Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag (?x).

In [9]:
with open('../data/corpus.txt') as f:
    raw = ' '.join([line.strip() for line in f.readlines()])

pattern = r"""(?x)  # set flag to allow verbose regexps
          [\w]+     # get alphanumeric sequences
          |\S       # get punctuation
         
"""
tokenized = nltk.regexp_tokenize(raw, pattern)
tokenized[1500:1550]    

['bufo', 'o', 'grotesco', 'y', 'lo', 'trágico', 'estén', 'mezclados', 'o', 'yuxtapuestos', ',', 'sino', 'fundidos', 'y', 'confundidos', 'en', 'uno', '.', 'Y', 'como', 'yo', 'le', 'hiciese', 'observar', 'que', 'eso', 'no', 'es', 'sino', 'el', 'más', 'desenfrenado', 'romanticismo', ',', 'me', 'contestó', ':', '«', 'no', 'lo', 'niego', ',', 'pero', 'con', 'poner', 'motes', 'a', 'las', 'cosas', 'no']

   * Use nltk.regexp_tokenize() to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.

In [10]:
raw2 = """My friend John-Michael Doe from Chicago in the U.S.A. bought a house for $ 2500.8 (i.e. €2012,6)
       in Spain on 6/7/2018."""

pattern2 = r"""(?x)  # set flag to allow verbose regexps
            \b((?:\d{1,2}/){2}(?:\d{4}|\d{2}))\b     # dates, e.g. 5/12/2009
          | \ ((?:[A-Z][\w\-\.]+\ *)*(?:[A-Z]+[\w\-\.]+)) # proper names, e.g. United Nations or Jean-Paul Sartre 
          | ([\$€]\ ?\d+(?:[\.,]\d+)?)  # monetary amounts, e.g. $12.40
"""

tokenized2 = nltk.regexp_tokenize(raw2, pattern2)
tokenized2 = [token for tokens in tokenized2 for token in tokens if token]
tokenized2

['John-Michael Doe', 'Chicago', 'U.S.A.', '$ 2500.8', '€2012,6', 'Spain', '6/7/2018']

### Exercise 13.

What is the difference between calling split on a string with no argument or with ' ' as the argument, e.g. sent.split() versus sent.split(' ')? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? (In IDLE you will need to use '\t' to enter a tab character.)

In [11]:
sent = "Back in the 90s,\tI was in a very famous\n\n TV show."

print("Splitting with no argument:")
print(sent.split())
print("Splitting with one space argument:")
print(sent.split(' '))

Splitting with no argument:
['Back', 'in', 'the', '90s,', 'I', 'was', 'in', 'a', 'very', 'famous', 'TV', 'show.']
Splitting with one space argument:
['Back', 'in', 'the', '90s,\tI', 'was', 'in', 'a', 'very', 'famous\n\n', 'TV', 'show.']


`.split()` sin argumentos siempre separa la cadena en los espacios, las tabulaciones y los saltos de línea. Cualquier secuencia de éstos, siempre dará lugar a una única separación.
Si le pasamos un argumento, `.split()` utilizará un algoritmo diferente del que acabamos de describir: separará en cada una de las ocurencias de este argumento. Por ejemplo, si el argumento es un espacio, hará un separación cada vez que encuentre un espacio, y no considerará las tabulaciones y saltos de línea como espacios.

### Exercise 14.

Create a variable words containing a list of words. Experiment with words.sort() and sorted(words). What is the difference?

In [12]:
my_list = 'Ziggy played guitar, jamming good with Weird and Gilly and the spiders from Mars.'.split()
sorted(my_list)

['Gilly', 'Mars.', 'Weird', 'Ziggy', 'and', 'and', 'from', 'good', 'guitar,', 'jamming', 'played', 'spiders', 'the', 'with']

In [13]:
my_list

['Ziggy', 'played', 'guitar,', 'jamming', 'good', 'with', 'Weird', 'and', 'Gilly', 'and', 'the', 'spiders', 'from', 'Mars.']

In [14]:
my_list.sort()

In [15]:
my_list

['Gilly', 'Mars.', 'Weird', 'Ziggy', 'and', 'and', 'from', 'good', 'guitar,', 'jamming', 'played', 'spiders', 'the', 'with']

`sorted(my_list)` nos devuelve una nueva lista ordenada, mientras que la lista original sigue igual.  

`my_list.sort()` ordena cambia la lista original, ordenándolo, pero no la devuelve.

### Exercise 17.

What happens when the formatting strings %6s and %-6s are used to display strings that are longer than six characters?

In [16]:
print("{:>6}".format("py"))
print("{:6}{}".format("py", "thon"))

    py
py    thon


Si usamos el nuevo método de formatear las cadenas, tenemos que sustituir %6s y %-6s por :>6 y :6 respectivamente.   

Lo que hacen es imprimir x espacios antes o después de la cadena, respectivamente, donde x corresponde a `6 menos la longitud de la cadena`. En nuestro caso, la longitud de `py` es 2, por lo que se imprimirán 4 espacios.

### Exercise 18.

Read in some text from a corpus, tokenize it, and print the list of all wh-word types that occur. (wh-words in English are used in questions, relative clauses and exclamations: who, which, what, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

In [17]:
wh_words = ("what", "where", "when", "why", "who", "which", "whose", "whom")

def normalize_token(token):
    """removes non-alpha cars and lowercases the token."""
    return re.compile('[^a-zA-Z]').sub('', token).lower()

url = "http://www.gutenberg.org/cache/epub/7028/pg7028.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')

first = raw.find("THE CLICKING OF CUTHBERT")
last = raw.rfind("End of Project Gutenberg's The Clicking of Cuthbert")
raw = raw[first:last]

tokens = nltk.word_tokenize(raw)

wh_tokens = [token for token in tokens 
            if normalize_token(token) in wh_words]


wh_tokens = sorted(set(wh_tokens), key=lambda s: normalize_token(s))
wh_tokens

['What', 'what', 'When', 'when', 'Where', 'where', 'which', 'Which', 'who', 'Who', 'WHO', 'whom', 'whose', 'why', 'Why', "'Why"]

### Exercise 19.

 Create a file consisting of words and (made up) frequencies, where each line consists of a word, the space character, and a positive integer, e.g. fuzzy 53. Read the file into a Python list using open(filename).readlines(). Next, break each line into its two fields using split(), and convert the number into an integer using int(). The result should be a list of the form: [['fuzzy', 53], ...].

In [18]:
with open('../data/2_19.txt') as f:
    raw = [line.strip() for line in f.readlines()]
    
freqs = [[word.split()[0], int(word.split()[1])] for word in raw]
    
freqs

[['random', 10], ['words', 44], ['collection', 90], ['for', 75], ['the', 89], ['exercise', 36], ['number', 19], ['nineteen', 34], ['from', 6], ['chapter', 2], ['two', 906]]

### Exercise 20.

Write code to access a favorite webpage and extract some text from it. For example, access a weather site and extract the forecast top temperature for your town or city today.

In [19]:
url = "http://www.aemet.es/es/eltiempo/prediccion/municipios/madrid-id28079"
html = request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
divs = soup.findAll("div", { 'class' : 'no_wrap'})

for div in divs:
    if div.get_text()[-2:] == '°C': # 1st "no_wrap" div with content ending in '°C' corresponds to the current temp.
        print (div.get_text())
        break

8°C


### Exercise 21.

Write a function unknown() that takes a URL as its argument, and returns a list of unknown words that occur on that webpage. In order to do this, extract all substrings consisting of lowercase letters (using re.findall()) and remove any items from this set that occur in the Words Corpus (nltk.corpus.words). Try to categorize these words manually and discuss your findings.

In [20]:
# This solution uses the text_from_url function from exercise 8.

def unknown(url):
    raw = text_from_web(url)
    words = re.findall(r'\b[a-z]+\b', raw)
    words = sorted(set(words))
    wordlist = [w for w in nltk.corpus.words.words('en') if w.islower()]
    unknown_words = [word for word in words if word not in wordlist]
    return unknown_words

unknown('https://www.ecigarettedirect.co.uk/ashtray-blog/2013/10/interview-inventor-e-cigarette-herbert-a-gilbert.html')



La lista de palabras proporcionada por `nltk.corpus.words` siendo bastante incompleta, la mayoría de las palabras "desconocidas" son en realidad palabras totalmente válidas y comunes ('credits', 'box', 'looks', etc). Solo en algunos casos se trata de palabras mal escritas ('cigerette'), de trozos de html que BeautifulSoup no ha logrado detectacr y eliminar ('endif'), de nombres propios escritos en minúsculas ('uk') o de partes de contracciones ('didn').

### Exercise 22.

TODO

### Exercise 23.

Are you able to write a regular expression to tokenize text in such a way that the word don't is tokenized into do and n't? Explain why this regular expression won't work: «n't|\w+».

In [21]:
sent = "I don't smoke.     ¿¡Don't you know?!"
matches = re.findall(r"([Dd]o)(n't)|(\w+)|(\S)", sent)

matches = [token for match in matches for token in match if token]

matches

['I', 'do', "n't", 'smoke', '.', '¿', '¡', 'Do', "n't", 'you', 'know', '?', '!']

La expresión `«n\'t|\w+»` no funciona. Si la aplicamos a "don't", tanto "n't" como "don" son resultados válidos. Sin embargo, los elementos que devuelve `findall` nunca se solapan, entonces solo se devolverá uno de los dos. Después de haberlo probado con más cadenas, he llegado a la conclusión de que se prioriza el elemento más cercano al principio de la cadena. En este caso, es "don".

### Exercise 24.

Try to write code to convert text into hAck3r, using regular expressions and substitution, where e → 3, i → 1, o → 0, l → |, s → 5, . → 5w33t!, ate → 8. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map s to two different values: $ for word-initial s, and 5 for word-internal s.

In [22]:
def multiple_replace(text, dic):
    for key, value in dic.items():
        text = re.sub(key, value, text)
    return text


    
text = """Python is an interpreted,  7 object-oriented, high-level programming language with dynamic semantics. 
Its high-level built in data structures, combined with dynamic typing and dynamic binding, 
make it very attractive for Rapid Application Development, as well as for use as a scripting 
or glue language to connect existing components together. I ate a skinny python."""

dic = {
    r'e': r'3',
    r'i': r'1',
    r'o': r'0',
    r'l': r'|',
    r'\bs': r'$',
    r'(?P<start>\w+)s(?P<end>\w+)': r'\g<start>5\g<end>',
    r'\.': r'5w33t!',
    r'ate': r'8',
    r'd': r'stu',
    r'y': r'zz'
}

multiple_replace(text.lower(), dic)

'pzzth0n 1s an 1nt3rpr3t3stu,  7 0bj3ct-0r13nt3stu, h1gh-|3v3| pr0gramm1ng |anguag3 w1th stuzznam1c $3mant1cs5w33t! \n1ts h1gh-|3v3| bu1|t 1n stuata $tructur3s, c0mb1n3stu w1th stuzznam1c tzzp1ng anstu stuzznam1c b1nstu1ng, \nmak3 1t v3rzz attract1v3 f0r rap1stu app|1cat10n stu3v3|0pm3nt, as w3|| as f0r u53 as a $cr1pt1ng \n0r g|u3 |anguag3 t0 c0nn3ct 3x15t1ng c0mp0n3nts t0g3th3r5w33t! 1 at3 a $k1nnzz pzzth0n5w33t!'

### Exercise 25.

Pig Latin is a simple transformation of English text. Each word of the text is converted as follows: move any consonant (or consonant cluster) that appears at the start of the word to the end, then append ay, e.g. string → ingstray, idle → idleay. http://en.wikipedia.org/wiki/Pig_Latin

    Write a function to convert a word to Pig Latin.
    Write code that converts text, instead of individual words.
    Extend it further to preserve capitalization, to keep qu together (i.e. so that quiet becomes ietquay), and to detect when y is used as a consonant (e.g. yellow) vs a vowel (e.g. style).

In [23]:
def word_to_pig(word):
    if not word.isalpha():
        return word
    pattern = re.compile(r'(?P<start>^([yY])?[^aeioAEIO(qu)(Qu)(QU)(qU)uyUY]*([qQ][uU])?)(?P<end>\w*)')
    pig_word = re.sub(pattern, 
                      r'\g<end>\g<start>ay', 
                      word)
    if word.istitle():
        pig_word = pig_word[0].title() + pig_word[1:]
    return pig_word

def tokenize_text(raw):
    tokens= re.findall(r"[a-zA-Z]+|\S|\s", raw)
    return tokens

def sent_to_pig(sent):
    tokens = tokenize_text(sent)
    pig_tokens = [word_to_pig(token)  
                  for token in tokens]
    pig_sent = ''.join(pig_tokens)
    return pig_sent


words = ['look', 'stamp', 'GRound', 'Cherry', 'idle', 'Quick', 
         'squeeze', 'yellow', 'style', 'You', 'SYD']
sent = """This is a quite long "sentence" 5+2... Yes, it is!!! ¿Aren't there two sentences here?"""

print('Converting words to Pig Latin:')
for word in words:
    print('{} --> {}'.format(word, word_to_pig(word)))

print('\n')
print("Converting sentence into Pig Latin:")
print(sent)
print('------->')
print(sent_to_pig(sent))

Converting words to Pig Latin:
look --> ooklay
stamp --> ampstay
GRound --> oundGRay
Cherry --> ErryChay
idle --> idleay
Quick --> IckQuay
squeeze --> eezesquay
yellow --> ellowyay
style --> ylestay
You --> OuYay
SYD --> YDSay


Converting sentence into Pig Latin:
This is a quite long "sentence" 5+2... Yes, it is!!! ¿Aren't there two sentences here?
------->
IsThay isay aay itequay onglay "entencesay" 5+2... EsYay, itay isay!!! ¿Arenay'tay erethay otway entencessay erehay?


### Exercise 26.

Download some text from a language that has vowel harmony (e.g. Hungarian), extract the vowel sequences of words, and create a vowel bigram table.

In [24]:
raw = ' '.join(nltk.corpus.udhr.words('Hungarian_Magyar-Latin1'))
vowel_seqs = set(re.findall(r'[aeiouAEIOUáéíóöőúüűÁÉÍÓÖŐÚÜŰ]{2,}', raw))
vowel_seqs = [tuple(seq) for seq in vowel_seqs]

cfd = nltk.ConditionalFreqDist(vowel_seqs)
cfd.tabulate()

  a e i u á é 
a 0 0 1 1 0 0 
e 1 1 1 0 0 1 
i 1 1 1 0 1 0 
á 0 0 0 0 0 1 
ó 1 0 0 0 0 0 
