# Chapter 3: Processing Raw Text

In [2]:
import nltk
nltk.data.path.append('./nltk_data')

## Exercises

#### 1. Define a string `s = 'colorless'`. Write a Python statement that changes this to `"colourless"` using only the slice and concatenation operations.

In [3]:
s = 'colorless'

In [5]:
s[:4] + 'u' + s[4:]

'colourless'

#### 2. We can use the slice notation to remove morphological endings on words. For example, `'dogs'[:-1]` removes the last character of dogs, leaving dog. Use slice notation to remove the affixes from these words (we've inserted a hyphen to indicate the affix boundary, but omit this from your strings): `dish-es`, `run-ning`, `nation-ality`, `un-do`, `pre-heat`.

In [7]:
'dishes'[:-2], 'running'[:-4], 'nationality'[:-5], 'undo'[:-2], 'preheat'[:-4]

('dish', 'run', 'nation', 'un', 'pre')

#### 3. We saw how we can generate an `IndexError` by indexing beyond the end of a string. Is it possible to construct an index that goes too far to the left, before the start of the string? 

In [8]:
'IndexError'[-99]

IndexError: string index out of range

#### 4. We can specify a "step" size for the slice. The following returns every second character within the slice: `monty[6:11:2]`. It also works in the reverse direction:  `monty[10:5:-2]` Try these for yourself, then experiment with different step values.

#### 5. What happens if you ask the interpreter to evaluate `monty[::-1]`? Explain why this is a reasonable result.

Indexing a string with the `[::-1]` operator in Python returns the reversed string. I don't think the syntax is particularly intuitive.

#### 6. Describe the class of strings matched by the following regular expressions.
* `[a-zA-Z]+`
* `[A-Z][a-z]*`
* `p[aeiou]{,2}t`
* `\d+(\.\d+)?`
* `([^aeiou][aeiou][^aeiou])*`
* `\w+|[^\w\s]+`

In [14]:
# [a-zA-Z]+ : 
nltk.re_show('[a-zA-Z]+')

TypeError: re_show() missing 1 required positional argument: 'string'

#### 7. Write regular expressions to match the following classes of strings:
* A single determiner (assume that a, an, and the are the only determiners).
* An arithmetic expression using integers, addition, and multiplication, such as 2*3+8.

#### 8. Write a utility function that takes a URL as its argument, and returns the contents of the URL, with all HTML markup removed. Use `from urllib import request` and then `request.urlopen('http://nltk.org/').read().decode('utf8')` to access the contents of the URL.

#### 9. Save some text into a file `corpus.txt`. Define a function `load(f)` that reads from the file named in its sole argument, and returns a string containing the text of the file.
* Use `nltk.regexp_tokenize()` to create a tokenizer that tokenizes the various kinds of punctuation in this text. Use one multi-line regular expression, with inline comments, using the verbose flag `(?x)`.
* Use `nltk.regexp_tokenize()` to create a tokenizer that tokenizes the following kinds of expression: monetary amounts; dates; names of people and organizations.

#### 10. Rewrite the following loop as a list comprehension:

`>>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']`  
`>>> result = []`  
`>>> for word in sent:`  
`...     word_len = (word, len(word))`  
`...     result.append(word_len)`  
`>>> result`  
`[('The', 3), ('dog', 3), ('gave', 4), ('John', 4), ('the', 3), ('newspaper', 9)]`

In [17]:
sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']
result = [(word, len(word)) for word in sent]
result

[('The', 3),
 ('dog', 3),
 ('gave', 4),
 ('John', 4),
 ('the', 3),
 ('newspaper', 9)]

#### 11. Define a string `raw` containing a sentence of your own choosing. Now, split `raw` on some character other than space, such as `'s'`.

In [18]:
raw = 'This is a sentence.'
raw.split('i')

['Th', 's ', 's a sentence.']

#### 12. Write a `for` loop to print out the characters of a string, one per line.

In [19]:
for char in 'Loop':
    print(char)

L
o
o
p


#### 13. What is the difference between calling `split` on a string with no argument or with `' '` as the argument, e.g. `sent.split()` versus `sent.split(' ')`? What happens when the string being split contains tab characters, consecutive space characters, or a sequence of tabs and spaces? (In IDLE you will need to use `'\t'` to enter a tab character.)

In [20]:
test_string = 'hi\tthere  '

In [21]:
test_string.split()

['hi', 'there']

In [22]:
test_string.split(' ')

['hi\tthere', '', '']

#### 14. Create a variable words containing a list of words. Experiment with `words.sort()` and `sorted(words)`. What is the difference?

`words.sort()` sorts the list in place (and therefore mutates the original `words` list), while `sorted(words)` returns a new list instance (and leaves the original `words` list untouched). The `.sort()` method is slightly faster, and is therefore desired if mutating the original list is not a problem.  

In [23]:
words = ['apple', 'orange', 'banana', 'grape']

In [30]:
sorted(words)

['apple', 'banana', 'grape', 'orange']

In [32]:
words.sort()
words

['apple', 'banana', 'grape', 'orange']

#### 15. Explore the difference between strings and integers by typing the following at a Python prompt: `"3" * 7` and `3 * 7`. Try converting between strings and integers using `int("3")` and `str(3)`.

In [36]:
'3' * 7

'3333333'

In [34]:
3 * 7

21

In [37]:
int('3') * str(7)

'777'

#### 17. What happens when the formatting strings `%6s` and `%-6s` are used to display strings that are longer than six characters?

In [40]:
word = 'dictionary'
print('%6s'.format(word))

%6s


#### 18. Read in some text from a corpus, tokenize it, and print the list of all *wh*-word types that occur. (*wh*-words in English are used in questions, relative clauses and exclamations: *who*, *which*, *what*, and so on.) Print them in order. Are any words duplicated in this list, because of the presence of case distinctions or punctuation?

#### 19. Create a file consisting of words and (made up) frequencies, where each line consists of a word, the space character, and a positive integer, e.g. `fuzzy 53`. Read the file into a Python list using `open(filename).readlines()`. Next, break each line into its two fields using `split()`, and convert the number into an integer using `int()`. The result should be a list of the form: `[['fuzzy', 53], ...]`.

#### 20. Write code to access a favorite webpage and extract some text from it. For example, access a weather site and extract the forecast top temperature for your town or city today.

#### 21. Write a function `unknown()` that takes a URL as its argument, and returns a list of unknown words that occur on that webpage. In order to do this, extract all substrings consisting of lowercase letters (using `re.findall()`) and remove any items from this set that occur in the Words Corpus (`nltk.corpus.words`). Try to categorize these words manually and discuss your findings.

#### 22. Examine the results of processing the URL `http://news.bbc.co.uk/` using the regular expressions suggested above. You will see that there is still a fair amount of non-textual data there, particularly Javascript commands. You may also find that sentence breaks have not been properly preserved. Define further regular expressions that improve the extraction of text from this web page.

#### 23. Are you able to write a regular expression to tokenize text in such a way that the word *don't* is tokenized into *do* and *n't*? Explain why this regular expression won't work: `«n't|\w+»`.

#### 24. Try to write code to convert text into *hAck3r*, using regular expressions and substitution, where `e` → `3`, `i` → `1`, `o` → `0`, `l` → `|`, `s` → `5`, `.` → `5w33t!`, `ate` → `8`. Normalize the text to lowercase before converting it. Add more substitutions of your own. Now try to map s to two different values: `$` for word-initial `s`, and `5` for word-internal `s`.

In [None]:
map = {
    'e': '3',
    'i': '1',
    'o': '0',
    'l': '|',
    's': '5',
    '.': '5w33t!',
    'ate': '8',
}

#### 25a. Write a function to convert a word to Pig Latin.

#### 25b. Write code that converts text, instead of individual words.

#### 25c. Extend it further to preserve capitalization, to keep `qu` together (i.e. so that `quiet` becomes `ietquay`), and to detect when `y` is used as a consonant (e.g. `yellow`) vs a vowel (e.g. `style`). 