# Exercise Sheet 1.0 - Text Processing with Python


## Learning Objectives

The motivation of this exercise is to gain familiarity with the Python programming language. We are going to do some basic text processing and analysis on a plaintext corpus. If you are not with familiar Python or Jupyter notebooks, it is recommended to start with the Python Tutorial notebook before attempting this exercise.

---


## Exercise 0

For this exercise, we are going to count the 25 most frequent words in **Alice’s Adventures in Wonderland** by Lewis Carroll. You are free to use any other piece of text of your choice for this exercise. This notebook contains step by step instructions (with some hints) and you are required to fill in the code blocks based on the material covered in the Python Tutorial notebook.

### 0. Download the text file.
Run the cell below to download the book **Alice’s Adventures in Wonderland** as a text file from [Project Gutenberg](http://www.gutenberg.org), and save into a file called `alice.txt`.

In [1]:
!curl https://www.gutenberg.org/files/11/11-0.txt > alice.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  170k  100  170k    0     0   122k      0  0:00:01  0:00:01 --:--:--  122k


---
### 1. Read text from file.
Open the text file `alice.txt` and read all the lines into a list.

In [23]:
lines = open("alice.txt", "r", encoding= "utf-8").readlines()  # read lines from sherlock.txt into this list


#### Hint:

The `open()` function can be used to read the file.

---
### 2. Filter out the metadata.
The text file contains some metadata about the book which is not relevant for our analysis. Discard this information by removing the first 54 lines from the beginning and the last 356 lines from the end.

In [24]:
lines = lines[54: -356]

'THE END \n'

#### Hint:

Use index slicing to select the required lines.

---
### 3. Remove leading and trailing spaces from each line in the list.
Each line contains a newline character `\n` at the end while some lines also contain leading and trailing spaces. This formatting is done for presentation purposes and not relevant for our analysis.

In [25]:
clean_lines = []  # store the lines in this list after removing the leading and trailing spaces

for line in lines:
    clean_lines.append(line.strip())

#### Hint:

The `strip()` function can be used to remove leading and trailing spaces.

---
### 4. Remove empty lines from the list.
After removing the newline character `\n` from each line in the list, some strings are now empty and can be discarded safely.

In [26]:
non_empty_lines = []  # store non empty lines in this list
# your code goes here
for line in clean_lines:
    if line != '':
        non_empty_lines.append(line)

#### Hint:

An empty string in Python is represented by `''` or `""`.

---
### 5. Join all the non empty lines into a single string.
Now that we have cleaned the corpus by removing some editorial details and formatting, we can focus on the actual text. Create a single string which contains all the lines from the text.



In [27]:
text = ' '.join(non_empty_lines)

#### Hint:

The `join` function can be used to join a list of strings into a single string.

---
### 6. Convert to lowercase
To keep the word counts consistent, we are going to covert everything lowercase. If we don't do this, the words `the`, `The` and `THE` would be considered distinct.  

In [28]:
text = text.lower()

#### Hint:

Use the `lower()` function.

---
### 7. Get a list of all the words in the text.

In [37]:
words = text.split()
words[40:50]

['it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in',
 'it,',
 '“and',
 'what']

#### Hint:

The `split()` can be used to get a list of words from a string.

### 8. Remove punctuation

For a machine, character sequences `rabbit`, `rabbit,` and `rabbit!` are diferrent words, although we as humans understand that this is the same word with/without punctuation marks after it. To avoid this confusion, we can remove punctuation, because it is unnecessary for our task.

In [39]:
punct = '.,?!:;—"«»()[]{}–~*@#$^&\/„“‘’-|+=`'
new_words = [w.strip(punct) for w in words]

new_words[40:50]

['it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in',
 'it',
 'and',
 'what']

#### Hint:

Use `strip()` function. List comprehensions may also come in handy!

---
### 9. How many total words are there in the text?

Individuals elements in a text (usually words, but not only) are called **tokens** in NLP.

In [41]:
len(new_words)

26441

#### Hint:

This can be found by finding the length of the `words` list.

---
### 10. How many unique words are there in the text?

Unique words are also called **types** in NLP.

In [43]:
len(set(new_words))

3505

#### Hint:

The `set` data type can be used to find unique values.

---
### 11. What are the 25 most frequent words?

In [57]:
# your code goes here (this can be done in less than 10 lines of code)
word_count = dict()

for word in new_words:
    if word in word_count:
        word_count[word] += 1
    else:
        word_count[word] = 1
word_count = list(word_count.items())        
sorted(word_count, key = lambda x: x[1], reverse = True)[:25]

[('the', 1629),
 ('and', 843),
 ('to', 715),
 ('a', 626),
 ('she', 534),
 ('of', 505),
 ('it', 477),
 ('said', 456),
 ('alice', 383),
 ('i', 380),
 ('in', 360),
 ('was', 347),
 ('you', 329),
 ('as', 262),
 ('her', 246),
 ('that', 240),
 ('at', 208),
 ('on', 183),
 ('had', 177),
 ('with', 176),
 ('all', 169),
 ('but', 164),
 ('for', 149),
 ('so', 145),
 ('be', 139)]

#### Hints:

1. Use a dictionary to store count of each word.
2. Convert the dictionary into a list of tuples and sort by counts in descending order.

#### Alternate Solutions:

1. Python >= 3.6 supports ordered dictionaries, so there is no need to convert to a list of tuples before sorting.
2. Look up the `Counter` container in the `collections` module in the [Python docs](https://docs.python.org/3/library/collections.html#collections.Counter).

In [48]:
from collections import Counter

word_counts = Counter(words)
word_counts.most_common(25)

[('the', 1604),
 ('and', 765),
 ('to', 706),
 ('a', 614),
 ('she', 518),
 ('of', 493),
 ('said', 420),
 ('it', 362),
 ('in', 349),
 ('was', 328),
 ('you', 257),
 ('as', 249),
 ('i', 249),
 ('alice', 221),
 ('that', 216),
 ('her', 207),
 ('at', 204),
 ('had', 176),
 ('with', 170),
 ('all', 154),
 ('on', 142),
 ('be', 138),
 ('for', 135),
 ('very', 126),
 ('so', 126)]