# Exercise 0

The motivation of this exercise is to gain familiarity with the Python programming language. We are going to do some basic text processing and analysis on a plaintext corpus. If you are not with familiar Python or Jupyter notebooks, it is recommended to start with the Python Tutorial notebook before attempting this exercise.

---

For this exercise, we are going to count the 25 most frequent words in **Alice’s Adventures in Wonderland** by Lewis Carroll. You are free to use any other piece of text of your choice for this exercise. This notebook contains step by step instructions (with some hints) and you are required to fill in the code blocks based on the material covered in the Python Tutorial notebook.

### 0. Download the text file.
Run the cell below to download the book **Alice’s Adventures in Wonderland** as a text file from [Project Gutenberg](http://www.gutenberg.org), and save into a file called `alice.txt`.

In [None]:
!curl https://www.gutenberg.org/files/11/11-0.txt > alice.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  170k  100  170k    0     0  1057k      0 --:--:-- --:--:-- --:--:-- 1057k


---
### 1. Read text from file.
Open the text file `alice.txt` and read all the lines into a list.

In [None]:
lines = []  # read lines from alice.txt into this list
with open('alice.txt', 'r', encoding='utf-8') as f:
  lines = f.readlines()

In [None]:
# another way to do it 
f = open('alice.txt', 'r', encoding='utf-8')
lines = f.readlines()
f.close()

---
### 2. Filter out the metadata.
The text file contains some metadata about the book which is not relevant for our analysis. Discard this information by removing the first 54 lines from the beginning and the last 356 lines from the end.

In [None]:
lines = lines[54:-356]

In [None]:
print(lines[0])
print(lines[-1])

CHAPTER I.

THE END 



---
### 3. Remove leading and trailing spaces from each line in the list.
Each line contains a newline character `\n` at the end while some lines also contain leading and trailing spaces. This formatting is done for presentation purposes and not relevant for our analysis.

In [None]:
clean_lines = []  # store the lines in this list after removing the leading and trailing spaces
for line in lines:
  clean_lines.append(line.strip())

---
### 4. Remove empty lines from the list.
After removing the newline character `\n` from each line in the list, some strings are now empty and can be discarded safely.

In [None]:
non_empty_lines = []  # store non empty lines in this list
for line in clean_lines:
  if line != '':
    non_empty_lines.append(line)

---
### 5. Join all the non empty lines into a single string.
Now that we have cleaned the corpus by removing some editorial details and formatting, we can focus on the actual text. Create a single string which contains all the lines from the text.



In [None]:
text = ' '.join(non_empty_lines)

---
### 6. Convert to lowercase
To keep the word counts consistent, we are going to covert everything lowercase. If we don't do this, the words `the`, `The` and `THE` would be considered distinct.  

In [None]:
text = text.lower()

---
### 7. Get a list of all the words in the text.

In [None]:
words = text.split(' ')
words[40:50]

['it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in',
 'it,',
 '“and',
 'what']

### 8. Remove punctuation

For a machine, character sequences `rabbit`, `rabbit,` and `rabbit!` are diferrent words, although we as humans understand that this is the same word with/without punctuation marks after it. To avoid this confusion, we can remove punctuation, because it is unnecessary for our task.

In [None]:
punct = '.,?!:;—"«»()[]{}–~*@#$^&\/„“‘’-|+=`'
new_words = [w.strip(punct) for w in words] # this is a list comprehension
new_words = [w for w in words if w not in punct]
new_words[40:50]

['it',
 'had',
 'no',
 'pictures',
 'or',
 'conversations',
 'in',
 'it',
 'and',
 'what']

In [None]:
# this 
words = [w.strip(punct) for w in words]

# equals this
new_words = []
for w in words:
    new_words.append(w.strip(punct))

---
### 9. How many total words are there in the text?

Individuals elements in a text (usually words, but not only) are called **tokens** in NLP.

In [None]:
len(words)

26381

---
### 10. How many unique words are there in the text?

Unique words are also called **types** in NLP.

In [None]:
len(set(words))

3504

---
### 11. What are the 25 most frequent words?

In [None]:
word_counts = dict()    # create an empty dictionary for word counts
for word in words:
  if word in word_counts:
    word_counts[word] += 1
  else:
    word_counts[word] = 1

word_counts = list(word_counts.items())  # convert dict to a list of tuples for word counts
sorted_by_word_counts = sorted(word_counts, key=lambda x: x[1], reverse=True)
sorted_by_word_counts[:25]

[('the', 1629),
 ('and', 843),
 ('to', 715),
 ('a', 626),
 ('she', 534),
 ('of', 505),
 ('it', 477),
 ('said', 456),
 ('alice', 383),
 ('i', 380),
 ('in', 360),
 ('was', 347),
 ('you', 329),
 ('as', 262),
 ('her', 246),
 ('that', 240),
 ('at', 208),
 ('on', 183),
 ('had', 177),
 ('with', 176),
 ('all', 169),
 ('but', 164),
 ('for', 149),
 ('so', 145),
 ('be', 139)]

#### Alternate Solutions:

1. Python >= 3.6 supports ordered dictionaries, so there is no need to convert to a list of tuples before sorting.
2. Look up the `Counter` container in the `collections` module in the [Python docs](https://docs.python.org/3/library/collections.html#collections.Counter).

In [None]:
from collections import Counter

word_counts = Counter(words)
word_counts.most_common(25)

[('the', 1629),
 ('and', 843),
 ('to', 715),
 ('a', 626),
 ('she', 534),
 ('of', 505),
 ('it', 477),
 ('said', 456),
 ('alice', 383),
 ('i', 380),
 ('in', 360),
 ('was', 347),
 ('you', 329),
 ('as', 262),
 ('her', 246),
 ('that', 240),
 ('at', 208),
 ('on', 183),
 ('had', 177),
 ('with', 176),
 ('all', 169),
 ('but', 164),
 ('for', 149),
 ('so', 145),
 ('be', 139)]