# Exercise 0

The motivation of this exercise is to gain familiarity with the Python programming language. We are going to do some basic text processing and analysis on a plaintext corpus. If you are not with familiar Python or Jupyter notebooks, it is recommended to start with the Python Tutorial notebook before attempting this exercise.

---

For this exercise, we are going to count the 25 most frequent words in **The Adventures of Sherlock Holmes**, by Sir Arthur Conan Doyle. You are free to use any other piece of text of your choice for this exercise. This notebook contains step by step instructions (with some hints) and you are required to fill in the code blocks based on the material covered in the Python Tutorial notebook.

### 0. Download the text file.
Run the cell below to download the book **The Adventures of Sherlock Holmes** as a text file from [Project Gutenberg](http://www.gutenberg.org), and save into a file called `sherlock.txt`.

In [1]:
!curl https://www.gutenberg.org/files/1661/1661-0.txt > sherlock.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  2  593k    2 14071    0     0  11677      0  0:00:52  0:00:01  0:00:51 11686
 15  593k   15 95991    0     0  42417      0  0:00:14  0:00:02  0:00:12 42436
 29  593k   29  173k    0     0  55132      0  0:00:11  0:00:03  0:00:08 55149
 48  593k   48  285k    0     0  68540      0  0:00:08  0:00:04  0:00:04 68556
 67  593k   67  397k    0     0  77342      0  0:00:07  0:00:05  0:00:02 81718
 88  593k   88  525k    0     0  85603      0  0:00:07  0:00:06  0:00:01  100k
100  593k  100  593k    0     0  89883      0  0:00:06  0:00:06 --:--:--  111k


---
### 1. Read text from file.
Open the text file `sherlock.txt` and read all the lines into a list.

In [2]:
lines = []  # read lines from sherlock.txt into this list
with open('sherlock.txt', 'r',encoding='utf-8') as f:
  lines = f.readlines()

---
### 2. Filter out the metadata.
The text file contains some metadata about the book which is not relevant for our analysis. Discard this information by removing the first 32 lines from the beginning and the last 368 lines from the end.

In [3]:
lines = lines[32:-368]
print(lines)

 whole house in my head. There was\n', 'one wing, however, which appeared not to be inhabited at all. A door\n', 'which faced that which led into the quarters of the Tollers opened into\n', 'this suite, but it was invariably locked. One day, however, as I\n', 'ascended the stair, I met Mr. Rucastle coming out through this door,\n', 'his keys in his hand, and a look on his face which made him a very\n', 'different person to the round, jovial man to whom I was accustomed. His\n', 'cheeks were red, his brow was all crinkled with anger, and the veins\n', 'stood out at his temples with passion. He locked the door and hurried\n', 'past me without a word or a look.\n', '\n', '“This aroused my curiosity, so when I went out for a walk in the\n', 'grounds with my charge, I strolled round to the side from which I could\n', 'see the windows of this part of the house. There were four of them in a\n', 'row, three of which were simply dirty, while the fourth was shuttered\n', 'up. They were evidently

---
### 3. Remove leading and trailing spaces from each line in the list.
Each line contains a newline character `\n` at the end while some lines also contain leading and trailing spaces. This formatting is done for presentation purposes and not relevant for our analysis.

In [4]:
clean_lines = []  # store the lines in this list after removing the leading and trailing spaces
for line in lines:
  clean_lines.append(line.strip())
print(clean_lines)

e peculiar tint, and', 'the same thickness. But then the impossibility of the thing obtruded', 'itself upon me. How could my hair have been locked in the drawer? With', 'trembling hands I undid my trunk, turned out the contents, and drew', 'from the bottom my own hair. I laid the two tresses together, and I', 'assure you that they were identical. Was it not extraordinary? Puzzle', 'as I would, I could make nothing at all of what it meant. I returned', 'the strange hair to the drawer, and I said nothing of the matter to the', 'Rucastles as I felt that I had put myself in the wrong by opening a', 'drawer which they had locked.', '', '“I am naturally observant, as you may have remarked, Mr. Holmes, and I', 'soon had a pretty good plan of the whole house in my head. There was', 'one wing, however, which appeared not to be inhabited at all. A door', 'which faced that which led into the quarters of the Tollers opened into', 'this suite, but it was invariably locked. One day, however, as I', 

---
### 4. Remove empty lines from the list.
After removing the newline character `\n` from each line in the list, some strings are now empty that can be discarded safely.

In [5]:
non_empty_lines = []  # store non empty lines in this list
for line in clean_lines:
  if line != '':
    non_empty_lines.append(line)

---
### 5. Join all the non empty lines into a single string.
Now that we have cleaned the corpus by removing the presentation details, we can focus on the actual text. Create a single string which contains all the lines from the text.



In [8]:
# text = # join all the lines into this string
text = ' '.join(non_empty_lines)
print(text)

the moonshine I saw what it was. It was a giant dog, as large as a calf, tawny tinted, with hanging jowl, black muzzle, and huge projecting bones. It walked slowly across the lawn and vanished into the shadow upon the other side. That dreadful sentinel sent a chill to my heart which I do not think that any burglar could have done. “And now I have a very strange experience to tell you. I had, as you know, cut off my hair in London, and I had placed it in a great coil at the bottom of my trunk. One evening, after the child was in bed, I began to amuse myself by examining the furniture of my room and by rearranging my own little things. There was an old chest of drawers in the room, the two upper ones empty and open, the lower one locked. I had filled the first two with my linen, and as I had still much to pack away I was naturally annoyed at not having the use of the third drawer. It struck me that it might have been fastened by a mere oversight, so I took out my bunch of keys and tried 

---
### 6. Convert to lowercase
To keep the word counts consistent, we are going to covert everything lowercase. If we don't do this, the words, **the** **The** and **THE**, would be considered distinct.  

In [11]:
text = text.lower()

---
### 7. Get a list of all the words in the text.

In [12]:
words = text.split(' ')

---
### 8. How many total words are there in the text?

In [13]:
len(words)

97063

---
### 9. How many unique words are there in the text?

In [14]:
len(set(words))

13518

In [39]:
from collections import Counter
Counter(words).most_common(15)

[('the', 5124),
 ('and', 2592),
 ('to', 2474),
 ('of', 2452),
 ('a', 2425),
 ('i', 2341),
 ('in', 1568),
 ('that', 1480),
 ('was', 1245),
 ('he', 1209),
 ('it', 1176),
 ('his', 1088),
 ('you', 1041),
 ('is', 995),
 ('my', 870)]

---
### 10. What are the 25 most frequent words?

In [43]:
from nltk.book import *


*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [45]:
fdist1 = FreqDist(words)
fdist1.most_common(50)

[('the', 5124),
 ('and', 2592),
 ('to', 2474),
 ('of', 2452),
 ('a', 2425),
 ('i', 2341),
 ('in', 1568),
 ('that', 1480),
 ('was', 1245),
 ('he', 1209),
 ('it', 1176),
 ('his', 1088),
 ('you', 1041),
 ('is', 995),
 ('my', 870),
 ('have', 841),
 ('had', 767),
 ('as', 767),
 ('with', 754),
 ('which', 694),
 ('at', 685),
 ('for', 658),
 ('be', 548),
 ('not', 537),
 ('but', 496),
 ('we', 452),
 ('from', 446),
 ('upon', 439),
 ('said', 415),
 ('this', 389),
 ('me', 379),
 ('been', 363),
 ('there', 359),
 ('she', 353),
 ('very', 349),
 ('your', 349),
 ('her', 335),
 ('on', 329),
 ('“i', 324),
 ('were', 309),
 ('so', 307),
 ('an', 306),
 ('by', 306),
 ('would', 293),
 ('all', 290),
 ('what', 287),
 ('are', 282),
 ('one', 280),
 ('when', 277),
 ('no', 263)]

In [41]:
word_counts = dict()    # create an empty dictionary for word counts
for word in words:
  if word in word_counts:
    word_counts[word] += 1
  else:
    word_counts[word] = 1

word_counts = list(word_counts.items()) 
 # convert dict to a list of tuples for word counts
sorted_by_word_counts = sorted(word_counts, key=lambda x: x[1], reverse=True)
sorted_by_word_counts[:25]

[('the', 5124),
 ('and', 2592),
 ('to', 2474),
 ('of', 2452),
 ('a', 2425),
 ('i', 2341),
 ('in', 1568),
 ('that', 1480),
 ('was', 1245),
 ('he', 1209),
 ('it', 1176),
 ('his', 1088),
 ('you', 1041),
 ('is', 995),
 ('my', 870),
 ('have', 841),
 ('had', 767),
 ('as', 767),
 ('with', 754),
 ('which', 694),
 ('at', 685),
 ('for', 658),
 ('be', 548),
 ('not', 537),
 ('but', 496)]

#### Alternate Solutions:

1. Python >= 3.6 supports ordered dictionaries, so there is no need to convert to a list of tuples before sorting.
2. Look up the `Counter` container in the `collections` module in the [Python docs](https://docs.python.org/3/library/collections.html#collections.Counter).