### Count 'em words:
- The objective of this notebook is to count words in a few texts from Jane Austen.

In [1]:
# Imports:
import os
import re
from nltk.corpus import stopwords
from collections import defaultdict

In [2]:
path = './austen/'
fnames = os.listdir(path)
print(fnames)

['1790 Love And Freindship.txt', '1805 Lady Susan.txt', '1811 Sense and Sensibility.txt', '1813 Pride and Prejudice.txt', '1814 Mansfield Park.txt', '1815 Emma.txt', '1818 Northanger Abbey.txt', '1818 Persuasion.txt']


In [3]:
word_count = defaultdict(int) # Initialise a dictionary where the default value for a new key is 0.

stop = stopwords.words('english') # English stopwords from NLTK

for fname in fnames:
    with open(path+fname, 'r') as f:
        content = f.read().splitlines() # Read and split the contents of 'fname' into lines
        
    for line in content:
        words = line.split() # Split the line into words (based on whitespace only)
        for word in words:
            word = word.lower() # Lowercase
            word = re.sub('[^a-zA-Z0-9]', '', word) # Remove non-word characters
            if word not in stop: # Remove stopwords
                word_count[word] += 1 # Increment the word count as you see the word (Note: default value is 0)
            
dict(list(word_count.items())[:5]) # print preview

{'love': 554, 'freindship': 18, 'early': 198, 'works': 15, 'friendship': 113}

In [4]:
# Sort a dictionary by value (counts): https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value
sorted_word_counts = sorted(word_count.items(), key=lambda kv: kv[1], reverse=True)
sorted_word_counts[:10] # preview 10 most frequent words

[('could', 3743),
 ('would', 3387),
 ('mr', 3071),
 ('mrs', 2508),
 ('must', 2238),
 ('said', 2140),
 ('much', 2024),
 ('one', 1958),
 ('miss', 1926),
 ('every', 1537)]

### Reflections:
To make sure the word counts make sense (and match Voyant's), I had to take care of 3 things:
- Remove any non-word characters because .split() is not capable of tokenizing into words. It had '(love' as a word which I took care of using regular expressions. Note that I did not use \W because I also wanted to remove underscore.
- Make the word lowercase because 'LOVE' and 'love' is the same thing.
- Remove stopwords like the, a, an etc. (imported from NLTK)

Conclusion: The output counts are still off (by a really small margin) and do not match Voyant's. Using nltk's word_tokenize or a more complex regular expression would definitely fix that.