# (3E-2) Word Counts

In this notebook, we'll learn:

* How to loop over lists
* How to count words (list of words -> word counts as dictionary)

We'll apply what we learned in the previous two notebooks about lists and dictionaries. Specifically, we're going to learn how to transform lists of words into dictionaries of word counts.

## Looping over lists

How do we count the words in a list? We need to be able to **"loop"** over each one individually. To do that, we use a for loop. Here's the syntax:

```python
for thing in list_of_things:
    print(thing)   # or do something else with thing
```

You can read this in human terms as: **For each** thing in this list of things, print the thing.

Let's try it out.

In [1]:
# First, here's a list of words to get us started
marx = ['all', 'that', 'is', 'solid', 'melts', 'into', 'air', ',', 
        'all', 'that', 'is', 'holy', 'is', 'profaned', ',', 
        'and', 'man', 'is', 'at', 'last', 'compelled', 'to', 'face', 'with', 'sober', 'senses',
        'his', 'real', 'conditions', 'of', 'life', ',', 
        'and', 'his', 'relations', 'with', 'his', 'kind', '.']

print(marx)

['all', 'that', 'is', 'solid', 'melts', 'into', 'air', ',', 'all', 'that', 'is', 'holy', 'is', 'profaned', ',', 'and', 'man', 'is', 'at', 'last', 'compelled', 'to', 'face', 'with', 'sober', 'senses', 'his', 'real', 'conditions', 'of', 'life', ',', 'and', 'his', 'relations', 'with', 'his', 'kind', '.']


In [2]:
# Ok, now let's loop!
for word in marx:
    print(word)

all
that
is
solid
melts
into
air
,
all
that
is
holy
is
profaned
,
and
man
is
at
last
compelled
to
face
with
sober
senses
his
real
conditions
of
life
,
and
his
relations
with
his
kind
.


In [4]:
# @TODO: Loop over each word in the Marx passage, but print the word only if it is longer than two letters
#

for word in marx:
    if len(word) > 2:
        print(word)


all
that
solid
melts
into
air
all
that
holy
profaned
and
man
last
compelled
face
with
sober
senses
his
real
conditions
life
and
his
relations
with
his
kind


In [14]:
# @TODO: Loop over each word in the Marx passage, but print the word only if the first letter is alphabetic
# hint: use the .isalpha() method of strings, which returns True if the string is not alphabetic (otherwise False)
#

#word[0].isalpha() # is the first letter alphabetic? 

marx_nopunct = []

for word in marx:
    if word[0].isalpha():
        #print(word)
        marx_nopunct.append(word)

print(marx_nopunct)

['all', 'that', 'is', 'solid', 'melts', 'into', 'air', 'all', 'that', 'is', 'holy', 'is', 'profaned', 'and', 'man', 'is', 'at', 'last', 'compelled', 'to', 'face', 'with', 'sober', 'senses', 'his', 'real', 'conditions', 'of', 'life', 'and', 'his', 'relations', 'with', 'his', 'kind']


### Advanced loops

#### How do we remember how far into the list we've looped?

Use the `enumerate()` wrapper around any list, and then iterate like this:

In [15]:
for index,word in enumerate(marx):   # 'index' is the index in the list marx at which 'word' is located
    print(index, word, marx[index])    

0 all all
1 that that
2 is is
3 solid solid
4 melts melts
5 into into
6 air air
7 , ,
8 all all
9 that that
10 is is
11 holy holy
12 is is
13 profaned profaned
14 , ,
15 and and
16 man man
17 is is
18 at at
19 last last
20 compelled compelled
21 to to
22 face face
23 with with
24 sober sober
25 senses senses
26 his his
27 real real
28 conditions conditions
29 of of
30 life life
31 , ,
32 and and
33 his his
34 relations relations
35 with with
36 his his
37 kind kind
38 . .


#### Going backwards

Just wrap the list in `reversed()`.

In [16]:
for word in reversed(marx):
    print(word,end=' ')     # print without adding a newline afterward; instead add a ' '

. kind his with relations his and , life of conditions real his senses sober with face to compelled last at is man and , profaned is holy is that all , air into melts solid is that all 

#### How to stop a loop short

Use `break`.

In [17]:
for index,word in enumerate(marx):
    if index>=10:
        break      # stop the loop!
    
    print(index,word)

0 all
1 that
2 is
3 solid
4 melts
5 into
6 air
7 ,
8 all
9 that


#### How to skip to the next iteration of the loop

Use `continue`.

In [18]:
for word in marx:
    if len(word)<3:
        continue          # skip right to the next iteration in the loop, don't even keep reading below
    
    print(word,end=' ')   # this won't run if we've already hit continue

all that solid melts into air all that holy profaned and man last compelled face with sober senses his real conditions life and his relations with his kind 

## Loop-counting words in list

How do we count the words in a list? Let's:

1. Create an empty dictionary of word counts
2. Loop over each word in the text
3. For each word, add 1 to its entry in the dictionary of word counts


In [41]:
# 1. Create an empty dictionary of word counts
wordcounts={}

# 2. Loop over each word in the text
for word in marx:
    print('word in loop is now',word)
    
    # 3. For each word, add 1 to its entry in the dictionary of word counts
    if word not in wordcounts:
        print('word not in wordcounts, setting to 1')
        wordcounts[word]  = 1
    else:
        print('word in wordcounts already, adding 1 to it')
        wordcounts[word] += 1

        
wordcounts

word in loop is now all
word not in wordcounts, setting to 1
word in loop is now that
word not in wordcounts, setting to 1
word in loop is now is
word not in wordcounts, setting to 1
word in loop is now solid
word not in wordcounts, setting to 1
word in loop is now melts
word not in wordcounts, setting to 1
word in loop is now into
word not in wordcounts, setting to 1
word in loop is now air
word not in wordcounts, setting to 1
word in loop is now ,
word not in wordcounts, setting to 1
word in loop is now all
word in wordcounts already, adding 1 to it
word in loop is now that
word in wordcounts already, adding 1 to it
word in loop is now is
word in wordcounts already, adding 1 to it
word in loop is now holy
word not in wordcounts, setting to 1
word in loop is now is
word in wordcounts already, adding 1 to it
word in loop is now profaned
word not in wordcounts, setting to 1
word in loop is now ,
word in wordcounts already, adding 1 to it
word in loop is now and
word not in wordcounts, s

{'all': 2,
 'that': 2,
 'is': 4,
 'solid': 1,
 'melts': 1,
 'into': 1,
 'air': 1,
 ',': 3,
 'holy': 1,
 'profaned': 1,
 'and': 2,
 'man': 1,
 'at': 1,
 'last': 1,
 'compelled': 1,
 'to': 1,
 'face': 1,
 'with': 2,
 'sober': 1,
 'senses': 1,
 'his': 3,
 'real': 1,
 'conditions': 1,
 'of': 1,
 'life': 1,
 'relations': 1,
 'kind': 1,
 '.': 1}

In [31]:
wordcounts['Ryan'] = 1

In [40]:
wordcounts['Ryan']

7

In [39]:
wordcounts['Ryan'] += 1

In [38]:
wordcounts

{'all': 2,
 'that': 2,
 'is': 4,
 'solid': 1,
 'melts': 1,
 'into': 1,
 'air': 1,
 ',': 3,
 'holy': 1,
 'profaned': 1,
 'and': 2,
 'man': 1,
 'at': 1,
 'last': 1,
 'compelled': 1,
 'to': 1,
 'face': 1,
 'with': 2,
 'sober': 1,
 'senses': 1,
 'his': 3,
 'real': 1,
 'conditions': 1,
 'of': 1,
 'life': 1,
 'relations': 1,
 'kind': 1,
 '.': 1,
 'Ryan': 6}

In [46]:
# @TODO: Write a function to produce a dictionary of counts from any list
def count(tokens):
    # create an empty dictionary
    wordcounts = {}
    
    # for each word in the list of tokens...
    for word in tokens:
        
        # if the word is in the dictionary of counts
        if word in wordcounts:
            
            # add 1 to it
            wordcounts[word] += 1

        # otherwise
        else:
            
            # set it to 1
            wordcounts[word] = 1
            
    # return the dictionary of counts
    return wordcounts

In [None]:
with open('../corpora/tropic_of_orange/texts/ch01.txt') as file:
    txt = file.read()

import nltk
tokens = nltk.word_tokenize(txt.lower())

count(tokens)

In [55]:
# @TODO: Write a function to produce a dictionary of counts from any list
# WITH RELATIVE COUNTS


def tf(tokens):
    # create an empty dictionary
    wordcounts = {}
    
    # get the number of words by the length of the incoming list of words
    num_words = len(tokens)
    
    # for each word in this list of words...
    for word in tokens:
        
        # if the word is already in the wordcounts dictionary...
        if word in wordcounts:
            # add 1 to it
            wordcounts[word] += 1

        # otherwise...
        else:
            # initialize it to 1
            wordcounts[word] = 1
            
    # for each word in the dictionary wordcounts
    for word in wordcounts:
        
        # set the word to itself
        wordcounts[word] = wordcounts[word] / num_words
    
    # return the dictionary of word counts
    return wordcounts

In [None]:
tf(tokens)

## Text to Count Pipeline

First, let's recapitulate some steps we already know:

### 1. Open a text

In [61]:
# @TODO: Write this function
#

def file2string(filename):
    """
    This function takes a filename,
    opens the file,
    and returns a string corresponding to the file's contents.
    """
    
    with open(filename) as file:
        string_contents = file.read()
    
    string_contents = string_contents.strip()
    
    return string_contents

In [63]:
# @TODO: Use your function on either one of your texts or one of Yamashita's
#

cool_txt = file2string('../corpora/my_corpus/texts/ryans_diss_chap3.txt')

#cool_txt

### 2. Tokenize the text

In [64]:
# @TODO: Write this function
#

def tokenize(string):
    """
    This function takes in a string,
    lower-cases it,
    and returns a list of words using NLTK's tokenizer.
    """

    string_lower = string.lower()
    
    import nltk
    return nltk.word_tokenize(string_lower)

In [65]:
# @TODO: Use your function to tokenize the text you opened above
#

cool_tokens = tokenize(cool_txt)

cool_tokens[:10]

['ryan',
 'heuser',
 'abstraction',
 ':',
 'a',
 'literary',
 'history',
 'chapter',
 '3',
 ':']

### 3. [New] Count the tokens!

In [68]:
# @TODO: Use the function developed above to count the tokens in your text
#

cool_counts = count(cool_tokens)

#cool_counts

### 4. Repeat steps 1-3 for another text

In [73]:
# @TODO: Repeat steps 1-3 for another text
#

rafaela_str = file2string('../corpora/tropic_of_orange/texts/ch01.txt')
rafaela_tokens = tokenize(rafaela_str)
rafaela_counts = count(rafaela_tokens)

bobby_str = file2string('../corpora/tropic_of_orange/texts/ch02.txt')
bobby_tokens = tokenize(bobby_str)
bobby_counts = count(bobby_tokens)

rafaela_counts['he'], bobby_counts['he']

(41, 29)

### 5. Compare word counts

In [None]:
# @TODO: Loop over a list of words that are interesting to you,
# and print their relative counts in both your texts
#

interesting_words = ['orange','tropic','tree','border','watch','rain']