# (3E-2) Word Counts

In this notebook, we'll learn:

* How to loop over lists
* How to count words (list of words -> word counts as dictionary)

We'll apply what we learned in the previous two notebooks about lists and dictionaries. Specifically, we're going to learn how to transform lists of words into dictionaries of word counts.

## Looping over lists

How do we count the words in a list? We need to be able to **"loop"** over each one individually. To do that, we use a for loop. Here's the syntax:

```python
for thing in list_of_things:
    print(thing)   # or do something else with thing
```

You can read this in human terms as: **For each** thing in this list of things, print the thing.

Let's try it out.

In [None]:
# First, here's a list of words to get us started
marx = ['all', 'that', 'is', 'solid', 'melts', 'into', 'air', ',', 
        'all', 'that', 'is', 'holy', 'is', 'profaned', ',', 
        'and', 'man', 'is', 'at', 'last', 'compelled', 'to', 'face', 'with', 'sober', 'senses',
        'his', 'real', 'conditions', 'of', 'life', ',', 
        'and', 'his', 'relations', 'with', 'his', 'kind', '.']

print(marx)

In [None]:
# Ok, now let's loop!

for word in marx:
    print(word)

In [None]:
# @TODO: Loop over each word in the Marx passage, but print the word only if it is longer than two letters
#



In [None]:
# @TODO: Loop over each word in the Marx passage, but print the word only if the first letter is alphabetic
# hint: use the .isalpha() method of strings, which returns True if the string is not alphabetic (otherwise False)
#



### Advanced loops

#### How do we remember how far into the list we've looped?

Use the `enumerate()` wrapper around any list, and then iterate like this:

In [None]:
for index,word in enumerate(marx):   # 'index' is the index in the list marx at which 'word' is located
    print(index, word, marx[index])    

#### Going backwards

Just wrap the list in `reversed()`.

In [None]:
for word in reversed(marx):
    print(word,end=' ')     # print without adding a newline afterward; instead add a ' '

#### How to stop a loop short

Use `break`.

In [None]:
for index,word in enumerate(marx):
    if index>=10:
        break      # stop the loop!
    
    print(index,word)

#### How to skip to the next iteration of the loop

Use `continue`.

In [None]:
for word in marx:
    if len(word)<3:
        continue          # skip right to the next iteration in the loop, don't even keep reading below
    
    print(word,end=' ')   # this won't run if we've already hit continue

## Loop-counting words in list

How do we count the words in a list? Let's:

1. Create an empty dictionary of word counts
2. Loop over each word in the text
3. For each word, add 1 to its entry in the dictionary of word counts


In [None]:
# 1. Create an empty dictionary of word counts
wordcounts={}

# 2. Loop over each word in the text
for word in marx:
    
    # 3. For each word, add 1 to its entry in the dictionary of word counts
    wordcounts[word]+=1


In [None]:
wordcounts

In [77]:
# @TODO: Write a function to produce a dictionary of counts from any list
def count(tokens):
    pass  # replace this line and write the function

In [None]:
# @TODO: Write a function to produce a dictionary of counts from any list
def count(tokens):
    pass  # replace this line and write the function

## Text to Count Pipeline

First, let's recapitulate some steps we already know:

### 1. Open a text

In [None]:
with open('../corpora/tropic_of_orange/texts/ch05.txt') as file:
    manzanar_txt = file.read()

In [None]:
## Print the first 1000 characters
print(manzanar_txt[:1000])

### 2. Tokenize the text

In [None]:
# let's use NLTK's tokenizer
import nltk

In [None]:
# Let's tokenize a lowercase version of manzanar
manzanar_txt_lowercase = manzanar_txt.lower()

In [None]:
# Let's pass the lower case version to NLTK's tokenizer
manzanar_tokens = nltk.word_tokenize(manzanar_txt_lowercase)

In [None]:
# Print the first 100 tokens
print(manzanar_tokens[:100])

### 3. [New] Count the tokens!

In [None]:
# @TODO: Count the tokens in manzanar
#


### 4. Normalizing counts