For citation information, please see the "Source Information" section listed in the associated README file: https://github.com/stephbuon/digital-history/tree/master/hist3368-week2-critical-word-count

# Week 2 Assignment: For Loops Tutorial & Critical Word Count

In this week's assignment, you'll learn how to loop over lists of data.  You'll also start the process of thinking critically about which words matter to you for the purposes of text mining, and how to use a thesaurus and the powers of reason to expand your expert vocabulary and divide it into categories of information. 

# For Loops Tutorial: HOW TO WRITE FOR LOOPS IN PYTHON

## Iterating over lists with for

[based on Lauren Klein's Lists and Loops https://github.com/laurenfklein/emory-qtm340/tree/0c3d0935ecd0a7920e331a8efd78240c49997606/notebooks]

The list comprehension syntax discussed earlier is very powerful: it allows you to succinctly transform one list into another list by thinking in terms of filtering and modification. But sometimes your primary goal isn't to make a new list, but simply to perform a set of operations on an existing list.

Let's say that you want to print every string in a list. Here's a short text:

In [1]:
text = "it was the best of times, it was the worst of times"

We can make a list of all the words in the text by splitting on whitespace:

In [2]:
words = text.split()

Of course, we can see what's in the list simply by evaluating the variable:

In [3]:
words

['it',
 'was',
 'the',
 'best',
 'of',
 'times,',
 'it',
 'was',
 'the',
 'worst',
 'of',
 'times']

But let's say that we want to print out each word on a separate line, without any of Python's weird punctuation. In other words, I want the output to look like:


    it
    was
    the
    best
    of
    times,
    it
    was
    the
    worst
    of
    times

But how can this be accomplished? We know that the print() function can display an individual string in this manner:

In [4]:
print("hello")

hello


So what we need, clearly, is a way to call the print() function with every item of the list. We could do this by writing a series of print() statements, one for every item in the list:

In [5]:
print(words[0])
print(words[1])
print(words[2])
print(words[3])
print(words[4])
print(words[5])
print(words[6])
print(words[7])
print(words[8])
print(words[9])
print(words[10])
print(words[11])

it
was
the
best
of
times,
it
was
the
worst
of
times



Nice, but there are some problems with this approach:

- It's kind of verbose---we're doing exactly the same thing multiple times, only with slightly different expressions. Surely there's an easier way to tell the computer to do this?
- It doesn't scale. What if we wrote a program that we want to produce hundreds or thousands of lines. Would we really need to write a print statement for each of those expressions?
- It requires us to know how many items are going to end up in the list to begin with.

Things are looking grim! But there's hope. Performing the same operation on all items of a list is an extremely common task in computer programming. So common, that Python has some built-in syntax to make the task easy: the for loop.

Here's how a for loop looks:

for tempvar in sourcelist:
    statements

The words for and in just have to be there---that's how Python knows it's a for loop. Here's what each of those parts mean.

    tempvar: A name for a variable. Inside of the for loop, this variable will contain the current item of the list.
    sourcelist: This can be any Python expression that evaluates to a list---a variable that contains a list, or a list slice, or even a list literal that you just type right in!
    statements: One or more Python statements. Everything tabbed over underneath the for will be executed once for each item in the list. The statements tabbed over underneath the for line are called the body of the loop.

Here's what the for loop for printing out every item in a list might look like:

In [6]:
for item in words:
    print(item)

it
was
the
best
of
times,
it
was
the
worst
of
times


The variable name item is arbitrary. You can pick whatever variable name you like, as long as you're consistent about using the same variable name in the body of the loop. If you wrote out this loop in a long-hand fashion, it might look like this:


    item = words[0]
    print(item)
    item = words[1]
    print(item)
    item = words[2]
    print(item)
    item = words[3]
    print(item)
    # etc.


    
    it
    was
    the
    best
    
Of course, the body of the loop can have more than one statement, and you can assign values to variables inside the loop:


In [7]:
for item in words:
    yelling = item.upper()
    print(yelling)

IT
WAS
THE
BEST
OF
TIMES,
IT
WAS
THE
WORST
OF
TIMES


You can also include other kinds of nested statements inside the for loop, like if/else:

In [8]:

for item in words:
    if len(item) == 2:
        print(item.upper())
    elif len(item) == 3:
        print("   " + item)
    else:
        print(item)

IT
   was
   the
best
OF
times,
IT
   was
   the
worst
OF
times


This structure is called a "loop" because when Python reaches the end of the statements in the body, it "loops" back to the beginning of the body, and executes the same statements again (this time with the next item in the list).


Python programmers tend to use for loops most often when the problem would otherwise be too tricky or complicated to solve using a list comprehension. It's easy to paraphrase any list comprehension in for loop syntax. For example, this list comprehension, which evaluates to a list of the squares of even integers from 1 to 25:


In [9]:
[x * x for x in range(1, 26) if x % 2 == 0]


[4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576]

You can rewrite this list comprehesion as a for loop by starting out with an empty list, then appending an item to the list inside the loop. The source list remains the same:


In [10]:
result = []
for x in range(1, 26):
    if x % 2 == 0:
        result.append(x * x)
result

[4, 16, 36, 64, 100, 144, 196, 256, 324, 400, 484, 576]

## Join: Making strings from lists

Once we've created a list of words, it's a common task to want to take that list and "glue" it back together, so it's a single string again, instead of a list. So, for example:

In [11]:
element_list = ["hydrogen", "helium", "lithium", "beryllium", "boron"]
glue = ", and "
glue.join(element_list)

'hydrogen, and helium, and lithium, and beryllium, and boron'

The .join() method needs a "glue" string to the left of it---this is the string that will be placed in between the list elements. In the parentheses to the right, you need to put an expression that evaluates to a list. Very frequently with .join(), programmers don't bother to assign the "glue" string to a variable first, so you end up with code that looks like this:


In [12]:
words = ["this", "is", "a", "test"]
" ".join(words)

'this is a test'


When we're working with .split() and .join(), our workflow usually looks something like this:

    Split a string to get a list of units (usually words).
    Use some of the list operations discussed above to modify or slice the list.
    Join that list back together into a string.
    Do something with that string (e.g., print it out).

With this in mind, here's a program that splits a string into words, randomizes the order of the words, then prints out the results:


In [13]:
# to make this block work:

# add `import random`, the module `shuffle()` belongs to. 

# remove `split()` bc the `shuffle()` method only works on lists, not string objects (and `split()` transforms items to string objects)

# if you want to keep demonstrating `.split()` with `shuffle()` you could transform the str objects to lists, but that step might be hard to follow logically 

# alterantively you could use `sort()` instead of suffle (see below)

import random

text = "it was a dark and stormy night"
# words = text.split() 
random.shuffle(words)
' '.join(words)

'test is this a'

In [14]:
# sort option w str split

text = "it was a dark and stormy night"
words = text.split()
words.sort()
for word in words:
    print(word)

a
and
dark
it
night
stormy
was


EXERCISE: Write a Python command-line program that prints out the lines of a text file in random order.

## Nested For Loops

Sometimes, I want to use multiple for loops to do my business.  This usually happens when data is 'structured,' that is, when the data exists in multiple separate lists, dictionaries, or dataframes, to each of which we want to apply separate conditions.

First, let's make a set of lists.  Each list contains a set of strings that correspond to the names of novels written by three novelists.

In [15]:
dickens = ['oliver twist', 'bleak house', 'a tale of two cities']
austen = ['sense and sensibility', 'emma', 'pride and prejudice']
trollope = ['doctor thorne', 'barchester towers', 'the land leaguers']

In [16]:
austen[1]

'emma'

Now, let's make a list of the novelists' names.

In [17]:
novelists = ['dickens', 'austen', 'trollope']

Importantly, I can now call up novels by the strings in the variable 'novelists.'  Here are two ways of getting Dickens' novels:

In [18]:
globals()['dickens'] # this looks for variables called 'dickens'

['oliver twist', 'bleak house', 'a tale of two cities']

In [19]:
globals()[novelists[0]] # this looks for variables that correspond to the first item in the list, 'novelists'

['oliver twist', 'bleak house', 'a tale of two cities']

same thing.

Let's put that into a for loop:

In [20]:
for novelist in novelists: # cycle through each novelist
    their_novels = globals()[novelist] # for each novelist, pull up the list that corresponds to their name -- thus for 'dickens,' call up the variable called 'dickens'
    print(novelist)
    print(their_novels)

dickens
['oliver twist', 'bleak house', 'a tale of two cities']
austen
['sense and sensibility', 'emma', 'pride and prejudice']
trollope
['doctor thorne', 'barchester towers', 'the land leaguers']


That's nicely formatted output. 

But say we want to do something which each of the novel names -- like creating a new dataset where each novel name is accurately annotate it with the name of its author.  How do I glue them together, when what I want to glue changes for each novel but also for each novelist? 

To do that, we'll need a *double* for loop, or a "nested" for loop.  

The outside for loop cycles through each novelist and calls up their list of novels in the variable 'their_novels'.

The inner for loop cycles through each of the items in 'their_novels."

I can use these nested for loops to output a really nicely formatted list of authors and novels.

In [21]:

for novelist in novelists: # for each novelist,
    their_novels = globals()[novelist] 
    for novel in their_novels: # cycle through each of the novels for that novelist. for each novel of each novelist:
        print(novel)
        print(novelist)
        print("")
    print("------------------")


oliver twist
dickens

bleak house
dickens

a tale of two cities
dickens

------------------
sense and sensibility
austen

emma
austen

pride and prejudice
austen

------------------
doctor thorne
trollope

barchester towers
trollope

the land leaguers
trollope

------------------


That's a nicely formatted print-out.  

But what I really want is a dataset where every entry is the name of a novelist and the novel they wrote.  Can I tweak the double for loop to do that?

In [22]:
novels_with_authors = []

for novelist in novelists: # for each novelist,
    their_novels = globals()[novelist] 
    for novel in their_novels: # cycle through each of the novels for that novelist. for each novel of each novelist:
        new_entry = novelist + "-" + novel # create a dummy variable, 'new_entry', which lists the novelist and novel
        novels_with_authors.append(new_entry) # add the dummy variable to my master list, novels_with_authors
        
novels_with_authors

['dickens-oliver twist',
 'dickens-bleak house',
 'dickens-a tale of two cities',
 'austen-sense and sensibility',
 'austen-emma',
 'austen-pride and prejudice',
 'trollope-doctor thorne',
 'trollope-barchester towers',
 'trollope-the land leaguers']

Notice that I produced this output -- and the output above -- with a "double for loop" or a "nested for loop."

The first "for loop" iterates through the novelists, one at a time:


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;"for novelist in novelists:"


The second "for loop" takes each novelist, and iterates through the list of their novels:
       
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; "for novel in their_novels"


Because the loops are nested, I'm not randomly applying Trollope or Austen's names to random novel titles; I'm creating a list where each author's name corresponds to the right novel.

In [23]:
novels_with_authors

['dickens-oliver twist',
 'dickens-bleak house',
 'dickens-a tale of two cities',
 'austen-sense and sensibility',
 'austen-emma',
 'austen-pride and prejudice',
 'trollope-doctor thorne',
 'trollope-barchester towers',
 'trollope-the land leaguers']

What if we want the list of novelists and their novels to be randomly arranged? For this, we can use the command random.shuffle()

In [24]:
novels_with_authors_random = []

for novelist in novelists: # for each novelist,
    their_novels = globals()[novelist] 
    for novel in their_novels: # cycle through each of the novels for that novelist. for each novel of each novelist:
        new_entry_random = novelist + "-" + novel # create a dummy variable, 'new_entry', which lists the novelist and novel
        novels_with_authors_random.append(new_entry_random) # add the dummy variable to my master list, novels_with_authors
        random.shuffle(novels_with_authors_random) #notice this new line here!
        
novels_with_authors_random

['trollope-the land leaguers',
 'austen-sense and sensibility',
 'austen-emma',
 'trollope-barchester towers',
 'austen-pride and prejudice',
 'dickens-oliver twist',
 'dickens-a tale of two cities',
 'trollope-doctor thorne',
 'dickens-bleak house']

# If you're totally new to Python

If you're totally new to Python, you might need some more time drilling with the basic operations to really master them.  The following notebooks are short -- and totally optional. However, I encourage you to work through them on your own if this is your first time learning to code.  Please make a note of them, in any case, so that you can return to them if you feel lost in future coding assignments.


- for loops : https://problemsolvingwithpython.com/09-Loops/09.01-For-Loops/
- counting things: https://github.com/laurenfklein/emory-qtm340/blob/master/notebooks/counting.ipynb