Session 4: Lots of Text Processing
=====

Today, we'll learn another new data type and some more fun things we can do with lists. We'll then move along into some strategies for text processing, and finish up reading Moby-Dick... sort of.

Specific topics:

 * Conditional expressions (expressions that result in `bool` results)
 * More fun with the `list` data type
 * The `set` data type
 * The `if-elif-else` conditional execution statement
 * Opening a text file and reading its contents 
 * Introducing the `with` statement for file processing
 * Using string functions to modify strings and slice them up
 
 

Conditional Expressions
------

A conditional expression is an expression that results in a `bool` value. Some obvious conditional expressions are the ones that compare values.

 * `==` is the *equality* operator stating that two values are equal
 * `!=` is the *inequality* operator stating that two values are not equal
 * `>` is the *greater than* operator
 * `>=` is the *greater than or equal to* operator
 * `<` is the *less than* operator
 * `<=` is the *less than or equal to* operator
 
 Recall the `and` and `or` operators that do symbolic math on `bool` values that we discussed last time.
 
 We'll talk about some special conditional operators in the coming weeks.

In [2]:
# Some things to compare to...

int_value = 100
float_value = 2.9
str_value = 'string'
some_ints = (1, 2, 3)

In [4]:
# Equality test
int_value != 105

True

In [5]:
# Comparison
float_value > 3.0

False

In [7]:
# String comparison

len(str_value) > 5

True

In [9]:
# Tuple comparison

some_ints == (1, 2, 3)

True

More features of the `list` data type
-----
There are a few new features of lists that we'll be needing for our text processing work. 

 * The `in` operator
 * Looking up the position of a value using the `index` method
 * Counting up matching entries in a list using the `count` method
 * Adding new items to a list with the `append` and `insert` methods
 * Using the `del` operator and the `remove` method to remove entries


In [15]:
# Note the extra , in the list... why should I do that?

words = ['this', 'is', 'a', 'collection', 
         'of', 'words', 'kaboom', 'is',
        ]

words

['this', 'is', 'a', 'collection', 'of', 'words', 'kaboom', 'is']

In [14]:
# Tuple with one entry?

(10,)

(10,)

In [16]:
# The `in` operator lets us know if a value lies within the list, resulting in a bool
'kaboom' in words

True

In [18]:
# Lets look for a word that we KNOW isn't in there...

'fred' in words

False

In [19]:
# If we want to know the position of a value in the list, use the `index` method

words.index('a')

2

In [20]:
# Change the indefinite article to the definite one

idx = words.index('a')
words[idx] = 'the'
words

['this', 'is', 'the', 'collection', 'of', 'words', 'kaboom', 'is']

In [21]:
# Count up the number of times we find some entries in there

words.count('is')

2

In [23]:
sentence = ' '.join(words)
sentence.count('is')

3

In [26]:
# What if a word isn't in there?

words.count('apogee')

0

In [27]:
# We may add an entry to the end of a list with the 'append' method

words.append('gopher')
words

['this', 'is', 'the', 'collection', 'of', 'words', 'kaboom', 'is', 'gopher']

In [28]:
# We can insert a new item in the list using the `insert` method

words.insert(0, 'peanut')
words

['peanut',
 'this',
 'is',
 'the',
 'collection',
 'of',
 'words',
 'kaboom',
 'is',
 'gopher']

In [29]:
# Remove the first entry in the list

del words[0]
words

['this', 'is', 'the', 'collection', 'of', 'words', 'kaboom', 'is', 'gopher']

In [30]:
# We can use the 'is' method to remove a value (if there are multiple entries,
# it removes the first one)

words.remove('is')

words

['this', 'the', 'collection', 'of', 'words', 'kaboom', 'is', 'gopher']

In [34]:
words = ['this', 'the', 'collection', 'of', 'words', 'kaboom', 'is', 'gopher']

for word in words:
    print(word)
    if 'is' in word:
        words.remove(word)
words

this
collection
of
words
kaboom
is


['the', 'collection', 'of', 'words', 'kaboom', 'gopher']

More about the `in` operator
----

Guess what? As you might expect, the `in` operator also works with tuples.

The `in` operator also works with dictionaries, but it tests the *keys*, not the *values*. Let's see!


In [35]:
# Here's a tuple to test
numbers = (1, 4, 9, 12, -3)

# look for -4
-4 in numbers

False

In [36]:
# Look for 9

9 in numbers

True

In [37]:
# Here's a dictionary for testing

d = {'a':9, 'b':100, 'd':-9,}

'z' in d

False

In [38]:
# Look for something that's in there

'a' in d

True

In [39]:
# Exercise: What if we want to see if the number -9 is in the values associated with
# the dictionary?

100 in d.values()

True

The `set` data type
------

A `set` in python is similar to the `dict` type, except it only has keys -- no values. 

Another way to say it is... a `set` is like a list, except it can't have duplicate values.

We write a set as a sequence of values between curly braces.

Sets offer some nice set operations, like `union` and `intersection`. 

We can add items to a set with the `add` method.


In [48]:
# Let's build a set!

my_set = {'today', 'is', 'thursday', 'whee', 'sunday', 'thursday'}
my_set

{'is', 'sunday', 'thursday', 'today', 'whee'}

In [47]:
# It will not surprise you to know that the `in` keyword works with `set` objects

'thursday' in my_set

True

In [43]:
# let's build another set!
days = {'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday'}
days

{'friday', 'monday', 'saturday', 'sunday', 'thursday', 'tuesday', 'wednesday'}

In [44]:
# The union of the sets?

my_set.union(days)

{'friday',
 'is',
 'monday',
 'saturday',
 'sunday',
 'thursday',
 'today',
 'tuesday',
 'wednesday',
 'whee'}

In [45]:
# The intersection of the sets?

my_set.intersection(days)

{'sunday', 'thursday'}

In [50]:
my_set.add('IU')
my_set

{'IU', 'is', 'sunday', 'thursday', 'today', 'whee'}

Conditional Execution
-----

We can use the `if`, `elif` (that is, else-if), and `else` keywords to execute code based on boolean tests (e.g. from conditional expressions). Here comes an example of a portion of a factoring algorithm...

In [51]:
# Factoring an integer value (version 1) -- check for multiples of 5

def factor_check_5(value):
    '''
    Checks to see if 5 is a prime factor of 'value'. Returns a tuple (None, value)
    if the value is not divisible by 5, else a tuple of the (5, value/5) 
    '''
    if value % 5 == 0:
        return (5, value // 5)
    else:
        return (None, value)
    
print(f'Check 15 {factor_check_5(15)}')
print(f'Check 12 {factor_check_5(12)}')
    

Check 15 (5, 3)
Check 12 (None, 12)


In [52]:
# We can add checks for 2 and 3 as factors...

def factor_check_235(value):
    '''
    Returns a tuple with None and the value if the value is not divisible by 2, 3, or 5, 
    else a tuple of the divisor and the quotient.
    '''
    if value % 5 == 0:
        return (5, value // 5)
    elif value % 3 == 0:
        return (3, value // 3)
    elif value % 2 == 0:
        return (2, value // 2)
    else:
        return (None, value)
    
print(f'Check 15 {factor_check_235(15)}')
print(f'Check 12 {factor_check_235(12)}')
print(f'Check 11 {factor_check_235(11)}')
    

Check 15 (5, 3)
Check 12 (3, 4)
Check 11 (None, 11)


In [53]:
# Exercise -- use the `for` and `if-elif-else` blocks to check the prime 
# factors 2, 3, 5, 7, 11, 13, 17, and 19

def factor_check(value):
    '''
    Returns a tuple with None and the value if the value is not divisible by 2, 3, or 5, 
    else a tuple of the divisor and the quotient
    '''
    primes = [2, 3, 5, 7, 11, 13, 17, 19,]
    
    for p in primes:
        if value % p == 0:
            return (p, value // p)
    return (None, value)
    
print(f'Check 15 {factor_check(15)}')
print(f'Check 12 {factor_check(12)}')
print(f'Check 11 {factor_check(11)}')
print(f'Check 38 {factor_check(38)}')
    

Check 15 (3, 5)
Check 12 (2, 6)
Check 11 (11, 1)
Check 38 (2, 19)


Opening a text file and processing its contents with a `for` loop
----

There's a lot of detail to be learned later about this code, and we'll deal with most of it later. 

With text files, the `open` function connects a variable to a file as a handle for reading data, line by line. We use the `encoding=` syntax to tell Python how to interpret the Unicode characters in the file.

The `with` block is a way to protect our code from leaving files open if the code that does the reading crashes. As I said, we'll deal with it all later. For now, it's sufficient to say that the code below is the appropriate way to process a text file:

 1. Open the file and connect it to a variable within a `with` block.
 1. Process the file in a `for` loop.
 1. Finish the block and the file is automatically closed by Python.
 

In [54]:
# Count up the number of lines in the file containing the book "Moby-Dick"

with open('moby-dick.txt', encoding='UTF-8') as input_file:
    line_count = 0
    for line in input_file:
        line_count += 1
print(line_count)

21116


In [55]:
with open('moby-dick.txt', encoding='UTF-8') as input_file:
    line_count = 0
    for line in input_file:
        if 'Ishmael' in line:
            print(line)
            break
print(line_count)

Call me Ishmael. Some years ago-never mind how long precisely-having

0


Using string functions to modify strings and slice them up
-----

We'll use some string methods to process our files.

 * strip -- Removes leading and trailing whitespace
 * startswith -- Checks the beginning of a string
 * lower -- Converts all characters to lower-case
 * replace -- replaces substrings with other substrings
 

In [56]:
# Remove leading and training whitespace...
s = '   This was a day we had all waited for.   '
s.strip()

'This was a day we had all waited for.'

In [57]:
s

'   This was a day we had all waited for.   '

In [58]:
'cow pig sheep'.startswith('cow')


True

In [59]:
'Roger Rabbit'.lower()

'roger rabbit'

Chaining Function Calls
----
In Python, most of the functions that can be used to process collections and strings return a new collection or string object. For example, we can replace the word 'a' with 'the' in a sentence using the `replace` method.

In [60]:
# Here's a string -- replace 'a' with 'the' (note the spaces)
s = 'This was a day we had all waited for.'

s.replace(' a ', ' the ')

'This was the day we had all waited for.'

In [61]:
# Note that the original string is intact

s

'This was a day we had all waited for.'

In [None]:
# If we want to change the string, we need to assign the result of the `replace` function



If we also wanted to replace 'was' with 'is' and 'had' to 'have', we can do this...

In [62]:
s = s.replace(' a ', ' the ')
s = s.replace(' was ', ' is ')
s = s.replace(' had ', ' have ')
s

'This is the day we have all waited for.'

Since the result of the `replace` function can then perform another processing step, we can simply "chain" the operations.

In [63]:
s = 'This was a day we had all waited for.'
s = s.replace(' a ', ' the ').replace(' was ', ' is ').replace(' had ', ' have ')
s

'This is the day we have all waited for.'

Parentheseized expressions can span across more than one line, so the chained expression can be written more clearly this way...

In [64]:
s = ( s.replace(' a ', ' the ')
       .replace(' was ', ' is ')
       .replace(' had ', ' have ')
    )
s

'This is the day we have all waited for.'

Accumulating some calculations in a loop
----
Sometimes, we'll want to process a sequence of values, and accumulate some calculation or operation.

In this case, we can often use this coding pattern:

 1. Initialize the result
 1. begin to loop over the sequence
    1. Update the result using some operation
    1. Continue looping
 1. When the loop is finished, we have the final result.

In [65]:
# Example: sum up the values in a list

my_list = [1, 2, 3, 4, 5]

# Initialize the sum
total = 0

# Now the loop
for i in my_list:
    total += i
    
# And... we're done!
print(total)

15


Python has more powerful ways to do this -- we'll see them later!

Now, let's take all we've learned and count up the lines with text in Moby Dick!

In [77]:
# Process "Moby-Dick"... removing the Chapter and Epliogue headings along the way...
# also force lower case and remove punctuation

with open('moby-dick.txt', encoding='UTF-8') as input_file:
    for line in input_file:
        line = ( line.strip()
                     .lower()
                     .replace('.', '')
                     .replace('-', ' ')
                     .replace(',', '')
               )
        
        if line.startswith('chapter') or line.startswith('epilogue'):
            # 'continue' finishes this pass through the loop
            continue
            
        if len(line) == 0:
            continue
            
        print(line)

call me ishmael some years ago never mind how long precisely having
little or no money in my purse and nothing particular to interest me
on shore i thought i would sail about a little and see the watery part
of the world it is a way i have of driving off the spleen and
regulating the circulation whenever i find myself growing grim about
the mouth; whenever it is a damp drizzly november in my soul; whenever
i find myself involuntarily pausing before coffin warehouses and
bringing up the rear of every funeral i meet; and especially whenever
my hypos get such an upper hand of me that it requires a strong moral
principle to prevent me from deliberately stepping into the street and
methodically knocking people’s hats off then i account it high time to
get to sea as soon as i can this is my substitute for pistol and ball
with a philosophical flourish cato throws himself upon his sword; i
quietly take to the ship there is nothing surprising in this if they
but knew it almost all men in their 

now when i looked about the quarter deck for some one having
authority in order to propose myself as a candidate for the voyage at
first i saw nobody; but i could not well overlook a strange sort of
tent or rather wigwam pitched a little behind the main mast it
seemed only a temporary erection used in port it was of a conical
shape some ten feet high; consisting of the long huge slabs of limber
black bone taken from the middle and highest part of the jaws of the
right whale planted with their broad ends on the deck a circle of
these slabs laced together mutually sloped towards each other and at
the apex united in a tufted point where the loose hairy fibres waved
to and fro like the top knot on some old pottowottamie sachem’s head a
triangular opening faced towards the bows of the ship so that the
insider commanded a complete view forward
and half concealed in this queer tenement i at length found one who by
his aspect seemed to have authority; and who it being noon and the
ship’s work 

in the american fishery he is not only an important officer in the
boat but under certain circumstances (night watches on a whaling
ground) the command of the ship’s deck is also his; therefore the grand
political maxim of the sea demands that he should nominally live apart
from the men before the mast and be in some way distinguished as their
professional superior; though always by them familiarly regarded as
their social equal
now the grand distinction drawn between officer and man at sea is
this the first lives aft the last forward hence in whale ships and
merchantmen alike the mates have their quarters with the captain; and
so too in most of the american whalers the harpooneers are lodged in
the after part of the ship that is to say they take their meals in
the captain’s cabin and sleep in a place indirectly communicating with
it
though the long period of a southern whaling voyage (by far the longest
of all voyages now or ever made by man) the peculiar perils of it and
the communit

make distant unobtrusive salutations to him in the street lest if they
pursued the acquaintance further they might receive a summary thump
for their presumption
but not only did each of these famous whales enjoy great individual
celebrity nay you may call it an ocean wide renown; not only was he
famous in life and now is immortal in forecastle stories after death
but he was admitted into all the rights privileges and distinctions
of a name; had as much a name indeed as cambyses or cæsar was it not
so o timor tom! thou famed leviathan scarred like an iceberg who so
long did’st lurk in the oriental straits of that name whose spout was
oft seen from the palmy beach of ombay? was it not so o new zealand
jack! thou terror of all cruisers that crossed their wakes in the
vicinity of the tattoo land? was it not so o morquan! king of japan
whose lofty jet they say at times assumed the semblance of a snow white
cross against the sky? was it not so o don miguel! thou chilian whale
marked like an 

inshore in a calm and lazily taking water on board; the loosened
sails of the ship and the long leaves of the palms in the background
both drooping together in the breezeless air the effect is very fine
when considered with reference to its presenting the hardy fishermen
under one of their few aspects of oriental repose the other engraving
is quite a different affair: the ship hove to upon the open sea and in
the very heart of the leviathanic life with a right whale alongside;
the vessel (in the act of cutting in) hove over to the monster as if to
a quay; and a boat hurriedly pushing off from this scene of activity
is about giving chase to whales in the distance the harpoons and
lances lie levelled for use; three oarsmen are just setting the mast in
its hole; while from a sudden roll of the sea the little craft stands
half erect out of the water like a rearing horse from the ship the
smoke of the torments of the boiling whale is going up like the smoke
over a village of smithies; and t

infidel as to one of the most appalling but not the less true events
perhaps anywhere to be found in all recorded history
you observe that in the ordinary swimming position of the sperm whale
the front of his head presents an almost wholly vertical plane to the
water; you observe that the lower part of that front slopes
considerably backwards so as to furnish more of a retreat for the long
socket which receives the boom like lower jaw; you observe that the
mouth is entirely under the head much in the same way indeed as
though your own mouth were entirely under your chin moreover you
observe that the whale has no external nose; and that what nose he
has his spout hole is on the top of his head; you observe that his eyes
and ears are at the sides of his head nearly one third of his entire
length from the front wherefore you must now have perceived that the
front of the sperm whale’s head is a dead blind wall without a single
organ or tender prominence of any sort whatsoever furthermore y

not in a condemned cell and as for the other whale why i’ll agree to
get more oil by chopping up and trying out these three masts of ours
than he’ll get from that bundle of bones; though now that i think of
it it may contain something worth a good deal more than oil; yes
ambergris i wonder now if our old man has thought of that it’s worth
trying yes i’m for it;” and so saying he started for the
quarter deck
by this time the faint air had become a complete calm; so that whether
or no the pequod was now fairly entrapped in the smell with no hope
of escaping except by its breezing up again issuing from the cabin
stubb now called his boat’s crew and pulled off for the stranger
drawing across her bow he perceived that in accordance with the
fanciful french taste the upper part of her stem piece was carved in
the likeness of a huge drooping stalk was painted green and for
thorns had copper spikes projecting from it here and there; the whole
terminating in a symmetrical folded bulb of a brigh

as air; and i’m down in the whole world’s books i am so rich i could
have given bid for bid with the wealthiest prætorians at the auction of
the roman empire (which was the world’s); and yet i owe for the flesh
in the tongue i brag with by heavens! i’ll get a crucible and into
it and dissolve myself down to one small compendious vertebra so
carpenter (_resuming his work_)
well well well! stubb knows him best of all and stubb always says
he’s queer; says nothing but that one sufficient little word queer;
he’s queer says stubb; he’s queer queer queer; and keeps dinning it
into mr starbuck all the time queer sir queer queer very queer and
here’s his leg! yes now that i think of it here’s his bedfellow! has
a stick of whale’s jaw bone for a wife! and this is his leg; he’ll
stand on this what was that now about one leg standing in three
places and all three places standing in one hell how was that? oh! i
don’t wonder he looked so scornful at me! i’m a sort of
strange thoughted sometimes the

stay on board on board! lower not when i do; when branded ahab gives
chase to moby dick that hazard shall not be thine no no! not with
the far away home i see in that eye!”
“oh my captain! my captain! noble soul! grand old heart after all!
why should any one give chase to that hated fish! away with me! let us
fly these deadly waters! let us home! wife and child too are
starbuck’s wife and child of his brotherly sisterly play fellow
youth; even as thine sir are the wife and child of thy loving
longing paternal old age! away! let us away! this instant let me alter
the course! how cheerily how hilariously o my captain would we bowl
on our way to see old nantucket again! i think sir they have some
such mild blue days even as this in nantucket”
“they have they have i have seen them some summer days in the
morning about this time yes it is his noon nap now the boy
vivaciously wakes; sits up in bed; and his mother tells him of me of
cannibal old me; how i am abroad upon the deep but will yet 