Session 4: Lots of Text Processing
=====

Today, we'll learn another new data type and some more fun things we can do with lists. We'll then move along into some strategies for text processing, and finish up reading Moby-Dick... sort of.

Specific topics:

 * Conditional expressions (expressions that result in `bool` results)
 * More fun with the `list` data type
 * The `set` data type
 * The `if-elif-else` conditional execution statement
 * Opening a text file and reading its contents 
 * Introducing the `with` statement for file processing
 * Using string functions to modify strings and slice them up
 
 

Conditional Expressions
------

A conditional expression is an expression that results in a `bool` value. Some obvious conditional expressions are the ones that compare values.

 * `==` is the *equality* operator stating that two values are equal
 * `!=` is the *inequality* operator stating that two values are not equal
 * `>` is the *greater than* operator
 * `>=` is the *greater than or equal to* operator
 * `<` is the *less than* operator
 * `<=` is the *less than or equal to* operator
 
 Recall the `and` and `or` operators that do symbolic math on `bool` values that we discussed last time.
 
 We'll talk about some special conditional operators in the coming weeks.

In [None]:
# Some things to compare to...

int_value = 100
float_value = 2.9
str_value = 'string'
some_ints = (1, 2, 3)

In [None]:
# Equality test


In [None]:
# Comparison


In [None]:
# String comparison


In [None]:
# Tuple comparison


More features of the `list` data type
-----
There are a few new features of lists that we'll be needing for our text processing work. 

 * The `in` operator
 * Looking up the position of a value using the `index` method
 * Counting up matching entries in a list using the `count` method
 * Adding new items to a list with the `append` and `insert` methods
 * Using the `del` operator and the `remove` method to remove entries


In [None]:
# Note the extra , in the list... why should I do that?

words = ['this', 'is', 'a', 'collection', 'of', 'words', 'kaboom', 'is',]

In [None]:
# The `in` operator lets us know if a value lies within the list, resulting in a bool
'kaboom' in words

In [None]:
# Lets look for a word that we KNOW isn't in there...

fred in words

In [None]:
# If we want to know the position of a value in the list, use the `index` method

words.index('a')

In [None]:
# Change the indefinite article to the definite one

idx = words.index('a')
words[idx] = 'the'
words

In [None]:
# Count up the number of times we find some entries in there

words.count('is')

In [None]:
# What if a word isn't in there?

words.count('apogee')

In [None]:
# We may add an entry to the end of a list with the 'append' method

words.append('gopher')
words

In [None]:
# We can insert a new item in the list using the `insert` method

words.insert(0, 'peanut')
words

In [None]:
# Remove the first entry in the list

del words[0]
words

In [None]:
# We can use the 'is' method to remove a value (if there are multiple entries,
# it removes the first one)

words.remove('is')

words

More about the `in` operator
----

Guess what? As you might expect, the `in` operator also works with tuples.

The `in` operator also works with dictionaries, but it tests the *keys*, not the *values*. Let's see!


In [None]:
# Here's a tuple to test
numbers = (1, 4, 9, 12, -3)

# look for -4
-4 in numbers

In [None]:
# Look for 9

9 in numbers

In [None]:
# Here's a dictionary for testing

d = {'a':9, 'b':100, 'd':-9,}

'z' in d

In [None]:
# Look for something that's in there

'a' in d

In [None]:
# Exercise: What if we want to see if the number -9 is in the values associated with
# the dictionary?



The `set` data type
------

A `set` in python is similar to the `dict` type, except it only has keys -- no values. 

Another way to say it is... a `set` is like a list, except it can't have duplicate values.

We write a set as a sequence of values between curly braces.

Sets offer some nice set operations, like `union` and `intersection`. 

We can add items to a set with the `add` method.


In [None]:
# Let's build a set!

my_set = {'today', 'is', 'thursday', 'whee', 'sunday'}
type(my_set)

In [None]:
# It will not surprise you to know that the `in` keyword works with `set` objects

'thursday' in my_set

In [None]:
# let's build another set!
days = {'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday'}

In [None]:
# The union of the sets?

my_set.union(days)

In [None]:
# The intersection of the sets?

my_set.intersection(days)

Conditional Execution
-----

We can use the `if`, `elif` (that is, else-if), and `else` keywords to execute code based on boolean tests (e.g. from conditional expressions). Here comes an example of a portion of a factoring algorithm...

In [None]:
# Factoring an integer value (version 1) -- check for multiples of 5

def factor_check_5(value):
    '''
    Checks to see if 5 is a prime factor of 'value'. Returns a tuple (None, value)
    if the value is not divisible by 5, else a tuple of the (5, value/5) 
    '''
    if value % 5 == 0:
        return (5, value // 5)
    else:
        return (None, value)
    
print(f'Check 15 {factor_check_5(15)}')
print(f'Check 12 {factor_check_5(12)}')
    

In [None]:
# We can add checks for 2 and 3 as factors...

def factor_check_235(value):
    '''
    Returns a tuple with None and the value if the value is not divisible by 2, 3, or 5, 
    else a tuple of the divisor and the quotient.
    '''
    if value % 5 == 0:
        return (5, value // 5)
    elif value % 3 == 0:
        return (3, value // 3)
    elif value % 2 == 0:
        return (2, value // 2)
    else:
        return (None, value)
    
print(f'Check 15 {factor_check_235(15)}')
print(f'Check 12 {factor_check_235(12)}')
print(f'Check 11 {factor_check_235(11)}')
    

In [None]:
# Exercise -- use the `for` and `if-elif-else` blocks to check the prime 
# factors 2, 3, 5, 7, 11, 13, 17, and 19

def factor_check(value):
    '''
    Returns a tuple with None and the value if the value is not divisible by 2, 3, or 5, 
    else a tuple of the divisor and the quotient
    '''
    
    # Put your code here!
    pass

print(f'Check 15 {factor_check(15)}')
print(f'Check 12 {factor_check(12)}')
print(f'Check 11 {factor_check(11)}')
print(f'Check 38 {factor_check(38)}')
    

Opening a text file and processing its contents with a `for` loop
----

There's a lot of detail to be learned later about this code, and we'll deal with most of it later. 

With text files, the `open` function connects a variable to a file as a handle for reading data, line by line. We use the `encoding=` syntax to tell Python how to interpret the Unicode characters in the file.

The `with` block is a way to protect our code from leaving files open if the code that does the reading crashes. As I said, we'll deal with it all later. For now, it's sufficient to say that the code below is the appropriate way to process a text file:

 1. Open the file and connect it to a variable within a `with` block.
 1. Process the file in a `for` loop.
 1. Finish the block and the file is automatically closed by Python.
 

In [None]:
# Count up the number of lines in the file containing the book "Moby-Dick"

with open('moby-dick.txt', encoding='UTF-8') as input_file:
    line_count = 0
    for line in input_file:
        line_count += 1
print(line_count)

Using string functions to modify strings and slice them up
-----

We'll use some string methods to process our files.

 * strip -- Removes leading and trailing whitespace
 * startswith -- Checks the beginning of a string
 * lower -- Converts all characters to lower-case
 * replace -- replaces substrings with other substrings
 

In [None]:
# Remove leading and training whitespace...
s = '   This was a day we had all waited for.   '
s.strip()

In [None]:
'cow pig sheep'.startswith('cow')

In [None]:
'Roger Rabbit'.lower()

Chaining Function Calls
----
In Python, most of the functions that can be used to process collections and strings return a new collection or string object. For example, we can replace the word 'a' with 'the' in a sentence using the `replace` method.

In [None]:
# Here's a string -- replace 'a' with 'the' (note the spaces)
s = 'This was a day we had all waited for.'

s.replace(' a ', ' the ')

In [None]:
# Note that the original string is intact



In [None]:
# If we want to change the string, we need to assign the result of the `replace` function



If we also wanted to replace 'was' with 'is' and 'had' to 'have', we can do this...

In [None]:
s = s.replace(' a ', ' the ')
s = s.replace(' was ', ' is ')
s = s.replace(' had ', ' have ')
s

Since the result of the `replace` function can then perform another processing step, we can simply "chain" the operations.

In [None]:
s = s.replace(' a ', ' the ').replace(' was ', ' is ').replace(' had ', ' have ')
s

Parentheseized expressions can span across more than one line, so the chained expression can be written more clearly this way...

In [None]:
s = ( s.replace(' a ', ' the ')
       .replace(' was ', ' is ')
       .replace(' had ', ' have ')
    )
s

Accumulating some calculations in a loop
----
Sometimes, we'll want to process a sequence of values, and accumulate some calculation or operation.

In this case, we can often use this coding pattern:

 1. Initialize the result
 1. begin to loop over the sequence
    1. Update the result using some operation
    1. Continue looping
 1. When the loop is finished, we have the final result.

In [None]:
# Example: sum up the values in a list

my_list = [1, 2, 3, 4, 5]

# Initialize the sum
total = 0

# Now the loop
for i in my_list:
    total += i
    
# And... we're done!
print(total)

Python has more powerful ways to do this -- we'll see them later!

Now, let's take all we've learned and count up the lines with text in Moby Dick!

In [None]:
# Process "Moby-Dick"... removing the Chapter and Epliogue headings along the way...

with open('moby-dick.txt', encoding='UTF-8') as input_file:
    pass