# Think Python, Week 12: Data Structure Selection

<img src='../meta/images/python-logo.png' style="float:right">

## Objectives
---

* Work some exercises together

## Exercise 13-1
---

> “Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase.”

* Process
  * Decompose the problem into separate pieces for simplicity, re-usability, and easier debugging
  * I usually prefer to work bottom-up, from the smallest pieces to the biggest ones
  * I've included a simple text file, 'forgiveness.txt'

Identifying the components of the solution (bottom-up order): discuss together.

* (Step 1)

**Try finding a solution before looking below**

<hr />

In [None]:
# clean and lowercase word
import string

def clean(word):
    cleaned = []
    for c in word:
        if c not in string.punctuation:
            cleaned.append(c.lower())
    return ''.join(cleaned)

In [None]:
string.punctuation

In [None]:
clean("What'sNew?")

In [None]:
clean("*")

<hr />

In [None]:
# split line into words
def linewords(line):
    return line.split(' ')

In [None]:
myline = "* The geographical significance of the two feeding miracles brings fresh understanding to Jesus' warning."
linewords(myline)

In [None]:
for w in linewords(myline):
    print(clean(w))

<hr />

In [None]:
# read a file, get words for each line, and clean them
def filewords(file):
    f = open(file)
    words = []
    for line in f:
        for w in linewords(line):
            w = clean(w.strip())
            if w:
                words.append(w)
    return words

filewords('forgiveness.txt')

**Discussion**: what is a word? 

## Exercise 13-2
---

* Count the number of times each 'word' occurs
* How many different words are there? 

**Terminology**: For problems like this

* Each different class (here, a word) is called a 'type'
* Each occurrence of it is called a 'token' or an 'instance' (though for word problem, we might also use 'vocabulary')

So we're counting both the number of *types*, and the *token* count for each type. 

In [None]:
# dictionaries make counting easy
def counter(l):
    cdict = dict()
    for item in l:
        if item not in cdict:
            cdict[item] = 0
        cdict[item] += 1
    return cdict

In [None]:
worddict = counter(filewords('forgiveness.txt'))
print("Number of word types: ", len(worddict))
print("Count for 'the': ", worddict['the'])

**Discussion**: what else might you want a counter to do? 

## Using the Biblia API
---

In [None]:
# standard Python modules for working with and opening URLs
import urllib

# put your API key in key.py like this
# KEY = 'abcdefghi'
from key import KEY


In [None]:
KEY

In [None]:
# the foundation for a URL
def construct_base_url(bible, format):
    """Return the base URL for BIBLE and format. 
    BIBLE is 'KJV' or 'LEB'
    format is 'xml' or 'json' or 'txt'
    """
    base_url = 'http://api.biblia.com/v1/bible/content/'
    url = base_url + bible + '.' + format
    return url

In [None]:
base = construct_base_url('LEB', 'txt')
base

In [None]:
def construct_url(base_url, passage, apikey=KEY):
    """Ensure URL, PASSAGE, and APIKEY are properly combined and
    encoded for opening a resource. Assumes the Bible version and
    return type are already in URL.
    """
    return base_url + '?' + urllib.parse.urlencode({'passage': passage,
                                                    'key': apikey})

In [None]:
url = construct_url(base, 'Mk4:1-9')
url

In [None]:
# get the content from the URL
def fetch_url(url):
    req = urllib.request.urlopen(url)
    return req.read()

In [None]:
content = fetch_url(url)
content

In [None]:
content.decode('utf-8')

## Discussion: How Far Do We Want To Go?
---

## Homework
---

* Read Chapter 14 and do the exercises. 