# Think Python: Week 12

<img src="reassuring.png" style="align: right;" />
Slides: http://github.com/sboisen/training/ThinkPython/Week12

## Word Frequency Analysis
* Python is great because you can easily do incredibly powerful things ... like this


## Decorate, Sort, Undecorate (DSU)

## Optional Parameters

In [None]:
def foo(arg=10): 
    print arg
foo()

In [None]:
def foo(arg1=10, arg2=20): 
    print arg1 + arg2
foo(12)

In [None]:
def foo(arg1=10, arg2): 
    print arg1 + arg2
foo(12)

## Sets

In [None]:
set1 = set('abecedarian')
print set1, type(set1)
set2 = set(['d', 'e', 'f', 'g'])
print set2 

In [None]:
print set1.intersection(set2)
print set1.union(set2)
print set1.difference(set2)
print set2.difference(set1)

## Concatenation Operator `*`

In [None]:
# concatenate a string
'spam ' * 3

In [None]:
# concatenate items from a list: result is a list
['spam '] * 3

In [None]:
['spam', 'ham',] * 3

In [None]:
['s', 'p', 'a', 'm', ' '] * 3

## Choosing Data Structures

* Factors: ease of implementation, ease of understanding, speed versus storage, ...
* The same problem may use different data structures for different purposes: one size doesn't fit all
* Optimizing implies you know the costs of your different resources

> "The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming."
>
> --  [Donald Knuth](https://en.wikipedia.org/wiki/Donald_Knuth), *Computer Programming as an Art*

## Debugging

> Debugging is like an experimental science. You should have at least one hypothesis about what the problem is. If there are two or more possibilities, try to think of a test that would eliminate one of them.

## Code Review: `createcompxml.py`
Review areas:
* bugs and logic flaws
* generality
* re-usability
* maintainability
* best practices
* idoimatic style ("Pythonic" code)
* general style

### How to Read Other People's Code
* Start with purpose
* Look at input and output
    * Preferably with examples
* Get the big picture (top-down)
* Understand additional modules

## Your Code Should Model Your Data
* Either give your data table a header row, or add one to your code

In [None]:
columns = ['category', 'id', '', '', 'label', 'productId']
row23 = ['', u'LLS:NLGALLENGRENO', '', '', u'Allen and Greenough\'s New Latin Grammar', 27101.0]
row23dict = dict(zip(columns, row23))
print row23dict
row23dict['label']

## Comments
* Put a single comment block at the top of the file explaining
    * the overall purpose in a single sentence
    * high-level assumptions


        """Reads an Excel sheet with resource and product IDs, and outputs an XML-formatted version. 

        Input file format: ...
        Output file format:
        """
        
* Comment on code that's unusual, odd, complex, ...
* Don't comment on code that's obvious

            # check to make sure elem isn't empty
            if len(elem):


## Import Style
* group your imports at the top of your file so people don't have to hunt for them
    * CI's standard: first built-in modules, then 3rd-party modules, then in-house modules


        import xlrd
        import xml.etree.cElementTree as ET             ## review: lower-case here would be more standard
        
        DATAFILE = 'compcharttext.xlsx'

## Factor out Commonalities and Redundancies

In [None]:
# instead of this
if len(elem):
    if not elem.text or not elem.text.strip():
        elem.text = i + "  "
    if not elem.tail or not elem.tail.strip():
        elem.tail = i
    for elem in elem:
        indent(elem, level+1)
    if not elem.tail or not elem.tail.strip():
        elem.tail = i
else:
    if level and (not elem.tail or not elem.tail.strip()):
        elem.tail = i

In [None]:
# one function to re-use, named positively 
def hasnochars(str)
    # i think this could just be (not str.strip())
    return not str or not str.strip()

if len(elem):
    if hasnochars(elem.text):
        elem.text = i + "  "
    if hasnochars(elem.tail):
        elem.tail = i
    for elem in elem:
        indent(elem, level+1)
    if hasnochars(elem.tail):
        elem.tail = i
else:
    if level and hasnochars(elem.tail):
        elem.tail = i


## General Style
* If you have both input and output files, name them distinctly
* Generality: functions that handle file streams (rather than files directly) are often more general
    * Especially for output: easier to switch from console output (debugging) to file output (production)

In [None]:
def output_it(value, outstr=None):
    print >> outstr, "My value is:", value
    
output_it('some value')
# or, output_it('some value', open('outputfile.txt'))

## Rework Your Data Formats
* Sometimes the best way to improve your code is to make your input or output format better
    * Example: rather than having two kinds of lines in the input file for section headers, just put the section on every line
        * Then track that column, and emit a new section marker when it changes

In [None]:
## review: comment your program and its high-level assumptions
"""Reads an Excel sheet with resource IDs and ___ and outputs an XML-formatted version. 

Input file format: ...

"""
## review: group your imports at the top: don't sprinkle them throughout
import xlrd
import xml.etree.cElementTree as ET             ## review: lower-case here would be more standard

DATAFILE = 'compcharttext.xlsx'

def sheet_matrix(file=DATAFILE):
    """Return the data from DATAFILE as a list of lists. Doesn't include the headers. 
    """
    # use file rather than DATAFILE
    sheet = xlrd.open_workbook(DATAFILE).sheet_by_index(0)
    rows = []
    for index in range(0, sheet.nrows):
        rows.append(sheet.row_values(index))
    return rows

x = sheet_matrix()

## Additional Resources

* <img src="bd.png" style="display: inline;" />The first four exercises introduce simple kinds of natural language processing (NLP). The [Natural Language Toolkit](http://www.nltk.org/) (NLTK) is a feature-rich Python library for all kinds of NLP tasks: CI uses this in multiple ways. 
    * <img src="bd.png" style="display: inline;" /><img src="bd.png" style="display: inline;" />I gave [a talk on NLTK at LinuxFest NW in 2008](http://www.semanticbible.com/other/talks/2008/nltk/main.html) if you're interested in more details. 
    * <img src="bd.png" style="display: inline;" />We could spend a whole session introducing NLP and NLTK if people are interested
* <img src="bd.png" style="display: inline;" />Python's `collections` module has a [`Counter` class](https://docs.python.org/2/library/collections.html) that simplifies histograms
    * NLTK has a much more powerful `FreqDist` class
* <img src="bd.png" style="display: inline;" />[Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law) (Exercise 13.9) controls many social behaviors, including much of language: it's worth understanding how it works and thinking where it applies. 
