In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

# Dealing with Strings in Python
<!-- requirement: small_data/fha_by_tract.csv -->

### Goals

 - Strings in Python (...and other things)
 - Basic string processing in Python
 - StringIO package in python
 - Regular expressions

## The string data structure

A string is a sequence of characters.  In Python it's indicated by surrounding it with either single or double quotes:

        'The quick brown fox jumped over the lazy dog'
        "The quick brown fox jumped over the lazy dog"

They are pretty much interchangeable.  The only difference has to do with __escaping__:


### Escaping

Suppose you wanted to enter the string 

        I'm Anatoly, but some people call me "Toly."

how would you do it?  You can't just surround it with `".."` like 

        "I'm Anatoly, but some people call me "Toly.""
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

because Python would get confused and think that the string was over early (as shown).  Similarly, you can't just enclose it in single quotes because of the single quote in `I'm`.  Instead, when you want to insert a quote into a quoted string, you _escape_ it by writing it as `\"` or `\'`.  So we could represent the previous string as either:

In [None]:
str1 = "I'm Anatoly, but some people call me \"Toly.\""
str2 = 'I\'m Anatoly, but some people call me "Toly."' 
print str1
print str2
str1 == str2

> ### Gotcha  (file away for later, don't think about now)
>
> This is for Python.  Other languages have their own rules and conventions and you often have to interface with them.  You should try to avoid dealing with these quotation and escaping rules in code -- it is a source of many bugs.  
>
> In many languages, notably bash shell scripts and SQL, the two types of quotes are not the same.  What's even worse, each of three popular SQL backends (SQLite, MySQL, and Postgres) have substantively different rules about strings and about quoting!

### Raw strings

The use of the backslash as the escape character means, that if you want to represent a backslash, you have to write `\\`.  This can get very annoying if you are writing strings that need a lot of backslashes, but not any other escaped characters.  (Regexes are the common example.)  To help with this, Python offers *raw strings*: If the opening quote is preceded by an `r`, Python will not escape any characters in it.

In [None]:
s1 = '\t'
s2 = r'\t'
len(s1), len(s2)

This is just a pleasantry of the interpretter&mdash;the strings that are created are exactly the same.

In [None]:
'\\t' is r'\t'

### Triple-quoted strings

These sorts of strings must be defined on a single line.  That is, they cannot contain literal new line characters.  Strings surrounded by three quote characters can, which is useful when dealing with large blocks of text.

In [None]:
print """Triple-quoted string can contain
many lines.
    All whitespace in the strings
is preserved.
    "This is another way to embed
quotes," he said."""

Again, this is just syntactic sugar.

In [None]:
"""a
b""" is "a\nb"

### Exercises


1. Fill in the following Python code
        >>> s = ...
        >>> print s
   
   so that the resulting output is
   
        Bob said "I'm not sure, but I think that the quick brown fox said it would 'jump over' the lazy dog."
   
2. Without running your code, what do you think is the output of just typing
        >>> s
   at the REPL?


### Docstrings

If a string literal is the first (non-empty) line in a function, class, or module definition, that string becomes the [*docstring*](https://www.python.org/dev/peps/pep-0257/) for that object.  It is accessible as the `.__doc__` attribute, and Python tools will use it as documentation for that object.

In [None]:
def func():
    """This is a docstring."""
    pass

func.__doc__

By convention, docstrings use triple quotes.  This is because they are usually more than one line long.

In [None]:
def func():
    """This is a longer docstring.
    
    By convention, docstrings start with a brief
    single line description, followed by an empty
    line and a longer description."""
    pass

help(func)

## Unicode

The strings we have seen so far are *byte strings*: each character must be stored in a single byte (we'll discuss how below), meaning we're limited to 256 characters.  While this is enough for some languages, it is certainly not enough to encode all characters from all languages.

The solution to this is [Unicode](http://en.wikipedia.org/wiki/Unicode), which defines a mapping between characters and numbers, referred to as *code points*.  There are over 1.1 million code points in the unicode standard, but only about 10% of them have been used.

Python supports unicode via *unicode strings*.  They look and operate like byte strings, but start with a `u` prefix.  (In Python 3, unicode strings are the default, and a `b` prefix is necessary to mark byte strings.)  For instance:

In [None]:
str3 = u'I\'m Anatoly, but some people call me "Toly."'
print str1 == str3  # This will be False in Python 3
print type(str1), type(str3)
print type(str1) == type(str3)

Both byte strings and unicode strings are subclasses of *basestring*.

In [None]:
isinstance(str1, basestring) and isinstance(str3, basestring)

The importance of unicode becomes more importance once we move outside English characters.

In [None]:
u_string = u"a b c d e α β γ δ ε"
print u_string
u_string

Ipython can display unicode characters.  The `repr` of a unicode string is still restricted to printable [ASCII](http://en.wikipedia.org/wiki/ASCII) characters.  Code points outside of ASCII are shown with a `\u1234` escape.

When text is sent over the network or written to disk, it needs to be serialized to bytes.  This process is known as *encoding*, and there are many encodings to choose from.  The most popular currently is UTF-8.

In [None]:
b_string = u_string.encode('utf-8')
b_string

UTF-8 is convenient because it encodes all 128 ASCII characters to a single byte, using the same values as the ASCII encoding.  This means that all ASCII-encoded text is automatically UTF-8 encoded text.  Other characters are encoded as sequences of two or more bytes.  In this case, the greek letters are encoded in two byte sequences.  This is important when considering lengths and offsets.  Unicode strings count characters; byte strings count bytes.

In [None]:
len(u_string), len(b_string)

Different encodings take different approaches.  UTF-16 encodes each code point into two or four bytes, with all the common characters, in the "Basic Multilingual Plane", needing only two bytes.

In [None]:
u_string.encode('utf-16')

The various [ISO-8859-X encodings](https://en.wikipedia.org/wiki/ISO/IEC_8859) each encode the latin alphabet (as in ASCII) and another alphabet in the high bytes.  For exaple, ISO-8859-7 encodes the Greek alphabet.

In [None]:
u_string.encode('iso8859-7')

If you try to encode with an encoding that doesn't support some of the characters in your string, you will get a `UnicodeEncodeError`.  This can be avoided by passing a second argment to the `.encode()` method, specifying what to do.  For example, unencodable characters can be replaced by one that can be encoded.

In [None]:
u_string.encode('ascii', 'replace')

Byte strings can be decoded to unicode strings, but their encoding must be specified.

In [None]:
print b_string.decode('utf-8')

If you decode with a different encoding, you will get a different string.

In [None]:
print u_string.encode('iso8859-7').decode('iso8859-6')

As a rule of thumb, programs dealing with text should use unicode strings internally.  Only when you need to interface with the network or disk should encoding to or decoding from byte strings occur.  This way, you need only keep track of one piece of data, the unicode string, instead of two, the byte string and the encoding.

All this said, this is something that you _shouldn't_ have to worry about: so long as you use libraries, and those libraries are smart, all the needed conversions should be handled for you.  Unfortunately, it is still possible to hit these rough edges so it's good to know about them.

> #### Aside
>
> If byte strings just represent bytes, how does Python know which characters to use when printing them?  Each Python installation has a default encoding which is used to decode byte strings for printing.  On this system (and most modern systems), this is UTF-8.  This means that you can print UTF-8-encoded byte strings and have things just work.  Trying to print with other encodings doesn't work as well.

In [None]:
print u_string.encode('utf-8')
print u_string.encode('iso-8859-7')

## Basic string processing

The Python standard library provides a bunch of basic string functions.  For a complete list see https://docs.python.org/2/library/string.html

The general pattern is that everything is invoked in `str.operation(arguments)` notation.  Let's just jump to examples for:

- `split`: Splits a string along a substring

In [None]:
"Once upon a midnight dreary, while I pondered weak and weary, over many a quaint and curious volume of forgotten lore.".split(",")

In [None]:
"Note that the splitting string does not have to be just one character long.".split("not")

- `join` : The opposite of split

In [None]:
", ".join(["a","b","c"])

In [None]:
print "\n".join(["Look I can make a string", "that crosses lines!"])

- `strip` : Removes leading / trailing whitespace

In [None]:
"    why is there so much whitespace around this? \n\t   ".strip()

- `startswith` : Checks if one string starts with another

In [None]:
"The quick brown fox...".startswith("The")

- `format` : String substitution and formatting

In [None]:
print "Plug {0} into {1}".format("this", "that")

print "Hi {first} {last}!   Bye {first}.".format(first="Jane", last="Doe")

print "Bob is {:+.2f} feet tall".format(5.526)

location = { 'city' : 'New York', 'state': 'NY' }
print "Welcome to {city}, {state}".format( **location )

There are also several operators that work on strings.
- `%` : An alternate string substitution method

In [None]:
print "Can I buy a %s for $%.2f" % ("salad", 2.56)

- `in`: Check if one string is contained in another one

In [None]:
"ea" in "team", "I" in "team"

- `+`: Concatenate strings

In [None]:
"Left me" + "et right"

Strings behave like arrays, so you can "slice them"

In [None]:
"The quick brown fox.."[4:9]

String formatting can be used to print out tables

In [None]:
import string
for i, c in enumerate(string.ascii_lowercase):
    print "{num:<2} {lower:>2} {upper:>2}".format(num=i, lower=c, upper=c.upper())

## StringIO in Python

If you have a file, it's easy to turn it into a string using `open` (likely wrapped in a `with` statement).  What if you want to turn a string into a file object?  Some python libraries take file objects for arguements.  How might you use their functionality on strings (e.g. from web scraping)?  The answer is `StringIO`.

In [None]:
with open("small_data/fha_by_tract.csv") as fh:
    data = [row for row in fh]
data[:5]

In [None]:
from StringIO import StringIO

string = "".join(data)

fh = StringIO(string)
string_data = [row for row in fh]
string_data[:5]

**Exercises**

1. Here's a string (gotten from running `ps auxww|tail -5` somewhere):

        root     31457  0.0  0.0  65996  3444 ?        Ss   04:21   0:00 sshd: preygel [priv]
        preygel  31459  0.0  0.0  65996  1444 ?        S    04:21   0:00 sshd: preygel@pts/3 
        preygel  31460  0.1  0.0  22492  3632 pts/3    Ss   04:21   0:00 -bash
        preygel  31478  0.0  0.0  18448  1256 pts/3    R+   04:22   0:00 ps auxww
        preygel  31479  0.0  0.0   7236   684 pts/3    S+   04:22   0:00 tail -10
   
   make a Python string that contains this as its contents.  
2. Write a function to extract just the second column of each row.
3. In the above example, why do we use `with` when opening a true file but not for `StringIO`?

## Regular expressions

Tasks like in #2 in the above Exercises are ubiquitous.  In this case -- because we have it on good authority that the column layout is fixed -- it's easy to do just by counting.  What if instead we had a file containing lines like the following

        Docket S13-396 . ID 30546 :  A photonic micro-structured vacuum-ultraviolet radiation source based on solid-state frequency conversion .  4/3/2014
        Docket S13-202 . ID 30260 :  Performance Enhancement of Transparent Conducting Electrodes by Mesoscale Metal Wires .  3/28/2014
        Docket S13-211 . ID 30257 :  The Self-Assembly of Semiconducting Single-Walled Carbon Nanotubes into Dense and Aligned Rafts on Patterned Substrates .  4/3/2014
        Docket S13-198 . ID 30246 :  Polymer matrices for ambient ionization mass spectrometry .  3/12/2014
               ^^^^^^^      ^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^
mixed in among other content, and we wanted to pick them out -- and to pick out the four underlined parts of each?

(P.S. Each of these is intended to be one line -- it just wraps in this view!)

This is a general class of problems: We want to be able to identify all strings that "look like _this_," and then to "extract _that_ bit of the string."  __Regular expressions__ provide a concise language for specifying _this_ and _that_, and regular expression solvers in most programming languages (including Python) are a great tool for solving this class of problems.

Before we get down to the dry stuff, let's see what _this_ and _that_ look like in our example:

In [None]:
import re

lines = ["Docket S13-396 . ID 30546 :  A photonic micro-structured vacuum-ultraviolet radiation source based on solid-state frequency conversion .  4/3/2014",
"Docket S13-202 . ID 30260 :  Performance Enhancement of Transparent Conducting Electrodes by Mesoscale Metal Wires .  3/28/2014",
"One might imagine I could go back to the file I copied from and insert the lines that were actually there.  But why?",
"Docket S13-211 . ID 30257 :  The Self-Assembly of Semiconducting Single-Walled Carbon Nanotubes into Dense and Aligned Rafts on Patterned Substrates .  4/3/2014",
"Docket S13-198 . ID 30246 :  Polymer matrices for ambient ionization mass spectrometry .  3/12/2014",
"I surely must be a fish out of water.",
"Docket S13-360 . ID 30476 :  High-Performance Silicon Photoanode Passivated with an Ultrathin Nickel Film .  3/19/2014",
"  Docket S66-666 . ID 66666 :  On the effects of white space at the start of the string .  5/16/2014"]

# only create regex once!
regex = re.compile(r"Docket (S.*) [.] ID (\d*) : (.*) [.]  (\d+/\d+/\d+)")
for line in lines:
    m = regex.match(line)
    if m:
        print "Aha, we've found them", m.groups()
    else:
        print "Can't fool me that easily"

Regular expressions provide a very concise way of specifying _sets of strings_ from a few building blocks and operations.  It is good to think of a regular expression as a special type of program that tries to "eat" a string, but is picky about what it eats.  For a given regular expression, the "set of strings" mentioned earlier consists of those strings that it's willing to eat.  The more formal word for this is **matches**: A regular expression *matches* some set of strings.

Here are some building blocks that apply to matching a single character:
  - `.` : Matches any character (except a newline)
  - `\s`: Matches any whitespace character (`\S` is the opposite)
  - `\d`: Matches any digit (`\D` is the opposite)
  - `\w`: Matches any alphanumeric character (`\W` is the opposite)
  - `c` : Matches the character 'c' (and similarly for all characters that don't have some special meaning like `.` or `\` or `[`)
  - `[   ]`: Lists of characters, with ranges allowed: e.g. `[a-zA-Z0-9]` is the same as `\w` in ordinary ASCII
  - `[^  ]`: Some characters have special meaning inside brackets, for instance the caret `^` indicates negation.  That is: `[^a-zA-Z0-9]` matches everything _other than_ what [a-zA-Z0-9] matches.
  - `$`: Matches the end of the string (or right before a newline)
  
Now the fun comes in when we build in notation for repetition and concatenation:
  - `AB`: If `A` and `B` are regular expressions, then a string $s$ will match `AB` if and only if it is the concatenation $s = s_A s_B$ of a string $s_A$ matching `A` with a string $s_B$ matching `B`.  In other words, `A B` will eat a string only if it can first let `A` eat some amount and then let `B` eat from what's left over -- and they must both be happy with what they get.
  - `*`: If `A` is any regular expression, then `A*` matches any number of repetitions of A.
  - `+`: ... matches one or more repetitions.
  - `?`: ... matches 0 or 1 repetitions.
  - `{m,n}`: ... matches between m and n repetitions.

Python implements regular expressions in the re library.  You should know:

1. `re.match` requires that the match begin at the start of the string (though it need not eat all the way to the end).  This is not the "normal" behavior for regular expression libraries, which allow the match to begin anywhere: `re.search` gives this behavior.

2. `re.match` returns a "match object".  It is not just a boolean value, but it _is_ "truthy" which is why we could write `if m:`


In [None]:
def ismatch(regex, string):
    print "Yes" if re.match(regex, string) else "No"

In [None]:
ismatch('..', 'a')

In [None]:
ismatch('\d\w\D\s', '1a. Item')

In [None]:
ismatch('[ABC][^ABC][A-Z]', 'AAA')

In [None]:
ismatch('[ABC][^ABC]*[A-Z]*', 'AAA')

Finally, if a regular expression _does_ match there's another verb that applies: **captures**.  If we put part of our expression in _parentheses_ `(   )` then it will be _captured_.  This  means that the matcher will remember which part of the string was eaten by the sub-expression inside of the parentheses, so that we can access it afterwards.

The capture groups are accessed in the match object: `m.group(..)`.  Note that this is **one-indexed**, not zero-indexed.  (More precisely, `m.group(0)` is the entire string matches.  This is a useful behavior since it doesn't have to eat the whole string.)  A tuple of all the matches is available from `m.groups()`.

In [None]:
m = re.match('([ABC])([^ABC]*)([A-Z]*)', 'AAA')
m.group(1), m.group(2), m.group(3)

Parentheses are also used to indicate that an entire sub-string should be repeated.  To avoid these operating as capture groups, non-capturing parentheses `(?:   )` can be used.

In [None]:
m = re.match('(abc)+(?:def)+', 'abcabcdefdef')
m.groups()

**Breaking down our example: the RE**

Now we're ready to break down our example.  Let's start with just the regular expression:

                         6
                     |vvvvvvvv|
        r"Docket (S.*) [.] ID (\d*) : (.*) [.]  (\d+/\d+/\d+)"
        ^ |^^^^^^|^^^| ^^^     ^^^
        1   2       3   4       5
        
1. We saw before that `u"..."` told Python that something is a Unicode string.  Well, `r"..."` tells it that it is a _raw_ string. That means that escaping rules we talked about earlier do not apply!  This is helpful for regular expressions because otherwise we'd have to write things like `\\d` in place of `\d`.
2. The regular expression `r"Docket "` would match exactly the string `"Docket "`.  None of the characters involved are special, not even space.
3. This is a capture group.  The regular expression inside matches any string that starts with an `'S'` -- it must be an `'S'` followed by zero or more times any other character.  Why doesn't it gobble up the rest of the string in our example?  Because for the whole expression to match the next bit, 4, has to get to "eat" as well.
4. This regular expression matches precisely one string: `"."`
    We could also have written this as `\.`, but we could _not_ have written just `.`.  That would match _any_ one character string.
5. This matches a string of digits.
6. What does this segment match?  Notice that it has no `+`, `*`, `?`, etc. in it so that it matches exactly one string: " . ID "

So in short, we're matching a string that looks like _this_:
  - It starts with "Docket ";
  - Then comes a string, starting with an "S", that we capture;  
  - Then the string " . ID ";
  - Then comes a string of numbers that we capture;
  - Then comes the string " : ", then anything, followed by a period.

Regexes can easily become nearly incomprehensible, so Python offers a "verbose" option.  In these regexes, whitespace is ignored, unless in a character class or preceeded by a backslash, and `#` acts as a line comment character.

In [None]:
regex_v = re.compile(r"""Docket\ 
                         (S.*)\ [.]\    # Match the docket number, followed by " . "
                         ID\ (\d*)\ :\  # ID number is a set of digits
                         (.*)\          # The title could be anything, followed by a space
                         [.]\ \ (\d+/\d+/\d+)  # A period, two spaces, and a date
                      """, re.VERBOSE)
for line in lines:
    m = regex_v.match(line)
    if m:
        print "Aha, we've found them", m.groups()
    else:
        print "Can't fool me that easily"

### Exercises
1. Write a regular expression that'll match (US) phone numbers.  Use this to write a function that'll take in a string and output the area code and separately the rest of the number (all punctuation, etc. removed).
1. Write a regular expression that matches [ipv4 addresses](https://en.wikipedia.org/wiki/IP_address).
1. Check out http://regexone.com/ for a **fun** interactive "tutorial".

### Further resources

See https://docs.python.org/2/library/re.html for Python's syntax / support of regular expressions.  

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*