# Chapter 07: File I/O and Regular Expression
___

## Part 1: Files

Python uses `file` objects to interact with external files on your computer. These file objects can be any sort of file you have on your computer, whether it be an audio file, a text file, emails, Excel documents, etc.

__Note__: You will probably need to install certain libraries or modules to interact with those various file types, but they are easily available. (We will cover downloading modules later on in the course).

Python has a built-in open function that allows us to open and play with basic file types. First we will need a file though. We're going to use some iPython magic to create a text file!

In [249]:
%%writefile test.txt
Hello, this is a quick test file

Writing test.txt


### 1. Opening a file

We can open a file with the `open()` function. The open function also takes in arguments (also called parameters). Lets see how this is used:

In [250]:
# open the text.txt we made earlier
my_file = open('test.txt')

In [273]:
open?

In [275]:
print file.__doc__

file(name[, mode[, buffering]]) -> file object

Open a file.  The mode can be 'r', 'w' or 'a' for reading (default),
writing or appending.  The file will be created if it doesn't exist
when opened for writing or appending; it will be truncated when
opened for writing.  Add a 'b' to the mode for binary files.
Add a '+' to the mode to allow simultaneous reading and writing.
If the buffering argument is given, 0 means unbuffered, 1 means line
buffered, and larger numbers specify the buffer size.  The preferred way
to open a file is with the builtin open() function.
Add a 'U' to mode to open the file for input with universal newline
support.  Any line ending in the input file will be seen as a '\n'
in Python.  Also, a file so opened gains the attribute 'newlines';
the value for this attribute is one of None (no newline read yet),
'\r', '\n', '\r\n' or a tuple containing all the newline types seen.

'U' cannot be combined with 'w' or '+' mode.



### 2. Reading a file

In [251]:
# we can now read the file
my_file.read()

'Hello, this is a quick test file'

In [252]:
# but what happens if we try to read it again?
my_file.read()

''

#### What happens?

This happens because you can imagine the reading "cursor" is at the end of the file after having read it. So there is nothing left to read. But we can reset the "cursor" like this:

In [253]:
# seek to the start of file (index 0)
my_file.seek(0)

In [254]:
# Now read again
my_file.read()

'Hello, this is a quick test file'

In order to not have to reset every time, we can also use the `readlines()` method. However be cautious to use this on large files, since everything will be held in memory. We will learn how to iterate over large files later.

In [256]:
# readlines returns a list of the lines in the file.
my_file.seek(0)
my_file.readlines()

['Hello, this is a quick test file']

### 3. Writing a file

By default, using the `open()` function will only grant the `read` permission for the file, we need to pass the argument 'w' to write over the file. For example:

In [257]:
# add a second argument to the function, 'w' which stands for write
my_file = open('test.txt','w+')

In [258]:
# write to the file
my_file.write('This is a new line')

In [261]:
# read the file
my_file.seek(0)
my_file.read()

'This is a new line'

### 4. Iterating through a File

Lets get a quick preview of a `for` loop by iterating over a text file. First let's make a new text file with some iPython Magic:

In [262]:
%%writefile test.txt
First Line
Second Line

Overwriting test.txt


Now we can use a little bit of flow to tell the program to `for` through every line of the file and do something:

In [263]:
for line in open('test.txt'):
    print line

First Line

Second Line


<font color="red">Be careful not to call `.read()` on the file, the whole text file was not stored in memory</font>.

In [266]:
with open('test.txt') as f:
    f.readline()

In [267]:
help(f.readline)

Help on built-in function readline:

readline(...)
    readline([size]) -> next line from the file, as a string.
    
    Retain newline.  A non-negative size argument limits the maximum
    number of bytes to return (an incomplete line may be returned then).
    Return an empty string at EOF.



### 5. Do we need to close the `file`?

In [269]:
my_file.close()

In [272]:
del f

## Part 2: Regular Expression in Python

As we mentioned in the previous chapter, regexes are one of the most useful things you can work with in programming. They are a mini-language for matching text.

`re` is the python package (module) for conducting regular expression.

In [100]:
import re

In [165]:
re?

In [None]:
help(re)

In [104]:
s = 'hello'

### 1. `re.compile`: create a regex pattern object

`re.compile()` will return an `SRE_Pattern` object.

In [101]:
re.compile?

In [110]:
pat = re.compile('e')

In [111]:
pat, type(pat)

(re.compile(r'e'), _sre.SRE_Pattern)

### 2. `re.search`: Looking for the first match to a given pattern

`re.search()` will return an `SRE_Match` object.

In [109]:
re.search?

In [153]:
result = re.search(pat, s)

In [154]:
result

<_sre.SRE_Match at 0x7f24343436b0>

In [155]:
help(result)

Help on SRE_Match object:

class SRE_Match(__builtin__.object)
 |  The result of re.match() and re.search().
 |  Match objects always have a boolean value of True.
 |  
 |  Methods defined here:
 |  
 |  __copy__(...)
 |  
 |  __deepcopy__(...)
 |  
 |  end(...)
 |      end([group=0]) -> int.
 |      Return index of the end of the substring matched by group.
 |  
 |  expand(...)
 |      expand(template) -> str.
 |      Return the string obtained by doing backslash substitution
 |      on the string template, as done by the sub() method.
 |  
 |  group(...)
 |      group([group1, ...]) -> str or tuple.
 |      Return subgroup(s) of the match by indices or names.
 |      For 0 returns the entire match.
 |  
 |  groupdict(...)
 |      groupdict([default=None]) -> dict.
 |      Return a dictionary containing all the named subgroups of the match,
 |      keyed by the subgroup name. The default argument is used for groups
 |      that did not participate in the match
 |  
 |  groups(...)
 | 

In [132]:
result.group(0)

'e'

You can try more complex regular expressions, of course.

In [133]:
sentence = ("A symmetry of a pattern is, loosely speaking, a way of transforming "
            "the pattern so that the pattern looks exactly the same after the "
            "transformation.")

In [217]:
results = re.search(r"(?P<PAT>pattern).*?(?P=PAT)", sentence, re.DOTALL+re.M)

Here `(?P<PAT>pattern)` creates a __named capturing (命名捕获)__ with name `PAT`, and the regular expression is `pattern`. Hence, this captured group can be __backreferenced (后向引用)__ by `(?P=PAT)`. 

In [220]:
print results.groupdict('PAT')

{'PAT': 'pattern'}


For a __non-named capturing (非命名捕获)__, you can try

In [229]:
results = re.search(r"(pattern).*?\1", sentence, re.DOTALL+re.MULTILINE)

In [230]:
print results.group(0)

pattern is, loosely speaking, a way of transforming the pattern


In [200]:
result = re.search(r"([\S]*?)(?=ing)", sentence)

Note here `(?=ing)` is a __lookahead zero-length assertion (前视零宽断言)__, which is also a __non-capturing group (非捕获组)__, but only a __positional assertion (位置断言)__.

In [202]:
result.group(1)

'speak'

In [203]:
result.group(2)

IndexError: no such group

### 3. `re.findall`: Find all matches to a given pattern

In [151]:
re.findall?

In [145]:
result = re.findall(r"pattern", sentence)

In [143]:
result.count('pattern')

3

In [146]:
type(result)

list

In [148]:
results = re.findall(r".at.", sentence)

In [149]:
results

['patt', 'patt', 'hat ', 'patt', 'mati']

In [150]:
# Case-insensitive matching
print(re.findall(r"h", "Hello there! How may I help you?"))
print(re.findall(r"h", "Hello there! How may I help you?", re.IGNORECASE))

['h', 'h']
['H', 'h', 'H', 'h']


### 4. `re.match`: Apply the pattern at the start of the string

`re.match()` will also return an `SRE_Match` object.

In [240]:
result = re.match(r'.*?pattern', sentence)

In [241]:
result

<_sre.SRE_Match at 0x7f243434c780>

In [242]:
result.group(0)

'A symmetry of a pattern'

### 5. `re.sub`, `re.subs`: substitution

In [233]:
re.subn?

In [158]:
re.sub('pattern', 'PATTERN', sentence)

'A symmetry of a PATTERN is, loosely speaking, a way of transforming the PATTERN so that the PATTERN looks exactly the same after the transformation.'

In [234]:
re.subn('pattern', 'PATTERN', sentence, count=1)

('A symmetry of a PATTERN is, loosely speaking, a way of transforming the pattern so that the pattern looks exactly the same after the transformation.',
 1)

### 6. `flags`: an 8-bit integers to indicate the options

This can be written into binary number `00000000`:
* For example, if we set `flags=2`, that means `flags=00000010`, indicating that `ignorecase=True`;

In [160]:
re.IGNORECASE

2

In [170]:
re.LOCALE

4

In [171]:
re.MULTILINE

8

In [169]:
re.DOTALL

16

In [172]:
re.UNICODE

32

In [173]:
re.VERBOSE

64

In [174]:
re.DEBUG

128

In [182]:
re.findall(r'this', "This is a string", flags=66)

['This']

In [183]:
re.findall(r'this', "This is a string", re.IGNORECASE+re.VERBOSE)

['This']

### 7. `re.escape`: Escape all non-alphanumeric in a pattern

In [247]:
re.escape(r'str.*\b')

'str\\.\\*\\\\b'

In [248]:
re.findall(re.escape(r'str.*\b'), "This is a string", re.IGNORECASE+re.VERBOSE)

[]

## Exercises

#### 1. What will output for the following python script?  

In [None]:
re.search(r"(a+b){2}", "abaaaabaab").group(0)

In [None]:
re.findall(r"[\.,;?!]", sentence)

In [None]:
re.findall(r"[^A-Za-z0-9 ]", sentence)

In [None]:
re.search(r"[0-9]{3}-[0-9]{3}-[0-9]{4}", "My phone number is 919-555-1212.")

#### 2. What will output for the following python script?

In [None]:
re.findall(r"\d{3}-\d{3}-\d{4}", "My phone number is 919-555-1212.")

In [None]:
re.findall(r"[^\w\s]", sentence)

In [None]:
re.findall(r"[\W\S]", sentence)

There's a few odder ones:

* `\A` matches the beginning of the string. This is a lot like `^`, but different for multi-line strings.
* `\Z` matches the end of the string. This is a lot like `$`, but different for multi-line strings.
* `\b` matches a word boundary. This means it matches an empty string at the end of a word.

#### 3. What will output for the following python script?

In [None]:
re.findall(r"\b\w{3,5}\b", sentence)

In [None]:
text = """This is a multi-line string.
It has newlines in it."""

print(re.findall(r"\w\.$", text, re.MULTILINE))
print(re.findall(r"\w\.\Z", text, re.MULTILINE))

In [None]:
possible_emails = ["clinton", "clinton@dreisbach.us", "beanguy@example.org", 
                   "Email help@example.org for more information",
                   "terry@example.org", "@carmen", "what@what", "hi@example.org"]
[possibility 
 for possibility in possible_emails 
 if re.search("\A\w+@\w+\.\w{2,3}\Z", possibility)]

#### 4. Capturing matches

In [None]:
possibilities = ["Queenland, HK", "Xuhui, SH", "Nanjing", "Huairou, BJ", "CQ"]
for possibility in possibilities:
    match = re.search("^([\w\s]+), ([A-Z]{2})", possibility)
    if match:
        town, city = match.groups()
        print "Town:", town, "| City:", city

In [None]:
phone_nums = ["999-555-1212", "(703) 555-9999", "800.555.7341", "3145558286"]
cleaned = []
for num in phone_nums:
    match = re.search(r"\(?(\d{3})\)?[\-\.]?\s*(\d{3})[\-\.]?(\d{4})", num)
    cleaned.append("{}-{}-{}".format(*match.groups()))
print(cleaned)

#### 5. Non-capturing group

Use `(?:)` to make a group but not capture it.

In [None]:
phone_num_with_possible_area_code = r"(?:\(?(\d{3})\)?[\-\.]?\s*)?(\d{3})[\-\.]?(\d{4})"
phone_nums = ["999-555-1212", "(703) 555-9999", "800.555.7341", "3145558286", "555-1212"]
cleaned = []
for num in phone_nums:
    match = re.search(phone_num_with_possible_area_code, num)
    cleaned.append("{}-{}-{}".format(*match.groups()))
print(cleaned)

In [None]:
phone_num_with_possible_area_code = r"(?:\(?(\d{3})\)?[-.]?\s*)?(\d{3})[-.]?(\d{4})"
phone_nums = ["999-555-1212", "(703) 555-9999", "800.555.7341", "3145558286", "555-1212"]
for num in phone_nums:
    match = re.search(phone_num_with_possible_area_code, num)
    print(match.group(1))

#### 6. Parsing a Genbank file

Here is a [Genbank file](data/NT_033777.gbk) for Chr-3R of *__D.melanogaster__*. Try to extract the CDS (coding sequence) and save them all in a file as FASTA format, using the knowledge of regular expression you have learn.

```
     CDS             join(1115..1913,7784..8649,9439..9771)
                     /gene="CG12581"
                     /locus_tag="Dmel_CG12581"
                     /note="CG12581 gene product from transcript CG12581-RA"
                     /codon_start=1
                     /product="CG12581-PA, isoform A"
                     /protein_id="NP_649435.2"
                     /db_xref="GI:24643831"
                     /db_xref="FLYBASE:FBgn0037213"
                     /db_xref="GeneID:40522"
                     /translation="MGDSTPICRCRVLYLGSAVPRQSKDGLQGIQEPLRSLYPSEGAV
                     GAKGIDSWLSVWSNGILLENVDENLKQITRFFPIESLHYCAAVRQVLIPERGNTHPEP
                     KFLPLDSPFARMPRAQHPPIFAAILRRTTGIKVLECHVFICKREAAANALVRCCFHAY
                     ADNSYARQLETGGGSSVYGTLKSGAISKSSSDLTGVGLANGVGNGSGGGNHHLSLSAQ
                     GGWRSRTGSTTTLNSLGRASNGHANGSAIGMNGSSAVSAAEGYTSVKNFYGSSADLNV
                     AVDDGDASFNGDENHKVWNGSQDQLDSIGPLESPYELFAGNTSTLGRPLRARQISTPI
                     DVPPPPVKDERKTKRDKKLTKSSGSQSLSGTLIRPKPVHPAPQHRSGFQGPSGPGSVT
                     YGHVSGHGLHAARYHTISHRGIPPGSHSHLTHHPPPHPSQNQVHMMHHPHIGGLPPMQ
                     IPVMMPQQYATLQPSRSTGKKKKKDKKSGAGGGVPVGMPIVPPIYAFQQQVVGVPAPQ
                     LAQSLIGETRPLGHSSRKLAASMGNGLDDSGNSGAESPSPGGTGIYKRKGHLNERAFS
                     YSIRQEHRSRSHGSLASLQFNPPDIKKEREIAQMVAGLDLNEGERPMGPNTLQRKHAM
                     TMSNGGLHGPGPSQHPHAHPPHPHAIYGPLGPASSFGMPRR"
```
or
```
     CDS             complement(15530..15955)
                     /gene="Dsk"
                     /locus_tag="Dmel_CG18090"
                     /note="CG18090 gene product from transcript CG18090-RA"
                     /codon_start=1
                     /product="CG18090-PA"
                     /protein_id="NP_524845.2"
                     /db_xref="GI:28573265"
                     /db_xref="FLYBASE:FBgn0000500"
                     /db_xref="GeneID:45845"
                     /translation="MGPRSCTHFATLFMPLWALAFCFLVVLPIPAQTTSLQNAKDDRR
                     LQELESKIGGEIDQPIANLVGPSFSLFGDRRNQKTMSFGRRVPLISRPIIPIELDLLM
                     DNDDERTKAKRFDDYGHMRFGKRGGDDQFDDYGHMRFGR"
```

This is a medium-size file. Be careful not to read all lines into a buffer, instead, use the streaming read.