# Lecture 9: Working in Python Volume 3

Topics:
* Iterators and list comprehension

* Matplotlib
* File I/O and file system exploration
* Classes

## List comprehension

Understanding "list comprehension" is when Python really clicked for me. Let's say we have a list of numbers 1 to 10 and we want to apply a function to just the even numbers. We want to store this as a new list. We could do this as:

In [6]:
def f(x):
    return x*x

In [8]:
f(3)

9

In [15]:
values = list(range(1,11))

In [16]:
values

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [22]:
for value in values:
    if value % 2 == 0:
        print(f(value))

4
16
36
64
100


In [24]:
new_values = []
for value in values:
    if value % 2 == 0:
        new_values.append(f(value))

In [25]:
new_values

[4, 16, 36, 64, 100]

Instead this can be done with a single list comprehension.

In [32]:
[f(value) for value in values if value % 2 == 0]

[4, 16, 36, 64, 100]

To step through in this construction:

In [42]:
[value for value in values]

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [34]:
[value for value in values if value % 2 == 0]

[2, 4, 6, 8, 10]

In [35]:
[f(value) for value in values if value % 2 == 0]

[4, 16, 36, 64, 100]

Similarly, we can construct a dictionary comprehension with:

In [50]:
mapping = {value: f(value) for value in values if value % 2 == 0}

In [51]:
mapping

{2: 4, 4: 16, 6: 36, 8: 64, 10: 100}

In [52]:
mapping[4]

16

## Functional programming

Apply a function to every element of a list with `map`.

In [36]:
values

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [39]:
map(f, values)

<map at 0x10b6a8ef0>

In [40]:
list(map(f, values))

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [45]:
list(map(lambda x: f(x), values))

[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

In [47]:
list(filter(lambda x: x % 2 == 0, values))

[2, 4, 6, 8, 10]

## Working with the web

In [53]:
import requests

In [65]:
response = requests.get('http://data.nextstrain.org/measles_tree.json')

In [71]:
response

<Response [200]>

In [73]:
response.text

'{\n  "attr": {\n    "div": 0, \n    "num_date": 1934.992759310427, \n    "num_date_confidence": [\n      1932.0738536704116, \n      1936.9073417770612\n    ]\n  }, \n  "tvalue": 0.0, \n  "clade": 0, \n  "strain": "NODE_0000097", \n  "children": [\n    {\n      "xvalue": 0.02481, \n      "muts": [\n        "C32T", \n        "A40G", \n        "A99T", \n        "T100A", \n        "A143G", \n        "A230G", \n        "C254T", \n        "G299A", \n        "T362C", \n        "T392A", \n        "G410A", \n        "G434A", \n        "T509C", \n        "T518C", \n        "C584T", \n        "G656T", \n        "C779T", \n        "C797T", \n        "T824C", \n        "C839T", \n        "T881C", \n        "G896A", \n        "C917T", \n        "T1002C", \n        "C1040T", \n        "G1076A", \n        "A1091G", \n        "T1324C", \n        "A1346G", \n        "G1358C", \n        "C1365T", \n        "G1421A", \n        "A1442G", \n        "G1446A", \n        "G1456A", \n        "A1459C", \n     

In [74]:
import json

In [75]:
tree = json.loads(response.text)

In [76]:
tree

{'attr': {'div': 0,
  'num_date': 1934.992759310427,
  'num_date_confidence': [1932.0738536704116, 1936.9073417770612]},
 'tvalue': 0.0,
 'clade': 0,
 'strain': 'NODE_0000097',
 'children': [{'xvalue': 0.02481,
   'muts': ['C32T',
    'A40G',
    'A99T',
    'T100A',
    'A143G',
    'A230G',
    'C254T',
    'G299A',
    'T362C',
    'T392A',
    'G410A',
    'G434A',
    'T509C',
    'T518C',
    'C584T',
    'G656T',
    'C779T',
    'C797T',
    'T824C',
    'C839T',
    'T881C',
    'G896A',
    'C917T',
    'T1002C',
    'C1040T',
    'G1076A',
    'A1091G',
    'T1324C',
    'A1346G',
    'G1358C',
    'C1365T',
    'G1421A',
    'A1442G',
    'G1446A',
    'G1456A',
    'A1459C',
    'G1515A',
    'T1525C',
    'C1549A',
    'T1559C',
    'G1562A',
    'A1616G',
    'T1620C',
    'C1621T',
    'T1664C',
    'C1667T',
    'G1688A',
    'G1690A',
    'A1699G',
    'C1725T',
    'G1769A',
    'C1770T',
    'C1778T',
    'G1879A',
    'A1895G',
    'A1966G',
    'C1999T',
    'A201

In [78]:
tree["strain"]

'NODE_0000097'

In [79]:
tree["children"]

[{'xvalue': 0.02481,
  'muts': ['C32T',
   'A40G',
   'A99T',
   'T100A',
   'A143G',
   'A230G',
   'C254T',
   'G299A',
   'T362C',
   'T392A',
   'G410A',
   'G434A',
   'T509C',
   'T518C',
   'C584T',
   'G656T',
   'C779T',
   'C797T',
   'T824C',
   'C839T',
   'T881C',
   'G896A',
   'C917T',
   'T1002C',
   'C1040T',
   'G1076A',
   'A1091G',
   'T1324C',
   'A1346G',
   'G1358C',
   'C1365T',
   'G1421A',
   'A1442G',
   'G1446A',
   'G1456A',
   'A1459C',
   'G1515A',
   'T1525C',
   'C1549A',
   'T1559C',
   'G1562A',
   'A1616G',
   'T1620C',
   'C1621T',
   'T1664C',
   'C1667T',
   'G1688A',
   'G1690A',
   'A1699G',
   'C1725T',
   'G1769A',
   'C1770T',
   'C1778T',
   'G1879A',
   'A1895G',
   'A1966G',
   'C1999T',
   'A2017G',
   'G2071A',
   'T2131C',
   'A2201G',
   'A2224G',
   'C2227T',
   'T2413C',
   'C2416T',
   'C2451T',
   'G2461A',
   'G2481A',
   'A2511G',
   'A2576G',
   'A2709G',
   'C2769T',
   'A2814G',
   'A2951G',
   'T3033C',
   'A3042G',
   'T3117

## Using APIs

### Using the NCBI Entrez API to download single Genbank record

NCBI Entrez API allows searching of Genbank records like so: https://www.ncbi.nlm.nih.gov/nuccore?term=measles%5Btitle%5D%20AND%20viruses%5Bfilter%5D%20AND%20%28%225000%22%5BSLEN%5D%20%3A%20%2220000%22%5BSLEN%5D%29&cmd=DetailsSearch

The BioPython implementation of the Entrez API is described here: https://biopython.org/DIST/docs/api/Bio.Entrez-module.html

In [112]:
from Bio import Entrez

In [113]:
Entrez.email = "tbedford@fredhutch.org"

In [114]:
handle = Entrez.efetch(db="nucleotide", id="LC420351", rettype="gb", retmode="text")

In [115]:
from Bio import GenBank

In [117]:
record = GenBank.read(handle)

In [118]:
record

<Bio.GenBank.Record.Record at 0x10c974eb8>

In [122]:
dir(record)

['BASE_FEATURE_FORMAT',
 'BASE_FORMAT',
 'GB_BASE_INDENT',
 'GB_FEATURE_INDENT',
 'GB_FEATURE_INTERNAL_INDENT',
 'GB_INTERNAL_INDENT',
 'GB_LINE_LENGTH',
 'GB_OTHER_INTERNAL_INDENT',
 'GB_SEQUENCE_INDENT',
 'INTERNAL_FEATURE_FORMAT',
 'INTERNAL_FORMAT',
 'OTHER_INTERNAL_FORMAT',
 'SEQUENCE_FORMAT',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accession_line',
 '_base_count_line',
 '_comment_line',
 '_contig_line',
 '_db_source_line',
 '_dblink_line',
 '_definition_line',
 '_features_line',
 '_keywords_line',
 '_locus_line',
 '_nid_line',
 '_organism_line',
 '_origin_line',
 '_pid_line',
 '_project_line',
 '_segment_line',
 '_sequence_line',
 '_source_line',
 '_ver

In [125]:
record.accession

['LC420351']

In [131]:
record.source

'Measles morbillivirus'

In [143]:
record.sequence[0:500]

'ACCAAACAAAGTTGGGTAAGGATAGATCAATCAATGATCATATTCTAGTACACTTAGGATTCAAGATCCTATTATCAGGGACAAGAGCAGGATTAGGGATATCCGAGGGCGCGCCATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCA'

### Using the NCBI Entrez API to search for Genbank records

In [212]:
handle = Entrez.esearch(db="nucleotide",
            retmax=10,
            term="measles[title] AND viruses[filter]",
            idtype="acc")

In [213]:
handle

<_io.TextIOWrapper encoding='utf-8'>

In [214]:
records = Entrez.read(handle)

In [210]:
dir(records)

['__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'attributes',
 'clear',
 'copy',
 'fromkeys',
 'get',
 'items',
 'keys',
 'pop',
 'popitem',
 'setdefault',
 'tag',
 'update',
 'values']

In [215]:
records["Count"]

'15613'

In [226]:
handle = Entrez.esearch(db="nucleotide", retmax=500,
            term="measles[title] AND viruses[filter] AND (5000[SLEN] : 20000[SLEN])", 
            idtype="acc")

In [227]:
records = Entrez.read(handle)

In [228]:
records["Count"]

'207'

In [229]:
records["IdList"]

['LC420351.1', 'NC_001498.1', 'MF775733.1', 'MH638233.1', 'MH173047.1', 'MG972194.1', 'LZ951275.1', 'MG912594.1', 'MG912593.1', 'MG912592.1', 'MG912591.1', 'MG912590.1', 'MG912589.1', 'LC336599.1', 'MF449469.1', 'KY969481.1', 'KY969480.1', 'KY969479.1', 'KY969478.1', 'KY969477.1', 'KY969476.1', 'KY656518.1', 'LF947387.1', 'KT588921.1', 'KX838946.2', 'HZ785298.1', 'KU728743.1', 'KU728742.1', 'KJ755982.1', 'KJ755981.1', 'KJ755980.1', 'KJ755979.1', 'KJ755978.1', 'KJ755977.1', 'KJ755976.1', 'KJ755975.1', 'KJ755974.1', 'AF266290.1', 'KT732261.1', 'KT732260.1', 'KT732259.1', 'KT732258.1', 'KT732257.1', 'KT732256.1', 'KT732255.1', 'KT732254.1', 'KT732253.1', 'KT732252.1', 'KT732251.1', 'KT732250.1', 'KT732249.1', 'KT732248.1', 'KT732247.1', 'KT732246.1', 'KT732245.1', 'KT732244.1', 'KT732243.1', 'KT732242.1', 'KT732241.1', 'KT732240.1', 'KT732239.1', 'KT732238.1', 'KT732237.1', 'KT732236.1', 'KT732235.1', 'KT732234.1', 'KT732233.1', 'KT732232.1', 'KT732231.1', 'KT732230.1', 'KT732229.1', 'KT7

In [230]:
len(records["IdList"])

207