# CSC 593

## Week 5

### Reading and Writing Files

The built-in function [`open()`](https://docs.python.org/3/library/functions.html#open) returns a "file object" that can be used to read from and/or write to a file.

In general we do file input/output inside code blocks started with the `with` keyword. This just simplifies the process of managing the file object and closing it after you've finished.

`open()` takes two arguments: a path to the file on the filesystem, and the *mode* we want to open the file in. To read from the file, use 'r' as the mode. Use the file object's `read()` method to read the entire file, or `readline()` to read one line at a time.

In [2]:
with open('../data/textfile.txt', 'r') as f:
    print(f.read())

This is just a short text file.

Here is another line of text.


Use mode 'w' to write to the file:

In [3]:
with open('../data/newfile.txt', 'w') as f:
    f.write("""This is some text.
    
This is some more text.""")

In [4]:
with open('../data/newfile.txt', 'r') as f:
    print(f.readline())

This is some text.



You can also iterate over the file object, line by line:

In [279]:
with open('../data/newfile.txt', 'r') as f:
    for ln in f:
        print(ln)

This is some text.

    

This is some more text.


Be careful: `open(filename, 'w')` will overwrite existing files.

In [5]:
open('../data/newfile.txt', 'w').close()  # our new file is now empty.

with open('../data/newfile.txt', 'r') as f:
    print(f.read())




Using mode 'x' will open a new file for writing, but throw an error if the file already exists.

In [7]:
with open('../data/textfile.txt', 'x') as f:
    pass

FileExistsError: [Errno 17] File exists: '../data/textfile.txt'

#### Practice

Try opening the class syllabus ('../README.md') and printing the first line.

In [12]:
with open('../README.md', 'r') as f:
    print(f.readline())

# University of Rhode Island



##### `CSV`

Comma-separated values files are a common data exchange format. Python has built-in support for them:

In [16]:
import csv

To read a CSV file, open it like any other, then read the file object with a `csv.reader()`. Here we use the `next()` function to retrieve the first line of the `familyxx.csv` file, then print the header labels.

`familyxx.csv` is part of the data release from the 2018 [National Health Interview Survey](https://www.cdc.gov/nchs/nhis/index.htm).

In [17]:
with open('../data/nhis/familyxx.csv') as f:
    rdr = csv.reader(f)
    hdr = next(rdr)
    for name in hdr:
        print(name)

FINT_Y_P
FINT_M_P
FMX
RECTYPE
SRVY_YR
HHX
FM_SIZE
FM_STRCP
FM_TYPE
FM_STRP
TELN_FLG
CURWRKN
TELCELN
WRKCELN
PHONEUSE
FLNGINTV
WTFA_FAM
FM_KIDS
FM_ELDR
FM_EDUC1
F10DVCT
F10DVYN
FDMEDCT
FDMEDYN
FHCDVCT
FHCDVYN
FHCHMCT
FHCHMYN
FHCPHRCT
FHCPHRYN
FHOSP2CT
FHOSP2YN
FNMEDCT
FNMEDYN
FSRUNOUT
FSLAST
FSBALANC
FSSKIP
FSSKDAYS
FSLESS
FSHUNGRY
FSWEIGHT
FSNOTEAT
FSNEDAYS
FHDSTCT
FDGLWCT1
FDGLWCT2
FWRKLWCT
FCHLMYN
FSPEDYN
FLAADLYN
FLIADLYN
FWKLIMYN
FWALKYN
FREMEMYN
FANYLYN
FCHLMCT
FSPEDCT
FLIADLCT
FWKLIMCT
FWALKCT
FREMEMCT
FANYLCT
FHSTATEX
FHSTATVG
FHSTATG
FHSTATFR
FHSTATPR
FLAADLCT
FGAH
HOUSEOWN
FSNAP
FSNAPMYR
INCGRP4
INCGRP5
RAT_CAT4
RAT_CAT5
FSALYN
FSEINCYN
FSSRRYN
FPENSYN
FOPENSYN
FSSIYN
FTANFYN
FOWBENYN
FINTR1YN
FDIVDYN
FCHSPYN
FINCOTYN
FSSAPLYN
FSDAPLYN
FWICYN
FSALCT
FSEINCCT
FSSRRCT
FPENSCT
FOPENSCT
FSSICT
FTANFCT
FOWBENCT
FINTR1CT
FDIVDCT
FCHSPCT
FINCOTCT
FSSAPLCT
FSDAPLCT
FWICCT
FHIPRVCT
FHISINCT
FHICARCT
FHICADCT
FHICHPCT
FHIMILCT
FHIPUBCT
FHIOGVCT
FHIIHSCT
FHIEXCT
COVCONF
FHICOST
FMEDBILL


##### `zip()`

The [`zip`](https://docs.python.org/3/library/functions.html#zip) function merges two or more iterables (like lists or strings).

In [242]:
l1 = [1, 2, 3]
l2 = ['a', 'b', 'c']
l3 = ['x', 'y', 'z']
for x in zip(l1, l2, l3):
    print(x)

for x in zip('foo', 'bar'):
    print(x)

(1, 'a', 'x')
(2, 'b', 'y')
(3, 'c', 'z')
('f', 'b')
('o', 'a')
('o', 'r')


This gives us another way to answer the last question from homework assignment 2:

In [286]:
string1 = 'ABCDEFGHIJ'
string2 = 'ABCDEEGHIJ'

for x in zip(string1, string2):
    print(*x, sep='')

AA
BB
CC
DD
EE
FE
GG
HH
II
JJ


More importantly, it's a convenient way to create dictionaries from two lists:

In [287]:
dict(zip(l2, l1))

{'a': 1, 'b': 2, 'c': 3}

Here, we create a list of dictionaries, each containing one row of the `familyxx` data.

In [288]:
with open('../data/nhis/familyxx.csv') as f:
    rdr = csv.reader(f)
    hdr = next(rdr)
    nhis = [dict(zip(hdr, row)) for row in rdr]

In [294]:
print(len(nhis))
print(nhis[0])

30309
{'FINT_Y_P': '2018', 'FINT_M_P': '1', 'FMX': '01', 'RECTYPE': '60', 'SRVY_YR': '2018', 'HHX': '000001', 'FM_SIZE': '1', 'FM_STRCP': '11', 'FM_TYPE': '1', 'FM_STRP': '11', 'TELN_FLG': '1', 'CURWRKN': '1', 'TELCELN': '1', 'WRKCELN': '1', 'PHONEUSE': '3', 'FLNGINTV': '1', 'WTFA_FAM': '3539', 'FM_KIDS': '0', 'FM_ELDR': '1', 'FM_EDUC1': '2', 'F10DVCT': '0', 'F10DVYN': '2', 'FDMEDCT': '0', 'FDMEDYN': '2', 'FHCDVCT': '1', 'FHCDVYN': '1', 'FHCHMCT': '0', 'FHCHMYN': '2', 'FHCPHRCT': '1', 'FHCPHRYN': '1', 'FHOSP2CT': '0', 'FHOSP2YN': '2', 'FNMEDCT': '0', 'FNMEDYN': '2', 'FSRUNOUT': '3', 'FSLAST': '3', 'FSBALANC': '3', 'FSSKIP': '', 'FSSKDAYS': '', 'FSLESS': '', 'FSHUNGRY': '', 'FSWEIGHT': '', 'FSNOTEAT': '', 'FSNEDAYS': '', 'FHDSTCT': '', 'FDGLWCT1': '0', 'FDGLWCT2': '0', 'FWRKLWCT': '', 'FCHLMYN': '', 'FSPEDYN': '', 'FLAADLYN': '2', 'FLIADLYN': '2', 'FWKLIMYN': '2', 'FWALKYN': '2', 'FREMEMYN': '2', 'FANYLYN': '2', 'FCHLMCT': '', 'FSPEDCT': '', 'FLIADLCT': '0', 'FWKLIMCT': '0', 'FWALKCT': 

#### Practice
Import your own dataset, or the NHIS persons file (`..\data\nhis\personsx.csv`). Create a list of dictionaries, as I have above.

In [20]:
with open('../dataset/drive.csv') as f:
    rdr = csv.reader(f)
    hdr = next(rdr)
    nhis = [dict(zip(hdr, row)) for row in rdr]

In [22]:
print(len(nhis))
print(nhis[2])

3350
{'Timestamp(ms)': '1500', 'Altitude[m]': '0', 'HV Battery Current[A]': '-1.25', 'HV Battery SOC[%]': '33', 'HV Battery Voltage[V]': '319', 'Is Driving[bool]': '1', 'Latitude[deg]': '0', 'Longitude[deg]': '0', 'OAT[degC]': '10', 'Vehicle Speed[km/h]': '0'}


### Web Scraping



We'll use the [`requests`](https://3.python-requests.org/) module to retrieve data from the web, and [`Beautiful Soup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to read these pages.

In [42]:
import requests
from bs4 import BeautifulSoup as bs

Get [Wikipedia's list of Rhode Island municipalities](https://en.wikipedia.org/wiki/List_of_municipalities_in_Rhode_Island). A response code of 200 means "OK"

In [43]:
page = requests.get('https://en.wikipedia.org/wiki/List_of_municipalities_in_Rhode_Island')
page

<Response [200]>

Parse the source of the page with BeautifulSoup and find the table. We know it has the class 'wikitable'. We have to do some tinkering here--the table is messier than the CDC's file.

In [44]:
soup = bs(page.text, 'html.parser')
table = soup.find('table', class_='wikitable')

#Find all the table headers (th elements).
#Remove the footnotes/references from the header cell labels.
headers = [th.text.strip().split('[')[0] for th in table.find_all('th')]

print(headers)

#There are two subheaders under Land Area. We need to make some adjustments to our headers.
lahead = headers[-4]
headers[-4] = lahead + ' sq mi'

#the list.insert() method adds an element to the list at a specified location.
headers.insert(-3, lahead + ' km2')

#Remove the last two elements from the headers list.
headers = headers[:-2]
print(headers)

['Name', 'Type', 'County', 'Year established', 'Year incorporated', 'Form of government', 'Population(2010)', 'Population(2000)', 'Change', 'Land area(2010)', 'Population density', 'sq mi', 'km2']
['Name', 'Type', 'County', 'Year established', 'Year incorporated', 'Form of government', 'Population(2010)', 'Population(2000)', 'Change', 'Land area(2010) sq mi', 'Land area(2010) km2', 'Population density']


In [45]:
ridata = []
for row in table.find_all('tr')[2:]:
    rowdata = [cell.text.strip() for cell in row.find_all('td')]
    ridata.append(dict(zip(headers, rowdata)))

ridata

[{'Name': 'Barrington',
  'Type': 'Town',
  'County': 'Bristol',
  'Year established': '1653',
  'Year incorporated': '1770',
  'Form of government': 'Council–manager',
  'Population(2010)': '16,310',
  'Population(2000)': '16,819',
  'Change': '−3.0%',
  'Land area(2010) sq mi': '8.22',
  'Land area(2010) km2': '21.3',
  'Population density': '1,984.2/sq\xa0mi (766.1/km2)'},
 {'Name': 'Bristol',
  'Type': 'Town',
  'County': 'Bristol',
  'Year established': '1680',
  'Year incorporated': '1746',
  'Form of government': 'Council–manager',
  'Population(2010)': '22,954',
  'Population(2000)': '22,469',
  'Change': '+2.2%',
  'Land area(2010) sq mi': '9.82',
  'Land area(2010) km2': '25.4',
  'Population density': '2,337.5/sq\xa0mi (902.5/km2)'},
 {'Name': 'Burrillville',
  'Type': 'Town',
  'County': 'Providence',
  'Year established': '1730',
  'Year incorporated': '1806',
  'Form of government': 'Council–manager',
  'Population(2010)': '15,955',
  'Population(2000)': '15,796',
  'Chan

#### Practice
Find another table on Wikipedia (try searching for "list of...". Import that table, as I have the RI towns list.

In [32]:
import requests
from bs4 import BeautifulSoup as bs
page = requests.get('https://en.wikipedia.org/wiki/List_of_Rhode_Island_railroads')
page
soup = bs(page.text, 'html.parser')
table = soup.find('table', class_='wikitable')

#Find all the table headers (th elements).
#Remove the footnotes/references from the header cell labels.
headers = [th.text.strip().split('[')[0] for th in table.find_all('th')]

print(headers)


['Name', 'Mark', 'System', 'From', 'To', 'Successor', 'Notes']


In [39]:
railroaddata = []
for row in table.find_all('tr')[2:]:
    rowdata = [cell.text.strip() for cell in row.find_all('td')]
    railroaddata.append(dict(zip(headers, rowdata)))

railroaddata


[{'Name': 'Boston and Providence Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1853',
  'To': '1972',
  'Successor': 'Penn Central Transportation Company'},
 {'Name': 'Boston and Providence Railroad and Transportation Company',
  'Mark': '',
  'System': 'NH',
  'From': '1834',
  'To': '1853',
  'Successor': 'Boston and Providence Railroad'},
 {'Name': 'Consolidated Rail Corporation',
  'Mark': 'CR',
  'System': '',
  'From': '1976',
  'To': '1982',
  'Successor': 'Providence and Worcester Railroad'},
 {'Name': 'Fall River, Warren and Providence Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1862',
  'To': '1892',
  'Successor': 'Old Colony Railroad'},
 {'Name': 'Hartford, Providence and Fishkill Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1849',
  'To': '1879',
  'Successor': 'New York and New England Railroad'},
 {'Name': 'Moshassuck Valley Railroad',
  'Mark': 'MOV',
  'System': '',
  'From': '1874',
  'To': '1981',
  'Successor': 'Providence and Worcester Ra

### Working with lists of data

#### Selecting "rows" or "columns"

Picking a single row by its index is easy--we've been doing this since the second class.

In [46]:
ridata[5]

{'Name': 'Coventry',
 'Type': 'Town',
 'County': 'Kent',
 'Year established': '1639',
 'Year incorporated': '1743[b]',
 'Form of government': 'Council–manager',
 'Population(2010)': '35,014',
 'Population(2000)': '33,668',
 'Change': '+4.0%',
 'Land area(2010) sq mi': '59.05',
 'Land area(2010) km2': '152.9',
 'Population density': '593.0/sq\xa0mi (228.9/km2)'}

In [47]:
print(ridata[-1])
del(ridata[-1])

{'Name': 'Total', 'Type': '—', 'County': '—', 'Year established': '—', 'Year incorporated': '—', 'Form of government': '—', 'Population(2010)': '1,052,567', 'Population(2000)': '1,048,319', 'Change': '+0.4%', 'Land area(2010) sq mi': '1,033.82', 'Land area(2010) km2': '2,677.6', 'Population density': '1,018.13/sq\xa0mi (393.10/km2)'}


We can also choose one or more rows using a list comprehension.

In [48]:
[x for x in ridata if x['County']=="Washington"]

[{'Name': 'Charlestown',
  'Type': 'Town',
  'County': 'Washington',
  'Year established': '1669',
  'Year incorporated': '1738',
  'Form of government': 'Council–manager',
  'Population(2010)': '7,827',
  'Population(2000)': '7,859',
  'Change': '−0.4%',
  'Land area(2010) sq mi': '36.45',
  'Land area(2010) km2': '94.4',
  'Population density': '214.7/sq\xa0mi (82.9/km2)'},
 {'Name': 'Exeter',
  'Type': 'Town',
  'County': 'Washington',
  'Year established': '1641',
  'Year incorporated': '1641[d]',
  'Form of government': 'Town meeting',
  'Population(2010)': '6,425',
  'Population(2000)': '6,045',
  'Change': '+6.3%',
  'Land area(2010) sq mi': '57.47',
  'Land area(2010) km2': '148.8',
  'Population density': '111.8/sq\xa0mi (43.2/km2)'},
 {'Name': 'Hopkinton',
  'Type': 'Town',
  'County': 'Washington',
  'Year established': '1639',
  'Year incorporated': '1757',
  'Form of government': 'Town meeting',
  'Population(2010)': '8,188',
  'Population(2000)': '7,836',
  'Change': '+4.

Another option is the [`filter()`](https://docs.python.org/3/library/functions.html#filter) function. For this we need a new language feature: [_lambda_ expressions](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions). These are small functions that can be used as function or method arguments without first declaring them.

Here's the equivalent of the last expression using `filter()`

In [49]:
list(filter(lambda x: x['County'] == 'Washington', ridata))

[{'Name': 'Charlestown',
  'Type': 'Town',
  'County': 'Washington',
  'Year established': '1669',
  'Year incorporated': '1738',
  'Form of government': 'Council–manager',
  'Population(2010)': '7,827',
  'Population(2000)': '7,859',
  'Change': '−0.4%',
  'Land area(2010) sq mi': '36.45',
  'Land area(2010) km2': '94.4',
  'Population density': '214.7/sq\xa0mi (82.9/km2)'},
 {'Name': 'Exeter',
  'Type': 'Town',
  'County': 'Washington',
  'Year established': '1641',
  'Year incorporated': '1641[d]',
  'Form of government': 'Town meeting',
  'Population(2010)': '6,425',
  'Population(2000)': '6,045',
  'Change': '+6.3%',
  'Land area(2010) sq mi': '57.47',
  'Land area(2010) km2': '148.8',
  'Population density': '111.8/sq\xa0mi (43.2/km2)'},
 {'Name': 'Hopkinton',
  'Type': 'Town',
  'County': 'Washington',
  'Year established': '1639',
  'Year incorporated': '1757',
  'Form of government': 'Town meeting',
  'Population(2010)': '8,188',
  'Population(2000)': '7,836',
  'Change': '+4.

`filter()` takes two arguments:
    
1. A function that returns `True` if we should keep the list (or other iterable) item or `False` otherwise; and
2. our list.

Our first argument above is a lambda function:

`lambda x: x['County'] == 'Washington'`

This is simply a shorthand method of creating a function and using it once. We can get the same effect this way:

In [50]:
def wash_county(x):
    return x['County'] == 'Washington'

list(filter(wash_county, ridata))

[{'Name': 'Charlestown',
  'Type': 'Town',
  'County': 'Washington',
  'Year established': '1669',
  'Year incorporated': '1738',
  'Form of government': 'Council–manager',
  'Population(2010)': '7,827',
  'Population(2000)': '7,859',
  'Change': '−0.4%',
  'Land area(2010) sq mi': '36.45',
  'Land area(2010) km2': '94.4',
  'Population density': '214.7/sq\xa0mi (82.9/km2)'},
 {'Name': 'Exeter',
  'Type': 'Town',
  'County': 'Washington',
  'Year established': '1641',
  'Year incorporated': '1641[d]',
  'Form of government': 'Town meeting',
  'Population(2010)': '6,425',
  'Population(2000)': '6,045',
  'Change': '+6.3%',
  'Land area(2010) sq mi': '57.47',
  'Land area(2010) km2': '148.8',
  'Population density': '111.8/sq\xa0mi (43.2/km2)'},
 {'Name': 'Hopkinton',
  'Type': 'Town',
  'County': 'Washington',
  'Year established': '1639',
  'Year incorporated': '1757',
  'Form of government': 'Town meeting',
  'Population(2010)': '8,188',
  'Population(2000)': '7,836',
  'Change': '+4.

We can select a single "column" with a simple list comprehension:

In [51]:
[row['Name'] for row in ridata]

['Barrington',
 'Bristol',
 'Burrillville',
 'Central Falls',
 'Charlestown',
 'Coventry',
 'Cranston',
 'Cumberland',
 'East Greenwich',
 'East Providence',
 'Exeter',
 'Foster',
 'Glocester',
 'Hopkinton',
 'Jamestown',
 'Johnston',
 'Lincoln',
 'Little Compton',
 'Middletown',
 'Narragansett',
 'Newport',
 'New Shoreham',
 'North Kingstown',
 'North Providence',
 'North Smithfield',
 'Pawtucket',
 'Portsmouth',
 'Providence',
 'Richmond',
 'Scituate',
 'Smithfield',
 'South Kingstown',
 'Tiverton',
 'Warren',
 'Warwick',
 'Westerly',
 'West Greenwich',
 'West Warwick',
 'Woonsocket']

##### Practice
Experiment with selecting rows or columns from one of the datasets you've loaded (your data, `personsx.csv`, or your Wikipedia table).

In [54]:
#Choose a subset of rows
railroaddata[5:]

[{'Name': 'Moshassuck Valley Railroad',
  'Mark': 'MOV',
  'System': '',
  'From': '1874',
  'To': '1981',
  'Successor': 'Providence and Worcester Railroad'},
 {'Name': 'Narragansett Pier Railroad',
  'Mark': 'NAP',
  'System': 'NH',
  'From': '1868',
  'To': '1981',
  'Successor': 'N/A'},
 {'Name': 'New England Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1895',
  'To': '1908',
  'Successor': 'New York, New Haven and Hartford Railroad'},
 {'Name': 'New York and Boston Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1853',
  'To': '1865',
  'Successor': 'Boston, Hartford and Erie Railroad'},
 {'Name': 'New York and New England Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1873',
  'To': '1895',
  'Successor': 'New England Railroad'},
 {'Name': 'New York, New Haven and Hartford Railroad',
  'Mark': 'NH',
  'System': 'NH',
  'From': '1893',
  'To': '1969',
  'Successor': 'Penn Central Transportation Company'},
 {'Name': 'New York, Providence and Boston Railroad',


In [55]:
#Choose a column
[x for x in railroaddata if x['System']=="NH"]

[{'Name': 'Boston and Providence Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1853',
  'To': '1972',
  'Successor': 'Penn Central Transportation Company'},
 {'Name': 'Boston and Providence Railroad and Transportation Company',
  'Mark': '',
  'System': 'NH',
  'From': '1834',
  'To': '1853',
  'Successor': 'Boston and Providence Railroad'},
 {'Name': 'Fall River, Warren and Providence Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1862',
  'To': '1892',
  'Successor': 'Old Colony Railroad'},
 {'Name': 'Hartford, Providence and Fishkill Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1849',
  'To': '1879',
  'Successor': 'New York and New England Railroad'},
 {'Name': 'Narragansett Pier Railroad',
  'Mark': 'NAP',
  'System': 'NH',
  'From': '1868',
  'To': '1981',
  'Successor': 'N/A'},
 {'Name': 'New England Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1895',
  'To': '1908',
  'Successor': 'New York, New Haven and Hartford Railroad'},
 {'Name': 'New York

#### [Sorting](https://docs.python.org/3/howto/sorting.html)

Sorting simple lists is simple.

In [56]:
from random import randrange
somelist = [randrange(100) for x in range(10)]
print(somelist)
print(sorted(somelist))

[31, 60, 67, 17, 92, 78, 0, 38, 94, 53]
[0, 17, 31, 38, 53, 60, 67, 78, 92, 94]


Our lists of dictionaries are slightly more complex. We must provide a `key` argument. We can use a `lambda`.

In [57]:
sorted(ridata, key=lambda muni: muni['Year established'])

[{'Name': 'Warren',
  'Type': 'Town',
  'County': 'Bristol',
  'Year established': '1620',
  'Year incorporated': '1747',
  'Form of government': 'Council–manager',
  'Population(2010)': '10,611',
  'Population(2000)': '11,360',
  'Change': '−6.6%',
  'Land area(2010) sq mi': '6.12',
  'Land area(2010) km2': '15.9',
  'Population density': '1,733.8/sq\xa0mi (669.4/km2)'},
 {'Name': 'Foster',
  'Type': 'Town',
  'County': 'Providence',
  'Year established': '1636',
  'Year incorporated': '1781',
  'Form of government': 'Town meeting',
  'Population(2010)': '4,606',
  'Population(2000)': '4,274',
  'Change': '+7.8%',
  'Land area(2010) sq mi': '50.80',
  'Land area(2010) km2': '131.6',
  'Population density': '90.7/sq\xa0mi (35.0/km2)'},
 {'Name': 'Johnston',
  'Type': 'Town',
  'County': 'Providence',
  'Year established': '1636',
  'Year incorporated': '1759',
  'Form of government': 'Mayor–council',
  'Population(2010)': '28,769',
  'Population(2000)': '28,195',
  'Change': '+2.0%',
 

We can also use `itemgetter` from the `operator` module:

In [58]:
from operator import itemgetter
sorted(ridata, key=itemgetter('Population(2010)'))

[{'Name': 'New Shoreham',
  'Type': 'Town',
  'County': 'Washington',
  'Year established': '1664',
  'Year incorporated': '1672',
  'Form of government': 'Council–manager',
  'Population(2010)': '1,051',
  'Population(2000)': '1,010',
  'Change': '+4.1%',
  'Land area(2010) sq mi': '9.08',
  'Land area(2010) km2': '23.5',
  'Population density': '115.7/sq\xa0mi (44.7/km2)'},
 {'Name': 'Scituate',
  'Type': 'Town',
  'County': 'Providence',
  'Year established': '1636',
  'Year incorporated': '1730',
  'Form of government': 'Town meeting',
  'Population(2010)': '10,329',
  'Population(2000)': '10,324',
  'Change': '0.0%',
  'Land area(2010) sq mi': '48.16',
  'Land area(2010) km2': '124.7',
  'Population density': '214.5/sq\xa0mi (82.8/km2)'},
 {'Name': 'Warren',
  'Type': 'Town',
  'County': 'Bristol',
  'Year established': '1620',
  'Year incorporated': '1747',
  'Form of government': 'Council–manager',
  'Population(2010)': '10,611',
  'Population(2000)': '11,360',
  'Change': '−6.6

##### Practice
Experiment with sorting your data.

In [59]:
#Sort one of your open datasets.
sorted(railroaddata, key=lambda muni: muni['To'])

[{'Name': 'Seekonk Branch Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1836',
  'To': '1839',
  'Successor': 'Boston and Providence Railroad',
  'Notes': 'Was located in what was part of Massachusetts before the boundary was moved in 1862'},
 {'Name': 'Providence and Plainfield Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1846',
  'To': '1851',
  'Successor': 'Hartford, Providence and Fishkill Railroad'},
 {'Name': 'Providence and Bristol Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1850',
  'To': '1852',
  'Successor': 'Providence, Warren and Bristol Railroad'},
 {'Name': 'Boston and Providence Railroad and Transportation Company',
  'Mark': '',
  'System': 'NH',
  'From': '1834',
  'To': '1853',
  'Successor': 'Boston and Providence Railroad'},
 {'Name': 'Woonsocket Union Railroad',
  'Mark': '',
  'System': 'NH',
  'From': '1850',
  'To': '1853',
  'Successor': 'New York and Boston Railroad'},
 {'Name': 'Warren and Fall River Railroad',
  'Mark': '',
  'Sy

#### Cleaning

We can loop over the list to make changes to our data. Here we use the `.isnumeric()` [string method](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) to determine whether a string can be converted to an `int` or `float`. We'll also use the `.replace()` method to remove periods and commas from strings as needed.

In [60]:
for row in ridata:
    if row['Year established'].isnumeric():
        row['Year established'] = int(row['Year established'])
    if row['Land area(2010) sq mi'].replace('.', '').isnumeric(): 
        row['Land area(2010) sq mi'] = float(row['Land area(2010) sq mi'])
    if row['Population(2010)'].replace('.', '').replace(',','').isnumeric():
        row['Population(2010)'] = int(row['Population(2010)'].replace(',',''))

In [61]:
ridata[0]

{'Name': 'Barrington',
 'Type': 'Town',
 'County': 'Bristol',
 'Year established': 1653,
 'Year incorporated': '1770',
 'Form of government': 'Council–manager',
 'Population(2010)': 16310,
 'Population(2000)': '16,819',
 'Change': '−3.0%',
 'Land area(2010) sq mi': 8.22,
 'Land area(2010) km2': '21.3',
 'Population density': '1,984.2/sq\xa0mi (766.1/km2)'}

We can use the `.split()` string method to extract specific parts of a string when we know the string has some regular formatting. For example:

In [62]:
pd = ridata[0]['Population density']
print(ridata[0]['Name']+"'s population density:", pd)
#Population density per square mile:
print("Per square mile", pd.split('/')[0])

#per square km:
print("Per square kilometer:", pd.split('(')[1].split('/')[0])

Barrington's population density: 1,984.2/sq mi (766.1/km2)
Per square mile 1,984.2
Per square kilometer: 766.1


##### Practice 
Find a field in your data that should be numeric and convert it to integers or floating-point numbers.

In [74]:
for row in railroaddata:
    if row['To'].isnumeric():
        row['To'] = int(row['To'])

#### Derived fields

Sometimes, the numbers we want to analyze are not provided in the data we have, but can be calculated from that data. We'll want to add new fields to the data with our calculated figures.

Earlier, I showed how we could extract population density from the numbers above. But we can also calculate it from the population and area numbers we've already converted to numeric variables:

In [67]:
pop  = ridata[0]['Population(2010)']
area = ridata[0]['Land area(2010) sq mi']
print(ridata[0]['Name']+"'s population density:", pop/area, "/square mile")

Barrington's population density: 1984.184914841849 /square mile


We can add this figure to every row of our data:

In [68]:
for row in ridata:
    row['population_density'] = row['Population(2010)'] / row['Land area(2010) sq mi']

ridata

[{'Name': 'Barrington',
  'Type': 'Town',
  'County': 'Bristol',
  'Year established': 1653,
  'Year incorporated': '1770',
  'Form of government': 'Council–manager',
  'Population(2010)': 16310,
  'Population(2000)': '16,819',
  'Change': '−3.0%',
  'Land area(2010) sq mi': 8.22,
  'Land area(2010) km2': '21.3',
  'Population density': '1,984.2/sq\xa0mi (766.1/km2)',
  'population_density': 1984.184914841849},
 {'Name': 'Bristol',
  'Type': 'Town',
  'County': 'Bristol',
  'Year established': 1680,
  'Year incorporated': '1746',
  'Form of government': 'Council–manager',
  'Population(2010)': 22954,
  'Population(2000)': '22,469',
  'Change': '+2.2%',
  'Land area(2010) sq mi': 9.82,
  'Land area(2010) km2': '25.4',
  'Population density': '2,337.5/sq\xa0mi (902.5/km2)',
  'population_density': 2337.4745417515273},
 {'Name': 'Burrillville',
  'Type': 'Town',
  'County': 'Providence',
  'Year established': 1730,
  'Year incorporated': '1806',
  'Form of government': 'Council–manager',


#### Summary Statistics

We've already discusses reading "columns" of data; with the functions in the [`statistics`](https://docs.python.org/3/library/statistics.html) module and the `min()` and `max()` functions, we can summarize those columns.

In [69]:
import statistics

print(statistics.mean([x['Land area(2010) sq mi'] for x in ridata]))
print(min([x['Land area(2010) sq mi'] for x in ridata]), max([x['Land area(2010) sq mi'] for x in ridata]))

26.508205128205127
1.2 59.05


##### Practice

Calculate the mean and range (maximum and minimum values) for a numeric field in one of the loaded datasets.

In [75]:
for row in railroaddata:
    row['Duration'] = row['To'] - row['From']

railroaddata


[{'Name': 'Boston and Providence Railroad',
  'Mark': '',
  'System': 'NH',
  'From': 1853,
  'To': 1972,
  'Successor': 'Penn Central Transportation Company',
  'Duration': 119},
 {'Name': 'Boston and Providence Railroad and Transportation Company',
  'Mark': '',
  'System': 'NH',
  'From': 1834,
  'To': 1853,
  'Successor': 'Boston and Providence Railroad',
  'Duration': 19},
 {'Name': 'Consolidated Rail Corporation',
  'Mark': 'CR',
  'System': '',
  'From': 1976,
  'To': 1982,
  'Successor': 'Providence and Worcester Railroad',
  'Duration': 6},
 {'Name': 'Fall River, Warren and Providence Railroad',
  'Mark': '',
  'System': 'NH',
  'From': 1862,
  'To': 1892,
  'Successor': 'Old Colony Railroad',
  'Duration': 30},
 {'Name': 'Hartford, Providence and Fishkill Railroad',
  'Mark': '',
  'System': 'NH',
  'From': 1849,
  'To': 1879,
  'Successor': 'New York and New England Railroad',
  'Duration': 30},
 {'Name': 'Moshassuck Valley Railroad',
  'Mark': 'MOV',
  'System': '',
  'From

In [77]:
import statistics

print(statistics.mean([x['Duration'] for x in railroaddata]))
print(min([x['Duration'] for x in railroaddata]), max([x['Duration'] for x in railroaddata]))

31.157894736842106
1 119
