# UNCLASSIFIED

Transcribed from FOIA Doc ID: 6689695

https://archive.org/details/comp3321

# (U) Setup

(U) For this notebook, you will need the following files: 

- user_file.csv 

- user_file.xml

Note: I had to guess at the contents of these files since they weren't included in the source materials.

(U) Right-click each to download and "Save As," then, from your Jupyter home, navigate to the folder containing this notebook and click the "Upload" button to upload each file from your local system. 

Note: I'm including my homemade versions of these files in the github.

# (U) Introduction: It's Sad, But True 

(U) Much of computing involves reading and writing structured data. Too much, probably. Often that data is contained in files--not even a database. We've already worked with opening, closing, reading from, and writing to text files. We've also frequently used string methods. At first, it might seem that that's all we need to work with CSV, XML, and other structured data formats. 

(U) After all, what could go wrong with the following? 

In [None]:
my_csv_file = open('user_file.csv', 'r')

In [None]:
csv_lines = my_csv_file.read().splitlines()

In [None]:
csv_lines

In [None]:
comma_separated_records = [line.split(',') for line in csv_lines]

In [None]:
comma_separated_records

In [None]:
xml_formatter = """
  <person>
    <name>{}</name>
    <address>{}</address>
    <city>{}</city>
    <state>{}</state>
    <zip>{}</zip>
    <phone>{}</phone>
    <primary_workstation>{}</primary_workstation>
    <username>{}</username>
  </person>"""

In [None]:
xml_records = "\n".join([xml_formatter.format(*record) for record in comma_separated_records])

In [None]:
print(xml_records)

In [None]:
xml_records = "<people>" + xml_records + "</people>"

In [None]:
with open('file.xml', 'w') as f:
    f.write(xml_records)

(U) In a rapidly-developed prototype with controlled input, this may not cause a problem. Given the way the real world works, though, someday this little snippet from a one-off script will become the long-forgotten key component of a huge, enterprise-wide project. Somebody will try to feed it data in just the wrong way at a crucial moment, and it will fail catastrophically. 

(U) When that happens, you'll wish you had used a fully-developed library that would have had a better chance against the malformed data. Thankfully, there are several-and they actually aren't any harder to get started with. 

# (U) Comma Separated Values (CSV) 

(U) The most exciting things about the `csv` module are the `DictReader` and `Dictwriter` classes. First, let's look at the plain vanilla options for reading and writing. 

In [None]:
import csv

In [None]:
f = open('user_file.csv')

In [None]:
reader = csv.reader(f)

In [None]:
header = next(reader)

In [None]:
header

In [None]:
all_lines = [line for line in reader]

In [None]:
all_lines

In [None]:
all_lines.sort()

In [None]:
g = open('user_file_sorted.csv', 'w')

In [None]:
writer = csv.writer(g)

In [None]:
writer.writerow(header)

In [None]:
writer.writerows(all_lines) 

In [None]:
g.close() 

(U) CSV readers and writers have other options involving _dialects_ and _separators_. Note that the argument to `csv.reader` must be an open file (or file-like object), and the reading starts at the current cursor position. 

(U) Accessing categorical data positionally is not ideal. That is why `csv` also provides the `DictReader` and `DictWriter` classes, which can also handle records with more or less in them than you expect. When given only a file as an argument, a `DictReader` uses the first line as the keys for the remaining lines; however, it is also possible to pass in `fieldnames` as an additional parameter.

In [None]:
f.seek(0)

In [None]:
d_reader = csv.DictReader(f)

In [None]:
records = [line for line in d_reader]

In [None]:
records

(U) To see the differences between reader and DictReader, look at how we might extract cities from the records in each. 

In [None]:
# for the object from csv.reader 
cities0 = [record[2] for record in all_lines]

In [None]:
# for the object from csv.DictReader 
cities1 = [record['city'] for record in records]

In [None]:
cities0

In [None]:
cities0 == cities1

(U) In a `DictWriter`, the `fieldnames` parameter is required and headers are not written by default. If you want one, add it with the `writeheader` method. If the `fieldnames` argument does not include all the fields for every dictionary passed into the `DictWriter`, the keyword argument `extrasaction` must be specified.

In [None]:
g = open('names_only.csv', 'w')

In [None]:
d_writer = csv.DictWriter(g, ['name', 'primary_workstation'], extrasaction='ignore')

In [None]:
d_writer.writeheader()

In [None]:
d_writer.writerows(records)

In [None]:
g.close()

# (U) JavaScript Object Notation (JSON) 

(U) JSON is another structured data format. In many cases it looks very similar to nested Python `dict`s and `list`s. However, there are enough notable differences from those (e.g. only single quotation marks are allowed, boolean values have a lowercase initial letter) that it's wise to use a dedicated module to parse JSON data. Still, _serializing_ and _deserializing_ JSON data structures is relatively painless.

(U) For this section, our example will be a list of novels:

In [None]:
import json

In [None]:
novel_list = []

In [None]:
novel_list.append({'title': 'Pride and Prejudice', 'author': 'Jane Austen'})

In [None]:
novel_list.append({'title': 'Crime and Punishment', 'author': 'Fyodor Dostoevsky'})

In [None]:
novel_list.append({'title': 'The Unconsoled' , 'author': 'Kazuo Ishiguro'}) 

In [None]:
json.dumps(novel_list) # to string

In [None]:
with open('novel_list.json', 'w') as f: 
    json.dump(novel_list, f) # to file

In [None]:
the_hobbit = '{"title": "The Hobbit", "author": "J.R.R. Tolkien"}'

In [None]:
novel_list.append(json.loads(the_hobbit)) # from string 

In [None]:
with open('war_and_peace.json') as f: # <-- if this file existed 
    novel_list.append(json.load(f)) # from fiie 

(U) By default, the `load` and `loads` methods return Unicode strings. It's possible to use the **json** module to define custom encoders and decoders, but this is not usually required. 

# (U) Extensible Markup Language (XML) 

(U) This lesson is supposed to be simple, but XML is complicated. We'll cover only the basics of reading data from and writing data to files in a very basic XML format using the **ElementTree** API, which is just the most recent of at least three approaches to dealing with XML in the Python Standard Library. We will not discuss attributes or namespaces at all, which are very common features of XML. If you need to process lots of XML quickly, it's 
probably best to look outside the standard library (probably at a package called **lxml**). 

(U) Although there are other ways to get started, an `ElementTree` can be created from a file by initializing with the keyword argument `file`: 

In [None]:
from xml.etree import ElementTree

In [None]:
xml_file = open('user_file.xml')

In [None]:
user_tree = ElementTree.ElementTree(file=xml_file)

(U) To do much of anything, it's best to pull the root element out of the `ElementTree`. Elements are iterable, so they can be expanded in list comprehensions. To see what is inside an element, the **ElementTree** module provides two class functions: `dump` (which prints to screen and returns `None`) and `tostring`. Each node has a `text` property, although in our example these are all empty except for leaf nodes. 

In [None]:
root_elt = user_tree.getroot()

In [None]:
root_elt

In [None]:
users = [u for u in root_elt]

In [None]:
users

In [None]:
print(ElementTree.tostring(users[1]))

In [None]:
u_children = [x for x in users[1]]

In [None]:
u_children[2].text

In [None]:
u_children[2].text = 'north-x5-1234'

In [None]:
ElementTree.dump(users[1])

(U) To get nested descendant elements directly, use `findall`, which returns a list of all matches, or `find`, which returns the first matched element. Note that these are the actual elements, not copies, so changes made here are visible in the whole element tree.

In [None]:
all_usernames = root_elt.findall('user/name/username')

In [None]:
all_usernames

In [None]:
[n.text for n in all_usernames[1:10]]

(U) To construct an XML document: 

- make an `Element`, 
- `append` other `Element`s to it (repeating as necessary), 
- wrap it all up in an `ElementTree`, and 
- use the `ElementTree.write` method (which takes a file _name_, not a `file` object).

In [None]:
apple = ElementTree.Element('apple')

In [None]:
apple.attrib['color'] = 'red'

In [None]:
apple.set('variety', 'honeycrisp')

In [None]:
apple.text = "Tasty"

In [None]:
ElementTree.dump(apple)

In [None]:
fruit_basket = ElementTree.Element('basket')

In [None]:
fruit_basket.append(apple)

In [None]:
fruit_basket.append(ElementTree.XML('<orange color="orange" variety="navel"></orange>'))

In [None]:
ElementTree.dump(fruit_basket)

In [None]:
fruit_tree = ElementTree.ElementTree(fruit_basket)

In [None]:
fruit_tree.write('fruit_basket.xml')

# (U) Bonus Material: Pickles and Shelves 

(U) At the expense of compatibility with other languages, Python also provides built-in serialization and data storage capabilities in the form of the **pickle** and **shelve** modules.

**pickle** lets you seralize objects into a byte stream that can be saved to a binary file and **shelve** expands on that by letting you assign key names to those objects in a byte steam.

## (U) Pickling

In [None]:
import pickle

In [None]:
pickleme = {}

In [None]:
pickleme['Title'] = 'Python is Cool'

In [None]:
pickleme['PageCount'] = 543

In [None]:
pickleme['Author'] = 'PythonFanboy1994'

In [None]:
pickleme

In [None]:
with open('/tmp/pickledData.pick', 'wb') as p:
    p = pickle.dump(pickleme, p)

In [None]:
with open('/tmp/pickledData.pick', 'rb') as p:
    w = pickle.load(p)

In [None]:
print(w)

## (U) Shelving 
### (U) Creating a Shelve 

In [None]:
import shelve

In [None]:
pickleme = {}

In [None]:
pickleme['Title'] = 'Python is Cool'

In [None]:
pickleme['PageCount'] = 543

In [None]:
pickleme['Author'] = 'PythonFanboy1994'

In [None]:
db = shelve.open('/tmp/shelve.dat')

In [None]:
db['book1'] = pickleme

In [None]:
db.sync()

In [None]:
pickleme['Title'] = 'Python is Cool -- The Next Phase'

In [None]:
pickleme['PageCount'] = 123

In [None]:
pickleme['Author'] = 'PythonFanboy1994'

In [None]:
pickleme

In [None]:
db['book2'] = pickleme

In [None]:
db.sync()

In [None]:
db.close()

### (U) Opening a Shelve 

In [None]:
bookshelf = shelve.open('/tmp/shelve.dat')

In [None]:
z = bookshelf.keys()

In [None]:
a = bookshelf['book1']

In [None]:
b = bookshelf['book2']

In [None]:
print(a)

In [None]:
print(b)

In [None]:
print(z)

In [None]:
bookshelf.close()

### (U) Modifying a Shelve 

In [None]:
db = shelve.open('/tmp/shelve.dat')

In [None]:
z = db.keys()

In [None]:
a = db['book1']

In [None]:
b = db['book2']

In [None]:
print(a)

In [None]:
print(b)

In [None]:
print(z)

In [None]:
a['PageCount'] = 544

In [None]:
b['PageCount'] = 129

In [None]:
db['book1'] = a

In [None]:
db['book2'] = b

In [None]:
db.close()

# UNCLASSIFIED

Transcribed from FOIA Doc ID: 6689695

https://archive.org/details/comp3321