In [2]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 144

In [3]:
import numpy as np

import expectexception

# Python Data Formats

Data is often stored in files.

Your data can be encoded into one or more files and saved to disk. Ideally, this data is saved to disk in a way that can be later retrieved, identical to the data we started with.

The simplest file example is saving arbitrary string data to a text file. That can be done as follows:

In [4]:
with open('example.txt', 'w') as f:
    f.write('hello world')

This creates a new file in the current directory called `example.txt`. Take this opportunity to use Jupyter's Terminal feature to verify the existence of this file.

We can also verify the file using Jupyter's ability to execute OS system commands with the `!` character:

In [5]:
!ls -l example.txt

-rw-rw-r-- 1 vagrant vagrant 11 Aug 11 00:32 example.txt


11 bytes, just as expected.

The contents of the file can be retrieved by opening the same file and reading it.

In [6]:
with open('example.txt', 'r') as f:
    print f.read()

hello world


Reading and writing data files are just more sophisticated versions of our simple example above. A table of data could be written to a csv file by joining each row of data with commas and writing that line to a file. Of course nobody writes Python code to write data to a csv file like that because all of the data packages we will use provide utilities for writing data to csv and many other formats. Nevertheless, it is worth realizing that writing a csv file boils down to writing comma separated lists to a file for easy retrieval later.

In [7]:
# don't write csv files like this...
with open('csv_example.txt', 'w') as f:
    f.write('column1,column2,column3\n')
    f.write('1,2,3\n')
    f.write('4,5,6\n')
    f.write('7,8,9\n')

Use Linux `cat` command to see the contents of the file:

In [8]:
!cat csv_example.txt

column1,column2,column3
1,2,3
4,5,6
7,8,9


If we had a Pandas `DataFrame` we could more simply write the data to a csv file using the `to_csv` method. We'll cover that extensively later.

## File Reading and Writing in more detail

Before continuing, let's understand a little bit more about reading and writing files.

The three steps for writing to a file are:

* Open a file for writing with the `open` command, returning a file handle
* Write to the file handle
* Close the file handle

The last step is necessary and can become a source of errors. Python has the convenient `with` keyword that ensures that the open file handle is always closed after the block of code is executed. Other programming languages don't always have this feature.

If we were to code our example without the `with` keyword, it would look like this:

In [9]:
f = open('example.txt', 'w')
f.write('hello world')
f.close()  # don't forget!

If we forgot to close the file, the file handle `f` would still be open. There is a limit to the number of open file handles your computer can handle; typically this is in the hundreds of thousands, if not more. That limit isn't usually the problem though: a more likely scenario is that an exception (error) is thrown after the file is opened and before the file is closed. If this happens, the `close` method is never called and the file stays open. This can cause problems for you later, particularly if you are using Windows. The Windows OS puts a lock on an opened file that prevents other processes from opening and writing to the same file; a residual lock can later cause a Python process to fail.

Also, note that Python and your operating system may be somewhat *lazy* in that it won't always write the contents of your file to disk right away. Try this:

In [10]:
f = open('mystery.txt', 'w')
f.write('when will this text appear in the file?')

Examine the current contents of `mystery.txt` with the `cat` command:

In [11]:
!cat mystery.txt

Egad!! The file is empty!

If we had just attempted to save precious research results, we would be very upset right now.

The issue is that the contents of `mystery.txt` are currently stored in a buffer somewhere, waiting for a convenient time to write it to disk. Usually Python will write it to disk for real when the buffer of pending writes gets to be a certain size. If we are impatient, we can use the `flush()` method:

In [12]:
f.flush()

The `flush()` method empties the file buffer and pushes the content to disk. The `close()` method does the same thing but `flush()` keeps the file open and allows subsequent writes.

Our file now contains the expected text:

In [13]:
!cat mystery.txt

when will this text appear in the file?

This makes us happy, but let's not allow our exuberance to cause us to forget to close the file:

In [14]:
f.close()

In the future, prefer to read and write files using the `with` command as above. This won't always work for all situations. For example, you might be monitoring financial price data on a website, with the goal of writing data to a file continuously during market hours. In that situation you should keep the file open and flush after file writes. This will prevent data loss should your program crash at the end of the day.

## Reading and Writing, or more precisely, Reading OR Writing, but not Both

Observe that when we opened the files in our examples, we instructed the open command that we were opening the file for reading OR for writing. We must decide in advance how we will access the file. If we open a file for reading and attempt to write to it, we will get an error.

In [15]:
%%expect_exception IOError

with open('fail.txt', 'r') as f:
    f.write('this will fail')

[0;31m[0m
[0;31mIOError[0mTraceback (most recent call last)
[0;32m<ipython-input-15-5df4ba8f91c7>[0m in [0;36m<module>[0;34m()[0m
[1;32m      1[0m [0;34m[0m[0m
[0;32m----> 2[0;31m [0;32mwith[0m [0mopen[0m[0;34m([0m[0;34m'fail.txt'[0m[0;34m,[0m [0;34m'r'[0m[0;34m)[0m [0;32mas[0m [0mf[0m[0;34m:[0m[0;34m[0m[0m
[0m[1;32m      3[0m     [0mf[0m[0;34m.[0m[0mwrite[0m[0;34m([0m[0;34m'this will fail'[0m[0;34m)[0m[0;34m[0m[0m

[0;31mIOError[0m: [Errno 2] No such file or directory: 'fail.txt'


Conveniently we opened the file with the `with` keyword so the file was closed for us, despite the error.

The `'r'` mode argument in the open command indicates we wish to read from an existing file. If the file did not exist, we would get an error. If we used a `'w'` mode argument, we would write to the file. If the file already existed, we would overwrite the file. If we want to append to the file instead, use the mode `'a'`, like this:

In [16]:
with open('example.txt', 'a') as f:
    f.write('\nhello again')

In [17]:
!cat example.txt 

hello world
hello again

We can also open files for reading and writing in binary mode using a `'b'`, like so:

In [18]:
with open('binary_example.txt', 'wb') as f:
    f.write('hello world as binary')

In [19]:
!cat binary_example.txt

hello world as binary

The result looks like ordinary text but this feature will be more important later when we write Python pickle files.

There are some other file modes like 'r+' and 'w+' that allow for reading AND writing at the same time, but that requires more complex code to manage the file. For our purposes it is a better practice to either read OR write to a file, not both.

## GZip Files

GZip files are compressed files that reduce the size or footprint of files on disk.

Random bytes of data are difficult or impossible to compress. A text file of English text is easy to compress because the letters in the English alphabet correspond to a limited, non-random set of bytes that can be encoded in a way that uses less space.

As Data Scientists, this can be useful when working with large data files.

Consider a situation where you must read data files from a network server across a slow network or from a slow hard drive. Of course it is always better and faster to read from fast local storage, but you might not always have that option, particularly for larger datasets. In these situations, the size of the files and your ability to read them may be a bottleneck limiting the speed of your overall data analysis. A compressed file will require your computer's CPU to uncompress the file contents, but that extra cost will often be smaller than the cost of reading a larger file over a network.

Empirical testing of your data access can help determine if compressed zip files can speed up your analysis.

Python has several libraries for reading and writing compressed files. One of these libraries is the `gzip` library. It can read and write files in the same format that the Linux `gzip` and `gunzip` commands use. Reading and writing files parallels reading and writing ordinary text files.

In [20]:
import gzip

sample_text = 'Python is awesome!'

with gzip.open('test_gzip.txt.gz', 'wb') as f:
    f.write(sample_text)

Note the use of file open mode 'wb', not 'w'. The extra 'b' is for binary.

If we like we can look at the contents of our file `test_gzip.txt.gz` with the Linux `cat` command:

In [21]:
!cat test_gzip.txt.gz

���Y�test_gzip.txt �,���S�,VH,O-��MU ���V   

That's ugly!

A better choice is the Linux command `zcat`, which is the same as `cat` but for gzipped files:

In [22]:
!zcat test_gzip.txt.gz

Python is awesome!

Our compressed file has been written, but the astute reader will notice that the compressed file we just wrote is larger than the text string we wrote.

In [23]:
!ls -l test_gzip.txt.gz

-rw-rw-r-- 1 vagrant vagrant 52 Aug 11 00:32 test_gzip.txt.gz


52 bytes, compared to 18 bytes in our `sample_text` string.

This is because a gzip file has some space overhead from being a zip file. We need a much larger amount of data to see a difference. We'll get to that in a moment, but first, let's read the file back:

In [24]:
with gzip.open('test_gzip.txt.gz', 'rb') as f:
    print f.read()

Python is awesome!


Truly, it is.

Now let's create a much larger file to demonstrate the size differences between zipped and unzipped files.

In [25]:
import string

random_text = ''.join(np.random.choice(list(string.lowercase), size=100000))

print random_text[:20]

with open('random_text.txt', 'w') as f:
    f.write(random_text)

with gzip.open('random_text.txt.gz', 'wb') as f:
    f.write(random_text)

zikqxcldswcytbzdgxbg


In [26]:
!ls -lh random_text*

-rw-rw-r-- 1 vagrant vagrant 98K Aug 11 00:32 random_text.txt
-rw-rw-r-- 1 vagrant vagrant 62K Aug 11 00:32 random_text.txt.gz


The zipped file has a compression ratio of 98 / 62 = ~ 1.6.

The random text frustrates compression because of the lack of repeated patterns, but the limited character set (only lower case letters) offers room for compression.

Real text or real data can often get a better compression ratio than 1.6.

## Pickle Files

Pickle files are special Python files for serializing and de-serializing Python objects. The `pickle` module takes everything the Python interpreter knows about an object in memory and writes that to a binary file. It can later retrieve that information to put the object back in memory again, just as it was before.

In [27]:
# pickle file example code
import pickle

sample_data = [sample_text, 42, 3.1415926535, [1, 2, 3], {1: 'a', 2: 'b', 3: 'c'}]

with open('sample_text.p', 'wb') as f:
    pickle.dump(sample_data, f)

The written file is 112 bytes.

In [28]:
!ls -l sample_text.p

-rw-rw-r-- 1 vagrant vagrant 112 Aug 11 00:32 sample_text.p


The file contents looks like it kind of makes sense...sort of?

In [29]:
!cat sample_text.p

(lp0
S'Python is awesome!'
p1
aI42
aF3.1415926535
a(lp2
I1
aI2
aI3
aa(dp3
I1
S'a'
p4
sI2
S'b'
p5
sI3
S'c'
p6
sa.

This can be retrieved using the pickle `load` method.

In [30]:
with open('sample_text.p', 'rb') as f:
    retrieved_sample_data = pickle.load(f)

retrieved_sample_data

['Python is awesome!', 42, 3.1415926535, [1, 2, 3], {1: 'a', 2: 'b', 3: 'c'}]

That's great! Same as before.

Python pickle files are very versatile and can save arbitrary objects, including user defined classes. There are a few some limitations in what it can save though. It can't save open file handles:

In [31]:
%%expect_exception TypeError

test_file_handle = open('test_file_handle.txt', 'w')

with open('error.p', 'wb') as f:
    pickle.dump(test_file_handle, f)

[0;31m[0m
[0;31mTypeError[0mTraceback (most recent call last)
[0;32m<ipython-input-31-cafa40db43cd>[0m in [0;36m<module>[0;34m()[0m
[1;32m      3[0m [0;34m[0m[0m
[1;32m      4[0m [0;32mwith[0m [0mopen[0m[0;34m([0m[0;34m'error.p'[0m[0;34m,[0m [0;34m'wb'[0m[0;34m)[0m [0;32mas[0m [0mf[0m[0;34m:[0m[0;34m[0m[0m
[0;32m----> 5[0;31m     [0mpickle[0m[0;34m.[0m[0mdump[0m[0;34m([0m[0mtest_file_handle[0m[0;34m,[0m [0mf[0m[0;34m)[0m[0;34m[0m[0m
[0m
[0;32m/opt/conda/lib/python2.7/pickle.pyc[0m in [0;36mdump[0;34m(obj, file, protocol)[0m
[1;32m   1374[0m [0;34m[0m[0m
[1;32m   1375[0m [0;32mdef[0m [0mdump[0m[0;34m([0m[0mobj[0m[0;34m,[0m [0mfile[0m[0;34m,[0m [0mprotocol[0m[0;34m=[0m[0mNone[0m[0;34m)[0m[0;34m:[0m[0;34m[0m[0m
[0;32m-> 1376[0;31m     [0mPickler[0m[0;34m([0m[0mfile[0m[0;34m,[0m [0mprotocol[0m[0;34m)[0m[0;34m.[0m[0mdump[0m[0;34m([0m[0mobj[0m[0;34m)[0m[0;34m[0m[0m


This might seem like a unimportant limitation, but it can become a problem when you are using custom Python classes while logging data to a file. Don't worry about this for now though.

Before we move on, let's be disciplined Python coders and close `test_file_handle`:

In [32]:
test_file_handle.close()

The last thing to mention is that Python 2 (usually) comes with not one but two pickle libraries. The second one, `cPickle`, reads and writes faster than the other, but on some installations is not available. Therefore, some Python code you will see in the wild will import pickle like below to use the faster version when it is available but fall back to the standard one when it is not.

The `cPickle` library is faster but if you not reading and writing a lot of pickle files, don't worry about this.

In [33]:
try:
    import cPickle as pickle
except:
    import pickle

## JSON Files

[JSON](https://en.wikipedia.org/wiki/JSON) (JavaScript Object Notation) files are serialized text files that store data in key-value pairs. 

A Data Scientist will find this data format useful for writing unstructured data. Websites frequently use this format to communicate to and from your browser. You can easily scrape data from a website if you can consume these communications from the web server.

Consider this example data:

In [34]:
student1 = {'name': 'Gary',
            'employment': ('librarian', 'research assistant'),
            'age': 22,
            'major': 'computer science',
            'hobbies': ['running', 'climbing trees', 'eating ice cream'],
            'grades': {'english': 82,
                       'linear algebra': 97,
                       'cpu design': 94}}

student2 = {'name': 'Jill',
            'age': 23,
            'major': 'electrical engineering',
            'minor': 'management',
            'hobbies': ['swimming', 'reading', 'drawing', 'public speaking'],
            'grades': {'french': 88,
                       'calculus': 94,
                       'electronics': 99,
                       'control systems': 95}}

student_list = [student1, student2]

You'll note that `student1` and `student2` are both dictionaries that contain nested dictionaries and lists. Neither could easily fit into an ordinary structured table because they have a variable number of grades and hobbies. They could be shoehorned in (and tragically we've seen this happen) but not without complications.

In [35]:
student_list

[{'age': 22,
  'employment': ('librarian', 'research assistant'),
  'grades': {'cpu design': 94, 'english': 82, 'linear algebra': 97},
  'hobbies': ['running', 'climbing trees', 'eating ice cream'],
  'major': 'computer science',
  'name': 'Gary'},
 {'age': 23,
  'grades': {'calculus': 94,
   'control systems': 95,
   'electronics': 99,
   'french': 88},
  'hobbies': ['swimming', 'reading', 'drawing', 'public speaking'],
  'major': 'electrical engineering',
  'minor': 'management',
  'name': 'Jill'}]

This data structure lends itself to being written to a json file. This can be done like so.

In [36]:
import json

with open('test_json.json', 'w') as f:
    json.dump(student_list, f, indent=2)

In this example we used the optional `indent` parameter to make it print the content in an easy to read format. We can leave that out to be more compact and appear on one line.

In [37]:
!cat test_json.json

[
  {
    "major": "computer science", 
    "name": "Gary", 
    "age": 22, 
    "grades": {
      "linear algebra": 97, 
      "cpu design": 94, 
      "english": 82
    }, 
    "hobbies": [
      "running", 
      "climbing trees", 
      "eating ice cream"
    ], 
    "employment": [
      "librarian", 
      "research assistant"
    ]
  }, 
  {
    "major": "electrical engineering", 
    "name": "Jill", 
    "age": 23, 
    "grades": {
      "calculus": 94, 
      "electronics": 99, 
      "control systems": 95, 
      "french": 88
    }, 
    "hobbies": [
      "swimming", 
      "reading", 
      "drawing", 
      "public speaking"
    ], 
    "minor": "management"
  }
]

We can read the file back by parsing the file contents with the `load` method.

In [38]:
with open('test_json.json', 'r') as f:
    student_list2 = json.load(f)
    
student_list2

[{u'age': 22,
  u'employment': [u'librarian', u'research assistant'],
  u'grades': {u'cpu design': 94, u'english': 82, u'linear algebra': 97},
  u'hobbies': [u'running', u'climbing trees', u'eating ice cream'],
  u'major': u'computer science',
  u'name': u'Gary'},
 {u'age': 23,
  u'grades': {u'calculus': 94,
   u'control systems': 95,
   u'electronics': 99,
   u'french': 88},
  u'hobbies': [u'swimming', u'reading', u'drawing', u'public speaking'],
  u'major': u'electrical engineering',
  u'minor': u'management',
  u'name': u'Jill'}]

You should see a few differences. One, the strings (`str`) were converted to unicode (`unicode`). Unicode refers to how the text is encoded in bytes. The difference isn't important here.

You'll also know that the first student had an `employment` key that was previously mapped to a `tuple`, but is now
mapped to a list. The JSON data format has no concept of a `tuple` so when a tuple is serialized, it is recreated as a list.

The object types are all `list`s and `dict`s.

In [39]:
print type(student_list2[0])
print type(student_list2[0]['employment'])
print type(student_list2[0]['grades'])

<type 'dict'>
<type 'list'>
<type 'dict'>


Interesting side note: The IPython notebook you are looking at right now is a JSON file. The data each cell and its output can easily be represented in JSON.

In [40]:
!head -n 25 IW_Data_Formats.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib\n",
    "import seaborn as sns\n",
    "matplotlib.rcParams['savefig.dpi'] = 144"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",


### JSON and APIs

The JSON data format is commonly used for communicating with modern APIs.

Let's look at a simple example of consuming a JSON API.  The example we'll look at is a *geocoder*: That is, a service for converting between addresses and normalized geographic information (e.g. latitude and longitude).  Going from addresses to normalized form is "forward geocoding" and going the other way is "reverse geocoding".

We'll interact with a free (and non-authenticated) geocoder run by OpenStreetMap.  The geocoded information is available by sending a GET request to <tt>http:&#8203;//nominatim.openstreetmap.org/search?q=<i>address</i>&addressdetails=1&format=json</tt>.  The portion before the question mark (`http://nominatim.openstreetmap.org/search`) is the endpoint on the server, while the portion following, known as the *query string*, contains the data being sent to the server.  (Thus, a GET request can be repeated simply by requesting the same URL again.  In contrast, the data sent in a POST request is contained in the request body, not in the URL.)

As is typical, the query string consists of several key=value pairs, separated by ampersands.  The requested address is specified with the `q` key in this case.  Some characters, like the spaces and commas, cannot be using in the URL, so they must be encoded.  To save you the pain of doing that manually, the `requests` module takes a dictionary of key-value pairs and formats the query string for you.

In [42]:
import requests
address = "11604 Fulham Street 20902"

response = requests.get("http://nominatim.openstreetmap.org/search",
                        params={'q': address, 
                                'addressdetails': 1, 
                                'format': 'json'})

The response object that is returned records the URL that was used...

In [43]:
response.url

u'http://nominatim.openstreetmap.org/search?q=11604+Fulham+Street+20902&addressdetails=1&format=json'

...as well as the response that the server gave.  (200 is the HTTP response for OK.)

In [44]:
response.status_code, response.reason

(200, 'OK')

The data returned by the server is available in the `.text` attribute.

In [45]:
response.text[:200]

u'[{"place_id":"63209438","licence":"Data \xa9 OpenStreetMap contributors, ODbL 1.0. http:\\/\\/www.openstreetmap.org\\/copyright","osm_type":"way","osm_id":"5975353","boundingbox":["39.043525","39.047622","-'

Note that the text has been properly decoded into unicode.  (The string is prefixed by `u`.)  If you need the raw bytes for some reason, they are available in the `.content` attribute.

In [46]:
response.content[:200]

'[{"place_id":"63209438","licence":"Data \xc2\xa9 OpenStreetMap contributors, ODbL 1.0. http:\\/\\/www.openstreetmap.org\\/copyright","osm_type":"way","osm_id":"5975353","boundingbox":["39.043525","39.047622","'

We could interpret this text as JSON with `json.loads`, but `requests` provides a convenience method `.json()` on the response object that does this for us.

In [47]:
response.json()

[{u'address': {u'country': u'United States of America',
   u'country_code': u'us',
   u'county': u'Montgomery County',
   u'locality': u'Kemp Mill',
   u'neighbourhood': u'Northwood Forest',
   u'postcode': u'20902',
   u'road': u'Fulham Street',
   u'state': u'Maryland'},
  u'boundingbox': [u'39.043525', u'39.047622', u'-77.027832', u'-77.024993'],
  u'class': u'highway',
  u'display_name': u'Fulham Street, Northwood Forest, Kemp Mill, Montgomery County, Maryland, 20902, United States of America',
  u'importance': 0.41,
  u'lat': u'39.044986',
  u'licence': u'Data \xa9 OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright',
  u'lon': u'-77.027793',
  u'osm_id': u'5975353',
  u'osm_type': u'way',
  u'place_id': u'63209438',
  u'type': u'residential'}]

In [48]:
response.json()[0]['boundingbox']

[u'39.043525', u'39.047622', u'-77.027832', u'-77.024993']

## CSV Files

The Python standard library includes a library for reading and writing csv files. It doesn't have some of the advanced csv features that the Pandas library comes with, but it is still a respectable library in its own right.

Here is an example of writing some sample data to a file using this module.

In [49]:
import csv

csv.writer

data = []
data.append(['this', 'is', 'a', 'test'])
data.append(['what', 'will an', 'extra, comma', 'do?'])
data.append(['some', 'numbers', 42, 3.14])

with open('csv_test.csv', 'w') as csvfile:
    test_writer = csv.writer(csvfile)
    for row in data:
        test_writer.writerow(row)

We can observe that the file was written to disk, as expected. Take note of how it handled the extra comma in one of the strings. Without that, it would be confused by the comma and think that there is an extra column in that row.

In [50]:
!cat csv_test.csv

this,is,a,test
what,will an,"extra, comma",do?
some,numbers,42,3.14


In [51]:
data2 = []

with open('csv_test.csv', 'r') as csvfile:
    test_reader = csv.reader(csvfile)
    for row in test_reader:
        data2.append(row)
        
data2

[['this', 'is', 'a', 'test'],
 ['what', 'will an', 'extra, comma', 'do?'],
 ['some', 'numbers', '42', '3.14']]

You'll notice that `data2` is not quite the same as `data`...can you see the difference?

The numbers `42` and `3.14` are now strings. Unlike pickle files, the types of the data are not preserved.

Later we will learn about Pandas and its read_csv function. This library has many advanced features for inferring column types as integers or floats. It can also automatically parse date columns. This can save you time when working with complex CSV files.

## XML Files

XML files are less popular these days than JSON files. As a data format they have a reputation for being larger and more cumbersome. There is some truth to this, but it's also true that XML can be more than a data format, whereas JSON files are only a data format. We won't get into these differences here, as we are only concerned with data right now.

To read and write xml files, we will use the non-standard library `lxml`. We will start by creating XML elements.

In [52]:
from lxml import etree

student_list = etree.Element("student_list")

student1 = etree.SubElement(student_list, "student")
name1 = etree.SubElement(student1, "name")
name1.text = "Gary"
name1.set('type', 'first name')

major1 = etree.SubElement(student1, "major")
major1.text = "computer science"

hobbies1 = etree.SubElement(student1, "hobbies")
hobby1 = etree.SubElement(hobbies1, "hobby")
hobby1.text = 'running'
hobby2 = etree.SubElement(hobbies1, "hobby")
hobby2.text = 'climbing trees'
hobby3 = etree.SubElement(hobbies1, "hobby")
hobby3.text = 'eating ice cream'

student_list_xml = etree.tostring(student_list, pretty_print=True)

print student_list_xml

<student_list>
  <student>
    <name type="first name">Gary</name>
    <major>computer science</major>
    <hobbies>
      <hobby>running</hobby>
      <hobby>climbing trees</hobby>
      <hobby>eating ice cream</hobby>
    </hobbies>
  </student>
</student_list>



As you can see, this is tedious to prepare XML. If you are working with data in Python it is much simpler to use JSON.

Nevertheless, it can be written to a text file:

In [53]:
with open('test_xml.xml', 'w') as f:
    f.write(student_list_xml)

In [54]:
!cat test_xml.xml

<student_list>
  <student>
    <name type="first name">Gary</name>
    <major>computer science</major>
    <hobbies>
      <hobby>running</hobby>
      <hobby>climbing trees</hobby>
      <hobby>eating ice cream</hobby>
    </hobbies>
  </student>
</student_list>


XML consists of a nested set of tags, with the tag hierarchy defined by the ordering of the open and close tags.

Observe the XML tags are similar to the keys used in the previous JSON file. The text between the open and close tags contain other tags or text. The tags can also have attributes, as we did above with `type="first name"`.

This can be retrieved and parsed.

In [55]:
student_list_xml2 = etree.parse('test_xml.xml')

print etree.tostring(student_list_xml2)

<student_list>
  <student>
    <name type="first name">Gary</name>
    <major>computer science</major>
    <hobbies>
      <hobby>running</hobby>
      <hobby>climbing trees</hobby>
      <hobby>eating ice cream</hobby>
    </hobbies>
  </student>
</student_list>


Data can also be extracted from the XML tree.

In [56]:
for student in student_list_xml2.findall('student'):
    name = student.find('name').text
    print name
    print '=' * len(name)
    hobbies = student.find('hobbies')
    for hobby in hobbies.findall('hobby'):
        print hobby.text

Gary
====
running
climbing trees
eating ice cream


XML files as a data format are not as easy to use but you may need to read them at some point in your career as a Data Scientist.

Also note that strict HTML can also be valid XML, and some HTML parsers (BeautifulSoup) will try to leverage this to quickly parse a web page. SVG image files are also XML, so you can use the lxml package to read and write them.

### Exercises

1. Find a CSV file somewhere on the Internet that contains data that interests you. Open the file with the `csv` library we learned about.
1. Same thing, but for a JSON file.
1. Using OpenStreetMap, search for your own address. Extract your latitude and longitude from the returned JSON.

### Exit Tickets

1. Why is it important to close files after opening them?
1. What is file compression and when would it be useful?
2. What is a Pickle file? Are there any Python objects that cannot be pickled.

*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*