# Chapter 6 - Data Loading, Storage, and File Formats

Accessing data is a necessary first step for using most of the tools in this book. I’m
going to be focused on data input and output using pandas, though there are numer‐
ous tools in other libraries to help with reading and writing data in various formats.
Input and output typically falls into a few main categories: reading text files and other
more efficient on-disk formats, loading data from databases, and interacting with net‐
work sources like web APIs.

## 6.1 Reading and Writing Data in Text Format

pandas features a number of functions for reading tabular data as a DataFrame
object. Table 6-1 summarizes some of them, though read_csv and read_table are
likely the ones you’ll use the most.

![](read_file.jpg)

![](read_file2.jpg)

I’ll give an overview of the mechanics of these functions, which are meant to convert
text data into a DataFrame. The optional arguments for these functions may fall into
a few categories:

* __Indexing__.
Can treat one or more columns as the returned DataFrame, and whether to get
column names from the file, the user, or not at all.

*  __Type inference and data conversion__.
This includes the user-defined value conversions and custom list of missing value
markers.

* __Datetime parsing__.
Includes combining capability, including combining date and time information
spread over multiple columns into a single column in the result.

* __Iterating__.
Support for iterating over chunks of very large files.

* __Unclean data issues__
Skipping rows or a footer, comments, or other minor things like numeric data
with thousands separated by commas.

Because of how messy data in the real world can be, some of the data loading func‐
tions (especially read_csv) have grown very complex in their options over time. It’s
normal to feel overwhelmed by the number of different parameters (read_csv has
over 50 as of this writing). The online pandas documentation has many examples
about how each of them works, so if you’re struggling to read a particular file, there
might be a similar enough example to help you find the right parameters.

Some of these functions, like pandas.read_csv, perform type inference, because the
column data types are not part of the data format. That means you don’t necessarily
have to specify which columns are numeric, integer, boolean, or string. Other data
formats, like HDF5, Feather, and msgpack, have the data types stored in the format.

Handling dates and other custom types can require extra effort. Let’s start with a
small comma-separated (CSV) text file:

In [9]:
# Print the contents of a text file.
!type examples\ex1.csv 

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


Since this is comma-delimited, we can use read_csv to read it into a DataFrame:

In [11]:
import pandas as pd

In [17]:
df = pd.read_csv('examples\ex1.csv')
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


We could also have used read_table and specified the delimiter:

In [18]:
pd.read_table('examples\ex1.csv', sep=',')

  """Entry point for launching an IPython kernel.


Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


A file will not always have a header row. Consider this file:

In [19]:
!type examples\ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


To read this file, you have a couple of options. You can allow pandas to assign default
column names, or you can specify names yourself:

In [20]:
pd.read_csv('examples\ex2.csv', header=None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [22]:
pd.read_csv('examples\ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Suppose you wanted the message column to be the index of the returned DataFrame.
You can either indicate you want the column at index 4 or named 'message' using
the index_col argument:

In [23]:
names = ['a', 'b', 'c', 'd', 'message']

In [24]:
pd.read_csv('examples/ex2.csv', names=names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In the event that you want to form a hierarchical index from multiple columns, pass a
list of column numbers or names:

In [25]:
!type examples\csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [26]:
parsed = pd.read_csv('examples\csv_mindex.csv', index_col=['key1','key2'])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In some cases, a table might not have a fixed delimiter, using whitespace or some
other pattern to separate fields. Consider a text file that looks like this:

In [27]:
list(open('examples\ex3.txt'))

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

While you could do some munging by hand, the fields here are separated by a vari‐
able amount of whitespace. In these cases, you can pass a regular expression as a
delimiter for read_table. *__This can be expressed by the regular expression \s+__*, so we
have then:

In [31]:
result = pd.read_csv('examples\ex3.txt', sep='\s+')
result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


The parser functions have many additional arguments to help you handle the wide
variety of exception file formats that occur (see a partial listing in Table 6-2). For
example, you can skip the first, third, and fourth rows of a file with skiprows:

In [32]:
!type examples\ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo


In [35]:
pd.read_csv('examples\ex4.csv', skiprows=[0,2,3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Handling missing values is an important and frequently nuanced part of the file pars‐
ing process. Missing data is usually either not present (empty string) or marked by
some sentinel value. By default, pandas uses a set of commonly occurring sentinels,
such as NA and NULL:

In [36]:
!type examples\ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo


In [37]:
result = pd.read_csv('examples\ex5.csv')
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [39]:
pd.isnull(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


The na_values option can take either a list or set of strings to consider missing
values:

In [40]:
result = pd.read_csv('examples\ex5.csv', na_values=['NULL'])
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Different NA sentinels can be specified for each column in a dict:

In [41]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
sentinels

{'message': ['foo', 'NA'], 'something': ['two']}

In [42]:
pd.read_csv('examples\ex5.csv', na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


Table 6-2 lists some frequently used options in pandas.read_csv and pan
das.read_table.

![](read_csv1.jpg)
![](read_csv2.jpg)

### Reading Text Files in Pieces

When processing very large files or figuring out the right set of arguments to cor‐
rectly process a large file, you may only want to read in a small piece of a file or iterate
through smaller chunks of the file.

Before we look at a large file, we make the pandas display settings more compact:

In [56]:
pd.options.display.max_rows = 10

In [44]:
result = pd.read_csv('examples\ex6.csv')
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
9995,2.311896,-0.417070,-1.409599,-0.515821,L
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G


If you want to only read a small number of rows (avoiding reading the entire file),
specify that with nrows:

In [45]:
pd.read_csv('examples\ex6.csv', nrows=5)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


To read a file in pieces, specify a chunksize as a number of rows:

In [46]:
chunker = pd.read_csv('examples\ex6.csv', chunksize=1000)
chunker

<pandas.io.parsers.TextFileReader at 0x2431cf996a0>

The TextParser object returned by read_csv allows you to iterate over the parts of
the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so:

In [49]:
from pandas import Series, DataFrame

In [57]:
tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

In [62]:
tot = tot.sort_values(ascending=False)

In [64]:
tot[:10]

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

TextParser is also equipped with a get_chunk method that enables you to read
pieces of an arbitrary size.

### Writing Data to Text Format

Data can also be exported to a delimited format. Let’s consider one of the CSV files
read before:

In [65]:
data = pd.read_csv('examples\ex5.csv')
data

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Using DataFrame’s to_csv method, we can write the data out to a comma-separated
file:

In [66]:
data.to_csv('examples\out.csv')

In [68]:
!type examples\out.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


Other delimiters can be used, of course (writing to sys.stdout so it prints the text
result to the console):

In [69]:
import sys

In [70]:
data.to_csv(sys.stdout, sep='|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo


Missing values appear as empty strings in the output. You might want to denote them
by some other sentinel value:

In [71]:
data.to_csv(sys.stdout, na_rep='NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo


With no other options specified, both the row and column labels are written. Both of
these can be disabled:

In [72]:
data.to_csv(sys.stdout, index=False, header=False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo


You can also write only a subset of the columns, and in an order of your choosing:

In [73]:
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c'])

a,b,c
1,2,3.0
5,6,
9,10,11.0


Series also has a to_csv method:

In [74]:
dates = pd.date_range('1/1/2000', periods=7)
dates

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07'],
              dtype='datetime64[ns]', freq='D')

In [76]:
import numpy as np

In [77]:
ts = pd.Series(np.arange(7), index=dates)
ts

2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
Freq: D, dtype: int32

In [79]:
ts.to_csv('examples\tseries.csv', header=False)

OSError: [Errno 22] Invalid argument: 'examples\tseries.csv'

### Working with Delimited Formats

It’s possible to load most forms of tabular data from disk using functions like pan
das.read_table. In some cases, however, some manual processing may be necessary.
It’s not uncommon to receive a file with one or more malformed lines that trip up
read_table. To illustrate the basic tools, consider a small CSV file:

In [80]:
!type examples\ex7.csv

"a","b","c"
"1","2","3"
"1","2","3"


For any file with a single-character delimiter, you can use Python’s built-in csv mod‐
ule. To use it, pass any open file or file-like object to csv.reader:

In [81]:
import csv

In [82]:
f = open('examples\ex7.csv')
reader = csv.reader(f)

Iterating through the reader like a file yields tuples of values with any quote charac‐
ters removed:

In [85]:
for line in reader:
    print(line)

From there, it’s up to you to do the wrangling necessary to put the data in the form
that you need it. Let’s take this step by step. First, we read the file into a list of lines:

In [84]:
with open('examples\ex7.csv') as f:
    lines = list(csv.reader(f))

Then, we split the lines into the header line and the data lines:

In [86]:
header, values = lines[0], lines[1:]

Then we can create a dictionary of data columns using a dictionary comprehension
and the expression zip(*values), which transposes rows to columns:

In [87]:
data_dict = {h: v for h,v in zip(header, zip(*values))}
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

CSV files come in many different flavors. To define a new format with a different
delimiter, string quoting convention, or line terminator, we define a simple subclass
of csv.Dialect:

In [88]:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL
    
reader = csv.reader(f, dialect=my_dialect)

TypeError: argument 1 must be an iterator

We can also give individual CSV dialect parameters as keywords to csv.reader
without having to define a subclass:

In [89]:
reader = csv.reader(f, delimiter='|')
reader

TypeError: argument 1 must be an iterator

The possible options (attributes of csv.Dialect) and what they do can be found in
Table 6-3.

![](csv_dialect1.jpg)
![](csv_dialect2.jpg)

For files with more complicated or fixed multicharacter delimiters,
you will not be able to use the csv module. In those cases, you’ll
have to do the line splitting and other cleanup using string’s split
method or the regular expression method re.split.

In [90]:
with open('mydata.csv','w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(('one', 'two', 'three'))
    writer.writerow(('1', '2', '3'))
    writer.writerow(('4', '5', '6'))
    writer.writerow(('7', '8', '9'))

### JSON Data

*__JSON (short for JavaScript Object Notation) has become one of the standard formats
for sending data by HTTP request between web browsers and other applications__*. It is
a much more free-form data format than a tabular text form like CSV. Here is an
example:

In [91]:
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
{"name": "Katie", "age": 38,
"pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

In [92]:
obj

'\n{"name": "Wes",\n"places_lived": ["United States", "Spain", "Germany"],\n"pet": null,\n"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},\n{"name": "Katie", "age": 38,\n"pets": ["Sixes", "Stache", "Cisco"]}]\n}\n'

JSON is very nearly valid Python code with the exception of its null value null and
some other nuances (such as disallowing trailing commas at the end of lists). The
basic types are objects (dicts), arrays (lists), strings, numbers, booleans, and nulls. All
of the keys in an object must be strings. There are several Python libraries for reading
and writing JSON data. I’ll use json here, as it is built into the Python standard
library. To convert a JSON string to Python form, use json.loads:

In [93]:
import json

In [94]:
result = json.loads(obj)
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

json.dumps, on the other hand, converts a Python object back to JSON:

In [95]:
asjson  = json.dumps(result)
asjson

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

How you convert a JSON object or list of objects to a DataFrame or some other data
structure for analysis will be up to you. Conveniently, you can pass a list of dicts
(which were previously JSON objects) to the DataFrame constructor and select a sub‐
set of the data fields:

In [96]:
siblings = pd.DataFrame(result['siblings'], columns=['name','age'] )
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


The pandas.read_json can automatically convert JSON datasets in specific arrange‐
ments into a Series or DataFrame. For example:

In [97]:
!type examples\example.json

[{"a": 1, "b": 2, "c": 3},
 {"a": 4, "b": 5, "c": 6},
 {"a": 7, "b": 8, "c": 9}]


The default options for pandas.read_json assume that each object in the JSON array
is a row in the table:

In [98]:
data = pd.read_json('examples\example.json')
data

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


For an extended example of reading and manipulating JSON data (including nested
records), see the USDA Food Database example in Chapter 7.

If you need to export data from pandas to JSON, one way is to use the to_json meth‐
ods on Series and DataFrame:

In [99]:
print(data.to_json())

{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}


In [101]:
print(data.to_json(orient='records'))

[{"a":1,"b":2,"c":3},{"a":4,"b":5,"c":6},{"a":7,"b":8,"c":9}]


### XML and HTML: Web Scraping

Python has many libraries for reading and writing data in the ubiquitous HTML and
XML formats. Examples include lxml, Beautiful Soup, and html5lib. While lxml is
comparatively much faster in general, the other libraries can better handle malformed
HTML or XML files.

pandas has a built-in function, read_html, which uses libraries like lxml and Beauti‐
ful Soup to automatically parse tables out of HTML files as DataFrame objects. To
show how this works, I downloaded an HTML file (used in the pandas documenta‐
tion) from the United States FDIC government agency showing bank failures.1 First,
you must install some additional libraries used by read_html:

In [103]:
conda install lxml

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... 
The environment is inconsistent, please check the package plan carefully
The following packages are causing the inconsistency:

  - defaults/win-64::anaconda==2019.07=py37_0
  - defaults/win-64::numba==0.44.1=py37hf9181ef_0
done

## Package Plan ##

  environment location: C:\Users\Wisnu\Anaconda3

  added / updated specs:
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    tbb-2018.0.5               |       he980bc4_0         150 KB
    ------------------------------------------------------------
                                           Total:         150 KB

The following NEW packages will be INSTALLED:

  tbb                pkgs/main/win-64::tbb-2018.0.5-he980bc4_0



Downloading and Extracting Packages

tbb-2018.0.5         | 150 KB    |            |   0% 
tbb-2018.0

In [104]:
pip install beautifulsoup4 html5lib

Note: you may need to restart the kernel to use updated packages.


If you are not using conda, pip install lxml will likely also work.

The pandas.read_html function has a number of options, but by default it searches
for and attempts to parse all tabular data contained within < table > tags. The result is
a list of DataFrame objects:

In [2]:
import pandas as pd
from pandas import Series, DataFrame

In [6]:
tables = pd.read_html('examples/fdic_failed_bank_list.html')

In [22]:
tables

[                                  Bank Name             City  ST   CERT  \
 0                               Allied Bank         Mulberry  AR     91   
 1              The Woodbury Banking Company         Woodbury  GA  11297   
 2                    First CornerStone Bank  King of Prussia  PA  35312   
 3                        Trust Company Bank          Memphis  TN   9956   
 4                North Milwaukee State Bank        Milwaukee  WI  20364   
 5                    Hometown National Bank         Longview  WA  35156   
 6                       The Bank of Georgia   Peachtree City  GA  35259   
 7                              Premier Bank           Denver  CO  34112   
 8                            Edgebrook Bank          Chicago  IL  57772   
 9                    Doral Bank  En Espanol         San Juan  PR  32102   
 10        Capitol City Bank & Trust Company          Atlanta  GA  33938   
 11                  Highland Community Bank          Chicago  IL  20290   
 12         

In [15]:
len(tables)

1

In [16]:
tables[0]

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"
5,Hometown National Bank,Longview,WA,35156,Twin City Bank,"October 2, 2015","April 13, 2016"
6,The Bank of Georgia,Peachtree City,GA,35259,Fidelity Bank,"October 2, 2015","October 24, 2016"
7,Premier Bank,Denver,CO,34112,"United Fidelity Bank, fsb","July 10, 2015","August 17, 2016"
8,Edgebrook Bank,Chicago,IL,57772,Republic Bank of Chicago,"May 8, 2015","July 12, 2016"
9,Doral Bank En Espanol,San Juan,PR,32102,Banco Popular de Puerto Rico,"February 27, 2015","May 13, 2015"


In [17]:
failures = tables[0]
failures.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"


As you will learn in later chapters, from here we could proceed to do some data
cleaning and analysis, like computing the number of bank failures by year:

In [19]:
failures['Closing Date'].head()

0    September 23, 2016
1       August 19, 2016
2           May 6, 2016
3        April 29, 2016
4        March 11, 2016
Name: Closing Date, dtype: object

In [20]:
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.head()

0   2016-09-23
1   2016-08-19
2   2016-05-06
3   2016-04-29
4   2016-03-11
Name: Closing Date, dtype: datetime64[ns]

In [23]:
close_timestamps.dt.year.value_counts()

2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2015      8
2016      5
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, dtype: int64

#### Parsing XML with lxml.objectify

XML (eXtensible Markup Language) is another common structured data format supporting hierarchical, nested data with metadata. The book you are currently reading
was actually created from a series of large XML documents.

Earlier, I showed the pandas.read_html function, which uses either lxml or Beautiful
Soup under the hood to parse data from HTML. XML and HTML are structurally
similar, but XML is more general. Here, I will show an example of how to use lxml to
parse data from a more general XML format.

The New York Metropolitan Transportation Authority (MTA) publishes a number of
data series about its bus and train services. Here we’ll look at the performance data,
which is contained in a set of XML files. Each train or bus service has a different file
(like Performance_MNR.xml for the Metro-North Railroad) containing monthly data
as a series of XML records that look like this:

![](xml1.jpg)
![](xml2.jpg)

Using lxml.objectify, we parse the file and get a reference to the root node of the
XML file with getroot:

In [25]:
from lxml import objectify

In [27]:
path = 'examples/mta_perf/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

root.INDICATOR returns a generator yielding each < INDICATOR > XML element. For
each record, we can populate a dict of tag names (like YTD_ACTUAL) to data values
(excluding a few tags):

In [29]:
data = []

skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ', 'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data={}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

Lastly, convert this list of dicts into a DataFrame:

In [30]:
perf = pd.DataFrame(data)
perf.head()

Unnamed: 0,AGENCY_NAME,CATEGORY,DESCRIPTION,FREQUENCY,INDICATOR_NAME,INDICATOR_UNIT,MONTHLY_ACTUAL,MONTHLY_TARGET,PERIOD_MONTH,PERIOD_YEAR,YTD_ACTUAL,YTD_TARGET
0,Metro-North Railroad,Service Indicators,Percent of commuter trains that arrive at thei...,M,On-Time Performance (West of Hudson),%,96.9,95,1,2008,96.9,95
1,Metro-North Railroad,Service Indicators,Percent of commuter trains that arrive at thei...,M,On-Time Performance (West of Hudson),%,95.0,95,2,2008,96.0,95
2,Metro-North Railroad,Service Indicators,Percent of commuter trains that arrive at thei...,M,On-Time Performance (West of Hudson),%,96.9,95,3,2008,96.3,95
3,Metro-North Railroad,Service Indicators,Percent of commuter trains that arrive at thei...,M,On-Time Performance (West of Hudson),%,98.3,95,4,2008,96.8,95
4,Metro-North Railroad,Service Indicators,Percent of commuter trains that arrive at thei...,M,On-Time Performance (West of Hudson),%,95.8,95,5,2008,96.6,95


XML data can get much more complicated than this example. Each tag can have
metadata, too. Consider an HTML link tag, which is also valid XML:

In [31]:
from io import StringIO

In [33]:
tag = '<a href="http://www.google.com">Google</a>'
root = objectify.parse(StringIO(tag)).getroot()

You can now access any of the fields (like href) in the tag or the link text:

In [34]:
root

<Element a at 0x282932bac08>

In [35]:
root.get('href')

'http://www.google.com'

In [36]:
root.text

'Google'

## 6.2 Binary Data Formats

One of the easiest ways to store data (also known as serialization) efficiently in binary
format is using Python’s built-in pickle serialization. pandas objects all have a
to_pickle method that writes the data to disk in pickle format:

In [3]:
import pandas as pd
from pandas import Series, DataFrame

In [4]:
frame = pd.read_csv('examples/ex1.csv')
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [5]:
frame.to_pickle('examples/frame_pickle')

You can read any “pickled” object stored in a file by using the built-in pickle directly,
or even more conveniently using pandas.read_pickle:

In [6]:
pd.read_pickle('examples/frame_pickle')

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


pickle is only recommended as a short-term storage format. The
problem is that it is hard to guarantee that the format will be stable
over time; an object pickled today may not unpickle with a later
version of a library. We have tried to maintain backward compati‐
bility when possible, but at some point in the future it may be nec‐
essary to “break” the pickle format.

pandas has built-in support for two more binary data formats: HDF5 and Message‐
Pack. I will give some HDF5 examples in the next section, but I encourage you to
explore different file formats to see how fast they are and how well they work for your
analysis. Some other storage formats for pandas or NumPy data include:

*  *bcolz*.
A compressable column-oriented binary format based on the Blosc compression
library.

*  *Feather*.
A cross-language column-oriented file format I designed with the R program‐
ming community’s Hadley Wickham. Feather uses the Apache Arrow columnar
memory format.

### Using HDF5 Format

*__HDF5 is a well-regarded file format intended for storing large quantities of scientific
array data__*. It is available as a C library, and it has interfaces available in many other
languages, including Java, Julia, MATLAB, and Python. *__The “HDF” in HDF5 stands
for hierarchical data format__*. Each HDF5 file can store multiple datasets and support‐
ing metadata. Compared with simpler formats, HDF5 supports on-the-fly compres‐
sion with a variety of compression modes, enabling data with repeated patterns to be
stored more efficiently. *__HDF5 can be a good choice for working with very large data‐
sets that don’t fit into memory__*, as you can efficiently read and write small sections of
much larger arrays.

While it’s possible to directly access HDF5 files using either the PyTables or h5py
libraries, pandas provides a high-level interface that simplifies storing Series and
DataFrame object. The HDFStore class works like a dict and handles the low-level
details:

In [8]:
import numpy as np

In [9]:
frame = pd.DataFrame({'a': np.random.randn(100)})

In [10]:
store = pd.HDFStore('mydata.h5')

In [11]:
store['obj1'] = frame
store['obj1_col'] = frame['a']
store

<class 'pandas.io.pytables.HDFStore'>
File path: mydata.h5

Objects contained in the HDF5 file can then be retrieved with the same dict-like API:

In [12]:
store['obj1']

Unnamed: 0,a
0,1.121275
1,-0.269566
2,0.969891
3,0.104056
4,0.699564
5,-1.662573
6,0.833922
7,-0.950682
8,0.295731
9,0.463830


HDFStore supports two storage schemas, 'fixed' and 'table'. The latter is generally
slower, but it supports query operations using a special syntax:

In [13]:
store.put('obj2', frame, format='table')

In [14]:
store.select('obj2', where=['index >= 10 and index <= 15'])

Unnamed: 0,a
10,-0.426853
11,-1.295238
12,-0.249434
13,0.167114
14,0.479221
15,-0.278924


In [15]:
store.close()

The put is an explicit version of the store['obj2'] = frame method but allows us to
set other options like the storage format.

The pandas.read_hdf function gives you a shortcut to these tools:

In [16]:
frame.to_hdf('mydata.h5', 'obj3', format='table')

In [17]:
pd.read_hdf('mydata.h5', 'obj3', where=['index<5'])

Unnamed: 0,a
0,1.121275
1,-0.269566
2,0.969891
3,0.104056
4,0.699564


If you work with large quantities of data locally, I would encourage you to explore
PyTables and h5py to see how they can suit your needs. Since many data analysis
problems are I/O-bound (rather than CPU-bound), using a tool like HDF5 can mas‐
sively accelerate your applications.

HDF5 is not a database. It is best suited for write-once, read-many
datasets. While data can be added to a file at any time, if multiple
writers do so simultaneously, the file can become corrupted.

### Reading Microsoft Excel Files

pandas also supports reading tabular data stored in Excel 2003 (and higher) files
using either the ExcelFile class or pandas.read_excel function. Internally these
tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respec‐
tively. You may need to install these manually with pip or conda.

To use ExcelFile, create an instance by passing a path to an xls or xlsx file:

In [19]:
xlsx = pd.ExcelFile('examples/ex1.xlsx')

Data stored in a sheet can then be read into DataFrame with parse:

In [20]:
pd.read_excel(xlsx, 'Sheet1')

Unnamed: 0,"a,b,c,d,message"
0,"1,2,3,4,hello"
1,"5,6,7,8,world"
2,"9,10,11,12,foo"


If you are reading multiple sheets in a file, then it is faster to create the ExcelFile,
but you can also simply pass the filename to pandas.read_excel:

In [22]:
frame = pd.read_excel('examples/ex1.xlsx', 'Sheet1')
frame

Unnamed: 0,"a,b,c,d,message"
0,"1,2,3,4,hello"
1,"5,6,7,8,world"
2,"9,10,11,12,foo"


To write pandas data to Excel format, you must first create an ExcelWriter, then
write data to it using pandas objects’ to_excel method:

In [23]:
writer = pd.ExcelWriter('examples/ex2.xlsx')

In [24]:
frame.to_excel(writer, 'Sheet1')

In [25]:
writer.save()

You can also pass a file path to to_excel and avoid the ExcelWriter:

In [26]:
frame.to_excel('examples/ex3.xlsx')

## 6.3 Interacting with Web APIs

Many websites have public APIs providing data feeds via JSON or some other format.
There are a number of ways to access these APIs from Python; one easy-to-use
method that I recommend is the requests package.

To find the last 30 GitHub issues for pandas on GitHub, we can make a GET HTTP
request using the add-on requests library:

In [27]:
import requests

In [28]:
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'

In [29]:
resp = requests.get(url)

In [30]:
resp

<Response [200]>

The Response object’s json method will return a dictionary containing JSON parsed
into native Python objects:

In [31]:
data = resp.json()

In [32]:
len(data)

30

In [33]:
len(data[0])

25

In [35]:
data[0]

{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/35650',
 'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
 'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/35650/labels{/name}',
 'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/35650/comments',
 'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/35650/events',
 'html_url': 'https://github.com/pandas-dev/pandas/issues/35650',
 'id': 675987834,
 'node_id': 'MDU6SXNzdWU2NzU5ODc4MzQ=',
 'number': 35650,
 'title': 'BUG:',
 'user': {'login': 'ruchirgarg05',
  'id': 10705513,
  'node_id': 'MDQ6VXNlcjEwNzA1NTEz',
  'avatar_url': 'https://avatars0.githubusercontent.com/u/10705513?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/ruchirgarg05',
  'html_url': 'https://github.com/ruchirgarg05',
  'followers_url': 'https://api.github.com/users/ruchirgarg05/followers',
  'following_url': 'https://api.github.com/users/ruchirgarg05/following{/

In [36]:
data[0]['title']

'BUG:'

Each element in data is a dictionary containing all of the data found on a GitHub
issue page (except for the comments). We can pass data directly to DataFrame and
extract fields of interest:

In [38]:
issues = pd.DataFrame(data, columns=['number', 'title','labels', 'state'])

In [39]:
issues

Unnamed: 0,number,title,labels,state
0,35650,BUG:,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
1,35649,Refactor tables latex,[],open
2,35647,BUG: Support custom BaseIndexers in groupby.ro...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
3,35646,"BUG: groupby(..., dropna=False).indices with s...","[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
4,35645,BUG/ENH: consistent gzip compression arguments,[],open
5,35644,BUG: `Series.sum` & `Series.mean` are inconsit...,[],open
6,35643,ENH: Styler tooltips feature,[],open
7,35642,BUG:Resample with groupby & agg(),"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
8,35641,REF/PERF: Move MultiIndex._tuples to MultiInde...,[],open
9,35640,DOC: Add specific Visual Studio Installer inst...,[],open


With a bit of elbow grease, you can create some higher-level interfaces to common
web APIs that return DataFrame objects for easy analysis.

## 6.4 Interacting with Databases

In a business setting, most data may not be stored in text or Excel files. SQL-based
relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use,
and many alternative databases have become quite popular. The choice of database is
usually dependent on the performance, data integrity, and scalability needs of an
application.

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has
some functions to simplify the process. As an example, I’ll create a SQLite database
using Python’s built-in sqlite3 driver:

In [6]:
import sqlite3

In [7]:
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
);"""

In [8]:
con = sqlite3.connect('mydata.sqlite')

In [9]:
con.execute(query)

OperationalError: table test already exists

In [10]:
con.commit()

Then, insert a few rows of data:

In [11]:
data = [('Atlanta', 'Georgia', 1.25, 6), ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]

In [12]:
data

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

In [13]:
type(data)

list

In [14]:
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"

In [15]:
con.executemany(stmt, data)

<sqlite3.Cursor at 0x17fc65f5500>

In [16]:
con.commit()

Most Python SQL drivers (PyODBC, psycopg2, MySQLdb, pymssql, etc.) return a list
of tuples when selecting data from a table:

In [18]:
cursor = con.execute('select * from test')

In [20]:
rows = cursor.fetchall()

In [21]:
rows

[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

You can pass the list of tuples to the DataFrame constructor, but you also need the
column names, contained in the cursor’s description attribute:

In [22]:
cursor.description

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [24]:
import pandas as pd
from pandas import Series, DataFrame

In [25]:
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


This is quite a bit of munging that you’d rather not repeat each time you query the
database. The SQLAlchemy project is a popular Python SQL toolkit that abstracts
away many of the common differences between SQL databases. pandas has a
read_sql function that enables you to read data easily from a general SQLAlchemy
connection. Here, we’ll connect to the same SQLite database with SQLAlchemy and
read data from the table created before:

In [26]:
import sqlalchemy as sqla

In [27]:
db = sqla.create_engine('sqlite:///mydata.sqlite')

In [28]:
pd.read_sql('select * from test', db)

Unnamed: 0,a,b,c,d
0,Atlanta,Georgia,1.25,6
1,Tallahassee,Florida,2.6,3
2,Sacramento,California,1.7,5


## 6.5 Conclusion

Getting access to data is frequently the first step in the data analysis process. We have
looked at a number of useful tools in this chapter that should help you get started. In
the upcoming chapters we will dig deeper into data wrangling, data visualization,
time series analysis, and other topics.