# 6.2 Binary Data Formats

1. [General Info](#general)
1. [Reading Microsoft Excel Files](#excel)
1. [Using HDF5 Format](#hdf5)

<a name="general"></a>
# General Info

Python has a built-in module called `pickle` that can save data in a binary format.  

All pandas objects have a `to_pickle` method that can do this.  

**One issue with pickle is that it is constantly changed, so it's not good to store things this way. You might 'pickle' a file but then 'unpickle' it a year later and it will be different.**

pandas can handle other formats as well - HDF5, ORC, Apache Parquet

In [39]:
import pandas as pd
import numpy as np

In [40]:
# Read in some data
frame = pd.read_csv("../../examples/ex1.csv")
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [41]:
# Write it out as pickle
frame.to_pickle("../../examples/frame_pickle")

In [42]:
# Read in again
pd.read_pickle("../../examples/frame_pickle")


Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [43]:
# Read in parquet data
fec = pd.read_parquet('../../datasets/fec/fec.parquet')
fec

Unnamed: 0,cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num
0,C00410118,P20002978,"Bachmann, Michelle","HARVEY, WILLIAM",MOBILE,AL,366010290,RETIRED,RETIRED,250.0,20-JUN-11,,,,SA17A,736166
1,C00410118,P20002978,"Bachmann, Michelle","HARVEY, WILLIAM",MOBILE,AL,366010290,RETIRED,RETIRED,50.0,23-JUN-11,,,,SA17A,736166
2,C00410118,P20002978,"Bachmann, Michelle","SMITH, LANIER",LANETT,AL,368633403,INFORMATION REQUESTED,INFORMATION REQUESTED,250.0,05-JUL-11,,,,SA17A,749073
3,C00410118,P20002978,"Bachmann, Michelle","BLEVINS, DARONDA",PIGGOTT,AR,724548253,NONE,RETIRED,250.0,01-AUG-11,,,,SA17A,749073
4,C00410118,P20002978,"Bachmann, Michelle","WARDENBURG, HAROLD",HOT SPRINGS NATION,AR,719016467,NONE,RETIRED,300.0,20-JUN-11,,,,SA17A,736166
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1001726,C00500587,P20003281,"Perry, Rick","GORMAN, CHRIS D. MR.",INFO REQUESTED,XX,99999,INFORMATION REQUESTED PER BEST EFFORTS,INFORMATION REQUESTED PER BEST EFFORTS,5000.0,29-SEP-11,REATTRIBUTION / REDESIGNATION REQUESTED (AUTOM...,,REATTRIBUTION / REDESIGNATION REQUESTED (AUTOM...,SA17A,751678
1001727,C00500587,P20003281,"Perry, Rick","DUFFY, DAVID A. MR.",INFO REQUESTED,XX,99999,DUFFY EQUIPMENT COMPANY INC.,BUSINESS OWNER,2500.0,30-SEP-11,,,,SA17A,751678
1001728,C00500587,P20003281,"Perry, Rick","GRANE, BRYAN F. MR.",INFO REQUESTED,XX,99999,INFORMATION REQUESTED PER BEST EFFORTS,INFORMATION REQUESTED PER BEST EFFORTS,500.0,29-SEP-11,,,,SA17A,751678
1001729,C00500587,P20003281,"Perry, Rick","TOLBERT, DARYL MR.",INFO REQUESTED,XX,99999,T.A.C.C.,LONGWALL MAINTENANCE FOREMAN,500.0,30-SEP-11,,,,SA17A,751678


<a name="excel"></a>
# Reading Microsoft Excel Files

pandas has ability to read most excel files (>=2003) using:

1. `pandas.ExcelFile` (this is for old-style XLS files)
1. `pandas.read_excel` (this is for newer XLSX files)

Both of these functions use `xlrd` and `openpyxl` packages

## pandas.ExcelFile

First you create an instance from a path to xls or xlsx file, then you can manage it with various methods

In [44]:
# Create instance
xlsx = pd.ExcelFile("../../examples/ex1.xlsx")

In [45]:
# View sheet names
xlsx.sheet_names

['Sheet1']

In [46]:
# Parse a sheet into a DataFrame
df1 = xlsx.parse(sheet_name="Sheet1")
df1

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


In [47]:
# Default behavior didn't grab the index column correctly.
df2 = xlsx.parse(sheet_name="Sheet1", index_col=0)
df2

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


## pandas.read_excel

`read_excel` can read directly from a file path and can also read from a loaded xlsx instance from `ExcelFile`

In [48]:
# ExcelFile Instance
foo = pd.read_excel(xlsx, index_col=0)
foo

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [49]:
# From file path
frame = pd.read_excel("../../examples/ex1.xlsx", sheet_name="Sheet1")
frame

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


## pandas.ExcelWriter

To write data to excel:

1. Create an `ExcelWriter`
2. Use the `to_excel` method to write it out.
3. Don't forget to close!

-or-

1. Just `object.to_excel("path/to/file.xlsx")`

In [50]:
# Make writer
writer = pd.ExcelWriter("../../examples/ex2.xlsx")
type(writer)

pandas.io.excel._openpyxl.OpenpyxlWriter

In [51]:
# Write
frame.to_excel(writer, "Sheet1")
writer.close()

  frame.to_excel(writer, "Sheet1")


In [52]:
# Write V2
frame.to_excel("../../examples/ex2b.xlsx")

<a name="hdf5"></a>
# Using HDF5 Format

Used for storing large quantities of scientific array data.  

"HDF" = hierarchical data format.  

An HDF5 file can store multiple datasets and their supporting metadata.  

It is easy/efficient to read/write small sections of much larger arrays using this filetype, so good for working with datasets that don't fit into memory.  

pandas provides a high-level interface for loading HDF5 files.  

## Example 1 - HDFStore

The `HDFStore` class is like a dictionary. Objects within an HDF5 file can be retrieved with dictionary-like API.

It supports two storage schemas. `fixed` is default and faster, `table` supports query operations

In [53]:
# Make a DataFrame
frame = pd.DataFrame({"a": np.random.standard_normal(100)})
print(type(frame))
frame

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,a
0,-0.736642
1,0.676655
2,0.950254
3,-1.108687
4,0.063422
...,...
95,-0.913463
96,0.851711
97,-0.146080
98,-0.775539


In [54]:
# Create file connection
store = pd.HDFStore("../../examples/mydata.h5")
print(type(store))
store

<class 'pandas.io.pytables.HDFStore'>


<class 'pandas.io.pytables.HDFStore'>
File path: ../../examples/mydata.h5

In [55]:
# Add the DataFrame to store
store["obj1"] = frame

# Add a single column from the DataFrame to store
store["obj1_col"] = frame["a"]

In [56]:
# View something that's currently in the store
store["obj1"]

Unnamed: 0,a
0,-0.736642
1,0.676655
2,0.950254
3,-1.108687
4,0.063422
...,...
95,-0.913463
96,0.851711
97,-0.146080
98,-0.775539


In [57]:
# Add something to the store using the "table" storage schema
# note that store.put("obj2", frame) and store["obj2"] = frame are the same, but put allows us to add options
store.put("obj2", frame, format="table")

In [58]:
# Select from a "table"-stored hdf5 object using the index
store.select("obj2", where=["index >= 10 and index <= 15"])

Unnamed: 0,a
10,-0.404136
11,1.522086
12,-1.38357
13,0.049891
14,-0.694978
15,-0.042148


In [59]:
# Have to close the file connection
store.close()

In [60]:
# Add something to store using the `to_hdf` shortcut (don't have to open and close the connection)
frame.to_hdf("../../examples/mydata.h5", key="obj3", format="table")

In [61]:
# Can read using the read_hdf shorthand as well
obj3 = pd.read_hdf("../../examples/mydata.h5", "obj3", where=["index < 5"])
obj3

Unnamed: 0,a
0,-0.736642
1,0.676655
2,0.950254
3,-1.108687
4,0.063422
