# Parsing text files 

There is a vast array of formats that are widely used to contain or transport data between applications. Here, we will examine parsing three of the most widely used of these formats in Python, namely:

 *  JSON
 *  XML
 *  HDF5

## The json library

JSON stands for <b>J</b>ava<b>S</b>cript <b>O</b>bject <b>N</b>otation

This is such a widespread format for passing data between servers and clients. As the name implies, it has a basis on structures found in JavaScript, but has found wider application, especially for REST (<b>RE</b>presentational <b>S</b>tate <b>T</b>ransfer) APIs. It has a straight-forward structure; this simplicity has contributed to its popularity.

It enabled dynamism within how web-pages access information, allowing structured data parsing for HTML documents.

Python has a Standard Library package to interface with JSON structures, called `json`.

Here, we will look at the exchange rate for major global currencies, comparing those to the USD. This information, accessed directly from the web via an API, is stored in a JSON file:

Source: https://api.exchangeratesapi.io/latest?base=USD


In [1]:
import os
import json
import pandas as pd

Set your data directory:

In [2]:
data_dir = "/home/ra/host/BH_Analytics/Discover/DataEngineering/data/"

In [3]:
json_name = os.path.join(data_dir, "USD_comparison.json")

Here, we can very easily open the JSON file (with the usual context manager, just to be safe). 

Note the encoding specification is often necessary:

In [8]:
with open(json_name, encoding='utf-8', errors='ignore') as json_file:
     usd_data = json.load(json_file, strict=False)

In [None]:
f = open(json_name, 'r')
type(f)

In [5]:
text = f.readlines()
text

['{"base":"USD","rates":{"BGN":1.747498213,"NZD":1.5268048606,"ILS":3.6137419585,"RUB":64.5899749821,"CAD":1.3465868477,"USD":1.0,"PHP":52.2596497498,"CHF":1.0055396712,"AUD":1.4440671909,"JPY":109.4085060758,"TRY":6.0337741244,"HKD":7.8483738385,"MYR":4.1885275197,"HRK":6.6346497498,"CZK":23.0906004289,"IDR":14375.0,"DKK":6.6734274482,"NOK":8.6816476054,"HUF":291.815582559,"GBP":0.789608649,"MXN":19.1026626162,"THB":31.8102215868,"ISK":123.9278055754,"ZAR":14.6081129378,"BRL":4.0479807005,"SGD":1.3777698356,"PLN":3.8376518942,"INR":69.6055218013,"KRW":1188.2505360972,"RON":4.2550929235,"CNY":6.9107398142,"SEK":9.5483380986,"EUR":0.8934953538},"date":"2019-05-28"}']

In [6]:
f.close()

Query the data structure:

In [7]:
f

<_io.TextIOWrapper name='/home/ra/host/BH_Analytics/Discover/DataEngineering/data/USD_comparison.json' mode='r' encoding='UTF-8'>

In [11]:
usd_data

{'base': 'USD',
 'rates': {'BGN': 1.747498213,
  'NZD': 1.5268048606,
  'ILS': 3.6137419585,
  'RUB': 64.5899749821,
  'CAD': 1.3465868477,
  'USD': 1.0,
  'PHP': 52.2596497498,
  'CHF': 1.0055396712,
  'AUD': 1.4440671909,
  'JPY': 109.4085060758,
  'TRY': 6.0337741244,
  'HKD': 7.8483738385,
  'MYR': 4.1885275197,
  'HRK': 6.6346497498,
  'CZK': 23.0906004289,
  'IDR': 14375.0,
  'DKK': 6.6734274482,
  'NOK': 8.6816476054,
  'HUF': 291.815582559,
  'GBP': 0.789608649,
  'MXN': 19.1026626162,
  'THB': 31.8102215868,
  'ISK': 123.9278055754,
  'ZAR': 14.6081129378,
  'BRL': 4.0479807005,
  'SGD': 1.3777698356,
  'PLN': 3.8376518942,
  'INR': 69.6055218013,
  'KRW': 1188.2505360972,
  'RON': 4.2550929235,
  'CNY': 6.9107398142,
  'SEK': 9.5483380986,
  'EUR': 0.8934953538},
 'date': '2019-05-28'}

Or just the keys:

In [12]:
usd_data.keys()

dict_keys(['base', 'rates', 'date'])

We can index as with any other Python object:

In [13]:
usd_data["date"]

'2019-05-28'

What is the going rate for the `BGN`? What currency _is_ that?

In [8]:
usd_data["rates"]["BGN"]

1.747498213

These codes aren't all that descriptive on their own.
Let's join with some more information

In [14]:
curr_codes_file = os.path.join(data_dir, "XE_ISO4217_CurrencyCodes.csv")

Create a dataframe from these codes:

In [15]:
curr_codes_df = pd.read_csv(curr_codes_file)

And create another from our JSON data:

In [16]:
usd_df = pd.DataFrame({"code": [x for x in usd_data['rates'].keys()]})

In [17]:
usd_df["rate"] = [usd_data['rates'][x] for x in usd_df["code"]]

Perform a left join to get the country code references:

In [18]:
curr_codes_file

'/home/ra/host/BH_Analytics/Discover/DataEngineering/data/XE_ISO4217_CurrencyCodes.csv'

In [19]:
%less /home/ra/host/BH_Analytics/Discover/DataEngineering/data/XE_ISO4217_CurrencyCodes.csv

In [20]:
usd_df = pd.merge(usd_df, curr_codes_df, left_on="code", right_on="Code", how="left")

Print out the countries and their rates, compared to USD

In [21]:
for Idx in usd_df.index.values:
    print(f"The {usd_df.loc[Idx, 'Country Name']} is at {usd_df.loc[Idx, 'rate']} USD")

The Bulgaria Lev is at 1.747498213 USD
The New Zealand Dollar is at 1.5268048606 USD
The Israel Shekel is at 3.6137419585 USD
The Russia Ruble is at 64.5899749821 USD
The Canada Dollar is at 1.3465868477 USD
The United States Dollar is at 1.0 USD
The Philippines Peso is at 52.2596497498 USD
The Switzerland Franc is at 1.0055396712 USD
The Australia Dollar is at 1.4440671909 USD
The Japan Yen is at 109.4085060758 USD
The Turkey Lira is at 6.0337741244 USD
The Hong Kong Dollar is at 7.8483738385 USD
The Malaysia Ringgit is at 4.1885275197 USD
The Croatia Kuna is at 6.6346497498 USD
The Czech Republic Koruna is at 23.0906004289 USD
The Indonesia Rupiah is at 14375.0 USD
The Denmark Krone is at 6.6734274482 USD
The Norway Krone is at 8.6816476054 USD
The Hungary Forint is at 291.815582559 USD
The United Kingdom Pound is at 0.789608649 USD
The Mexico Peso is at 19.1026626162 USD
The Thailand Baht is at 31.8102215868 USD
The Iceland Krona is at 123.9278055754 USD
The South Africa Rand is at 

Or, more succinct -- and performant -- broadcasting:

In [15]:
print("The " + usd_df['Country Name'] + " is at " + usd_df['rate'].astype('str') + " USD")

0                The Bulgaria Lev is at 1.747498213 USD
1         The New Zealand Dollar is at 1.5268048606 USD
2              The Israel Shekel is at 3.6137419585 USD
3              The Russia Ruble is at 64.5899749821 USD
4              The Canada Dollar is at 1.3465868477 USD
5                The United States Dollar is at 1.0 USD
6          The Philippines Peso is at 52.2596497498 USD
7          The Switzerland Franc is at 1.0055396712 USD
8           The Australia Dollar is at 1.4440671909 USD
9                The Japan Yen is at 109.4085060758 USD
10               The Turkey Lira is at 6.0337741244 USD
11          The Hong Kong Dollar is at 7.8483738385 USD
12          The Malaysia Ringgit is at 4.1885275197 USD
13              The Croatia Kuna is at 6.6346497498 USD
14    The Czech Republic Koruna is at 23.0906004289 USD
15               The Indonesia Rupiah is at 14375.0 USD
16             The Denmark Krone is at 6.6734274482 USD
17              The Norway Krone is at 8.6816476

So `BGN` was the Bulgarian Lev!

## Parsing XML documents with xml.etree

Another popular format for storing structured documents is the e<b>X</b>tended <b>M</b>ark-up <b>L</b>anguage (XML). This looks a lot like the HTML that encodes most static websites. It is capable of nested structures and can be almost human-readable!

We will use the `xml.etree` library, although there is also `lxml`.

Because these files may also be large, we have additionally zipped it with the `Bzip` utility. Hence we'll need to unzip it with the `bz2` Python library.

We'll use the results of a tiny search for gravitational waves from the LIGO project.

In [22]:
import xml.etree

import bz2

xml_dir = "../data"

input_file = bz2.BZ2File(os.path.join(xml_dir, 'search_bands.xml.bz2'), 'rb')

The XML document is first parsed into a tree structure, with the `ElementTree` method:

In [23]:
import xml.etree.ElementTree as ET
tree = ET.parse( input_file )

We need a starting point with which to navigate the tree. 

A common starting point is the root of the tree structure:

In [24]:
root = tree.getroot()

We'll set up some variables to catch the information stored in the XML:

In [29]:
twoF = []
freq = []
jobId = []
num_templates = []

Walk through the XML tree for 2F and f0

In [20]:
for jobNumber in root.iter('job'):
    nodeInfo = root[int(jobNumber.text)].find('loudest_nonvetoed_template')
    if nodeInfo.find('twoF') is not None:
        twoF.append( float( nodeInfo.find('twoF').text ) )
        freq.append( float( nodeInfo.find('freq').text ) )
        jobId.append( jobNumber.text )

In [36]:
root[0][0].text

'0.0701275049410.0701275049410.070127504941'

Get number of templates in each band

In [37]:
for nTempl in root.iter('num_templates'):
   num_templates.append( float( nTempl.text ) )

In [38]:
num_templates

[919023910.0,
 919077732.0,
 919027936.0,
 919028760.0,
 918977341.0,
 918977078.0,
 919027559.0,
 918973907.0,
 918919543.0,
 918968292.0,
 918964322.0,
 918959641.0,
 918954282.0,
 918896528.0,
 283795892.0]

In [40]:
# The hallowed function for pretty-printing
# From http://effbot.org/zone/element-lib.htm#prettyprint
# Note that lxml, an external library, has a pretty-print option
# but it breaks for a lot of corner-casees
def indent(elem, level=0):
    """This is a function to make the XML output pretty, with the right level
    of indentation. See
    http://effbot.org/zone/element-lib.htm#prettyprint
    for the original version"""
    i = "\n" + level*"  "
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

Apply the indentation:

In [45]:
root[0][0].text = "Mickey Mouse"
root[0][0].set("new tag", "5")

In [46]:
indent(root)

Write out an XML file:

In [48]:
#search_bands_xml = ET.ElementTree(root)
#search_bands_xml
tree.write("new_output.xml", xml_declaration=True, encoding='UTF-8', method='xml')

In [50]:
%less new_output.xml

## HDF5 data format and the h5py library  

## The HDF5 Format

A popular data format for representing out of memory data (on disk) in Pandas is the HDF5 format. HDF5 is a data model and format that can store multiple data frames, as well as different data types. This format has been used extensively outside of Python, long before Pandas was written. However, Pandas has adopted it as well and it is a popular format.
From the website:

>HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data.

https://support.hdfgroup.org/HDF5/
Pandas has documentation for the HDF5 format under the Data I/O section of the documentation:

https://pandas.pydata.org/pandas-docs/stable/io.html

In this documentation, Pandas goes through a series of functions that allow basic I/O, along with some subsetting capability. You can review the documentation there, it is quite good. However, since that was developed, there has been another package called `Dask` that does a better job of the same sort of thing. We will go through the Dask package in this lecture set. However, if you find it a bit overwhelming, you could use the more basic Pandas functions detailed in the links above.

## Creating an HDF5 data store using Pandas functions
You can create tables that are static, that is read-only that do not support querying using this file format like so:


First, read in a dataset

In [27]:
transactions = pd.read_csv(r'../data/retail_sales/transactions.csv')

Establish an HDF file connection (you may need to create the `temp` folder first)

In [28]:
! mkdir -p temp
store = pd.HDFStore('temp/store.h5')
store

<class 'pandas.io.pytables.HDFStore'>
File path: temp/store.h5

Write the table as static

In [29]:
store['transact'] = transactions

Note the transact frame now is part of the store.

In [30]:
store 

<class 'pandas.io.pytables.HDFStore'>
File path: temp/store.h5

You can then retrieve items like so:

In [31]:
transaction2 = store['transact']
transaction2.head()

Unnamed: 0,InvoiceNo,StockCode,Quantity,InvoiceDate,UnitPrice,CustomerID
0,536365,85123A,6,12/1/2010 8:26,2.55,17850.0
1,536365,71053,6,12/1/2010 8:26,3.39,17850.0
2,536365,84406B,8,12/1/2010 8:26,2.75,17850.0
3,536365,84029G,6,12/1/2010 8:26,3.39,17850.0
4,536365,84029E,6,12/1/2010 8:26,3.39,17850.0


Behind the scenes, pandas is leveraging the PyTables library to interact with HDF5 data. Pytables uses some C code and 'Cython' to accomplish fast performance and friendly implementation.

Here, the store is acting like a dict and we can store or retrieve data items via their name (key). By using this syntax (`store['name']`), we are using the HDF `put` method which creates a fixed array format. Fixed stores are not appendable once written (although they can be replaced). This format is specified by default when using `put` or `to_hdf` or by `format='fixed'` or `format='f'`

## Speed gains

Note the difference in speed between retrieving the HDF store and reading in the CSV file. On our system, it took $\frac{1}{5}$ the time to bring in the HDF5 data.

In [32]:
get_ipython().run_line_magic('time', "df2 = pd.read_csv(r'../data/retail_sales/transactions.csv')")

CPU times: user 336 ms, sys: 60 ms, total: 396 ms
Wall time: 593 ms


In [33]:
get_ipython().run_line_magic('time', "df1 = store['transact']")

CPU times: user 64 ms, sys: 20 ms, total: 84 ms
Wall time: 135 ms


However, the typical use case for HDF5 involves saving a large amount of data to a single data store (on disk dataframe), so you can later query or add to the data (append).

## Appending to the data store
Let's show how to add to an existing store of data. First, we will delete the static table:

In [34]:
del store['transact']

Now we can create a non-static dataframe in the store, using the table format (or 't'). For this table format, delete & query type operations are supported-- tables are specified by `format='table'` or `format='t'` to `append` or `put` or `to_hdf`, or by using the `append` method to create the table.

For example:
`store.put('tabl1', t1, format='table')`
or
`store.put('tabl1', t1, format='t')`
or
`df.to_hdf(format='table')`

In [35]:
store

<class 'pandas.io.pytables.HDFStore'>
File path: temp/store.h5

In [36]:
t1 = transactions[0:100]
t2 = transactions[100:200]

Append creates the table, but also appends if exists

In [37]:
store.append('transact_all',t1,min_itemsize = 50)

In [38]:
store.append('transact_all',t2)

Strings are a fixed width in the underlying data store. Therefore, we must decide on the width of the column when writing the table. You can pass an argument when creating the table to establish the minimum length for a given column (or all columns). In the above code I set a width of 50 for all strings. You could alternatively pass a dict of variable names as key and widths as values to set it by column.

The data store is not threadsafe, and does not support concurrent reading/writing.
A data table in PyTable is defined as a collection of records whose values are stored in fixed-length fields. All records have the same structure and all values in each field have the same data type. We can specify the size of the fields and their datatypes explicitly- but pandas provides a high-level interface that takes care of these aspects for us.

In [39]:
get_ipython().run_line_magic('time', "df2 = store['transact_all'] ## great performance.")

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 47.5 ms


For more advanced useage, review the excellent Pandas documentation linked above. 

## Conclusion

We covered how to parse three of the most widely used flat-file data formats in the technology sector. This was the JSON format, using the `json` library, XML documents, using `xml.etree`, and HDF5 files using `pandas` (there is also h5py). 