# Parallel parsing of xml-files with Dask and ElementTree

We have two xml-files (but it might as well be a million), and we want to parse them in parallel with Dask. We use ElementTree to parse the xml.

1. Dask Bags: https://dask.readthedocs.io/en/latest/bag.html
2. ElementTree: https://docs.python.org/3/library/xml.etree.elementtree.html

This borrows heavily from the dask example at https://examples.dask.org/bag.html

In [1]:
#import
import xml.etree.ElementTree as ET
from dask.distributed import Client, progress
import dask.bag as db

Setting up the Dask thread pool:

In [2]:
client = Client(n_workers=4, threads_per_worker=1)
client

0,1
Client  Scheduler: tcp://127.0.0.1:44197  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 12.58 GB


We need to treat each file as an object and process it separately, which makes it hard to use the `db.from_text` method because it treats each line as an object. Instead we define a function to parse and return the root element of a file, and call this on a list of xml-files.

In [3]:
def parse_xmlfile(x):
    return(ET.parse(x).getroot())

In [4]:
b = db.from_sequence(['xmlfile1.xml', 'xmlfile2.xml']).map(parse_xmlfile)

Finding each country in the elements and checking how many elements we have. Should be equal to the number of files (same as if we ran this on `b`), but each element should contain multiple entries.

In [5]:
elm = db.map(lambda x: x.findall('country'), b)

In [6]:
elm.count().compute()

2

Create a function to extract the info we want from each country-element, and return an array of json-elements (dicts). Since each element in `elm` is a list of elements, this gets a little nested as shown below.

In [11]:
elm.compute()

[[<Element 'country' at 0x7f6ab07f4a48>,
  <Element 'country' at 0x7f6ab07f4c28>,
  <Element 'country' at 0x7f6ab07f4db8>],
 [<Element 'country' at 0x7f6ab07f44a8>,
  <Element 'country' at 0x7f6ab07f4688>,
  <Element 'country' at 0x7f6ab07f4818>]]

In [8]:
def make_rows(cntrs):
    return([{'name': x.get('name'), 'rank': x.find('rank').text} for x in cntrs])

Call the `make_rows` function on the elements, flatten it to melt the two arrays (that stem from the two files) into one, and actually compute the result (Dask is lazy).

In [9]:
jsons = db.map(make_rows, elm).flatten().compute()

In [10]:
jsons

[{'name': 'Liechtenstein', 'rank': '1'},
 {'name': 'Singapore', 'rank': '4'},
 {'name': 'Panama', 'rank': '68'},
 {'name': 'Sweden', 'rank': '7'},
 {'name': 'Denmark', 'rank': '12'},
 {'name': 'Norway', 'rank': '3'}]