# Description

High-throughput workflows often deal with large datasets that serve different purposes, such as screening for materials of interest, searching for compounds that meet certain criteria, etc. Due to the size of these data sets and because each system typically is analized independent from the rest, we use parallelization methods to distribute tasks. As a parctical example, consider the [QM09 dataset](https://www.nature.com/articles/sdata201422) by R. Ramakrishnan, P. O. Dral, M. Rupp, and O. A. von Lilienfeld *Sci. Data*, **1**, 140022 (2014). We already discussed the data reported therein for Notebook 16_Graph_attention_networks. Briefly, the authors computed different properties for 133,885 small molecules. The geometries are included in $n$.xyz files, where $1 \leq n \leq 133\,885\, \in \, \mathbb{I}$. For ease of implementation, we will consider a random subset of 1,000 systems, sampled with the following function that produces 1,000 random indexes for $n$,

In [15]:
import random

sampling_size = 1_000

subset_idx    = random.sample(range(1, 133_885), sampling_size)

print(f'random {sampling_size} indexes contain {subset_idx[:5]} ... {subset_idx[-5:]}'.replace('[', '').replace(']', ''))

random 1000 indexes contain 45199, 27763, 27036, 81765, 1611 ... 111502, 99554, 3803, 45888, 13115


As reported by the authors, each [.xyz file](https://springernature.figshare.com/articles/dataset/Data_for_6095_constitutional_isomers_of_C7H10O2/1057646?backTo=/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904) is formatted as follows:

|Line       |Content|
|---|---|
|1          |Number of atoms na|
|2          |Properties 1-17 (see below)|
|3,...,na+2 |Element type, coordinate (x,y,z) (Angstrom), and Mulliken partial charge (e) of atom|
|na+3       |Frequencies (3na-5 or 3na-6)|
|na+4       |SMILES from GDB9 and for relaxed geometry|
|na+5       |InChI for GDB9 and for relaxed geometry|

The properties stored in the second line of each file:

|Index  |Property  |Unit         |Description|
|---|---|---|---|
| 0  |tag       |-            |"gdb9"; string constant to ease extraction via grep|
| 1  |index     |-            |Consecutive, 1-based integer identifier of molecule|
| 2  |A         |GHz          |Rotational constant A|
| 3  |B         |GHz          |Rotational constant B|
| 4  |C         |GHz          |Rotational constant C|
| 5  |mu        |Debye        |Dipole moment|
| 6  |alpha     |Bohr^3       |Isotropic polarizability|
| 7  |HOMO      |Hartree      |Energy of Highest occupied molecular orbital (HOMO)|
| 8  |LUMO      |Hartree      |Energy of Lowest occupied molecular orbital (LUMO)|
| 9  |gap       |Hartree      |Gap, difference between LUMO and HOMO|
|10  |r2        |Bohr^2       |Electronic spatial extent|
|11  |zpve      |Hartree      |Zero point vibrational energy|
|12  |U0        |Hartree      |Internal energy at 0 K|
|13  |U         |Hartree      |Internal energy at 298.15 K|
|14  |H         |Hartree      |Enthalpy at 298.15 K|
|15  |G         |Hartree      |Free energy at 298.15 K|
|16  |Cv        |cal/(mol K)  |Heat capacity at 298.15 K|

## Your task

Download the QM09 dataset and use your choice of parallel computing to extract $H$, the Enthalpy at 298.15 K, for the subset of molecules that correspond to the random indexes in `subset_idx`. Collect all the indexes and $H$ values in a `dataframe`.

> ### Assignment
>
> Use the `with pd.option_context('display.max_rows', None,): display(dataframe)` to show your data. Note that the code snipet assumes `import pandas as pd`, please change it if necessary.
>
> Use a histogram to report the distribution for the $H$ values you extracted.

## Considerations

Don't open all data files before deploying the parallel processes. Avoid unnecessary data load by handling the IO operations in each parallel task.

# Your implementation

> **You will earn extra credits for the organization, implementation, and legibility of your code**