## Outline

Often ingestion tasks are written to simply dump data to disk. A second job then has to perform a disk sort.
This is slow.

This notebook shows how to perform a distributed ingest while also sorting.

## Create sample table

In [1]:
//Imports and setup
p)import warnings
p)warnings.filterwarnings("ignore")
p)import pandas as pd
p)import numpy as np
p)import pyarrow as pa
p)import pyarrow.parquet as pq

In [2]:
p)times=[np.datetime64('2012-06-30T21:00:00.000000000-0400')] * 4
p)table=pd.DataFrame(columns=['time','sym','price','size'])
p)table['time'] = times
p)table['sym'] = ['a','b','a','b']
p)table['price'] = [4.0,3.0,2.0,1.0]
p)table['size'] = [100,200,300,400]
p)print(table)

                 time sym  price  size
0 2012-07-01 01:00:00   a    4.0   100
1 2012-07-01 01:00:00   b    3.0   200
2 2012-07-01 01:00:00   a    2.0   300
3 2012-07-01 01:00:00   b    1.0   400


In [3]:
p)pq.write_table(pa.Table.from_pandas(table), 'example.parquet')

## Sorting

The important change in this example is that we extract the columns we wish to sort on in the master process.

Using these the correct sort index for the data is create.

This is then sent to all slave processes which will use it to correctly save each column in the same sort order

## Running the example

Start your worker processes

Run the master process to distribute the work

The output shows that the qparquet data is now successfully a q splayed table with corrct sort and attributes 

## Files 

### convert.q

This script coordinates distributing the work of converting the parquet file across multiple processes

### qparquet.q

This file contains needed imports and functions