## Testing Streaming Output into Parquet with PyArrow

In [45]:
import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
import torch
from datasets import load_dataset

### Testing out basic table creation

In [10]:
table = pa.table([
    pa.array([
        "The quick brown fox jumped over the log.",
        "Color dreams sleep furiously."       
]),
    pa.array([
        torch.zeros(256).numpy(),
        torch.zeros(256).numpy()
    ])
], names=["text", "embeddings"])

In [11]:
table.schema

text: string
embeddings: list<item: float>
  child 0, item: float

### Testing streaming out into Parquet file
TODO: Best practices for breaking up into multiple files / shards?

#### Write random data

In [12]:
with pq.ParquetWriter('testing.tmp.parquet', table.schema) as writer:
   for i in range(3):
      table = pa.table([
            pa.array([
               f"{i}: The quick brown fox jumped over the log.",
               f"{i}: Color dreams sleep furiously."       
         ]),
            pa.array([
               torch.randn(256).numpy(),
               torch.randn(256).numpy()
            ])
         ], names=["text", "embeddings"])

      writer.write_table(table)

In [18]:
!du -h *.parquet

12K	testing.tmp.parquet


#### Read it back via HuggingFace iterable dataset

In [53]:
parquet_file = pq.ParquetFile('testing.tmp.parquet')
metadata = parquet_file.metadata
print(metadata)

<pyarrow._parquet.FileMetaData object at 0x7f31b8bcfce0>
  created_by: parquet-cpp-arrow version 17.0.0
  num_columns: 2
  num_rows: 1668492
  num_row_groups: 1630
  format_version: 2.6
  serialized_size: 5905946


In [51]:
dataset = load_dataset("parquet", data_files={'train': 'testing.tmp.parquet'}, streaming=True, batch_size=1024, )

In [52]:
for x in dataset['train'].iter(batch_size=32):
    print(list(x.keys()), len(x['chunks']), len(x['embeddings']))
    embeddings = np.array(x['embeddings'])
    chunks = (x['chunks'])

    print(type(embeddings), embeddings.shape)
    print(type(chunks), type(chunks[0]), chunks[0])
    print('')

['chunks', 'embeddings'] 32 32
<class 'numpy.ndarray'> (32, 384)
<class 'list'> <class 'str'> [CLS] anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy, typically including nation - states, and capitalism. anarchism advocates for the replacement of the state with stateless societies and voluntary free associations. as a historically left - wing movement, this reading of anarchism is placed on the farthest left of the political spectrum, usually described as the libertarian wing of the socialist movement ( libertarian socialism ). [SEP]

['chunks', 'embeddings'] 32 32
<class 'numpy.ndarray'> (32, 384)
<class 'list'> <class 'str'> [CLS] during the classical era, anarchists had a militant tendency. not only did they confront state armed forces, as in spain and ukraine, but some of them also employed terrorism as propaganda of the deed. assassinatio

#### Test Output from `embed_wikipedia_articles.ipynb`
Run that script first. I start off by only doing a small number of chunks in `benchmark_document_embedding.ipynb`, just to make sure it's saving correctly.