# File Transformations

## For Use in Cloud Data Throughput Benchmarking

-------------------------------------------------------------------------------------------------------------------------------

**WARNING:**

**RUNNING THESE CELLS WILL RAISE AN ERROR OR OVERWRITE DATA ALREADY STORED WITHIN gs://cloud-data-benchmarks** 

**CHANGE THE BUCKET AND/OR SOURCE & OUTPUT FILE(S) IF YOU WISH TO USE THIS NOTEBOOK**

-------------------------------------------------------------------------------------------------------------------------------

In [None]:
import pandas as pd
import dask.dataframe as dd
import dask.array as dsa
import zarr
import xarray as xr
import numpy as np
import intake
token = '/home/ubuntu/Cloud-Data-Transfer-Speed-Benchmarks/cloud-data-benchmarks.json'

Note: The name_function does not sort partitions in the output files. Therefore, when using this method to split up CSV files into partitions of the same (or different) file type, make sure to include a sorting feature in the naming function.

In this instance, since these files will be used to measure read speed, the order that the files are concatenated by Dask when they are called into the timing program does not matter. If this method is being used for machine learning or data analysis, it might be a good idea to preserve the partition order.

## CSV to Partitioned Parquets

In [None]:
df = dd.read_csv('gs://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.csv', assume_missing=True)

In [None]:
name_function = lambda x: f"ETOPO1_Ice_g_gmt4_{x}.parquet"
dd.to_parquet(df, 'gs://cloud-data-benchmarks/parquetpartitions', name_function=name_function, storage_options={'token':token})
del df

## CSV to One Parquet File

In [None]:
df = pd.read_csv('/home/ubuntu/Cloud-Data-Transfer-Speed-Benchmarks/ETOPO1_Ice_g_gmt4.csv')

In [None]:
df.to_parquet('/home/ubuntu/Cloud-Data-Transfer-Speed-Benchmarks/ETOPO1_Ice_g_gmt4.parquet', engine='fastparquet')
del df

## CSV to Partitioned CSVs

In [None]:
df = dd.read_csv('gs://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.csv', assume_missing=True)

In [None]:
def name_function(i):
    return "ETOPO1_Ice_g_gmt4_" + str(i) + ".csv"
dd.to_csv(df, 'gs://cloud-data-benchmarks/csvpartitions', name_function=name_function, storage_options={'token':token})
del df

## NetCDF to Zarr

### Zarr Group

This approach uses Xarray to store the contents of the NetCDF file within a Zarr group. Note that there is no method of retrieving the NetCDF file directly from cloud storage.

In [None]:
ds = intake.open
ds.to_zarr(store='gs://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.zarr', storage_options={'token':token}, consolidated=True)

### Zarr Array

In [None]:
ds = xr.open_zarr(store='gs://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.zarr', storage_options={'token':token}, consolidated=True)
darray = ds.to_array()
da = darray.data
da = dsa.from_array(da)
dsa.to_zarr(da, 'gs://cloud-data-benchmarks/ETOPO1_Ice_g_gmt4.zarr', storage_options={'token':token})
