-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utilization of parquet for intermediate storage #26
Comments
I can now confirm that dbdreader can be used to reproduce At this stage, the merged dataframe can be tossed into the parquet framework.
A quick comparison of 9 separate files and values obtained via
The deviations occur as The code will have to be refactored a bit to handle both pairs of files. The method should remain the same. Once that is written, the |
Or convert the ascii reader using the above to parquet and then adapt the netcdf part to pull from parquet instead of ascii. |
Yes, will make this a gradual transition. Added a hook to deployment.json: {
"glider": "unit_507",
"trajectory_date": "20220212T0000",
"filters": {
"tsint": 20,
"filter_z": 1,
"filter_time": 5,
"filter_points": 5,
"filter_distance": 1
},
"extra_kwargs": {
"enable_parquet": true,
"echograms": {
"enable_nc": true,
"enable_ascii": true,
"enable_image": true,
"plot_type": "profile",
"plot_cmap": "ek80",
"svb_thresholds": [-10,-20,-25,-30,-35,-45,-55],
"svb_limits": [-10.0, -55.0],
"echogram_range_bins": 20,
"echogram_range": -60.0,
"echogram_range_units": "meters"
}
},
"attributes": {
"acknowledgement": "This work was supported by funding from NOAA/IOOS/AOOS.",
"comment": "", Enabling parquet also assumes dbdreader will also be used.
Now to bolt on the decoding code... ... and push certain things over to the nc processing side. Getting parquet tables to store datetime[ns, UTC] was giving me a problem initially. This gets us to the goal of parquet tables and nc/xarray and preserving the seconds since 1990-0-0 units. Once the merge happens: dbd_data = pd.merge(sbd_data, tbd_data, on=['m_present_time'], how='outer', sort=True)
dbd_data = dbd_data.reset_index() drop # This works for xarray and parquet
dbd_data['time'] = [datetime.datetime.fromtimestamp(tm, tz=datetime.timezone.utc) for tm in dbd_data['time'].values]
# Convert time to datetime and preserve fractional seconds
# NOTE: works for xarray; not for parquet
#dbd_data['time'] = pd.to_datetime(dbd_data['time'], utc=True, unit='s')
# Make time the index
dbd_data = dbd_data.set_index('time')
# Write files to storage
# write netcdf
fn_xr = "%s%s" % (bn, ".nc")
ds = dbd_data.to_xarray()
ds['time'] = ds['time'].astype('datetime64[ns]')
encodings = {}
for dvar in ds.data_vars:
encodings[dvar] = {"_FillValue": None}
encodings.update({
"time": {
"_FillValue": None,
"calendar": "standard",
"units": "seconds since 1990-01-01T00:00:00Z"
}
})
ds.to_netcdf(fn_xr, encoding = encodings)
# write parquet
fn_pq = "%s%s" % (bn, ".parquet")
dbd_data.to_parquet(fn_pq)
# Verify what we wrote can be read back in
# read netcdf
#nc_ver = xr.open_dataset(fn_xr)
# NOTE: initial checks showed that original values differ using xarray
# read parquet
#pq_ver = pq.read_table(fn_pq) Instead of *.dat files, there will be .parquet files in the ascii directory for the nc side to utilize with the flag. Temporary until things get fully moved about. |
As expected, picking up extra resolution by utilizing dbdreader and parquet for processing. Original m_depth and m_altitude via
via
Still some work to do... |
There are several steps involved in migration to parquet for intermediate processing of slocum glider data.
dbdreader
reproduces similar results to slocum binaries (Feature: optionally include first record in data payload smerckel/dbdreader#18)convertDbds.sh
withdbdreader
/parquet
parquet
(just using tables for now)Is there a particular storage pattern or design desired for the
parquet
data structures?REF: https://arrow.apache.org/docs/python/parquet.html#parquet-file-writing-options
Have to nail down a potential dbdreader issue first.
The text was updated successfully, but these errors were encountered: