# SEG-Y to Vector DataFrames and Back

The connection of segysak to `xarray` greatly simplifies the process of vectorising segy 3D data and returning it to SEGY. To do this, one can use the close relationship between `pandas` and `xarray`.

## Loading Data

We start by loading data normally using the `segy_loader` utility. For this example we will use the Volve example sub-cube.

In [1]:
# Disable progress bars for small examples
from segysak.progress import Progress
Progress.set_defaults(disable=True)

In [2]:
import pathlib
import xarray as xr
from IPython.display import display

volve_3d_path = pathlib.Path("data/volve10r12-full-twt-sub3d.sgy")
print("3D", volve_3d_path.exists())

volve_3d = xr.open_dataset(volve_3d_path, dim_byte_fields={'iline': 5, 'xline': 21}, extra_byte_fields={'cdp_x': 73, 'cdp_y': 77})

3D True


## Vectorisation

Once the data is loaded it can be converted to a `pandas.DataFrame` directly from the loaded `Dataset`. The Dataframe is multi-index and contains columns for each variable in the originally loaded dataset. This includes the seismic amplitude as `data` and the `cdp_x` and `cdp_y` locations. If you require smaller volumes from the input data, you can use xarray selection methods prior to conversion to a DataFrame.

In [3]:
volve_3d_df = volve_3d.to_dataframe()
display(volve_3d_df)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cdp_x,cdp_y,data
iline,xline,samples,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10090,2150,4.0,43640052,647744704,0.020575
10090,2150,8.0,43640052,647744704,0.022041
10090,2150,12.0,43640052,647744704,0.019659
10090,2150,16.0,43640052,647744704,0.025421
10090,2150,20.0,43640052,647744704,0.025436
...,...,...,...,...,...
10150,2351,3384.0,43414413,647878266,0.000000
10150,2351,3388.0,43414413,647878266,0.000000
10150,2351,3392.0,43414413,647878266,0.000000
10150,2351,3396.0,43414413,647878266,0.000000


We can remove the multi-index by resetting the index of the DataFrame. Vectorized workflows such as machine learning can then be easily applied to the DataFrame.

In [4]:
volve_3d_df_reindex = volve_3d_df.reset_index()
display(volve_3d_df_reindex)

Unnamed: 0,iline,xline,samples,cdp_x,cdp_y,data
0,10090,2150,4.0,43640052,647744704,0.020575
1,10090,2150,8.0,43640052,647744704,0.022041
2,10090,2150,12.0,43640052,647744704,0.019659
3,10090,2150,16.0,43640052,647744704,0.025421
4,10090,2150,20.0,43640052,647744704,0.025436
...,...,...,...,...,...,...
10473695,10150,2351,3384.0,43414413,647878266,0.000000
10473696,10150,2351,3388.0,43414413,647878266,0.000000
10473697,10150,2351,3392.0,43414413,647878266,0.000000
10473698,10150,2351,3396.0,43414413,647878266,0.000000


## Return to Xarray

It is possible to return the DataFrame to the Dataset for output to SEGY. To do this the multi-index must be reset. Afterward, `pandas` provides the `to_xarray` method.

In [5]:
volve_3d_df_multi = volve_3d_df_reindex.set_index(["iline", "xline", "samples"])
display(volve_3d_df_multi)
volve_3d_ds = volve_3d_df_multi.to_xarray()
display(volve_3d_ds)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cdp_x,cdp_y,data
iline,xline,samples,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10090,2150,4.0,43640052,647744704,0.020575
10090,2150,8.0,43640052,647744704,0.022041
10090,2150,12.0,43640052,647744704,0.019659
10090,2150,16.0,43640052,647744704,0.025421
10090,2150,20.0,43640052,647744704,0.025436
...,...,...,...,...,...
10150,2351,3384.0,43414413,647878266,0.000000
10150,2351,3388.0,43414413,647878266,0.000000
10150,2351,3392.0,43414413,647878266,0.000000
10150,2351,3396.0,43414413,647878266,0.000000


The resulting dataset requires some changes to make it compatible again for export to SEGY.
Firstly, the attributes need to be set. The simplest way is to copy these from the original SEG-Y input. Otherwise they can be set manually. `segysak` specifically needs the `sample_rate` and the `coord_scalar` attributes.

In [6]:
volve_3d_ds.attrs = volve_3d.attrs
display(volve_3d_ds.attrs)

{'seisnc': '{"coord_scalar": -100.0, "coord_scaled": false}'}

The `cdp_x` and `cdp_y` positions must be reduced to 2D along the vertical axis "twt" and set as coordinates.

In [7]:
volve_3d_ds["cdp_x"]

In [8]:
volve_3d_ds["cdp_x"] = volve_3d_ds["cdp_x"].mean(dim=["samples"])
volve_3d_ds["cdp_y"] = volve_3d_ds["cdp_y"].mean(dim=["samples"])
volve_3d_ds = volve_3d_ds.set_coords(["cdp_x", "cdp_y"])
volve_3d_ds

Afterwards, use the `to_segy` method as normal to return to SEGY.

In [9]:
volve_3d_ds.seisio.to_segy("data/test.segy", iline=189, xline=193, trace_header_map={'cdp_x':181, 'cdp_y':185})

## Very large datasets

If you have a very large dataset (SEG-Y file), it may be possible to use `ds.to_dask_dataframe()` which can perform operations, including the writing of data in a lazy manner.