# Tutorial: Merging Data

One of xarray's killer features is the ability to merge many individual data files into a single Dataset.
This allows scientists to operate at a high mental level, focusing on their scientific questions rather than the details of how the dataset happened to provided.
For example, it's common for geospatial data with dimensions *latitude, longitude* to be distributed with one file per day.
Xarray allows you to combine all the files into a single Dataset with dimensions *time, latitude, longitdue*.
This eliminates the need to loop over the files and process them one by one, as was often done in the past.

Xarray tries to combine files automatically via its `open_mfdataset` function. 
This function examines the file metadata and tries to make reasonable choices about how the user wants the data to be combined.
But this doesn't always work right.
Xarray can't read your mind.
Furthermore, frequently data files have inconsistent or incorrect metadata, or fail to follow established conventions.
These are "dirty" data.
Dirty data can also cause `open_mfdataset` to fail.
This is not Xarray's fault.

Xarray power users have a range of tricks up their sleeve to overcome these situations.
This tutorial explains some strategies for merging dirty data.
Understanding these techniques requires a deeper understanding of how Xarray combines data in general.

**Note**: This tutorial makes frequent use of python [list comprehensions](#) to iterate over collections concisely. New python users should make sure they are familiar with this code pattern before moving forward.

## Clean Data

We start with an example of when `open_mfdataset` *does* work right.
In this example and all others, we will generate toy data, rather than using real data.
Our toy examples has 1x1 deg. lat/lon resolution and one value every 6 months for two years.

In [None]:
import xarray as xr
import numpy as np
import pandas as pd
import os
from matplotlib import pyplot as plt
%matplotlib inline

# dataset dimensions
ntime, nlat, nlon = 4, 180, 360
dims = ['time', 'lat', 'lon']

# a dataset with random values but realistic coordinates
ds = xr.Dataset({'temperature': (dims, np.random.rand(ntime, nlat, nlon)),
                 'pressure': (dims, np.random.rand(ntime, nlat, nlon))},
                coords={'time': ('time',
                                 pd.date_range('2018-01-01', freq='6MS', periods=ntime)),
                        'lat': ('lat', np.arange(-90, 90) + 0.5),
                        'lon': ('lon', np.arange(-180, 180) + 0.5)})
ds

We will now write this dataset into 4 distinct files, one per variable per year.
We do this by first splitting the dataset into 4 distinct Dataset objects.

In [None]:
times, dsets_temp = zip(*ds[['temperature']].groupby('time.year'))
times, dsets_pres = zip(*ds[['pressure']].groupby('time.year'))

Let's examine one of these individual datasets. Examining datasets is very important for understanding what's happening under the hood.

In [None]:
dsets_temp[0]

This dataset just has one 3D array in its data variables.

We now write to disk.

In [None]:
!rm -rf clean_data # just in case you are running this repeatedly

import os
os.mkdir('clean_data')

dsets = dsets_temp + dsets_pres
fnames = ([f'clean_data/temperature_{n:02d}.nc' for n in range(len(dsets_temp))] + 
          [f'clean_data/pressure_{n:02d}.nc' for n in range(len(dsets_pres))])
          
xr.save_mfdataset(dsets, fnames)
!ls clean_data

Let's examine just one of the files:

In [None]:
xr.open_dataset('clean_data/temperature_00.nc')

This is identical the data we generated.
We can verify this with:

In [None]:
ds_loaded = xr.open_dataset('clean_data/temperature_00.nc')
ds_loaded.identical(dsets_temp[0])

### What does `open_mfdataset` do?

The aim of the following section is to help de-mystify the `open_mfdataset` function, which is powerful but often a source of user confusion.

Let's try to open all the files we just wrote in one go using `open_mfdataset`:

In [None]:
ds_mf = xr.open_mfdataset('clean_data/*.nc')
ds_mf

Everything just worked without any special options!
üòÅ

We got back the same dataset we created back up at the top of this notebook, which we verify by:

In [None]:
ds_mf.identical(ds)

However, we got a nasty warning. üòü

This `FutureWarning` tells us that this way of combining data will be deprecated in the future, once Xarray 0.13 is released.
Long-time Xarray users should pay close attention to this warning.
Code that previously worked may stop working in the future.

Let's see what the future behavior will be?

In [None]:
xr.open_mfdataset('clean_data/*.nc', combine='by_coords')

Whew! üòå It still works.

What's happening under the hood in `open_mfdataset`? It's calling the function `combine_by_coords`.
We can mimic this behavior by  opening each file individually and then calling that function ourselves.

In [None]:
from glob import glob
all_files = glob('clean_data/*.nc')
all_dsets = [xr.open_dataset(fname) for fname in all_files]
xr.combine_by_coords(all_dsets)

### Controlling Dask Chunks

You may have noticed that, unlike `open_mfdatset`, the explicit `combine_by_coords` approach above did not produce Dask arrays. Instead, it operated eagerly, loading all the data into memory. This is not what we want with big data. `open_mfdatset` always automatically applies `.chunk()` to the datasets it combines. We can replicate this behavior with the following:

In [None]:
all_dsets_chunked = [xr.open_dataset(fname, chunks={}) for fname in all_files]
xr.combine_by_coords(all_dsets_chunked)

We could also supply a `chunks` keyword to `open_mfdataset` to control chunking more explicitly:

In [None]:
xr.open_mfdataset('clean_data/*.nc', combine='by_coords', chunks={'time': 1, 'lat': 90})

The same thing is possible with the manual approach:

In [None]:
all_dsets_chunked = [xr.open_dataset(fname, chunks={'time': 1, 'lat': 90})
                     for fname in all_files]
xr.combine_by_coords(all_dsets_chunked)

It is always better to apply chunking in this way, right when you open the individual files, rather than later, after you have already combined files.

### More Explicit Manual Combining

`combine_by_coords` itself does a few different things under the hood.
It uses both `concat` to combine the files along the time dimension and `merge` to combine the two different variables (`temperature` and `pressure`) into a single Dataset. We can do all of these things manually if we want:

In [None]:
temp_dsets = [xr.open_dataset(fname, chunks={}) for fname in glob('clean_data/temperature_*.nc')]
temp_concat = xr.concat(temp_dsets, dim='time')
pres_dsets = [xr.open_dataset(fname, chunks={}) for fname in glob('clean_data/pressure_*.nc')]
pres_concat = xr.concat(pres_dsets, dim='time')
xr.merge([temp_concat, pres_concat])

Some important differences in this manual approach are:

1. We had to know in advance that different variables were stored in different files and write some repetitive code. Fortunately this was obvious from the file names, but this is not always the case for real datasets.
1. We had to manually specify the `concat_dim` keyword and know in advance that `'time'` was the dimension to concatenate over.
1. We had to specify the files in the correct order (more on this below).

Manually dataset combining is the most powerful and flexible approach, but, for new, unfamiliar datasets, it requires that you **manually inspect your files carefully!** This is an important general piece of advice, especially once dirty data comes along.

### Order Matters for Concatenation!

In the example above, we were lucky that `glob('clean_data/temperature_*.nc')` gave us the files in correct chronological order. This is not always guaranteed to be the case, especially if the files follow a weird naming convenion. Let's see what happens if we call `concat` on files in the wrong order.

In [None]:
temp_fnames_wrong_order = ['clean_data/temperature_01.nc', 'clean_data/temperature_00.nc']
temp_dsets_wrong_order = [xr.open_dataset(fname, chunks={})
                          for fname in temp_fnames_wrong_order]
ds_wrong = xr.concat(temp_dsets_wrong_order, dim='time')
ds_wrong

As we can see, Xarray put the data together in the order we provided it.

In [None]:
plt.plot(ds.time.data, 'o-', label='original')
plt.plot(ds_wrong.time.data, '^-', label='wrong time order')
plt.legend()

The `combine_by_coords` function includes some special logic to try to order the datasets such that the values in their dimension coordinates are monotonic.

In [None]:
ds_combine_by_coords = xr.combine_by_coords(temp_dsets_wrong_order)

plt.plot(ds.time.data, 'o-', label='original')
plt.plot(ds_combine_by_coords.time.data, '^-',
         label='wrong order but combine_by_coords fixed me')
plt.legend()

### Explicitly Enumerate Files

The examples above assumed that you wanted to do some wildcard matching (e.g. `*.nc`) to combine files.
This is good for exploratory data analysis where you don't know exactly what you're looking for.
But for more mature code, or code used in production data processing systems, explicit is better than implicit.
If you know the naming conventions that were used to generate your files, you should use this information to explictly specifiy the filenames you want to open. This also has performance implications: `glob` can be very slow on very large directories over some network filesystems.

Here is an example of a fully explicit manual combine:

In [None]:
data_dir = './clean_data'
varnames = ['temperature', 'pressure']
time_suffixes = ['00', '01']
concat_dim = 'time'

variable_dsets = []
for vname in varnames:
    fnames = [os.path.join(data_dir, f'{vname}_{time_suffix}.nc')
              for time_suffix in time_suffixes]
    dsets = [xr.open_dataset(fname, chunks={}) for fname in fnames]
    ds_concat = xr.concat(dsets, dim=concat_dim)
    variable_dsets.append(ds_concat)
ds_manually_combined = xr.merge(variable_dsets)
ds_manually_combined


Let's verify that this is also the same as the original dataset we started with.

In [None]:
ds_manually_combined.identical(ds)