# Subset

A subset is of a set of points (GPIs) on a pygeogrids Grid. Each point in a subset is part of the overarching grid. Each subset point has a value attached that represents that point e.g. when written to file, when visualised etc. By default the value is 1, i.e. a binary mask when storing subset values. A subset must have a "name" assigned. It can optionally have a longer description ("meaning") and any additional metadata as a dictionary ("attributes"). Based on the number and format of the passed subset points upon initialisation, a subset also has a shape.

## Creating subsets.
To create a subset, simply pass a 1d or 2d array (or list) of points and a name. Optionally a meaning, values and any additional attributes can be given.

In [19]:
# Example: Define 2 arbitrary subsets
import numpy as np
from pygeogrids.subset import Subset

# for the first subset pass gpis as 1d arrays, all points are assigned the default value "1".
subset1 = Subset(name='Subset1', gpis=np.array([10,11,12,13]))

# for the second subset pass gpis as a 2d array, not all points are assigned the same value.
subset2 = Subset(name='Subset2', meaning='Advanced case', gpis=np.array([[11,12],[19,20]]), values=[1,1,2,2],
                 attrs={'AnyAttributeName': 'AttributeValue'})

Notice the different shapes and values, the shape depends on wheather a 1d ar 2d gpis-array was passed when creating the subset.

In [20]:
print(f'{subset1.name} has values {subset1.values} and shape {subset1.shape}')
print(f'{subset2.name} has values {subset2.values} and shape {subset2.shape}')

Subset1 has values [1 1 1 1] and shape (4,)
Subset2 has values [1 1 2 2] and shape (2, 2)


Subset objects can be returned as (reduced) dictionaries.

A dictionary with all available properties is returned with the `Subset.to_dict(format='all')` function. The format keyword defines which attributes are returned.

In [21]:
from pprint import pprint
print("When returning `all` properties: \n")
pprint(subset2.as_dict('all'))

When returning `all` properties: 

{'Subset2': {'attrs': {'AnyAttributeName': 'AttributeValue', 'shape': (2, 2)},
             'gpis': array([[11, 12],
       [19, 20]]),
             'meaning': 'Advanced case',
             'values': array([1, 1, 2, 2])}}


In [22]:
print("When returning `gpis` only: \n")
print(subset2.as_dict('gpis'))

When returning `gpis` only: 

{'Subset2': array([11, 12, 19, 20])}


## Filter subset

Finally we can also filter a subset by its values and pass a whitelist of values that are allowed. A new Subset is created from the filtered points and any passed, optional kwargs. Here we select the points of subset2 that have value '1' assigned. Any arguments other than the values to filter for are used to create the new Subset.

In [23]:
filtered_subset2 = subset2.select_by_val(vals=[1], meaning='Filtered version of subset2')

In [24]:
print("The filtered subset looks like this: \n")
pprint(filtered_subset2.as_dict())

The filtered subset looks like this: 

{'filtered_Subset2': {'attrs': {'shape': (2,)},
                      'gpis': array([11, 12]),
                      'meaning': 'Filtered version of subset2',
                      'values': array([1, 1])}}


## Combining 2 subsets
Basic operations to combine two subsets are provided. Combining subsets is done based on their GPIs. The result of combining two subsets in a new subset containing the respective points. Only the GPIs are taken from the input subsets. All other attributes must be defined by the user (as kwargs for this function). A name is generated from the names on the input subsets if no new name is specified. 

The following methods to combine subsets are currently implemented:


1) `Subset.intersect(other[Subset], **new_subset_kwargs)`:  

Creates a new subset from points that are in **this subset AND in the other subset**. In this example we also pass a new 'meaning' of the intersection. The new name is created from the names of the input subsets.

In [25]:
inter = subset2.intersect(subset1, meaning='intersection of subset1 and subset2')
pprint(inter.as_dict())

{'Subset2_inter_Subset1': {'attrs': {'shape': (2,)},
                           'gpis': array([11, 12]),
                           'meaning': 'intersection of subset1 and subset2',
                           'values': array([1, 1])}}


2) `Subset.union(other[Subset], **new_subset_kwargs)`:  
    
Creates a new subset from points that are in **this subset OR in the other subset**. In this example we also define a new value for all points of the union so that not the default value 1 is used. The new name is created from the names of the input subsets.

In [26]:
union = subset2.union(subset1, values=3)
pprint(union.as_dict())

{'Subset2_union_Subset1': {'attrs': {'shape': (6,)},
                           'gpis': array([10, 11, 12, 13, 19, 20]),
                           'meaning': '',
                           'values': array([3, 3, 3, 3, 3, 3])}}


3) `Subset.diff(other[Subset], **new_subset_kwargs)`:
    
Create a new subset from points that are in **this subset, but NOT in the other subset**. In this example we also define a name to use for the new subset and some attributes.

In [27]:
diff = subset2.diff(subset1, name='subset_diff', attrs={'method': 'subset2 minus subset1'})
pprint(diff.as_dict())

{'subset_diff': {'attrs': {'method': 'subset2 minus subset1', 'shape': (2,)},
                 'gpis': array([19, 20]),
                 'meaning': '',
                 'values': array([1, 1])}}


# SubsetCollection

A SubsetCollection holds multiple subsets and provides functions to add, drop and combine/merge several of them at once. Can be written to / read from a netcdf (definition) file.

## Create, load SubsetCollection
There are several ways to create a subset collection.
* By passing a list of subsets
* By passing a dictionary of subset attributes (as returned by `Subset.as_dict()`)
* By loading a netcdf file

The following command creates a new Collection from 2 already existing subsets.

In [28]:
from pygeogrids.subset import SubsetCollection

# create a subset collection from a list of subsets:
collection = SubsetCollection(subsets=[subset1, subset2])

Afterwards, similar to the Subset, the Collection can be returned as a dictionary. We will use this command in the examples to give a qick overview over the current state of the collection.

In [29]:
subset_collection_dict = collection.as_dict()
pprint(subset_collection_dict)

{'Subset1': {'attrs': {'shape': (4,)},
             'gpis': array([10, 11, 12, 13]),
             'meaning': '',
             'values': array([1, 1, 1, 1])},
 'Subset2': {'attrs': {'AnyAttributeName': 'AttributeValue', 'shape': (2, 2)},
             'gpis': array([[11, 12],
       [19, 20]]),
             'meaning': 'Advanced case',
             'values': array([1, 1, 2, 2])}}


A dictionary like this could also be used to create a collection directly, without the need to create single Subsets first.

In [30]:
# a dictionary like this, which contains dictionaries of subset attributes (see `Subset.as_dict()`) 
# is another easy way to create a SusetCollection directly
collection_from_dict = SubsetCollection.from_dict(subset_collection_dict)

The result is the same, which can be checked with ´==´

In [31]:
collection == collection_from_dict

True

Additional subsets can be added to an existing Collection.

In [32]:
collection.add(Subset('added_subset', gpis=np.array([5, 12, 19, 26]), values=5.))
pprint(collection.as_dict())

{'Subset1': {'attrs': {'shape': (4,)},
             'gpis': array([10, 11, 12, 13]),
             'meaning': '',
             'values': array([1, 1, 1, 1])},
 'Subset2': {'attrs': {'AnyAttributeName': 'AttributeValue', 'shape': (2, 2)},
             'gpis': array([[11, 12],
       [19, 20]]),
             'meaning': 'Advanced case',
             'values': array([1, 1, 2, 2])},
 'added_subset': {'attrs': {'shape': (4,)},
                  'gpis': array([ 5, 12, 19, 26]),
                  'meaning': '',
                  'values': array([5, 5, 5, 5])}}


## Combine and Merge multiple subsets in a collection
Combining is not the same as merging subsets! 

### combine()
`combine()` takes one of the methods implemented for `Subset` (e.g. `intersect`, `union`, `diff`) and applies it one by one to all subsets in the passed order until finally one single new Subset is created. `combine` only affects the GPIs of the subset while assigning new `values` to the so found new Subset. Therefore, if no values are specified, all combined GPIs will have the (default value) 1 afterwards. A `new_name` for the new subset must be passed to `combine`.

In [33]:
collection.combine(['Subset1', 'added_subset'], method='union', new_name='combined_union')
pprint(collection.as_dict())

{'Subset1': {'attrs': {'shape': (4,)},
             'gpis': array([10, 11, 12, 13]),
             'meaning': '',
             'values': array([1, 1, 1, 1])},
 'Subset2': {'attrs': {'AnyAttributeName': 'AttributeValue', 'shape': (2, 2)},
             'gpis': array([[11, 12],
       [19, 20]]),
             'meaning': 'Advanced case',
             'values': array([1, 1, 2, 2])},
 'added_subset': {'attrs': {'shape': (4,)},
                  'gpis': array([ 5, 12, 19, 26]),
                  'meaning': '',
                  'values': array([5, 5, 5, 5])},
 'combined_union': {'attrs': {'shape': (7,)},
                    'gpis': array([ 5, 10, 11, 12, 13, 19, 26]),
                    'meaning': '',
                    'values': array([1, 1, 1, 1, 1, 1, 1])}}


# merge()
`merge()` is similar to `combine(method='union')` as the new subset will contain all points that were in any of the merged subsets.
But other than `combine()` it keeps the values of the merged subsets (with higher priority for subsets that were merged last). By default the input subsets are not kept after merging (only if `keep=True` is selected). Note in the following example that 'merged_subset' contains values from both merged subsets. Point 12 (which was in both subsets) was assinged the value from the subset that appeared later in the passed list.

In [34]:
collection.merge(['Subset1', 'added_subset'], new_name='merged_subsets')
pprint(collection.as_dict())

{'Subset2': {'attrs': {'AnyAttributeName': 'AttributeValue', 'shape': (2, 2)},
             'gpis': array([[11, 12],
       [19, 20]]),
             'meaning': 'Advanced case',
             'values': array([1, 1, 2, 2])},
 'combined_union': {'attrs': {'shape': (7,)},
                    'gpis': array([ 5, 10, 11, 12, 13, 19, 26]),
                    'meaning': '',
                    'values': array([1, 1, 1, 1, 1, 1, 1])},
 'merged_subsets': {'attrs': {'shape': (7,)},
                    'gpis': array([ 5, 10, 11, 12, 13, 19, 26]),
                    'meaning': 'Merged subsets Subset1, added_subset',
                    'values': array([5, 1, 1, 5, 1, 5, 5])}}


## Write/Load to/from netcdf file
Finally we write the subset collection to netcdf file. This is not the same as storing a MetaGrid, but only stores the subsets without context.

In [35]:
import tempfile
import os

filename = os.path.join(tempfile.mkdtemp(), 'ssc.nc')

collection.to_file(filename)
pprint(collection.as_dict())

{'Subset2': {'attrs': {'AnyAttributeName': 'AttributeValue',
                       'meaning': 'Advanced case',
                       'shape': (2, 2)},
             'gpis': array([[11, 12],
       [19, 20]]),
             'meaning': 'Advanced case',
             'values': array([1, 1, 2, 2])},
 'combined_union': {'attrs': {'meaning': '', 'shape': (7,)},
                    'gpis': array([ 5, 10, 11, 12, 13, 19, 26]),
                    'meaning': '',
                    'values': array([1, 1, 1, 1, 1, 1, 1])},
 'merged_subsets': {'attrs': {'meaning': 'Merged subsets Subset1, added_subset',
                              'shape': (7,)},
                    'gpis': array([ 5, 10, 11, 12, 13, 19, 26]),
                    'meaning': 'Merged subsets Subset1, added_subset',
                    'values': array([5, 1, 1, 5, 1, 5, 5])}}


And load the same collection again from file:

In [36]:
loaded_collection = SubsetCollection.from_file(filename)
pprint(loaded_collection.as_dict())
collection == loaded_collection

{'Subset2': {'attrs': OrderedDict([('AnyAttributeName', 'AttributeValue'),
                                   ('shape', (2, 2))]),
             'gpis': array([[11, 12],
       [19, 20]]),
             'meaning': 'Advanced case',
             'values': array([1, 1, 2, 2])},
 'combined_union': {'attrs': OrderedDict([('shape', (7,))]),
                    'gpis': array([ 5, 10, 11, 12, 13, 19, 26]),
                    'meaning': '',
                    'values': array([1, 1, 1, 1, 1, 1, 1])},
 'merged_subsets': {'attrs': OrderedDict([('shape', (7,))]),
                    'gpis': array([ 5, 10, 11, 12, 13, 19, 26]),
                    'meaning': 'Merged subsets Subset1, added_subset',
                    'values': array([5, 1, 1, 5, 1, 5, 5])}}


True