Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEMP: source files for cereal and linalg wikis #3420

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
208 changes: 208 additions & 0 deletions doc/wikis/cereal.md
@@ -0,0 +1,208 @@
## SHOGUN `Cereal` serialization framework

#### Table of Contents

- [Motivation](#motivation)
- [For SHOGUN developers](#for-shogun-developers)
- [Examples] (#examples)
- [For serialization framework developers] (#for-serialization-framework-developers)
- [Serialization interface] (#serialization-interface)
- [Serialization methods in `SGObject`] (#serialization-methods-in-sgobject)
- [Serialization methods in `Any`] (#serialization-methods-in-any)
- [Serialization methods in `SGVector`, `SGMatirx` and `SGReferencedData`] (#serialization-methods-in-sgvector-sgmatrix-and-sgreferenceddata)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar points for all the above links in the table of contents.


### Motivation

[`Cereal`](http://uscilab.github.io/cereal/) is a header-only C++11 serialization library that is fast, light-weight, and easy to extend.

The `Cereal` serialization framework in SHOGUN uses the new tag parameter framework, which allows easy and readable archive of `SGObject` class data.

### For SHOGUN developers

- `Cereal` serialization library is required for SHOGUN compilation. If `Cereal` is not found, SHOGUN will automatically download the library to `third_party/`.

- SHOGUN now supports the serialization of data into 3 formats: binary, XML, and JSON archives. The 3 pairs of save/load methods can be called by:

```cpp
save_binary(filename);
load_binary(filename);

save_json(filename);
load_json(filename);

save_xml(filename);
load_xml(filename);
```

- All parameters saved in tag parameter list for one `SGObject` can be saved and load by:

```cpp
SGObject obj_save;
obj_save.save_json(filename);

SGObject obj_load;
obj_load.load_json(filename);
```

- Customized archives can be added as shown [here](http://uscilab.github.io/cereal/serialization_archives.html)

#### Examples

`CCerealObject` class defined in [`CerealObject.h`](https://github.com/shogun-toolbox/shogun/blob/feature/cereal/tests/unit/io/CerealObject.h)
is derived from CSGObject and used for `Cereal` serialization unit tests.
We also use `CCerealObject` here to show how to serialize `SGObject` in SHOGUN.

In `CCerealObject` class, we initialize a member `SGVector<float64_t> m_vector` and register it to the parameter list in constructors:

```cpp
#include <shogun/base/SGObject.h>
#include <shogun/lib/SGVector.h>

using namespace shogun;

class CCerealObject : public CSGObject
{
public:
// Construct CCerealObject from input SGVector
CCerealObject(SGVector<float64_t> vec) : CSGObject()
{
m_vector = vec;
init_params();
}

// Default constructor
CCerealObject() : CSGObject()
{
m_vector = SGVector<float64_t>(5);
m_vector.set_const(0);
init_params();
}

const char* get_name() const { return "CerealObject"; }

protected:
// Register m_vector to parameter list with name(tag) "test_vector"
void init_params()
{

register_member("test_vector", m_vector);
}

SGVector<float64_t> m_vector;
}
```

`m_vector` will be archived if we call serialization methods on `CCerealObject` instance.

```cpp
#include "CerealObject.h"
#include <shogun/lib/SGVector.h>

using namespace shogun;

// Create a CCerealObject instance with assigned SGVector values
SGVector<float64_t> vec;
vec.range_fill(1.0);
CCerealObject obj_save(vec);

// Serialization
obj_save.save_json("serialization_test_json.cereal");

// Create another CCerealObject instance for data loading
CCerealObject obj_load();
obj_load.load_json("serialization_test_json.cereal");

// We can extract the loaded parameter:
SGVector<float64_t> vec_load;
vec_load = obj_load.get<SGVector<float64_t>>("test_vector");
```

The JSON file `serialization_test_json.cereal` will be:
```
{
"CerealObject": { // Class name
"test_vector": { // The tag of the parameter to be saved
"value0": 2, // Container type for internal use
"value1": 12, // Primitive type for internal use
"value2": { // Data to archive
"ReferencedData": { // Reference Data
"ref_counting": true,
"refcount number": 3
},
"length": 5, // Length of the vector
"value1": 0, // values of the vector
"value2": 1,
"value3": 2,
"value4": 3,
"value5": 4
}
}
}
}
```


### For serialization framework developers

The serialization framework has two components:

- Serialization interfaces implemented in `SGObject`, and

- Serialization (load/save) methods implemented in `SGObject` and non-`SGObject` based data structrues.

#### Serialization interface

- The `save_binary()` method in `SGObject.h` generates an `cereal::BinaryOutputArchive` object and saves `SGObject` to binary file by calling `cereal_save()` method in `SGObject`. `load_binary()` method generates an `cereal::BinaryInputArchive` object and loads the parameters from binary file back to `SGObject` by calling `cereal_load()` method in `SGObject`. The ideas are the same for JSON and XML archives.

#### Serialization methods in `SGObject`

- `cereal_save()` method iterates through the parameter list of `SGObject` registered as `self::map`, archives the [`name value pair`](https://uscilab.github.io/cereal/assets/doxygen/classcereal_1_1NameValuePair.html), with name as `basetag.name()` and value by calling `any.cereal_save()`.

- `cereal_load()` method iterates through the parameter list and resets the parameter by calling `any.cereal_load()`

#### Serialization methods in `Any`

- Namespace `serial` and object `serial::DataType m_datatype` in `Any.h` save and convert the data type of the value of parameters in `Any` constructors into `Enum`.

```
enum EnumContainerType
{
CT_UNDEFINED,
CT_PRIMITIVE,
CT_SGVECTOR,
CT_SGMATRIX
};

enum EnumPrimitiveType
{
PT_UNDEFINED,
PT_BOOL_TYPE,
PT_CHAR_TYPE,
PT_INT_8,
PT_UINT_8,
PT_INT_16,
PT_UINT_16,
PT_INT_32,
PT_UINT_32,
PT_INT_64,
PT_UINT_64,
PT_FLOAT_32,
PT_FLOAT_64,
PT_FLOAT_MAX,
PT_COMPLEX_128,
};
```

- `Cereal_save()` together with `cereal_save_helper()` methods cast the object `storage` to its input type and archives the value.

- `Cereal_load()` together with `cereal_load_helper()` methods read the saved value back to `storage` and reset the `policy` based on the data type.

#### Serialization methods in `SGVector`, `SGMatirx` and `SGReferencedData`

Both `SGVector` and `SGMatirx` are derived from `SGReferencedData` class.

- `SGReferencedData` archives whether `ref_counting` is on by saving `true`/`false`, and the `ref_counting` value if `m_refcount != NULL`, i.e. `ref_counting` is on.

- `SGVector` and `SGMatrix` archive `ref_counting` value by calling base class load/save methods: `cereal::base_class<SGReferencedData>(this)` ([See introduction](http://uscilab.github.io/cereal/inheritance.html)).
For `SGVector`, length and vector values are archived, while for `SGMatrix`, row number, column number, and matrix values in `T* matrix` are archived. Data of `complex128_t` type is casted to `float64_t` type before archiving.
171 changes: 171 additions & 0 deletions doc/wikis/linalg.md
@@ -0,0 +1,171 @@
## Internal linear algebra library

#### Table of Contents

- [Motivation](#motivation)
- [For SHOGUN developers](#for-shogub-developers)
- [Setting `linalg` backend] (#setting-linalg-backend)
- [Using `linalg` operations] (#using-linalg-operations)
- [Examples] (#Examples)
- [For `linalg` developers] (#for-linalg-developers)
- [Understanding operation interface `LinalgNameSpace.h`] (#understanding-operation-interface-linalgnamespace.h)
- [Understanding backend interfaces] (#Understanding backend interfaces)
- [Understanding operation implementations of different backends] (#understanding-operation-implementations-of-different-backends)
- [Extend external libraries] (#extend-external-libraries)

### Motivation

Linear algebra operations form the backbone for most of the computation components in any Machine Learning library. However, writing all of the required linear algebra operations from scratch is rather redundant and undesired, especially when we have some excellent open source alternatives. In Shogun, we prefer

- [`Eigen3`](http://eigen.tuxfamily.org/index.php?title=Main_Page) for its speed and simplicity at the usage level,
- [`ViennaCL`](http://viennacl.sourceforge.net/) version 1.5 for GPU powered linear algebra operations.

For Shogun maintainers, however, the usage of different external libraries for different operations can lead to a painful task.

- For example, consider some part of an algorithm originally written using `Eigen3` API. But a Shogun user wishes to use `ViennaCL` for that algorithm instead, hoping to obtain boosted performance utilizing a GPU powered platform. There is no way of doing that without having the algorithm _rewritten_ by the developers using `ViennaCL`, which leads to _duplication_ of code and effort.
- Also, there is no way to do a _performance comparison_ for the developers while using _different_ external linear algebra libraries for the _same_ algorithm in Shogun code.
- It is also somewhat frustrating for a _new_ developer who has to invest significant amount of time and effort to learn each of these external APIs _just_ to add a new algorithm in Shogun.


### Features of internal linear algebra library

Shogun's **internal linear algebra library** (will be referred as `linalg` hereinafter) is a work-in-progress attempt to overcome these issues. We designed `linalg` as a modularized internal **header only** library in order to

- provide a uniform API for Shogun developers to choose any supported backend without having to worry about the syntactical differences in the external libraries' operations,
- have the backend set for each operations at compile-time (for lesser runtime overhead) and therefore intended to be used internally by Shogun developers,
- allow Shogun developers to add new linear algebra backend plug-ins easily.

### For Shogun developers
#### Setting `linalg` backend
Users can switch between `linalg` backends via global variable `sg_linalg`.
- Shogun uses `Eigen3` backend as default linear algebra backend.
- Enabling of GPU backend allows the data transfer between CPU and GPU, as well as the operations on GPU. `ViennaCL`(GPU) backend can be enabled by assigning new `ViennaCL` backend class to `sg_linalg` or canceled by:
```
sg_linalg->set_gpu_backend(new LinalgBackendViennaCL());
sg_linalg->set_gpu_backend(nullptr);
```

- Though backends can be extended, only one CPU backend and one GPU backend are allowed to be registered each time.

#### Using `linalg` operations
`linalg` library works for both `SGVectors` and `SGMatrices`. The operations can be called by:

```
#include <shogun/mathematics/linalg/LinalgNamespace.h>
shogun::linalg::operation(args)
```

- To use `linalg` operations on GPU data (vectors or matrices) and transfer data between GPU, one can call `to_gpu` and `from_gpu` methods. The methods return results as new instances.

```
auto result = linalg::to_gpu(arg)
auto result = linalg::from_gpu(arg_on_gpu)
```
- The `to_gpu` method will return the original CPU vector or matrix if no GPU backend is available. The `from_gpu` method will return the input argument if it is already on CPU and raise error if no GPU backend is available anymore.

- The status of data can be checked by: `data.on_gpu()`. `True` means the data is on GPU and `false` means the data is on CPU.

- The operations will be carried out on GPU __only if__ the data passed to the operations are on GPU __and__ GPU backend is registered: `sg_linalg->get_gpu_backend() == true`. The `linalg` will be conducted on CPU if the data is on CPU.

- `linalg` will report errors if the data is on GPU but no GPU backend is available anymore. Errors will also occur when an operation requires multiple inputs but the inputs are not on the same backend.

- A warning will be generated if an operation is not available on specific backend.

#### Examples

Here we show how to do vector dot with `linalg` library operations on CPU and GPU.

```
// CPU dot operation

#include <shogun/lib/SGVector.h>
#include <shogun/mathematics/linalg/LinalgNamespace.h>

using namesapce shogun;

// Create SGVectors
const index_t size = 3;
SGVector<int32_t> a(size), b(size);
a.range_fill(0);
b.range_fill(0);

auto result = linalg::dot(a, b);
```

```
// GPU dot operation

#include <shogun/lib/SGVector.h>
#include <shogun/mathematics/linalg/LinalgNamespace.h>
#include <shogun/mathematics/linalg/LinalgBackendViennaCL.h>

using namesapce shogun;

// Set gpu backend
sg_linalg->set_gpu_backend(new LinalgBackendViennaCL());

// Create SGVectors
const index_t size = 3;
SGVector<int32_t> a(size), b(size), a_gpu, b_gpu;
a.range_fill(0);
b.range_fill(0);

// Transfer vectors to GPU
a_gpu = linalg::to_gpu(a);
b_gpu = linalg::to_gpu(b);

// run dot operation
auto result = linalg::dot(a_gpu, b_gpu);`
```
If the result is a vector or matrix, it needs to be transferred back
```
#include <shogun/lib/SGVector.h>
#include <shogun/mathematics/linalg/LinalgNamespace.h>
#include <shogun/mathematics/linalg/LinalgBackendViennaCL.h>

using namesapce shogun;

// set gpu backend
sg_linalg->set_gpu_backend(new LinalgBackendViennaCL());

// Create a SGVector
SGVector<float32_t> a(5), a_gpu;
a.range_fill(0);

// Transfer the vector to gpu
a_gpu = linalg::to_gpu(a);

// Run scale operation and transfer the result back to CPU
auto result_gpu = linalg::scale(a_gpu, 0.3);
auto result = linalg::from_gpu(result_gpu);
```


### For `linalg` developers
The structure of `linalg` consists of three groups of components:
- The interface that decides which backend to use for each operation (`LinalgNameSpace.h`)
- The structure serves as interface of backend libraries (`GPUMemory*.h`)
- The operation implementations in each backend (`LinalgBackend*.h`).

#### Understanding operation interface `LinalgNameSpace.h`

- `LinalgNameSpace.h` defines multiple `linalg` operation interfaces in namespace `linalg`. All operation methods will call `infer_backend()` method on the inputs, and decide the backend to call.

#### Understanding backend interfaces

- `GPUMemoryBase` class is a generic base class serving as GPU memory library interface.
The GPU data is referred as `GPUMemoryBase` pointer once it is generated by `to_GPU()` method, and is cast back to specific GPU memory type during operations.

- `GPUMemoryViennaCL` is `ViennaCL` specific GPU memory library interface, which defines the operations to access and manipulate data on GPU with `ViennaCL` operations.

#### Understanding operation implementations of different backends

- `LinalgBackendBase` is the base class for operations on all different backends. The macros in `LinalgBackendBase` class defined the `linalg` operations and data transfer operations available in at least one backend.

- `LinalgBackendGPUBase` has two pure virtual methods: `to_gpu()` and `from_gpu()`. `LinalgBackendViennaCL` and other user-defined GPU backend classes are required to be derived from `LinalgBackendGPUBase` class, and thus GPU transfer methods are required to be implemented.

- `LinalgBackendEigen` and `LinalgBackendViennaCL*` classes provide the specific implementations of linear algebra operations with `Eigen3` library and `ViennaCL` library.

#### Extend external libraries

Current `linalg` framework allows easy addition of external linear algebra libraries. To add CPU-based algebra libraries, users just need to derive from `LinalgBackendBase` and re-implement the methods with new library. For GPU-based libraries, users need to add new class derived from `LinalgBackendGPUBase`, as well as the GPU memory library interface class derived from 'GPUMemoryBase` class.