# Configuration and Parser

In many application scenario, `Configuration` is an important component.
People hope to use `Configuration` to specify the arguments which will be used in the following tasks.
Generally, `Configuration` is expected to have the following traits:
1. contains several variables and variables can be read and written in the following tasks;
2. could be dumped to the file and reloaded from specified file,

in addition, there are some advanced needs:
1. could be easily converted into command line interface;
2. flexibly specify the configurable variables rather than include all variables;

Thus, we proposed our `Configuration` and the adapted `ConfigurationParser`.
Our `Configuration` is capable of:
1. contains several readable and writeable variables which can be flexibly specified to be configurable or not;
2. could be dumped to a file (default is `json`, but `yaml` and `toml` are also supported)
and reloaded from the specified file.

And with the help of `ConfigurationParser`,
the variables in `Configuration` can be easily converted into CLI arguments.
Furthermore, in order to enrich the ability of the parser to analyze the complicated data structure via console
(e.g., list, dict, even calculation expression), we proposed an unique formulations.

In the following chapters, we will show the basic usage of the `Configuration`
and explain the expression grammar in `ConfigurationParser`.

## Basic Usage

Assume our need is to set a logger in `Configuration` and some other task relevant variables,
in addition, we only want to preserve the variables rather than the logger
(in fact, a logger object could be dumped to json).

Thus, `Configuration` is expected to hold all variables in run time and dumped `json` file
but only hold the logger during run time.
By inheriting `Configuration` from `longling.lib.parser`
which exclude `logger` and some other 'basic' variables related to all class object
(e.g., '\_\_dict\_\_', '\_\_weakref\_\_'), we can achieve that:

In [1]:
from longling import set_logging_info, config_logging
from longling.lib.parser import Configuration

set_logging_info()

class Params(Configuration):
    a = 1
    b = 2

params = Params()
params.dump("params.json", override=True)
with open("params.json") as f:
    print(f.read())

INFO:root:writing configuration parameters to G:\program\longling\tutorials\params.json


{
  "a": 1,
  "b": 2
}


In [2]:
params.a = 10  # this will not change the value in "params.json", thus the loaded "a" is still 1
logger = config_logging(logger="params")
params = Params.load("params.json", logger=logger)  # we specify a new logger here, if not, logging will be used.
print(params)

logger: <Logger params (INFO)>
a: 1
b: 2


Aha, we can see that the `Configuration` can included and excluded
when dumping and loading `Configuration` via `json` file.

Also, `yaml` and `toml` are supported:

In [3]:
params.dump("params.toml", override=True, file_format="toml")
with open("params.toml") as f:
    print(f.read())

params.dump("params.yaml", override=True, file_format="yaml")
with open("params.yaml") as f:
    print(f.read())


a = 1
b = 2

{a: 1, b: 2}



## Converting `Configuration` into CLI

### 1s to CLI
The easiest way to convert `Configuration` into CLI is to use the `ConfigurationParser` in `longling.lib.parser`.

Here is the example:

In [4]:
from longling.lib.parser import ConfigurationParser

cfg_parser = ConfigurationParser(Params)
cfg_parser.print_help()
print(cfg_parser(
    "--a 5 --b 7"
))

usage: ipykernel_launcher.py [-h] [--a A] [--b B] [--kwargs KWARGS]

optional arguments:
  -h, --help       show this help message and exit
  --a A            set a, default is 1
  --b B            set b, default is 2
  --kwargs KWARGS  add extra argument here, use format:
                   <key>=<value>(;<key>=<value>)
{'a': '5', 'b': '7'}


See, quite easy. We will show a more complicated example that use our designing `console input grammar`
to receive more data structures or expressions.

### Grammar in `ConfigurationParser`

Number and string are easy to be passed to `Configuration` via with the help of `ConfigurationParser`.
However,
we often need to use more complicated data structures like `list` and `dict` as the variables in `Configuration`.
Thus, is it possible to pass these via console?
The answer is `Yes`.
We propose a Console Input Grammar (CIG) to support passing complicated data structures via console.
Furthermore, the grammar also allow users using simple `python` expression in console such as `for` and `if`.

Here is an example where `int`, `float`, `list`, `dict`, `tuple` and `dict` are all used:

In [5]:
class ComplicatedParams(Configuration):
    a = 1
    b = 0.2
    c = "hello world"
    d = [1, 2, 3]
    e = ("Tom", "Jerry")
    f = {"Ada": 1.0, "Sher": 3.0}

cfg_parser = ConfigurationParser(ComplicatedParams)
cfg_parser.print_help()
cfg_parser([
    "--a", "int(10)",
    "--b", "float(0.8)",
    "--c", "hello longling",
    "--d", "list([i for i in range(8)])",
    "--e", "tuple((bool(True), bool(False)))",
    "--f", "Ada=2.0;Sher=dict(a=1,b=None)",
    "--kwargs", "g=a;h=None"
])

usage: ipykernel_launcher.py [-h] [--a A] [--b B] [--c C] [--d D] [--e E]
                             [--f F] [--kwargs KWARGS]

optional arguments:
  -h, --help       show this help message and exit
  --a A            set a, default is 1
  --b B            set b, default is 0.2
  --c C            set c, default is hello world
  --d D            set d, default is [1, 2, 3]
  --e E            set e, default is ('Tom', 'Jerry')
  --f F            set f, default is {'Ada': 1.0, 'Sher': 3.0}, dict
                   variables, use format: <key>=<value>(;<key>=<value>)
  --kwargs KWARGS  add extra argument here, use format:
                   <key>=<value>(;<key>=<value>)


{'a': 10,
 'b': 0.8,
 'c': 'hello longling',
 'd': [0, 1, 2, 3, 4, 5, 6, 7],
 'e': (True, False),
 'f': {'Ada': '2.0', 'Sher': {'a': 1, 'b': None}},
 'g': 'a',
 'h': None}

All passed values will be treated as `str` by default.
In order to enable the evaluation which can retrieve the data structure and expression from string,
type declaration is required which is illustrated in the previous example.
Only `dict` in the top level is a little special, `;` is used to distinguished the dict object.
In addition, once the evaluation is enabled, the value inner the expression will be automatically parsed
which is no longer needed the special expression like top level such as `Sher=dict(a=1,b=None)`

```
Q&A

```


## Advanced

### `Configuration` template in machine learning
First, we analyze our needs and then we tailor `Configuration` for machine learning task.

#### File Structure

In [13]:
from longling.ML import Configuration

cfg = Configuration()
print(cfg)

logger: <Logger MLModel (INFO)>
model_name: MLModel
root: ./
dataset: 
timestamp: 20201117094110
workspace: 
root_data_dir: ./\data
data_dir: ./\data\data
root_model_dir: ./\data\model\MLModel
model_dir: ./\data\model\MLModel
cfg_path: data\model\MLModel\configuration.json
caption: 
validation_result_file: data\model\MLModel\result.json


For better illustration, we print the Table of Content (ToC) of the relevant directories and files
```
tutorials/      <-- root
├── data/       <-- root_data_dir
│   ├── data/       <-- data_dir
│   │   ├── test
│   │   ├── train
│   │   └── valid
│   └── model/
│       └── MLModel/        <-- root_model_dir (also model_dir)
│           ├── configuration.json      <-- cfg_path
│           ├── params.txt
│           └── result.json     <-- vadidation_result_file
└── parser.ipynb        <-- the script you run
```

To verify the generalization ability of the model, we need to conduct experiments on more than one datasets.
So how about specify the dataset?

In [14]:
cfg = Configuration(dataset="dataset1")
print(cfg)

logger: <Logger MLModel (INFO)>
model_name: MLModel
root: ./
dataset: dataset1
timestamp: 20201117094110
workspace: 
root_data_dir: ./\data\dataset1
data_dir: ./\data\dataset1\data
root_model_dir: ./\data\dataset1\model\MLModel
model_dir: ./\data\dataset1\model\MLModel
cfg_path: data\dataset1\model\MLModel\configuration.json
caption: 
validation_result_file: data\dataset1\model\MLModel\result.json


We print the ToC as below:
```
tutorials/      <-- root
├── data/
│   └── dataset1/       <-- root_data_dir
│       ├── data/       <-- data_dir
│       │   ├── test
│       │   ├── train
│       │   └── valid
│       └── model/
│           └── MLModel/        <-- root_model_dir (also model_dir)
│               ├── configuration.json      <-- cfg_path
│               ├── params.txt
│               └── result.json     <-- vadidation_result_file
└── parser.ipynb        <-- the script you run
```
See, a new hierarchy is added under `data` directory.

Sometimes, we want to distinguish our model by different timestamp, so we propose RunTimePath (RTP) to implement this.
By using "$variable_name" in path variable, we can dynamic determine the directory and file paths.

For example, we use `timestamp` as `workspace`:

In [16]:
cfg = Configuration(dataset="dataset1", workspace="$timestamp")
print(cfg)

logger: <Logger MLModel (INFO)>
model_name: MLModel
root: ./
dataset: dataset1
timestamp: 20201117094110
workspace: 20201117094110
root_data_dir: ./\data\dataset1
data_dir: ./\data\dataset1\data
root_model_dir: ./\data\dataset1\model\MLModel
model_dir: ./\data\dataset1\model\MLModel\20201117094110
cfg_path: data\dataset1\model\MLModel\20201117094110\configuration.json
caption: 
validation_result_file: data\dataset1\model\MLModel\20201117094110\result.json


The ToC is changed as below:
```
tutorials/      <-- root
├── data/
│   └── dataset1/       <-- root_data_dir
│       ├── data/       <-- data_dir
│       │   ├── test
│       │   ├── train
│       │   └── valid
│       └── model/
│           └── MLModel/        <-- root_model_dir
│               └── 20201117094110/     <-- model_dir
│                   ├── configuration.json      <-- cfg_path
│                   ├── params.txt
│                   └── result.json     <-- validation_result_file
└── parser.ipynb        <-- the script you run
```

See, the `model_dir` has been changed, where `timestamp` is used as the workspace.

We list the default the variables supporting the run-time sequential assignment:
```
"workspace",
"root_data_dir",
"data_dir",
"root_model_dir",
"model_dir"
```
To be noticed that, during runtime, the variables are strictly determined by the sequential positions listed above.

Although our predefined data and model structure can greatly help organize the data and model,
we think it will be friendly if users want to specify as they wish.
In fact, you can specify the data path like this:

In [18]:
cfg = Configuration(
    data_dir="data/",
    model_dir="model/",
    cfg_path="$model_dir/cfg.json",
    vadidation_result_file="result.json"
)
print(cfg)

logger: <Logger MLModel (INFO)>
model_name: MLModel
root: ./
dataset: 
timestamp: 20201117094110
workspace: 
root_data_dir: ./\data
data_dir: data/
root_model_dir: ./\data\model\MLModel
model_dir: model/
cfg_path: model\configuration.json
caption: 
vadidation_result_file: result.json
validation_result_file: model\result.json


```
tutorials/
├── data/       <-- data_dir
│   ├── test
│   ├── train
│   └── valid
├── model/      <-- model_dir
│   ├── configuration.json      <-- cfg_path
│   ├── parans.txt
│   └── result.json     <-- validation_result_file
└── parser.ipynb        <-- the script you run
```
More simply, you can pass any path variables and use them.

In [20]:
cfg = Configuration(train_path="train", valid_path="valid", test_path="test")
print("train_path:", cfg.train_path)
print("valid_path:",cfg.valid_path)
print("test_path:",cfg.test_path)

train_path: train
valid_path: valid
test_path: test


Consider there are various frameworks which have their own individual undumpedable variables,
we respectively design the different `Configuration` template for some of them:

#### Dynamic Parameters

We often set the default parameters combinations in ML models to help us initialize components.
For example, we use the following `NCfg` to include two properties: `train_params` and `lr_params`:

In [32]:
class NCfg(Configuration):
    train_params = {"batch_size": 64, "begin_epoch": 0, "end_epoch": 10}
    lr_params = {"learning_rate": 0.01, "max_update_steps": 1000}
cfg = NCfg()
cfg.train_params, cfg.lr_params

({'batch_size': 64, 'begin_epoch': 0, 'end_epoch': 10},
 {'learning_rate': 0.01, 'max_update_steps': 1000})

Usually, we change these parameters combinations by passing new values:

In [34]:
cfg = NCfg(
    train_params={"batch_size": 128, "begin_epoch": 0, "end_epoch": 10},
    lr_params = {"learning_rate": 0.1, "max_update_steps": 1000}
)
cfg.train_params, cfg.lr_params

({'batch_size': 128, 'begin_epoch': 0, 'end_epoch': 10},
 {'learning_rate': 0.1, 'max_update_steps': 1000})

Although we only want to change two sub-parameters (i.e., `batch_size` and `learning_rate`),
we need to repass all parameters.
To simplify that, we can simply use the `_update` suffix to update `*_params` variables
by only passing the changed sub-parameters.

In [35]:
cfg = NCfg(
    train_params_update={"batch_size": 32},
    lr_params_update={"learning_rate": 0.001}
)
cfg.train_params, cfg.lr_params

({'batch_size': 32, 'begin_epoch': 0, 'end_epoch': 10},
 {'learning_rate': 0.001, 'max_update_steps': 1000})

#### One step to CLI

By cooperating with `ConfigurationParser`, we can use cli commands to specify the variables from console:

In [30]:
from longling.ML import ConfigurationParser
cfg = Configuration()
cfg_parser = ConfigurationParser(Configuration)

cfg_parser.print_help()
cfg_kwargs = cfg_parser([
    "--data_dir", "$root_data_dir",
    "--workspace", "model1",
    "--caption", "this is a test message",
    "--kwargs", "test_path=test"
])
cfg_kwargs

usage: ipykernel_launcher.py [-h] [--model_name MODEL_NAME] [--root ROOT]
                             [--dataset DATASET] [--timestamp TIMESTAMP]
                             [--workspace WORKSPACE]
                             [--root_data_dir ROOT_DATA_DIR]
                             [--data_dir DATA_DIR]
                             [--root_model_dir ROOT_MODEL_DIR]
                             [--model_dir MODEL_DIR] [--cfg_path CFG_PATH]
                             [--caption CAPTION] [--kwargs KWARGS]

optional arguments:
  -h, --help            show this help message and exit
  --model_name MODEL_NAME
                        set model_name, default is MLModel
  --root ROOT           set root, default is ./
  --dataset DATASET     set dataset, default is
  --timestamp TIMESTAMP
                        set timestamp, default is 20201117094110
  --workspace WORKSPACE
                        set workspace, default is
  --root_data_dir ROOT_DATA_DIR
                        set ro

{'model_name': 'MLModel',
 'root': './',
 'dataset': '',
 'timestamp': '20201117094110',
 'workspace': 'model1',
 'root_data_dir': '$root\\data',
 'data_dir': '$root_data_dir',
 'root_model_dir': '$root_data_dir\\model\\$model_name',
 'model_dir': PureWindowsPath('$root_data_dir/model/$model_name'),
 'cfg_path': PureWindowsPath('$model_dir/configuration.json'),
 'caption': 'this is a test message',
 'test_path': 'test'}

In [29]:
print(Configuration(**cfg_kwargs))

logger: <Logger MLModel (INFO)>
model_name: MLModel
root: ./
dataset: 
timestamp: 20201117094110
workspace: model1
root_data_dir: ./\data
data_dir: ./\data
root_model_dir: ./\data\model\MLModel
model_dir: ./\data\model\MLModel\model1
cfg_path: data\model\MLModel\model1\configuration.json
caption: this is a test message
test_path: test
validation_result_file: data\model\MLModel\model1\result.json


#### Variants

##### `Configuration` for `mxnet`

[`mxnet`](https://mxnet.apache.org)is a widely used deep learning framework and contains several handy but unique variable class.
For example, when using `mxnet`, context (i.e., `ctx`) should be specified no matter in data loading or model training.
Thus, `ctx` is the variable which should be included in `Configuration`.
However, as claimed before, unique variable cannot be directly dumped into `json` file
and obviously `ctx` is a run-time variable which is pointless to preserve it in dumped file.

#### `Configuration` for `pytorch`