### imports:
    - os - for building file paths
    - yaml - for file i/o
    - xrsdkit.models.train - for training models
    - xrsdkit.tools.ymltools - for saving the output file

In [1]:
import os

import yaml
from xrsdkit.models import train as xrsdtrain
from xrsdkit.tools import ymltools as xrsdyml

# a message callback is used to print information about model training.
# assign msg_callback to None or print, depending on desired verbosity.
msg_callback = None
#msg_callback = print

### read datasets:

This example uses two datasets, combined, to train the models.
xrsdyml.read_local_dataset opens all of the files in the dataset
and loads the relevant data into a pandas.DataFrame table.

In [2]:
batch_dataset_dir = os.path.join('..','batch_pd_nanoparticles','dataset')
flow_dataset_dir = os.path.join('..','flowreactor_pd_nanoparticles','dataset')

dataset, dataset_index = xrsdyml.read_local_dataset(
    [batch_dataset_dir,flow_dataset_dir],message_callback=msg_callback)
dataset

Unnamed: 0,experiment_id,sample_id,noise_model,pop0_distribution,pop0_form,pop0_interaction,pop0_lattice,pop1_distribution,pop1_form,pop1_interaction,...,pearson_expq,pearson_invexpq,q_best_hump,q_best_trough,best_hump_qwidth,best_trough_qwidth,q_best_hump_log,q_best_trough_log,best_hump_qwidth_log,best_trough_qwidth_log
0,RxnA_201602,RxnA_201602_1455304815,flat,,guinier_porod,,,,,,...,-0.755270,0.796819,0.027656,0.598075,0.000287,0.087023,0.027792,0.598300,0.000966,0.015458
1,RxnA_201602,RxnA_201602_1455307289,low_q_scatter,single,spherical,,F_cubic,,,,...,-0.248760,0.272581,0.093201,0.021478,0.000273,0.046967,0.095669,0.024021,0.001433,0.032981
2,RxnA_201602,RxnA_201602_1455308244,low_q_scatter,single,spherical,,F_cubic,,,,...,-0.242065,0.265221,0.091976,0.035677,0.000276,0.018308,0.094621,0.036239,0.001482,0.011924
3,RxnA_201602,RxnA_201602_1455308448,low_q_scatter,single,spherical,,F_cubic,,,,...,-0.243288,0.266665,0.091822,0.033143,0.000272,0.022151,0.094383,0.033937,0.001437,0.014507
4,RxnA_201602,RxnA_201602_1455306181,flat,,guinier_porod,,,,,,...,-0.557588,0.605769,0.117777,0.077262,0.004637,0.037793,-0.010435,0.078791,0.006381,0.020781
5,RxnA_201602,RxnA_201602_1455306807,low_q_scatter,single,spherical,,F_cubic,,,,...,-0.249071,0.272774,0.094297,0.184135,0.000271,0.282947,0.096668,-0.128794,0.001406,0.278410
6,RxnA_201602,RxnA_201602_1455307532,low_q_scatter,single,spherical,,F_cubic,,,,...,-0.253328,0.277930,0.516015,0.025868,1.359423,0.033237,0.517144,0.027534,0.001724,0.026654
7,RxnA_201602,RxnA_201602_1455306206,flat,,guinier_porod,,,,,,...,-0.560888,0.607783,0.015589,0.092796,0.000884,0.006439,0.020862,0.092888,0.001714,0.004332
8,RxnA_201602,RxnA_201602_1455304606,flat,,guinier_porod,,,,,,...,-0.758761,0.799748,0.606521,0.643803,0.126548,0.380625,0.605263,0.668456,0.017827,0.083519
9,RxnA_201602,RxnA_201602_1455305991,flat,,guinier_porod,,,,,,...,-0.707013,0.749449,0.023684,0.121353,0.000477,0.286984,0.024926,0.167317,0.001237,0.736573


### train models:

Using the dataset table built in the previous step, 
a set of models is trained
to automate the analysis
of samples that are similar to the training set.
Usually, feature selection and hyperparameter optimization
are beneficial to model performance,
but they take time, so they are turned off.

At the end of the training, 
the output_dir will be populated with two new directories,
'classifiers' and 'regressors'.
Each of these directory trees contains a hierarchy of models.
Each trained model produces three files:

* model_name.txt - Human-readable report of training results.
* model_name.pickle - Machine-readable model, created by scikit-learn, for re-loading the model later without re-training.
* model_name.yml - Human- or Machine-readable data file containing training results, model parameters, selected features, and trained hyperparameters.

In [3]:
model_output_dir = os.path.join('..','tutorial_models')
regressors,classifiers = xrsdtrain.train_from_dataframe(dataset,
                                                         train_hyperparameters=False,
                                                         select_features=False,
                                                         output_dir=model_output_dir,
                                                         message_callback=msg_callback
                                                        )

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


### inspect the training summary:

If one or more models are to be optimized,
the performance metrics for the latest training run
can be found in the overall training summary,
or in the model's own training summaries.

In [4]:
training_summary_file = os.path.join(model_output_dir,'training_summary.yml')
summary = yaml.load(open(training_summary_file,'r'))
#summary


### add feature selection and hyperparameter training

Model performance can be improved by selecting input features and tuning hyperparameters.
Feature selection is performed by a recursive leave-one-feature-out process,
and hyperparameter training is performed by grid search.
In both cases, the cross-validation of the model's training metric
is used as an objective function.

In [5]:
# TODO

### tune the model configuration:

The model configuration file can be edited to select modeling algorithms,
choose a target performance metrics, 
and (TODO) tune cross-validation and hyperparameter selection processes.

In [6]:
model_config_file = os.path.join(model_output_dir,'model_config.yml')
config = yaml.load(open(model_config_file,'r'))
#config