## Training on multiple files

The trainer class supports training on a dataset distributed between multiple files.
This is very important in cases where the dataset is quite large, and the training process can get quite tedious. With the proper configurations, this training can be done on separate phases progressively, so that you can stop it on a certain point and restart it later.

**Consider this script inside the following directory layout:**

    this_dir
    |----this.ipynb -> the current notebook
    |----mod
    |	|----here is where the model files would be
    |----verb-conj
    |	|----es-verbs.txt -> file containing verb conjugation in spanish
    |	|----en-verbs.txt -> file containing verb conjugation in spanish
    |----data
    |	|----example1.txt
    |	|----example2.txt
    |	|----example3.txt
    |	|----... as many training files
    |----other files

This is necessary since for the training and prediction process, choosing models and languages, it's necessary to specify where the files are.

In [None]:
from PictogramsPredictionLibrary.trainerClass import trainer
from PictogramsPredictionLibrary.predictorClass import predictor

#Training Process: since we are going to train on multiple files, we can pass an empty string on the file_path parameter
tr = trainer("", data_format = "txt", model = "example", model_dir = "./mod" language = "en")
tr.multiple_train(
                "./data" #directory path for the multiple data files
                save_progress = True, #this way it'll save a version of the model after each training file is done
                delete_done = True #this will delete each file from the data folder after it's done.
                ) 

As you can see, `trainer.multiple_train` is a method that trains the model on different files, which can be used to complete the training in multiple phases given the proper configuration:

- although the `file_path` in the initialization of trainer is unused, the data_format will be used to filter which files are used for training and which are ignored.
- the `save_progress` parameter set to True will save the model after each training file is finished, instead of waiting for all the training process to be done.
- the `delete_done` paramter set to True will delete each file that is already trained on from the data foldet. This way if you run the same code on a future instance, it will only continue training on the files that are left to train.

**CAUTION: IN CASE SOMETHING GOES WRONG, ALWAYS MAKE A BACKUP OF YOUR DATA FILES IF** `delete_done` **MODE WILL BE ACTIVATED.**

To get somehow what the method does, another way of achieving the same functionality without it would be:

In [None]:
import os

files_dir = "./data"
files_name = [os.path.join(files_dir, n) for n in os.list_dir(files_dir)]

tr = trainer("", data_format = "txt", model = "example", model_dir = "./mod" language = "en")
start = True
for f in files_name:
    tr.new_data(f)
    if start:
        tr.train()
    else:
        tr.train(check_model_status = False) #This is so that it only check the model status on each run
    os.remove(f)