# Importer
<div style="position: absolute; right:0;top:0"><a href="../evaluation.py.ipynb" style="text-decoration: none"> <font size="5">↑</font></a></div>

The Importer Module loads data from various sources into a standard format for further processing.

---
## [Datasets](./datasets.ipynb)

List of all supported datasets.

---
## Configuration

Each dataset is specified as an entry in config.datasets as 

```json
"identifier": {
    "name": STRING,
    "run": BOOLEAN,
    "labels": BOOLEAN,
    "num_topics": LIST,
    "mod": STRING,
    "cls": STRING,
    "OTHER": ANY (optional)
```

- `identifier` is used for saving and identifying the dataset.
- `name` is a full name used for printing output
- `run` defines whether the evaluation script will run anything related to this dataset
- `labels` must be true if the dataset contains ground truth class information
- `num_topics` is a list of number of topics that should be extracted from the dataset. Each entry may be
  - a number (e.g. `5`,`10`,`100`,...)
  - a string of the form `"#gt"` (e.g. "gt", "2gt", "0.5gt") where `#` is a number, setting the number of topics to `#` times the number of ground truth classes  
    You probably want 'gt' in there.
- `mod` defines the python module and `cls` the class used to import the data
- `OTHER` define as many parameters as you like. Use as `self.info['data_info']['OTHER']` 
    
## Import Class

Each module has to implement a class inheriting from `util.ImporterBase` with a `def run(self)` function that handles the import. Useful tools:
- `data.document_writer(info)` to save the imported documents together with `importer.util.DocumentInfo`. Example usage:

```python
with data.document_writer(info) as document_writer:
    docinfo = DocumentInfo(document_writer)
    for text, class_id in your_dataset:
        docinfo.add_document(text, class_id)
```

- Use `importer.util.DocumentInfo`'s and `importer.util.ClassInfo`'s `save_meta` functions after import to save and show some information of the dataset.

- `importer.util.ClassInfo` to make a list of classes and count documents in each class with `data.save_classes(classes,info)`. Example usage:

```python
# Count occurences of each class
classinfo = importer.util.ClassInfo()
for document in your_dataset:
    class_id = classinfo.increase_class_count(document.classname) 
# Save classes
classes = classinfo.make_class_list()
data.save_classes(classes,info)
```

- Both steps can be combined. See available files for more examples.