In this notebook we create our work dataset from two other ones. 

<hr>

Domain dataset, which contains species names from GBIF site.

Out-of-domain dataset, with general knowlegde texts in plain-text files. For this example, we use abstracts from the PLOS site, but feel free to replace it with another files.

Then, we draw some insights to create our model and the text handling. Such as, size of the samples, quantity, stategy, etc.

We also use the <a href="https://bitbucket.org/conabio_cmd/conabio_ml_text">CONABIO ML Text</a> library to template an end-to-end pipeline.

In [None]:
import pandas as pd
import numpy as np

import pydash
import re
import string
import random

from pathlib import Path
from pprint import pprint

In [None]:
# If you use the CONABIO_ML library code always remember to update your PYTHONPATH env variable with
# export PYTHONPATH=`pwd`:`pwd`/conabio_ml_text/conabio_ml:`pwd`/conabio_ml_text

In [None]:
from conabio_ml_text.datasets import Dataset

We build the dataset using two types of samples: species names and common knowledge words.

In [None]:
base_dataset_path = Path("dataset")

d_dataset_path = base_dataset_path / "species.txt"
ood_dataset_path = base_dataset_path / "text_files"

The domain specific dataset is a plain text file (<i>We assume it located at `base_dataset_path / "species.txt"` </i>) that contains taxonomic trees of species, with the following format:

`species_parent, species, … `.

<hr>

We obtained this dataset from the <a href="https://www.gbif.org/developer/species">GBIF species API</a> using `Animalia` as a root taxonomic tree. 

You can gather the json representations of the taxonomic tree, and then convert it to plain text, according to your needs using the `dataset_builder.py` script.

The dataset for this example is available to download, from <a href="https://tctp-datasets.s3.us-south.cloud-object-storage.appdomain.cloud/species.txt">HERE</a>.

In [None]:
species_names = ""
with open(d_dataset_path) as _f:
    species_names = _f.read()
    
species_trees = species_names.split("\n")
species = set(pydash.chain(species_trees)\
              .filter(lambda x: len(x) > 0)\
              .map(lambda x: x.lower().split(","))\
              .flatten()\
              .map(lambda x: x.replace(" ", "_"))\
              .value())

domain_dataset = pd.DataFrame(list(species), columns = ["item"])
domain_dataset["label"] = "species"
domain_dataset

In [None]:
len(domain_dataset["item"].unique())

Now, we draw some basic insights of the d-dataset (domain dataset)

In [None]:
species_lengths = domain_dataset["item"].apply(lambda x: len(x))
species_words = domain_dataset["item"].apply(lambda x: len(x.split()))

MIN_CHAR_SIZE = int(np.min(species_lengths))
pprint(f"Mean char size of species: {np.mean(species_lengths)}.")
pprint(f"Max char size of species: {np.max(species_lengths)}.")
pprint(f"Min char size of species: {MIN_CHAR_SIZE}.")

MEAN_SPECIES_WORDS = int(np.mean(species_words))
pprint(f"Mean word size of species: {MEAN_SPECIES_WORDS}. Max word size of species: {np.max(species_words)}")



pprint(f"Dataset size: {len(domain_dataset)}")

Finally, at word level. We have the number of unique tokens in the dataset.

In [None]:
word_level = pydash.chain(domain_dataset["item"])\
            .map(lambda x: set(x.split()))\
            .reduce(lambda x, y: x.union(y), set())\
            .value()

pprint(f"And we have {len(word_level)} unique species words.")

<hr>

For the non-domain dataset, we use a set of `1000` abstracts with subject `health sciences` obtained from the  <a href="https://plos.org/">PLOS site</a>.  Gathered using the <a href="https://github.com/thecopy-and-thepaste/qtod">qtod module</a>.

You can use your own plain text files or just download the files we are working with from <a href="https://tctp-datasets.s3.us-south.cloud-object-storage.appdomain.cloud/clsspec_text_files.zip">HERE</a>, and extract it. We assume the path for the files in `base_dataset_path / "text_files"`.

Then, we just extract 1-3 grams taking care that samples don't repeat.

In [None]:
re_numbers = re.compile('^[-+]?[\d.]+(?:e-?\d+)?$')

def ood_preproc(item_path:str):
    try:
        with open (item_path, mode="r", encoding='utf-8') as _f:
            item = _f.read()

        tokens = []
        # We only care to remove hyperlink, puntuation, and numbers.
        item = item.lower()

        item = item.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation)))

        for token in item.split():               
            if re.findall(re_numbers, token):
                continue

            tokens.append(token)

        ix = 0
        while ix < len(tokens):
            step = random.randint(1, 3)

            yield "_".join(tokens[ix: ix+step])
            ix += step
    except Exception as ex:
        print(ex)
        print(item_path)

In [None]:
ood = Dataset.from_folder(source_path=ood_dataset_path,
                          extensions=["txt"],
                          recursive=False,
                          label_by_folder_name=True,
                          split_by_folder=False,
                          include_id=False,
                          item_reader = ood_preproc)

In [None]:
ood_items = ood.data["item"].unique()
ood_dataset = pd.DataFrame(ood_items, columns=["item"])
ixs = ood_dataset.apply(lambda x: len(x["item"]) > MIN_CHAR_SIZE, axis = 1)

ood_dataset = ood_dataset.loc[ixs]
ood_dataset["label"] = "non_species"
ood_dataset

Then, just drop as a new dataset

In [None]:
dataset = pd.concat([ood_dataset, domain_dataset])
dataset = dataset.reset_index(drop=True)
dataset.to_csv(base_dataset_path / "dataset.csv")

In [None]:
len(dataset), len(ood_dataset["item"].unique()) + len(domain_dataset["item"].unique())

In [None]:
dataset

In [None]:
pprint(f"{len(dataset)} samples")
pprint(f'Species samples: {len(dataset[dataset["label"] == "species"])}')
pprint(f'Non-Species samples: {len(dataset[dataset["label"] == "non_species"])}')

In [None]:
ds = Dataset.from_csv(base_dataset_path / "dataset.csv")
ds.reporter(".", {})