This notebook is addressed to explore some insights about the dataset that contains abstracts gathered from PLOS.

And then tune up to use it in a classifier whose main purpose is to detect the following environments: `['coastal', 'freshwater', 'marine', 'terrestrial'] `
<hr>

Remember to add paths of `conabio_ml` and `conabio_ml_text` to your PYTHONPATH with
`export PYTHONPATH=`pwd`:`pwd`/conabio_ml_text/conabio_ml:`pwd`/conabio_ml_text`

In [None]:
# Here you must have the paths of both conabio_ml and conabio_ml_text libs
!echo $PYTHONPATH

In [None]:
import numpy as np
import pandas as pd
import pydash
import json

import conabio_ml

from pathlib import Path
from pprint import pprint
from collections import OrderedDict

from conabio_ml.pipeline import Pipeline
from conabio_ml.assets import AssetTypes
from conabio_ml_text.datasets.dataset import Dataset, Partitions

In [None]:
base_dataset_path = Path(f"dataset")
dataset_path = Path(f"{base_dataset_path}/plos_2021-01-06.csv")
results_path = Path(f"results")
report_path = Path(f"report")

We load the dataset to know the following:
- Number of classes
- Stats about the item column (abtract): Min/max/mean number of words

In [None]:
dataset = Dataset.from_csv(dataset_path)

The dataset contains the following labels.

In [None]:
labels = pd.unique(dataset.data["label"]).tolist()
pprint(labels)

We consider the clases in 2 types: 
- domain samples: ['coastal', 'freshwater', 'marine', 'terrestrial']
- out-of-domain-samples: ['health-sciences', 'earth-sciences', 'life-sciences']

In [None]:
items = dataset.data["item"].apply(lambda x: x.split())
word_count = items.apply(lambda x: len(x))

pprint (f"Max: {np.max(word_count)}, Min:{np.min(word_count)}, Mean:{np.mean(word_count)}, Std:{np.std(word_count)}")

These are some examples of the items with maximum and minimum `num_of_words`

In [None]:
print(f"Max sentence:\n {dataset.data.loc[int(np.argmax(word_count))]['item']}")
print("------------")
print(f"Min sentence:\n {dataset.data.loc[int(np.argmin(word_count))]['item']}")

Then, we constrain the min number of words of the samples.

That's because, after the preproc our samples will be transformed to tensors (`[w0, w1, w2] -> [ix, ix, ix]`) of fixed length.

In [None]:
MIN_NUM_WORDS = 16
min_dataset_path = base_dataset_path / "min_words_dataset.csv"

ix_min_words = dataset.data["item"].apply(lambda x: len(x.split()) >= MIN_NUM_WORDS)
min_words_dataset = dataset.data[ix_min_words].reset_index()
min_words_dataset.to_csv(min_dataset_path)

Using the min words dataset. We draw the same stats as before.

In [None]:
dataset = Dataset.from_csv(min_dataset_path)
dataset.reporter(report_path / "dataset", {})

items = dataset.data["item"].apply(lambda x: x.split())
word_count = items.apply(lambda x: len(x))

pprint (f"Max: {np.max(word_count)}, Min:{np.min(word_count)}, Mean:{np.mean(word_count)}, Std:{np.std(word_count)}")

We broadly have the following statsfor the quantity of words:

- Max: 1524
- Min: 16
- Mean: ~261
- Std: ~89

We need to calculate the amount of samples over some values of `word_count`.

In [None]:
TH_1 = 261 
TH_2 = 261 + (1 * 89)
TH_25 = 261 + int(1.5 * 89)
TH_3 = 261 + (2 * 89)
(f"With {TH_1} words: {len(word_count[word_count.apply(lambda x: x < TH_1)]) / len(word_count)} of the dataset",
 f"{TH_2} words: {len(word_count[word_count.apply(lambda x: x < TH_2)]) / len(word_count)} of the dataset",
 f"{TH_25} words: {len(word_count[word_count.apply(lambda x: x < TH_25)]) / len(word_count)} of the dataset",
 f"{TH_3} words: {len(word_count[word_count.apply(lambda x: x < TH_3)]) / len(word_count)} of the dataset")

So, for padding purposes in the train we get a sample size of `450` words.
<hr>

Finally, we drop one class of the out-of-domain dataset (`[health-sciences, life-sciences, earth-sciences]`) to balance the dataset. 

Having the following labels to classify.

`['coastal', 'freshwater', 'marine', 'terrestrial', 'other'] `

In [None]:
# We remove 1 (earth-sciences) of the ood classes
classes = ['health-sciences', "life-sciences", "terrestrial", "marine", "freshwater", "coastal"]
result_labels = ["other", "terrestrial", "marine", "freshwater", "coastal"]
dataset_5classes = dataset.data[dataset.data["label"].isin(classes)]

labels_to_group = ['health-sciences', "life-sciences"]

# Some items might be repeated in the sets
hs_rows = dataset_5classes[dataset_5classes["label"] == 'health-sciences']
temp = dataset_5classes[dataset_5classes["label"] == 'life-sciences']
ls_rows = temp[~temp["DOC_ID"].isin(hs_rows["DOC_ID"])]

ixs = set(hs_rows.index).union(set(ls_rows.index))
dataset_5classes.loc[ixs, "label"] = "other"

prunned_dataset = dataset_5classes[dataset_5classes["label"].isin(result_labels)]

In [None]:
labels = pd.unique(prunned_dataset["label"])
unique_items = pd.unique(prunned_dataset["DOC_ID"])

len(unique_items), len(prunned_dataset)

The dataset contains fewer unique items, so, some of the labels are repeated across the dataset.

We produce 2 datasets with this info.

- The simplified version of the dataset with one class for each item.
- The multilabel version of the dataset, using all samples.

In [None]:
samples_count = OrderedDict()
simplified_dataset = pd.DataFrame()
doc_ids = set()

for label in labels:
    amount = len(prunned_dataset[prunned_dataset["label"] == label])
    samples_count[amount] = label
    

for sample_count, label in samples_count.items():
    # Items with the current label
    temp_dataset = prunned_dataset[prunned_dataset["label"] == label]
    
    # Unique doc_ids
    unique_items = set(temp_dataset["DOC_ID"])
    label_items = unique_items - doc_ids
    
    temp_dataset = temp_dataset[temp_dataset["DOC_ID"].isin(label_items)]
    simplified_dataset = simplified_dataset.append(temp_dataset)
    
    doc_ids = doc_ids.union(unique_items)

In [None]:
simplified_dataset.to_csv(base_dataset_path / "dataset_multiclass.csv")
prunned_dataset.to_csv(base_dataset_path / "dataset_multilabel.csv")

In [None]:
res = Dataset.from_csv(base_dataset_path / "dataset_multiclass.csv")
res.reporter(report_path / "dataset", {})

In [None]:
res = Dataset.from_csv(base_dataset_path / "dataset_multilabel.csv")
res.reporter(report_path / "dataset_merged", {})