### Extracting additional labels from text

This notebook explores the first place Kaggle Submission's algorithm
for extracting labels that we're not included in the training set.

[notebook here](https://github.com/Coleridge-Initiative/rc-kaggle-models/blob/main/1st%20ZALO%20FTW/notebooks/get_candidate_labels.ipynb)

The first place submission uses discovered labels for validation only
and not for training. The code is an adaptation from the notebooks.

In [1]:
import json
import regex as re
from itertools import chain
from typing import List

import spacy
import pandas as pd
from tqdm import tqdm
from unidecode import unidecode

nlp = spacy.load('en_core_web_trf')

In [2]:
kaggle_labels = pd.read_csv("../data/kaggle/train.csv")
kaggle_labels.head(2)

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study
1,2f26f645-3dec-485d-b68d-f013c9e05e60,Educational Attainment of High School Dropouts...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study


In [3]:
aggregated_labels = pd.DataFrame({"id": kaggle_labels["Id"].unique()})

def aggregate_clean_label(row: pd.DataFrame):
    labels = list(map(lambda x: x.strip(), row["dataset_label"].unique()))
    return "|".join(labels)

unique_labels = kaggle_labels.groupby("Id").apply(aggregate_clean_label)
aggregated_labels["label"] = aggregated_labels["id"].apply(lambda x: unique_labels[x])
aggregated_labels.head(2)

Unnamed: 0,id,label
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,National Education Longitudinal Study|Educatio...
1,2f26f645-3dec-485d-b68d-f013c9e05e60,National Education Longitudinal Study|Educatio...


In [4]:
def get_text(document_id: str) -> str:
    with open("../data/kaggle/train/" + document_id + ".json") as f:
        document = json.load(f)

    text = unidecode(" ".join(list(map(
        lambda x: x["text"].strip().replace("\n", " "), 
        document
    ))))

    return text

In [5]:
text = get_text("d0fa7568-7d8e-4db9-870f-f9c6f668c17b")
text[:100]

'This study used data from the National Education Longitudinal Study (NELS:88) to examine the effects'

The description from the notebook says that candidates are selected in the 
following way:

```
2. (Optional) We detect the keywords (Dataset, Database, Study, Survey, ...) 
position in the input string then look forward/backward of that keyword util
meet two consecutive lowercase words.
```

Let's try using the entity extraction model and then filter those by the
selected keywords. That seems analogous to what the first place submission does.

In [6]:
from importlib import reload
import src.models.regex_model as rm
import src.data.kaggle_repository as kr

In [7]:
model = rm.RegexModel(config={})
repo = kr.KaggleRepository()
data = repo.get_training_data()

In [8]:
outputs = model.inference({}, data)

In [9]:
from functools import partial


keywords = [
    "Database", "Dataset", "Databases", "Datasets",
    "Data Set", "Data System", "Data Systems", "Data Sets", "Dataset System", "Dataset Systems",
    "Survey", "Surveys", "Study", "Studies",
]

def filter_labels_by_keywords(keywords:List[str], row:pd.DataFrame) -> str:
    preds = row["model_prediction"].strip().split("|")
    filtered = list(filter(lambda x: any(map(lambda y: y in x, keywords)), preds))
    labels = row["label"].strip().split("|")
    not_already_listed = list(filter(lambda x: x.lower() not in labels, filtered))

    return "|".join(not_already_listed) if len(not_already_listed) else ""

fitler_f = partial(filter_labels_by_keywords, keywords)
outputs["filtered"] = outputs.apply(fitler_f, axis=1)

Let's see what we caught that may have been missed by the original run 
through.

In [10]:
outputs.loc[:, ["id", "label", "filtered"]]

Unnamed: 0,id,label,filtered
0,5b466b5d-6b6f-48cf-8364-3893ce09c8ec,common core of data,American Community Survey (ACS)|Census Bureau'...
1,0a2c7004-f763-4846-b95f-1fdf537f8a04,agricultural resource management survey,Agricultural Resource Management Survey (ARMS)
2,86cef975-b9a2-44c7-a480-cbe918e72159,early childhood longitudinal study,NICHD Study of Early Child Care and Youth Deve...
3,baec0fbc-4ef7-4b27-af66-843d393640bd,national water level observation network,
4,24a55c45-eaf8-4066-98e3-349c6eff6186,adni|alzheimer's disease neuroimaging initiati...,
...,...,...,...
11447,e1c78694-d96b-487f-b445-fd692c5fb84e,adni,
11448,10a7d47c-cd38-4763-bb4b-e5804a670b90,our world in data,
11449,622123b8-bed9-4f4f-b026-158e552f0839,adni|alzheimer's disease neuroimaging initiati...,
11450,90dad306-ae3b-4016-9f60-cf45d76bc0f2,baltimore longitudinal study of aging (blsa)|b...,Framingham Heart Study |National Health and Nu...


In [11]:
outputs.loc[outputs["id"]=="5b466b5d-6b6f-48cf-8364-3893ce09c8ec", ["id", "label", "filtered"]].values

array([['5b466b5d-6b6f-48cf-8364-3893ce09c8ec', 'common core of data',
        "American Community Survey (ACS)|Census Bureau's Center for Economic Studies |American Community Survey"]],
      dtype=object)

Let's look at some of the examples

In `5b466b5d-6b6f-48cf-8364-3893ce09c8ec`, the listed labels are: 
- `common core of data`

The candidate labels are:
- `American Community Survey (ACS)` (https://www.census.gov/programs-surveys/acs/) This seems to be a dataset
- `Census Bureau's Center for Economic Studies` This seems to be a false positive
- `American Community Survey` (https://www.census.gov/programs-surveys/acs/) This seems to be a dataset

The approach used in first place submission exlcudes these from training, which seems to be a good idea.

Let's convert a document into: `positive`, `negative`, and `candidate` samples

In [12]:
type(nlp)

spacy.lang.en.English

In [13]:



def detect_labels(labels:List[re.Pattern], sentence:str) -> List[List[str]]:
    return list(map(
        lambda match: match.captures(), # It's possible to have more than one match
        filter(
            bool,
            map(
                lambda rl: rl.search(sentence),
                labels
            )
        )
    ))

def tag_sentence(regex_labels:List[re.Pattern], sentence:spacy.tokens.span.Span):
    match_lists = sorted(
        detect_labels(regex_labels, sentence.text), 
        key=lambda x: max(map(len, x)), 
        reverse=True
    )

    tokens = [token.text for token in sentence]
    tags = [token.tag_ for token in sentence]
    ner_tags = ["O"] * len(sentence) # assume no match

    for match in chain.from_iterable(match_lists):
        label_tokens = nlp(match)
        start_idx = tokens.index(label_tokens[0].text)
        idxs = list(range(start_idx, start_idx + len(label_tokens)))


        first_tag = ner_tags[start_idx]
        prev_tag = ner_tags[start_idx - 1] if start_idx > 0 else "O"
        # If there are any tokens that are already marked then this match
        # could be a subset of another match
        if not any(map(lambda x: x!="O", ner_tags[start_idx: start_idx + len(label_tokens)])):

            if prev_tag=="O":
                ner_tags[start_idx] = "I-DAT"
            else:
                ner_tags[start_idx] = "B-DAT"

            for idx in idxs[1:]:
                ner_tags[idx] = "I-DAT"

    return tokens, tags, ner_tags

def expand_row(nlp:spacy.lang.en.English, row:pd.DataFrame) -> pd.DataFrame:
    labels = row["label"].strip().split("|")
    candidate_labels = row["filtered"].strip().split("|")

    regex_labels = list(map(
        re.compile,
        map(
            rm.RegexModel.regexify_keyword,
            labels
        )
    ))

    regex_candidate_labels = list(map(
        re.compile,
        map(
            rm.RegexModel.regexify_keyword,
            candidate_labels
        )
    ))

    # process the text so that it can be turned into sentences and tokenized
    text = unidecode(row["text"]).strip()
    processed = nlp(text)

    

