This notebook analyzes the performance improvements to model's 2 and 3. With
respect to particular key words or datasets.

The models are evaluated on the Kaggle competition's held out private validation
set.

Improvements on the larger Elsevier corpus will can be evaluated in the future,
but will require further validation as there isn't a ground truth.

**Note On Methodology:** The submissions in the kaggle competition were at a
high level all some form of an ensemble. Though there is some main component or
model/methodology, the submitted models all were wrapped in some extra
heuristics to improve the final results. Which is analogous to measuring an
ensemble consisting of the: core model + the heurisitic model. This notebook
will only be analyzing the core methodology of the models, and not the extra
heuristics. The heuristics though helpful for the competition are not robust and
tend to not add value over and abobe the overall ensemble of all the kaggle
submissions.

In [1]:
import re

import pandas as pd

import democratizing_data_ml_algorithms.models.kaggle_model2 as km2
import democratizing_data_ml_algorithms.models.kaggle_model3 as km3
import democratizing_data_ml_algorithms.models.kaggle_model3_regex_inference as km3r
import democratizing_data_ml_algorithms.models.regex_model as rm
import democratizing_data_ml_algorithms.models.schwartz_hearst_model as shm
import democratizing_data_ml_algorithms.evaluate.model as em
import democratizing_data_ml_algorithms.data.kaggle_repository as kr

In [2]:
from importlib import reload

In [3]:
class MockRepo:
    def __init__(self, df):
        self.df = df
    def get_validation_data(self):
        return self.df

In [4]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

In [5]:
with open("../data/entity_classification/ncses_priorities.csv", "r") as f:
    ncses_priorities = [l.strip().split(",")[0].replace('"', '') for l in f.readlines()[1:]]
    ncses_priorities_cleaned = list(map(clean_text, ncses_priorities))

kaggle_validation = pd.read_csv("../data/kaggle/validation.csv")

In [6]:
kaggle_validation

Unnamed: 0,Id,PredictionString,Usage
0,7bfd8bb51,dbgap,Ignored
1,7d3e31302,sass|sass data|school and staffing sass data|s...,Private
2,3644f959a,foodaps|national household food acquisition an...,Public
3,ed3527cf3,database of genotypes and phenotypes|database ...,Ignored
4,236061129,dbgap,Ignored
...,...,...,...
8058,3cb8b0e09,database of genotypes and phenotypes|database ...,Ignored
8059,324fe1310,dbgap,Ignored
8060,39e66a274,1000 genomes project|1000 genomes project 1000...,Private
8061,c411b1b6c,dbgap|american cancer society cancer preventio...,Public


In [8]:
ncses_mask = kaggle_validation["PredictionString"].apply(lambda s: len(set(s.split("|")).intersection(set(ncses_priorities_cleaned)))>0)

ncses_mask.sum()

kaggle_validation_ncses = kaggle_validation.loc[ncses_mask, :]

In [9]:
valid_ids = kaggle_validation_ncses["Id"].values
kaggle_validation_ncses

Unnamed: 0,Id,PredictionString,Usage
46,d7d2d07d4,business r d and innovation survey|brdis|surve...,Public
57,b8f279039,business r d and innovation survey|r d|brdis|s...,Public
62,4b871ab7c,national survey of college graduates|decennial...,Public
64,152b44241,nscg|national survey of college graduates,Private
66,e18be232b,national survey of recent college graduates|na...,Public
...,...,...,...
8013,8799c68ad,nsf survey of federal funds for research and d...,Ignored
8019,c6fb73a43,national survey of college graduates|cgs gre s...,Public
8030,2355997d9,current population survey|nls72|nscg|national ...,Private
8040,ea59e8f5f,national survey of recent college graduates|be...,Public


In [10]:
repo = kr.KaggleRepository()
validation_dataframe = repo.get_validation_data()
validation_dataframe_ncses = validation_dataframe.loc[validation_dataframe["id"].isin(valid_ids), :]

### Model 3 

Model 3 is the fastest model of all the Kaggle Models. It is also the most 
rigid. It uses a set of heuristics to extract datasets from a new set of 
Documents. In looking for datasets in documents, it is actually quite limited
by employing a simple string search for the datasets.

**Improving Model 3**

An idea to improve this model, is to relax some of the contraints by using a
regular expression. Rather doing a simple string search, we transform each of
the datasets being search for into a regular expression. This allows us to
be flexible about casing when we want to be, but remain strict when we it would
be less helpful (e.g. acronyms that share a spelling with other common words).

In [17]:
config = {
    "model_path": "../models/kaggle_model3/baseline/params.txt",
}

In [18]:
evaluation = em.evaluate_model(
    MockRepo(validation_dataframe_ncses.copy()),
    km3.KaggleModel3(),
    config,
)
evaluation


        Model Evaluation:

        - Run time: 6.493399143218994 seconds, avg: 0.016480708485327396 seconds per sample
        - True Postive Count: 554, avg: 1.4060913705583757 per sample
        - Precision: 0.6367816091954023
        - Recall: 0.17328745699092898
        

In [19]:
evaluation_regex = em.evaluate_model(
    MockRepo(validation_dataframe_ncses.copy()),
    km3r.Kaggle3RegexInference(config),
    config,
)
evaluation_regex


        Model Evaluation:

        - Run time: 70.35855960845947 seconds, avg: 0.17857502438695297 seconds per sample
        - True Postive Count: 1966, avg: 4.98984771573604 per sample
        - Precision: 0.8920145190562614
        - Recall: 0.6184334696445423
        

### Model 2

Model 2 from the Kaggle competition can be broken into two parts. The first part
performs *entity extraction* and the second part performs *entity
classification*.

*Entity Extraction*

In the original submission, the entity extraction methodology was based on the
[Schwartz & Hearst (2003)](https://pubmed.ncbi.nlm.nih.gov/12603049/) algorithm
which is a algorithm for extracting abbreviations and definitions in biomedical
text.

*Entity Classification*

To classify the extracted entities, the Model 2 fined tuned a
[RobBERTa](https://arxiv.org/abs/1907.11692) model for sentence classification.
The training dataset was hand curated by the team and consists a mix of entities
that satisfy the entity extraction criteria. Some of which are datasets (e.g.
United State of America (USA) and National Educational Longitudinal Study
(NELS)).

**Improving Model 2**

To improve the approach in Model 2, we took a look at improving both the entity
extraction and the entity classification algorithms. At a high level, we can
imagine that the two parts of the algorithm have different purposes. Ideally
the entity extraction algorithm will produce a high recall and the entity
classification algorithm will produce a high precision.

*Entity Extraction*

The entity extraction algorithm was limited in two ways. First, the decision to
use the Schwartz & Hearst (2003) algorithm was a generally good choice, but will
neccsarily miss some entities. Namely, those that aren't idenitified using the
form LONG NAME (ABBREVIATION). This approach will, by design, limit recall.
Additionally, the implementation of the algorithm in the submission wasn't
robust. To improve recall, we drop the schwartz & hearst algorithm and instead
use a regular expression to extract entities. 

*Entity Classification*

To improve the entity classification algorithm, we also fine tune a RobBERTa
based transformer model, but also include the NCSES targeted datasets as
positive training samples. A difficult limitation of this approach is that we're
using a sentence classification model, but we're only feeding it a short amount 
of text. Additionally, there aren't a lot of postive training samples. We 
experimented with balancing the training dataset by oversampling the positive
training samples, but this didn't improve the model's performance. Additionally, 
the training of model 2 wasn't very consistent using the same hyperparameters
given by the authors.


In [11]:

reload(em)
reload(shm)
reload(km2)
evaluation = em.evaluate_model(
    MockRepo(validation_dataframe_ncses.copy()),
    km2.KaggleModel2(),
    dict(
        pretrained_model="../models/kaggle_model2/models/kaggle_model2/baseline",
        min_prob=0.9,
        extractor = shm.SchwartzHearstModel(),
        extractor_config = dict(),
    ),

)
evaluation


        Model Evaluation:

        - Run time: 313.28103828430176 seconds, avg: 0.795129538792644 seconds per sample
        - True Postive Count: 222, avg: 0.5634517766497462 per sample
        - Precision: 0.3881118881118881
        - Recall: 0.06939668646452016
        

In [11]:
reload(km2)
evaluation = em.evaluate_model(
    MockRepo(validation_dataframe_ncses.copy()),
    km2.KaggleModel2(),
    dict(
        batch_size=16,
        pretrained_model="../models/kaggle_model2/models/kaggle_model2/accessible_slope",
        extractor = rm.RegexModel(config=dict()),
        extractor_config = dict(),
        min_prob=0.9,
    ),

)
evaluation


        Model Evaluation:

        - Run time: 2270.1137869358063 seconds, avg: 5.761710119126412 seconds per sample
        - True Postive Count: 1889, avg: 4.7944162436548226 per sample
        - Precision: 0.4275690357627886
        - Recall: 0.5921630094043887
        