# Trying out a Regular Expression as a Baseline for High Recall

Kaggle Model 2 uses the Schwartz-Heasrt (SW) algorithm for extracting candidate 
entities and then classifies them using a binary classifier (see 
`explore_schwartz_heart_baseline.ipynb` for more info). The SW algorithm will
miss any entities that don't match the pattern LONG FORM (ACRONYM). In the 
evalution of the Kaggle private data set, this produced a recall of 0.65. So,
at best models leveraging the SW algorithm will only produce a recall of 0.65.

This notebook tries using a Regular Expression based extraction method to get 
candidates which is more flexible than the SW algorithm.


In [1]:
from itertools import chain
import json

import pandas as pd
from thefuzz import fuzz, process

import democratizing_data_ml_algorithms.models.regex_model as rm
from democratizing_data_ml_algorithms.data.kaggle_repository import KaggleRepository
from democratizing_data_ml_algorithms.evaluate.model import evaluate_model, evaluate_kaggle_private

In [2]:
repo = KaggleRepository()

In [5]:
# the `scorer` and `processor` arguments are explained in the notebook
# `defining_a_match_1.ipynb`

evaluation = evaluate_kaggle_private(
    rm.RegexModel(dict()),
    dict(),  # this model doesn't have any configuration params
    scorer=fuzz.partial_ratio,  # use fuzzy string matching
    processor=lambda s: s.lower(),  # convert to lowercase
)
evaluation

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=2016), Label(value='0 / 2016'))), …


        Model Evaluation:

        - Run time: 15.608129501342773 seconds, avg: 0.0019357719832993642 seconds per sample
        - True Postive Count: 24137, avg: 2.993550787548059 per sample
        - Precision: 0.0744222616202883
        - Recall: 0.6834192196613624
        

In [6]:
stats = evaluation.output_statistics
all_labels = list(chain(*list(map(lambda x: x["labels"], stats["statistics"].values))))
global_stats = list(chain(*list(map(lambda x: x["stats"], stats["statistics"].values))))

In [7]:
stats_df = pd.DataFrame({"labels": all_labels, "stats": global_stats})
stats_df.loc[stats_df["stats"] == "FN", :].groupby("labels").count().sort_values(
    "stats", ascending=False
)

Unnamed: 0_level_0,stats
labels,Unnamed: 1_level_1
dbgap,3710
database of genotypes and phenotypes,129
gtex,112
1000 genomes project,110
database of genotypes and phenotypes dbgap,83
...,...
genemania,1
genenetwork,1
genenetwork org,1
generation scotland,1


Let's try adding some explicit keywords that we want to include that don't always match the rules. 

In [8]:
keywords = [
    "database of genotypes and phenotypes",
    "dbgap",
    "DART buoy",
    "pisa"
]

evaluation_with_keywords = evaluate_model(
    repo, 
    rm.RegexModel(dict(keywords=keywords)), 
    dict(),
)

evaluation_with_keywords

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=2016), Label(value='0 / 2016'))), …


        Model Evaluation:

        - Run time: 28.748844623565674 seconds, avg: 0.003565527052407996 seconds per sample
        - True Postive Count: 28028, avg: 3.47612551159618 per sample
        - Precision: 0.08533284620419844
        - Recall: 0.7984730214802576
        

In [9]:
stats = evaluation_with_keywords.output_statistics
all_labels = list(chain(*list(map(lambda x: x["labels"], stats["statistics"].values))))
global_stats = list(chain(*list(map(lambda x: x["stats"], stats["statistics"].values))))

In [10]:
stats_df = pd.DataFrame({"labels": all_labels, "stats": global_stats})
stats_df.loc[stats_df["stats"] == "FN", :].groupby("labels").count().sort_values(
    "stats", ascending=False
)

Unnamed: 0_level_0,stats
labels,Unnamed: 1_level_1
gtex,112
1000 genomes project,110
business r d and innovation survey,79
dbsnp,75
foodaps,72
...,...
genereviews,1
generif,1
genes and genomes database,1
genesis study,1
