### Description
This notebook is part of our 27th place solution which is based on simple regex and some heuristics. We just started the competition 10 days before deadline. So, we are happy about where we stand. We tried to train some models. But none worked better than heuristics based solution. Plus, we didn't find any proper validation strategy. With Heuristics based solution, we can just use all train data as validation set and there is less leakage in public lb scores. We further evaluated our solution on RichContext competition data (https://coleridgeinitiative.org/richcontext/richcontextcompetition/). This gave us better idea on what may or may not work on unseen test. In this notebook, We just tried to predict very confident datasets to mantain high precision and sufficient recall. We were little worried about low recall on rich-context competition dataset, so, in final submission, we just took union of predictions of string matching with heuristics based solution. Our best solution (a variant of this notebook) scored above 0.4 on private lb but we didn't have enough reasons to select it as our final submission. 

### Challenges in proper validation

1. Train data was weakly labelled
2. Public test set consists only of 12% of total test set
3. RichContext competition had slightly longer dataset names. They mention dates, places, etc which are not there in train set.

### Statistics
1. Heuristics based solution 0.391 (CV) 0.499 (Public) 0.397 (Private)
2. String Matching 0.575 (Public) 0.105 (Private)
3. Heuristics + String Matching (Selected Submission) 0.559 (Public) 0.395 (Private)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
import json
import re
from tqdm.auto import tqdm
import nltk.data
from IPython.display import display
jaccard_threshold = 0.5
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

In [None]:
train_df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv',index_col='Id')
temp_1 = [x.lower() for x in train_df['dataset_label'].unique()]
temp_2 = [x.lower() for x in train_df['dataset_title'].unique()]
temp_3 = [x.lower() for x in train_df['cleaned_label'].unique()]
existing_labels = set(temp_1 + temp_2 + temp_3)

In [None]:
IS_SUBMIT = len(os.listdir('../input/coleridgeinitiative-show-us-the-data/test/'))>4

# IS_SUBMIT = True
print(IS_SUBMIT)

if IS_SUBMIT:
    test_root = '../input/coleridgeinitiative-show-us-the-data/test/'
else:
    test_root = '../input/coleridgeinitiative-show-us-the-data/train/'
#     test_root = '../input/show-us-the-data-sample-test/test/'

In [None]:
sample_submission = pd.DataFrame(os.listdir(test_root),columns=['Id'])
sample_submission['Id'] = sample_submission['Id'].apply(lambda x:x.replace('.json',''))
sample_submission['PredictionString'] = None
sample_submission = sample_submission.set_index("Id").sort_index()
print(sample_submission.shape)

In [None]:
def read_text(id_):
    with open(f'{test_root}{id_}.json') as f:
        text_set = json.load(f)
        text_set = {entry['section_title']:entry['text'] for entry in text_set}
    return text_set

def split_sentences(id_,debug=False):
    """
    Split text into sentences
    """
    all_sentences = []
    text_set = read_text(id_)
    for key in text_set:
        text = text_set[key]
        sentences = pd.DataFrame(tokenizer.tokenize(text),columns=['sentence'])
        sentences['section'] = key
        sentences['section_line'] = sentences.index
        all_sentences.append(sentences)
    all_sentences = pd.concat(all_sentences).reset_index(drop=True)
    all_sentences['line'] = all_sentences.index
    all_sentences['id'] = id_
    return all_sentences

In [None]:
df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv')

In [None]:
def clean_text_pre(txt):
    return re.sub('[^A-Za-z0-9-]+', ' ', str(txt))

def clean_text_post(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

In [None]:
dataset_keywords = ['study', 'dataset', 'model','survey','data','adni','codes', 'genome', 'program','assessment','database','census','initiative','sequences'
                    'gauge','system','stewardship','surge']

dataset_keywords = ['study', 'dataset','survey','database','sequences','census','sequence', 'data', 'data set','poll']
# dataset_keywords = ['study', 'dataset','survey','data']

dataset_keywords = list(map(lambda x:x.capitalize(),dataset_keywords))

regex_pattern = '\w*{}\w*'.format("|".join(dataset_keywords))
regex_pattern = re.compile(regex_pattern)

regex_pattern

In [None]:
#### remove >.5 jaccard matches from predicitons
def jaccard_similarity(s1, s2):
    l1 = s1.split(" ")
    l2 = s2.split(" ")    
    intersection = len(list(set(l1).intersection(l2)))
    union = (len(l1) + len(l2)) - intersection
    return float(intersection) / union

In [None]:
def preprocess_sentence(sentence):
#     for word in dataset_keywords:
#         sentence = sentence.replace(" "+word.lower()+" "," "+word+" ").strip()
    sentence = sentence[0].lower()+sentence[1:]
    return sentence

def find_keyword(df,col):
    df['keyword_present'] = 0
    for word in dataset_keywords:
        df['keyword_present'] += (df[col].apply(lambda x:x.find(word.capitalize()))>0).astype(int)
    df['keyword_present'] = df['keyword_present']>0
    
def is_highlighted(word):
    return any(x.isupper() for x in word)

def extract_dataset_names(sentence):
    all_datasets = []
    for m in regex_pattern.finditer(sentence):
        start_pos = m.start()
        end_pos = m.start()+len(m.group())
        #To Do -  Use regex to split
        pretext = clean_text_pre(sentence[:end_pos].split(',')[-1]).strip().split()
        posttext = clean_text_pre(sentence[end_pos:].split(',')[0]).strip().split()

        limit = 1
        dataset_start_idx = -1
        for i,x in enumerate(reversed(pretext)):
            if not is_highlighted(x):
                limit -= 1
            if is_highlighted(x):
                dataset_start_idx = -1*(i+1)
            if limit == -1:
                break

        limit = 1     
        dataset_end_idx = 0
        for i,x in enumerate(posttext):
            if not is_highlighted(x):
                limit -= 1
            if is_highlighted(x):
                dataset_end_idx = (i+1)
            if limit == -1:
                break

        dataset_name = pretext[dataset_start_idx:] + posttext[:dataset_end_idx]
        dataset_name = " ".join(dataset_name)
        all_datasets.append(dataset_name)
    return all_datasets

def filter_duplicates(dataset_names):
    dataset_names = list(sorted(dataset_names,key = lambda x:-1*len(x)))
    filtered_dataset_names = []
    for dataset in dataset_names:
        is_duplicate = False
        for reference in filtered_dataset_names:
            if jaccard_similarity(dataset,reference)>=0.9:
                is_duplicate=True
        if not is_duplicate:
            filtered_dataset_names.append(dataset)
    return filtered_dataset_names

def filter_by_length(dataset_names):
    return [x for x in dataset_names if len(x.split())>3 and len(x.split())<=10]

def filter_by_count(dataset_names):
    if len(dataset_names)>10:
        return []
#         return dataset_names[:10]
    else:
        return dataset_names
    
def clean_label(dataset_names):
    return [clean_text_post(x) for x in dataset_names]

def remove_existing_labels(dataset_names):
    return [x for x in dataset_names if x not in existing_labels]

In [None]:
def get_abbreviations(text):
    cleaned = clean_text_pre(text)
    words = cleaned.split()
    abbreviations = [word for word in words if sum(x.isupper() for x in word)>=3]
    return abbreviations

def get_abbreviation_details(row):
    abbreviations = {}
    cleaned = clean_text_pre(row.sentence)
    for abbreviation in row.abbreviations:
        start_pos = cleaned.find(abbreviation)
        end_pos = start_pos+len(abbreviation)
        MAX_LENGTH = len(abbreviation)+2
        left = cleaned[:start_pos].split()[-1*MAX_LENGTH:]
        right = cleaned[end_pos:].split()[:MAX_LENGTH]
        if sum(is_highlighted(x) for x in left)>2:
            first_letters = [x[0] for x in left]
            pos = -1
            for i in range(len(first_letters)-1,-1,-1):
                x = first_letters[i]
                if x.lower()==abbreviation[pos].lower():
                    if pos==-1:
                        end_word = i
                    pos-=1
                if pos<-1*len(abbreviation):
                    start_word = i
                    abbreviations[abbreviation] = " ".join(left[start_word:end_word+1])
                    break
        elif sum(is_highlighted(x) for x in right)>2:
            first_letters = [x[0] for x in right]
            pos = 0
            for i,x in enumerate(first_letters):
                if x.lower()==abbreviation[pos].lower():
                    if pos==0:
                        start_word = i
                    pos+=1
                if pos==len(abbreviation):
                    end_word = i
                    abbreviations[abbreviation] = " ".join(right[start_word:end_word+1])
                    break
    return [f"{x}::{abbreviations[x]}" for x in abbreviations]

In [None]:
SIMULATE = test_root == '../input/coleridgeinitiative-show-us-the-data/train/'
SIMULATE

## Get All the abbreviations pair from the text

Abbreviations may be a sequence of capitalized word followed by a uppercased word. Example - Alzheimer's Disease Neuroimaging Initiative (ADNI) 

In [None]:
if SIMULATE:
    all_abbreviations = pd.read_csv('../input/coleridge-abbreviations/abbreviations.csv')
else:
    all_abbreviations = []
    predictions = pd.DataFrame(columns = ['PredictionString'])
    for id_,row in tqdm(sample_submission.iterrows()):
        sentences = split_sentences(id_)
        sentences['abbreviations'] = sentences.sentence.apply(get_abbreviations)
        sentences['abbreviations'] = sentences.apply(get_abbreviation_details,axis=1)
        abbreviations = np.concatenate(sentences.abbreviations.values)
        abbreviations = {x.split('::')[0]:x.split('::')[1] for x in abbreviations}
        abbreviations = pd.Series(abbreviations).reset_index()
        abbreviations.columns = ['short_form','long_form']
        abbreviations['Id'] = id_
        all_abbreviations.append(abbreviations)
    all_abbreviations = pd.concat(all_abbreviations).reset_index(drop=True)

## Filter dataset names from the shortform-longform list

Logic - Dataset names should either end with any of dataset keywords or should contain (DATASET-KEYWORD + of/for/on) 

In [None]:
long_forms = all_abbreviations.long_form.apply(clean_text_post).drop_duplicates()
short_forms = all_abbreviations.short_form.apply(clean_text_post).drop_duplicates()
dataset_short_forms = np.unique(all_abbreviations[all_abbreviations.long_form.str.lower().apply(lambda x: any(y.lower() in x for y in dataset_keywords))].short_form.values)
dataset_short_forms = [x for x in dataset_short_forms if len(x)>3]
regex_pattern_abbreviations = '\w*{}\w*'.format("|".join(dataset_short_forms))
regex_pattern_abbreviations = re.compile(regex_pattern_abbreviations)
long_forms.shape,short_forms.shape

In [None]:
dataset_keywords_lowered = [x.lower() for x in dataset_keywords]
dataset_keywords_cleaned = [clean_text_post(x) for x in dataset_keywords]
dataset_keywords,dataset_keywords_lowered,dataset_keywords_cleaned

In [None]:
def filter_false_positives_from_longforms(dataset_names):
    filtered = []
    for dataset in dataset_names:
        dataset_cleaned = clean_text_post(dataset)
        if any(x+" of" in dataset_cleaned for x in dataset_keywords_cleaned):
            filtered.append(dataset)
        elif any(x+" from" in dataset_cleaned for x in dataset_keywords_cleaned):
            filtered.append(dataset)
        elif any(x+" on" in dataset_cleaned for x in dataset_keywords_cleaned):
            filtered.append(dataset)
        elif any(dataset_cleaned.endswith(x) for x in dataset_keywords_cleaned):
            filtered.append(dataset)
    return filtered

In [None]:
long_forms_filtered = filter_false_positives_from_longforms(long_forms)

In [None]:
def filter_false_positives(dataset_names):
    filtered = []
    for dataset in dataset_names:
        dataset_cleaned = clean_text_post(dataset)
        if any(x+" of" in dataset_cleaned for x in dataset_keywords_cleaned):
            filtered.append(dataset)
        elif any(x+" from" in dataset_cleaned for x in dataset_keywords_cleaned):
            filtered.append(dataset)
        elif any(x+" on" in dataset_cleaned for x in dataset_keywords_cleaned):
            filtered.append(dataset)
    return filtered

def filter_by_abbreviations(dataset_names):
    filtered = []
    for dataset1 in dataset_names:
        accepted = False
        for dataset2 in long_forms_filtered:
            if jaccard_similarity(dataset1,dataset2)>=0.7:
                accepted = True
        if accepted:
            filtered.append(dataset1)
    return filtered

## Heuristics Explained
Dataset names should be either ending with dataset keywords or consists of
1. Extract all dataset candidates using some heuristics (Using regex find all keywords- look left and right to get sequence of capitalized words)
2. filter_by_length - Remove dataset with no of words  <3 or >10
3. Filter by abbreviation - filter dataset whose jaccard with longform list is > 0.7
4. If no dataset was found in previous step - filter FP from datasets (in 2) by using logic similar to that used for abbreviations (DATASET-KEYWORD + of/for/on) 
5. If no dataset was found in previous step and no of dataset (in 2) < 3 - Use the list from (2)
6. If no dataset was found in previous step and no of dataset (in 2) >= 3 -Return no predictions

In [None]:
predictions = pd.DataFrame(index=sample_submission.index,columns = ['PredictionString'])
for id_,row in tqdm(sample_submission.iterrows()):
    if SIMULATE:
        sentences = pd.read_csv(f'../input/coleridge-sud-sentences/sentences/tmp/sentences/{id_}.csv')
    else:
        sentences = split_sentences(id_)
    sentences['sentence'] = sentences.sentence.apply(preprocess_sentence)
    sentences['predicted_datasets'] = sentences.sentence.apply(extract_dataset_names)
    sentences['predicted_datasets'] = sentences.predicted_datasets.apply(filter_by_length)
    predicted_datasets_orig = np.unique(np.concatenate(sentences.predicted_datasets.values))
    predicted_datasets = [clean_text_post(x) for x in predicted_datasets_orig]
    predicted_datasets = filter_by_abbreviations(predicted_datasets)
    if len(predicted_datasets)==0:
        predicted_datasets = filter_false_positives(predicted_datasets_orig)
        predicted_datasets = [clean_text_post(x) for x in predicted_datasets]
    if len(predicted_datasets)==0 and len(predicted_datasets_orig)<3:
        predicted_datasets = [clean_text_post(x) for x in predicted_datasets_orig]
    predicted_datasets = filter_duplicates(predicted_datasets)
    predicted_datasets = "|".join(predicted_datasets)
    if predicted_datasets!="":
        predictions.loc[id_,'PredictionString'] = predicted_datasets

In [None]:
assert sample_submission.shape[0]==predictions.shape[0]
assert (sample_submission.index.values==predictions.index.values).sum()==sample_submission.shape[0]
submission = predictions.reset_index()
submission.columns = ['Id','PredictionString']
submission.to_csv('submission.csv',index=False)
print(submission.shape)
submission.sample(2)

In [None]:
submission.PredictionString.value_counts().head(1)

In [None]:
submission = pd.read_csv('submission.csv')
submission.PredictionString.isna().sum()

## Metric Implementation

In [None]:
def calculate_tp_fp_fn(pred_row):
    
    if pd.isnull(pred_row.cleaned_label):
        true = []
    else:
        true = pred_row.cleaned_label.split('|')
        
    if pd.isnull(pred_row.PredictionString):
        predicted = []
    else:
        predicted = pred_row.PredictionString.split('|')
        
    true = [clean_text_post(x) for x in true]
    predicted = [clean_text_post(x) for x in predicted]
    scores = pd.DataFrame(columns=['true','predicted','score'])
    i = 0
    for j,sample1 in enumerate(true):
        for k,sample2 in enumerate(predicted):
            scores.loc[i] = [j,k,jaccard_similarity(sample1,sample2)]
            i += 1
    scores = scores[scores.score>0.5].sort_values('score',ascending=False).reset_index(drop=True)
    true_done = {}
    predicted_done = {}
    tp = 0
    for i,row in scores.iterrows():
        if (row.true not in true_done) and (row.predicted not in predicted_done):
            tp += 1
            true_done[row.true] = True
            predicted_done[row.predicted] = True
    fp = len(predicted) - tp
    fn = len(true) - tp
    pred_row['tp'] = tp
    pred_row['fp'] = fp
    pred_row['fn'] = fn
    return pred_row

def F_beta(precision,recall,beta=0.5):
    epsilon = 0.0000001
    return (1+beta**2)*precision*recall/(((beta**2)*precision)+recall+epsilon)

In [None]:
if SIMULATE:
    train_df = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv',index_col='Id')
    all_labels = set(train_df.cleaned_label.unique())
    ground_truth = train_df.reset_index().groupby('Id').cleaned_label.apply(lambda x: '|'.join(x)).reset_index()
    print(ground_truth.shape)
    predictions_df = pd.merge(ground_truth,submission,on='Id',how='inner')
    tqdm.pandas()
    scores = predictions_df.progress_apply(calculate_tp_fp_fn,axis=1)
    scores.to_csv('scores.csv')
    tp,fp,fn = scores[['tp','fp','fn']].sum().values
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    Fbeta05 = F_beta(precision,recall,beta=0.5)
    print("tp =",tp)
    print("fp =",fp)
    print("fn =",fn)
    print("Precision =",precision)
    print("Recall =",recall)
    print("F Beta at 0.5 =",Fbeta05)

## False Positives

In [None]:
if SIMULATE:
    print(scores[scores.fp>1].PredictionString.sample(50).values)

## False Negatives

In [None]:
if SIMULATE:
    print(scores[scores.fn>1].cleaned_label.drop_duplicates().sample(50).values)