# Classifier & NER V3
Third attempt solution with classifier (RF - balanced) and spaCy (NER) models (Score: unscored - notebook timedout on Kaggle)

Workflow:
1. tokenize test sentences
2. pass test sentences through the binary classifier; retain sentences with high prob score
3. pass high prob test sentences through the NER model; extract and append entity matches

In [1]:
# pip install spacy-transformers --user

In [2]:
import glob
import json
import re
import pickle
import pandas as pd
import numpy as np

import nltk
from nltk.tokenize import sent_tokenize

import spacy
from spacy import displacy

In [3]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

def jaccard(str1, str2): 
    """
    defined by the competition
    """
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

def clean_text(txt):
    """
    Convert to lowercase, remove special characters, and punctuation.
    """
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

def find_acronyms(txt):
    """
    finds and returns a sequence of capital letters
    for use on dataset_titles, dataset_labels, or full text
    """
    matches = re.findall(r"\b[A-Z\.]{2,}s?\b", txt)
    if matches:
        return 1
    else:
        return 0

def count_acronyms(txt):
    """
    finds and returns a sequence of capital letters
    for use on dataset_titles, dataset_labels, or full text
    """
    matches = re.findall(r"\b[A-Z\.]{2,}s?\b", txt)
    if matches:
        return len(matches)
    else:
        return 0

In [4]:
DATA_DIR = '/nfs/turbo/hrg/coleridge/'

In [5]:
custom_ner_model = spacy.load('/nfs/turbo/hrg/coleridge/spaCy/output/model-best') # output model is stored as "model-best" and "model-last"

## Data

In [6]:
submission_df = pd.read_csv('../data/sample_submission.csv')

In [7]:
test_files = glob.glob('../data/test/*.json')

df_test_pubs = pd.DataFrame()
for test_file in test_files: 
    file_data = pd.read_json(test_file)
    file_data.insert(0,'Id', test_file.split('/')[-1].split('.')[0])
    df_test_pubs = pd.concat([df_test_pubs, file_data])

df_test_pubs['clean_text'] = df_test_pubs['text'].apply(clean_text)
df_test_pubs.head()

Unnamed: 0,Id,section_title,text,clean_text
0,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Introduction,A significant body of research has been conduc...,a significant body of research has been conduc...
1,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Literature review,We reviewed the literature that explored retai...,we reviewed the literature that explored retai...
2,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Food shopping patterns: where do people shop?,Diversification in the food retail sector offe...,diversification in the food retail sector offe...
3,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Food shopping patterns: what do people buy?,"Many factors, including income, participation ...",many factors including income participation in...
4,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,2,Anne Palmer et al. shopping at the same store ...,anne palmer et al shopping at the same store h...


## Target
We are predicting whether a `sent` contains a data reference based on its features

In [8]:
test_sentences = []

for row in df_test_pubs.itertuples():
    sentences = row[3].split(".")
    for sent in sentences:
#     for sent in sent_tokenize(row[3]):
        test_sentences.append((row[1], row[2], sent))

df_test_sent = pd.DataFrame(test_sentences, columns=['Id', 'section_title', 'sent'])

df_test_sent['sent'] = df_test_sent['sent'].astype(str)
df_test_sent['section_title'] = df_test_sent['section_title'].astype(str)

df_test_sent['sent_clean'] = df_test_sent['sent'].apply(clean_text)
df_test_sent['section_clean']= df_test_sent['section_title'].apply(clean_text)

df_test_sent

Unnamed: 0,Id,section_title,sent,sent_clean,section_clean
0,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Introduction,A significant body of research has been conduc...,a significant body of research has been conduc...,introduction
1,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Introduction,Even though lowincome (LI) households' choice...,even though lowincome li households choices a...,introduction
2,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Introduction,Findings on purchasing patterns of LI househo...,findings on purchasing patterns of li househo...,introduction
3,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Introduction,Understanding the specific factors related to...,understanding the specific factors related to...,introduction
4,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Introduction,This study contributes to the literature by e...,this study contributes to the literature by e...,introduction
...,...,...,...,...,...
2717,2100032a-7c33-4bff-97ef-690822c43466,Figure 1.,\nTrampush et al,trampush et al,figure 1
2718,2100032a-7c33-4bff-97ef-690822c43466,Figure 1.,Page 19 Table 3 Meta-analytic results of th...,page 19 table 3 meta analytic results of the ...,figure 1
2719,2100032a-7c33-4bff-97ef-690822c43466,Figure 1.,1 associated with educational attainment in CO...,1 associated with educational attainment in co...,figure 1
2720,2100032a-7c33-4bff-97ef-690822c43466,Figure 1.,,,figure 1


Apply NER model to all test sentences

In [9]:
candidates = df_test_sent['sent'].to_list()

for candidate in candidates:
    label = set()
    doc = custom_ner_model(candidate)
    if len(doc.ents) > 0:
        label.add(clean_text(doc.ents))
        displacy.render(doc, style="ent", jupyter=True)

## Features

### has Indicator terms

In [10]:
df_test_sent['freqData'] = df_test_sent['sent_clean'].str.count('data')
df_test_sent['freqEdu'] = df_test_sent['sent_clean'].str.count('edu')
df_test_sent['freqSample'] = df_test_sent['sent_clean'].str.count('sample')
df_test_sent['freqNational'] = df_test_sent['sent_clean'].str.count('national')
df_test_sent['freqSurvey'] = df_test_sent['sent_clean'].str.count('survey')
df_test_sent['freqPublic'] = df_test_sent['sent_clean'].str.count('public')
df_test_sent['freqAvail'] = df_test_sent['sent_clean'].str.count('avail')
df_test_sent['freqNSF'] = df_test_sent['sent_clean'].str.count('nsf')
df_test_sent['freqGov'] = df_test_sent['sent_clean'].str.count('gov')
df_test_sent['freqAccess'] = df_test_sent['sent_clean'].str.count('access')

df_test_sent['hasData'] = np.where(df_test_sent['sent_clean'].str.contains('data'), 1, 0)
df_test_sent['hasEdu'] = np.where(df_test_sent['sent_clean'].str.contains('edu'), 1, 0)
df_test_sent['hasSample'] = np.where(df_test_sent['sent_clean'].str.contains('sample'), 1, 0)
df_test_sent['hasNational'] = np.where(df_test_sent['sent_clean'].str.contains('national'), 1, 0)
df_test_sent['hasSurvey'] = np.where(df_test_sent['sent_clean'].str.contains('survey'), 1, 0)
df_test_sent['hasPublic'] = np.where(df_test_sent['sent_clean'].str.contains('public'), 1, 0)
df_test_sent['hasAvail'] = np.where(df_test_sent['sent_clean'].str.contains('survey'), 1, 0)
df_test_sent['hasNSF'] = np.where(df_test_sent['sent_clean'].str.contains('nsf'), 1, 0)
df_test_sent['hasGov'] = np.where(df_test_sent['sent_clean'].str.contains('gov'), 1, 0)
df_test_sent['hasAccess'] = np.where(df_test_sent['sent_clean'].str.contains('access'), 1, 0)

df_test_sent.sample(n=10)

Unnamed: 0,Id,section_title,sent,sent_clean,section_clean,freqData,freqEdu,freqSample,freqNational,freqSurvey,...,hasData,hasEdu,hasSample,hasNational,hasSurvey,hasPublic,hasAvail,hasNSF,hasGov,hasAccess
1755,2f392438-e215-4169-bebf-21ac4ff253e1,Indicator,Primary education refers to ISCED97 level 1,primary education refers to isced97 level 1,indicator,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2237,2f392438-e215-4169-bebf-21ac4ff253e1,How to read the charts,"Although not included in the charts, postseco...",although not included in the charts postsecon...,how to read the charts,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1423,2f392438-e215-4169-bebf-21ac4ff253e1,Definitions and Methodology,94 and less than or equal to 409,94 and less than or equal to 409,definitions and methodology,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
103,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,2,A small study in Houston found that African A...,a small study in houston found that african a...,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1842,2f392438-e215-4169-bebf-21ac4ff253e1,Indicator 10,"These range from level 1 to level 6, with lev...",these range from level 1 to level 6 with leve...,indicator 10,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2019,2f392438-e215-4169-bebf-21ac4ff253e1,19,"In the United States, the use for school achi...",in the united states the use for school achie...,19,1,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
2208,2f392438-e215-4169-bebf-21ac4ff253e1,ISCED97 levels,The ISCED97 allows researchers to compile sta...,the isced97 allows researchers to compile sta...,isced97 levels,0,1,0,1,0,...,0,1,0,1,0,0,0,0,0,0
859,3f316b38-1a24-45a9-8d8c-4e05a42257c6,CONCLUSION,nccoastalatlas,nccoastalatlas,conclusion,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
116,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Primary data: customer intercept survey-data c...,The two eligibility criteria were that the st...,the two eligibility criteria were that the st...,primary data customer intercept survey data co...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
627,3f316b38-1a24-45a9-8d8c-4e05a42257c6,Storm Surge Vulnerability,This effort included a program to collect fir...,this effort included a program to collect fir...,storm surge vulnerability,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


### in Section

In [11]:
df_test_sent['inIntro'] = np.where(df_test_sent['section_clean'].str.contains('intro'), 1, 0)
df_test_sent['inDisc'] = np.where(df_test_sent['section_clean'].str.contains('discus'), 1, 0)
df_test_sent['inAbst'] = np.where(df_test_sent['section_clean'].str.contains('abstr'), 1, 0)
df_test_sent['inResult'] = np.where(df_test_sent['section_clean'].str.contains('resul'), 1, 0)
df_test_sent['inConcl'] = np.where(df_test_sent['section_clean'].str.contains('conclu'), 1, 0)
df_test_sent['inMethod'] = np.where(df_test_sent['section_clean'].str.contains('meth'), 1, 0)
df_test_sent['inBack'] = np.where(df_test_sent['section_clean'].str.contains('back'), 1, 0)
df_test_sent['inData'] = np.where(df_test_sent['section_clean'].str.contains('data'), 1, 0)
df_test_sent['inSumm'] = np.where(df_test_sent['section_clean'].str.contains('summ'), 1, 0)
df_test_sent['inAckno'] = np.where(df_test_sent['section_clean'].str.contains('acknowl'), 1, 0)

df_test_sent.sample(n=10)

Unnamed: 0,Id,section_title,sent,sent_clean,section_clean,freqData,freqEdu,freqSample,freqNational,freqSurvey,...,inIntro,inDisc,inAbst,inResult,inConcl,inMethod,inBack,inData,inSumm,inAckno
262,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,What people buy,The CES analysis finds that a larger share of...,the ces analysis finds that a larger share of...,what people buy,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1628,2f392438-e215-4169-bebf-21ac4ff253e1,Definitions and Methodology,eighth-graders whose principals reported inti...,eighth graders whose principals reported inti...,definitions and methodology,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2536,2100032a-7c33-4bff-97ef-690822c43466,Introduction,,,introduction,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1762,2f392438-e215-4169-bebf-21ac4ff253e1,Indicator,dollars using 2006 national purchasing power ...,dollars using 2006 national purchasing power ...,indicator,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2003,2f392438-e215-4169-bebf-21ac4ff253e1,19,Principals of 15-yearold students were given ...,principals of 15 yearold students were given ...,19,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
469,3f316b38-1a24-45a9-8d8c-4e05a42257c6,Relative Vulnerability,"As relative sea level reaches 1-2ft, the mars...",as relative sea level reaches 1 2ft the marsh...,relative vulnerability,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1795,2f392438-e215-4169-bebf-21ac4ff253e1,G-8 Countries,This was higher than in all other participati...,this was higher than in all other participati...,g 8 countries,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2324,2f392438-e215-4169-bebf-21ac4ff253e1,Postsecondary and tertiary:,,,postsecondary and tertiary,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
404,3f316b38-1a24-45a9-8d8c-4e05a42257c6,STUDY AREA,The boundary of the CAHA was used as the exten...,the boundary of the caha was used as the exten...,study area,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1631,2f392438-e215-4169-bebf-21ac4ff253e1,Definitions and Methodology,e,e,definitions and methodology,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


## has Acronyms

In [12]:
df_test_sent['hasAcronym'] = df_test_sent['sent'].apply(find_acronyms)
df_test_sent['hasAcronym'].value_counts()

0    2054
1     668
Name: hasAcronym, dtype: int64

In [13]:
df_test_sent['freqAcronym'] = df_test_sent['sent'].apply(count_acronyms)
df_test_sent['freqAcronym'].unique()

array([0, 2, 1, 7, 3, 4, 5, 8, 6])

## has Titles
Check if a sentence has a dataset title from Data.gov or ICPSR

In [14]:
icpsr = pd.read_csv(DATA_DIR + 'labels/icpsr_studies.csv')
icpsr_labels = icpsr['NAME'].apply(clean_text).str.replace('\d+', '')

%time df_test_sent['hasICPSRTitle'] = df_test_sent['sent_clean'].apply(lambda x: any([k in x for k in icpsr_labels]))
df_test_sent['hasICPSRTitle'] = df_test_sent['hasICPSRTitle'].astype('category').cat.codes

df_test_sent['hasICPSRTitle'].value_counts()

CPU times: user 5.24 s, sys: 0 ns, total: 5.24 s
Wall time: 5.26 s


0    2716
1       6
Name: hasICPSRTitle, dtype: int64

In [15]:
datagov = pd.read_csv(DATA_DIR + 'labels/kaggle_data_800.csv')
datagov_labels = datagov['title'].apply(clean_text).str.replace('\d+', '')

%time df_test_sent['hasDATAGOVTitle'] = df_test_sent['sent_clean'].apply(lambda x: any([k in x for k in datagov_labels]))
df_test_sent['hasDATAGOVTitle'] = df_test_sent['hasDATAGOVTitle'].astype('category').cat.codes

df_test_sent['hasDATAGOVTitle'].value_counts()

CPU times: user 1.05 s, sys: 0 ns, total: 1.05 s
Wall time: 1.05 s


0    2707
1      15
Name: hasDATAGOVTitle, dtype: int64

## Inspect test data

In [16]:
df_test_sent.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2722 entries, 0 to 2721
Data columns (total 39 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Id               2722 non-null   object
 1   section_title    2722 non-null   object
 2   sent             2722 non-null   object
 3   sent_clean       2722 non-null   object
 4   section_clean    2722 non-null   object
 5   freqData         2722 non-null   int64 
 6   freqEdu          2722 non-null   int64 
 7   freqSample       2722 non-null   int64 
 8   freqNational     2722 non-null   int64 
 9   freqSurvey       2722 non-null   int64 
 10  freqPublic       2722 non-null   int64 
 11  freqAvail        2722 non-null   int64 
 12  freqNSF          2722 non-null   int64 
 13  freqGov          2722 non-null   int64 
 14  freqAccess       2722 non-null   int64 
 15  hasData          2722 non-null   int64 
 16  hasEdu           2722 non-null   int64 
 17  hasSample        2722 non-null   

In [17]:
# df_test_sent.to_csv(DATA_DIR + 'test_sents_all.csv')

In [18]:
df_test_sent = df_test_sent.drop(columns=['section_title', 
                                          'sent_clean', 
                                          'section_clean'])

df_test_sent.head()

Unnamed: 0,Id,sent,freqData,freqEdu,freqSample,freqNational,freqSurvey,freqPublic,freqAvail,freqNSF,...,inConcl,inMethod,inBack,inData,inSumm,inAckno,hasAcronym,freqAcronym,hasICPSRTitle,hasDATAGOVTitle
0,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,A significant body of research has been conduc...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Even though lowincome (LI) households' choice...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,2,0,0
2,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Findings on purchasing patterns of LI househo...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
3,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Understanding the specific factors related to...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,This study contributes to the literature by e...,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
df_test_sent.shape

(2722, 36)

In [20]:
X_new = df_test_sent.iloc[:, 2:]

print(X_new.shape)

(2722, 34)


## Classifier
Pass tokenized test sentences to the classifier and keep predicted sentences  with prob > 0.5

In [21]:
model_path = DATA_DIR + "classifier"

pkl_filename = model_path+"/RF_balanced_model.pkl"

with open(pkl_filename, 'rb') as file:
    best_model = pickle.load(file)

best_model

RandomForestClassifier(class_weight='balanced', n_jobs=2, random_state=42)

Apply the trained classifier to the test data

In [22]:
np.random.seed(1)

%time predict = best_model.predict(X_new)

CPU times: user 40.6 ms, sys: 4.08 ms, total: 44.7 ms
Wall time: 30.3 ms


Append the predicted probability of 0, 1 to test dataframe

In [23]:
prob = best_model.predict_proba(X_new)
df_test_sent['prob_0'] = prob[:,0] 
df_test_sent['prob_1'] = prob[:,1]

df_test_sent

Unnamed: 0,Id,sent,freqData,freqEdu,freqSample,freqNational,freqSurvey,freqPublic,freqAvail,freqNSF,...,inBack,inData,inSumm,inAckno,hasAcronym,freqAcronym,hasICPSRTitle,hasDATAGOVTitle,prob_0,prob_1
0,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,A significant body of research has been conduc...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.997355,0.002645
1,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Even though lowincome (LI) households' choice...,0,0,0,0,0,0,0,0,...,0,0,0,0,1,2,0,0,0.941944,0.058056
2,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Findings on purchasing patterns of LI househo...,0,0,0,0,0,0,0,0,...,0,0,0,0,1,1,0,0,0.963614,0.036386
3,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Understanding the specific factors related to...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.997355,0.002645
4,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,This study contributes to the literature by e...,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2717,2100032a-7c33-4bff-97ef-690822c43466,\nTrampush et al,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.999807,0.000193
2718,2100032a-7c33-4bff-97ef-690822c43466,Page 19 Table 3 Meta-analytic results of th...,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.999807,0.000193
2719,2100032a-7c33-4bff-97ef-690822c43466,1 associated with educational attainment in CO...,0,1,0,0,0,0,0,0,...,0,0,0,0,1,6,0,0,1.000000,0.000000
2720,2100032a-7c33-4bff-97ef-690822c43466,,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0.999807,0.000193


Filter the dataframe to extract high likelihood sentences by keeping rows with `prob_1` >= 0.1
* Try different thresholds

In [24]:
df_candidates = df_test_sent.query('prob_1 >= 0.01')

# for sent in df_candidates.sent:
#     print(sent, "\n")

What percent of sentences have a higher likelihood of being citances?

In [25]:
print((len(df_candidates)/len(df_test_sent))*100)

21.491550330639235


## NER
Apply custom model to candidate sentences only

In [26]:
candidates = df_candidates['sent'].to_list()

for candidate in candidates:
    label = set()
    doc = custom_ner_model(candidate)
    if len(doc.ents) > 0:
        label.add(clean_text(doc.ents))
        displacy.render(doc, style="ent", jupyter=True)

In [27]:
result = []

for index in submission_df.Id:
    publication_text = df_candidates[df_candidates['Id'] == index].sent
    label = set()
    for candidate in publication_text:
        doc = custom_ner_model(candidate)
        if len(doc.ents) > 0:
            label.add(clean_text(doc.ents))
    label_list = sorted(list(label))
    result.append('|'.join(label_list))

for hit in result:
    print(hit, "\n")

 adni  

 nces common core of data | trends in international mathematics and science study  

 coastal change hazards portal | nc sea level rise risk management study | noaa storm surge | slosh and | slosh basin models | slosh display | slosh grid | slosh grids | slosh inundation grid | slosh meows | slosh model | slosh mom | slosh moms | slosh output | slosh point | slosh safir | slosh storm  

 rural urban continuum codes  



In [28]:
submission_df['PredictionString'] = result
submission_df

Unnamed: 0,Id,PredictionString
0,2100032a-7c33-4bff-97ef-690822c43466,adni
1,2f392438-e215-4169-bebf-21ac4ff253e1,nces common core of data | trends in internat...
2,3f316b38-1a24-45a9-8d8c-4e05a42257c6,coastal change hazards portal | nc sea level ...
3,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,rural urban continuum codes
