# Feature engineering
Analyzing publication sections and sentences to learn which features can predict data citations

To do: 
- [ ] weights for features (based on frequency)
- [ ] include more features (classification methods can handle many more inputs)

Functions:
- [x] extract acronyms
- [x] make acronyms
- [x] find ngrams
- [x] search window

Features:
- [x] SECTION (2): inIntro (0,1); inAbstract (0,1)
- [x] INDICATOR TERM (1): hasData (0,1)
- [x] ENTITY (1): hasOrg (0,1)

In [62]:
import pandas as pd
import numpy as np
import re
from collections import defaultdict, Counter
import pickle

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords

import spacy
import en_core_web_sm

from imblearn.over_sampling import SMOTE

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier


In [1]:
path = '../data/'
DATA_DIR = '/nfs/turbo/hrg/coleridge/'


In [33]:
nlp = en_core_web_sm.load()

## Functions
Custom: `extract_acronyms`, `make_acronyms`, `combine_tokens`, and set a `search_window`

Here are example demo strings:

In [3]:
def clean_text(txt):
    """
    Convert to lowercase, remove special characters, and punctuation.
    """
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

In [8]:
sec1 = "The SSGAC reanalyzed their GWAS data utilizing a two-stage design in order to examine whether educational attainment is a valid proxy phenotype for cognitive ability. (Rietveld et al., 2014b ) Subjects with available neurocognitive data (n = 24,189) were removed from the larger cohort, and a GWAS of years of education was then conducted in the remaining ~106K participants (Stage 1). "
sec2 = "This report draws on the most current information about education from four primary sources: the Indicators of National Education Systems (INES) at the Organization for Economic Cooperation and Development (OECD); the Progress in International reading Literacy Study (PIrLS); the Program for International Student Assessment (PISA); and the Trends in International Mathematics and Science Study (TIMSS). "
title1 = "Social Science Genetic Association Consortium and Genome Wide Association Study"
title2 = "Alzheimer's Disease Neuroimaging Initiative (ADNI)"
title3 = "Trends in International Mathematics and Science Study"
false1 = "no dataset acronym here"

In [184]:
def extract_acronyms(txt):
    """
    finds and returns a sequence of capital letters
    for use on dataset_titles, dataset_labels, or full text
    """
    matches = re.findall(r"\b[A-Z\.]{2,}s?\b", txt)
    if matches:
        return matches
    else:
        return None

In [158]:
def make_acronyms(txt):
    """
    finds sequences of words with capital letters, returns sequence of their first letters
    for use on dataset_titles, dataset_labels, or full text
    """
    constructs = re.findall(r"([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)", str(txt))
    if constructs:
        hits = []
        for construct in constructs:
            s = construct.split()
            l = []
            for i in s:
                if(i != "and"):
                    a = i[0]
                    l.append(a)
                    b = "".join(l)
            hits.append(b)
        return hits
    else:
        return None

In [60]:
stop = set(stopwords.words('english'))

def find_ngrams(txt, n):
    return list(zip(*[txt[i:] for i in range(n)]))

In [11]:
def search_window(section, phrase, window_size):
    """
    defines a section to search, a search phrase, and a text window size to return
    for use on full text as a preprocessing step
    """
    section = section.split()
    phrase = phrase.split()
    words = len(phrase)

    for i, word in enumerate(section):
        if word == phrase[0] and section[i:i+words] == phrase:
            start = max(0, i-window_size)
            return ' '.join(section[start:i+words+window_size])

Demonstrate functions with examples

In [12]:
clean_text("The India Human Development Survey (Desai et al., 2005) ")

'the india human development survey desai et al 2005 '

In [13]:
jaccard('India Human Development Survey', 'from the human development survey')

0.5

In [14]:
print(sec1)

candidates = extract_acronyms(sec1)

for candidate in candidates:
    print(search_window(sec1,candidate,5))

The SSGAC reanalyzed their GWAS data utilizing a two-stage design in order to examine whether educational attainment is a valid proxy phenotype for cognitive ability. (Rietveld et al., 2014b ) Subjects with available neurocognitive data (n = 24,189) were removed from the larger cohort, and a GWAS of years of education was then conducted in the remaining ~106K participants (Stage 1). 
The SSGAC reanalyzed their GWAS data utilizing
The SSGAC reanalyzed their GWAS data utilizing a two-stage design
The SSGAC reanalyzed their GWAS data utilizing a two-stage design


In [15]:
print(title1)

candidates = make_acronyms(title1)

for candidate in candidates:
    print(search_window(sec1,candidate,5))

Social Science Genetic Association Consortium and Genome Wide Association Study
The SSGAC reanalyzed their GWAS data utilizing
The SSGAC reanalyzed their GWAS data utilizing a two-stage design


In [16]:
window_ex = search_window(sec2,title3,2)
window_ex

'and the Trends in International Mathematics and Science Study (TIMSS).'

## Labels

In [94]:
icpsr = pd.read_excel(turbo+'studies.xlsx')

icpsr['cleaned_name'] = icpsr['NAME'].apply(clean_text)
icpsr['cleaned_name'] = icpsr['cleaned_name'].str.replace("restricted", "")
icpsr['cleaned_name'] = icpsr['cleaned_name'].str.replace('\d+', '')
icpsr['cleaned_name'] = icpsr['cleaned_name'].str.strip()
icpsr['cleaned_name'] = icpsr['cleaned_name'].str.replace('  ', ' ')
icpsr_names = icpsr['cleaned_name']

icpsr['no_stops'] = icpsr['cleaned_name'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
Counter(" ".join(icpsr['no_stops']).split()).most_common(10)

icpsr['bigrams'] = icpsr['no_stops'].map(lambda x: find_ngrams(x.split(" "), 2))
icpsr_bigrams = icpsr['bigrams'].tolist()
icpsr_bigrams = list(chain(*icpsr_bigrams))
icpsr_bigrams = [(x.lower(), y.lower()) for x,y in icpsr_bigrams]

icpsr_bigrams_counts = Counter(icpsr_bigrams)
icpsr_bigrams_counts.most_common(10)

icpsr['trigrams'] = icpsr['no_stops'].map(lambda x: find_ngrams(x.split(" "), 3))
icpsr_trigrams = icpsr['trigrams'].tolist()
icpsr_trigrams = list(chain(*icpsr_trigrams))
icpsr_trigrams = [(x.lower(), y.lower(), z.lower()) for x,y,z in icpsr_trigrams]

icpsr_trigrams_counts = Counter(icpsr_trigrams)
icpsr_trigrams_counts.most_common(10)

[(('survey', 'consumer', 'attitudes'), 505),
 (('consumer', 'attitudes', 'behavior'), 505),
 (('new', 'york', 'times'), 410),
 (('federal', 'justice', 'statistics'), 407),
 (('justice', 'statistics', 'program'), 407),
 (('news', 'new', 'york'), 373),
 (('cbs', 'news', 'new'), 364),
 (('census', 'population', 'housing'), 329),
 (('current', 'population', 'survey'), 326),
 (('population', 'housing', 'united'), 315)]

In [97]:
datagov_long = pd.read_csv(turbo+'data_set_26897.csv')

datagov_long['cleaned_title'] = datagov_long['title'].apply(clean_text)
datagov_long['cleaned_title'] = datagov_long['cleaned_title'].str.replace('\d+', '')
datagov_long['cleaned_title'] = datagov_long['cleaned_title'].str.strip()
datagov_long['cleaned_title'] = datagov_long['cleaned_title'].str.replace('  ', ' ')
datagov_titles = datagov_long['cleaned_title']

datagov_long['no_stops'] = datagov_long['cleaned_title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
Counter(" ".join(datagov_long['no_stops']).split()).most_common(10)

datagov_long['bigrams'] = datagov_long['no_stops'].map(lambda x: find_ngrams(x.split(" "), 2))
datagov_long_bi = datagov_long['bigrams'].tolist()
datagov_long_bi = list(chain(*datagov_long_bi))
datagov_long_bi = [(x.lower(), y.lower()) for x,y in datagov_long_bi]

datagov_long_bi_counts = Counter(datagov_long_bi)
datagov_long_bi_counts.most_common(10)

datagov_long['trigrams'] = datagov_long['no_stops'].map(lambda x: find_ngrams(x.split(" "), 3))
datagov_long_tri = datagov_long['trigrams'].tolist()
datagov_long_tri = list(chain(*datagov_long_tri))
datagov_long_tri = [(x.lower(), y.lower(), z.lower()) for x,y,z in datagov_long_tri]

datagov_long_tri_counts = Counter(datagov_long_tri)
datagov_long_tri_counts.most_common(10)

[(('temperature', 'salinity', 'profile'), 2165),
 (('north', 'atlantic', 'ocean'), 2091),
 (('global', 'temperature', 'salinity'), 1516),
 (('salinity', 'profile', 'program'), 1514),
 (('profile', 'program', 'gtspp'), 1512),
 (('program', 'gtspp', 'submitted'), 1484),
 (('north', 'pacific', 'ocean'), 1274),
 (('gtspp', 'submitted', 'nodc'), 1168),
 (('submitted', 'nodc', 'accession'), 1168),
 (('ocean', 'weather', 'station'), 1089)]

In [98]:
datagov_short = pd.read_csv(turbo+'data_set_800.csv')

datagov_short['cleaned_title'] = datagov_short['title'].apply(clean_text)
datagov_short['cleaned_title'] = datagov_short['cleaned_title'].str.replace('\d+', '')
datagov_short['cleaned_title'] = datagov_short['cleaned_title'].str.strip()
datagov_short['cleaned_title'] = datagov_short['cleaned_title'].str.replace('  ', ' ')
datagov_short['cleaned_title'] = datagov_short['cleaned_title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
datagov_names = datagov_short['cleaned_title']

datagov_short['no_stops'] = datagov_short['cleaned_title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
Counter(" ".join(datagov_short['no_stops']).split()).most_common(10)

datagov_short['bigrams'] = datagov_short['no_stops'].map(lambda x: find_ngrams(x.split(" "), 2))
datagov_short_bi = datagov_short['bigrams'].tolist()
datagov_short_bi = list(chain(*datagov_short_bi))
datagov_short_bi = [(x.lower(), y.lower()) for x,y in datagov_short_bi]

datagov_short_bi_counts = Counter(datagov_short_bi)
datagov_short_bi_counts.most_common(10)

datagov_short['trigrams'] = datagov_short['no_stops'].map(lambda x: find_ngrams(x.split(" "), 3))
datagov_short_tri = datagov_short['trigrams'].tolist()
datagov_short_tri = list(chain(*datagov_short_tri))
datagov_short_tri = [(x.lower(), y.lower(), z.lower()) for x,y,z in datagov_short_tri]

datagov_short_tri_counts = Counter(datagov_short_tri)
datagov_short_tri_counts.most_common(10)

[(('nndss', 'table', 'ii'), 80),
 (('conus', 'image', 'service'), 78),
 (('terrestrial', 'condition', 'assessment'), 57),
 (('condition', 'assessment', 'tca'), 57),
 (('covid', 'state', 'profile'), 53),
 (('state', 'profile', 'report'), 53),
 (('version', 'conus', 'image'), 42),
 (('alaska', 'image', 'service'), 38),
 (('tree', 'canopy', 'cover'), 36),
 (('state', 'drug', 'utilization'), 30)]

In [113]:
df_train['no_stops'] = df_train['cleaned_title'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
Counter(" ".join(df_train['no_stops']).split()).most_common(10)

df_train['bigrams'] = df_train['no_stops'].map(lambda x: find_ngrams(x.split(" "), 2))
df_train_bigrams = df_train['bigrams'].tolist()
df_train_bigrams = list(chain(*df_train_bigrams))
df_train_bigrams = [(x.lower(), y.lower()) for x,y in df_train_bigrams]

df_train_bigrams_counts = Counter(df_train_bigrams)
df_train_bigrams_counts.most_common(10)

df_train['trigrams'] = df_train['no_stops'].map(lambda x: find_ngrams(x.split(" "), 3))
df_train_trigrams = df_train['trigrams'].tolist()
df_train_trigrams = list(chain(*df_train_trigrams))
df_train_trigrams = [(x.lower(), y.lower(), z.lower()) for x,y,z in df_train_trigrams]

df_train_trigrams_counts = Counter(df_train_trigrams)
df_train_trigrams_counts.most_common(10)

[(('alzheimer', 'disease', 'neuroimaging'), 6144),
 (('disease', 'neuroimaging', 'initiative'), 6144),
 (('neuroimaging', 'initiative', 'adni'), 6144),
 (('baltimore', 'longitudinal', 'study'), 1589),
 (('longitudinal', 'study', 'aging'), 1589),
 (('study', 'aging', 'blsa'), 1589),
 (('education', 'longitudinal', 'study'), 1226),
 (('trends', 'international', 'mathematics'), 1163),
 (('international', 'mathematics', 'science'), 1163),
 (('mathematics', 'science', 'study'), 1163)]

In [207]:
df_train['acronyms'] = df_train['dataset_title'].apply(extract_acronyms)
acronyms = df_train['acronyms'].tolist()
ac_res = list(filter(None, acronyms))
ac_flat = flatten_list(ac_res)
ac_lower = [i.lower() for i in ac_flat]
acronym_labels = set(ac_lower)
acronym_labels

{'adni',
 'agid',
 'anss',
 'bbs',
 'blsa',
 'cas',
 'cccsl',
 'charybdis',
 'cord',
 'cov',
 'covid',
 'crown',
 'ffrdc',
 'jh',
 'niagads',
 'noaa',
 'ricord',
 'rsna',
 'sars'}

## Data

Dataframe of all sections (match==`True` contain a data cite; match==`False` do not contain a data cite)

- remove records with missing section title, text (we can't learn from these examples)

In [5]:
df_train = pd.read_csv(path+'train.csv',low_memory=False)
df_pubs = pd.read_csv(path+'test_publications.csv',low_memory=False)
df_pubs['clean_section'] = df_pubs['section_title'].apply(clean_text)
df_pubs = df_pubs.dropna(subset=['section_title'])
df_pubs = df_pubs.dropna(subset=['text'])

df_full = df_train.merge(df_pubs, on='Id')
df_full['dataset_label'] = df_full['dataset_label'].astype(str)
df_full['text'] = df_full['text'].astype(str)
df_full['match'] = df_full.apply(lambda x: x.dataset_label in x.text, axis=1)

df_full

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label,section_title,text,clean_text,clean_section,match
0,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,What is this study about?,This study used data from the National Educati...,this study used data from the national educati...,what is this study about,True
1,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,Features of Dual Enrollment Programs,Dual enrollment programs allow high school stu...,dual enrollment programs allow high school stu...,features of dual enrollment programs,False
2,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,WWC Single Study Review,"What did the study find?\nThe study reported, ...",what did the study find the study reported and...,wwc single study review,False
3,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,WWC Rating,The research described in this report meets WW...,the research described in this report meets ww...,wwc rating,False
4,d0fa7568-7d8e-4db9-870f-f9c6f668c17b,The Impact of Dual Enrollment on College Degre...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,Intervention group,The intervention group was comprised of those ...,the intervention group was comprised of those ...,intervention group,False
...,...,...,...,...,...,...,...,...,...,...
345487,fd23e7e0-a5d2-4f98-992d-9209c85153bb,A ligand-based computational drug repurposing ...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds data,cas covid 19 antiviral candidate compounds data,Retrieval of GLUT-1 data,By mapping GLUT-1 retrieved from the Open Targ...,by mapping glut 1 retrieved from the open targ...,retrieval of glut 1 data,False
345488,fd23e7e0-a5d2-4f98-992d-9209c85153bb,A ligand-based computational drug repurposing ...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds data,cas covid 19 antiviral candidate compounds data,GLUT-1 case: substructure searches in DrugBank,Hierarchical clustering of available Murcko sc...,hierarchical clustering of available murcko sc...,glut 1 case substructure searches in drugbank,False
345489,fd23e7e0-a5d2-4f98-992d-9209c85153bb,A ligand-based computational drug repurposing ...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds data,cas covid 19 antiviral candidate compounds data,Table 2 (continued),"The structural fragment, SMARTS string, the nu...",the structural fragment smarts string the numb...,table 2 continued,False
345490,fd23e7e0-a5d2-4f98-992d-9209c85153bb,A ligand-based computational drug repurposing ...,CAS COVID-19 antiviral candidate compounds dat...,CAS COVID-19 antiviral candidate compounds data,cas covid 19 antiviral candidate compounds data,Experiences when using the workflow in the cla...,The workflow described herein has been used in...,the workflow described herein has been used in...,experiences when using the workflow in the cla...,False


## Heuristics

### inSection
We want to see if there's a relationship between `match`==True and section
- Introduction
- Abstract
- Discussion

In [64]:
pub_sections = ["title", 
                "abstract", 
                "introduction", 
                "background",
                "method",
                "methods", 
                "methodology",
                "preprocessing",
                "design",
                "analysis",
                "sample", 
                "results", 
                "discussion", 
                "conclusion", 
                "conclusions",
                "summary",
                "references", 
                "data", 
                "material",
                "materials", 
                "supplement",
                "supplements",
                "supplementary",
                "table",
                "tables", 
                "figure",
                "figures",
                "footnote",
                "footnotes",
                "acknowledgement",
                "acknowledgements",
                "appendix",
                "appendices"]

df_normal = df_full.query('clean_section in @pub_sections')

def func(a):
    if "method" in a.lower():
        return "methods"
    elif "methodology" in a.lower():
        return "methods"
    elif "conclusion" in a.lower():
        return "conclusions"
    elif "materials" in a.lower():
        return "material"
    elif "supplements" in a.lower():
        return "supplement"
    elif "supplementary" in a.lower():
        return "supplement"
    elif "figures" in a.lower():
        return "figures"
    elif "footnote" in a.lower():
        return "footnotes"
    elif "figures" in a.lower():
        return "figure"
    elif "tables" in a.lower():
        return "table"
    elif "acknowledgements" in a.lower():
        return "acknowledgement"
    elif "appendices" in a.lower():
        return "appendix"
    else:
        return a

df_normal['clean_section'] = df_normal['clean_section'].apply(lambda x: func(x))
df_normal['inIntro'] = np.where(df_normal['clean_section'].str.contains('intro'), 1, 0)
df_normal['inAbstract'] = np.where(df_normal['clean_section'].str.contains('abstract'), 1, 0)
df_normal['inDiscussion'] = np.where(df_normal['clean_section'].str.contains('discuss'), 1, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_normal['clean_section'] = df_normal['clean_section'].apply(lambda x: func(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_normal['inIntro'] = np.where(df_normal['clean_section'].str.contains('intro'), 1, 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_normal['inAbstract'] = np.where(d

### hasData
We want to see if there's a relationship between `match`==True and indicator terms

- data
- edu
- survey

In [65]:
indicator_terms = ["accession", 
                   "available at", 
                   "available from", 
                   "com", 
                   "commercial", 
                   "corp", 
                   "data",
                   "dataset", 
                   "datasets", 
                   "database", 
                   "deposited",
                   "doi",
                   "donated",
                   "edu",
                   "ftp:", 
                   "gift", 
                   "gov", 
                   "inc", 
                   "national institutes of health", 
                   "national science foundation",
                   "nih", 
                   "nsf",
                   "grant",
                   "obtained from", 
                   "publicly available", 
                   "purchased from", 
                   "repository", 
                   "sample sets stored", 
                   "suppl", 
                   "supplemental",
                   "survey"]

ind_pattern = '|'.join(indicator_terms)

def search_pat(search_str:str, search_list:str):
    search_obj = re.search(search_list, search_str)
    if search_obj:
        return_str = search_str[search_obj.start(): search_obj.end()]
    else:
        return_str = 'NA'
    return return_str

df_normal['clean_text'] = df_normal['clean_text'].astype(str)
df_normal['indicator'] = df_normal['clean_text'].apply(lambda x: search_pat(search_str=x, search_list=ind_pattern))
df_normal['hasData'] = np.where(df_normal['indicator'].str.contains('data'), 1, 0)
df_normal['hasEdu'] = np.where(df_normal['indicator'].str.contains('edu'), 1, 0)
df_normal['hasSurvey'] = np.where(df_normal['indicator'].str.contains('survey'), 1, 0)
df_normal.sample(n=10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_normal['clean_text'] = df_normal['clean_text'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_normal['indicator'] = df_normal['clean_text'].apply(lambda x: search_pat(search_str=x, search_list=ind_pattern))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_normal['hasData'] = np.w

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label,section_title,text,clean_text,clean_section,match,inIntro,inAbstract,inDiscussion,indicator,hasData,hasEdu,hasSurvey
167171,56951ddb-95e5-4a51-a2e4-f9b2c5ce93b0,Changing Urbanization Patterns in US Lung Canc...,Rural-Urban Continuum Codes,Rural-Urban Continuum Codes,rural urban continuum codes,Methods,To analyze rural-urban disparities in lung can...,to analyze rural urban disparities in lung can...,methods,False,0,0,0,data,1,0,0
236805,6fdca189-26af-4f14-ae09-72e4ba82d9f9,Black and Latino College Enrollment: Effects o...,National Education Longitudinal Study,National Education Longitudinal Study,national education longitudinal study,Results,High School Graduates' College Enrollment Over...,high school graduates college enrollment over ...,results,False,0,0,0,edu,0,1,0
166930,67722736-9ee1-4c96-857a-05b877166e0b,The Descriptive Epidemiology of Gastric Cancer...,Rural-Urban Continuum Codes,Rural-Urban Continuum Codes,rural urban continuum codes,Discussion,This is the first study to describe the epidem...,this is the first study to describe the epidem...,discussion,False,0,0,1,inc,0,0,0
16215,c9125c0d-5e8c-418b-a967-dede06239aa6,Waist circumference is an independent risk fac...,Baltimore Longitudinal Study of Aging (BLSA),Baltimore Longitudinal Study of Aging (BLSA),baltimore longitudinal study of aging blsa,Abstract,Conclusions: Study results showed that waist c...,conclusions study results showed that waist ci...,abstract,False,0,1,0,,0,0,0
144448,387bc909-8db4-4303-a627-4a44647523c5,Educational infrastructures and organisational...,Trends in International Mathematics and Scienc...,Trends in International Mathematics and Scienc...,trends in international mathematics and scienc...,Background,The Swedish National Agency for Education intr...,the swedish national agency for education intr...,background,True,0,0,0,edu,0,1,0
134733,62d6178e-0a74-4c0c-bc40-97e29346c0ae,The ICCAM platform study: An experimental medi...,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI,adni,Abstract,We aimed to set up a robust multi-centre clini...,we aimed to set up a robust multi centre clini...,abstract,False,0,1,0,inc,0,0,0
35206,644a6445-62ac-40e9-ae29-6cc6b0917b5e,Producing Humanities PhDs among BAs at Doctora...,Survey of Earned Doctorates,Survey of Earned Doctorates,survey of earned doctorates,Abstract,This paper investigates which attributes of a ...,this paper investigates which attributes of a ...,abstract,True,0,1,0,survey,0,0,1
112089,91b8b4f5-4ade-433a-b3dc-216738e58e1d,Diffusion tensor imaging in Alzheimer's diseas...,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI,adni,CONCLUSION,"In summary, the existing literature supports t...",in summary the existing literature supports th...,conclusions,False,0,0,0,corp,0,0,0
86275,7b53be02-2d42-4404-894e-60bebc8155fa,Sparse Scale-Space Decomposition of Volume Cha...,Alzheimer's Disease Neuroimaging Initiative (A...,Alzheimer's Disease Neuroimaging Initiative (A...,alzheimer s disease neuroimaging initiative adni,Abstract,Abstract. Anatomical changes like brain atroph...,abstract anatomical changes like brain atrophy...,abstract,False,0,1,0,inc,0,0,0
105953,3d73578e-3932-4d2c-8e36-5bd9800c79ec,Prediction of clinical scores from neuroimagin...,Alzheimer's Disease Neuroimaging Initiative (A...,Alzheimer's Disease Neuroimaging Initiative (A...,alzheimer s disease neuroimaging initiative adni,Abstract,"Abstract-In this paper, we explore the use of ...",abstract in this paper we explore the use of c...,abstract,False,0,1,0,data,1,0,0


### Pivot table (full)
None of the features we checked for are very indicative. However, we see that true data citations tend to:
- have more instances of "data" terms
- show up in "introduction" sections

In [66]:
pd.pivot_table(df_normal,index="match",values = ["inIntro", "inAbstract", "inDiscussion", "hasData", "hasEdu", "hasSurvey"])

Unnamed: 0_level_0,hasData,hasEdu,hasSurvey,inAbstract,inDiscussion,inIntro
match,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
False,0.175454,0.101242,0.020564,0.235178,0.159362,0.198619
True,0.26301,0.098251,0.023626,0.220649,0.159138,0.365425


### hasOrg
We want to see if there's a relationship between `match`==True and entity type (ORGANIZATION) `hasOrgDate`==1
- many citances contain a reference to something that spaCy thinks is an ORGANIZATION and/or a DATE so let's use that as a rule
- proportional sample  to reduce dataframe size, first based on count by dataset title (10%), then based on weighted `match` (1%)
- split section text into sentences
- from here we can we work with true and false examples of citances

In [130]:
sample_df = df_normal.groupby('dataset_title').apply(lambda x: x.sample(frac=0.1))
sample_df.reset_index(drop=True)

sent = pd.DataFrame(sample_df['text'].apply(sent_tokenize).tolist(), index=sample_df.Id).stack()
sent = sent.reset_index()[[0, 'Id']]
sent.columns = ['sent', 'Id']

df_sent = sent.merge(sample_df, on='Id')
df_sent = df_sent.drop(columns=['clean_text', 
                      'clean_section', 
                      'match', 
                      'text', 
                      'cleaned_label'])

df_sent['dataset_label'] = df_sent['dataset_label'].astype(str)
df_sent['sent'] = df_sent['sent'].astype(str)
df_sent['match'] = df_sent.apply(lambda x: x.dataset_label in x.sent, axis=1)
df_sent = df_sent[['Id', 'pub_title', 'section_title', 'dataset_title', 'dataset_label','sent', 'match', "inIntro", "inAbstract", "inDiscussion", "hasData", "hasEdu", "hasSurvey"]]

sent_sample = df_sent.groupby('match').apply(lambda x: x.sample(n=1000)).reset_index(drop = True)
sent_sample['match'].value_counts()

True     1000
False    1000
Name: match, dtype: int64

Define functions for extracting named entities in dataframe

In [131]:
def extract_named_ents(text):
    """Extract named entities, and beginning, middle and end idx using spaCy's out-of-the-box model. 
    
    Keyword arguments:
    text -- the actual text source from which to extract entities
    
    """
#     return [(ent.text, ent.start_char, ent.end_char, ent.label_) for ent in nlp(text).ents]
    return [(ent.text, ent.label_) for ent in nlp(text).ents]

def add_named_ents(df):
    """Create new column in data frame with named entity tuple extracted.
    
    Keyword arguments:
    df -- a dataframe object
    
    """
    sent_sample['named_ents'] = sent_sample['sent'].apply(extract_named_ents)  

In [132]:
add_named_ents(sent_sample)

named_entities = ['ORG']

ent_pattern = '|'.join(named_entities)

def ent_pat(search_str:str, search_list:str):
    search_obj = re.search(search_list, search_str)
    if search_obj:
        return_str = search_str[search_obj.start(): search_obj.end()]
    else:
        return_str = 'NA'
    return return_str

sent_sample['named_ents'] = sent_sample['named_ents'].astype(str)
sent_sample['named_ents'] = sent_sample['named_ents'].apply(lambda x: ent_pat(search_str=x, search_list=ent_pattern))
sent_sample['hasOrg'] = np.where(sent_sample['named_ents']!= 'NA', 1, 0)
sent_sample = sent_sample[['Id', 'pub_title', 'section_title', 'dataset_title', 'dataset_label','sent', 'match', 'named_ents', 'inIntro', 'inAbstract', 'inDiscussion', 'hasData', 'hasEdu', 'hasSurvey', 'hasOrg']]
sent_sample.sample(n=10)

Unnamed: 0,Id,pub_title,section_title,dataset_title,dataset_label,sent,match,named_ents,inIntro,inAbstract,inDiscussion,hasData,hasEdu,hasSurvey,hasOrg
550,ca2b1382-8ead-45c8-ad30-8877da49a72d,Sex-specific differences in progressive glucos...,Discussion,Baltimore Longitudinal Study of Aging (BLSA),Baltimore Longitudinal Study of Aging,"With the population rapidly aging, the prevale...",False,,0,0,1,0,0,0,0
69,120a0831-611c-45ac-8545-97af5c18ab19,Gene-based GWAS and biological pathway analysi...,Methods,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI,Genebased (and pathway-based) analyses can ove...,False,ORG,0,0,0,1,0,0,1
705,7e9548f8-f50e-45fc-92ac-49b57fae9469,Vestibular Function and Beta-Amyloid Depositio...,INTRODUCTION,Baltimore Longitudinal Study of Aging (BLSA),Baltimore Longitudinal Study of Aging (BLSA),Beta-amyloid (Aβ) plaque deposition is a key f...,False,,1,0,0,0,0,0,0
1525,b4a32e23-0b36-4fc0-92bb-1002aa71e577,What’s Driving U.S. Broiler Farm Profitability?,Introduction,Census of Agriculture,Census of Agriculture,"In 2012, almost 33,000 U.S. farms sold nearly ...",True,,1,0,0,0,0,0,0
1660,9686e290-9fa8-494d-8a75-ac44a2ed17be,Production expenses of specialized vegetable ...,Conclusions,Agricultural Resource Management Survey,Agricultural Resource Management Survey,The farm-level data for this report were deriv...,True,ORG,0,0,0,1,0,0,1
707,bd554e04-556b-49df-b337-8843ee6b997b,Differentiated Curriculum Enhancement in Inclu...,Materials,National Education Longitudinal Study,National Education Longitudinal Study,"Names of all activities, with key concepts and...",False,,0,0,0,0,0,0,0
57,17b44def-f81d-45bb-96a5-d6ef6f3cb22d,Longitudinal Associations Between Childhood Ob...,Discussion,Early Childhood Longitudinal Study,Early Childhood Longitudinal Study,"However, discrepancies arise for other subject...",False,,0,0,1,0,0,0,0
1120,c70db48a-205c-43c2-80bf-a4cea8b8e97a,Predictive markers for AD in a multi-modality ...,Abstract,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI,Data used in the evaluations of our algorithm ...,True,,0,1,0,0,0,0,0
1364,f59da1b0-bb75-41ea-8492-6b1ebca2548d,A novel framework for longitudinal atlas const...,Abstract,Alzheimer's Disease Neuroimaging Initiative (A...,Alzheimers Disease Neuroimaging Initiative,1 Data used in preparation of this article wer...,True,ORG,0,1,0,0,0,0,1
456,a1aa5055-fe22-4014-bd5f-8255f6b5f212,Incident cognitive impairment: longitudinal ch...,Results,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI,"In contrast to the PIB results, CSF biomarkers...",False,ORG,0,0,0,1,0,0,1


### Pivot table (sample)
By comparing true and false citation matches, we see there isn't much of a difference between T/F citances wrt indicator terms and section (Introduction). 

True 'citances' (citation sentences):
- mention the term "data"
- contain an "ORG" entity
- are in the Abstract section

In [133]:
pd.pivot_table(sent_sample,index="match",values = ["inIntro", "inAbstract", "inDiscussion", "hasData", "hasEdu", "hasSurvey", "hasOrg"])

Unnamed: 0_level_0,hasData,hasEdu,hasOrg,hasSurvey,inAbstract,inDiscussion,inIntro
match,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
False,0.202,0.09,0.276,0.033,0.177,0.249,0.219
True,0.272,0.092,0.604,0.011,0.261,0.175,0.293
