## Test on MedRxiv data

This is an insteresting test. We trained on ArXiv, so we shouldn't necessarily expect the tracker to perform well on data from another field. 

We need to make some adjustments to track MedRxiv because the RAT searches Crossref for articles. However, MedRxiv is already in Crossref, so we need to filter out the query-articles when they come up in the results. 

## Reference

The dataset comes from this Zenodo repository:

Cabanac, Guillaume, Oikonomidi, Theodora, & Boutron, Isabelle. (2021). Day-to-day discovery of preprint–publication links [Data set]. In Scientometrics. Zenodo. https://doi.org/10.5281/zenodo.4432116

In [1]:
import pandas as pd
import os, json, requests

# Load the MedRxiv data


In [2]:

p = r'../../../data'
fp = os.path.join(p,'20200916-medRxiv_analysis.xlsx')
medrxiv = pd.read_excel(fp)
medrxiv.shape

(740, 13)

# First, use Crossref API to get correct data from medrxiv data

This is the data we will use to check our results. 

In [3]:
class CrossrefAPI:
    
    def __init__(self, doi):
        self.doi = doi
      
    def get_data(self):
        doi = self.doi
        api_resp = self.api_call(doi)
        output = self.selected_output(api_resp)
        return output
    
    @staticmethod
    def api_call(doi:str):
        base_url = 'https://api.crossref.org/works/'
        url = base_url+doi
        headers = {
            'User-Agent': "Analytics",
            'mailto': os.environ.get('MY_EMAIL','')
        }
        r = requests.get(url, headers = headers)
        if r.status_code == 200:
            try:
                crdata = r.json()
                if 'message' in crdata:
                    msg = crdata['message']
                    return msg
            except:
                pass
        return None
            
    def selected_output(self,msg):
        output = dict()
        output['title'] = msg.get('title',[''])[0]
        if len(output['title'])>0:
            authors = msg['author']
            if isinstance(authors,list):
                output['authors'] = self.authors_list(authors)
                output['DOI'] = msg['DOI']
                output['abstract'] = msg.get('abstract','')
                return output
            
    @staticmethod
    def authors_list(authors: list) -> list:
        match_authors = []
        for author in authors:
            given = author.get('given', '')
            family = author.get('family', '')
            match_authors.append(family + ', ' + given)
        return '; '.join(match_authors)
            

In [4]:
from tqdm import tqdm

testdf_path = os.path.join(p,'testdf.csv')
if os.path.exists(testdf_path):
    testdf = pd.read_csv(testdf_path)
else:

    test_data = []
    for i,row in tqdm(medrxiv.iterrows(),total = len(medrxiv)):
        output_row = dict()
        try:
            query_data = CrossrefAPI(row['DOI preprint']).get_data()
            match_data = CrossrefAPI(row['DOI publication']).get_data()
        except:
            query_data = None
            match_data= None
        if all(x!=None for x in (query_data,match_data)):
            for key in query_data:
                output_row['query_'+key] = query_data[key]
            for key in match_data:
                output_row['match_'+key] = match_data[key]
            test_data.append(output_row)
    testdf = pd.DataFrame(test_data)
    testdf.to_csv(testdf_path, index = False, encoding = 'utf-8-sig')
testdf.shape

(740, 8)

In [5]:
testdf.head(2)

Unnamed: 0,query_title,query_authors,query_DOI,query_abstract,match_title,match_authors,match_DOI,match_abstract
0,Optimising the use of molecular tools for the ...,"Munson, Morgan; Creswell, Benjamin; Kondobala,...",10.1101/19000232,<jats:title>Abstract</jats:title><jats:sec><ja...,Optimising the use of molecular tools for the ...,"Munson, Morgan; Creswell, Benjamin; Kondobala,...",10.1093/trstmh/trz083,<jats:title>Abstract</jats:title><jats:sec><ja...
1,"The prevalence of scabies, pyoderma and other ...","Marks, Michael; Sammut, Thomas; Cabral, Marito...",10.1101/19000257,<jats:title>Abstract</jats:title><jats:sec><ja...,"The prevalence of scabies, pyoderma and other ...","Marks, Michael; Sammut, Thomas; Cabral, Marito...",10.1371/journal.pntd.0007820,


In [6]:
# what % of abstracts are available?
# guessing there are more in pubmed than in crossref
n =len(testdf)
n_w_abstract = len(testdf[testdf['match_abstract'].isna()])
n, n_w_abstract, 1- ( n_w_abstract/ n)

(740, 462, 0.3756756756756757)

# Check RAT recall

In [7]:
def testdfrow_to_article(row):
    article = dict()
    article['Submission Date'] = '1900-01-01'
    article['Decision Date'] = '1900-01-01'
    article['Accept or Reject Final Decision'] = 'Reject'
    article['Manuscript ID'] = row['query_DOI']
    # article['query_doi'] = row['match_DOI']
    article['Manuscript Title'] = row['query_title']
    article['Journal Name'] = 'medrxiv'
    article['doi_number'] = row['match_DOI']
    article['Author Names'] = row['query_authors']
    article['abstract'] = row['query_abstract']
    return article
    
# sample_size = 1000 # len(testdf)
# articles = list(testdf.sample(sample_size, random_state =7).T.to_dict().values())
articles = list(testdf.T.to_dict().values())
articles = [testdfrow_to_article(row) for row in articles]
len(articles)

740

In [8]:
from rejected_article_tracker.src.ScholarOneRejectedArticlesMatch import ScholarOneRejectedArticlesMatch
# for each row, run the RAT and see whether we get the DOI
# keep a running total of recall
config = {
                # from 0.0 - 1.0 
                # Set higher for better precision, lower for better recall
                "threshold": 0.0, 
                # any number from 1 - 10. If there are multiple versions of the article out there, it's worth picking  number >1
                "max_results_per_article":10, 
                # limit results to these types see: https://api.crossref.org/types
                "article_types":[],
            }
# The CrossRef API requires an email address for lookups.    
email = os.getenv('MY_EMAIL')
# Define a 'results' list.
results = []

# Run match
ScholarOneRejectedArticlesMatch(
    articles=articles,
    config=config,
    email=email,
    results=results
).match()

result_df = pd.DataFrame(results)

len(result_df)

7389

In [9]:
# add the correct DOIs that we were searching for

result_df = result_df.merge(testdf[['query_DOI','match_DOI']].rename(columns={'match_DOI':'correct_doi'}),
                            left_on = 'manuscript_id', 
                            right_on='query_DOI', 
                            how = 'left')
result_df = result_df.drop(['query_DOI'], axis = 1)
result_df.head(2)

Unnamed: 0,manuscript_id,raw_manuscript_id,journal_name,manuscript_title,title_for_search,submission_date,decision_date,authors,text_sub_date,final_decision,...,match_earliest_date,match_similarity,match_one,match_all,classifier_score,match_crossref_score,match_crossref_cites,match_rank,match_total_decision_days,correct_doi
0,10.1101/19000232,10.1101/19000232,medrxiv,Optimising the use of molecular tools for the ...,Optimising use molecular tools diagnosis yaws,1900-01-01,1900-01-01,"Morgan+Munson, Benjamin+Creswell, Kofi+Kondoba...",1900-01-01,Reject,...,2019-07-24T00:00:00Z,97,1,0,0.873461,179.86777,0,1,43668,10.1093/trstmh/trz083
1,10.1101/19000232,10.1101/19000232,medrxiv,Optimising the use of molecular tools for the ...,Optimising use molecular tools diagnosis yaws,1900-01-01,,"Morgan+Munson, Benjamin+Creswell, Kofi+Kondoba...",1900-01-01,Reject,...,2019-06-25T00:00:00Z,97,1,1,0.961794,187.41878,0,2,0,10.1093/trstmh/trz083


In [10]:
p = r'../../../data'
fp = os.path.join(p,'resultdf.csv')
result_df.to_csv(fp,index=False,encoding ='utf-8-sig')

In [11]:
# filter out the preprints themselves

result_df = result_df[ result_df['manuscript_id'] != result_df['match_doi'] ]

In [12]:
# articles[0]

In [13]:
cols = ['manuscript_id',
       'manuscript_title', 'authors',
       'correct_doi', 'match_doi',
       'match_type', 'match_title', 'match_authors', 'match_publisher',
       'match_journal', 'match_pub_date', 'match_earliest_date',
       'match_similarity', 'match_one', 'match_all', 'match_crossref_score',
       'match_crossref_cites', 'match_rank', 'match_total_decision_days',
       'classifier_score',
       'correct']

In [14]:
result_df.head(2)

Unnamed: 0,manuscript_id,raw_manuscript_id,journal_name,manuscript_title,title_for_search,submission_date,decision_date,authors,text_sub_date,final_decision,...,match_earliest_date,match_similarity,match_one,match_all,classifier_score,match_crossref_score,match_crossref_cites,match_rank,match_total_decision_days,correct_doi
0,10.1101/19000232,10.1101/19000232,medrxiv,Optimising the use of molecular tools for the ...,Optimising use molecular tools diagnosis yaws,1900-01-01,1900-01-01,"Morgan+Munson, Benjamin+Creswell, Kofi+Kondoba...",1900-01-01,Reject,...,2019-07-24T00:00:00Z,97,1,0,0.873461,179.86777,0,1,43668,10.1093/trstmh/trz083
2,10.1101/19000232,10.1101/19000232,medrxiv,Optimising the use of molecular tools for the ...,Optimising use molecular tools diagnosis yaws,1900-01-01,,"Morgan+Munson, Benjamin+Creswell, Kofi+Kondoba...",1900-01-01,Reject,...,2011-04-01T00:00:00Z,43,0,0,0.002512,42.2879,4,3,0,10.1093/trstmh/trz083


In [15]:
result_df['correct'] = result_df['match_doi']==result_df['correct_doi']
correct_dois = set(testdf['match_DOI'])
threshold = 0.5

crossref_recall = result_df[(result_df['match_doi'].isin(correct_dois))]['match_doi'].nunique() / len(correct_dois)
correct_results = result_df[(result_df['classifier_score']>=threshold) & (result_df['match_doi'].isin(correct_dois))]
n_true_positives = correct_results['manuscript_id'].nunique()

# check ML specifically
results_that_ml_can_find = result_df[(result_df['match_doi'].isin(correct_dois))]['match_doi'].unique()
results_ml_found = result_df[(result_df['classifier_score']>=threshold)&(result_df['match_doi'].isin(correct_dois))]['match_doi']
ml_recall = len(results_ml_found) / len(results_that_ml_can_find)

n_false_positives = len(result_df[(result_df['classifier_score']>=threshold) & (~result_df['match_doi'].isin(correct_dois))])
n_false_neg = len(result_df[(result_df['classifier_score']<threshold) & (result_df['match_doi'].isin(correct_dois))])

precision = n_true_positives / (n_true_positives+n_false_positives)
recall = n_true_positives / len(correct_dois)

output = [
dict(label='Number of MedRxiv preprints', value=len(testdf)),
dict(label='Results from RAT search', value=len(result_df)),
dict(label='Crossref\'s recall', value=crossref_recall),
dict(label='Max possible correct results', value=len(correct_dois)),
dict(label='ML recall', value=ml_recall),
dict(label='TP', value=n_true_positives),
dict(label='FP', value=n_false_positives),
dict(label='FN', value=n_false_neg),
dict(label='RAT recall', value=recall),
dict(label='RAT precision', value=precision),
]
pd.DataFrame(output)

Unnamed: 0,label,value
0,Number of MedRxiv preprints,740.0
1,Results from RAT search,6649.0
2,Crossref's recall,0.991892
3,Max possible correct results,740.0
4,ML recall,0.923706
5,TP,678.0
6,FP,119.0
7,FN,133.0
8,RAT recall,0.916216
9,RAT precision,0.85069


In [16]:
# check incorrect results
result_df[(result_df['match_doi']!='No Match')&(result_df['correct']==False)][cols]

Unnamed: 0,manuscript_id,manuscript_title,authors,correct_doi,match_doi,match_type,match_title,match_authors,match_publisher,match_journal,...,match_earliest_date,match_similarity,match_one,match_all,match_crossref_score,match_crossref_cites,match_rank,match_total_decision_days,classifier_score,correct
2,10.1101/19000232,Optimising the use of molecular tools for the ...,"Morgan+Munson, Benjamin+Creswell, Kofi+Kondoba...",10.1093/trstmh/trz083,10.1109/parelec.2011.40,proceedings-article,Optimising Heterogeneous 3D Networks-on-Chip,"Michael Opoku+Agyeman, Ali+Ahmadinia",IEEE,2011 Sixth International Symposium on Parallel...,...,2011-04-01T00:00:00Z,43,0,0,42.287900,4,3,0,0.002512,False
3,10.1101/19000232,Optimising the use of molecular tools for the ...,"Morgan+Munson, Benjamin+Creswell, Kofi+Kondoba...",10.1093/trstmh/trz083,10.1101/2020.02.20.20025122,posted-content,Mapping of yaws endemicity in Ghana; Lessons t...,"Laud Anthony Wihibeturo+Basing, Moses+Djan, Sh...",Cold Spring Harbor Laboratory,,...,2020-02-23T00:00:00Z,38,1,0,35.551506,0,4,0,0.001209,False
4,10.1101/19000232,Optimising the use of molecular tools for the ...,"Morgan+Munson, Benjamin+Creswell, Kofi+Kondoba...",10.1093/trstmh/trz083,10.3390/tropicalmed5040157,journal-article,Multiplex Recombinase Polymerase Amplification...,"Michael+Frimpong, Shirley Victoria+Simpson, Hu...",MDPI AG,Tropical Medicine and Infectious Disease,...,2020-10-06T00:00:00Z,34,1,0,34.838630,2,5,0,0.000673,False
5,10.1101/19000232,Optimising the use of molecular tools for the ...,"Morgan+Munson, Benjamin+Creswell, Kofi+Kondoba...",10.1093/trstmh/trz083,10.20944/preprints202008.0569.v1,posted-content,Multiplex Recombinase Polymerase Amplification...,"Michael+Frimpong, Shirley Victoria+Simpson, Hu...",MDPI AG,,...,2020-08-26T00:00:00Z,35,1,0,35.434326,0,6,0,0.000779,False
6,10.1101/19000232,Optimising the use of molecular tools for the ...,"Morgan+Munson, Benjamin+Creswell, Kofi+Kondoba...",10.1093/trstmh/trz083,10.5465/ambpp.2018.18072abstract,journal-article,Environmental and Organizational Factors Assoc...,William+Opoku-Agyeman,Academy of Management,Academy of Management Proceedings,...,2018-07-09T00:00:00Z,36,0,0,34.665800,0,7,0,0.000902,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7384,10.1101/2020.07.05.20146878,Effect of Systemic Inflammatory Response to SA...,"Catia+Marzolini, Felix+Stader, Marcel+Stoeckle...",10.1128/aac.01177-20,10.1101/2020.07.07.20148163,posted-content,Epidemiology of SARS-CoV-2 Emergence Amidst Co...,"Karoline+Leuzinger, Tim+Roloff, Rainer+Gosert,...",Cold Spring Harbor Laboratory,,...,2020-07-08T00:00:00Z,40,1,0,59.595337,3,6,0,0.001628,False
7385,10.1101/2020.07.05.20146878,Effect of Systemic Inflammatory Response to SA...,"Catia+Marzolini, Felix+Stader, Marcel+Stoeckle...",10.1128/aac.01177-20,10.1002/jmv.26731,journal-article,Epidemiology and precision of SARS‐CoV‐2 detec...,"Karoline+Leuzinger, Rainer+Gosert, Kirstine K....",Wiley,Journal of Medical Virology,...,2020-12-14T00:00:00Z,43,1,0,58.310135,10,7,0,0.002525,False
7386,10.1101/2020.07.05.20146878,Effect of Systemic Inflammatory Response to SA...,"Catia+Marzolini, Felix+Stader, Marcel+Stoeckle...",10.1128/aac.01177-20,10.1101/2020.09.22.20198697,posted-content,Epidemiology and precision of SARS-CoV-2 detec...,"Karoline+Leuzinger, Rainer+Gosert, Kirstine K....",Cold Spring Harbor Laboratory,,...,2020-09-23T00:00:00Z,45,1,0,59.318394,1,8,0,0.003382,False
7387,10.1101/2020.07.05.20146878,Effect of Systemic Inflammatory Response to SA...,"Catia+Marzolini, Felix+Stader, Marcel+Stoeckle...",10.1128/aac.01177-20,10.1186/s13756-021-00912-z,journal-article,Systematic screening on admission for SARS-CoV...,"Rahel N.+Stadler, Laura+Maurer, Lisandra+Aguil...",Springer Science and Business Media LLC,Antimicrobial Resistance &amp; Infection Control,...,2021-02-27T00:00:00Z,46,1,0,54.636375,7,9,0,0.003914,False
