# RDoC Expert Survey Author Finder
Chris Iyer
Updated 3/23/2023

This is a notebook designed to find authors' emails for the purposes of sending them our RDoC Expert Survey Screener. Leveraging functions from `author_finder_functions.py`, this notebook will do the following things for each of the tasks we are using:
1. Search pubmed central (PMC) for open-access articles in the past 10 years with task keywords in the abstract.
2. Obtain correspondence/author emails for as many of these articles as possible.
3. Retrieve the number of PMC articles that cite the given PMC article.
4. Select the top <n> (100?) most-cited papers and retrieve their emails in order to send them our expert screener. 
5. We'll write these emails to a CSV.


In [3]:
from author_finder_functions import *

In [11]:
ROOT_PATH = '/Users/chrisiyer/_Current/lab/code/author_finder/pubget_data/' # change this to match your local desired path

# tasks: 'spatial_cueing', 'visual_search', 'cued_ts', 'ax_cpt', 'flanker', 'stroop', 'stop_signal', 'go_nogo', 'span', 'change_detection', 'n_back'
task_to_run = 'stop_signal'

# if you would like to manually change the keywords to search through, do so here:
task_keywords[task_to_run] = ['stop-signal task', 'stop signal task']

### Option 1: run all-in-one

In [3]:
run_author_finder(task_to_run, ROOT_PATH, output = 'csv') # output = 'txt'

INFO	2023-04-05T11:23:36-0700	pubget._download	Downloading data in /Users/chrisiyer/_Current/lab/code/author_finder/pubget_data/stop_signal/query_985725b7fed8fc29d2aca0ec88f29238/articlesets
INFO	2023-04-05T11:23:36-0700	pubget._download	Performing search
INFO	2023-04-05T11:23:37-0700	pubget._entrez	Search returned 337 results
INFO	2023-04-05T11:23:37-0700	pubget._entrez	Downloading 337 articles (in 1 batches)
INFO	2023-04-05T11:23:37-0700	pubget._entrez	getting batch 1 / 1
INFO	2023-04-05T11:23:47-0700	pubget._download	Finished downloading articles in /Users/chrisiyer/_Current/lab/code/author_finder/pubget_data/stop_signal/query_985725b7fed8fc29d2aca0ec88f29238/articlesets
INFO	2023-04-05T11:23:47-0700	pubget._download	All articles matching the query have been downloaded
INFO	2023-04-05T11:23:47-0700	pubget._articles	Extracting articles from /Users/chrisiyer/_Current/lab/code/author_finder/pubget_data/stop_signal/query_985725b7fed8fc29d2aca0ec88f29238/articlesets to /Users/chrisiyer/_

Article set #1:
Out of 337 papers, 329 had emails and 8 did not.
Result: 337 papers; 329 with emails.


array(['Chambersc1@cardiff.ac.uk', 'D.vanRooij@donders.ru.nl',
       'David.sharp@imperial.ac.uk', 'F.L.J.Verbruggen@exeter.ac.uk',
       'Hoptman@nki.rfmh.org', 'Jan-Wessel@uiowa.edu',
       'Marieke.mur@mrc-cbu.cam.ac.uk', 'Natalia.Lawrence@exeter.ac.uk',
       'a.c.sanchesferreira@bham.ac.uk', 'a.hampshire@imperial.ac.uk',
       'adamaron@ucsd.edu', 'adamsrc1@cardiff.ac.uk', 'agrethe@ucsd.edu',
       'ajj@liv.ac.uk', 'am2505@medschl.cam.ac.uk', 'amdale@ucsd.edu',
       'amy_caswell@brown.edu', 'andbari@gmail.com',
       'angie.kehagia@gmail.com', 'anne_gaertner@tu-dresden.de',
       'assari@umich.edu', 'auda@ncu.edu.tw', 'ayayak-tky@umin.ac.jp',
       'b.c.h.huijgen@umcg.nl', 'b.franke@donders.ru.nl',
       'b.vanhulst@umcutrecht.nl', 'c.beckmann@donders.ru.nl',
       'c.hartman@accare.nl', 'c.lavallee@uni-oldenburg.de',
       'c.padilla@uib.es', 'c.rae@bsms.ac.uk', 'c3chen@ntu.edu.tw',
       'charlotte.rae@mrc-cbu.cam.ac.uk', 'chiang-shan.li@yale.edu',
       'chijuan

### Option 2: run step-by-step

In [None]:
outpath = os.path.join(ROOT_PATH, task_to_run)

In [None]:
# 1. pubmed search
do_pubget_query(task_keywords[task_to_run], outpath) # writes directory with search results

In [12]:
# 2. Pull emails and PMCIDS
query_path = os.path.join(outpath,
                           [i for i in os.listdir(ROOT_PATH + task_to_run) if i.startswith('query')][0],
                           'articlesets')
papers = get_all_emails(query_path) # PMCIDs and emails

Article set #1:
Out of 337 papers, 329 had emails and 8 did not.


In [13]:
# 3. Top 100 most cited
papers_top = get_most_cited(papers, n=100)

In [43]:
# 4. write output
write_papers_csv(papers_top, outpath) 
# OR
write_email_txt(papers_top, outpath)

array(['Chambersc1@cardiff.ac.uk', 'F.L.J.Verbruggen@exeter.ac.uk',
       'James.Rowe@mrc-cbu.cam.ac.uk', 'M.J.Howard@kent.ac.uk',
       'Marieke.mur@mrc-cbu.cam.ac.uk', 'Natalia.Lawrence@exeter.ac.uk',
       'a.c.sanchesferreira@bham.ac.uk', 'a.hampshire@imperial.ac.uk',
       'adam.frost@ucsf.edu', 'adamaron@ucsd.edu',
       'adamsrc1@cardiff.ac.uk', 'agrethe@ucsd.edu', 'ajj@liv.ac.uk',
       'am2505@medschl.cam.ac.uk', 'amy_caswell@brown.edu',
       'andbari@gmail.com', 'angie.kehagia@gmail.com',
       'anna.wredenberg@ki.se', 'assari@umich.edu',
       'ayayak-tky@umin.ac.jp', 'b.c.h.huijgen@umcg.nl',
       'bernd-fritzsch@uiowa.edu',
       'bernhard.nieswandt@virchow.uni-wuerzburg.de',
       'c.lavallee@uni-oldenburg.de', 'c.rae@bsms.ac.uk',
       'c3chen@ntu.edu.tw', 'charlotte.rae@mrc-cbu.cam.ac.uk',
       'chiang-shan.li@yale.edu', 'chijuan@cc.ncu.edu.tw',
       'choijs73@gmail.com', 'christoph.freyer@ki.se',
       'claire.ocallaghan@sydney.edu.au', 'd.matzke@uva

### CURRENT CAVEATS

1. The keywords are imperfect. Including "task" will exclude a lot of good papers, but leaving it out means we get articles like [this article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4359377/) that are about the 'stop codon acting as a stop signal' and not about the stop signal task at all.

2. A few emails are lost (somewhere in the ballpark of 2%). I'm not worried.

3. We are sorting by # of citations only of other papers in PMC (not necessarily all citations of the paper, but just the ones in PMC). 

4. This search only gets open access papers, which is not all possible serach results.