# RDoC Expert Survey Author Finder
Chris Iyer
Updated 3/23/2023

This is a notebook designed to find authors' emails for the purposes of sending them our RDoC Expert Survey Screener. Leveraging functions from `author_finder_functions.py`, this notebook will do the following things for each of the tasks we are using:
1. Search pubmed central (PMC) for open-access articles in the past 10 years with task keywords in the abstract.
2. Obtain correspondence/author emails for as many of these articles as possible.
3. Retrieve the number of PMC articles that cite the given PMC article.
4. Select the top <n> (100?) most-cited papers and retrieve their emails in order to send them our expert screener. 
5. We'll write these emails to a CSV.


In [1]:
from author_finder_functions import *

In [2]:
ROOT_PATH = '/Users/chrisiyer/_Current/lab/code/author_finder/pubget_data/' # change this to match your local desired path

# tasks: 'spatial_cueing', 'visual_search', 'cued_ts', 'ax_cpt', 'flanker', 'stroop', 'stop_signal', 'go_nogo', 'span', 'change_detection', 'n_back'
all_tasks = ['spatial_cueing', 'visual_search', 'cued_ts', 'ax_cpt', 'flanker', 'stroop', 'stop_signal', 'go_nogo', 'span', 'change_detection', 'n_back']
tasks_to_run = all_tasks

# if you would like to manually change the keywords to search through, do so here:
# task_keywords[task_to_run] = ['stop-signal task', 'stop signal task']

### Option 1: run all-in-one

In [3]:
run_author_finder(tasks_to_run, ROOT_PATH, output = 'csv') # output = 'txt'

INFO	2023-04-07T14:47:47-0700	pubget._download	Nothing to do: current processing step 'download' already completed in /Users/chrisiyer/_Current/lab/code/author_finder/pubget_data/spatial_cueing/query_985725b7fed8fc29d2aca0ec88f29238/articlesets
INFO	2023-04-07T14:47:47-0700	pubget._articles	Nothing to do: current processing step 'extract_articles' already completed in /Users/chrisiyer/_Current/lab/code/author_finder/pubget_data/spatial_cueing/query_985725b7fed8fc29d2aca0ec88f29238/articles
INFO	2023-04-07T14:47:47-0700	pubget._data_extraction	Nothing to do: current processing step 'extract_data' already completed in /Users/chrisiyer/_Current/lab/code/author_finder/pubget_data/spatial_cueing/query_985725b7fed8fc29d2aca0ec88f29238/subset_allArticles_extractedData


NameError: name 'ROOT_PATH' is not defined

### Option 2: Run step-by-step

In [None]:
task_to_run = tasks_to_run[0] # CHANGE THIS

outpath = os.path.join(ROOT_PATH, task_to_run)

In [None]:
# 1. pubmed search
do_pubget_query(task_keywords[task_to_run], outpath) # writes directory with search results

In [None]:
# 2. Pull emails and PMCIDS
query_path = os.path.join(outpath,
                           [i for i in os.listdir(ROOT_PATH + task_to_run) if i.startswith('query')][0],
                           'articlesets')
papers = get_all_emails(query_path) # PMCIDs and emails

In [None]:
# 3. Top 100 most cited
papers_top = get_most_cited(papers, n=100)

In [None]:
# 4. write output
write_papers_csv(papers_top, outpath) 
# OR
write_email_txt(papers_top, outpath)

### CURRENT CAVEATS

1. The keywords are imperfect. Including "task" will exclude a lot of good papers, but leaving it out means we get articles like [this article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4359377/) that are about the 'stop codon acting as a stop signal' and not about the stop signal task at all.

2. A few emails are lost (somewhere in the ballpark of 2%). I'm not worried.

3. We are sorting by # of citations only of other papers in PMC (not necessarily all citations of the paper, but just the ones in PMC). 

4. This search only gets open access papers, which is not all possible serach results.