In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import requests
from urllib.parse import urlencode, quote_plus
import numpy as np
import sys

import nltk
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

**Pandas Version**<br>
Here we import pandas in a different cell because this code requires Pandas version 1.5.3<br> If the users Pandas version differs from 1.5.3 there is a 'pip install' line that should be run to ensure the proper version of Pandas.

In [None]:
#We designed the code to work with Pandas 1.5.3
import pandas as pd
print(pd. __version__)

#If the Pandas version differs from 1.5.3, run the following:
#pip install pandas==1.5.3 --user

1.5.3


# **Citing this code**
This code is the second version of a Expertise finding tool developed by Volz et al. 2023 (https://ui.adsabs.harvard.edu/abs/2023AAS...24210207V/abstract).<br>
It utilizes NASA ADS API to query for articles (refereed or not) in the "Astronomy" database (cite ADS).
Please, cite "Helfenbein et al. 2023 (in prep) and refer to the README file in the github.

**Directory set up**<br>
The file *stopwords.txt* is utilized to create meaningful N-grams. Make sure to provide an accurate path in the following cell.<br> Also, the path will be used by the code in other instances to identify other useful files.

In [None]:
path_stop= 'Insert Path Here'
stop_file='stopwords.txt'
stop_dir=path_stop+stop_file
sys.path.append(path_stop)

In [None]:
#For the TextAnalysis File, please refer to M. Volze et al. 2023
import TextAnalysis as TA
import ADSsearcherpkg as AP

In [None]:
#token = 'Your own token from ADS API page ' #Insert your API token

# **Example 1: Searching expertises of a single person based on their name**

The search will focus on papers published by a specific author in the past 15 years independently of the current affiliation:<br>
The format for a single author search is as follows: **"Last, First"**<br>
In the following example we search for Dr. Joshua Pepper expertise. <br>
**Note:** the user can decide to query ONLY refereed paper adding, before the token keyword the following keyword:<br>
**refereed="property:refereed"**


In [None]:
datf=AP.ads_search(name="Pepper, Joshua",
               token=token, stop_dir=stop_dir)

In [None]:
# To display the data frame run the following:
datf
# To save it in a excel format run the following:
#datf.to_excel(path_stop+"output.xlsx")

Unnamed: 0,Input Author,Input Institution,First Author,Bibcode,Title,Publication Date,Keywords,Affiliations,Abstract,Top 10 Words,Top 10 Bigrams,Top 10 Trigrams,Data Type
0,"Pepper, Joshua","None, None, None, None, None, None, None, None...","Pepper, Joshua, Pepper, Joshua, Pepper, Joshua...","2021plat.confE..96P, 2021tsc2.confE.139P, 2020...","The TESS Input Catalog and lessons for PLATO, ...","2021-10-00, 2021-07-00, 2020-06-00, 2020-01-00...","Input catalogue, TESS, Zenodo community plato2...","Lehigh University, Lehigh University, Departme...",The presentation compares the input catalogues...,"[(star, 78), (kelt, 62), (planet, 46), (transi...","[((radial, velocity), 18), ((transiting, plane...","[((transiting, planets, bright), 6), ((planets...","Dirty, Dirty, Clean, Dirty, Clean, Clean, Clea..."


# **Example 2: Searching expertises of ALL scientists that published as first authors when affiliated to single institution name**

The search will focus on papers and all authors that have published in the past 15 years at a specific institution (academic or otherwise): <br>
The format for a single institution is as follows: **institution="Institution Name"**. <br>
**Caveat**: It is possible that the institutions as input by the user does not match what has been cataloged in ADS, therefore if the final output is empty, make sure to try different versions of the institution names (e.g. Cal Poly Pomona, Cal Poly, California Polytechnic State University) to get the most complete list of authors.

In [None]:
datf=AP.ads_search(institution="Hampton University",refereed="property:refereed",
               token=token, stop_dir=stop_dir)

I will search for every paper who first authors is afiliated with Hampton University and published in the past 15 years.

I am now querying ADS.



In [None]:
# To display the data frame run the following:
datf
# To save it in a excel format run the following:
#datf.to_excel(path_stop+"output.xlsx")

# **Example 3: Searching a single author publication while affiliated to a specific institution**

The search will focus on papers published by a single author while they are affiliated to a specific institution, in the past 15 years:<br>

The format for a single author and institution is as follows: **name= 'Last, First', institution= 'Institution Name'**.

In [None]:
datf=AP.ads_search(name= 'Capper, Daniel', institution="University of Southern Mississippi",
               token=token, stop_dir=stop_dir)

In [None]:
# To display the data frame run the following:
datf
# To save it in a excel format run the following:
#datf.to_excel(path_stop+"output.xlsx")

# **Example 4: Searching a single author name within a different time-frame**

The search will focus on papers from one single author that were published in a different time-frame. There are two options for doing so:
   - A single year (e.g. 2010): in this case the code will query ADS for articles published by the specified authors between one year prior to 4  years after. So searching year='2010' will search articles between 2009 and 2014<br>
   - A year range: in this case the syntax is year='[YEAR TO YEAR]' (e.g. year='[2009 TO 2023]') <br>

The format for a single author name remains the same as before: **name= 'Last, First'**. <br>

Here are two examples:
- Searching for Dr. Pepper's articles between year 1999 and 2004
- Searching for Dr. Pepper's articles between year 2019 and 2023

In [None]:
datf=AP.ads_search(name= 'Pepper, Joshua', year='2000',
               token=token, stop_dir=stop_dir)

In [None]:
# To display the data frame run the following:
datf
# To save it in a excel format run the following:
#datf.to_excel(path_stop+"output.xlsx")

In [None]:
datf=AP.ads_search(name= 'Pepper, Joshua', year='[2019 TO 2023]',
               token=token, stop_dir=stop_dir)

In [None]:
datf

# **Example 5: Searching a single institution name within a specific time-frame**

The search will focus on authors that publishes as first authors affiliated to a specific institution in a defined timespan. <br>
The format for a author name is the same in previous example (**"Last name, First name"**) and specified year range is similar to the option provided earlier:<br>
   - A single year (e.g. 2010): in this case the code will query ADS for articles published by the specified authors between one year prior to 4  years after. So searching year='2010' will search articles between 2009 and 2014<br>
   - A year range: in this case the syntax is year='[YEAR TO YEAR]' (e.g. year='[2009 TO 2023]') <br>

Following we present two examples:

In [None]:
datf=AP.ads_search(institution="University of Southern Mississippi",year='2000',
               token=token, stop_dir=stop_dir)

In [None]:
# To display the data frame run the following:
datf
# To save it in a excel format run the following:
#datf.to_excel(path_stop+"output.xlsx")

In [None]:
datf=AP.ads_search(institution="University of Southern Mississippi",year='[1990 TO 2000]',
               token=token, stop_dir=stop_dir)

In [None]:
datf

# **Example 6: Searching a single Author, at a specific institution and within a specific time-frame**

The following example combines several of the previous ones in a single search.
Specifically:<br>
   - A single author<br>
   - Affiliated to a single institutions<br>
   - In a specific time frame of publications<br>
    
Please, refer to the previous examples for the sintax required. <br>
Here are an example

In [None]:
datf=AP.ads_search(name= 'Brown, Beth A.', institution="Howard university",year='[2009 TO 2022]',
               token=token, stop_dir=stop_dir)

In [None]:
# To display the data frame run the following:
datf
# To save it in a excel format run the following:
#datf.to_excel(path_stop+"output.xlsx")

# **Example 7: Searching through a list of institutions**

The search will focus on papers from a list of institutions, so the input is a csv file that has multiple institution names stored in it. This will then find all papers from those institutions (**see CAVEATS in Example 2 above related to Institution Names)**:<br>

The input file has to be a .csv file (e.g."top10inst.csv"), and must contain at least one column titled  **"Current Institution"** or **"Institution"** (the first cell of the column is usually interpreted as such). The file can contains other columns, they will be ignored.<br>
If the file is in a different directory than the one where the code it, include the whole path. <br>

The code will run as in Example 2 above for each institutions and append the results at each iteration providing a final dataframe with all the researchers at all the institutions in the list provided.<br>
**NOTE: at the moment if an institution query returns an empty dataframe the code will ignore it and continue to the following one.**


In [None]:
datf=AP.run_file_insts(filename= '/example3.csv',
               token=token, stop_dir=stop_dir)


I will search for every paper who first authors is afiliated with University of California, Berkeley and published in the past 15 years.

I am now querying ADS.

DataFrame is empty! Something is wrong with the institution
I am querying ADS in a different way, stay tuned!/n
I will search for every paper who first authors is afiliated with University of California, Berkeley and published in the past 15 years.



  final_df= final_df.append(data1, ignore_index= True)


1 iterations done
I will search for every paper who first authors is afiliated with Hunter College and published in the past 15 years.

I am now querying ADS.



  final_df= final_df.append(data1, ignore_index= True)


2 iterations done
I will search for every paper who first authors is afiliated with Yale University  and published in the past 15 years.

I am now querying ADS.

DataFrame is empty! Something is wrong with the institution
I am querying ADS in a different way, stay tuned!/n
I will search for every paper who first authors is afiliated with Yale University  and published in the past 15 years.

3 iterations done


  final_df= final_df.append(data1, ignore_index= True)


In [None]:
# To display the data frame run the following:
datf
# To save it in a excel format run the following:
#datf.to_excel(path_stop+"output.xlsx")

Unnamed: 0,Input Author,Input Institution,First Author,Bibcode,Title,Publication Date,Keywords,Affiliations,Abstract,Top 10 Words,Top 10 Bigrams,Top 10 Trigrams,Data Type
0,"Abdurashidova, Zara","University of California, Berkeley","Abdurashidova, Zara, Abdurashidova, Zara","2022ApJ...925..221A, 2022ApJ...924...51A",First Results from HERA Phase I: Upper Limits ...,"2022-02-00, 2022-01-00","1383, 343, 1858, Astrophysics - Cosmology and ...","Department of Astronomy, University of Califor...",We report upper limits on the Epoch of Reioniz...,"[(limit, 9), (hera, 7), (radio, 5), (upper, 4)...","[((upper, limit), 4), ((epoch, reionization), ...","[((cm, power, spectrum), 2), ((hydrogen, epoch...","Clean, Clean"
1,"Abdurrahman, F. N.","University of California, Berkeley","Abdurrahman, F. N.",2018AJ....156..100A,Improved Image Quality over 10‧ Fields with th...,2018-09-00,"instrumentation: adaptive optics, site testing...","Department of Astronomy, University of Califor...",´Imaka is a ground-layer adaptive optics (GLAO...,"[(ao, 8), (field, 5), (psf, 4), (telescope, 3)...","[((ao, psf), 3), ((ao, ao), 2), ((delivered, i...","[((delivered, image, quality), 2), ((imaka, gr...",Clean
2,"Abdurrahman, Fatima N.","University of California, Berkeley","Abdurrahman, Fatima N., Abdurrahman, Fatima N.","2021ApJ...912..146A, 2021PhDT.........6A",On the Possibility of Stellar Lenses in the Bl...,"2021-05-00, 2021-00-00","Gravitational microlensing, High-resolution mi...","Department of Astronomy, University of Califor...",Although stellar-mass black holes (BHs) are li...,"[(lens, 20), (field, 12), (source, 11), (image...","[((lens, source), 8), ((black, hole), 6), ((pr...","[((eliminate, possibility, stellar), 4), ((ste...","Clean, Dirty"
3,"Abrahams, Ellianna S.","University of California, Berkeley","Abrahams, Ellianna S.",2022ApJ...938...46A,Informing the Cataclysmic Variable Sequence fr...,2022-10-00,"Cataclysmic variable stars, Variable stars, Su...","Department of Astrophysics, University of Cali...",The orbital-period (P <SUB>orb</SUB>) gap in t...,"[(cv, 6), (orb, 4), (color, 4), (period, 3), (...","[((absolute, magnitude), 3), ((orbital, period...","[((color, absolute, magnitude), 2), ((dwarf, n...",Clean
4,"Aczel, Miriam R.","University of California, Berkeley","Aczel, Miriam R.",2021ConPh..62...52A,"The cosmos: astronomy in the new millennium, 5...",2021-01-00,,California Institute for Energy and Environmen...,,[],[],[],Dirty
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1884,"van Dokkum, Pieter",Yale University,"van Dokkum, Pieter, van Dokkum, Pieter, van Do...","2023NatAs...7..514V, 2023RNAAS...7...83V, 2023...","An exciting era of exploration, A Direct Conne...","2023-05-00, 2023-05-00, 2023-05-00, 2023-04-00...",", Supermassive black holes, 1663, Astrophysics...","Astronomy Department, Yale University, New Hav...",Long-exposure spectra taken with the James Web...,"[(galaxy, 187), (cluster, 61), (mass, 58), (ng...","[((globular, cluster), 44), ((dark, matter), 3...","[((low, surface, brightness), 10), ((hubble, s...","Dirty, Clean, Dirty, Clean, Clean, Dirty, Dirt..."
1885,"van Dokkum, Pieter G.",Yale University,"van Dokkum, Pieter G., van Dokkum, Pieter G., ...","2015ApJ...813...23V, 2015ApJ...804L..26V, 2015...","Forming Compact Massive Galaxies, Spectroscopi...","2015-11-00, 2015-05-00, 2015-01-00, 2014-12-00...","galaxies: evolution, galaxies: structure, Astr...","Department of Astronomy, Yale University, New ...",In this paper we study a key phase in the form...,"[(galaxy, 114), (mass, 64), (stellar, 40), (st...","[((stellar, mass), 18), ((massive, galaxy), 13...","[((early, type, galaxies), 7), ((star, forming...","Clean, Clean, Clean, Clean, Clean, Clean, Clea..."
1886,"van de Voort, Freeke",Yale University,"van de Voort, Freeke, van de Voort, Freeke, va...","2020MNRAS.494.4867V, 2019MNRAS.482L..85V, 2018...",Neutron star mergers and rare core-collapse su...,"2020-06-00, 2019-01-00, 2018-06-00, 2018-05-00...","methods: numerical, stars: abundances, stars: ...","Max Planck Institute for Astrophysics, Karl-Sc...","We use cosmological, magnetohydrodynamical sim...","[(gas, 29), (galaxy, 28), (star, 16), (mass, 1...","[((star, formation), 8), ((gas, accretion), 8)...","[((rare, core, collapse), 3), ((metal, poor, s...","Clean, Clean, Clean, Clean, Clean"
1887,"van den Bosch, Frank",Yale University,"van den Bosch, Frank, van den Bosch, Frank, va...","2023DDA....5420004V, 2019atp..prop...59V, 2017...",On the Tidal Evolution of Dark Matter Substruc...,"2023-09-00, 2019-00-00, 2017-00-00",", ,","Yale University, Yale University, Yale University",The statistics of dark matter (DM) substructur...,"[(halo, 29), (galaxy, 29), (matter, 27), (dark...","[((dark, matter), 26), ((halo, mass), 7), ((ma...","[((dark, matter, halos), 6), ((dark, matter, s...","Dirty, Dirty, Dirty"


# **Example 8: Searching through a list of Authors names**

The search will focus on papers from a list of authors names (similar format as Example 1 above, **'Last, First'**). <br>
The input is a .csv file that has multiple authors names stored in it under a column Title: **"Name"**. <br>
The ADS search will focus on the period 2003 to 2023.
<br>
If the file is in a different directory than the one where the code it, include the whole path. <br>

The code will then execute the search one name after the other and uppend each result to the previous one.<br>
In the following example we use, for convenience, the same example file as before which also contain a list of researchers names.


In [None]:
datf=AP.run_file_names(filename= '/example3.csv',
               token=token, stop_dir=stop_dir)

I will go through each name in the list. Name should be formatted in a single column called "Last, First".  We will search by default any pubblication in the past 20 years by these authors, independently of the institutions they were  affiliated to. 

Browning, Matthew
I will search for every paper who first authors is Browning, Matthew and has published between 2003 and 2023. 

I am now querying ADS.



  final_df= final_df.append(data1, ignore_index= True)


1 iterations done
Cruz, Kelle
I will search for every paper who first authors is Cruz, Kelle and has published between 2003 and 2023. 

I am now querying ADS.



  final_df= final_df.append(data1, ignore_index= True)


2 iterations done
Gawiser, Eric
I will search for every paper who first authors is Gawiser, Eric and has published between 2003 and 2023. 

I am now querying ADS.

3 iterations done


  final_df= final_df.append(data1, ignore_index= True)


In [None]:
# To display the data frame run the following:
datf
# To save it in a excel format run the following:
#datf.to_excel(path_stop+"output.xlsx")

Unnamed: 0,Input Author,Input Institution,First Author,Bibcode,Title,Publication Date,Keywords,Affiliations,Abstract,Top 10 Words,Top 10 Bigrams,Top 10 Trigrams,Data Type
0,"Browning, Matthew","None, None, None, None, None, None, None, None...","Browning, M. K., Browning, Matthew K., Brownin...","2021csss.confE..80B, 2020mdps.conf..141B, 2019...",Modelling X-ray and radio emission from a flar...,"2021-03-00, 2020-01-00, 2019-05-00, 2017-05-00...","Young stars, Flares, Radio emission, , Astroph...","Uni of Manchester, -, -, University of Exeter,...",T-Tauri stars exhibit strong flaring activity....,"[(field, 88), (magnetic, 66), (rotation, 64), ...","[((magnetic, field), 42), ((differential, rota...","[((magnetic, dynamo, action), 9), ((low, mass,...","Dirty, Dirty, Dirty, Dirty, Dirty, Dirty, Dirt..."
1,"Cruz, Kelle","None, None, None, None, None, None, None, None...","Cruz, Kelle, Cruz, K. L., Cruz, Kelle, Cruz, K...","2021csss.confE.248C, 2018yCat..51550034C, 2018...",SIMPLE Archive of Complex Objects: A new colla...,"2021-03-00, 2018-09-00, 2018-07-00, 2018-01-00...","Very low mass stars, Astronomy, archives, Star...","CUNY Hunter College, -, -, Department of Physi...",The SIMPLE Archive project -- the Substellar a...,"[(dwarf, 120), (mass, 44), (brown, 44), (low, ...","[((brown, dwarf), 44), ((ultracool, dwarf), 20...","[((stars, brown, dwarfs), 9), ((low, mass, sta...","Dirty, Dirty, Dirty, Dirty, Clean, Dirty, Dirt..."
2,"Gawiser, Eric","None, None, None, None, None, None, None, None...","Gawiser, Eric, Gawiser, Eric, Gawiser, Eric, G...","2023AAS...24142705G, 2021adap.prop..169G, 2018...","ODIN: Blobs, Galaxies, and Protoclusters Found...","2023-01-00, 2021-00-00, 2018-01-00, 2016-06-00...",", , , , , , , , , , , , , , , 98.62.Ai, 98.62....","Rutgers University, Rutgers University, New Br...",We present a new NOIRLab Survey called ODIN (O...,"[(galaxy, 161), (formation, 75), (star, 67), (...","[((star, formation), 57), ((formation, rate), ...","[((star, formation, rate), 18), ((star, format...","Clean, Dirty, Clean, Dirty, Dirty, Dirty, Dirt..."
