## Segmenting ML researchers based on publications

My goal is to estimate at a high level the number of top senior ML researchers currently active in the field. My methodology is as follows:

1. Generate a list of authors who published at one of the top 3 ML conferences in 2020, as measured by h5-index (NeurIPS, ICLR, ICML), by scraping data from conference websites
2. Using an API interface to Google Scholar, obtain the publication history of each of these authors
3. Determine the number of authors from this group who meet a given thresholds, e.g. > 5 papers published in the last 5 years

### (1) Generate initial author list from conference data (2020)

In [5]:
# icml
from src.icml import read_icml_papers

icml_data = read_icml_papers(filename="./data/icml_2020_papers.txt")
icml_authors = icml_data[3]  # dict linking authors to affiliations

In [7]:
# neurips
from src.neurips import read_neurips_papers

neurips_data = read_neurips_papers(filename="./data/neurips_2020_accepted.txt")
neurips_authors = neurips_data[3]  # dict linking authors to affiliations

In [8]:
# iclr
from src.iclr import read_iclr_papers

iclr_data = read_iclr_papers()
icrl_authors = list(
    set(
        [item for sublist in iclr_data.authors.to_list() for item in sublist]
    )
)  # list of authors

In [10]:
print(f"Total number of authors from ICML: {len(icml_authors)}")
print(f"Total number of authors from NeurIPS: {len(neurips_authors)}")
print(f"Total number of authors from ICRL: {len(icrl_authors)}")

Total number of authors from ICML: 3422
Total number of authors from NeurIPS: 5926
Total number of authors from ICRL: 6872


In [11]:
# combine
import pandas as pd
    
all_authors = {
    **{a: "" for a in icrl_authors},
    **icml_authors,
    **neurips_authors,
}


author_df = pd.DataFrame({
    "author_name": list(all_authors.keys()),
    "affiliations": list(all_authors.values())
})

In [12]:
print(f"Total number of authors: {len(author_df)}")
author_df.head(10)

Total number of authors: 12948


Unnamed: 0,author_name,affiliations
0,Ryan Lowe*,
1,Hanshu YAN,
2,Dylan Shell,[Texas A&M University]
3,Amarjot Singh,
4,Piotr Indyk,[MIT]
5,Amin Ghiasi,
6,Keizo Kato,[Fujitsu Laboratories Ltd.]
7,Yanzhi Wang,
8,Nenghai Yu,"[University of Science, University of Science]"
9,Andrew Willis,


### (2) Get publication history for each author from Google Scholar

In [None]:
from scholarly import scholarly
from tqdm import tqdm

def get_author_history(author_name):
    print(author_name)
    search_query = scholarly.search_author(author_name)
    try:
        author_history = scholarly.fill(next(search_query))
    except StopIteration as e:
        print("Couldn't find history")
        author_history = {}
    return author_history

test_df = author_df.head(100)

test_df["full_author_history"] = test_df.apply(
#     lambda a: get_author_history(f"{a.author_name} ({a.affiliations[0] if len(a.affiliations) > 0 else ''})"),
    lambda a: get_author_history(a.author_name),
    axis=1
)

# # Print the titles of the author's publications
# print([pub['bib']['title'] for pub in author['publications']])

# # Take a closer look at the first publication
# pub = scholarly.fill(author['publications'][0])
# print(pub)

# # Which papers cited that publication?
# print([citation['bib']['title'] for citation in scholarly.citedby(pub)])

Ryan Lowe*
Couldn't find history
Hanshu YAN
Dylan Shell
Amarjot Singh
Piotr Indyk


In [49]:
# author["citedby5y"]
# author["hindex"]
# author["hindex5y"]
author["publications"]

[{'container_type': 'Publication',
  'source': <PublicationSource.AUTHOR_PUBLICATION_ENTRY: 2>,
  'bib': {'title': 'Densepose: Dense human pose estimation in the wild',
   'pub_year': '2018'},
  'filled': False,
  'author_pub_id': 'cLPaHcIAAAAJ:R3hNpaxXUhUC',
  'num_citations': 440},
 {'container_type': 'Publication',
  'source': <PublicationSource.AUTHOR_PUBLICATION_ENTRY: 2>,
  'bib': {'title': 'Houdini: Fooling deep structured visual and speech recognition models with adversarial examples',
   'pub_year': '2017'},
  'filled': False,
  'author_pub_id': 'cLPaHcIAAAAJ:hFOr9nPyWt4C',
  'num_citations': 230},
 {'container_type': 'Publication',
  'source': <PublicationSource.AUTHOR_PUBLICATION_ENTRY: 2>,
  'bib': {'title': 'ModDrop: adaptive multi-modal gesture recognition',
   'pub_year': '2014'},
  'filled': False,
  'author_pub_id': 'cLPaHcIAAAAJ:YsMSGLbcyi4C',
  'num_citations': 218},
 {'container_type': 'Publication',
  'source': <PublicationSource.AUTHOR_PUBLICATION_ENTRY: 2>,
  'bi