## Segmenting ML researchers based on publications

My goal is to estimate at a high level the number of top senior ML researchers currently active in the field. My methodology is as follows:

1. Generate a list of authors who published at one of the top 3 ML conferences in 2020, as measured by h5-index (NeurIPS, ICLR, ICML), by scraping data from conference websites
2. Using an API interface to Google Scholar, obtain the publication history of each of these authors
3. Determine the number of authors from this group who meet a given thresholds, e.g. > 5 papers published in the last 5 years

### (1) Generate initial author list from conference data (2020)

In [1]:
# icml
from src.icml import read_icml_papers

icml_data = read_icml_papers(filename="./data/icml_2020_papers.txt")
icml_authors = icml_data[3]  # dict linking authors to affiliations

In [2]:
# neurips
from src.neurips import read_neurips_papers

neurips_data = read_neurips_papers(filename="./data/neurips_2020_accepted.txt")
neurips_authors = neurips_data[3]  # dict linking authors to affiliations

In [3]:
# iclr
from src.iclr import read_iclr_papers

iclr_data = read_iclr_papers()
icrl_authors = list(
    set(
        [item for sublist in iclr_data.authors.to_list() for item in sublist]
    )
)  # list of authors

In [4]:
print(f"Total number of authors from ICML: {len(icml_authors)}")
print(f"Total number of authors from NeurIPS: {len(neurips_authors)}")
print(f"Total number of authors from ICRL: {len(icrl_authors)}")

Total number of authors from ICML: 3422
Total number of authors from NeurIPS: 5926
Total number of authors from ICRL: 6872


In [5]:
# combine
import pandas as pd
    
all_authors = {
    **{a: "" for a in icrl_authors},
    **icml_authors,
    **neurips_authors,
}


author_df = pd.DataFrame({
    "author_name": list(all_authors.keys()),
    "affiliations": list(all_authors.values())
})

In [6]:
print(f"Total number of authors: {len(author_df)}")
author_df.head(10)

Total number of authors: 12948


Unnamed: 0,author_name,affiliations
0,Tal Linzen,
1,Taewoo Kim,
2,Dongjin Song,[NEC Labs America]
3,Zirui Wang*,
4,Kaizhi Qian,[UIUC]
5,Hirokatsu Kataoka,
6,Ping Li,"[The Hong Kong Polytechnic University, Baidu R..."
7,Ramzan Umarov,
8,Byungsoo Kim,
9,Hao He,[Massachusetts Institute of Technology]


### (2) Get publication history for each author from Google Scholar

In [99]:
from scholarly import scholarly
from tqdm import tqdm

def get_author_history(author_name):
    search_query = scholarly.search_author(author_name)
    try:
        author_history = scholarly.fill(next(search_query))
    except StopIteration as e:
        print("Couldn't find history")
        author_history = {}
    return author_history


# Each of these queries takes a while (API is slow to respond), so I'm splitting
# up the author pool into groups of 100, processing them separately and saving them
# so errors (e.g. spotty connection) don't take us back to square 0. 
# Running the below took ~10 hrs total.

group_size = 100
groups = [author_df.iloc[i:i + group_size] for i in range(0, len(author_df) - group_size + 1, group_size)]

print(f"{len(groups)} groups")

start_from_group = 0

for i, group in enumerate(groups):
    if i < start_from_group:
        continue
    print(f"group {i}")
    group["full_author_history"] = group.apply(
    #     lambda a: get_author_history(f"{a.author_name} ({a.affiliations[0] if len(a.affiliations) > 0 else ''})"),
        lambda a: get_author_history(a.author_name),
        axis=1
    )
    group.to_csv(f"./data/group_{i}.csv")


129 groups


### (3) Determine members of this group who meet certain thresholds 

In [1]:
# We saved our Google Scholar data as CSV, so the author history field (which was a dict), 
# will need to be re-converted from a string (which it was saved as) back to a dict

# Note - just pickle dataframes in step (2) in future, as there's a ~5% false read rate 
# with the below method, and the size difference is negligible

import re
from ast import literal_eval

def remove_arrow_bracket_content(s):
    no_right = re.sub(r'\<', "'", s)
    return re.sub(r'\>', "'", no_right)

def remove_empty_sets(s):
    return re.sub(r'set\(\)', "'EmptySetPlaceholder'", s)

def clean_author_history(history):
    no_brackets = remove_arrow_bracket_content(history)
    return remove_empty_sets(no_brackets)

def author_history_to_dict(history):
    try:
        cleaned = clean_author_history(history)
        d = literal_eval(cleaned)
        d["false_read"] = "n"
        return d
    except:
        return {"false_read": "y"}

In [2]:
# These are helper functions for creating full author records from a
# Google Scholar author history

# where an entry is missing, we return "" for string fields, and None for number fields

def get_scholar_id(history):
    return history.get("scholar_id", "")

def get_name(history):
    return history.get("name", "")

def get_url_picture(history):
    return history.get("url_picture", "")

def get_cited_by(history):
    return history.get("citedby", None)

def get_cited_by_5y(history):
    return history.get("citedby5y", None)

def get_cites_per_year(history):
    return history.get("cites_per_year", {})

def get_interests(history):
    return history.get("interests", [])

def get_num_coauthors(history):
    return len(history["coauthors"]) if "coauthors" in history else None

def get_hindex(history):
    return history.get("hindex", None)

def get_hindex_5y(history):
    return history.get("hindex5y", None)

def get_num_publications(history):
    return len(history["publications"]) if "publications" in history else None

def get_num_publications_5y(history):
    return len([p for p in history["publications"] if p.get("pub_year", 2021) >= 2015]) if "publications" in history else None

def get_publication_titles(history):
    return [p['bib']['title'] for p in history['publications']] if "publications" in history else []

def get_google_scholar_affiliation(history):
    return history.get("affiliation", "")


In [3]:
def get_author_records_from_author_group(group):
    group["history"] = group.full_author_history.apply(author_history_to_dict)
    return pd.DataFrame({
        'false_read': group.history.apply(lambda h: h["false_read"]),
        'scholar_id': group.history.apply(get_scholar_id),
        'scholarly_name': group.author_name,
        'google_scholar_name': group.history.apply(get_name),
        'url_picture': group.history.apply(get_url_picture),
        'interests': group.history.apply(get_interests),
        'scholarly_affiliation': group.affiliations,
        'google_scholar_affiliation': group.history.apply(get_google_scholar_affiliation),
        'cited_by': group.history.apply(get_cited_by),
        'cited_by_5y': group.history.apply(get_cited_by_5y),
        'cites_per_year': group.history.apply(get_cites_per_year),
        'hindex': group.history.apply(get_hindex),
        'hindex_5y': group.history.apply(get_hindex_5y),
        'num_publications': group.history.apply(get_num_publications),
        'num_publications_5y': group.history.apply(get_num_publications_5y),
        'publication_titles': group.history.apply(get_publication_titles),
        'num_coauthors': group.history.apply(get_num_coauthors),
    })

In [5]:
from tqdm import tqdm
import pandas as pd

# Convert all our saved author data into a single dataframe with 
# the fields we care about
# Should take a couple of minutes to load and process 

num_groups = 129

full_author_df = pd.DataFrame()

for group_num in tqdm(range(num_groups)):
    group_df = pd.read_csv(f"./data/group_{group_num}.csv")
    author_record_df = get_author_records_from_author_group(group_df)
    full_author_df = pd.concat([full_author_df, author_record_df])

100%|██████████| 129/129 [01:48<00:00,  1.19it/s]


In [12]:
false_reads = sum(full_author_df.false_read == "y")
print(f"Number of true reads {len(full_author_df) - false_reads}")
print(f"Number of false reads {false_reads}")
print(f"False read %: {round((false_reads / len(full_author_df)) * 100, 2)}")

Number of true reads 12008
Number of false reads 892
False read %: 6.91


In [16]:
# Number of authors with > 5 papers published in last 5 years
print(f"Number of authors with > 5 papers published in last 5 years: {sum(full_author_df.num_publications_5y > 5)}")

Number of authors with > 5 papers published in last 5 years: 9213


In [17]:
# Sample of authors
full_author_df.sample(10)

Unnamed: 0,false_read,scholar_id,scholarly_name,google_scholar_name,url_picture,interests,scholarly_affiliation,google_scholar_affiliation,cited_by,cited_by_5y,cites_per_year,hindex,hindex_5y,num_publications,num_publications_5y,publication_titles,num_coauthors
47,n,,Min-Gwan Seo,,,[],['ESTsoft'],,,,{},,,,,[],
93,n,T3RhPdoAAAAJ,Siddharth Swaroop,Siddharth Swarup Rautaray,https://scholar.google.com/citations?view_op=m...,"[Big Data Analytics, Human Computer Interaction]","['University of Cambridge', 'University of Cam...",KIIT Deemed to be University,2120.0,1880.0,"{2012: 27, 2013: 60, 2014: 107, 2015: 165, 201...",18.0,17.0,112.0,112.0,[Vision based hand gesture recognition for hum...,7.0
1,n,7QcT04EAAAAJ,Peter Sunehag,Peter Sunehag,https://scholar.google.com/citations?view_op=m...,"[Machine Learning, Reinforcement Learning, Dee...",['Google - DeepMind'],Google - DeepMind,856.0,703.0,"{2004: 3, 2005: 2, 2006: 3, 2007: 3, 2008: 3, ...",13.0,11.0,64.0,64.0,[Deep reinforcement learning in large discrete...,11.0
44,n,6Tzhi4MAAAAJ,Jingcheng Pang,Jincheng Pang,https://scholar.google.com/citations?view_op=m...,"[Biomedical Image Analysis, Signal and Image P...",['Nanjing University'],"Pfizer, Tufts University",260.0,197.0,"{2011: 1, 2012: 2, 2013: 21, 2014: 30, 2015: 2...",9.0,8.0,27.0,27.0,[Evaluation of bone marrow lesion volume as a ...,20.0
12,n,ss-IvjMAAAAJ,Jason Saragih,Jason Saragih,https://scholar.google.com/citations?view_op=m...,"[Computer Vision, Machine Learning]",['Facebook'],"Research Scientist, Facebook Reality Labs",5628.0,4107.0,"{2008: 17, 2009: 45, 2010: 65, 2011: 136, 2012...",22.0,20.0,70.0,70.0,[The extended cohn-kanade dataset (ck+): A com...,0.0
96,n,iBeDoRAAAAAJ,Richard Zemel,Richard Zemel,https://scholar.google.com/citations?view_op=m...,"[Machine Learning, Computer Vision, Neural Cod...","['Vector Institute', 'Vector Institute', 'Vect...","Professor of Computer Science, University of T...",32384.0,26512.0,"{1996: 91, 1997: 102, 1998: 106, 1999: 112, 20...",63.0,49.0,275.0,275.0,"[Show, attend and tell: Neural image caption g...",0.0
56,n,I-Hs9twAAAAJ,YUFAN ZHAO,Yifan Zhao,https://scholar.google.com/citations?view_op=m...,"[nonlinear system identification, non-destruct...",['Microsoft'],"Senior Lecturer, Cranfield University",1107.0,927.0,"{2006: 3, 2007: 4, 2008: 6, 2009: 8, 2010: 20,...",18.0,16.0,122.0,122.0,[Part-regularized near-duplicate vehicle re-id...,4.0
61,n,mqtdKfkAAAAJ,Stephen Pasteris,Stephen Pasteris,https://scholar.google.com/citations?view_op=m...,[Machine Learning],"['University College London', 'University Coll...","Research Associate, University College London",96.0,94.0,"{2013: 1, 2014: 1, 2015: 3, 2016: 6, 2017: 5, ...",6.0,6.0,19.0,19.0,[Service placement with provable guarantees in...,0.0
10,n,uQXB6-oAAAAJ,Yueming Lyu,Yueming Lyu (吕月明),https://scholar.google.com/citations?view_op=m...,"[machine learning, optimization, approximation...",,University of Technology Sydney,122.0,122.0,"{2015: 3, 2016: 16, 2017: 12, 2018: 19, 2019: ...",6.0,6.0,15.0,15.0,[Asymmetric cyclical hashing for large scale i...,5.0
28,n,chATYTUAAAAJ,Andreas Lehrmann,Andreas C. Lehmann,https://scholar.google.com/citations?view_op=m...,"[musicology, music psychology]",['Borealis AI'],"Professor of Musicology, University of Music, ...",6565.0,2598.0,"{1997: 38, 1998: 39, 1999: 48, 2000: 63, 2001:...",29.0,20.0,97.0,97.0,[Expert and exceptional performance: Evidence ...,12.0
