# Preprocess metadata

In this script, we'll start with the full, unprocessed metadata JSON downloaded directly from Kaggle.  
Cleaning:
- Parse some of the columns to include useful info (v1_date, first_category, etc.)
- Drop unnecessary columns  
- Parse author name lists to be a list of Firstname {Middle} Lastname.  

Subsetting:
- Only include papers since 2018
- Only include papers with at least one 'cs' or 'stat' category

LLM annotation:
- Annotate papers based on whether they contain an LLM-related keyword ('mentions_LM_keyword'); include all such keywords in a list column
- Create an ```lm_metadata``` sub-dataframe of LLM papers with mentions_LM_keyword == True, so that we can use this for downstream things (extracting S2 info, getting fulltexts, etc)

(Optional) Use paper fulltexts to extract affiliations for the LLM papers. These data are necessary for the analyses involving affiliations.
- *Outside of this notebook:* Download fulltext PDFs from Google Cloud. You can't restrict to only a certain subset of papers, so you'll need to download all papers, but you can start at 2018. e.g. ```gsutil -m cp -r gs://arxiv-dataset/arxiv/arxiv/pdf/18* /path/to/output``` will download only 2018 papers and you can use analogous commands for 2019-present.
- For the LLM papers, convert PDFs to fulltext using the pdftotext tool (code is in the notebook).
- Parse each paper's converted .txt file to look for email addresses to infer affiliation. Store emails in the metadata dataframe. 

Output: 
- Output the processed overall metadata df and the LLM-specific metadata df as JSON files to '/share/pierson/raj/LLM_bibliometrics_v2/processed_data/'

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from datetime import datetime
from tqdm import tqdm
import os
import sys

if '/home/rm868/LLM-publication-patterns/data_prep' not in sys.path:
    sys.path.append('/home/rm868/LLM-publication-patterns/data_prep')

from preprocess_utils import BASE_DATA_DIR, PROCESSED_DATA_DIR
from preprocess_utils import get_lm_terms

%load_ext autoreload
%autoreload 2

### Load + clean columns

In [7]:
# Takes about 1m30s to load all metadata
metadata_date = '20230910'
metadata_path = os.path.join(BASE_DATA_DIR, f'{metadata_date}-arxiv-metadata-oai-snapshot.json')
metadata = pd.read_json(metadata_path, lines=True)

In [8]:
metadata.sample(5)

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
800237,1612.04428,David Renfrew,"Nicholas A. Cook, Walid Hachem, Jamal Najim an...",Non-Hermitian random matrices with a variance ...,50 pages. The original arXiv submission has be...,,10.1214/18-EJP230,,math.PR,http://arxiv.org/licenses/nonexclusive-distrib...,"For each $n$, let $A_n=(\sigma_{ij})$ be an ...","[{'version': 'v1', 'created': 'Tue, 13 Dec 201...",2020-08-03,"[[Cook, Nicholas A., ], [Hachem, Walid, ], [Na..."
924574,1712.05975,Pablo Portilla Cuadrado,Pablo Portilla Cuadrado,General t\^ete-\`a-t\^ete graphs and Seifert m...,"22 pages, 20 figures. arXiv admin note: text o...",,,,math.GT math.GN,http://arxiv.org/licenses/nonexclusive-distrib...,T\^ete-\`a-t\^ete graphs and relative t\^ete...,"[{'version': 'v1', 'created': 'Sat, 16 Dec 201...",2017-12-19,"[[Cuadrado, Pablo Portilla, ]]"
1484853,2106.07089,Luca Cardelli,"Luca Cardelli, Marta Kwiatkowska, Luca Laurenti",A Language for Modeling And Optimizing Experim...,,,,,q-bio.QM,http://creativecommons.org/licenses/by/4.0/,Automation is becoming ubiquitous in all lab...,"[{'version': 'v1', 'created': 'Sun, 13 Jun 202...",2021-11-30,"[[Cardelli, Luca, ], [Kwiatkowska, Marta, ], [..."
2129115,hep-ph/0410225,Terunuma Sachiko,"M. Bando, T. Kugo, A. Sugamoto, S. Terunuma",Pentaquark Baryons in String Theory -- Talk at...,talk given by A. Sugamoto at International Wor...,,,,hep-ph,,Pentaquark baryons $\Theta^{+}$ and $\Xi^{--...,"[{'version': 'v1', 'created': 'Fri, 15 Oct 200...",2007-05-23,"[[Bando, M., ], [Kugo, T., ], [Sugamoto, A., ]..."
908557,1711.01547,Daniel Rohrlich,Agung Budiyono and Daniel Rohrlich,Quantum mechanics as classical statistical mec...,12 pages; comments welcome,"Nature Communications 8, 1306 (2017)",10.1038/s41467-017-01375-w,,quant-ph,http://arxiv.org/licenses/nonexclusive-distrib...,Where does quantum mechanics part ways with ...,"[{'version': 'v1', 'created': 'Sun, 5 Nov 2017...",2017-11-07,"[[Budiyono, Agung, ], [Rohrlich, Daniel, ]]"


In [9]:
metadata['categories'] = metadata['categories'].apply(lambda x: x.split(' '))
metadata['first_category'] = metadata['categories'].apply(lambda x: x[0])
metadata['authors'] = metadata['authors_parsed']
metadata['v1_date'] = metadata['versions'].apply(lambda x: x[0]['created'])
metadata['v1_date'] = pd.to_datetime(metadata['v1_date'], format='%a, %d %b %Y %H:%M:%S GMT')
metadata = metadata.drop(['submitter', 'update_date', 'authors_parsed', 'comments', 'journal-ref', 'doi', 'report-no', 'license'], axis=1)
display(metadata.sample(5))

Unnamed: 0,id,authors,title,categories,abstract,versions,first_category,v1_date
851178,1705.07637,"[[Bordalba, Ricard, ], [Ros, Lluís, ], [Porta,...",Kinodynamic Planning on Constraint Manifolds,[cs.RO],This paper presents a motion planner for sys...,"[{'version': 'v1', 'created': 'Mon, 22 May 201...",cs.RO,2017-05-22 09:59:09
1738359,2210.17004,"[[Liu, Aiwei, ], [Yu, Honghai, ], [Hu, Xuming,...",Character-level White-Box Adversarial Attacks ...,[cs.CL],We propose the first character-level white-b...,"[{'version': 'v1', 'created': 'Mon, 31 Oct 202...",cs.CL,2022-10-31 01:46:29
1993870,cond-mat/0009148,"[[Kezsmarki, I., ], [Csonka, Sz., ], [Berger, ...",Pressure dependence of the spin gap in BaVS_3,"[cond-mat.str-el, cond-mat.mtrl-sci]",We carried out magnetotransport experiments ...,"[{'version': 'v1', 'created': 'Mon, 11 Sep 200...",cond-mat.str-el,2000-09-11 12:15:24
1497429,2107.03417,"[[Kukavica, Igor, ], [Nguyen, Trinh, ], [Vicol...",On the Euler+Prandtl expansion for the Navier-...,[math.AP],We establish the validity of the Euler$+$Pra...,"[{'version': 'v1', 'created': 'Wed, 7 Jul 2021...",math.AP,2021-07-07 18:03:56
1754283,2211.15377,"[[Carneiro, Hugo, ], [Weber, Cornelius, ], [We...",Whose Emotion Matters? Speaking Activity Local...,"[eess.AS, cs.CV, cs.LG, cs.NE, cs.SD]",The task of emotion recognition in conversat...,"[{'version': 'v1', 'created': 'Wed, 23 Nov 202...",eess.AS,2022-11-23 09:57:17


### Subset + fix author names

In [10]:
# Subset the dataframe, including papers only that:
# - Were posted since 2018-01-01
# - Have at least one category which startswith 'cs.' or 'stat.'

# metadata = metadata[metadata['v1_date'] >= datetime(2018, 1, 1, 0, 0, 0)]
# metadata = metadata[metadata['v1_date'] >= datetime(2008, 1, 1, 0, 0, 0)]
metadata = metadata[metadata['categories'].apply(lambda x: any([c.startswith('cs.') or c.startswith('stat.') for c in x]))]

display(metadata)

Unnamed: 0,id,authors,title,categories,abstract,versions,first_category,v1_date
1,0704.0002,"[[Streinu, Ileana, ], [Theran, Louis, ]]",Sparsity-certifying Graph Decompositions,"[math.CO, cs.CG]","We describe a new algorithm, the $(k,\ell)$-...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",math.CO,2007-03-31 02:26:18
45,0704.0046,"[[Csiszar, I., ], [Hiai, F., ], [Petz, D., ]]",A limit relation for entropy and channel capac...,"[quant-ph, cs.IT, math.IT]","In a quantum mechanical model, Diosi, Feldma...","[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",quant-ph,2007-04-01 16:37:36
46,0704.0047,"[[Kosel, T., ], [Grabec, I., ]]",Intelligent location of simultaneously active ...,"[cs.NE, cs.AI]",The intelligent acoustic emission locator is...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",cs.NE,2007-04-01 13:06:50
49,0704.0050,"[[Kosel, T., ], [Grabec, I., ]]",Intelligent location of simultaneously active ...,"[cs.NE, cs.AI]",Part I describes an intelligent acoustic emi...,"[{'version': 'v1', 'created': 'Sun, 1 Apr 2007...",cs.NE,2007-04-01 18:53:13
61,0704.0062,"[[Šrámek, Rastislav, ], [Brejová, Broňa, ], [V...",On-line Viterbi Algorithm and Its Relationship...,[cs.DS],"In this paper, we introduce the on-line Vite...","[{'version': 'v1', 'created': 'Sat, 31 Mar 200...",cs.DS,2007-03-31 23:52:33
...,...,...,...,...,...,...,...,...
2321538,quant-ph/9909094,"[[Knill, E., ], [Laflamme, R., ]]",Quantum Computation and Quadratically Signed W...,"[quant-ph, cs.CC]",We prove that quantum computation is polynom...,"[{'version': 'v1', 'created': 'Thu, 30 Sep 199...",quant-ph,1999-09-30 22:24:33
2321571,quant-ph/9910033,"[[Hemaspaandra, Edith, , RIT], [Hemaspaandra, ...",Almost-Everywhere Superiority for Quantum Comp...,"[quant-ph, cs.CC]",Simon as extended by Brassard and H{\o}yer s...,"[{'version': 'v1', 'created': 'Fri, 8 Oct 1999...",quant-ph,1999-10-08 03:48:56
2321625,quant-ph/9910087,"[[Kent, Adrian, , DAMTP, University of Cambrid...",Unconditionally Secure Commitment of a Certifi...,"[quant-ph, cs.CR]",In a secure bit commitment protocol involvin...,"[{'version': 'v1', 'created': 'Wed, 20 Oct 199...",quant-ph,1999-10-20 21:09:56
2321706,quant-ph/9911043,"[[Hardy, Lucien, , The Perimeter Institute], [...",Cheat Sensitive Quantum Bit Commitment,"[quant-ph, cs.CR]",We define cheat sensitive cryptographic prot...,"[{'version': 'v1', 'created': 'Tue, 9 Nov 1999...",quant-ph,1999-11-09 22:53:16


In [11]:
from preprocess_utils import fix_author_name_list

metadata['authors'] = metadata['authors'].apply(fix_author_name_list)

### Annotate whether LLM paper

In [12]:
# Add column for list of LLM terms in title or abstract
metadata['LM_related_terms'] = metadata['title'].apply(get_lm_terms) + metadata['abstract'].apply(get_lm_terms)

# Remove duplicate entries, i.e. apply list(set()) to each list
metadata['LM_related_terms'] = metadata['LM_related_terms'].apply(lambda x: list(set(x)))
metadata['mentions_LM_keyword'] = metadata['LM_related_terms'].apply(lambda x: len(x) > 0)

In [13]:
lm_metadata = metadata[metadata['mentions_LM_keyword']].copy()
display(lm_metadata.sample(5), lm_metadata.shape)

Unnamed: 0,id,authors,title,categories,abstract,versions,first_category,v1_date,LM_related_terms,mentions_LM_keyword
1530334,2109.07513,"[Rami Botros, Tara N. Sainath, Robert David, E...",Tied & Reduced RNN-T Decoder,"[cs.CL, cs.LG, cs.SD, eess.AS]",Previous works on the Recurrent Neural Netwo...,"[{'version': 'v1', 'created': 'Wed, 15 Sep 202...",cs.CL,2021-09-15 18:19:16,[language model],True
760927,1608.04465,"[Ehsan Shareghi, Matthias Petri, Gholamreza Ha...","Fast, Small and Exact: Infinite-order Language...",[cs.CL],Efficient methods for storing and querying a...,"[{'version': 'v1', 'created': 'Tue, 16 Aug 201...",cs.CL,2016-08-16 02:33:21,[language model],True
1202423,1911.03829,"[Yen-Chun Chen, Zhe Gan, Yu Cheng, Jingzhou Li...",Distilling Knowledge Learned in BERT for Text ...,"[cs.CL, cs.LG]",Large-scale pre-trained language model such ...,"[{'version': 'v1', 'created': 'Sun, 10 Nov 201...",cs.CL,2019-11-10 02:12:38,"[language model, BERT]",True
1822533,2304.04487,"[Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, D...",Inference with Reference: Lossless Acceleratio...,"[cs.CL, cs.AI]","We propose LLMA, an LLM accelerator to lossl...","[{'version': 'v1', 'created': 'Mon, 10 Apr 202...",cs.CL,2023-04-10 09:55:14,"[language model, large language model]",True
1792674,2302.0774,"[Wei-Wei Du, Hong-Wei Wu, Wei-Yao Wang, Wen-Ch...",Team Triple-Check at Factify 2: Parameter-Effi...,"[cs.CL, cs.AI, cs.CV, cs.LG]",Multi-modal fact verification has become an ...,"[{'version': 'v1', 'created': 'Sun, 12 Feb 202...",cs.CL,2023-02-12 18:08:54,[foundation model],True


(17651, 10)

## Get emails from fulltexts to determine affiliations

### Convert downloaded PDFs to fulltexts

Paper PDF fulltexts needed to be downloaded from GCP before running.
These cells only need to be run once to get parsed .txt files for each paper. 

In [None]:
from preprocess_utils import get_paper_filenames

# Create a list of filenames in the correct format for all the papers in lm_metadata
paper_filenames = lm_metadata.apply(get_paper_filenames, axis=1)
print(paper_filenames[:5])

In [None]:
from preprocess_utils import PAPER_PDF_DIR
LLM_FULLTEXT_DIR = os.path.join(BASE_DATA_DIR, 'llm_paper_fulltexts')

# Make all the monthyear subdirs in the destination directory
all_monthyears = lm_metadata['id'].apply(lambda x: x.split('.')[0]).unique()
for monthyear in all_monthyears:
    os.makedirs(f'{LLM_FULLTEXT_DIR}/{monthyear}', exist_ok=True)

# Create a list of PDF and text file paths
pdf_paths = []
txt_paths = []

# Populate the PDF and text file paths based on your logic
for filename, monthyear in paper_filenames:
    pdf_paths.append(f'{PAPER_PDF_DIR}/{monthyear}/{filename}.pdf')
    txt_paths.append(f'{LLM_FULLTEXT_DIR}/{monthyear}/{filename}.txt')

In [None]:
from preprocess_utils import convert_pdfs_in_parallel

# Call the function to convert PDFs to text
# This took 18 minutes, but may take longer with fewer threads
convert_pdfs_in_parallel(pdf_paths, txt_paths)

print("All LLM paper PDFs converted to txt files.")

### Extract emails from fulltexts

In [59]:
%load_ext autoreload
%autoreload 2
from preprocess_utils import get_all_emails

fulltext_dir = os.path.join(BASE_DATA_DIR, 'llm_paper_fulltexts')

# Add emails as a column by applying the get_all_emails function to the dataframe (it takes in rows)
n_lines_to_check = 100
lm_metadata['emails'] = None
for i, row in tqdm(lm_metadata.iterrows(), miniters=1000):
    lm_metadata.at[i, 'emails'] = get_all_emails(row, fulltext_dir, n_lines_to_check)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


16979it [00:25, 662.13it/s]


In [60]:
# print idxs in 'emails' which are None
print("Number of papers w/o fulltext", len(lm_metadata[lm_metadata['emails'].isnull()].index.tolist()))
print("Number of papers w/o emails", len(lm_metadata[lm_metadata['emails'].apply(lambda x: x is None or len(x) == 0)].index.tolist()))

Number of papers w/o fulltext 10
Number of papers w/o emails 2360


### Add affiliation column

In [55]:
from preprocess_utils import add_affiliation_info

lm_metadata_affil = add_affiliation_info(lm_metadata)
display(lm_metadata_affil.sample(5))

Unnamed: 0,id,authors,title,categories,abstract,versions,first_category,v1_date,LM_related_terms,mentions_LM_keyword,emails,domains,industry,academic
15477,2307.07854,"[Iman Saberi, Fatemeh Fard, Fuxiang Chen]",Multilingual Adapter-based Knowledge Aggregati...,[cs.SE],Multilingual fine-tuning (of a multilingual ...,"[{'version': 'v1', 'created': 'Sat, 15 Jul 202...",cs.SE,1689441436000,[language model],True,"[iman.saberi@ubc.ca, fatemeh.fard@ubc.ca, fuxi...","[ubc.ca, leicester.ac.uk]",False,True
447,1904.02099,"[Dan Kondratyuk, Milan Straka]","75 Languages, 1 Model: Parsing Universal Depen...","[cs.CL, cs.LG]","We present UDify, a multilingual multi-task ...","[{'version': 'v1', 'created': 'Wed, 3 Apr 2019...",cs.CL,1554310375000,[BERT],True,[dankondratyuk@gmail.com],[],False,False
7744,2205.05807,"[Patrick Wilken, Evgeny Matusov]",AppTek's Submission to the IWSLT 2022 Isometri...,[cs.CL],To participate in the Isometric Spoken Langu...,"[{'version': 'v1', 'created': 'Thu, 12 May 202...",cs.CL,1652313744000,[BERT],True,"[pwilken@apptek.com, ematusov@apptek.com]",[apptek.com],False,False
12377,2304.11852,[Didier El Baz],"Can we Trust Chatbots for now? Accuracy, repro...",[cs.CY],Large Language Models (LLM) are studied. App...,"[{'version': 'v1', 'created': 'Mon, 24 Apr 202...",cs.CY,1682319250000,"[language model, large language model, ChatGPT...",True,[elbaz@laas.fr],[laas.fr],False,False
8359,2207.01672,[Ramon Ruiz-Dolz],A Cascade Model for Argument Mining in Japanes...,[cs.CL],The rVRAIN team tackled the Budget Argument ...,"[{'version': 'v1', 'created': 'Mon, 4 Jul 2022...",cs.CL,1656960558000,[BERT],True,[raruidol@dsic.upv.es],[dsic.upv.es],False,False


## Save the cleaned metadata files, one for all papers and one for LLM papers only

In [63]:
lm_metadata.sample(10)[['id', 'authors', 'title', 'emails']]

Unnamed: 0,id,authors,title,emails
1809103,2303.09306,"[HAZ Sameen Shahgir, Ramisa Alam, Md. Zarif Ul...",BanglaCoNER: Towards Robust Bangla Complex Nam...,[]
1822098,2304.04052,"[Zihao Fu, Wai Lam, Qian Yu, Anthony Man-Cho S...",Decoder-Only or Encoder-Decoder? Interpreting ...,"[{zf268,nhc30}@cam.ac.uk, {wlam,manchoso}@se.c..."
1656802,2205.1239,"[Yau-Shian Wang, Yingshan Chang]",Toxicity Detection with Generative Prompt-base...,"[{yaushiaw, yingshac}@cs.cmu.edu]"
1231253,2001.0584,"[Lei Shi, Shijie Geng, Kai Shuang, Chiori Hori...",Multi-Layer Content Interaction Through Quater...,[]
1440505,2103.1036,"[Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding...",GLM: General Language Model Pretraining with A...,"[kimi_yang@rcrai.com, jietang@tsinghua.edu.cn]"
1591154,2201.05782,"[Jibao Qiu, C. L. Philip Chen, Tong Zhang]",A Novel Multi-Task Learning Method for Symboli...,"[csj.b.qiu@mail.scut.edu.cn, {philipchen, tony..."
943760,1802.0422,"[Francisco J. R. Ruiz, Michalis K. Titsias, Ad...",Augment and Reduce: Stochastic Inference for L...,"[f.ruiz@eng.cam.ac.uk, f.ruiz@columbia.edu]"
1361675,2010.05345,"[Xikun Zhang, Deepak Ramachandran, Ian Tenney,...",Do Language Embeddings Capture Scales?,"[xikunz2@cs.stanford.edu, ramachandrand@google..."
1177021,1909.06983,"[Fang Liu, Ge Li, Bolin Wei, Xin Xia, Zhiyi Fu...",A Self-Attentional Neural Architecture for Cod...,"[liufang816@pku.edu.cn, lige@pku.edu.cn, bolin..."
1874947,2307.03952,"[Yu Ji, Wen Wu, Hong Zheng, Yi Hu, Xi Chen, Li...",Is ChatGPT a Good Personality Recognizer? A Pr...,[]


In [65]:
# Save metadata to json
metadata.to_json(os.path.join(PROCESSED_DATA_DIR, f'cs_stat_metadata_{metadata_date}.json'), 
                 orient='records', lines=True)
lm_metadata.to_json(os.path.join(PROCESSED_DATA_DIR, f'lm_papers_metadata_{metadata_date}.json'), 
                    orient='records', lines=True)

In [14]:
# All metadata, without date filtering
metadata.to_json(os.path.join(PROCESSED_DATA_DIR, f'fulldaterange_cs_stat_metadata_{metadata_date}.json'), 
                 orient='records', lines=True)