# Preprocess metadata

In this script, we'll start with the full, unprocessed metadata JSON downloaded directly from Kaggle.  
Cleaning:
- Parse some of the columns to include useful info (v1_date, first_category, etc.)
- Drop unnecessary columns  
- Parse author name lists to be a list of Firstname {Middle} Lastname.  

Subsetting:
- Only include papers since 2018
- Only include papers with at least one 'cs' or 'stat' category

LLM annotation:
- Annotate papers based on whether they contain an LLM-related keyword ('mentions_LM_keyword'); include all such keywords in a list column
- Create an ```lm_metadata``` sub-dataframe of LLM papers with mentions_LM_keyword == True, so that we can use this for downstream things (extracting S2 info, getting fulltexts, etc)

(Optional) Use paper fulltexts to extract affiliations for the LLM papers. These data are necessary for the analyses involving affiliations.
- *Outside of this notebook:* Download fulltext PDFs from Google Cloud. You can't restrict to only a certain subset of papers, so you'll need to download all papers, but you can start at 2018. e.g. ```gsutil -m cp -r gs://arxiv-dataset/arxiv/arxiv/pdf/18* /path/to/output``` will download only 2018 papers and you can use analogous commands for 2019-present.
- For the LLM papers, convert PDFs to fulltext using the pdftotext tool (code is in the notebook).
- Parse each paper's converted .txt file to look for email addresses to infer affiliation. Store emails in the metadata dataframe. 

Output: 
- Output the processed overall metadata df and the LLM-specific metadata df as JSON files to '/share/pierson/raj/LLM_bibliometrics_v2/processed_data/'

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from datetime import datetime
from tqdm import tqdm
import os
import sys

if '/home/rm868/LLM-publication-patterns/data_prep' not in sys.path:
    sys.path.append('/home/rm868/LLM-publication-patterns/data_prep')

from preprocess_utils import BASE_DATA_DIR, PROCESSED_DATA_DIR
from preprocess_utils import get_lm_terms

%load_ext autoreload
%autoreload 2

### Load + clean columns

In [7]:
# Takes about 1m30s to load all metadata
metadata_date = '20230910'
metadata_path = os.path.join(BASE_DATA_DIR, f'{metadata_date}-arxiv-metadata-oai-snapshot.json')
metadata = pd.read_json(metadata_path, lines=True)

In [3]:
metadata.sample(5)

Unnamed: 0,id,submitter,authors,title,comments,journal-ref,doi,report-no,categories,license,abstract,versions,update_date,authors_parsed
350788,1206.4695,Katherine Deck,"Katherine M. Deck, Matthew J. Holman, Eric Ago...",Rapid dynamical chaos in an exoplanetary system,"6 pages, 5 figures",,10.1088/2041-8205/755/1/L21,,astro-ph.EP,http://arxiv.org/licenses/nonexclusive-distrib...,We report on the long-term dynamical evoluti...,"[{'version': 'v1', 'created': 'Wed, 20 Jun 201...",2015-06-05,"[[Deck, Katherine M., ], [Holman, Matthew J., ..."
791355,1611.05908,Mykhailo Klymenko Dr,"M.V. Klymenko, S. Rogge and F. Remacle",Multi-valley envelope function equations and e...,,"Phys. Rev. B 92, 195302 (2015)",10.1103/PhysRevB.92.195302,,cond-mat.mes-hall,http://arxiv.org/licenses/nonexclusive-distrib...,We propose a system of real-space envelope f...,"[{'version': 'v1', 'created': 'Thu, 17 Nov 201...",2016-11-21,"[[Klymenko, M. V., ], [Rogge, S., ], [Remacle,..."
1096952,1903.04566,Mohammad Rostami,"Mohammad Rostami, Soheil Kolouri, Praveen K. P...",Complementary Learning for Overcoming Catastro...,,,,,cs.LG stat.ML,http://arxiv.org/licenses/nonexclusive-distrib...,"Despite huge success, deep networks are unab...","[{'version': 'v1', 'created': 'Mon, 11 Mar 201...",2019-06-04,"[[Rostami, Mohammad, ], [Kolouri, Soheil, ], [..."
1824055,2304.06009,Alexander Naumann,"Alexander Naumann, Felix Hertlein, Laura D\""or...",Literature Review: Computer Vision Application...,,,,,cs.CV cs.LG,http://arxiv.org/licenses/nonexclusive-distrib...,Computer vision applications in transportati...,"[{'version': 'v1', 'created': 'Wed, 12 Apr 202...",2023-06-08,"[[Naumann, Alexander, ], [Hertlein, Felix, ], ..."
2230811,math/0408225,Frederic Rochon,Frederic Rochon,Bott Periodicity for Fibred Cusp Operators,"38 pages, corrected typos",,,,math.DG math.AP,,In the framework of fibred cusp operators on...,"[{'version': 'v1', 'created': 'Tue, 17 Aug 200...",2007-05-23,"[[Rochon, Frederic, ]]"


In [4]:
metadata['categories'] = metadata['categories'].apply(lambda x: x.split(' '))
metadata['first_category'] = metadata['categories'].apply(lambda x: x[0])
metadata['authors'] = metadata['authors_parsed']
metadata['v1_date'] = metadata['versions'].apply(lambda x: x[0]['created'])
metadata['v1_date'] = pd.to_datetime(metadata['v1_date'], format='%a, %d %b %Y %H:%M:%S GMT')
metadata = metadata.drop(['submitter', 'update_date', 'authors_parsed', 'comments', 'journal-ref', 'doi', 'report-no', 'license'], axis=1)
display(metadata.sample(5))

Unnamed: 0,id,authors,title,categories,abstract,versions,first_category,v1_date
601599,1502.06939,"[[Dascaliuc, Radu, ], [Michalowski, Nicholas, ...",Symmetry Breaking and Uniqueness for the Incom...,"[math.AP, math-ph, math.MP, math.PR, physics.f...",The present article establishes connections ...,"[{'version': 'v1', 'created': 'Tue, 24 Feb 201...",math.AP,2015-02-24 20:23:27
1517639,2108.08824,"[[Zubida, Assaf, ], [Yitzhaki, Elad, ], [Lindn...",Optimal short-time measurements for Hamiltonia...,"[quant-ph, cond-mat.quant-gas, cond-mat.str-el]",Characterizing noisy quantum devices require...,"[{'version': 'v1', 'created': 'Thu, 19 Aug 202...",quant-ph,2021-08-19 17:48:48
1911592,astro-ph/0003231,"[[Avelino, P. P., , CAUP, Porto], [Martins, C....",Topological defects: fossils of an anisotropic...,[astro-ph],We consider the evolution of domain walls pr...,"[{'version': 'v1', 'created': 'Thu, 16 Mar 200...",astro-ph,2000-03-16 09:58:37
1446981,2103.16836,"[[Courteille, Hermann, , LISTIC], [Benoît, A.,...",Channel-Based Attention for LCC Using Sentinel...,"[cs.CV, cs.LG, cs.NE, eess.IV]",Deep Neural Networks (DNNs) are getting incr...,"[{'version': 'v1', 'created': 'Wed, 31 Mar 202...",cs.CV,2021-03-31 06:24:15
962721,1804.01028,"[[Tourigny-Plante, Alex, ], [Michaud-Belleau, ...",An open and flexible digital phase-locked loop...,[eess.SP],This paper presents an open and flexible dig...,"[{'version': 'v1', 'created': 'Tue, 3 Apr 2018...",eess.SP,2018-04-03 15:16:43


### Subset + fix author names

In [5]:
# Subset the dataframe, including papers only that:
# - Were posted since 2018-01-01
# - Have at least one category which startswith 'cs.' or 'stat.'

metadata = metadata[metadata['v1_date'] >= datetime(2018, 1, 1, 0, 0, 0)]
metadata = metadata[metadata['categories'].apply(lambda x: any([c.startswith('cs.') or c.startswith('stat.') for c in x]))]

display(metadata)

Unnamed: 0,id,authors,title,categories,abstract,versions,first_category,v1_date
929308,1801.00377,"[[Shalaby, Walid, ], [AlAila, BahaaEddin, ], [...",Help Me Find a Job: A Graph-based Approach for...,"[cs.IR, cs.SI]",Online job boards are one of the central com...,"[{'version': 'v1', 'created': 'Mon, 1 Jan 2018...",cs.IR,2018-01-01 00:47:44
929311,1801.00380,"[[Kang, Xiaoning, ], [Deng, Xinwei, ]]",On Variable Ordination of Modified Cholesky De...,"[math.ST, stat.TH]",Estimation of large sparse covariance matric...,"[{'version': 'v1', 'created': 'Mon, 1 Jan 2018...",math.ST,2018-01-01 01:33:54
929313,1801.00382,"[[Cheng, Yu-Hsiang, ], [Huang, Tzee-Ming, ], [...",A clustering method for misaligned curves,[stat.ME],We consider the problem of clustering misali...,"[{'version': 'v1', 'created': 'Mon, 1 Jan 2018...",stat.ME,2018-01-01 02:14:08
929315,1801.00384,"[[Najafi, Mehrnaz, ], [He, Lifang, ], [Yu, Phi...",Error-Robust Multi-View Clustering,[cs.LG],"In the era of big data, data may come from m...","[{'version': 'v1', 'created': 'Mon, 1 Jan 2018...",cs.LG,2018-01-01 02:42:04
929316,1801.00385,"[[Desai, Ruta, ], [Li, Beichen, ], [Yuan, Ye, ...",Interactive Co-Design of Form and Function for...,[cs.RO],Our goal is to make robotics more accessible...,"[{'version': 'v1', 'created': 'Mon, 1 Jan 2018...",cs.RO,2018-01-01 03:19:01
...,...,...,...,...,...,...,...,...
1908703,2309.03899,"[[Lamdouar, Hala, ], [Xie, Weidi, ], [Zisserma...",The Making and Breaking of Camouflage,[cs.CV],"Not all camouflages are equally effective, a...","[{'version': 'v1', 'created': 'Thu, 7 Sep 2023...",cs.CV,2023-09-07 17:58:05
1908704,2309.03900,"[[Chen, Su-Kai, ], [Yen, Hung-Lin, ], [Liu, Yu...",Learning Continuous Exposure Value Representat...,"[eess.IV, cs.CV]",Deep learning is commonly used to reconstruc...,"[{'version': 'v1', 'created': 'Thu, 7 Sep 2023...",eess.IV,2023-09-07 17:59:03
1908707,2309.03903,"[[Cheng, Ho Kei, ], [Oh, Seoung Wug, ], [Price...",Tracking Anything with Decoupled Video Segment...,[cs.CV],Training data for video segmentation are exp...,"[{'version': 'v1', 'created': 'Thu, 7 Sep 2023...",cs.CV,2023-09-07 17:59:41
1908708,2309.03904,"[[Zhu, Jiapeng, ], [Yang, Ceyuan, ], [Zheng, K...",Exploring Sparse MoE in GANs for Text-conditio...,[cs.CV],"Due to the difficulty in scaling up, generat...","[{'version': 'v1', 'created': 'Thu, 7 Sep 2023...",cs.CV,2023-09-07 17:59:43


In [None]:
from preprocess_utils import fix_author_name_list

metadata['authors'] = metadata['authors'].apply(fix_author_name_list)

### Annotate whether LLM paper

In [11]:
# Add column for list of LLM terms in title or abstract
metadata['LM_related_terms'] = metadata['title'].apply(get_lm_terms) + metadata['abstract'].apply(get_lm_terms)

# Remove duplicate entries, i.e. apply list(set()) to each list
metadata['LM_related_terms'] = metadata['LM_related_terms'].apply(lambda x: list(set(x)))
metadata['mentions_LM_keyword'] = metadata['LM_related_terms'].apply(lambda x: len(x) > 0)

In [19]:
lm_metadata = metadata[metadata['mentions_LM_keyword']].copy()
display(lm_metadata.sample(5), lm_metadata.shape)

Unnamed: 0,id,authors,title,categories,abstract,versions,first_category,v1_date,LM_related_terms,mentions_LM_keyword
1348746,2009.07243,"[Moin Nadeem, Tianxing He, Kyunghyun Cho, Jame...",A Systematic Characterization of Sampling Algo...,"[cs.CL, cs.AI, cs.LG]",This work studies the widely adopted ancestr...,"[{'version': 'v1', 'created': 'Tue, 15 Sep 202...",cs.CL,2020-09-15 17:28:42,[language model],True
1457062,2104.09644,"[Bhavani Singh Agnikula Kshatriya, Nicolas A N...",Neural Language Models with Distant Supervisio...,"[cs.CL, cs.AI, cs.IR]",Major depressive disorder (MDD) is a prevale...,"[{'version': 'v1', 'created': 'Mon, 19 Apr 202...",cs.CL,2021-04-19 21:11:41,"[language model, BERT]",True
1778453,2301.07389,"[Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Ki...",Towards Models that Can See and Read,"[cs.CV, cs.LG]",Visual Question Answering (VQA) and Image Ca...,"[{'version': 'v1', 'created': 'Wed, 18 Jan 202...",cs.CV,2023-01-18 09:36:41,[language model],True
1334130,2008.06408,"[Juan Manuel Pérez, Aymé Arango, Franco Luque]",ANDES at SemEval-2020 Task 12: A jointly-train...,[cs.CL],This paper describes our participation in Se...,"[{'version': 'v1', 'created': 'Thu, 13 Aug 202...",cs.CL,2020-08-13 16:07:00,[BERT],True
1789909,2302.04975,"[Nay San, Martijn Bartelds, Blaine Billings, E...",Leveraging supplementary text data to kick-sta...,[cs.CL],Recent research using pre-trained transforme...,"[{'version': 'v1', 'created': 'Thu, 9 Feb 2023...",cs.CL,2023-02-09 23:30:49,[language model],True


(16979, 10)

## Get emails from fulltexts to determine affiliations

### Convert downloaded PDFs to fulltexts

Paper PDF fulltexts needed to be downloaded from GCP before running.
These cells only need to be run once to get parsed .txt files for each paper. 

In [None]:
from preprocess_utils import get_paper_filenames

# Create a list of filenames in the correct format for all the papers in lm_metadata
paper_filenames = lm_metadata.apply(get_paper_filenames, axis=1)
print(paper_filenames[:5])

In [None]:
from preprocess_utils import PAPER_PDF_DIR
LLM_FULLTEXT_DIR = os.path.join(BASE_DATA_DIR, 'llm_paper_fulltexts')

# Make all the monthyear subdirs in the destination directory
all_monthyears = lm_metadata['id'].apply(lambda x: x.split('.')[0]).unique()
for monthyear in all_monthyears:
    os.makedirs(f'{LLM_FULLTEXT_DIR}/{monthyear}', exist_ok=True)

# Create a list of PDF and text file paths
pdf_paths = []
txt_paths = []

# Populate the PDF and text file paths based on your logic
for filename, monthyear in paper_filenames:
    pdf_paths.append(f'{PAPER_PDF_DIR}/{monthyear}/{filename}.pdf')
    txt_paths.append(f'{LLM_FULLTEXT_DIR}/{monthyear}/{filename}.txt')

In [None]:
from preprocess_utils import convert_pdfs_in_parallel

# Call the function to convert PDFs to text
# This took 18 minutes, but may take longer with fewer threads
convert_pdfs_in_parallel(pdf_paths, txt_paths)

print("All LLM paper PDFs converted to txt files.")

### Extract emails from fulltexts

In [59]:
%load_ext autoreload
%autoreload 2
from preprocess_utils import get_all_emails

fulltext_dir = os.path.join(BASE_DATA_DIR, 'llm_paper_fulltexts')

# Add emails as a column by applying the get_all_emails function to the dataframe (it takes in rows)
n_lines_to_check = 100
lm_metadata['emails'] = None
for i, row in tqdm(lm_metadata.iterrows(), miniters=1000):
    lm_metadata.at[i, 'emails'] = get_all_emails(row, fulltext_dir, n_lines_to_check)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


16979it [00:25, 662.13it/s]


In [60]:
# print idxs in 'emails' which are None
print("Number of papers w/o fulltext", len(lm_metadata[lm_metadata['emails'].isnull()].index.tolist()))
print("Number of papers w/o emails", len(lm_metadata[lm_metadata['emails'].apply(lambda x: x is None or len(x) == 0)].index.tolist()))

Number of papers w/o fulltext 10
Number of papers w/o emails 2360


### Add affiliation column

In [55]:
from preprocess_utils import add_affiliation_info

lm_metadata_affil = add_affiliation_info(lm_metadata)
display(lm_metadata_affil.sample(5))

Unnamed: 0,id,authors,title,categories,abstract,versions,first_category,v1_date,LM_related_terms,mentions_LM_keyword,emails,domains,industry,academic
15477,2307.07854,"[Iman Saberi, Fatemeh Fard, Fuxiang Chen]",Multilingual Adapter-based Knowledge Aggregati...,[cs.SE],Multilingual fine-tuning (of a multilingual ...,"[{'version': 'v1', 'created': 'Sat, 15 Jul 202...",cs.SE,1689441436000,[language model],True,"[iman.saberi@ubc.ca, fatemeh.fard@ubc.ca, fuxi...","[ubc.ca, leicester.ac.uk]",False,True
447,1904.02099,"[Dan Kondratyuk, Milan Straka]","75 Languages, 1 Model: Parsing Universal Depen...","[cs.CL, cs.LG]","We present UDify, a multilingual multi-task ...","[{'version': 'v1', 'created': 'Wed, 3 Apr 2019...",cs.CL,1554310375000,[BERT],True,[dankondratyuk@gmail.com],[],False,False
7744,2205.05807,"[Patrick Wilken, Evgeny Matusov]",AppTek's Submission to the IWSLT 2022 Isometri...,[cs.CL],To participate in the Isometric Spoken Langu...,"[{'version': 'v1', 'created': 'Thu, 12 May 202...",cs.CL,1652313744000,[BERT],True,"[pwilken@apptek.com, ematusov@apptek.com]",[apptek.com],False,False
12377,2304.11852,[Didier El Baz],"Can we Trust Chatbots for now? Accuracy, repro...",[cs.CY],Large Language Models (LLM) are studied. App...,"[{'version': 'v1', 'created': 'Mon, 24 Apr 202...",cs.CY,1682319250000,"[language model, large language model, ChatGPT...",True,[elbaz@laas.fr],[laas.fr],False,False
8359,2207.01672,[Ramon Ruiz-Dolz],A Cascade Model for Argument Mining in Japanes...,[cs.CL],The rVRAIN team tackled the Budget Argument ...,"[{'version': 'v1', 'created': 'Mon, 4 Jul 2022...",cs.CL,1656960558000,[BERT],True,[raruidol@dsic.upv.es],[dsic.upv.es],False,False


## Save the cleaned metadata files, one for all papers and one for LLM papers only

In [63]:
lm_metadata.sample(10)[['id', 'authors', 'title', 'emails']]

Unnamed: 0,id,authors,title,emails
1809103,2303.09306,"[HAZ Sameen Shahgir, Ramisa Alam, Md. Zarif Ul...",BanglaCoNER: Towards Robust Bangla Complex Nam...,[]
1822098,2304.04052,"[Zihao Fu, Wai Lam, Qian Yu, Anthony Man-Cho S...",Decoder-Only or Encoder-Decoder? Interpreting ...,"[{zf268,nhc30}@cam.ac.uk, {wlam,manchoso}@se.c..."
1656802,2205.1239,"[Yau-Shian Wang, Yingshan Chang]",Toxicity Detection with Generative Prompt-base...,"[{yaushiaw, yingshac}@cs.cmu.edu]"
1231253,2001.0584,"[Lei Shi, Shijie Geng, Kai Shuang, Chiori Hori...",Multi-Layer Content Interaction Through Quater...,[]
1440505,2103.1036,"[Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding...",GLM: General Language Model Pretraining with A...,"[kimi_yang@rcrai.com, jietang@tsinghua.edu.cn]"
1591154,2201.05782,"[Jibao Qiu, C. L. Philip Chen, Tong Zhang]",A Novel Multi-Task Learning Method for Symboli...,"[csj.b.qiu@mail.scut.edu.cn, {philipchen, tony..."
943760,1802.0422,"[Francisco J. R. Ruiz, Michalis K. Titsias, Ad...",Augment and Reduce: Stochastic Inference for L...,"[f.ruiz@eng.cam.ac.uk, f.ruiz@columbia.edu]"
1361675,2010.05345,"[Xikun Zhang, Deepak Ramachandran, Ian Tenney,...",Do Language Embeddings Capture Scales?,"[xikunz2@cs.stanford.edu, ramachandrand@google..."
1177021,1909.06983,"[Fang Liu, Ge Li, Bolin Wei, Xin Xia, Zhiyi Fu...",A Self-Attentional Neural Architecture for Cod...,"[liufang816@pku.edu.cn, lige@pku.edu.cn, bolin..."
1874947,2307.03952,"[Yu Ji, Wen Wu, Hong Zheng, Yi Hu, Xi Chen, Li...",Is ChatGPT a Good Personality Recognizer? A Pr...,[]


In [65]:
# Save metadata to json
metadata.to_json(os.path.join(PROCESSED_DATA_DIR, f'cs_stat_metadata_{metadata_date}.json'), 
                 orient='records', lines=True)
lm_metadata.to_json(os.path.join(PROCESSED_DATA_DIR, f'lm_papers_metadata_{metadata_date}.json'), 
                    orient='records', lines=True)