In [1]:
import pandas as pd

In [2]:
papers = pd.read_csv('papers.csv')

In [3]:
papers.shape

(20286, 3)

In [4]:
papers.head()

Unnamed: 0,year,name,text
0,1987,Bit-Serial Neural Networks,573 \n\nBIT - SERIAL NEURAL NETWORKS \n\nAlan...
1,1987,Connectivity Versus Entropy,1 \n\nCONNECTIVITY VERSUS ENTROPY \n\nYaser S...
2,1987,The Hopfield Model with Multi-Level Neurons,278 \n\nTHE HOPFIELD MODEL WITH MUL TI-LEVEL N...
3,1987,How Neural Nets Work,442 \n\nAlan Lapedes \nRobert Farber \n\nThe...
4,1987,Spatial Organization of Neural Networks: A Pro...,740 \n\nSPATIAL ORGANIZATION OF NEURAL NEn...


In [5]:
papers.tail()

Unnamed: 0,year,name,text
20281,2023,Optimal testing using combined test statistics...,Optimal testing using combined test statistics...
20282,2023,Regret-Optimal Model-Free Reinforcement Learni...,Regret-Optimal Model-Free Reinforcement Learni...
20283,2023,Convolutional State Space Models for Long-Rang...,Convolutional State Space Models for\nLong-Ran...
20284,2023,"CRoSS: Diffusion Model Makes Controllable, Rob...","CRoSS: Diffusion Model Makes\nControllable, Ro..."
20285,2023,American Stories: A Large-Scale Structured Tex...,American Stories: A Large-Scale Structured Tex...


Papers from 1987-2019 were downloaded as plaintext directly from the NeurIPS conference website. Papers from 2020 onward were downloaded as PDFs and then converted to plaintext using PyMuPDF. This may have resulted in some formatting differences. Let's take a look at the plaintext of a few papers.

In [12]:
# Let's take a look at the last paper in the dataset - one from 2023. This paper is representative of papers that
# were converted from PDFs

print(papers.loc[20285, 'text'])

American Stories: A Large-Scale Structured Text
Dataset of Historical U.S. Newspapers
Melissa Dell1,2∗, Jacob Carlson1, Tom Bryan1, Emily Silcock1, Abhishek Arora1, Zejiang Shen3,
Luca D’Amico-Wong1, Quan Le4, Pablo Querubin2,5, Leander Heldring6
1Harvard University; Cambridge, MA, USA.
2National Bureau of Economic Research; Cambridge, MA, USA.
3Massachusetts Institute of Technology; Cambridge, MA, USA.
4Princeton University; Princeton, NJ, USA.
5New York University; New York, NY, USA.
6Kellogg School of Management, Northwestern University, Evanston, IL, USA.
∗Corresponding author: melissadell@fas.harvard.edu.
Abstract
Existing full text datasets of U.S. public domain newspapers do not recognize the
often complex layouts of newspaper scans, and as a result the digitized content
scrambles texts from articles, headlines, captions, advertisements, and other lay-
out regions. OCR quality can also be low. This study develops a novel, deep learn-
ing pipeline for extracting full article text

In comparing the plaintext to the PDF, here are the things we notice did not translate over very well:

 - images/figures - the image and figure titles made it across into plaintext, but as expected, images themseves did not. In the RAG response back to the user, we'll have to decide whether to display the rendered images back to the user (most likely relying on the PDF document itself - which might require PDF manipulation if we just want to display the part with the image) _or_ whether we want to leave images out. Leaving images could be potentially viable as we're planning to provide access to the papers involved in the particular RAG response regardless.
 - tables - tables and their data are translated into plaintext but not in tabular format. This raises questions of how to handle these both from an embedding perspective (would including tabular data add extra "noise" to the embedding) as whether they should be displayed in the RAG response (similar to the question for images)
 - equations - similar to tables, equations are carried over into plaintext, but lacking formatting. This makes it hard for a human to understand the equation, and, similarly, equations are a niche notation that might not be embedded well. So the question regarding whether to include them in embedding/display goes for equations as well.

In [18]:
# Now let's take at the first paper in our dataset - one from 1987. This paper's plaintext was provided by
# the NeurIPS website directly.

print(papers.loc[0, 'text'])

573 

BIT - SERIAL NEURAL  NETWORKS 

Alan F.  Murray,  Anthony V . W.  Smith  and Zoe F.  Butler. 

Department of Electrical Engineering,  University of Edinburgh, 

The King's Buildings, Mayfield Road,  Edinburgh, 

Scotland,  EH93JL. 

ABSTRACT 

A  bit  - serial  VLSI  neural  network  is  described  from  an  initial  architecture  for  a 
synapse array through to silicon layout and board design.  The issues surrounding bit 
- serial  computation,  and  analog/digital  arithmetic  are  discussed  and  the  parallel 
development  of  a  hybrid  analog/digital  neural  network  is  outlined.  Learning  and 
recall  capabilities  are  reported  for  the  bit  - serial  network  along  with  a  projected 
specification  for  a  64  - neuron,  bit  - serial  board  operating  at 20 MHz.  This tech(cid:173)
nique  is  extended  to  a  256  (2562  synapses)  network  with  an  update  time  of 3ms, 
using  a  "paging"  technique  to  time  - multiplex  calculations  through  the  synapse

Here we can see that the formatting is slightly different from the PDF-converted papers. There is a bit more whitespace, and page numbers are present. Outside of that, the same challenges regarding images and equations are present.

In [19]:
# Now let's take a look at another paper from before 2020, where the plaintext was provided by the NeurIPS website
# directly. This paper was manually selected because it contains images, equations, and tables.

paper = papers[papers['name'] == 'Probabilistic Inference of Hand Motion from Neural Activity in Motor Cortex']
print(paper.iloc[0, 2])

Probabilistic Inference of Hand Motion from Neural

Activity in Motor Cortex

Y. Gao M. J. Black

E. Bienenstock

S. Shoham

J. P. Donoghue

 Division of Applied Mathematics, Brown University, Providence, RI 02912

 Dept. of Computer Science, Brown University, Box 1910, Providence, RI 02912

 Princeton University, Dept. of Molecular Biology Princeton, NJ, 08544

 Dept. of Neuroscience, Brown University, Providence, RI 02912

gao@cfm.brown.edu, black@cs.brown.edu, elie@dam.brown.edu,

sshoham@princeton.com, john donoghue@brown.edu

Abstract

Statistical learning and probabilistic inference techniques are used to in-
fer the hand position of a subject from multi-electrode recordings of neu-
ral activity in motor cortex. First, an array of electrodes provides train-
ing data of neural ﬁring conditioned on hand kinematics. We learn a non-
parametric representation of this ﬁring activity using a Bayesian model
and rigorously compare it with previous models using cross-validation.
Se

We notice that page numbers are absent from this paper - so page numbers must have been phased out from earlier years of the conference. Additionally, the table comes across similarly to the PDF-converted papers - it contains all the data but is not formatted tabularly.

Given that there may be critical information in the images, tables, and equations of these papers, it might be worthwhile exploring the use of PyMuPDF's Markdown formatter - which converts PDFs into Markdown. This might aid in preserving some of the information. It also might make it easier to parse the papers into chunks - if we decide that would improve the RAG system.

Also, considering that the plaintext of papers pre-2020 (provided directly from the NeurIPS website) and papers provided after 2020 (converted from PDFs using PyMuPDF) have different formatting - it might be worthwhile to consider using the same PDF conversion process for all the papers to try to make formatting slightly more standardized. Although, the underlying paper PDFs might not have consistent formatting from year to year.

Of note, is that all of these would be preprocessing steps - which only happen once. Which provides us a bit more leeway as far as scalability is concerned. (I.e., adding more preprocessing steps make take longer for the preprocessing pipeline to run, but since it's only run once or on a batch basis, that might be worth it if it makes the RAG responses better).