# Data Exploration

This notebook explores two aspects of the data:
- The formatting of raw text in the papers
- The number of tokens in the papers

## Formatting of text

### Basic exploration

In [1]:
import pandas as pd

In [2]:
papers = pd.read_csv('papers.csv')

In [3]:
papers.shape

(20286, 3)

In [4]:
papers.head()

Unnamed: 0,year,name,text
0,1987,Bit-Serial Neural Networks,573 \n\nBIT - SERIAL NEURAL NETWORKS \n\nAlan...
1,1987,Connectivity Versus Entropy,1 \n\nCONNECTIVITY VERSUS ENTROPY \n\nYaser S...
2,1987,The Hopfield Model with Multi-Level Neurons,278 \n\nTHE HOPFIELD MODEL WITH MUL TI-LEVEL N...
3,1987,How Neural Nets Work,442 \n\nAlan Lapedes \nRobert Farber \n\nThe...
4,1987,Spatial Organization of Neural Networks: A Pro...,740 \n\nSPATIAL ORGANIZATION OF NEURAL NEn...


In [5]:
papers.tail()

Unnamed: 0,year,name,text
20281,2023,Optimal testing using combined test statistics...,Optimal testing using combined test statistics...
20282,2023,Regret-Optimal Model-Free Reinforcement Learni...,Regret-Optimal Model-Free Reinforcement Learni...
20283,2023,Convolutional State Space Models for Long-Rang...,Convolutional State Space Models for\nLong-Ran...
20284,2023,"CRoSS: Diffusion Model Makes Controllable, Rob...","CRoSS: Diffusion Model Makes\nControllable, Ro..."
20285,2023,American Stories: A Large-Scale Structured Tex...,American Stories: A Large-Scale Structured Tex...


### Investigating Formatting of Papers in Plaintext

Papers from 1987-2019 were downloaded as plaintext directly from the NeurIPS conference website. Papers from 2020 onward were downloaded as PDFs and then converted to plaintext using PyMuPDF. This may have resulted in some formatting differences. Let's take a look at the plaintext of a few papers.

#### PDF-Converted Papers

Let's take a look at the last paper in the dataset - one from 2023. This paper is representative of papers that were converted from PDFs. The PDF for this paper can be viewed [here](https://papers.nips.cc/paper_files/paper/2023/file/ffeb860479ccae44d84c0de32acd693d-Paper-Datasets_and_Benchmarks.pdf).

In [6]:
text = papers.loc[len(papers)-1, 'text']

Let's first just take a look at the general format of the paper.

In [7]:
print(text[:1000])

American Stories: A Large-Scale Structured Text
Dataset of Historical U.S. Newspapers
Melissa Dell1,2∗, Jacob Carlson1, Tom Bryan1, Emily Silcock1, Abhishek Arora1, Zejiang Shen3,
Luca D’Amico-Wong1, Quan Le4, Pablo Querubin2,5, Leander Heldring6
1Harvard University; Cambridge, MA, USA.
2National Bureau of Economic Research; Cambridge, MA, USA.
3Massachusetts Institute of Technology; Cambridge, MA, USA.
4Princeton University; Princeton, NJ, USA.
5New York University; New York, NY, USA.
6Kellogg School of Management, Northwestern University, Evanston, IL, USA.
∗Corresponding author: melissadell@fas.harvard.edu.
Abstract
Existing full text datasets of U.S. public domain newspapers do not recognize the
often complex layouts of newspaper scans, and as a result the digitized content
scrambles texts from articles, headlines, captions, advertisements, and other lay-
out regions. OCR quality can also be low. This study develops a novel, deep learn-
ing pipeline for extracting full article text

Looks like the paper comes across pretty well with simple newlines in between sections.

#### Images

Let's take a look at how images were converted from the PDF. Using the PDF linked [here](https://papers.nips.cc/paper_files/paper/2023/file/ffeb860479ccae44d84c0de32acd693d-Paper-Datasets_and_Benchmarks.pdf), we can find a segment of the PDF that contains an image and print it.

In [8]:
img_index = text.find('into the public domain')
print(text[img_index:img_index+150])

into the public domain.
(a) Scans Across Time
(b) Scans Across Space
Figure 1: Scans in the Chronicling America database across time and space.
We sho


Here we see that the image is completely skipped during the PDF conversion process. However, the image titles are present, and those are probably more relevant for the RAG. So we likely don't need to adjust the PDF-conversion for the images. Instead, we'll just have to make sure we display the PDF itself so users can see the images when exploring the results from the RAG.

#### Tables

Now let's take a look at how tables were converted from the PDFs. Using the PDF linked [here](https://papers.nips.cc/paper_files/paper/2023/file/ffeb860479ccae44d84c0de32acd693d-Paper-Datasets_and_Benchmarks.pdf), we can find a segment of the PDF that contains a table and print it.

In [9]:
table_index = text.find('billion tokens')
print(text[table_index:table_index+500])

billion tokens.
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
Total
Text Bounding Boxes
Other Bounding Boxes
Boxes
Articles
Headlines
Captions
Bylines
Images
Ads
Tables
Mastheads
Legible
-
335M
368M
9.7M
14.7M
-
-
-
-
Illegible
-
26M
27M
0.9M
2.5M
-
-
-
-
Borderline
-
77M
22M
1.3M
1.2M
-
-
-
-
Total
1.14B
438M
417M
11.9M
18.4M
9.1M
221M
16.3M
4.9M
Table 1: American Stories dataset statistics.
American Stories provides the classes and coordinates for all content regions. Using the provided
metadata, it is 


Here we see that the tabular data is converted into text, however it is not formatted tabularly. This could potentially be useful to the RAG as-is, and if we display the PDF (so users can view the rendered table), then we might not need to adjust the PDF conversion process for tables. That being said, converting the tables so they contain more formatting and hence retain a bit more symantic information during the embedding process might provide better results for the RAG. We'll have to try both strategies out to see.

#### Equations

Now let's take a look at how equations were converted from the PDFs. Using the PDF linked [here](https://papers.nips.cc/paper_files/paper/2023/file/ffeb860479ccae44d84c0de32acd693d-Paper-Datasets_and_Benchmarks.pdf), we can find a segment of the PDF that contains an equation and print it.

In [10]:
eq_index = text.find('formulation')
print(text[eq_index:eq_index+200])

formulation
Lsup
out =
X
i∈I
Lsup
out ,i =
X
i∈I
−1
|P(i)|
X
p∈P (i)
log
exp (zi · zp/τ)
P
a∈A(i) exp (zi · za/τ)
as implemented in PyTorch Metric Learning [22], where τ is a temperature parameter, i 


Similar to tables, equations were converted into text, but do not retain their formatting. This could be useful to the RAG as-is, but might be *more* useful if the equations are converted with some consistent formatting. In any case, we'll want to display the PDF itself as well so users can see the rendered equations.

### Papers with directly-provided plaintext

The plaintext for papers pre-2020 is available directly from the NeurIPS website. Let's investigate how those papers' images, tables, and equations compare to the PDF-converted ones.

Let's take a look at the very first paper in the dataset, one from 1987. This paper can be found [here](https://papers.nips.cc/paper_files/paper/1987/file/02e74f10e0327ad868d138f2b4fdd6f0-Paper.pdf).

In [11]:
text = papers.loc[0, 'text']

Let's take a quick look at how the paper comes across.

In [12]:
print(text[:1000])

573 

BIT - SERIAL NEURAL  NETWORKS 

Alan F.  Murray,  Anthony V . W.  Smith  and Zoe F.  Butler. 

Department of Electrical Engineering,  University of Edinburgh, 

The King's Buildings, Mayfield Road,  Edinburgh, 

Scotland,  EH93JL. 

ABSTRACT 

A  bit  - serial  VLSI  neural  network  is  described  from  an  initial  architecture  for  a 
synapse array through to silicon layout and board design.  The issues surrounding bit 
- serial  computation,  and  analog/digital  arithmetic  are  discussed  and  the  parallel 
development  of  a  hybrid  analog/digital  neural  network  is  outlined.  Learning  and 
recall  capabilities  are  reported  for  the  bit  - serial  network  along  with  a  projected 
specification  for  a  64  - neuron,  bit  - serial  board  operating  at 20 MHz.  This tech(cid:173)
nique  is  extended  to  a  256  (2562  synapses)  network  with  an  update  time  of 3ms, 
using  a  "paging"  technique  to  time  - multiplex  calculations  through  the  synapse

Looks like this paper includes page numbers (573), and has a bit more newline formatting for sections.

#### Equations

Now let's take a look at how equations are formatted with the directly-provided papers. Using the PDF linked [here](https://papers.nips.cc/paper_files/paper/1987/file/02e74f10e0327ad868d138f2b4fdd6f0-Paper.pdf), we can find a segment of the PDF that contains an equation and print it.

In [13]:
eq_index = text.find('The neural output')
print(text[eq_index:eq_index+150])

The neural output state at time t, V[,  is related to x[ by 

V[  = F (xf) 

(1) 

The  activation  function  is  a  "squashing"  function  ensuring  


Looks like the equations from the directly-provided text has a similar structure to the PDF-converted text.

#### Images

Now let's take a look at how images were converted for the directly-provided text. Using the PDF linked [here](https://papers.nips.cc/paper_files/paper/1987/file/02e74f10e0327ad868d138f2b4fdd6f0-Paper.pdf), we can find a segment of the PDF that contains an image and print it.

In [14]:
img_index = text.find('stream networks')
print(text[img_index: img_index+100])

stream networks. 

Synapse 

States { Vj  } 

Figure 1. Generic architecture for  a  network of n to


Similar to the PDF-generated text, the direclty-provided text simply omits the images.

#### Tables

We need to find another paper that has a table in it. We'll go with [this one](https://papers.nips.cc/paper_files/paper/2004/file/026a39ae63343c68b5223a95f3e17616-Paper.pdf).

In [15]:
paper_with_table = papers[papers['name'] == 'PAC-Bayes Learning of Conjunctions and Classification of Gene-Expression Data'].reset_index()
text = paper_with_table.loc[0, 'text']

We can find and print the table by referencing the PDF [here](https://papers.nips.cc/paper_files/paper/2004/file/026a39ae63343c68b5223a95f3e17616-Paper.pdf).

In [16]:
table_index = text.find('error on the training set')
print(text[table_index:table_index+400])

error on the training set.

Data Set

Name
Colon
B MD
C MD
ALL/AML

#exs
62
34
60
72

SVM SVM+gs
size
errs
256
12
32
12
1024
29
18
64

errs
11
6
21
10

ratio
0.42
0.10
0.077
0.002

Soft Greedy

size G-errs B-errs Bound
1
1
3
2

9
6
22
17

18
20
40
38

12
6
24
19

Table 1: DNA micro-array data sets and results.

For each algorithm, the “errs” columns of Table 1 contain the 5-fold CV error
expresse


Similar to the PDF-generated papers, the directly-provided papers contain tabular data but it is not formatted.

### Considerations regarding text formatting

Given that there may be critical information in the images, tables, and equations of these papers, it might be worthwhile exploring the use of PyMuPDF's Markdown formatter - which converts PDFs into Markdown. This might aid in preserving some of the information. It also might make it easier to parse the papers into chunks - if we decide that would improve the RAG system.

Also, considering that the plaintext of papers pre-2020 (provided directly from the NeurIPS website) and papers provided after 2020 (converted from PDFs using PyMuPDF) have different formatting - it might be worthwhile to consider using the same PDF conversion process for all the papers to try to make formatting slightly more standardized. Although, the underlying paper PDFs might not have consistent formatting from year to year.

Of note, is that all of these would be preprocessing steps - which only happen once. Which provides us a bit more leeway as far as scalability is concerned. (I.e., adding more preprocessing steps make take longer for the preprocessing pipeline to run, but since it's only run once or on a batch basis, that might be worth it if it makes the RAG responses better).

## Number of tokens

This section explores how many tokens are in each raw, unprocessed paper - so that we can determine, at a minimum, how much chunking we'll need to do before we can fit the papers into OpenAI's embedding models (which have a max input of 8,191 tokens).

In [17]:
import tiktoken
import numpy as np

In [18]:
tokenizer = tiktoken.encoding_for_model('text-embedding-3-small')

In [19]:
def n_tokens(text):
    try:
        n_tokens = len(tokenizer.encode(text))
    except:
        n_tokens = np.nan
    return n_tokens

In [20]:
papers['n_tokens'] = papers.text.apply(n_tokens)

In [21]:
papers.head()

Unnamed: 0,year,name,text,n_tokens
0,1987,Bit-Serial Neural Networks,573 \n\nBIT - SERIAL NEURAL NETWORKS \n\nAlan...,9136.0
1,1987,Connectivity Versus Entropy,1 \n\nCONNECTIVITY VERSUS ENTROPY \n\nYaser S...,5220.0
2,1987,The Hopfield Model with Multi-Level Neurons,278 \n\nTHE HOPFIELD MODEL WITH MUL TI-LEVEL N...,4445.0
3,1987,How Neural Nets Work,442 \n\nAlan Lapedes \nRobert Farber \n\nThe...,11220.0
4,1987,Spatial Organization of Neural Networks: A Pro...,740 \n\nSPATIAL ORGANIZATION OF NEURAL NEn...,8575.0


Let's also drop any rows with missing values.

In [22]:
papers.isna().sum()

year        0
name        0
text        3
n_tokens    9
dtype: int64

In [23]:
papers.dropna(inplace=True)
papers.isna().sum()

year        0
name        0
text        0
n_tokens    0
dtype: int64

In [24]:
papers.n_tokens.describe()

count     20277.000000
mean      10168.549835
std        6816.756597
min           1.000000
25%        6374.000000
50%        9823.000000
75%       12609.000000
max      299550.000000
Name: n_tokens, dtype: float64

There are plenty of papers whose `n_tokens` is larger than the max input for the OpenAI embedding models (8,191 tokens). We'll definitely have to chunk these down.

In [25]:
papers.to_csv('papers_with_n_tokens.csv', index=False, header=True)