## Introduction

This notebook plays with the [Alexandria Index](https://alex.macrocosm.so/download).

First, download the datasets. 

Install `pyarrow` and `fastparquet` by `pip install pyarrow fastparquet`. Then copy in one of the `abstracts` parquet files from the archive, and run the next cells to load the dataset. I used the `abstracts_1.parquet` file. You can try others.

First, use `%ls%` to find out where you are. Then use `%cd` to get to the notebook's folder if you are not in it.

In [162]:
%ls

[0m[01;32mabstracts_1.parquet[0m*   [01;32mabstracts_4.parquet[0m*  [01;32mtitles_17.parquet[0m*
[01;32mabstracts_10.parquet[0m*  [01;32mabstracts_5.parquet[0m*  [01;32mtitles_18.parquet[0m*
[01;32mabstracts_11.parquet[0m*  [01;32mabstracts_6.parquet[0m*  [01;32mtitles_19.parquet[0m*
[01;32mabstracts_12.parquet[0m*  [01;32mabstracts_7.parquet[0m*  [01;32mtitles_2.parquet[0m*
[01;32mabstracts_13.parquet[0m*  [01;32mabstracts_8.parquet[0m*  [01;32mtitles_20.parquet[0m*
[01;32mabstracts_14.parquet[0m*  [01;32mabstracts_9.parquet[0m*  [01;32mtitles_21.parquet[0m*
[01;32mabstracts_15.parquet[0m*  [01;32malexandria.ipynb[0m*     [01;32mtitles_22.parquet[0m*
[01;32mabstracts_16.parquet[0m*  [01;32moutput_1.txt[0m*         [01;32mtitles_23.parquet[0m*
[01;32mabstracts_17.parquet[0m*  [01;32moutput_full.txt[0m*      [01;32mtitles_3.parquet[0m*
[01;32mabstracts_18.parquet[0m*  [01;32mtitles_1.parquet[0m*     [01;32mtitles_4.parquet[0m*

In [121]:
import glob
import os

directory_path = os.getcwd()
filepaths = glob.glob(os.path.join(directory_path, 'abstracts_*.parquet'))
# print(filepaths)

## Exploring the `abstracts` dataset

The `abstracts` dataset contains 2.25 million 

In [131]:
import pandas as pd
df = pd.read_parquet(filepaths[0])

Let's see if the data contains duplicates:

In [132]:
print(df["doi"].duplicated().sum())
print(df["abstract"].duplicated().sum())

0
137


We see that there are no doi duplicates, good! But what about the abstract duplicates?

In [239]:
def export_duplicates(df, filename, truncation_length=100, column='abstract'):
    with open(filename, 'w') as file:
        duplicates = df[df.duplicated(subset=column, keep=False)]
        truncated_df = duplicates.copy()
        truncated_df['doi'] = truncated_df['doi'].str[:truncation_length]
        truncated_df[column] = truncated_df[column].str.replace('\n', ' ').str[:truncation_length]
        
        sorted_df = truncated_df.sort_values(by=column)
        sorted_df[['doi', column]].to_csv(file, sep='\t', header=False, index=False)

In [151]:
export_duplicates(df, "output_1.txt", column='abstract')

We see several types of duplicates:

* Papers that are comments on other papers. For example, there are 3 papers with the same abstract: "Comment: Expert Elicitation for Reliable System Design [arXiv:0708.0279]".
* Papers withdrawn, with the abstract "This paper has been withdrawn by the author." or its variants.
* Multi-part papers.
    * For example, a series of 7 papers ("Some series and integrals involving the Riemann zeta function..." by Donal F. Connon) with the same abstract, because it's just one 1400-pages long paper filled with nothing but large integrals and summations. I imagine they just split it into 7 parts either due to upload limits.

And as it should, every duplicate abstract has the same embedding vector.

Now let's import *all* the abstracts into one large dataframe.

In [123]:
dfs = []
for filepath in filepaths:
    df = pd.read_parquet(filepath)
    dfs.append(df)

cdf = pd.concat(dfs, ignore_index=True)

In [127]:
cdf

Unnamed: 0,abstract,embeddings,doi
0,A fully differential calculation in perturba...,"[-0.035151865, 0.022851437, 0.025942933, -0.02...",0704.0001
1,"We describe a new algorithm, the $(k,\ell)$-...","[0.035485767, -0.0015772493, -0.0016615744, -0...",0704.0002
2,The evolution of Earth-Moon system is descri...,"[-0.014510429, 0.010210799, 0.049661566, -0.01...",0704.0003
3,We show that a determinant of Stirling cycle...,"[0.029191103, 0.047992915, -0.0061754594, -0.0...",0704.0004
4,In this paper we show how to compute the $\L...,"[-0.015174898, 0.01603887, 0.04062805, -0.0246...",0704.0005
...,...,...,...
2254193,A promising theory in modifying general rela...,"[0.02845307, 0.010213018, -0.0065456596, 0.024...",1710.04612
2254194,We consider an $\ell_0$-minimization problem...,"[0.0020157294, 0.0043197623, 0.03604705, -0.04...",1710.04613
2254195,"Given an ideal I in a polynomial ring, we co...","[0.029166956, -0.0078339875, 0.014820765, -0.0...",1710.04614
2254196,Imitation learning is a powerful paradigm fo...,"[0.039186474, -0.03989054, 0.009515166, -0.056...",1710.04615


How much space did that take up? The `abstract` and `doi` columns each are made of ASCII characters, each costing 1 byte. So we can just count. The `embeddings` column is made of vectors of float32, each costing 4 bytes. The following calculation shows
* `doi`: 23 MB
* `abstract`: 2000 MB
* `embeddings`: 6600 MB

In [161]:
average_length = cdf['doi'].str.len().mean()
num_rows = len(cdf)
size_requirement_bytes = average_length * num_rows
size_requirement_mb = size_requirement_bytes / 1048576
print(f"The estimated size requirement for the 'doi' column is approximately: {size_requirement_mb:.2f} MB.")

The estimated size requirement for the 'doi' column is approximately: 22.76 MB.


In [160]:
average_length = cdf['abstract'].str.len().mean()
num_rows = len(cdf)
size_requirement_bytes = average_length * num_rows
size_requirement_mb = size_requirement_bytes / 1048576
print(f"The estimated size requirement for the 'abstract' column is approximately: {size_requirement_mb:.2f} MB.")

The estimated size requirement for the 'abstract' column is approximately: 2007.99 MB.


In [159]:
num_rows = len(cdf)
sample_entry = cdf['embeddings'][0] 
entry_size_bytes = sample_entry.nbytes
size_requirement_bytes = entry_size_bytes * num_rows
size_requirement_mb = size_requirement_bytes / 1048576
print(f"The estimated space requirement for the 'embeddings' column is approximately: {size_requirement_mb:.2f} MB.")

The estimated space requirement for the 'embeddings' column is approximately: 6604.10 MB.


We see that it contains 2.25 million rows. Let's explore it!

In [128]:
print(cdf["doi"].duplicated().sum())
print(cdf["abstract"].duplicated().sum())

0
1718


So there are no doi duplicates, but 1718 abstract duplicates. Let's export the duplicates and see what they are.

In [152]:
export_duplicates(cdf, "output_full.txt")

So I took a brief look and some of what I found:
* Literally the same title and abstract, but one is longer than the other. <https://arxiv.org/abs/1907.05261>, <https://arxiv.org/abs/2006.13685>
* <https://arxiv.org/abs/2012.12178> is an extended abstract of <https://arxiv.org/abs/2104.09611>.
* <https://arxiv.org/abs/2303.04075> is an "evolved version" of <https://arxiv.org/abs/2209.12285>. Why couldn't they have submitted a second version?
* 500 papers withdrawn. Voluntary withdrawals are usually due to mistakes that invalidate the result, or the paper being superceded by later publications. The involuntary ones are usually due to plagerism or being a jerk. Notable examples:
    * 7 papers by N. Mebarki, A.Maireche are all plagerized.
    * 8 papers by Ramy Naboulsi, all plagerized. The situation was apparently that Naboulsi was trying to get into Japan. As a result we got probably the [weirdest abstract](https://arxiv.org/abs/hep-ph/0304045) ever, which is just an email from Yasushi Watanabe apologizing... The whole situation is described in [Preprint server seeks way to halt plagiarists | Nature](https://www.nature.com/articles/426007a).
    > The plagiarism case traces its origins to June 2002, when Yasushi Watanabe, a high-energy physicist at the Tokyo Institute of Technology, was contacted by Ramy Naboulsi, who said he was a mathematical physicist. Naboulsi asked for Watanabe's help in obtaining a research position in Japan. Impressed by Naboulsi's work, Watanabe agreed to upload some of his papers to ArXiv, which Naboulsi was unable to do himself as he had no academic affiliation. “I was so amazed at his productivity I began to think he was a genius,” Watanabe later wrote in an e-mail to the archive.
    * 3 papers by Tomasz Bodziony all about how Einstein faked his relativity papers. "withdrawn by arXiv administrators due to inflammatory content and unprofessional language".
    * 3 crackpot math papers by Asia Furones, "withdrawn by arXiv admin because of the use of a pseudonym, in violation of arXiv policy".
    * D. L. Khokhlov voluntarily withdrew 4 papers "due to the presented idea is wrong". Just the direct approach, huh?
    * 21 voluntary withdrawals due to "crucial sign error", 10 due to "crucial error".

Some other arXiv trivias:
* <https://arxiv.org/abs/1511.08771>: 11232 pages long. The main text is 102 pages long. The rest of it is basically what happens when someone has a csv file but wants to squeeze it into a pdf.


## Compiling a FAISS database

We can quickly search over the vector database by compiling them into a FAISS database, which is optimized for fast vector searches.

In [186]:
import faiss
import numpy as np

# Convert embeddings into a matrix (2D numpy array)
embedding_matrix = np.vstack(cdf['embeddings'].values)

# Get the dimension of the embeddings
dimension = embedding_matrix.shape[1] 

# Build the FAISS index
index = faiss.IndexFlatL2(dimension)
index.add(embedding_matrix.astype('float32'))  # FAISS uses float32

# Now, create a mapping from the index in the FAISS database to the corresponding doi and abstract.
# The i-th entry in the FAISS index corresponds to the i-th entry in the DataFrame
index_to_doi = cdf['doi'].values
index_to_abstract = cdf['abstract'].values

# Now, you can search the FAISS index
def search(query_vector, k=5):
    # Make sure query_vector is a 2D array
    query_vector = query_vector.reshape(1, -1).astype('float32')
    
    _, indices = index.search(query_vector, k)
    
    # Convert indices to original DOIs and abstracts
    result_doi = index_to_doi[indices]
    result_abstract = index_to_abstract[indices]
    result_abstract = np.array([[np.char.replace(s, '\n', ' ') for s in row] for row in result_abstract])
    
    return result_doi, result_abstract

In [215]:
import warnings
warnings.filterwarnings('ignore')

from InstructorEmbedding import INSTRUCTOR
model_ins = INSTRUCTOR('hkunlp/instructor-xl')

load INSTRUCTOR_Transformer
max_seq_length  512


In [216]:
import torch
# Check if CUDA is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device = ", device)
model_ins.to(device)

device =  cuda


INSTRUCTOR(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
  (2): Dense({'in_features': 1024, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()
)

In [214]:
def query_abstract(query, prompt="Represent the query for retrieving relevant research paper abstracts:", k=5):
    query_vector = model_ins.encode(
        sentences=[[prompt, query]],
        batch_size=1,
        device=str(device)
    )
    return search(query_vector, k=k)

questions = [
    "Who else than Einstein developed relativity independently?",
    "Are we living in a simulation?",
    "Why is the Transformer architecture more scalable than LSTM?",
    "What is the role of quantum mechanics in biology?",
    "What is the role of dark matter in the universe?",
    "What are the recent developments in climate change research?",
    "Are there alternatives to the Big Bang theory?",
]
for question in questions:
    print(question)
    print('-'*80)
    print(query_abstract(question, k=2)[1])
    print('-'*80)

Who else than Einstein developed relativity independently?
--------------------------------------------------------------------------------
[['  The intermediate stage of the development of general relativity is inseparable of Marcel Grossmann\'s mathematical assistance. Einstein acknowledges Grossmann\'s help during 1912-1914 to the development of general relativity. In fact, as with special relativity so was it with General relativity, Einstein received assistance only from his old friends, Marcel Grossmann and Michele Besso. However, he continued to consider Besso as his eternal "sounding board"... '
  '  In 1895 Hendrik Antoon Lorentz derived the Fresnel dragging coefficient in his theory of immobile ether and electrons. This derivation did not explicitly involve electromagnetic theory at all. According to the 1922 Kyoto lecture notes, before 1905 Einstein tried to discuss Fizeau\'s experiment "as originally discussed by Lorentz" (in 1895). At this time he was still under the impre

It's working quite good!

### Counting the bits

The `Instruct-XL` model is 5 GB when stored in the disk, and about 10.5 GB when loaded onto the GPU VRAM. 

To estimate the FAISS index size, we have to serialize it first. It comes out to be 6.4 GB.

The Pandas dataframe is just 2.5 GB.

In [226]:
import os
import tempfile

def size_of_model(model):
    torch.save(model.state_dict(), 'temp.p')
    size = os.path.getsize('temp.p')
    os.remove('temp.p')
    return size

print(f"Instructor-XL model has size {size_of_model(model_ins)/(2**30):.4f} GB")


Instructor-XL model has size 4.6258 GB


In [227]:
def size_of_faiss_index(index):
    byte_array = faiss.serialize_index(index)
    return sys.getsizeof(byte_array)

print(f"FAISS index has size {size_of_faiss_index(index)/(2**30):.4f} GB")

FAISS index has size 6.4493 GB


In [228]:
def size_of_dataframe(df):
    return df.memory_usage(deep=True).sum()

print(f"Total abstract dataset as a Pandas DataFrame has size {size_of_dataframe(cdf)/(2**30):.4f} GB")

Total abstract dataset as a Pandas DataFrame has size 2.4744 GB


## Exploring the titles dataset

We can do the same exploration with the titles dataset.

In [234]:
del cdf
del df

In [236]:
filepaths = glob.glob(os.path.join(directory_path, 'titles_*.parquet'))
dfs = []
for filepath in filepaths:
    df = pd.read_parquet(filepath)
    dfs.append(df)

cdf = pd.concat(dfs, ignore_index=True)

In [237]:
cdf

Unnamed: 0,title,embeddings,doi
0,Calculation of prompt diphoton production cros...,"[-0.050620172, 0.041436385, 0.05363288, -0.029...",0704.0001
1,Sparsity-certifying Graph Decompositions,"[0.014515653, 0.023809524, -0.028145121, -0.04...",0704.0002
2,The evolution of the Earth-Moon system based o...,"[-4.766115e-05, 0.017415706, 0.04146007, -0.03...",0704.0003
3,A determinant of Stirling cycle numbers counts...,"[0.027208889, 0.046175897, 0.0010913888, -0.01...",0704.0004
4,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,"[0.0113909235, 0.0042667952, -0.0008565594, -0...",0704.0005
...,...,...,...
2254193,Thermodynamics of Black Holes in Rastall Gravity,"[0.02459273, 0.024434721, 0.025344433, 0.04852...",1710.04612
2254194,Tractable ADMM Schemes for Computing KKT Point...,"[-0.010883444, 0.0013427543, 0.0028294649, -0....",1710.04613
2254195,Mono: an algebraic study of torus closures,"[0.0011102908, -0.022653135, 0.054966096, -0.0...",1710.04614
2254196,Deep Imitation Learning for Complex Manipulati...,"[0.039771307, -0.010292426, 0.0242721, -0.0688...",1710.04615


In [240]:
export_duplicates(cdf, "output_titles.txt", column='title')

Interesting things I found.

Some really popular titles:
* 13 "Beyond the Standard Model"
* 9 "Physics beyond the Standard Model"
* 6 "Chiral perturbation theory"
* 6 "CP Violation in Hyperon Decays"
* 5 "CP violation"
Some papers are not informative: 2 "Title Redacted", 2 "withdrawn", 6 "Rejoinder".

Some uncommon title types:
* 73 starting with "Comment on" or "Comment:"
* 129 starting with "Discussion of" or "Discussion:"
* 11 "Matters of gravity", which turns out to be "The newsletter of the Division of Gravitational Physics of the American Physical Society".

### FAISS search

Now just like what we did with abstracts, we can compile a FAISS vector search database and search over them, to find paper titles that might answer the question.

In [241]:
import faiss
import numpy as np

# Convert embeddings into a matrix (2D numpy array)
embedding_matrix = np.vstack(cdf['embeddings'].values)

# Get the dimension of the embeddings
dimension = embedding_matrix.shape[1] 

# Build the FAISS index
index = faiss.IndexFlatL2(dimension)
index.add(embedding_matrix.astype('float32'))  # FAISS uses float32

# Now, create a mapping from the index in the FAISS database to the corresponding doi and abstract.
# The i-th entry in the FAISS index corresponds to the i-th entry in the DataFrame
index_to_doi = cdf['doi'].values
index_to_title = cdf['title'].values

# Now, you can search the FAISS index
def search(query_vector, k=5):
    # Make sure query_vector is a 2D array
    query_vector = query_vector.reshape(1, -1).astype('float32')
    
    _, indices = index.search(query_vector, k)
    
    # Convert indices to original DOIs and abstracts
    result_doi = index_to_doi[indices]
    result_title = index_to_title[indices]
    result_title = np.array([[np.char.replace(s, '\n', ' ') for s in row] for row in result_title])
    
    return result_doi, result_title

In [242]:
def query_title(query, prompt="Represent the query for retrieving relevant research paper titles:", k=5):
    query_vector = model_ins.encode(
        sentences=[[prompt, query]],
        batch_size=1,
        device=str(device)
    )
    return search(query_vector, k=k)

questions = [
    "Who else than Einstein developed relativity independently?",
    "Are we living in a simulation?",
    "Why is the Transformer architecture more scalable than LSTM?",
    "What is the role of quantum mechanics in biology?",
    "What is the role of dark matter in the universe?",
    "What are the recent developments in climate change research?",
    "Are there alternatives to the Big Bang theory?",
]
for question in questions:
    print(question)
    print('-'*80)
    print(query_abstract(question, k=10)[1])
    print('-'*80)

Who else than Einstein developed relativity independently?
--------------------------------------------------------------------------------
[['Einstein and Hilbert: The Creation of General Relativity'
  'From Newton to Einstein: the birth of Special Relativity'
  "Quanta: The Originality of Einstein's Approach to Relativity?"
  'A note on "Einstein\'s special relativity beyond the speed of light by   James M. Hill and Barry J. Cox"'
  'The contribution of Giordano Bruno to the principle of relativity'
  "A brief note on how Einstein's general relativity has influenced the   development of modern differential geometry"
  "Beyond Einstein's General Relativity"
  'Connection independent formulation of general relativity'
  "Max Born, Albert Einstein and Hermann Minkowski's Space-Time Formalism   of Special Relativity"
  'Derivation of Einstein Cartan theory from General Relativity']]
--------------------------------------------------------------------------------
Are we living in a simula