#### Part 1 - Build an ArXiv RAG search system using FAISS
Use natural language to search for research papers on ArXiv and<br>
get search results in a format that enables quick review.<br>
by vbookshelf<br>
20 Feb 2024

Part 2 - Search ArXiv using text vector similarity<br>
https://www.kaggle.com/code/vbookshelf/part-2-search-arxiv-using-text-vector-similarity

## Objectives

- Build a RAG (Retrieval Augmented Generation) search system that uses vector search to compare natural language search queries to research paper titles and abstracts in the ArXiv database.
- Use free tools like Sentence Transformers to create the vectors and FAISS to manage the vector search.
- Format the search results to enable quick review.
- Use OpenAi to create a natural language ouput.

##### Example search query: "I want to build an invisibility cloak like the one in Harry Potter."

## Approach

Arxiv has a dataset on Kaggle that gets updated weekly. It includes a file containing metadata for all papers stored in the Arxiv database. The metadata includes the title and abstract of each paper. There are approximately 2.4 million papers in the ArXiv dataset.

To create the RAG system we will take the title and abstract for each paper and convert it into a vector embedding. These vectors will be stored in a FAISS index.

FAISS (Facebook AI Similarity Search) is an open-source library designed for fast (GPU supported) vector similarity search in large datasets.

When a search query is submitted, the query text string will be vectorized. This query vector will then be compared to the vectors in the FAISS index. Then, the search results will be reranked and the top matches will be returned. The paper title, arxiv categories, paper abstract and a link to the pdf file will be displayed for each search result. This format will enable users to quickly scan through the search results to determine which papers are relevant to their work.

Finally, an OpenAi model wil be used to create a one sentence summary of each abstract in the search results.

This project has two notebooks. This notebook is Part 1. In the Part 2 notebook you can submit search queries and review the results. 

Please ensure that the GPU (P100) in your notebook is turned on.

## Resources to learn RAG basics

Here's a list of resources that will help you understand what's happening in this notebook:

- Faiss - Introduction to Similarity Search<br>
James Briggs<br>
https://www.youtube.com/watch?v=sKyvsdEv6rk

- Large Language Models with Semantic Search<br>
Deeplearning.Ai short course<br>
https://www.deeplearning.ai/short-courses/large-language-models-semantic-search/

- Colab Notebook that explains rerank<br>
retrieve_rerank_simple_wikipedia.ipynb<br>
https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb#scrollTo=UlArb7kqN3Re

- Vector Databases: from Embeddings to Applications<br>
Deeplearning.Ai short course<br>
(This approach is an alternative to using FAISS and Sentence Transformers)<br>
https://www.deeplearning.ai/short-courses/vector-databases-embeddings-applications/

- Sentence transformers docs<br>
https://www.sbert.net/



## Install packages

In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.3.1-py3-none-any.whl.metadata (11 kB)
Downloading sentence_transformers-2.3.1-py3-none-any.whl (132 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-2.3.1


In [2]:
#!pip install faiss-cpu
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [3]:
!pip install openai

Collecting openai
  Downloading openai-1.12.0-py3-none-any.whl.metadata (18 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.26.0-py3-none-any.whl.metadata (7.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.3-py3-none-any.whl.metadata (20 kB)
Downloading openai-1.12.0-py3-none-any.whl (226 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m226.7/226.7 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.26.0-py3-none-any.whl (75 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.3-py3-none-any.whl (77 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: httpcore, httpx, openai
Successfully installed httpcore-1.0.3 httpx-0.26.0 openai-1.12.0


In [4]:
!pip install rank-bm25

Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank-bm25
Successfully installed rank-bm25-0.2.2


In [5]:
import pandas as pd
import numpy as np
import os

import json
import re

In [6]:
# Optional
# Please add your OpenAi API Key here.
# I will delete this key after committing this notebook.
OPENAI_API_KEY  = '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'

In [7]:
os.listdir('../input/arxiv')

['arxiv-metadata-oai-snapshot.json']

## All ArXiv categories

The ArXiv data has evolved over time and it seems that some category codes have changed or been added. It appears that the taxonomy on the ArXiv website needs to be updated because some of the category codes in the json metadata are not in the taxonomy.

In [8]:
# All Arxiv category codes
# Source: https://www.kaggle.com/code/artgor/arxiv-metadata-exploration

# https://arxiv.org/category_taxonomy
# https://info.arxiv.org/help/api/user-manual.html#subject_classifications


category_map = {
# These created errors when mapping categories to descriptions
'acc-phys': 'Accelerator Physics',
'adap-org': 'Not available',
'q-bio': 'Not available',
'cond-mat': 'Not available',
'chao-dyn': 'Not available',
'patt-sol': 'Not available',
'dg-ga': 'Not available',
'solv-int': 'Not available',
'bayes-an': 'Not available',
'comp-gas': 'Not available',
'alg-geom': 'Not available',
'funct-an': 'Not available',
'q-alg': 'Not available',
'ao-sci': 'Not available',
'atom-ph': 'Atomic Physics',
'chem-ph': 'Chemical Physics',
'plasm-ph': 'Plasma Physics',
'mtrl-th': 'Not available',
'cmp-lg': 'Not available',
'supr-con': 'Not available',
###

# Added
'econ.GN': 'General Economics', 
'econ.TH': 'Theoretical Economics', 
'eess.SY': 'Systems and Control', 
    
'astro-ph': 'Astrophysics',
'astro-ph.CO': 'Cosmology and Nongalactic Astrophysics',
'astro-ph.EP': 'Earth and Planetary Astrophysics',
'astro-ph.GA': 'Astrophysics of Galaxies',
'astro-ph.HE': 'High Energy Astrophysical Phenomena',
'astro-ph.IM': 'Instrumentation and Methods for Astrophysics',
'astro-ph.SR': 'Solar and Stellar Astrophysics',
'cond-mat.dis-nn': 'Disordered Systems and Neural Networks',
'cond-mat.mes-hall': 'Mesoscale and Nanoscale Physics',
'cond-mat.mtrl-sci': 'Materials Science',
'cond-mat.other': 'Other Condensed Matter',
'cond-mat.quant-gas': 'Quantum Gases',
'cond-mat.soft': 'Soft Condensed Matter',
'cond-mat.stat-mech': 'Statistical Mechanics',
'cond-mat.str-el': 'Strongly Correlated Electrons',
'cond-mat.supr-con': 'Superconductivity',
'cs.AI': 'Artificial Intelligence',
'cs.AR': 'Hardware Architecture',
'cs.CC': 'Computational Complexity',
'cs.CE': 'Computational Engineering, Finance, and Science',
'cs.CG': 'Computational Geometry',
'cs.CL': 'Computation and Language',
'cs.CR': 'Cryptography and Security',
'cs.CV': 'Computer Vision and Pattern Recognition',
'cs.CY': 'Computers and Society',
'cs.DB': 'Databases',
'cs.DC': 'Distributed, Parallel, and Cluster Computing',
'cs.DL': 'Digital Libraries',
'cs.DM': 'Discrete Mathematics',
'cs.DS': 'Data Structures and Algorithms',
'cs.ET': 'Emerging Technologies',
'cs.FL': 'Formal Languages and Automata Theory',
'cs.GL': 'General Literature',
'cs.GR': 'Graphics',
'cs.GT': 'Computer Science and Game Theory',
'cs.HC': 'Human-Computer Interaction',
'cs.IR': 'Information Retrieval',
'cs.IT': 'Information Theory',
'cs.LG': 'Machine Learning',
'cs.LO': 'Logic in Computer Science',
'cs.MA': 'Multiagent Systems',
'cs.MM': 'Multimedia',
'cs.MS': 'Mathematical Software',
'cs.NA': 'Numerical Analysis',
'cs.NE': 'Neural and Evolutionary Computing',
'cs.NI': 'Networking and Internet Architecture',
'cs.OH': 'Other Computer Science',
'cs.OS': 'Operating Systems',
'cs.PF': 'Performance',
'cs.PL': 'Programming Languages',
'cs.RO': 'Robotics',
'cs.SC': 'Symbolic Computation',
'cs.SD': 'Sound',
'cs.SE': 'Software Engineering',
'cs.SI': 'Social and Information Networks',
'cs.SY': 'Systems and Control',
'econ.EM': 'Econometrics',             
'eess.AS': 'Audio and Speech Processing',
'eess.IV': 'Image and Video Processing',
'eess.SP': 'Signal Processing',               
'gr-qc': 'General Relativity and Quantum Cosmology',
'hep-ex': 'High Energy Physics - Experiment',
'hep-lat': 'High Energy Physics - Lattice',
'hep-ph': 'High Energy Physics - Phenomenology',
'hep-th': 'High Energy Physics - Theory',
'math.AC': 'Commutative Algebra',
'math.AG': 'Algebraic Geometry',
'math.AP': 'Analysis of PDEs',
'math.AT': 'Algebraic Topology',
'math.CA': 'Classical Analysis and ODEs',
'math.CO': 'Combinatorics',
'math.CT': 'Category Theory',
'math.CV': 'Complex Variables',
'math.DG': 'Differential Geometry',
'math.DS': 'Dynamical Systems',
'math.FA': 'Functional Analysis',
'math.GM': 'General Mathematics',
'math.GN': 'General Topology',
'math.GR': 'Group Theory',
'math.GT': 'Geometric Topology',
'math.HO': 'History and Overview',
'math.IT': 'Information Theory',
'math.KT': 'K-Theory and Homology',
'math.LO': 'Logic',
'math.MG': 'Metric Geometry',
'math.MP': 'Mathematical Physics',
'math.NA': 'Numerical Analysis',
'math.NT': 'Number Theory',
'math.OA': 'Operator Algebras',
'math.OC': 'Optimization and Control',
'math.PR': 'Probability',
'math.QA': 'Quantum Algebra',
'math.RA': 'Rings and Algebras',
'math.RT': 'Representation Theory',
'math.SG': 'Symplectic Geometry',
'math.SP': 'Spectral Theory',
'math.ST': 'Statistics Theory',
'math-ph': 'Mathematical Physics',
'nlin.AO': 'Adaptation and Self-Organizing Systems',
'nlin.CD': 'Chaotic Dynamics',
'nlin.CG': 'Cellular Automata and Lattice Gases',
'nlin.PS': 'Pattern Formation and Solitons',
'nlin.SI': 'Exactly Solvable and Integrable Systems',
'nucl-ex': 'Nuclear Experiment',
'nucl-th': 'Nuclear Theory',
'physics.acc-ph': 'Accelerator Physics',
'physics.ao-ph': 'Atmospheric and Oceanic Physics',
'physics.app-ph': 'Applied Physics',
'physics.atm-clus': 'Atomic and Molecular Clusters',
'physics.atom-ph': 'Atomic Physics',
'physics.bio-ph': 'Biological Physics',
'physics.chem-ph': 'Chemical Physics',
'physics.class-ph': 'Classical Physics',
'physics.comp-ph': 'Computational Physics',
'physics.data-an': 'Data Analysis, Statistics and Probability',
'physics.ed-ph': 'Physics Education',
'physics.flu-dyn': 'Fluid Dynamics',
'physics.gen-ph': 'General Physics',
'physics.geo-ph': 'Geophysics',
'physics.hist-ph': 'History and Philosophy of Physics',
'physics.ins-det': 'Instrumentation and Detectors',
'physics.med-ph': 'Medical Physics',
'physics.optics': 'Optics',
'physics.plasm-ph': 'Plasma Physics',
'physics.pop-ph': 'Popular Physics',
'physics.soc-ph': 'Physics and Society',
'physics.space-ph': 'Space Physics',
'q-bio.BM': 'Biomolecules',
'q-bio.CB': 'Cell Behavior',
'q-bio.GN': 'Genomics',
'q-bio.MN': 'Molecular Networks',
'q-bio.NC': 'Neurons and Cognition',
'q-bio.OT': 'Other Quantitative Biology',
'q-bio.PE': 'Populations and Evolution',
'q-bio.QM': 'Quantitative Methods',
'q-bio.SC': 'Subcellular Processes',
'q-bio.TO': 'Tissues and Organs',
'q-fin.CP': 'Computational Finance',
'q-fin.EC': 'Economics',
'q-fin.GN': 'General Finance',
'q-fin.MF': 'Mathematical Finance',
'q-fin.PM': 'Portfolio Management',
'q-fin.PR': 'Pricing of Securities',
'q-fin.RM': 'Risk Management',
'q-fin.ST': 'Statistical Finance',
'q-fin.TR': 'Trading and Market Microstructure',
'quant-ph': 'Quantum Physics',
'stat.AP': 'Applications',
'stat.CO': 'Computation',
'stat.ME': 'Methodology',
'stat.ML': 'Machine Learning',
'stat.OT': 'Other Statistics',
'stat.TH': 'Statistics Theory'
}


## Load the Arxiv metadata

In [9]:
# https://www.kaggle.com/code/matthewmaddock/nlp-arxiv-dataset-transformers-and-umap

# This takes about 1 minute.


cols = ['id', 'title', 'abstract', 'categories']
data = []
file_name = '/kaggle/input/arxiv/arxiv-metadata-oai-snapshot.json'


with open(file_name, encoding='latin-1') as f:
    for line in f:
        doc = json.loads(line)
        lst = [doc['id'], doc['title'], doc['abstract'], doc['categories']]
        data.append(lst)

df_data = pd.DataFrame(data=data, columns=cols)

print(df_data.shape)

df_data.head()

(2421966, 4)


Unnamed: 0,id,title,abstract,categories
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG
2,704.0003,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph
3,704.0004,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO
4,704.0005,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA


## Convert the category codes into text

If a description was not available for a category code then I left it out when converting category codes into text.

In [10]:
def get_cat_text(x):
    
    cat_text = ''
    
    # Put the codes into a list
    cat_list = x.split(' ')
    
    for i, item in enumerate(cat_list):
        
        cat_name = category_map[item]
        
        # If there was no description available
        # for the category code then don't include it in the text.
        if cat_name != 'Not available':
            
            if i == 0:
                cat_text = cat_name
            else:
                cat_text = cat_text + ', ' + cat_name
 
    # Remove leading and trailing spaces
    cat_text = cat_text.strip()
    
    return cat_text
    

df_data['cat_text'] = df_data['categories'].apply(get_cat_text)

df_data.head()

Unnamed: 0,id,title,abstract,categories,cat_text
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturba...,hep-ph,High Energy Physics - Phenomenology
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-...",math.CO cs.CG,"Combinatorics, Computational Geometry"
2,704.0003,The evolution of the Earth-Moon system based o...,The evolution of Earth-Moon system is descri...,physics.gen-ph,General Physics
3,704.0004,A determinant of Stirling cycle numbers counts...,We show that a determinant of Stirling cycle...,math.CO,Combinatorics
4,704.0005,From dyadic $\Lambda_{\alpha}$ to $\Lambda_{\a...,In this paper we show how to compute the $\L...,math.CA math.FA,"Classical Analysis and ODEs, Functional Analysis"


In [11]:
# Print details of one paper

i = 1

print('Id:',df_data.loc[i, 'id'])
print()
print('Title:',df_data.loc[i, 'title'])
print()
print('Categories:',df_data.loc[i, 'cat_text'])
print()
print('Abstract:',df_data.loc[i, 'abstract'])

Id: 0704.0002

Title: Sparsity-certifying Graph Decompositions

Categories: Combinatorics, Computational Geometry

Abstract:   We describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use
it obtain a characterization of the family of $(k,\ell)$-sparse graphs and
algorithmic solutions to a family of problems concerning tree decompositions of
graphs. Special instances of sparse graphs appear in rigidity theory and have
received increased attention in recent years. In particular, our colored
pebbles generalize and strengthen the previous results of Lee and Streinu and
give a new proof of the Tutte-Nash-Williams characterization of arboricity. We
also present a new decomposition that certifies sparsity based on the
$(k,\ell)$-pebble game with colors. Our work also exposes connections between
pebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and
Westermann and Hendrickson.



## Clean the text
- Replace newline characters ('\n') with a space
- Remove leading and trailing spaces

In [12]:
# Replace newline characters ('\n') with a space
# Remove leading and trailing spaces

def clean_text(x):
    
    # Replace newline characters with a space
    new_text = x.replace("\n", " ")
    # Remove leading and trailing spaces
    new_text = new_text.strip()
    
    return new_text

df_data['title'] = df_data['title'].apply(clean_text)
df_data['abstract'] = df_data['abstract'].apply(clean_text)

#df_filtered.head()

## Create the text string that will be vectorized

Here the title will be appended to the abstract.

In [13]:
# Append the title to the abstract

df_data['prepared_text'] = df_data['title'] + ' {title} ' + df_data['abstract']

#df_data.head()

## Get the data ready for vectorizing

We need a list of text strings.

In [14]:
# Create a list of text chunks

chunk_list = list(df_data['prepared_text'])

# The ids are used to create web links to each paper.
# You can access each paper directly on ArXiv using these links:
# https://arxiv.org/abs/{id}: ArXiv page for the paper
# https://arxiv.org/pdf/{id}: Direct link to download the PDF

arxiv_id_list = list(df_data['id'])
cat_list = list(df_data['cat_text'])

print(len(chunk_list))
print(len(arxiv_id_list))
print(len(cat_list))

2421966
2421966
2421966


In [15]:
chunk_list[0]

'Calculation of prompt diphoton production cross sections at Tevatron and   LHC energies {title} A fully differential calculation in perturbative quantum chromodynamics is presented for the production of massive photon pairs at hadron colliders. All next-to-leading order perturbative contributions from quark-antiquark, gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as all-orders resummation of initial-state gluon radiation valid at next-to-next-to-leading logarithmic accuracy. The region of phase space is specified in which the calculation is most reliable. Good agreement is demonstrated with data from the Fermilab Tevatron, and predictions are made for more detailed tests with CDF and DO data. Predictions are shown for distributions of diphoton pairs produced at the energy of the Large Hadron Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs boson are contrasted with those produced from QCD processes at the LHC, showing that enhanced 

The data is now ready for vectorization.

## Create the embedding vetors

This step takes about 1 hour 40 minutes to create approx. 2.4 million vectors. I saved these vectors and created a Part 2 notebook where the saved vectors are loaded instead of being created from scratch. 

You can use the Part 2 notebook to quickly test this RAG system. There you can enter your own search queries and review the results.

In [16]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Sentences are encoded by calling model.encode()
embeddings = model.encode(chunk_list)

print(embeddings.shape)
print('Embedding length', embeddings.shape[1])

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/75687 [00:00<?, ?it/s]

(2421966, 384)
Embedding length 384


In [17]:
# Display one embedding

i = 1
print(chunk_list[i])
print(embeddings[i])

Sparsity-certifying Graph Decompositions {title} We describe a new algorithm, the $(k,\ell)$-pebble game with colors, and use it obtain a characterization of the family of $(k,\ell)$-sparse graphs and algorithmic solutions to a family of problems concerning tree decompositions of graphs. Special instances of sparse graphs appear in rigidity theory and have received increased attention in recent years. In particular, our colored pebbles generalize and strengthen the previous results of Lee and Streinu and give a new proof of the Tutte-Nash-Williams characterization of arboricity. We also present a new decomposition that certifies sparsity based on the $(k,\ell)$-pebble game with colors. Our work also exposes connections between pebble game algorithms and previous sparse graph algorithms by Gabow, Gabow and Westermann and Hendrickson.
[ 3.53147136e-03  4.11602966e-02  1.37148583e-02 -6.80273771e-02
  7.57682463e-03 -4.09503579e-02  3.72349769e-02 -1.04655609e-01
 -3.49298902e-02  3.47316

## Save the embedding vectors and the dataframe

In [18]:
type(embeddings)

numpy.ndarray

In [19]:
# Save the array in compressed format
np.savez_compressed('compressed_array.npz', array_data=embeddings)

!ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


__notebook__.ipynb  compressed_array.npz


In [20]:
# Check the size of the saved file

import os

# Get the size of the file in bytes
file_size_bytes = os.path.getsize('compressed_array.npz')

# Convert bytes to megabytes
file_size_mb = file_size_bytes / (1024 * 1024)

print("File size:", file_size_mb, "MB")

File size: 3289.1370239257812 MB


In [21]:
# How to load the saved array

# Load the compressed array
loaded_embeddings = np.load('compressed_array.npz')

# Access the array by the name you specified ('my_array' in this case)
loaded_embeddings = loaded_embeddings['array_data']

loaded_embeddings.shape

(2421966, 384)

In [22]:
# Save the DataFrame in compressed format

df_data.to_csv('compressed_dataframe.csv.gz', compression='gzip', index=False)

!ls

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


__notebook__.ipynb  compressed_array.npz  compressed_dataframe.csv.gz


In [23]:
# How to load the compressed DataFrame

df = pd.read_csv('compressed_dataframe.csv.gz', compression='gzip')

print(df.shape)

df.head(2)

  df = pd.read_csv('compressed_dataframe.csv.gz', compression='gzip')


(2421966, 6)


Unnamed: 0,id,title,abstract,categories,cat_text,prepared_text
0,704.0001,Calculation of prompt diphoton production cros...,A fully differential calculation in perturbati...,hep-ph,High Energy Physics - Phenomenology,Calculation of prompt diphoton production cros...
1,704.0002,Sparsity-certifying Graph Decompositions,"We describe a new algorithm, the $(k,\ell)$-pe...",math.CO cs.CG,"Combinatorics, Computational Geometry",Sparsity-certifying Graph Decompositions {titl...


## How to set up FAISS for Exhaustive Search

In an exhaustive search (brute-force search) we compare the query vector to every vector in the database. Therefore, we don't need to train an index. 

You'll need to have watched this video to understand all that we are going to do next:<br>
Faiss - Introduction to Similarity Search<br>
https://www.youtube.com/watch?v=sKyvsdEv6rk

In [24]:
import faiss

embed_length = embeddings.shape[1]

index = faiss.IndexFlatL2(embed_length)

# Check if the index is trained.
# No training needed when using greedy search i.e. IndexFlatL2
index.is_trained

True

In [25]:
# Add the embeddings to the index

index.add(embeddings)

# Check the total number of embeddings in the index
index.ntotal

2421966

In [26]:
# Run a query

query_text = """
I want to create an invisibility cloak similar to the one in Harry Potter.
"""
query = [query_text]


# Vectorize the query string
query_embedding = model.encode(query)

# Set the number of outputs we want
top_k = 3

# Run the query
# index_vals refers to the chunk_list index values
scores, index_vals = index.search(query_embedding, top_k)

print(index_vals)
print(scores)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[[236317 192746 581272]]
[[0.5331528  0.68929166 0.7126996 ]]


In [27]:
# Let's print the first search result

pred_indexes = index_vals[0]

i = 0
chunk_index = pred_indexes[i]
text = chunk_list[chunk_index]

text

'Harry Potter\'s Cloak {title} The magic "Harry Potter\'s cloak" has been the dream of human beings for really long time. Recently, transformation optics inspired from the advent of metamaterials offers great versatility for manipulating wave propagation at will to create amazing illusion effects. In the present work, we proposed a novel transformation recipe, in which the cloaking shell somehow behaves like a "cloaking lens", to provide almost all desired features one can expect for a real magic cloak. The most exciting feature of the current recipe is that an object with arbitrary characteristics (e.g., size, shape or material properties) can be invisibilized perfectly with positive-index materials, which significantly benefits the practical realization of a broad-band cloaking device fabricated with existing materials. Moreover, the one concealed in the hidden region is able to undistortedly communicate with the surrounding world, while the lens-like cloaking shell will protect the 

## Set up FAISS - Nearest Neighbor Search

Exhaustive (brute-force) search can be slow when searching over a large number of vectors. A nearest neighbor search is faster.

In [28]:
# How many clusters (voronoid cells) do we want?
# Example: For 4 centroilds we need at least 156 embeddings in
# order to train the index.
num_centroids = 5

quantizer = faiss.IndexFlatL2(embed_length)

index = faiss.IndexIVFFlat(quantizer, embed_length, num_centroids)

In [29]:
# Train the index
# After the index is trained it's ready to receive data

index.train(embeddings)

index.is_trained

True

In [30]:
# Add the embeddings to the index

index.add(embeddings)

# Check how many embeddings are in the index
index.ntotal

2421966

In [31]:
query = [query_text]
query_embedding = model.encode(query)

top_k = 5


# Run the query
# index_vals refers to the chunk_list index values
scores, index_vals = index.search(query_embedding, top_k)

print(index_vals)
print(scores)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[[385286  82609 399548 172403 967389]]
[[0.7165092  0.7456944  0.7569691  0.7853357  0.79582125]]


In [32]:
# Let's print the first search result

pred_indexes = index_vals[0]

i = 3
chunk_index = pred_indexes[i]
text = chunk_list[chunk_index]

text

"Transformation Optics, Generalized Cloaking and Superlenses {title} In this paper, transformation optics is presented together with a generalization of invisibility cloaking: instead of an empty region of space, an inhomogeneous structure is transformed via Pendry's map in order to give, to any object hidden in the central hole of the cloak, a completely arbitrary appearance. Other illusion devices based on superlenses considered from the point of view of transformation optics are also discussed."

## Change the nprobe value

Note: The code from this point onwards will be using the results from this search (index.nprobe = 4). But you could also use the results from the exhaustive search or the nearest neighbors search that we did above.

In [33]:
# So far we've just been searching the cell with 
# the nearest centroid.
# Setting nprobe allows us to search more of
# the nearest cells. e.g. nprobe = 4 means w will search 4 cells.
# This can be done if we were not getting good results and wanted
# to improve performance. The time taken also increases as we are
# comparing to more vectors.

index.nprobe = 4

In [34]:
query = [query_text]
query_embedding = model.encode(query)

top_k = 5

# Run the query
# index_vals refers to the chunk_list index values
scores, index_vals = index.search(query_embedding, top_k)

print(index_vals)
print(scores)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[[236317 192746 581272 385286 231606]]
[[0.5331528  0.68929166 0.7126996  0.7165092  0.7363473 ]]


In [35]:
# Let's print the third search result

pred_indexes = index_vals[0]

i = 3
chunk_index = pred_indexes[i]
text = chunk_list[chunk_index]

text

"Asymmetric Cloaking Theory Based on Finsler Geometry ~ How to design   Harry Potter's invisibility cloak with a scientific method ~ {title} Is it possible to actually make Harry's invisibility cloaks? The most promising approach for realizing such magical cloaking in our real world would be to use transformation optics, where an empty space with a distorted geometry is imitated with a non-distorted space but filled with transformation medium having appropriate permittivity and permeability. An important requirement for true invisibility cloaks is nonreciprocity; that is, a person in the cloak should not be seen from the outside but should be able to see the outside. This invisibility cloak, or a nonreciprocal shield, cannot be created as far as we stay in conventional transformation optics. Conventional transformation optics is based on Riemann geometry with a metric tensor independent of direction, and therefore cannot be used to design the nonreciprocal shield. To overcome this prob

In [36]:
# There's also one other method to speed up the search.
# Please refer to the video mentioned above.

## Re-ranking the predicted results

Vector search compares vectors but reranking compares text. Therefore, reranking improves search results because it re-orders the search results based on the relevance of the search result to the query i.e. it compares each text result to the query text and assigns a score.

In [37]:
from sentence_transformers import CrossEncoder

# We use a cross-encoder to re-rank the results
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [38]:
# [1] Run a search

query = [query_text]
query_embedding = model.encode(query)

top_k = 10
D, I = index.search(query_embedding, top_k)

list(I[0])

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

[236317, 192746, 581272, 385286, 231606, 82609, 91810, 60126, 243078, 399548]

In [39]:
# [2] Get the text associated with each search result

pred_list = list(I[0])

# Replace the chunk index values with the corresponding strings
pred_strings_list = [chunk_list[item] for item in pred_list]

pred_strings_list[0]

'Harry Potter\'s Cloak {title} The magic "Harry Potter\'s cloak" has been the dream of human beings for really long time. Recently, transformation optics inspired from the advent of metamaterials offers great versatility for manipulating wave propagation at will to create amazing illusion effects. In the present work, we proposed a novel transformation recipe, in which the cloaking shell somehow behaves like a "cloaking lens", to provide almost all desired features one can expect for a real magic cloak. The most exciting feature of the current recipe is that an object with arbitrary characteristics (e.g., size, shape or material properties) can be invisibilized perfectly with positive-index materials, which significantly benefits the practical realization of a broad-band cloaking device fabricated with existing materials. Moreover, the one concealed in the hidden region is able to undistortedly communicate with the surrounding world, while the lens-like cloaking shell will protect the 

In [40]:
# Format the input for the cross encoder

# The input to the cross_encoder is a list of lists
# [[query_text, pred_text1], [query_text, pred_text2], ...]

cross_input_list = []

for item in pred_strings_list:
    
    new_list = [query[0], item]
    
    cross_input_list.append(new_list)
    

In [41]:
cross_input_list[2]

['\nI want to create an invisibility cloak similar to the one in Harry Potter.\n',
 'Electrostatic Field Invisibility Cloak {title} Invisibility cloak is drawing much attention due to its special camouflage when exposed to physical field varing from wave (electromagnetic field, acoustic field, elastic wave, etc.) to scalar field (thermal field, static magnetic field, dc electric field and mass diffusion). Here, an electrostatic field invisibility cloak has been theoretically investigated, and experimentally demonstrated for the first time to perfectly hide a certain region from sight without disturbing the external electrostatic field. The desired cloaking effect has been achieved via both scattering cancelling technology and transformation optics (TO).This present work will pave a novel way for manipulating of electrostatic field where would enable a wide range of potential applications and sustainable products made available.']

In [42]:
# Put the pred text into a dataframe

df = pd.DataFrame(cross_input_list, columns=['query_text', 'pred_text'])
df['original_index'] = I[0]

df.head()

Unnamed: 0,query_text,pred_text,original_index
0,\nI want to create an invisibility cloak simil...,"Harry Potter's Cloak {title} The magic ""Harry ...",236317
1,\nI want to create an invisibility cloak simil...,A near-perfect invisibility cloak constructed ...,192746
2,\nI want to create an invisibility cloak simil...,Electrostatic Field Invisibility Cloak {title}...,581272
3,\nI want to create an invisibility cloak simil...,Asymmetric Cloaking Theory Based on Finsler Ge...,385286
4,\nI want to create an invisibility cloak simil...,Macroscopic Invisibility Cloak for Visible Lig...,231606


In [43]:
# Now, score all retrieved passages using the cross_encoder

cross_scores = cross_encoder.predict(cross_input_list)

cross_scores

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

array([ 3.7353137 ,  0.9840179 , -1.5424991 ,  4.8000298 , -0.2437487 ,
        0.820004  , -0.37842786, -2.1118193 , -0.68046725, -0.8392715 ],
      dtype=float32)

In [44]:
# Add the scores to the dataframe

df['cross_scores'] = cross_scores

df.head()

Unnamed: 0,query_text,pred_text,original_index,cross_scores
0,\nI want to create an invisibility cloak simil...,"Harry Potter's Cloak {title} The magic ""Harry ...",236317,3.735314
1,\nI want to create an invisibility cloak simil...,A near-perfect invisibility cloak constructed ...,192746,0.984018
2,\nI want to create an invisibility cloak simil...,Electrostatic Field Invisibility Cloak {title}...,581272,-1.542499
3,\nI want to create an invisibility cloak simil...,Asymmetric Cloaking Theory Based on Finsler Ge...,385286,4.80003
4,\nI want to create an invisibility cloak simil...,Macroscopic Invisibility Cloak for Visible Lig...,231606,-0.243749


In [45]:
# Sort the DataFrame in descending order based on the scores

df_sorted = df.sort_values(by='cross_scores', ascending=False)

df_sorted.head(10)

Unnamed: 0,query_text,pred_text,original_index,cross_scores
3,\nI want to create an invisibility cloak simil...,Asymmetric Cloaking Theory Based on Finsler Ge...,385286,4.80003
0,\nI want to create an invisibility cloak simil...,"Harry Potter's Cloak {title} The magic ""Harry ...",236317,3.735314
1,\nI want to create an invisibility cloak simil...,A near-perfect invisibility cloak constructed ...,192746,0.984018
5,\nI want to create an invisibility cloak simil...,Invisibility cloak without singularity {title}...,82609,0.820004
4,\nI want to create an invisibility cloak simil...,Macroscopic Invisibility Cloak for Visible Lig...,231606,-0.243749
6,\nI want to create an invisibility cloak simil...,A complementary media invisibility cloak that ...,91810,-0.378428
8,\nI want to create an invisibility cloak simil...,Homogeneous optical cloak constructed with uni...,243078,-0.680467
9,\nI want to create an invisibility cloak simil...,Photorealistic rendering of unidirectional fre...,399548,-0.839271
2,\nI want to create an invisibility cloak simil...,Electrostatic Field Invisibility Cloak {title}...,581272,-1.542499
7,\nI want to create an invisibility cloak simil...,Cylindrical Cloak with Axial Permittivity/Perm...,60126,-2.111819


In [46]:
# Compare the orginal predicted index order and 
# the re-ranked index order

print('Original order:',I[0])
print('Reranked order:',list(df_sorted['original_index']))

Original order: [236317 192746 581272 385286 231606  82609  91810  60126 243078 399548]
Reranked order: [385286, 236317, 192746, 82609, 231606, 91810, 243078, 399548, 581272, 60126]


In [47]:
# Print the output

# Print three results
num_results = 3

for i in range(0,num_results):
    
    text = df_sorted.loc[i, 'pred_text']
    
    original_index = df_sorted.loc[i, 'original_index']
    arxiv_id = df_data.loc[original_index, 'id']
    cat_text = df_data.loc[original_index, 'cat_text']
    
    # Crete the link to the research paper pdf
    link_to_pdf = f'https://arxiv.org/pdf/{arxiv_id}'
    
    print('Link to pdf:',link_to_pdf)
    print('Categories:',cat_text)
    print('Abstract:',text)
    print()

Link to pdf: https://arxiv.org/pdf/1101.0904
Categories: Classical Physics
Abstract: Harry Potter's Cloak {title} The magic "Harry Potter's cloak" has been the dream of human beings for really long time. Recently, transformation optics inspired from the advent of metamaterials offers great versatility for manipulating wave propagation at will to create amazing illusion effects. In the present work, we proposed a novel transformation recipe, in which the cloaking shell somehow behaves like a "cloaking lens", to provide almost all desired features one can expect for a real magic cloak. The most exciting feature of the current recipe is that an object with arbitrary characteristics (e.g., size, shape or material properties) can be invisibilized perfectly with positive-index materials, which significantly benefits the practical realization of a broad-band cloaking device fabricated with existing materials. Moreover, the one concealed in the hidden region is able to undistortedly communicat

## Use OpenAI to create a natural language output

Here we will use OpenAi to create a summary of each of the paper abstracts in our search results. These will give the user a quick one sentence summary of each abstract.

This step is optional. OpenAi is not free. You'll need to assess whether or not this step provides value to the user. My view is that it doesn't add value. I've not included this step in Part 2.

In [48]:
# Get the top 3 search results
pred_text_list = list(df_sorted['pred_text'])
context = pred_text_list[0:3]

# Create the prompt

prompt = f"""
You will be provided with a list of titles and abstracts 
for research papers: 
{context}
Write a one sentence summary of each abstract at the level 
of a high school student.
"""

In [49]:
from openai import OpenAI

client = OpenAI(api_key=OPENAI_API_KEY)

completion = client.chat.completions.create(
  model="gpt-3.5-turbo-0301",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
  ]
)


print(completion.choices[0].message.content)

1. Theoretical research proposes a way to create an invisibility cloak using Finsler geometry and nonreciprocal shielding with anisotropic, nonreciprocal permittivity and permeability.

2. A novel transformation recipe is proposed to create a magic cloak that can invisibilize objects of any characteristics with positive-index materials, creating a virtual image to protect the cloaked source/sensor from being traced back by outside detectors.

3. A near-perfect invisibility cloak with a diamond cross section is achieved using a two-step coordinate transformation, and the cloak consists of four kinds and eight blocks of homogeneous transformation media.


## Evaluation

I think that it's best to evaluate the performance of this system empirically i.e. by using it. Not only do we want this search system to match our queries to papers that align with our subject matter, but we also want this system to lead us to relevant papers in related fields that we never would have considered.

## Summary of the RAG workflow

[ 1 ] Setup
1. Read in the text from the file
2. Chunk the text and store the chunks in a list
3. Use Sentence Transformer to create a vector for each chunk
4. Use FAISS to create the index (Set up either exhaustive search or nearest neighbor search)
5. Train the index, if needed
6. Load the vectors into the index
7. Set up re-rank

[2] Run similarity search
1. Get the query text (question)
2. Vector encode the question
3. Run similarity search

[3] Run re-rank

[4] Run OpenAi
1. Get the question
2. Get the context
2. Create the prompt
3. Run the prompt through OpenAi
4. Get the natural language answer to the question

# Appendix


## 1. How to do a keyword search using bm25

Keyword search can be useful when looking for things like unusual words or serial numbers. These will be hard to find with a vector search.

This is a simple example that shows how to use the bm25 alogorithm to do a keyword search. Knowing this can be helpful in cases where you want to do a hybrid search i.e. a combination of a keyword search and a vector search.

For hybid search you could try adding the text output from the keyword search to the text list that gets passed to the reranker.

In [50]:
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction import _stop_words
import string

#####

# Note:
# I'm not doing it in this simple example, but you should also:
# - remove punctuation
# - convert all text to lower case
# - remove leading and trailing spaces

# 'document.' will not be matched to 'document'

# This is an example function:
def bm25_tokenizer(text):
    tokenized_doc = []
    for token in text.lower().split():
        token = token.strip(string.punctuation)

        if len(token) > 0 and token not in _stop_words.ENGLISH_STOP_WORDS:
            tokenized_doc.append(token)
    return tokenized_doc

#####



# Sample text
documents = [
    "This is the first document",
    "This document is the second IS999333 document.",
    "And this is the third one.",
    "Is this the first document",
]

# Preprocess documents (split into tokens)
tokenized_documents = [doc.split(" ") for doc in documents]

# Initialize BM25 model
bm25 = BM25Okapi(tokenized_documents)

# Query - Search for a serial number
query = 'IS999333'

# Tokenize the query
tokenized_query = query.split(" ")

# Perform the keyword search
scores = bm25.get_scores(tokenized_query)

# Rank documents based on scores
ranked_documents = sorted(zip(scores, documents), reverse=True)

# Display ranked documents
for score, document in ranked_documents:
    print(f"Score: {score:.2f}, Document: {document}")


Score: 0.77, Document: This document is the second IS999333 document.
Score: 0.00, Document: This is the first document
Score: 0.00, Document: Is this the first document
Score: 0.00, Document: And this is the third one.


## 2. A quick way to implement RAG search on the ArXiv website by using a vector database

Weaviate is a cloud vector database. It's possible to send text to Weviate and have the database create the vectors using OpenAi or Cohere. Also, vector search and reranking can be done inside Weviate. The user search query can be sent as text via the API. The database will conduct the vector search and return the search results as text. Weviate supports API requests via Curl.

Okay, so what does this mean?

1. Vector database setup and weekly updates can be done with Python<br>
The ArXiv vector database can be set up and updated using Python in a Jupyter notebook. Weaviate supports CRUD (create, read, update, and delete) operations. Therefore, the weekly ArXiv updates can simply be uploaded to the database as text. The database will take care of creating vectors. Nothing else is needed.

2. The frontend UI and the backend API code can be created using common web languages like Javascript and Php.
There's no need to use Python because API requests, containing user search queries, can be sent using Curl from Javascript or Php.

The downside is that Weaviate, OpenAi and Cohere are paid solutions. They are not free python packages like FAISS and Sentence Transformers.

This is a good resource to learn how to use Weaviate:

Deeplearning.Ai Short Course<br>
Vector Databases: from Embeddings to Applications<br>
https://www.deeplearning.ai/short-courses/vector-databases-embeddings-applications/

## Conclusion

Many thanks to the team at ArXiv for making this data freely available and for updating it regularly. Also, many thanks to James Briggs whose clear and concise youtube tutorial helped me learn how to use FAISS. And thanks to Kaggle for this free GPU powered notebook environment.

Thank you for reading. Now go build your invisibility cloak.