In [1]:
from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer
from sentence_transformers.models import StaticEmbedding

import os
import sys
import glob
import pandas as pd
import numpy as np
import pynndescent as nn

# Add the current directory to the path
sys.path.append(os.getcwd())
from preprocess import prepare_PDF

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
glob.glob(r"C:\Users\Steven\Desktop\*.pdf", )

[]

In [3]:
# define paths
main = r"C:\Users\Steven\Documents\Python\super-search"
data = f"{main}/data/tests"
test_file = f"{data}/32286.pdf"

In [5]:
df = pd.DataFrame.from_dict(prepare_PDF(test_file))
df

Unnamed: 0,raw_chunk,processed_chunk,file_path
0,NBER WORKING PAPER SERIES\nPROMOTING PUBLIC HE...,PROMOTING PUBLIC HEALTH WITH BLUNT INSTRUMENTS...,C:\Users\Steven\Documents\Python\super-search/...
1,"32286\nMarch 2024\nJEL No. H70,I1,I11,J20\nABS...","32286 March 2024 JEL No. H70,I1,I11,J20 study ...",C:\Users\Steven\Documents\Python\super-search/...
2,"drawn to healthcare, relaxing shortages. On th...","drawn to healthcare, relaxing shortages. On ot...",C:\Users\Steven\Documents\Python\super-search/...
3,and out of the industry. Our findings suggest ...,and out industry. findings suggest vaccine man...,C:\Users\Steven\Documents\Python\super-search/...
4,Findings \nsuggest trade-offs faced by health ...,Findings suggest trade-offs faced health polic...,C:\Users\Steven\Documents\Python\super-search/...
...,...,...,...
242,vertical lines.\n55\n Table A1: Healthcare ind...,vertical lines. 55 Table A1 Healthcare industr...,C:\Users\Steven\Documents\Python\super-search/...
243,"(Flood et al., 2023).\n56\n Table A2: Healthca...","(Flood et al., 2023). 56 Table A2 Healthcare O...",C:\Users\Steven\Documents\Python\super-search/...
244,and ophthalmic medical technician\nall other\n...,and ophthalmic medical technician all other 35...,C:\Users\Steven\Documents\Python\super-search/...
245,Population Survey. We use the variable OCC inc...,Population Survey. use variable OCC included i...,C:\Users\Steven\Documents\Python\super-search/...


In [6]:
prepare_PDF(test_file)['processed_chunk'][0]

'PROMOTING PUBLIC HEALTH WITH BLUNT INSTRUMENTS EVIDENCE FROM VACCINE MANDATES Rahi Abouk John S. Earle Johanna Catherine Maclean Sungbin Park Working Paper 32286 http //www.nber.org/papers/w32286 1050 Massachusetts Avenue Cambridge, MA 02138 March 2024 Research reported in publication supported National Institute on Mental Health National Institutes Health under Award Number 1R01MH132552 (PI Johanna Catherine Maclean). John Earle also acknowledges support Russell Sage Foundation. views expressed herein authors not necessarily reflect views National Institutes Health or. NBER working papers circulated discussion comment purposes. not peer-reviewed or subject to review NBER Board Directors accompanies official NBER publications. 2024 Rahi Abouk, John S. Earle, Johanna Catherine Maclean, Sungbin Park. All rights reserved. Short sections text, not to exceed two paragraphs, may quoted without explicit permission provided full credit, including notice, given to source. Promoting Public Heal

## To-Do:

### Database management
-~~ We need a system to handle vector database as well as allowing for fast retrieval of the files corresponding to each vector~~
    - ~~Each file needs a file_id~~
        - ~~Links to the filepath~~
    - ~~Each chunk needs a chunk_id~~
        - ~~Links to the chunk text~~
            - ~~Importantly, link to the original text, not the processed text used for embedding.~~
                - ~~**I think the solution is to do the sentence chunking on the original text, before processing**~~
    - ~~file_id + chunk_id should uniquely identify a vector in the database~~
- ~~After taking a user query and encoding it, perform similarity search in the database
    to identify a row, then link to the row's filepath and text~~
- ~~Then print a hyperlink to the file, and print the original text~~
- **Problem**: how to return images, given that we first caption them with an LLM and then encode the caption?
    - Also, should we treat each image description as a chunk, or subchunk the images for more accuracy?

### Lexical search
- Use bm25s package to create ngram indices
- Incorporate lexical search into the queries to improve accuracy on exact matches to key phrases.

### Misc.
- Parallelized PDF processing
- Timing everything to understand where the bottlenecks are
- Save performance statistics to report in the app
    - How many PDF pages have been read? 
- ~~Switch to PyMuPDF~~ (It's much faster!!!)
- ~~Save page number to the index~~
    - ~~This actually doesn't seem possible because chunking works by combining all text to a single line.~~
- Allow to read text files, including code (.py, .R, .do, .sql)
    - pymupdf can do this

### Incorporating Images
- Use PyMuPDF to extract images from each page of the PDF.
- Goal is to pass each image to a multi-modal LLM for summary, which is then fed into the cleaned text.
    - Possibly LLaVA for describing the images.
    - Since captioning the images would take a super long time, this should be an optional step.
        - Ideally would be done after first parsing the text, but then you might have to regenerate the whole vector base
- **Why not just use an image encoder directly (if available)? Is it bad to mix embeddings from different encoders? probably**
    - Instead could encode images separately from the text, and have a separate search function for them.

### GUI
- The following should be customizable inputs:
    - Chunk token size ("larger is faster but less accurate") (default: 256)
    - Chunk overlap ("larger gives more context per chunk") (default: chunk_size / 4)
    - Choice of sentence transformer: provide a few options based on speed/accuracy tradeoff.
        - Fastest: static-retrieval-mrl-en-v1
        - Medium: bge-m3
        - Slowest: gte-large-en-v1.5
        - (these are subject to change)
    - Index database save location
    - Similarity matrix (default: cosine)

In [4]:
# Current encoding model implementation: static-retrieval-mrl-en-v1
# https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1
# Model defaults to 1024 dense dimensions, but can be truncated to save space/time

truncated_dimensions = 1024

model = SentenceTransformer(
    "sentence-transformers/static-retrieval-mrl-en-v1"
    , device="cpu"
    , truncate_dim=truncated_dimensions
    )


In [5]:
## TESTING
# Importing a lot of PDFs to see how long this takes
papers_repo = r"C:\Users\Steven\Documents\Python\Data\NBER papers"

files = os.listdir(papers_repo)
files.sort(reverse=True)

full_dict = {
    'raw_chunk': []
    , 'processed_chunk': []
    , 'file_path': []
}

counter=1

for paper in files[0:100]:
    f = f"{papers_repo}/{paper}"
    iter_dict = prepare_PDF(f)
    full_dict['raw_chunk'].extend(iter_dict['raw_chunk'])
    full_dict['processed_chunk'].extend(iter_dict['processed_chunk'])
    full_dict['file_path'].extend(iter_dict['file_path'])
    if counter%5==0:
        print(f"Finished file {counter}.")
    counter+=1

df = pd.DataFrame.from_dict(full_dict)
df

# Currently takes around 1 second per file (with tokenization chunking)
# Takes ~ 0.7 seconds with approximate chunking
# After switching to PyMuPDF, 0.1 seconds per file, but 3 cmsOpenProfileFromMem errors

Finished filed 5.
Finished filed 10.
Finished filed 15.
Finished filed 20.
Finished filed 25.
Finished filed 30.
Finished filed 35.
Finished filed 40.
Finished filed 45.
Finished filed 50.
Finished filed 55.
Finished filed 60.
Finished filed 65.
Finished filed 70.
Finished filed 75.
MuPDF error: format error: cmsOpenProfileFromMem failed

Finished filed 80.
Finished filed 85.
Finished filed 90.
Finished filed 95.
MuPDF error: format error: cmsOpenProfileFromMem failed

Finished filed 100.


Unnamed: 0,raw_chunk,processed_chunk,file_path
0,NBER WORKING PAPER SERIES\nRETIREMENT AND THE ...,RETIREMENT AND THE EVOLUTION OF PENSION STRUCT...,C:\Users\Steven\Documents\Python\Data\NBER pap...
1,retirement ages. In this paper we find that th...,retirement ages. In paper find absence age-rel...,C:\Users\Steven\Documents\Python\Data\NBER pap...
2,NBER\nlfriedberg@virginia.edu\nAnthony Webb\nI...,NBER lfriedberg virginia.edu Anthony Webb Inte...,C:\Users\Steven\Documents\Python\Data\NBER pap...
3,in 1983 to 44% in 1998.1 \nPension wealth in ...,in 1983 to 44% in 1998.1 Pension wealth in tra...,C:\Users\Steven\Documents\Python\Data\NBER pap...
4,early on in order to gain access to \nlarge fu...,early on in order to gain access to large futu...,C:\Users\Steven\Documents\Python\Data\NBER pap...
...,...,...,...
19614,effects (as percentage of steady state consump...,effects (as percentage steady state consumptio...,C:\Users\Steven\Documents\Python\Data\NBER pap...
19615,The policy rule is \nˆ\n5.0\n0.0\n0.0\nt\nt\nt...,The policy rule ˆ 5.0 0.0 0.0 t t t t i Y s π ...,C:\Users\Steven\Documents\Python\Data\NBER pap...
19616,\n1.04 \n0.00 \n0.02 \n0.01 \n \n \n \n \n \nS...,1.04 0.00 0.02 0.01 Stochastic steady state d...,C:\Users\Steven\Documents\Python\Data\NBER pap...
19617,\n-0.086 \n-0.081 \n capital stock (foreign) \...,-0.086 -0.081 capital stock (foreign) 0.431 0....,C:\Users\Steven\Documents\Python\Data\NBER pap...


In [7]:
vecs = model.encode(df['processed_chunk'])
# This returns a np array of shape (n, d), where n is 
#     number of chunks and d is embedding dimensions.

df['vector'] = [i for i in np.unstack(vecs)]
# Add the embeddings to our dataframe in a single variable,
#     so each cell contains the d-dimensional np vector.

df
# takes 3-4 seconds

Unnamed: 0,raw_chunk,processed_chunk,file_path,vector
0,NBER WORKING PAPER SERIES\nRETIREMENT AND THE ...,RETIREMENT AND THE EVOLUTION OF PENSION STRUCT...,C:\Users\Steven\Documents\Python\Data\NBER pap...,"[0.52801526, 0.42685816, 0.7900874, -2.884799,..."
1,retirement ages. In this paper we find that th...,retirement ages. In paper find absence age-rel...,C:\Users\Steven\Documents\Python\Data\NBER pap...,"[0.10414995, 1.8608961, -0.58080816, -3.008655..."
2,NBER\nlfriedberg@virginia.edu\nAnthony Webb\nI...,NBER lfriedberg virginia.edu Anthony Webb Inte...,C:\Users\Steven\Documents\Python\Data\NBER pap...,"[1.413939, 2.5478559, -0.47449192, -3.7103095,..."
3,in 1983 to 44% in 1998.1 \nPension wealth in ...,in 1983 to 44% in 1998.1 Pension wealth in tra...,C:\Users\Steven\Documents\Python\Data\NBER pap...,"[1.7579198, 0.9669083, -1.406159, -2.7612479, ..."
4,early on in order to gain access to \nlarge fu...,early on in order to gain access to large futu...,C:\Users\Steven\Documents\Python\Data\NBER pap...,"[1.0870267, 0.1609143, -1.9778296, -2.9505513,..."
...,...,...,...,...
19614,effects (as percentage of steady state consump...,effects (as percentage steady state consumptio...,C:\Users\Steven\Documents\Python\Data\NBER pap...,"[-1.0131456, -0.84284586, 2.846375, 1.1860936,..."
19615,The policy rule is \nˆ\n5.0\n0.0\n0.0\nt\nt\nt...,The policy rule ˆ 5.0 0.0 0.0 t t t t i Y s π ...,C:\Users\Steven\Documents\Python\Data\NBER pap...,"[-0.38758105, -0.39671537, 0.31152856, 0.02586..."
19616,\n1.04 \n0.00 \n0.02 \n0.01 \n \n \n \n \n \nS...,1.04 0.00 0.02 0.01 Stochastic steady state d...,C:\Users\Steven\Documents\Python\Data\NBER pap...,"[-0.30377403, -0.79760385, -0.4383789, 0.20054..."
19617,\n-0.086 \n-0.081 \n capital stock (foreign) \...,-0.086 -0.081 capital stock (foreign) 0.431 0....,C:\Users\Steven\Documents\Python\Data\NBER pap...,"[-0.25683185, -1.5001143, 0.6960092, 0.1808855..."


In [33]:
# testing querying the index

query = 'Madison and Jefferson reaction in January and February 1792' # reference to paper 9943

# Encode the query
query_vec = model.encode(query)

# Search for nearest neighbors in the df



In [13]:
vecs.shape

(19619, 1024)

In [16]:
index = nn.NNDescent(vecs)
index.prepare()

In [34]:
index.query(query_vec.reshape(1,-1), k=3)
# 10829, 10830, 10851

(array([[10829, 10830, 10851]], dtype=int32),
 array([[133.33989, 139.48766, 140.08379]], dtype=float32))

In [35]:
for i in [10829, 10830, 10851]:
    print(df['raw_chunk'][i])
    print(df['file_path'][i])

strikingly similar to Hamilton’s earlier report that Jefferson and
 -3-
1  Annals of Congress, 1 (January 8, 1790), p. 969.
2  Annals of Congress, 1 (January 15, 1790), p. 1095.
Madison opposed.  
Given the report’s importance in the history of U.S. economic policy, this paper explores
the reception and immediate legislative impact of the report.  After briefly reviewing the
contents and proposals in the December 1791 report, the paper turns to Madison’s and
Jefferson’s reaction to it in January and February 1792.  In February and March 1792, Congress
debated bounties for the cod fisheries and additional revenue proposals involving tariffs, both of
which related to Hamilton’s report.  Finally, the paper examines the turn of manufacturing
interests away from the Federalists as the Jeffersonian Republican policy of reciprocity offered
the hope of
C:\Users\Steven\Documents\Python\Data\NBER papers/9943.pdf
1791 report, the paper turns to Madison’s and
Jefferson’s reaction to it in January 

In [1]:
import pymupdf
from preprocess import *

in_path=r"C:\Users\sevan\Desktop\Meet with Prof. Fulton.txt"
doc = pymupdf.open(in_path)

# combine all pages into one list
paper = []
for page in doc:
    # extract text from page
    page_text = page.get_text()

    # append to paper
    paper.append(page_text)

# convert list into string
paper_one_string = ' '.join(paper)

# chunk the raw text
chunker = setup_chunker()
chunks = chunker(paper_one_string)

# Organize chunks and processed chunks into a dictionary
chunk_data = {
    'raw_chunk': [chunk for chunk in chunks]
    , 'processed_chunk': [preprocess(chunk) for chunk in chunks]
    , 'file_path': [in_path for chunk in chunks]
}

chunk_data


UnboundLocalError: cannot access local variable '_chunk_size' where it is not associated with a value