Academic Paper Semantic Search is a localized search engine, providing plain text search on a collection of pdf papers.
Search methods include neural network based search (BERT based) and simple dictionary based matching (BM25).
It requires Human Intelligence on top of Artificial Intelligence, so it will not be a local ChatGPT.
The current version is designed to run in laptops without GPU, so large neural network models are not used. But this program modularizes the search algorithm, so you can easily swap the state of art models. Lite versions are strongly recommended.
If you are interested in fine-tuning large models with limited GPU, I strongly recommend one of the BOOM paper below (or the first figure). Training a model with 100+ Billion parameters requires a different skill set. It's easy to torture one Nvidia 2080 Ti for a year, but it is not enough.
- Create virtual environment. Recommend using
mambarather thancondato install packages
conda create -p env/ python=3.9
conda activate env/
pip install --upgrade pip
pip install farm-haystack[sql,only-faiss,inmemorygraph] streamlit st-annotated-text- You may want the GPU version if possible
conda activate env/
pip install farm-haystack[only-faiss-gpu] transformers[torch]Windows users may have trouble installing the faiss-gpu in the farm-haystack. An alternative is
conda activate env/
conda install -c conda-forge faiss-gpu- Copy
data/db-*folders todata/and run
../env/python -m streamlit run ui/Search.py --server.runOnSave=true --server.address=127.0.0.1This script starts FastAPI for query
%~dp0./env/python.exe -m uvicorn rest_api.search_rest_gunicorn:app --host 127.0.0.1 --port 7999 --workers 1 This script starts the webserver
%~dp0./env/python -m streamlit run ui/Search.py --server.runOnSave=true --server.address=127.0.0.1Note: without the --server.address=127.0.0.1, streamlitwill broadcast your ip address to the world.
- Install Zotera, and plug-in
ZotFileandDOI ManagerTool -> ZotFile Preference-> use subfolder defined by:[%a](%y){ %t}ZotFile Preference-> Renaming Rules -> Format for all item & Patents:[%a](%y){ %t}ZotFile Preference-> Tablet Settings: checkuse ZotFile to send and get files from tablet, and setbase folder
- Import pdf papers into Zotero, obtain
doiand clean up metadata - Select all pdf and right click, management attachments, sent to tablet
- Use
Adobe ProorAbbyyfor batch text recognition
-
Use Adobe Pro to recognize and export pdf to word document
-
use Pandoc to convert to plain text:
pandoc -f docx -i file_name.docx -t plain -o file_name.txt
Use virtual environment to manage python packages. Many of them may have conflict with your current packages.
- Download GROBID docker image and the python client. The CRT-only is enough.
- Run
src/extract_text.pyto convert pdf totei.xmlformat and parse them into plain text - May need spell check
- Recommend
mambato install packages - Install haystack python package or docker image
- need to install faiss; given the small number of documents, the cpu version is fast enough
- if possible, recommend
mamba install -c conda-forge libfaiss-avx2 - if want to embed documents, need
transformer[torch]package
git clone https://github.com/kermitt2/grobid_client_python- run
src/build_database.py - Simple dictionary search use BM25 and In Memory
database.
- Documents are chunked into sentences. Each has at most 300 words with 10 words overlap.
- May need to tune sentence length for better performance
- Neural Network based search use sentence-transformer to embed words. Data are stored
in FAISS
- Documents are chunked into sentences with at most 100 words
- Embedding model use (need GPU for fast process)
sentence-transformers/multi-qa-mpnet-base-dot-v1sentence-transformers/msmarco-distilbert-base-tas-b
- databases are in the
datafolder. Copydb-faissanddb-inmemorytodeploy/data/
- Copy the haystack-demos and modify scripts in the
uifolder. The main part iswebapp.py - The sample scripts use
haystackapi and docker, but you can run your script directly without docker.
Windows users who have Intel cpu and want faster matching speed may want to compile the avx2 version from source. You can
link the MKL library for faster speed.
-
Install
visual studio 2019(desktop development with C++),cuda toolkit,Intel OneAPI toolkit, andswig -
Use conda env to activate desired python version
-
Download the latest
faissrelease. -
Assume install with default settings, in
cmd, activate environment variablesconda activate "path to desire python env" "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
MKL library will be loaded automatically. If you don't need GPU, set
-DFAISS_ENABLE_GPU=OFF"C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -B build ^ -DFAISS_ENABLE_PYTHON=ON ^ -DFAISS_ENABLE_GPU=ON ^ -DBUILD_SHARED_LIBS=ON ^ -DCMAKE_BUILD_TYPE=Release ^ -DFAISS_OPT_LEVEL=avx2 ^ -DBUILD_TESTING=OFF
maxCpuCountdefault is 1, include the switch without number will use all coresMSBuild.exe build/faiss/faiss_avx2.vcxproj /property:Configuration=Release /maxCpuCount:12
Or you may use Visual Studio to open
faiss/build/ALL_BUILD.vcxproj, selectrelease, and buildswigfaiss_avx2 -
Build python wheel
cd build/faiss/python/ python setup.py bdist_wheel python setup.py install
- Spell check for plain text
- Fine tune embedding models
- Check quality & runtime for joint model: combine multiple embedding models for neural network based search
