This repository hosts main.py
, a Python script that trawls through a specified directory, processes various types of document formats (.doc
, .xlsx
, .pdf
, .csv
, and .txt
), and indexes them using Facebook's FAISS (Facebook AI Similarity Search) library with the help of embeddings generated by OpenAI's Models. The primary goal is to create an efficient search and retrieval system for a variety of text documents.
To successfully run this project, ensure that you have installed:
- Python 3.6+
docx
Python library.pandas
Python library.- Loggers provided by the standard
logging
Python library. langchain
Python library version 0.1+. This library provides utilities to load various file types (TXTLoader
,CSVLoader
,PyPDFLoader
) and embeddings (OpenAIEmbeddings
).faiss
Python library.
To install these dependencies, run the following pip command:
pip install python-docx pandas logging langchain faiss-cpu
If you have the appropriate hardware requirements, you can use faiss-gpu
instead of faiss-cpu
to leverage GPU acceleration.
To run the script, follow the steps outlined below:
-
Clone this repository to your local machine.
git clone <repo_url>
-
Populate a directory with the documents you wish to process.
-
Inject your personal OpenAI key into the script by replacing
'YOUR_OPENAI_KEY'
.openai_key = 'YOUR_OPENAI_KEY'
-
Include the path to your documents directory by replacing
'/path/to/your/directory'
.root_dir = '/path/to/your/directory'
-
Run the script.
python main.py
The script traverses all files in the specified directory and its sub-directories. It converts .doc
files into .txt
files, and .xls
files into .csv
files. These converted documents, alongside existing .pdf
, .csv
, and .txt
files, are then loaded into memory one by one. Each file is transformed into an embedding using an OpenAI model, then added to the FAISS index. Once all documents have been processed, the final FAISS index is saved locally as faiss.index
.
Please note that our script respects your privacy: it does not send any data directly to OpenAI or any other online service. All processing happens locally on your machine.
However, be mindful of the fact that the script logs errors that occur while processing a document. You can view these warnings in your command line console output.
This project follows the Unlicense, allowing unlimited freedom to use, modify, and distribute this project as per your needs or liking.