This repository contains the Python code used for the experiments and figures in the paper.
➤ code for illustrative figures/ contains the code used to generate Figures 1,2,3 & 7 from the paper.
➤ synthetic experiment/ contains a main file (synthetic_experiment.py) that outputs a .pkl file and an auxiliary file (latex_table_from_pkl.py) that generates a LaTeX table from the .pkl file, summarizing the results of the experiments.
➤ realdata experiment/ contains the .py files necessary to run the experiments on real-world datasets: encode/ contains the code used to convert the dataset queries into embeddings for each text embedding model used, GHC_all_distances.py contains an implementation of the GHC algorithm with multiple distances. faiss.py contains an implementation of the GHC algorithm using the FAISS library, which significantly speeds up computation. run_CC.py and run_SKM.py contain an implementation of the Center-Based Classifier and Sequential k-Means algorithms respectively.
The datasets are not directly provided in this repository due to their size. They are available at the following anonymous Google Drive link:
➤ https://drive.google.com/drive/folders/1K2A_8CkU6fXjy9dsDotJi8gJYrt2viNo?usp=sharing ⮜
There, the *_embeddings/
Folders contain the datasets used for evaluation and the computed embeddings of the queries.
Install the necessary libraries before running the real data experiments:
nginx
Copy
Edit
pip install torch torchvision transformers datasets faiss-cpu
nginx
Copy
Edit
python faiss.py
Run with custom distance:
css
Copy
Edit
python GHC_all_distances.py --distance_type [your_choice]
Replace [your_choice] with one of the supported distance types (NEARESTQUERY, EUCLIDEAN, SPHERICAL).
MIT License.