Codebase of "Spiral of Silence: How is Large Language Model Killing Information Retrieval?—A Case Study on Open Domain Question Answering"

Table of Contents

News and Updates
Introduction
Installation
Usage
Evaluation
More Use Examples

News and Updates

[05/12/2024] Published code used in our experiments.

Introduction

In this study, we construct and iteratively run a simulation pipeline to deeply investigate the short-term and long-term effects of LLM text on RAG systems. (Arxiv)

What does our code currently provide?

User-friendly Iteration Simulation Tool: We offer an easy-to-use iteration simulation tool that integrates functionalities from ElasticSearch, LangChain, and api-for-open-llm, allowing for convenient dataset loading, selection of various LLMs and retrieval-ranking models, and automated iterative simulation.
Support for Multiple Datasets: Including but not limited to Natural Questions, TriviaQA, WebQuestions, PopQA. By converting data to jsonl format, you can use the framework in this paper to experiment with any data.
Support for Various Retrieval and Re-ranking Models: BM25, Contriever, LLM-Embedder, BGE, UPR, MonoT5 and more.
Support for frequently-used LLMs: GPT-3.5 turbo, chatglm3-6b, Qwen-14B-Chat, Llama-2-13b-chat-hf, Baichuan2-13B-Chat.
Supports Various RAG Pipeline Evolution Evaluation Methods: Automatically organizes and assesses the vast amount of results from each experiment.

Installation

Our framework depends on ElasticSearch 8.11.1 and api-for-open-llm, therefore it is necessary to install these two tools first. We suggest downloading the same version of ElasticSearch 8.11.1, and before starting, set the appropriate http.port and http.host in the config/elasticsearch.yml file, as these will be used for the configuration needed to run the code in this repository.

When installing api-for-open-llm, please follow the instructions provided by its repository to install all the dependencies and environment required to run the model you need. The PORT you configure in the .env file will also serve as a required configuration for this codebase.

Install via GitHub

First, clone the repo:

git clone --recurse-submodules git@github.com:VerdureChen/SOS-Retrieval-Loop.git

Then,

cd SOS-Retrieval-Loop

To install the required packages, you can create a conda environment:

conda create --name SOS_LOOP python=3.10

then use pip to install required packages:

pip install -r requirements.txt

Usage

Please see Installation to install the required packages.

Before running our framework, you need to start ElasticSearch and api-for-open-llm. When starting ElasticSearch, you need to set the appropriate http.port and http.host in the config/elasticsearch.yml file, as these will be used for the configuration needed to run the code in this repository.

When starting api-for-open-llm, you need to set the PORT in the .env file, which will also serve as a required configuration for this codebase.

Configuration

Since our code involves many datasets, models, and index functionalities, we use a three-level config method to control the operation of the code:

In the config folder of the specific function (such as retrieval, re-ranking, generation, etc.), there is a configuration file template for that function, which contains all the parameters that can be configured by yourself. You can confirm the function and dataset corresponding to the configuration file from the file name of the configuration file. For example, src/retrieval_loop/retrieve_configs/bge-base-config-nq.json is a retrieval configuration file, corresponding to the Natural Questions dataset, and using the BGE-base model. We take src/retrieval_loop/index_configs/bge-base-config-psgs_w100.json as an example:
```
 {
       "new_text_file": "../../data_v2/input_data/DPR/psgs_w100.jsonl", 
       "retrieval_model": "bge-base",
       "index_name": "bge-base_faiss_index",
       "index_path": "../../data_v2/indexes",
       "index_add_path": "../../data_v2/indexes",
       "page_content_column": "contents",
       "index_exists": false,
       "normalize_embeddings": true,
       "query_files": ["../../data_v2/input_data/DPR/nq-test-h10.jsonl"],
       "query_page_content_column": "question",
       "output_files": ["../../data_v2/ret_output/DPR/nq-test-h10-bge-base"],
       "elasticsearch_url": "http://124.16.138.142:9978"
 }
```
Where new_text_file is the path to the document to be newly added to the index, retrieval_model is the retrieval model used, index_name is the name of the index, index_path is the storage path of the index, index_add_path is the path where the ID of the incremental document in the index is stored (this is particularly useful when we need to delete specific documents from the index), page_content_column is the column name of the text to be indexed in the document file, index_exists indicates whether the index already exists (if set to false, the corresponding index will be created, otherwise the existing index will be read from the path), normalize_embeddings is whether to normalize the output of the retrieval model, query_files is the path to the query file, query_page_content_column is the column name of the query text in the query file, output_files is the path to the output retrieval result file (corresponding to query_files), and elasticsearch_url is the url of ElasticSearch.
Since we usually need to integrate multiple steps in a pipeline, corresponding to the example script src/run_loop.sh, we provide a global configuration file src/test_function/test_configs/template_total_config.json. In this global configuration file, you can configure the parameters of each stage at once. You do not need to configure all the parameters in it, just the parameters you need to modify relative to the template configuration file.
In order to improve the efficiency of script running, the src/run_loop.sh script supports running multiple datasets and LLM-generated results for a retrieval-re-ranking method at the same time. To flexibly configure such experiments, we support generating new configuration files during the pipeline run in src/run_loop.sh through rewrite_configs.py. For example, when we need to run the pipeline in a loop, we need to record the config content of each round. Before each retrieval, the script will run:
```
 python ../rewrite_configs.py --total_config "${USER_CONFIG_PATH}" \
                           --method "${RETRIEVAL_MODEL_NAME}" \
                           --data_name "nq" \
                           --loop "${LOOP_NUM}" \
                           --stage "retrieval" \
                           --output_dir "${CONFIG_PATH}" \
                           --overrides '{"query_files": ['"${QUERY_FILE_LIST}"'], "output_files": ['"${OUTPUT_FILE_LIST}"'] , "elasticsearch_url": "'"${elasticsearch_url}"'", "normalize_embeddings": false}'
```
Where --total_config is the path to the global configuration file, --method is the name of the retrieval method, --data_name is the name of the dataset, --loop is the number of the current loop, --stage is the stage of the current pipeline, --output_dir is the storage path of the newly generated configuration file, and --overrides is the parameters that need to be modified (also a subset of the template configuration file for each task).
When configuring, you need to pay attention to the priority of the three types of configurations: the first level is the default configuration in each task config template, the second level is the global configuration file, and the third level is the new configuration file generated during the pipeline run. During the pipeline run, the second level configuration will override the first level configuration, and the third level configuration will override the second level configuration.

Running the Code

Through the following steps, you can reproduce our experiments. Before that, please read the Configuration section to understand the settings of the configuration file:

Dataset Preprocessing: Whether it is a query or a document, we need to convert the dataset to jsonl format. In our experiments, we use the data.wikipedia_split.psgs_w100 dataset, which can be downloaded to the data_v2/raw_data/DPR directory and unzipped according to the instructions in the DPR repository. We provide a simple script data_v2/gen_dpr_hc_jsonl.py, which can convert the dataset to jsonl format and place it in data_v2/input_data/DPR. The query files used in the experiment are located in data_v2/input_data/DPR/sampled_query.
```
 cd data_v2
 python gen_dpr_hc_jsonl.py 
```
Generate Zero-Shot RAG Results: Use src/llm_zero_generate/run_generate.sh, by modifying the configuration in the file, you can generate zero-shot RAG results for all data and models in batches. Configure the following parameters at the beginning of the script:
```
MODEL_NAMES=(chatglm3-6b) #chatglm3-6b qwen-14b-chat llama2-13b-chat baichuan2-13b-chat gpt-3.5-turbo
GENERATE_BASE_AND_KEY=(
   "gpt-3.5-turbo http://XX.XX.XX.XX:XX/v1 xxx"
   "chatglm3-6b http://XX.XX.XX.XX:XX/v1 xxx"
   "qwen-14b-chat http://XX.XX.XX.XX:XX/v1 xxx"
   "llama2-13b-chat http://XX.XX.XX.XX:XX/v1 xxx"
   "baichuan2-13b-chat http://XX.XX.XX.XX:XX/v1 xxx"
  )
DATA_NAMES=(tqa pop nq webq)
CONTEXT_REF_NUM=1
QUESTION_FILE_NAMES=(
  "-test-sample-200.jsonl"
  "-upr_rerank_based_on_bm25.json"
)
LOOP_CONFIG_PATH_NAME="../run_configs/original_retrieval_config"

TOTAL_LOG_DIR="../run_logs/original_retrieval_log"
QUESTION_FILE_PATH_TOTAL="../../data_v2/loop_output/DPR/original_retrieval_result"
TOTAL_OUTPUT_DIR="../../data_v2/loop_output/DPR/original_retrieval_result"
```
Where MODEL_NAMES is a list of model names for which results need to be generated, GENERATE_BASE_AND_KEY consists of the model name, api address, and key, DATA_NAMES is a list of dataset names, CONTEXT_REF_NUM is the number of context references (set to 0 in the zero-shot case), QUESTION_FILE_NAMES is a list of query file names (but note that the script identifies the dataset to which it belongs by the prefix of the file name, so to query nq-test-sample-200.jsonl, you need to include nq in DATA_NAMES, and this field only fills in -test-sample-200.jsonl), LOOP_CONFIG_PATH_NAME and TOTAL_LOG_DIR are the storage paths of the running config and logs, QUESTION_FILE_PATH_TOTAL is the query file storage path, and TOTAL_OUTPUT_DIR is the storage path of the generated results. After configuring, run the script:
```
cd src/llm_zero_generate
bash run_generate.sh
```
Build Dataset Index: Use src/retrieval_loop/run_index_builder.sh, by modifying the MODEL_NAMES and DATA_NAMES configuration in the file, you can build indexes for all data and models at once. You can also obtain retrieval results based on the corresponding method by configuring query_files and output_files. In our experiments, all retrieval model checkpoints are placed in the ret_model directory. Run:
```
 cd src/retrieval_loop
 bash run_index_builder.sh
```
Post-process the data generated by Zero-Shot RAG, filter the generated text, rename IDs, etc.: Use src/post_process/post_process.sh, by modifying the MODEL_NAMES and QUERY_DATA_NAMES configuration in the file, you can process all data and models of zero-shot RAG generation results in one run. For Zero-shot data, we set LOOP_NUM to 0, and LOOP_CONFIG_PATH_NAME, TOTAL_LOG_DIR, TOTAL_OUTPUT_DIR specify the paths of the script configs, logs, and output, respectively. FROM_METHOD indicates the generation method of the current text to be processed, which will be added as a tag to the processed document ID. INPUT_FILE_PATH is the path to the text file to be processed, with each directory containing the name of each dataset, and each directory containing various Zero-shot result files. In addition, make sure that INPUT_FILE_NAME is consistent with the actual input text name. Run:
```
cd src/post_process
bash post_process.sh
```
Add the content generated by Zero-Shot RAG to the index and obtain the retrieval results after adding Zero-Shot data: Use src/run_zero-shot.sh, by modifying the GENERATE_MODEL_NAMES and QUERY_DATA_NAMES configuration in the file, you can add the zero-shot RAG generation results of all data and models to the index in one run, and obtain the retrieval results after adding Zero-Shot data. Note that the run_items list indicates the retrieval-re-ranking methods that need to be run, where each element is constructed as "item6 bm25 monot5", indicating that the sixth Zero-Shot RAG experiment in this run is based on the BM25+MonoT5 retrieval-re-ranking method. Run:
```
cd src
bash run_zero-shot.sh
```
Run the main LLM-generated Text simulation loop: Use src/run_loop.sh, by modifying the GENERATE_MODEL_NAMES and QUERY_DATA_NAMES configuration in the file, you can run the LLM-generated Text Simulation loop for all data and models in batches. You can control the number of loops by setting TOTAL_LOOP_NUM. Since it involves updating the index multiple times, only one retrieval-re-ranking method pipeline can be run at a time. If you want to change the number of contexts that LLM can see in the RAG pipeline, you can do so by modifying CONTEXT_REF_NUM, which is set to 5 by default. Run:
```
cd src
bash run_loop.sh
```

Evaluation

For the large amount of results generated in the experiment, our framework supports various batch evaluation methods. After setting QUERY_DATA_NAMES and RESULT_NAMES in src/evaluation/run_context_eva.sh, you can choose any supported task for evaluation, including:

TASK="retrieval": Evaluate the retrieval and re-ranking results of each iteration, including Acc@5 and Acc@20.
TASK="QA": Evaluate the QA results of each iteration (EM).
TASK="context_answer": Calculate the number of documents in the contexts (default top 5 retrieval results) that contain the correct answer when each LLM answers correctly (EM=1) or incorrectly (EM=0) at the end of each iteration.
TASK="bleu": Calculate the SELF-BLEU value of the contexts (default top 5 retrieval results) at each iteration, with 2-gram and 3-gram calculated by default.
TASK="percentage": Calculate the percentage of each LLM and human-generated text in the top 5, 20, and 50 contexts at each iteration.
TASK="misQA": In the Misinformation experiment, calculate the EM of specific incorrect answers in the QA results at each iteration.
TASK="QA_llm_mis" and TASK="QA_llm_right": In the Misinformation experiment, calculate the situation where specific incorrect or correct answers in the QA results at each iteration are determined to be supported by the text after being judged by GPT-3.5-Turbo (refer to EM_llm in the paper).
TASK="filter_bleu_*" and TASK="filter_source_*": In the Filtering experiment, calculate the evaluation results of each iteration under different filtering methods, where * represents the evaluation content that has already appeared (retrieval, percentage, context_answer).

The results generated after evaluation are stored by default in the corresponding RESULT_DIR/RESULT_NAME/QUERY_DATA_NAME/results directory.

More Use Examples

Using Different LLMs at Different Iteration Stages

By modifying the GENERATE_MODEL_NAMES configuration in src/run_loop.sh, you can use different LLMs at different stages. For example:

GENERATE_MODEL_NAMES_F3=(qwen-0.5b-chat qwen-1.8b-chat qwen-4b-chat)
GENERATE_MODEL_NAMES_F7=(qwen-7b-chat llama2-7b-chat baichuan2-7b-chat)
GENERATE_MODEL_NAMES_F10=(gpt-3.5-turbo qwen-14b-chat llama2-13b-chat)

Indicates that qwen-0.5b-chat, qwen-1.8b-chat, and qwen-4b-chat are used in the first three rounds of iteration, qwen-7b-chat, llama2-7b-chat, and baichuan2-7b-chat are used in the fourth to seventh rounds of iteration, and gpt-3.5-turbo, qwen-14b-chat, and llama2-13b-chat are used in the eighth to tenth rounds of iteration.

We can use this to simulate the impact of LLM performance enhancement over time on the RAG system. Remember to provide the corresponding information in GENERATE_BASE_AND_KEY, and modify GENERATE_TASK to update_generate.

Misinformation Experiment

We first use GPT-3.5-Turbo to generate 5 incorrect answers for each query, and then randomly select an incorrect answer, and let all LLMs used in the experiment generate supporting text for it. You can modify the {dataset}_mis_config_answer_gpt.json configuration file (used to generate incorrect answers) and {dataset}_mis_config_passage_{llm}.json configuration file (used to generate text containing incorrect answers) in the src/misinfo/mis_config directory to configure each dataset and generation content path and api information.
Run the src/misinfo/run_gen_misinfo.sh script to generate Misinformation text for all data and models in one run.
Refer to Running the Code steps 4 to 6 to add the generated Misinformation text to the index and obtain the RAG after adding Misinformation data.
Refer to Evaluation to run the src/evaluation/run_context_eva.sh script to evaluate the results of the Misinformation experiment, and it is recommended to use the "retrieval", "context_answer", "QA_llm_mis", and "QA_llm_right" evaluation tasks.

Filtering Experiment

In our experiments, we conducted SELF-BLEU filtering and source filtering of the retrieved contexts separately.

SELF-BLEU filtering: We calculate the SELF-BLEU value of the top retrieval results to ensure that the context input to the LLM maintains a high degree of similarity (i.e., a low SELF-BLEU value, default is 0.4). To enable this feature, set FILTER_METHOD_NAME=filter_bleu in src/run_loop.sh.
Source filtering: We identify text from the LLM in the top retrieval results and exclude it. We use Hello-SimpleAI/chatgpt-qa-detector-roberta for identification. Please place the checkpoint in the ret_model directory before the experiment starts. To enable this feature, set FILTER_METHOD_NAME=filter_source in src/run_loop.sh.
Refer to Evaluation to run the src/evaluation/run_context_eva.sh script to evaluate the results of the Filtering experiment, and it is recommended to use the "retrieval", "QA", "filter_bleu_*", and "filter_source_*" evaluation tasks.

Deleting Corresponding Documents from the Index

Since our experiments involve dynamic updates of the index, it is not possible to reconstruct the index from scratch in each simulation. Instead, in each simulation, we record the IDs of the newly added text in the src/run_logs directory under the index_add_logs directory corresponding to this experiment. After the experiment is over, we delete the corresponding documents from the index using the src/post_process/delete_doc_from_index.py script.

When you need to delete documents from the BM25 index, run:

cd src/post_process
python delete_doc_from_index.py --config_file_path delete_configs/delete_config_bm25.json

Where src/post_process/delete_configs/delete_config_bm25.json is the corresponding configuration file, which contains parameters:

{
    "id_files": ["../run_logs/zero-shot_retrieval_log_low/index_add_logs"],
    "model_name": "bm25",
    "index_path": "../../data_v2/indexes",
    "index_name": "bm25_psgs_index",
    "elasticsearch_url": "http://xxx.xxx.xxx.xxx:xxx",
    "delete_log_path": "../run_logs/zero-shot_retrieval_log_low/index_add_logs"
}

Where id_files is the directory where the document IDs to be deleted are located, model_name is the index model name, index_path is the index storage path, index_name is the index name, elasticsearch_url is the url of ElasticSearch, and delete_log_path is the storage path of the document ID record. The ID files of different indexes can be mixed in the same directory, and the script will automatically read the document IDs corresponding to the index in the directory and delete the documents from the index. 2. When you need to delete documents from the Faiss index, run:

cd src/post_process
python delete_doc_from_index.py --config_file_path delete_configs/delete_config_faiss.json

Where src/post_process/delete_configs/delete_config_faiss.json is the corresponding configuration file, which contains parameters:

{
     "id_files": ["../run_logs/mis_filter_bleu_nq_webq_pop_tqa_loop_log_contriever_None_total_loop_10_20240206164013/index_add_logs"],
    "model_name": "faiss",
    "index_path": "../../data_v2/indexes",
    "index_name": "contriever_faiss_index",
    "elasticsearch_url": "http://xxx.xxx.xxx.xxx:xxx",
    "delete_log_path": "../run_logs/mis_filter_bleu_nq_webq_pop_tqa_loop_log_contriever_None_total_loop_10_20240206164013/index_add_logs"
}

Where id_files is the directory where the document IDs to be deleted are located, model_name is the index model name, index_path is the index storage path, index_name is the index name, elasticsearch_url is the url of ElasticSearch, and delete_log_path is the storage path of the document ID record.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
data_v2		data_v2
externals		externals
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
get_response.py		get_response.py
pipeline.png		pipeline.png
requirements.txt		requirements.txt

VerdureChen/SOS-Retrieval-Loop

Folders and files

Latest commit

History

Repository files navigation