# Query Length Analysis Across BEIR Datasets

In this notebook, we analyze and compare query length distributions across multiple BEIR datasets. The goal is to:
- Compute descriptive statistics (mean, median, percentiles) for query lengths.
- Visualize distributions (histograms, box plots, density plots) for each dataset.
- Compare differences across datasets, particularly noting that scientific datasets tend to have longer queries.
- Define appropriate thresholds for short, medium, and long queries, which will inform subsequent experiments.
All outputs (plots and statistics) are saved in the `data_analysis_results` folder for future reference.


In [1]:
### DATASETS
datasets = {
    'msmarco': ['train', 'dev'],
    'hotpotqa': ['train', 'dev', 'test'],
    'arguana': ['test'],
    'quora': ['dev', 'test'],
    'scidocs': ['test'],
    'fever': ['train', 'dev', 'test'],
    'climate-fever': ['test'],
    'scifact': ['train', 'test'],
    'fiqa': ['train', 'dev', 'test'],
    'nfcorpus': ['train', 'dev', 'test']
}

### RESULTS FOLDER
results_folder = "data_analysis_results"

In [2]:
import os
import matplotlib.pyplot as plt

plt.ioff()

# Ensure that the results folder exists
def ensure_folder_exists(folder):
    if not os.path.exists(folder):
        os.makedirs(folder)
    return folder

# Create a folder for analysis results
results_folder = ensure_folder_exists(results_folder)
print("Results folder:", results_folder)

Results folder: data_analysis_results


In [3]:
from IRutils.analysis_util import QueryLengthAnalyzer

def analyze_and_save(dataset_name, split="test", results_folder="data_analysis_results") -> QueryLengthAnalyzer:
    """
    Loads a BEIR dataset, computes query length analysis,
    saves plots and statistics, and returns analyzer instance.
    :param dataset_name: dataset name
    :param split: split to analyze
    :param results_folder: result directory to save plots and statistics
    :return: a QueryLengthAnalyzer instance
    """

    dataset_folder = ensure_folder_exists(os.path.join(results_folder, dataset_name))

    # initialize the analyzer (contains the dowload and processing of dataset)
    analyzer = QueryLengthAnalyzer(dataset_name=dataset_name, split=split)

    # save stats to a text file
    stats_file = os.path.join(dataset_folder, f"{dataset_name}_stats.txt")
    with open(stats_file, "w") as f:
        f.write(f"Query Length Statistics for {dataset_name} (split {split})\n")
        for key, value in analyzer.stats.items():
            f.write(f"{key}: {value}\n")
    print(f"Statistics saved for {dataset_name} (split {split}) in {stats_file}")

    # save histogram plot
    histogram_file = os.path.join(dataset_folder, f"{dataset_name}_histogram.png")
    analyzer.plot_histogram(show=False, save_path=histogram_file)
    print(f"Histogram saved for {dataset_name} (split {split}) in {histogram_file}")


    # save box plot
    boxplot_file = os.path.join(dataset_folder, f"{dataset_name}_boxplot.png")
    analyzer.plot_boxplot(show=False, save_path=boxplot_file)
    print(f"Boxplot saved for {dataset_name} (split {split}) in {boxplot_file}")

    return analyzer


  from tqdm.autonotebook import tqdm


In [4]:
# using test split for analysis
dataset_names = list(datasets.keys())
print("Datasets to analyze:", dataset_names)


Datasets to analyze: ['msmarco', 'hotpotqa', 'arguana', 'quora', 'scidocs', 'fever', 'climate-fever', 'scifact', 'fiqa', 'nfcorpus']


In [5]:
analyzers = {}

for name in dataset_names:
    analyzers[name] = analyze_and_save(dataset_name=name, split="test", results_folder=results_folder)

  0%|          | 0/8841823 [00:00<?, ?it/s]

Statistics saved for msmarco (split test) in data_analysis_results/msmarco/msmarco_stats.txt
Histogram saved for msmarco (split test) in data_analysis_results/msmarco/msmarco_histogram.png
Boxplot saved for msmarco (split test) in data_analysis_results/msmarco/msmarco_boxplot.png


  0%|          | 0/5233329 [00:00<?, ?it/s]

Statistics saved for hotpotqa (split test) in data_analysis_results/hotpotqa/hotpotqa_stats.txt
Histogram saved for hotpotqa (split test) in data_analysis_results/hotpotqa/hotpotqa_histogram.png
Boxplot saved for hotpotqa (split test) in data_analysis_results/hotpotqa/hotpotqa_boxplot.png


  0%|          | 0/8674 [00:00<?, ?it/s]

Statistics saved for arguana (split test) in data_analysis_results/arguana/arguana_stats.txt
Histogram saved for arguana (split test) in data_analysis_results/arguana/arguana_histogram.png
Boxplot saved for arguana (split test) in data_analysis_results/arguana/arguana_boxplot.png


  0%|          | 0/522931 [00:00<?, ?it/s]

Statistics saved for quora (split test) in data_analysis_results/quora/quora_stats.txt
Histogram saved for quora (split test) in data_analysis_results/quora/quora_histogram.png
Boxplot saved for quora (split test) in data_analysis_results/quora/quora_boxplot.png


  0%|          | 0/25657 [00:00<?, ?it/s]

Statistics saved for scidocs (split test) in data_analysis_results/scidocs/scidocs_stats.txt
Histogram saved for scidocs (split test) in data_analysis_results/scidocs/scidocs_histogram.png
Boxplot saved for scidocs (split test) in data_analysis_results/scidocs/scidocs_boxplot.png


  0%|          | 0/5416568 [00:00<?, ?it/s]

Statistics saved for fever (split test) in data_analysis_results/fever/fever_stats.txt
Histogram saved for fever (split test) in data_analysis_results/fever/fever_histogram.png
Boxplot saved for fever (split test) in data_analysis_results/fever/fever_boxplot.png


  0%|          | 0/5416593 [00:00<?, ?it/s]

Statistics saved for climate-fever (split test) in data_analysis_results/climate-fever/climate-fever_stats.txt
Histogram saved for climate-fever (split test) in data_analysis_results/climate-fever/climate-fever_histogram.png
Boxplot saved for climate-fever (split test) in data_analysis_results/climate-fever/climate-fever_boxplot.png


  0%|          | 0/5183 [00:00<?, ?it/s]

Statistics saved for scifact (split test) in data_analysis_results/scifact/scifact_stats.txt
Histogram saved for scifact (split test) in data_analysis_results/scifact/scifact_histogram.png
Boxplot saved for scifact (split test) in data_analysis_results/scifact/scifact_boxplot.png


  0%|          | 0/57638 [00:00<?, ?it/s]

Statistics saved for fiqa (split test) in data_analysis_results/fiqa/fiqa_stats.txt
Histogram saved for fiqa (split test) in data_analysis_results/fiqa/fiqa_histogram.png
Boxplot saved for fiqa (split test) in data_analysis_results/fiqa/fiqa_boxplot.png


  0%|          | 0/3633 [00:00<?, ?it/s]

Statistics saved for nfcorpus (split test) in data_analysis_results/nfcorpus/nfcorpus_stats.txt
Histogram saved for nfcorpus (split test) in data_analysis_results/nfcorpus/nfcorpus_histogram.png
Boxplot saved for nfcorpus (split test) in data_analysis_results/nfcorpus/nfcorpus_boxplot.png
