# Set up

In [1]:
!git clone https://github.com/tadtd/binsketch-algorithm

Cloning into 'binsketch-algorithm'...
remote: Enumerating objects: 395, done.[K
remote: Counting objects: 100% (31/31), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 395 (delta 10), reused 13 (delta 5), pack-reused 364 (from 1)[K
Receiving objects: 100% (395/395), 243.97 KiB | 4.28 MiB/s, done.
Resolving deltas: 100% (246/246), done.


In [2]:
%cd binsketch-algorithm

/kaggle/working/binsketch-algorithm


# Configuration

In [3]:
import os
import gdown
import zipfile

# Process data

In [4]:
DRIVE_URL = 'https://drive.google.com/drive/folders/1ARBY9cIGj_jigi5Y88CtUy-GMj2clrXj'
RAW_DATA_PATH = './raw'
PROCESSED_DATA_PATH = './data'

In [5]:
def download_dataset(drive_url, target_folder):
    print(f"Processing: {drive_url}")
    
    # Create the folder if it doesn't exist
    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    if "drive/folders" in drive_url or "folder" in drive_url:
        print("Downloading individual files directly...")
        gdown.download_folder(drive_url, output=target_folder, quiet=False, use_cookies=False)
        print(f"\n[SUCCESS] Folder contents downloaded to: {target_folder}")
    else:
        print("\n[INFO] Detected a Drive FILE link.")
        zip_path = os.path.join(target_folder, "temp_dataset.zip")
        output = gdown.download(drive_url, zip_path, quiet=False, fuzzy=True)
        
        if not output:
            print("[ERROR] Download failed.")
            return

        print(f"\nUnzipping {output}...")
        try:
            with zipfile.ZipFile(output, 'r') as zip_ref:
                zip_ref.extractall(target_folder)
            print(f"[SUCCESS] Extracted to: {target_folder}")
            
            # Clean up the zip file
            os.remove(output)
            
        except zipfile.BadZipFile:
            print("[ERROR] The downloaded file was not a valid zip file.")
            print("Check if the file on Drive is actually a .zip archive.")

In [6]:
download_dataset(DRIVE_URL, RAW_DATA_PATH)

Processing: https://drive.google.com/drive/folders/1ARBY9cIGj_jigi5Y88CtUy-GMj2clrXj
Downloading individual files directly...


Retrieving folder contents


Retrieving folder 10Y_7o78v8HhztcE7bzgV3M94N0vXjt_U bbc
Processing file 1lgysq7G_lc_zc71dGGqWoHejDCixcOVr docword.bbc.txt.gz
Processing file 1kO9AOWWuACtNsgmA9yyp9C6pSoB8KW3U vocab.bbc.txt
Retrieving folder 1D9mMF6ealOinLAsmwXsZs9IMNPBqzuM0 enron
Processing file 1JuUxpaQRAl1yZGqb3xcSP8nfK3_1NZ8y docword.enron.txt.gz
Processing file 16Rn70xrTnYOIkVm2mz4ICR2bEbyzyvGQ vocab.enron.txt
Retrieving folder 1YMxNXk2-7Ok1_3gnIguPwWt365C5X7Et kos
Processing file 1c1bJ-eX5Rp729zGSeGGXqwyor4DfWsrx docword.kos.txt.gz
Processing file 1YL0wnFKLJz-h6emVWHYDfBET5cAZhdKi vocab.kos.txt
Retrieving folder 17JU-ouMBLAUilZiaxE1xzeKfpFvDy9PA nytimes
Processing file 1gsmnfyNEAtA_3kdU5GMhwnUX-vOlc_9A docword.nytimes.txt.gz
Processing file 1jAnAFekn8u-e_FO1tsElhr-gP_dvSXkx vocab.nytimes.txt


Retrieving folder contents completed
Building directory structure
Building directory structure completed
Downloading...
From: https://drive.google.com/uc?id=1lgysq7G_lc_zc71dGGqWoHejDCixcOVr
To: /kaggle/working/binsketch-algorithm/raw/bbc/docword.bbc.txt.gz
100%|██████████| 490k/490k [00:00<00:00, 85.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=1kO9AOWWuACtNsgmA9yyp9C6pSoB8KW3U
To: /kaggle/working/binsketch-algorithm/raw/bbc/vocab.bbc.txt
100%|██████████| 77.2k/77.2k [00:00<00:00, 53.7MB/s]
Downloading...
From: https://drive.google.com/uc?id=1JuUxpaQRAl1yZGqb3xcSP8nfK3_1NZ8y
To: /kaggle/working/binsketch-algorithm/raw/enron/docword.enron.txt.gz
100%|██████████| 12.3M/12.3M [00:00<00:00, 53.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=16Rn70xrTnYOIkVm2mz4ICR2bEbyzyvGQ
To: /kaggle/working/binsketch-algorithm/raw/enron/vocab.enron.txt
100%|██████████| 236k/236k [00:00<00:00, 80.0MB/s]
Downloading...
From: https://drive.google.com/uc?id=1c1bJ-eX5Rp729zGSeGGXqwyo


[SUCCESS] Folder contents downloaded to: ./raw



Download completed


In [7]:
!python scripts/convert.py


Processing BBC dataset
Loading vocabulary from raw/bbc/vocab.bbc.txt...
Loading document-word data from raw/bbc/docword.bbc.txt.gz...
  Documents: 2225, Words: 9635, Non-zeros: 158079
Creating sparse matrix...
Reading sparse data: 100%|████| 158079/158079 [00:00<00:00, 289155.68 entries/s]
Converting to DataFrame in batches of 500 documents...
Processing batches: 100%|█████████████████████| 5/5 [00:00<00:00, 76.62 batch/s]
Concatenating batches...
DataFrame shape: (2225, 9635)
Sparsity: 99.26%

BBC DataFrame Preview:
             ad  sale  boost  time  ...  quarterli  media  giant  jump
document_id                         ...                               
0             1     1      1     1  ...          0      1      1     0
1             0     0      1     1  ...          0      0      0     0
2             0     1      0     0  ...          0      0      1     0
3             0     0      0     0  ...          0      0      0     0
4             0     0      0  

# Experiment

## Experiment 1: Accuracy of Estimation

### Experiment on NYTimes to calculate $MSE$ using Inner Product

In [8]:
DATASET = 'nytimes'
SIMILARITY_SCORE = 'inner_product'
METRIC = 'mse'
data_path = f'./data/{DATASET}_binary.npy'
THRESHOLD = [120, 150, 180, 200, 220, 250, 270, 300]
threshold_str = ' '.join([str(t) for t in THRESHOLD])

In [9]:
!python scripts/save_ground_truth.py --experiment 1 \
                                     --data_path {data_path} \
                                     --similarity_score {SIMILARITY_SCORE} \
                                     --seed 42 \
                                     --use_gpu

✓ GPU acceleration enabled
Loading matrix from ./data/nytimes_binary.npy...
  Matrix Shape: (5000, 102660)
  Data type: int8
  Value range: [0, 1]
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Ground Truth (inner_product) for 12497500 pairs...
  Using GPU-accelerated batch computation...
Computing ground truth: 100%|████████████████| 10/10 [00:00<00:00, 20.57batch/s]
Extracting pair similarities...

Saving ground truth to experiment/ground_truth/ground_truth_nytimes_inner_product.json...
✓ Saved 12497500 ground truth values
  Min similarity: 0.000000
  Max similarity: 775.000000
  Mean similarity: 12.607225


In [10]:
!python main.py --experiment 1 \
                --algo BinSketch BCS \
                --data_path {data_path} \
                --ground_truth_path ground_truth_exp1_{DATASET}_{SIMILARITY_SCORE}.json \
                --seed 42 \
                --threshold {threshold_str} \
                --similarity_score {SIMILARITY_SCORE} \
                --eval_metric {METRIC} \
                --use_gpu

✓ GPU acceleration enabled
Running Experiment 1: Accuracy of Similarity Estimation
Loading matrix from ./data/nytimes_binary.npy...
  Matrix Shape: (5000, 102660)

Loading/Computing Ground Truth
Calculating ground truth (this may take a while)...
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Ground Truth (inner_product) for 12497500 pairs...
  Using GPU-accelerated batch computation...
Computing ground truth: 100%|███████████████| 10/10 [00:00<00:00, 119.03batch/s]
Extracting pair similarities...

Saving ground truth to experiment/ground_truth/ground_truth_exp1_nytimes_inner_product.json...
✓ Saved 12497500 ground truth values
  Min similarity: 0.000000
  Max similarity: 775.000000
  Mean similarity: 12.607225

Precomputing filtered pairs for 8 thresholds...
  Threshold 120.0: 1734 pairs
  Threshold 150.0: 1423 pairs
  Threshold 180.0: 1160 pairs
  Threshold 200.0: 1024 pairs
  Threshold 220.0: 887 pairs
  Threshold 250.0: 730 

### Experiments on ENRON to calculate $-\log(MSE)$ using Cosine Similarity

In [11]:
DATASET = 'enron'
SIMILARITY_SCORE = 'cosine_similarity'
METRIC = 'minus_log_mse'
data_path = f'./data/{DATASET}_binary.npy'
THRESHOLD = [.1, .2, .3, .4, .5, .7, .8, .9]
threshold_str = ' '.join([str(t) for t in THRESHOLD])

In [12]:
!python scripts/save_ground_truth.py --experiment 1 \
                                     --data_path {data_path} \
                                     --similarity_score {SIMILARITY_SCORE} \
                                     --seed 42 \
                                     --use_gpu

✓ GPU acceleration enabled
Loading matrix from ./data/enron_binary.npy...
  Matrix Shape: (5000, 28102)
  Data type: int8
  Value range: [0, 1]
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Ground Truth (cosine_similarity) for 12497500 pairs...
  Using GPU-accelerated batch computation...
Computing ground truth: 100%|████████████████| 10/10 [00:00<00:00, 15.52batch/s]
Extracting pair similarities...

Saving ground truth to experiment/ground_truth/ground_truth_enron_cosine_similarity.json...
✓ Saved 12497500 ground truth values
  Min similarity: 0.000000
  Max similarity: 1.000005
  Mean similarity: 0.026620


In [13]:
!python main.py --experiment 1 \
                --algo BinSketch SimHash MinHash \
                --data_path {data_path} \
                --ground_truth_path ground_truth_exp1_{DATASET}_{SIMILARITY_SCORE}.json \
                --seed 42 \
                --threshold {threshold_str} \
                --similarity_score {SIMILARITY_SCORE} \
                --eval_metric {METRIC} \
                --use_gpu

✓ GPU acceleration enabled
Running Experiment 1: Accuracy of Similarity Estimation
Loading matrix from ./data/enron_binary.npy...
  Matrix Shape: (5000, 28102)

Loading/Computing Ground Truth
Calculating ground truth (this may take a while)...
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Ground Truth (cosine_similarity) for 12497500 pairs...
  Using GPU-accelerated batch computation...
Computing ground truth: 100%|████████████████| 10/10 [00:00<00:00, 63.83batch/s]
Extracting pair similarities...

Saving ground truth to experiment/ground_truth/ground_truth_exp1_enron_cosine_similarity.json...
✓ Saved 12497500 ground truth values
  Min similarity: 0.000000
  Max similarity: 1.000005
  Mean similarity: 0.026620

Precomputing filtered pairs for 8 thresholds...
  Threshold 0.1: 332108 pairs
  Threshold 0.2: 37774 pairs
  Threshold 0.3: 19539 pairs
  Threshold 0.4: 12672 pairs
  Threshold 0.5: 9027 pairs
  Threshold 0.7: 5535 pairs

### Experiments on NYTimes to calculate $- \log(MSE)$ using Jaccard Similarity

In [14]:
DATASET = 'nytimes'
SIMILARITY_SCORE = 'jaccard_similarity'
METRIC = 'minus_log_mse'
data_path = f'./data/{DATASET}_binary.npy'
THRESHOLD = [.1, .2, .3, .4, .5, .7, .8, .9]
threshold_str = ' '.join([str(t) for t in THRESHOLD])

In [15]:
!python scripts/save_ground_truth.py --experiment 1 \
                                     --data_path {data_path} \
                                     --similarity_score {SIMILARITY_SCORE} \
                                     --seed 42 \
                                     --use_gpu

✓ GPU acceleration enabled
Loading matrix from ./data/nytimes_binary.npy...
  Matrix Shape: (5000, 102660)
  Data type: int8
  Value range: [0, 1]
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Ground Truth (jaccard_similarity) for 12497500 pairs...
  Using GPU-accelerated batch computation...
Computing ground truth: 100%|████████████████| 10/10 [00:00<00:00, 17.62batch/s]
Extracting pair similarities...

Saving ground truth to experiment/ground_truth/ground_truth_nytimes_jaccard_similarity.json...
✓ Saved 12497500 ground truth values
  Min similarity: 0.000000
  Max similarity: 1.000000
  Mean similarity: 0.027982


In [16]:
!python main.py --experiment 1 \
                --algo BinSketch BCS MinHash \
                --data_path {data_path} \
                --ground_truth_path ground_truth_exp1_{DATASET}_{SIMILARITY_SCORE}.json \
                --seed 42 \
                --threshold {threshold_str} \
                --similarity_score {SIMILARITY_SCORE} \
                --eval_metric {METRIC} \
                --use_gpu

✓ GPU acceleration enabled
Running Experiment 1: Accuracy of Similarity Estimation
Loading matrix from ./data/nytimes_binary.npy...
  Matrix Shape: (5000, 102660)

Loading/Computing Ground Truth
Calculating ground truth (this may take a while)...
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Ground Truth (jaccard_similarity) for 12497500 pairs...
  Using GPU-accelerated batch computation...
Computing ground truth: 100%|████████████████| 10/10 [00:00<00:00, 35.43batch/s]
Extracting pair similarities...

Saving ground truth to experiment/ground_truth/ground_truth_exp1_nytimes_jaccard_similarity.json...
✓ Saved 12497500 ground truth values
  Min similarity: 0.000000
  Max similarity: 1.000000
  Mean similarity: 0.027982

Precomputing filtered pairs for 8 thresholds...
  Threshold 0.1: 30908 pairs
  Threshold 0.2: 3440 pairs
  Threshold 0.3: 2149 pairs
  Threshold 0.4: 1865 pairs
  Threshold 0.5: 1708 pairs
  Threshold 0.7: 1591 pa

### Experiments on BBC to calculate $-\log(MSE)$ using Jaccard Similarity

In [17]:
DATASET = 'bbc'
SIMILARITY_SCORE = 'jaccard_similarity'
METRIC = 'minus_log_mse'
data_path = f'./data/{DATASET}_binary.npy'
THRESHOLD = [.1, .2, .3, .4, .5, .7, .8, .9]
threshold_str = ' '.join([str(t) for t in THRESHOLD])

In [18]:
!python scripts/save_ground_truth.py --experiment 1 \
                                     --data_path {data_path} \
                                     --similarity_score {SIMILARITY_SCORE} \
                                     --seed 42 \
                                     --use_gpu

✓ GPU acceleration enabled
Loading matrix from ./data/bbc_binary.npy...
  Matrix Shape: (2225, 9635)
  Data type: int8
  Value range: [0, 1]
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Ground Truth (jaccard_similarity) for 2474200 pairs...
  Using GPU-accelerated batch computation...
Computing ground truth: 100%|██████████████████| 5/5 [00:00<00:00, 33.96batch/s]
Extracting pair similarities...

Saving ground truth to experiment/ground_truth/ground_truth_bbc_jaccard_similarity.json...
✓ Saved 2474200 ground truth values
  Min similarity: 0.000000
  Max similarity: 1.000000
  Mean similarity: 0.065442


In [19]:
!python main.py --experiment 1 \
                --algo BinSketch BCS MinHash \
                --data_path {data_path} \
                --ground_truth_path ground_truth_exp1_{DATASET}_{SIMILARITY_SCORE}.json \
                --seed 42 \
                --threshold {threshold_str} \
                --similarity_score {SIMILARITY_SCORE} \
                --eval_metric {METRIC} \
                --use_gpu

✓ GPU acceleration enabled
Running Experiment 1: Accuracy of Similarity Estimation
Loading matrix from ./data/bbc_binary.npy...
  Matrix Shape: (2225, 9635)

Loading/Computing Ground Truth
Calculating ground truth (this may take a while)...
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Ground Truth (jaccard_similarity) for 2474200 pairs...
  Using GPU-accelerated batch computation...
Computing ground truth: 100%|██████████████████| 5/5 [00:00<00:00, 34.45batch/s]
Extracting pair similarities...

Saving ground truth to experiment/ground_truth/ground_truth_exp1_bbc_jaccard_similarity.json...
✓ Saved 2474200 ground truth values
  Min similarity: 0.000000
  Max similarity: 1.000000
  Mean similarity: 0.065442

Precomputing filtered pairs for 8 thresholds...
  Threshold 0.1: 258072 pairs
  Threshold 0.2: 2196 pairs
  Threshold 0.3: 376 pairs
  Threshold 0.4: 230 pairs
  Threshold 0.5: 191 pairs
  Threshold 0.7: 168 pairs
  Threshol

## Experiment 2: Ranking

### Experiments on ENRON to calculate Accuracy using Jaccard Similarity

In [20]:
THRESHOLD = [.1, .2, .4, .5, .6, .7, .85, .95]
threshold_str = ' '.join([str(t) for t in THRESHOLD])
RETRIEVAL_METRIC = 'accuracy'
DATASET = 'enron'
data_path = f'./data/{DATASET}_binary.npy'
SIMILARITY_SCORE = 'jaccard_similarity'

In [21]:
!python scripts/save_ground_truth.py --experiment 2 \
                                     --data_path {data_path} \
                                     --similarity_score {SIMILARITY_SCORE} \
                                     --train_ratio .9 \
                                     --seed 42 \
                                     --use_gpu

✓ GPU acceleration enabled
Loading matrix from ./data/enron_binary.npy...
  Matrix Shape: (5000, 28102)
  Data type: int8
  Value range: [0, 1]
Dataset split: 4500 training, 500 query
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Experiment 2 Ground Truth (jaccard_similarity)...
  Training samples: 4500
  Query samples: 500
  Using GPU-accelerated batch computation...
Computing similarities: 100%|██████████████████| 5/5 [00:00<00:00, 34.74batch/s]

  Min similarity: 0.000000
  Max similarity: 1.000000
  Mean similarity: 0.011788

Saving ground truth to experiment/ground_truth/ground_truth_exp2_enron_jaccard_similarity.json...
✓ Saved similarity matrix (500, 4500)


In [22]:
!python main.py --experiment 2 \
                --algo BinSketch BCS MinHash \
                --data_path {data_path} \
                --ground_truth_path ground_truth_exp2_{DATASET}_{SIMILARITY_SCORE}.json \
                --train_ratio .9 \
                --seed 42 \
                --threshold {threshold_str} \
                --similarity_score {SIMILARITY_SCORE} \
                --retrieval_metric {RETRIEVAL_METRIC} \
                --use_gpu

✓ GPU acceleration enabled
Running Experiment 2: Retrieval Performance Evaluation
GPU acceleration enabled
Loading matrix from ./data/enron_binary.npy...
  Matrix Shape: (5000, 28102)
Dataset split: 4500 training, 500 query samples

Loading/Computing Ground Truth Similarities
Loading ground truth from experiment/ground_truth/ground_truth_exp2_enron_jaccard_similarity.json...
  Loaded similarity matrix (500, 4500)
Using cached ground truth similarity matrix

Precomputing ground truth neighbors for 8 thresholds...
  Threshold 0.1: avg 15.78 neighbors per query
  Threshold 0.2: avg 6.41 neighbors per query
  Threshold 0.4: avg 3.55 neighbors per query
  Threshold 0.5: avg 2.58 neighbors per query
  Threshold 0.6: avg 2.14 neighbors per query
  Threshold 0.7: avg 1.67 neighbors per query
  Threshold 0.85: avg 1.01 neighbors per query
  Threshold 0.95: avg 0.73 neighbors per query

Algorithm: BinSketch

  Compression length k=100
    Compressing data...
  [GPU] Con

### Experiments on NYTimes to calculate F1 Score using Jaccard Similarity

In [23]:
THRESHOLD = [.1, .2, .4, .5, .6, .7, .85, .95]
threshold_str = ' '.join([str(t) for t in THRESHOLD])
RETRIEVAL_METRIC = 'f1'
DATASET = 'nytimes'
data_path = f'./data/{DATASET}_binary.npy'
SIMILARITY_SCORE = 'jaccard_similarity'

In [24]:
!python scripts/save_ground_truth.py --experiment 2 \
                                     --data_path {data_path} \
                                     --similarity_score {SIMILARITY_SCORE} \
                                     --train_ratio .9 \
                                     --seed 42 \
                                     --use_gpu

✓ GPU acceleration enabled
Loading matrix from ./data/nytimes_binary.npy...
  Matrix Shape: (5000, 102660)
  Data type: int8
  Value range: [0, 1]
Dataset split: 4500 training, 500 query
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Experiment 2 Ground Truth (jaccard_similarity)...
  Training samples: 4500
  Query samples: 500
  Using GPU-accelerated batch computation...
Computing similarities: 100%|██████████████████| 5/5 [00:00<00:00, 32.96batch/s]

  Min similarity: 0.000000
  Max similarity: 1.000000
  Mean similarity: 0.027929

Saving ground truth to experiment/ground_truth/ground_truth_exp2_nytimes_jaccard_similarity.json...
✓ Saved similarity matrix (500, 4500)


In [25]:
!python main.py --experiment 2 \
                --algo BinSketch BCS MinHash \
                --data_path {data_path} \
                --ground_truth_path ground_truth_exp2_{DATASET}_{SIMILARITY_SCORE}.json \
                --train_ratio .9 \
                --seed 42 \
                --threshold {threshold_str} \
                --similarity_score {SIMILARITY_SCORE} \
                --retrieval_metric {RETRIEVAL_METRIC} \
                --use_gpu

✓ GPU acceleration enabled
Running Experiment 2: Retrieval Performance Evaluation
GPU acceleration enabled
Loading matrix from ./data/nytimes_binary.npy...
  Matrix Shape: (5000, 102660)
Dataset split: 4500 training, 500 query samples

Loading/Computing Ground Truth Similarities
Loading ground truth from experiment/ground_truth/ground_truth_exp2_nytimes_jaccard_similarity.json...
  Loaded similarity matrix (500, 4500)
Using cached ground truth similarity matrix

Precomputing ground truth neighbors for 8 thresholds...
  Threshold 0.1: avg 11.41 neighbors per query
  Threshold 0.2: avg 1.44 neighbors per query
  Threshold 0.4: avg 0.64 neighbors per query
  Threshold 0.5: avg 0.60 neighbors per query
  Threshold 0.6: avg 0.58 neighbors per query
  Threshold 0.7: avg 0.57 neighbors per query
  Threshold 0.85: avg 0.53 neighbors per query
  Threshold 0.95: avg 0.52 neighbors per query

Algorithm: BinSketch

  Compression length k=100
    Compressing data...
  [GPU

### Experiments on KOS to calculate F1 Score using Cosine Similarity

In [26]:
THRESHOLD = [.1, .2, .4, .5, .6, .7, .85, .95]
threshold_str = ' '.join([str(t) for t in THRESHOLD])
RETRIEVAL_METRIC = 'f1'
DATASET = 'kos'
data_path = f'./data/{DATASET}_binary.npy'
SIMILARITY_SCORE = 'cosine_similarity'

In [27]:
!python scripts/save_ground_truth.py --experiment 2 \
                                     --data_path {data_path} \
                                     --similarity_score {SIMILARITY_SCORE} \
                                     --train_ratio .9 \
                                     --seed 42 \
                                     --use_gpu

✓ GPU acceleration enabled
Loading matrix from ./data/kos_binary.npy...
  Matrix Shape: (3430, 6906)
  Data type: int8
  Value range: [0, 1]
Dataset split: 3087 training, 343 query
Transferring data to GPU...
✓ Data transferred to GPU (using cupy)
Calculating Experiment 2 Ground Truth (cosine_similarity)...
  Training samples: 3087
  Query samples: 343
  Using GPU-accelerated batch computation...
Computing similarities: 100%|██████████████████| 4/4 [00:00<00:00, 27.70batch/s]

  Min similarity: 0.000000
  Max similarity: 0.977639
  Mean similarity: 0.070053

Saving ground truth to experiment/ground_truth/ground_truth_exp2_kos_cosine_similarity.json...
✓ Saved similarity matrix (343, 3087)


In [28]:
!python main.py --experiment 2 \
                --algo BinSketch BCS SimHash \
                --data_path {data_path} \
                --ground_truth_path ground_truth_exp2_{DATASET}_{SIMILARITY_SCORE}.json \
                --train_ratio .9 \
                --seed 42 \
                --threshold {threshold_str} \
                --similarity_score {SIMILARITY_SCORE} \
                --retrieval_metric {RETRIEVAL_METRIC} \
                --use_gpu

✓ GPU acceleration enabled
Running Experiment 2: Retrieval Performance Evaluation
GPU acceleration enabled
Loading matrix from ./data/kos_binary.npy...
  Matrix Shape: (3430, 6906)
Dataset split: 3087 training, 343 query samples

Loading/Computing Ground Truth Similarities
Loading ground truth from experiment/ground_truth/ground_truth_exp2_kos_cosine_similarity.json...
  Loaded similarity matrix (343, 3087)
Using cached ground truth similarity matrix

Precomputing ground truth neighbors for 8 thresholds...
  Threshold 0.1: avg 504.92 neighbors per query
  Threshold 0.2: avg 44.74 neighbors per query
  Threshold 0.4: avg 30.10 neighbors per query
  Threshold 0.5: avg 19.97 neighbors per query
  Threshold 0.6: avg 8.91 neighbors per query
  Threshold 0.7: avg 3.71 neighbors per query
  Threshold 0.85: avg 0.36 neighbors per query
  Threshold 0.95: avg 0.01 neighbors per query

Algorithm: BinSketch

  Compression length k=100
    Compressing data...
  [GPU] Conve