# BinSketch Algorithm Experiments

This notebook runs experiments to evaluate the accuracy of similarity estimation algorithms (BinSketch, SimHash, MinHash) across different compression lengths.

## Usage

1. **Setup**: Download and convert datasets (cells below)
2. **Configure**: Set experiment parameters (thresholds, dataset, algorithms)
3. **Run**: Execute experiment cells that call `main.py`

The `main.py` script handles:
- Loading binary matrices
- Computing ground truth similarities
- Running algorithms with different compression lengths
- Generating accuracy plots

In [1]:
%ls

 Volume in drive D is New Volume
 Volume Serial Number is 6085-9B4A

 Directory of d:\hcmus\introduction-to-algorithm-complexity-and-analysis\binsketch-algorithm

09-Jan-26  08:50 AM    <DIR>          .
08-Jan-26  08:37 PM    <DIR>          ..
08-Jan-26  11:56 PM                75 .gitignore
09-Jan-26  12:43 AM    <DIR>          .venv
09-Jan-26  12:51 AM            64,330 bin_sketch.ipynb
09-Jan-26  12:07 AM             7,903 convert.py
08-Jan-26  11:57 PM             1,673 download_dataset.py
08-Jan-26  03:40 PM                 0 main.py
09-Jan-26  12:35 AM                25 README.md
09-Jan-26  12:51 AM                51 requirements.txt
09-Jan-26  12:44 AM    <DIR>          src
               7 File(s)         74,057 bytes
               4 Dir(s)  473,662,222,336 bytes free


In [None]:
DRIVE_URL = 'https://drive.google.com/drive/folders/1ARBY9cIGj_jigi5Y88CtUy-GMj2clrXj'
RAW_DATA_PATH = './raw'
PROCESSED_DATA_PATH = './data'

In [None]:
import gdown
import zipfile
import os

# Prepare dataset

In [None]:
def download_dataset(drive_url, target_folder):
    print(f"Processing: {drive_url}")
    
    # Create the folder if it doesn't exist
    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    if "drive/folders" in drive_url or "folder" in drive_url:
        print("Downloading individual files directly...")
        gdown.download_folder(drive_url, output=target_folder, quiet=False, use_cookies=False)
        print(f"\n[SUCCESS] Folder contents downloaded to: {target_folder}")
    else:
        print("\n[INFO] Detected a Drive FILE link.")
        zip_path = os.path.join(target_folder, "temp_dataset.zip")
        output = gdown.download(drive_url, zip_path, quiet=False, fuzzy=True)
        
        if not output:
            print("[ERROR] Download failed.")
            return

        print(f"\nUnzipping {output}...")
        try:
            with zipfile.ZipFile(output, 'r') as zip_ref:
                zip_ref.extractall(target_folder)
            print(f"[SUCCESS] Extracted to: {target_folder}")
            
            # Clean up the zip file
            os.remove(output)
            
        except zipfile.BadZipFile:
            print("[ERROR] The downloaded file was not a valid zip file.")
            print("Check if the file on Drive is actually a .zip archive.")

In [None]:
download_dataset(DRIVE_URL, RAW_DATA_PATH)

In [None]:
!python convert.py

### Experiment 1: Accuracy of Estimation

Run experiments to evaluate the accuracy of similarity estimation across different compression lengths.

In [None]:
# Configuration
THRESHOLD = [.1, .2, .3, .4, .5, .6, .7, .8, .9]
threshold_str = " ".join(map(str, THRESHOLD))

# Available datasets: bbc, enron, kos, nytimes
DATASET = 'enron'

#### Cosine Similarity

In [None]:
# Experiment 1: Cosine Similarity
data_path = f'./data/{DATASET}_binary.npy'

!python main.py --data_path {data_path} --algo BinSketch SimHash MinHash --metric cosine_similarity --threshold {threshold_str}

### Experiment 2: Ranking

In [None]:
# Configuration for Ranking Experiment
RANKING_THRESHOLD = [.1, .2, .5, .6, .8, .85, .9, .95]
ranking_threshold_str = " ".join(map(str, RANKING_THRESHOLD))

# Run ranking experiment
data_path = f'./data/{DATASET}_binary.npy'

!python main.py --data_path {data_path} --algo BinSketch SimHash MinHash --metric cosine_similarity --threshold {ranking_threshold_str}