# Analyze Similar Pairs
This small project will complete jobs below:
1. Get all the similar pairs that output from LSH and SetJoin
2. Analyze them (Know how many documents among them, verify if there is any false positive)
3. Compare them (Is there any inclusion relationship)


## Load Similar Pairs

In [1]:
import tqdm
from glob import glob
import argparse

# load from LSH

dup_dir_path = "/research/projects/zp128/RedPajama-Data-1T/RedPajama-Data-1T/RedPajama_norm/dup"
files = glob(f"{dup_dir_path}/*")
simp_lsh = set()
for fp in files:
    with open(fp, "r") as f:
        for line in tqdm.tqdm(f):
            ori_pair = tuple(line.strip().split(" :: "))
            pair = tuple([part.split("@")[1] for part in ori_pair])
            if pair[0] != pair[1]:
                simp_lsh.add((int(pair[0]), int(pair[1])))




7349it [00:00, 172164.23it/s]
7223it [00:00, 184080.75it/s]
7405it [00:00, 195763.27it/s]
7430it [00:00, 185020.12it/s]
7314it [00:00, 184638.54it/s]
7316it [00:00, 183057.29it/s]
7234it [00:00, 170651.10it/s]
7317it [00:00, 178719.56it/s]
7458it [00:00, 180986.24it/s]


In [2]:
# load from SetJoin
import util
simpairs_bin_path = "/research/projects/zp128/RedPajama_Analysis/SetJoin/similar_pairs/stackexchange_sim_pairs_0.800000.bin"
idmap_bin_path = "/research/projects/zp128/RedPajama_Analysis/SetJoin/sorted_sets/stackexchange_idmap.bin"

idmap = util.read_ints_from_binary(idmap_bin_path)
sim_pairs = util.read_pairs_from_binary(simpairs_bin_path)
simp_setjoin = util.map_elements(sim_pairs, idmap)

simp_setjoin = util.correct_pair_order(simp_setjoin)
simp_lsh = util.correct_pair_order(simp_lsh)

## Analyze them

In [3]:
import struct
thres = 0.8
# load the real documents
ids_setjoin = util.extract_elements(simp_setjoin)
dataset = util.read_pajama_dataset_selected_docs("stackexchange",ids_setjoin)
ids_setjoin_size = len(ids_setjoin)
print(f"IT includes {ids_setjoin_size} unique documents.")


There are total 29825086 documents in this /research/projects/zp128/RedPajama-Data-1T/RedPajama-Data-1T/stackexchange/tokenized_text_document.idx
IT includes 44793 unique pairs.


In [5]:
thres = 0.8
ids_lsh = util.extract_elements(simp_lsh)
dataset = util.read_pajama_dataset_selected_docs("stackexchange",ids_lsh)
ids_lsh_size = len(ids_lsh)
print(f"IT includes {ids_lsh_size} unique documents.")
util.check_jaccard_similarity(dataset,simp_lsh, thres )

There are total 29825086 documents in this /research/projects/zp128/RedPajama-Data-1T/RedPajama-Data-1T/stackexchange/tokenized_text_document.idx
IT includes 32104 unique pairs.
0.625


False

## Analyze them

In [6]:
print(f"There are {len(simp_setjoin)} pairs in simp_setjoin")
print(f"There are {len(simp_lsh)} pairs in simp_lsh")

union_set = simp_setjoin.union(simp_lsh)
union_size = len(union_set)
print(f"The union of two sets includes {union_size} unique pairs.")

# Intersection
intersection_set = simp_setjoin.intersection(simp_lsh)
intersection_size = len(intersection_set)
print(f"The intersection of two sets includes {intersection_size} common pairs.")

# Difference
difference_set = simp_setjoin.difference(simp_lsh)  # B - A
difference_size = len(difference_set)
print(f"The difference  of two sets(B - A) includes {difference_size} common pairs.")


There are 31950 pairs in simp_setjoin
There are 17389 pairs in simp_lsh
The union of two sets includes 35188 unique pairs.
The intersection of two sets includes 14151 common pairs.
The difference  of two sets(B - A) includes 17799 common pairs.
