# End-to-end fuzzy deduplication

GPU accelerated implementation of a MinHash-LSH based fuzzy deduplication. For more information about fuzzy deduplication in NeMo Curator, refer to the [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) section of the documentation page.

The tutorial here shows how to run fuzzy deduplication on text data by executing 2 end to end workflows.
These 2 workflows roughly cover the following steps to perform fuzzy deduplication:

1. Read original dataset
2. Compute MinHashes signatures of these documents
3. Perform LSH - Group Minhashes into bands/buckets and shuffle these bands/buckets so that documents in the same bucket are in the same batch/file.
4. Convert the LSH outputs (bucket_id -> doc_id mapping) into an edgelist in preparation for connected components. 
5. Compute connected components across all potential duplicates found via LSH.
6. Generate list of duplicate documents by randomly selecting 1 document to keep from each group/component and dropping the rest.
7. Remove duplicates based on the generated duplicate list.

We also allow users to also run these steps independently, which will be covered in the step by step tutorial in the same directory as this tutorial.

In [1]:
import os

import fsspec

# Silence Curator logs via Loguru
os.environ["LOGURU_LEVEL"] = "ERROR"

import pandas as pd

input_dataset_path = "./input"  # Path to input dataset
fuzzy_output_dir = "./fuzzy_outputs"  # Path to store all fuzzy outputs including cache & deduped dataset

fuzzy_cache_path = os.path.join(
    fuzzy_output_dir, "cache"
)  # Path to store fuzzy deduplication intermediates (minhash, lsh etc.)
deduplicated_output_path = os.path.join(fuzzy_output_dir, "fuzzy_deduped_dataset")

input_filetype = (
    "parquet"  # this can be either of jsonl or parquet (you'll need to change how input data is generated)
)
# Note: It's important that this is constant across identification and removal.
# More information about choosing a good blocksize is mentioned in the performance considerations section below
input_blocksize = "512MiB"
output_filetype = "parquet"  # this can be either of jsonl or parquet

storage_options = None  # Optional additional cloud I/O args to pass into Pandas/cuDF during I/O operations.
io_kwargs = {"storage_options": storage_options} if storage_options is not None else None
fs, _ = fsspec.url_to_fs(fuzzy_cache_path, **storage_options if storage_options is not None else {})

### Downloading and saving a sample dataset

We download and save the [Tinystories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset to the specified `input_dataset_path` above. This step can be skipped if running on a different dataset that's already present in the input_dataset_path.

In [2]:
from nemo_curator.utils.file_utils import get_all_file_paths_under

if len(get_all_file_paths_under(input_dataset_path, storage_options=storage_options)) == 0:
    import os
    import uuid

    from datasets import load_dataset

    input_df = load_dataset("roneneldan/TinyStories", split="train").to_pandas()
    num_rows_per_file = 10_000

    os.makedirs(input_dataset_path, exist_ok=True)

    for i, start_idx in enumerate(range(0, len(input_df), num_rows_per_file)):
        if i % 50 == 0:
            print(f"Processing file {i}")
        end_idx = min(len(input_df), start_idx + num_rows_per_file)
        subset_df = input_df.iloc[start_idx:end_idx].copy()
        subset_df["id"] = [str(uuid.uuid4()) for _ in range(len(subset_df))]
        subset_df.to_parquet(
            os.path.join(input_dataset_path, f"part_{i}.parquet"), index=False, storage_options=storage_options
        )

    print(f"Created {i + 1} files")

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00004-2d5a1467fff108(…):   0%|          | 0.00/249M [00:00<?, ?B/s]

data/train-00001-of-00004-5852b56a2bd28f(…):   0%|          | 0.00/248M [00:00<?, ?B/s]

data/train-00002-of-00004-a26307300439e9(…):   0%|          | 0.00/246M [00:00<?, ?B/s]

data/train-00003-of-00004-d243063613e5a0(…):   0%|          | 0.00/248M [00:00<?, ?B/s]

data/validation-00000-of-00001-869c898b5(…):   0%|          | 0.00/9.99M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2119719 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21990 [00:00<?, ? examples/s]

Processing file 0
Processing file 50
Processing file 100
Processing file 150
Processing file 200
Created 212 files


## Running as a Single Stage (End-to-End)

See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.deduplication.fuzzy.workflow.html#api) for more information about the `FuzzyDeduplicationWorkflow` class.

### General Notes
#### ID Generation
1. The ID generation process requires a Ray cluster to be started before running the workflow either from the CLI or by using the `RayClient` API in Curator.
2. The `FuzzyDeduplicationWorkflow` API doesn't utilize any existing IDs in the input dataset and instead generates IDs on the fly using an ID Generator actor.
3. The ID Generator gives each row a unique increasing integer ID, based on the order files are read.
4. This avoids expensive ID->Integer encoding for the underlying connected components algorithm which only supports integer IDs.
5. When we find duplicates, we save these integer IDs in sorted files with multiple row groups.
6. We also save a `fuzzy_id_generator.json` which maintains a mapping of input file partitions to ID ranges for that batch.
7. During removal, reading the same file groups will give the same integer IDs, using the min/max ID values, we can find all corresponding duplicates in that range making the process faster.

#### Performance Considerations
1. LSH - Configuring `bands_per_iteration` controls how many bands to process simultaneously in a single shuffle. Higher values can lead to faster performance but might increase memory pressure.
2. A low `input_blocksize` may not saturate the GPUs enough while a high `input_blocksize` can lead to OOM errors during MinHash and excessive object store usage during removal. It's recommend to keep it at 512MiB-1.5GiB and reduce if running into OOMs during MinHash.
3. The removal step can be memory intensive and it's recommend to set a higher fraction of object store memory for removal (if the machine has enough RAM). The `RayDataExecutor` showed better results during duplicate removal.
4. The removal workflow is CPU only and can be run  on machines that don't have GPUs

#### Hyperparameter Considerations
1. The current defaults for fuzzy deduplication (260 hashes, 13 hashes per band) approximate finding documents with a Jaccard similarity of 0.8. For more information on selecting the number of bands/hashes it's recommended to analyze the S curve and tolerable threshold for false positives (and negatives). More information about LSH can be found in section `3.4.2` [here](http://infolab.stanford.edu/~ullman/mmds/ch3n.pdf).
2. The `char_ngrams` values of 24 is set to approximate roughly ngrams that correspond to ~5 words.


In [4]:
import time

import torch

from nemo_curator.backends.experimental.ray_data import RayDataExecutor
from nemo_curator.core.client import RayClient

NUM_GPUS = 2

if torch.cuda.device_count() < NUM_GPUS:
    error_msg = "The number of GPUs on this machine are lesser than the default this tutorial was tested with, please update `num_gpus` passed into `RayClient`"
    raise ValueError(error_msg)

client = RayClient(num_cpus=64, num_gpus=NUM_GPUS)  # change as needed
client.start()

In [3]:
from nemo_curator.stages.deduplication.fuzzy import FuzzyDeduplicationWorkflow

# All workflows support passing in different kwargs and storage_options for the read, cache and output datasets
# We use a common one here for simplicity

identification_workflow = FuzzyDeduplicationWorkflow(
    cache_path=fuzzy_cache_path,
    output_path=fuzzy_output_dir,
    input_path=input_dataset_path,
    input_filetype=input_filetype,
    input_blocksize=input_blocksize,
    text_field="text",
    seed=42,
    char_ngrams=24,
    minhashes_per_band=13,
    bands_per_iteration=10,
    read_kwargs=io_kwargs,
    cache_kwargs=io_kwargs,
    write_kwargs=io_kwargs,
)

In [5]:
st = time.time()
_ = identification_workflow.run()
print(f"Identification workflow took: {(time.time() - st):.2f}s")

2025-12-15 23:08:45,515	INFO worker.py:1696 -- Using address 127.0.1.1:6380 set in the environment variable RAY_ADDRESS
2025-12-15 23:08:45,521	INFO worker.py:1837 -- Connecting to existing Ray cluster at address: 127.0.1.1:6380...
[2025-12-15 23:08:45,534 W 17220 17220] global_state_accessor.cc:505: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2025-12-15 23:08:46,536 W 17220 17220] global_state_accessor.cc:505: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2025-12-15 23:08:47,538 W 17220 17220] global_state_accessor.cc:505: Some processes that the driver needs to connect to have not registered with GCS, so retrying. Have you run 'ray start' on this node?
[2025-12-15 23:08:48,539 I 17220 17220] global_state_accessor.cc:487: This node has an IP address of 127.0.1.1, but we cannot find a local Raylet with the same

2025-12-15 23:08:40,526	INFO usage_lib.py:447 -- Usage stats collection is disabled.
2025-12-15 23:08:40,526	INFO scripts.py:919 -- [37mLocal node IP[39m: [1m127.0.1.1[22m
2025-12-15 23:08:48,520	SUCC scripts.py:963 -- [32m--------------------[39m
2025-12-15 23:08:48,521	SUCC scripts.py:964 -- [32mRay runtime started.[39m
2025-12-15 23:08:48,521	SUCC scripts.py:965 -- [32m--------------------[39m
2025-12-15 23:08:48,521	INFO scripts.py:967 -- [36mNext steps[39m
2025-12-15 23:08:48,521	INFO scripts.py:970 -- To add another node to this Ray cluster, run
2025-12-15 23:08:48,521	INFO scripts.py:973 -- [1m  ray start --address='127.0.1.1:6380'[22m
2025-12-15 23:08:48,521	INFO scripts.py:982 -- To connect to this Ray cluster:
2025-12-15 23:08:48,521	INFO scripts.py:984 -- [35mimport[39m[26m ray
2025-12-15 23:08:48,521	INFO scripts.py:985 -- ray[35m.[39m[26minit(_node_ip_address[35m=[39m[26m[33m'127.0.1.1'[39m[26m)
2025-12-15 23:08:48,521	INFO scripts.py:997 -- To su

2025-12-15 23:08:51,273	INFO worker.py:1696 -- Using address 127.0.1.1:6380 set in the environment variable RAY_ADDRESS
2025-12-15 23:08:51,277	INFO worker.py:1837 -- Connecting to existing Ray cluster at address: 127.0.1.1:6380...
2025-12-15 23:08:51,284	INFO worker.py:2014 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://127.0.0.1:8267 [39m[22m
2025-12-15 23:09:14,403	INFO worker.py:1696 -- Using address 127.0.1.1:6380 set in the environment variable RAY_ADDRESS
2025-12-15 23:09:14,407	INFO worker.py:1837 -- Connecting to existing Ray cluster at address: 127.0.1.1:6380...
2025-12-15 23:09:14,414	INFO worker.py:2014 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://127.0.0.1:8267 [39m[22m
2025-12-15 23:09:58,710	INFO worker.py:1696 -- Using address 127.0.1.1:6380 set in the environment variable RAY_ADDRESS
2025-12-15 23:09:58,714	INFO worker.py:1837 -- Connecting to existing Ray cluster at address: 127.0.1.1:6380...
2025-12-15 23:09:58,720	INFO w

Identification workflow took: 121.86s


In [None]:
from nemo_curator.stages.deduplication.id_generator import CURATOR_DEDUP_ID_STR
from nemo_curator.stages.text.deduplication import TextDuplicatesRemovalWorkflow

removal_workflow = TextDuplicatesRemovalWorkflow(
    input_path=input_dataset_path,  # Must be identical to the path used during identification
    ids_to_remove_path=os.path.join(fuzzy_output_dir, "FuzzyDuplicateIds"),
    output_path=deduplicated_output_path,
    input_filetype=input_filetype,
    input_blocksize=input_blocksize,  # This must be identical to the blocksize used during identification
    duplicate_id_field=CURATOR_DEDUP_ID_STR,
    id_generator_path=os.path.join(fuzzy_output_dir, "fuzzy_id_generator.json"),
    output_filetype=output_filetype,
    input_kwargs=io_kwargs,  # read_kwargs for input dataset
    duplicate_id_read_kwargs=io_kwargs,  # read_kwargs for removal_id's generated by Fuzzy workflow
    id_generator_storage_options=storage_options,
    output_kwargs=io_kwargs,
)

In [7]:
st = time.time()
_ = removal_workflow.run(executor=RayDataExecutor())
print(f"Removal workflow took: {(time.time() - st):.2f}s")

2025-12-15 23:12:51,080	INFO worker.py:1696 -- Using address 127.0.1.1:6380 set in the environment variable RAY_ADDRESS
2025-12-15 23:12:51,084	INFO worker.py:1837 -- Connecting to existing Ray cluster at address: 127.0.1.1:6380...
2025-12-15 23:12:51,084	INFO worker.py:1855 -- Calling ray.init() again after it has already been called.
2025-12-15 23:12:51,317	INFO worker.py:1696 -- Using address 127.0.1.1:6380 set in the environment variable RAY_ADDRESS
2025-12-15 23:12:51,321	INFO worker.py:1837 -- Connecting to existing Ray cluster at address: 127.0.1.1:6380...
2025-12-15 23:12:51,327	INFO worker.py:2014 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://127.0.0.1:8267 [39m[22m
2025-12-15 23:12:59,895	INFO logging.py:397 -- Registered dataset logger for dataset dataset_5_0
2025-12-15 23:12:59,904	INFO streaming_executor.py:174 -- Starting execution of Dataset dataset_5_0. Full logs are in /tmp/ray/session_2025-12-15_23-08-40_527975_17737/logs/ray-data
2025-12-15 23:

Running 0: 0.00 row [00:00, ? row/s]

- MapBatches(FilePartitioningStageTask) 1: 0.00 row [00:00, ? row/s]

- StreamingRepartition 2: 0.00 row [00:00, ? row/s]

- MapBatches(ParquetReaderStageActor) 3: 0.00 row [00:00, ? row/s]

- MapBatches(TextDuplicatesRemovalStageTask)->MapBatches(ParquetWriterTask) 4: 0.00 row [00:00, ? row/s]

2025-12-15 23:13:22,279	INFO streaming_executor.py:300 -- ✔️  Dataset dataset_5_0 execution finished in 22.37 seconds
2025-12-15 23:13:22,288	INFO util.py:257 -- Exiting prefetcher's background thread
2025-12-15 23:13:22,304	INFO worker.py:1696 -- Using address 127.0.1.1:6380 set in the environment variable RAY_ADDRESS
2025-12-15 23:13:22,309	INFO worker.py:1837 -- Connecting to existing Ray cluster at address: 127.0.1.1:6380...
2025-12-15 23:13:22,316	INFO worker.py:2014 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://127.0.0.1:8267 [39m[22m


Removal workflow took: 31.28s


### Looking at Intermediate Results and Output

#### MinHash Results
1. `_curator_dedup_id` - The IDs assigned to this dataset on the fly during the initial read.
2. `_minhash_signature` - MinHash Signature

#### LSH Results
1. `_bucket_id` - The bucket/band identifier
2. `_curator_dedup_id` - List of all document IDs that belong to that bucket

#### Buckets To Edges Result
1. `_curator_dedup_id_x`, `_curator_dedup_id_y` - Mapping of edges in a Graph where each column are documents that are potential duplicates.

In [8]:
minhash_path = os.path.join(fuzzy_cache_path, "MinHashStage")
display(pd.read_parquet(fs.unstrip_protocol(fs.find(minhash_path)[0]), storage_options=storage_options).head())

lsh_path = os.path.join(fuzzy_cache_path, "LSHStage")
display(pd.read_parquet(fs.unstrip_protocol(fs.find(lsh_path)[0]), storage_options=storage_options).head())

b2e_path = os.path.join(fuzzy_cache_path, "BucketsToEdgesStage")
display(pd.read_parquet(fs.unstrip_protocol(fs.find(b2e_path)[0]), storage_options=storage_options).head())

Unnamed: 0,_curator_dedup_id,_minhash_signature
0,0,"[218051, 2965574, 2358869, 20793331, 9567445, ..."
1,1,"[13231761, 1801895, 1933976, 3402840, 8234515,..."
2,2,"[13972691, 2206484, 3887953, 1782578, 7445153,..."
3,3,"[5066913, 6771503, 375732, 841498, 7703292, 45..."
4,4,"[4066453, 951833, 9469185, 3399185, 1533452, 6..."


Unnamed: 0,_bucket_id,_curator_dedup_id
0,b0_000055fd7daae1e46223e8b7e06bf2e0,"[68375, 969489]"
1,b0_0000f975e5bcda25838df43b0d37737f,"[224885, 1975572]"
2,b0_0001c9dff36e10d709d64123cb0dee4d,"[826007, 1309488]"
3,b0_00020b2c889483bd6a78ffe9a8d7deb1,"[908278, 1270888]"
4,b0_00024c2b7321353410dd908eb31499bd,"[1222795, 2000426]"


Unnamed: 0,_curator_dedup_id_x,_curator_dedup_id_y
0,68375,969489
1,224885,1975572
2,826007,1309488
3,908278,1270888
4,1222795,2000426


#### Connected Components Result

1. `_curator_dedup_id` - The document IDs
2. `_duplicate_group_id` - The group ID that document belongs to. Documents with the same duplicate group ID are duplicates

In [9]:
cc_path = os.path.join(fuzzy_cache_path, "ConnectedComponentsStage")
cc_df = pd.read_parquet(cc_path, storage_options=storage_options)  # works with pandas since the input here is small
display(cc_df)
grouped_cc_df = cc_df.groupby("_duplicate_group_id")._curator_dedup_id.agg(list)
display(grouped_cc_df)
duplicate_cluster_sizes = cc_df._duplicate_group_id.value_counts()
display(duplicate_cluster_sizes)

Unnamed: 0,_curator_dedup_id,_duplicate_group_id
0,3,171083
1,5,491932
2,6,491933
3,7,320428
4,8,171086
...,...,...
640509,2119713,132508
640510,2119714,320421
640511,2119715,320422
640512,2119716,453258


_duplicate_group_id
0                [603797, 0]
1                [603798, 1]
2                [2, 603799]
5               [12, 603809]
6               [603812, 15]
                 ...        
640502    [1237637, 2119693]
640506    [2119701, 1237645]
640507    [2119702, 1237646]
640510    [2119706, 1237650]
640511    [2119707, 1237651]
Name: _curator_dedup_id, Length: 320043, dtype: object

_duplicate_group_id
14100     230
153774      3
269755      3
269739      3
269745      3
         ... 
427192      2
198728      2
198726      2
213120      2
310717      2
Name: count, Length: 320043, dtype: int64

Based on the distribution above we can see that there is one cluster/group where 230 documents are all duplicates followed by many smaller clusters with 2 or 3 documents that are duplicates.

#### FuzzyDuplicateIds Results (List of duplicate docs to remove)
1. `_curator_dedup_id` - ID of docs in the removal list

In [10]:
duplicate_ids_path = os.path.join(fuzzy_output_dir, "FuzzyDuplicateIds")
duplicates_df = pd.read_parquet(duplicate_ids_path, storage_options=storage_options)
display(duplicates_df.head())

print(f"Number of duplicate documents found for removal: {len(duplicates_df)}")

Unnamed: 0,_curator_dedup_id
0,13
1,25
2,32
3,53
4,56


Number of duplicate documents found for removal: 320471


#### Checking that the duplicate ids list contains only one document per group

In [11]:
# As an example let's look at the group with the largest number of duplicates
largest_duplicate_cluster = grouped_cc_df.loc[duplicate_cluster_sizes.index[0]]

# number of docs in the removal list from this group
docs_to_remove_in_group = duplicates_df._curator_dedup_id.isin(largest_duplicate_cluster).sum()

print(f"Number of documents in the duplicate group: {len(largest_duplicate_cluster)}")
print(f"Number of documents in the removal list from the same group: {docs_to_remove_in_group}")
assert docs_to_remove_in_group == (len(largest_duplicate_cluster) - 1)  # noqa: S101

Number of documents in the duplicate group: 230
Number of documents in the removal list from the same group: 229


#### Advanced: Looking at examples of duplicate documents

1. This analysis involves re-reading the input data with the same ID mapping that was used during duplicate identification.
2. Merging the input data with the connected components results on the `_curator_dedup_id` column to associate each document which the duplicate group it belongs to which can be used for further analysis.

**NOTE**: This analysis approach is intended as an example for smaller datasets and only works for cases where the connected components dataframe is small and fits comfortable in memory. It is not recommended for larger datasets.

In [12]:
from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.base import ProcessingStage
from nemo_curator.stages.resources import Resources
from nemo_curator.stages.text.io.reader import JsonlReader, ParquetReader
from nemo_curator.tasks.document import DocumentBatch


class CustomMergeStage(ProcessingStage[DocumentBatch, DocumentBatch]):
    """
    Warning: This should not be attempted with large connected components results.
    A small stage that merges the input data (using the id's generated) with the connected components result.
    Works because CC results are small enough to fit per batch.
    """

    resources = Resources(cpus=1.0)

    def process(self, batch: DocumentBatch) -> DocumentBatch:
        df = batch.to_pandas().merge(cc_df, how="inner", on=[CURATOR_DEDUP_ID_STR])
        return DocumentBatch(
            task_id=batch.task_id, dataset_name=batch.dataset_name, data=df, _stage_perf=batch._stage_perf
        )


ReaderClass = ParquetReader if input_filetype == "parquet" else JsonlReader
pipeline = Pipeline(
    name="Explore duplicates",
    stages=[
        ReaderClass(file_paths=input_dataset_path, blocksize=input_blocksize, _assign_ids=True, read_kwargs=io_kwargs),
        CustomMergeStage(),
    ],
)

In [13]:
from nemo_curator.stages.deduplication.id_generator import create_id_generator_actor, kill_id_generator_actor

try:
    create_id_generator_actor(
        filepath=os.path.join(fuzzy_output_dir, "fuzzy_id_generator.json"), storage_options=storage_options
    )
    merged_results = pipeline.run()
    merged_df = pd.concat([batch.to_pandas() for batch in merged_results]).sort_values("_duplicate_group_id")
finally:
    kill_id_generator_actor()

2025-12-15 23:13:58,592	INFO worker.py:1696 -- Using address 127.0.1.1:6380 set in the environment variable RAY_ADDRESS
2025-12-15 23:13:58,597	INFO worker.py:1837 -- Connecting to existing Ray cluster at address: 127.0.1.1:6380...
2025-12-15 23:13:58,597	INFO worker.py:1855 -- Calling ray.init() again after it has already been called.
2025-12-15 23:13:59,704	INFO worker.py:1696 -- Using address 127.0.1.1:6380 set in the environment variable RAY_ADDRESS
2025-12-15 23:13:59,709	INFO worker.py:1837 -- Connecting to existing Ray cluster at address: 127.0.1.1:6380...
2025-12-15 23:13:59,716	INFO worker.py:2014 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://127.0.0.1:8267 [39m[22m
2025-12-15 23:13:59,736	INFO worker.py:1696 -- Using address 127.0.1.1:6380 set in the environment variable RAY_ADDRESS
2025-12-15 23:13:59,740	INFO worker.py:1837 -- Connecting to existing Ray cluster at address: 127.0.1.1:6380...
2025-12-15 23:13:59,740	INFO worker.py:1855 -- Calling ray.in

In [14]:
display(merged_df[merged_df._curator_dedup_id.isin(largest_duplicate_cluster)])

Unnamed: 0,text,id,_curator_dedup_id,_duplicate_group_id
34010,,92d6e01b-3292-494a-b139-9479bdb6e624,115098,14100
176822,,cbcc728f-089d-4e63-99d4-a23f35736955,610909,14100
176823,,830bd2ba-0f07-4e89-ae60-ba14d0a2fa09,610910,14100
94578,,7cab3473-0f8e-4b8b-a22e-6961151572d0,327273,14100
176824,,db416c14-9ce3-4ac8-87a6-3ea546ddb1bc,610912,14100
...,...,...,...,...
28991,,c2c4f14a-f405-472c-aece-e36a2c4a8762,98886,14100
57528,,37173cac-340e-4a27-9e22-d0fb23bd0aee,1188225,14100
28988,,2dab220a-bd13-4b1a-ba1c-f0459b16c268,98880,14100
28990,,22ec3467-721d-4f9d-a0c1-3d654d909ae7,98884,14100


The largest cluster/group of duplicates in this dataset seems to be all documents with empty/no text.

Let's look at the second largest cluster of documents.

In [15]:
duplicates = merged_df[merged_df._curator_dedup_id.isin(grouped_cc_df.loc[duplicate_cluster_sizes.index[1]])]
display(duplicates)

print(f"\nDocument1\n----------\n{duplicates.iloc[0].text}")
print(f"\nDocument2\n----------\n{duplicates.iloc[1].text}")

Unnamed: 0,text,id,_curator_dedup_id,_duplicate_group_id
300414,Sara and Ben were friends who liked to play in...,dccdcb21-13c0-4df9-b4eb-cf6871f57969,1994745,153774
106660,Sara and Ben were friends who liked to play in...,2012a26a-c0fb-448d-be3a-955a5fb8f165,373063,153774
209932,Sara and Ben were friends who liked to play in...,296aad0d-3fcc-4bcd-9852-2d7907f715af,1698008,153774



Document1
----------
Sara and Ben were friends who liked to play in the park. One day, they saw a big dog with a red bow on its neck. Sara wanted to pet the dog, but Ben was scared.

"Come on, Ben, the dog is nice. Look, it has a bow. It wants to be our friend," Sara said.

"No, Sara, the dog is big and loud. It might bite us. We should go away," Ben said.

Sara did not listen to Ben. She ran to the dog and tried to touch its bow. The dog did not like that. It growled and barked at Sara. It showed its teeth and snapped at her hand. Sara was scared and ran back to Ben.

"Are you okay, Sara?" Ben asked.

"Yes, Ben, I am okay. But the dog was terrible. It did not want me to pet it. It was mean to me," Sara said.

"I told you, Sara, the dog was big and loud. You should have listened to me. We should not bother animals we do not know. They might hurt us," Ben said.

Sara nodded. She was sorry she did not listen to Ben. She learned her lesson. She and Ben went to play with their own toys. T

In [16]:
client.stop()

### Conclusion
We were able to find and remove ~320_000 duplicate documents in a dataset of ~2.1 Million Rows 