Skip to content

ryanwebster90/snip-dedup

Repository files navigation

snip-dedup

PyPI - Version linting - Ruff format - Black license - MIT license - MIT

This repo is a WIP

You no longer can filter the LAION dataset to remove duplicates, as LAION disabled the webdataset on huggingface. I'll focus on adding some functionality for deduplication for future webdatasets using clip features.

  • Compress features using pretrained SNIP networks (for ViT-H-14, ViT-L14, ViT-B-32)
  • Read our research paper
  • Train SNIP on your CLIP features
  • Run a de-duplication of your dataset using our de-dup code

SNIP is a technique to compress CLIP features. It is competitive with previous works for large scale retrieval of deep features, and has some nice properties for multi-modal features. Read more about it here.

We used SNIP together with the faiss library to deduplicate a billions scale dataset, and found a high level of duplication (roughly 700M / 2 billion). This webdataset is no longer being distributed by laion.

Install

pip install --upgrade snip-dedup

Usage

# List available commands
snip --help
snip download --help

# Download and deduplicate the 10 first shards of the dataset
snip download --start 0 --end 10

Then, you may download (deduplicated) laion2b images with the awesome img2dataset.

See the colab license - MIT for a demo on search.

What is a Duplicate?

In our first iteration, we merely marked duplicates pairwise, and remove one sample from a duplicate pair (the above code downloads a binary array, for samples to remove). In our latest run, we recorded the entire adjacency matrix of duplication. For instance, suppose SNIP has labeled feature $k$ as a duplicate with feature $j$. Then $A[k,j] = A[j,k] = 1$ in the adjacency matrix. We're currently having trouble computing the full connected components of this matrix, see this issue.

If you allow connected components with only one node, Then to compute the number of "unique" samples, you simply take one from each duplicate set, say $|\mathcal{C}|$ sets, with $N$ nodes is $D := N - |\mathcal{C}|$ duplicates.

Approximate CCs of Duplicates

Currently, we have an approximation of the CC of the duplicates. During the de-duplication, we label nodes as follows. Suppose we are at node $n$, the pseudo code for one step of labeling is calculated as

labels = np.arange(0,N)
...
d,i = index.search(feats[n,:],k)
dups = get_dups(d,i) #Use adaptive threshhold on ADC (see paper)
label[dups] = resolve_labels_one_step(dups)

Where N is number of nodes (2B for L2B). Here resolve_labels_one_step will simply re-write any node that is unlabeled to be the current node $n$. This can be thought of as a tree. We then connect nodes with common ancestors with a fixed point

while True:
      label = label[label]

The labels of the above loop can be found on huggingface vitl14_labels.

Other:

cumulative sizes of features (for indexing sharded files)

Finding images overfit by Stable Diffusion

By analyzing the most duplicated images, we have found several more images verbatim copied by Stable Diffusion, posing a copyright problem:

sylvester_overfit hopped up logo

Note on False positives

We noticed many images labled as dup by SNIP but not by raw feats are in fact newar duplicates, for example:

Chess1 Chess2

you may check a list of (randomly sampled) detected duplicate pairs here

Semantic Search

You may use the compressed features to do semantic search with faiss (see for instance, the clip-retrieval repository).

Contribute

Contributions are welcome. Usually, the best way is first to open an issue to discuss things.

This python project uses the hatch project manager. Dependencies are specified inside the pyproject.toml file, and build configs inside the hatch.toml file. As such you can enter the isolated development environment with hatch shell from inside the repository.

The code should be documented following the Numpy docstring standard.

To avoid silly mistakes, the code is checked with pyright. To ensure a consistent styling, all python code is formatted with black and we use the ruff linter. Remark that these can usually get installed in your editor, such as VS Code, to view the checks directly in the code. Once you have installed them (suggested via pipx), you can check that the code is consistent with:

hatch run check           # check for mistakes via static analysis with pyright
black --check snip_dedup/ # check formatting of all python files
ruff check snip_dedup/    # check linting rules

STILL TODO:

Citation

@misc{webster2023deduplication,
      title={On the De-duplication of LAION-2B}, 
      author={Ryan Webster and Julien Rabin and Loic Simon and Frederic Jurie},
      year={2023},
      eprint={2303.12733},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages