snip-dedup

This repo is a WIP

You no longer can filter the LAION dataset to remove duplicates, as LAION disabled the webdataset on huggingface. I'll focus on adding some functionality for deduplication for future webdatasets using clip features.

Compress features using pretrained SNIP networks (for ViT-H-14, ViT-L14, ViT-B-32)
Read our research paper
Train SNIP on your CLIP features
Run a de-duplication of your dataset using our de-dup code

SNIP is a technique to compress CLIP features. It is competitive with previous works for large scale retrieval of deep features, and has some nice properties for multi-modal features. Read more about it here.

We used SNIP together with the faiss library to deduplicate a billions scale dataset, and found a high level of duplication (roughly 700M / 2 billion). This webdataset is no longer being distributed by laion.

Install

pip install --upgrade snip-dedup

Usage

# List available commands
snip --help
snip download --help

# Download and deduplicate the 10 first shards of the dataset
snip download --start 0 --end 10

Then, you may download (deduplicated) laion2b images with the awesome img2dataset.

See the colab for a demo on search.

What is a Duplicate?

In our first iteration, we merely marked duplicates pairwise, and remove one sample from a duplicate pair (the above code downloads a binary array, for samples to remove). In our latest run, we recorded the entire adjacency matrix of duplication. For instance, suppose SNIP has labeled feature $k$ as a duplicate with feature $j$. Then $A[k,j] = A[j,k] = 1$ in the adjacency matrix. We're currently having trouble computing the full connected components of this matrix, see this issue.

If you allow connected components with only one node, Then to compute the number of "unique" samples, you simply take one from each duplicate set, say $|\mathcal{C}|$ sets, with $N$ nodes is $D := N - |\mathcal{C}|$ duplicates.

Approximate CCs of Duplicates

Currently, we have an approximation of the CC of the duplicates. During the de-duplication, we label nodes as follows. Suppose we are at node $n$, the pseudo code for one step of labeling is calculated as

labels = np.arange(0,N)
...
d,i = index.search(feats[n,:],k)
dups = get_dups(d,i) #Use adaptive threshhold on ADC (see paper)
label[dups] = resolve_labels_one_step(dups)

Where N is number of nodes (2B for L2B). Here resolve_labels_one_step will simply re-write any node that is unlabeled to be the current node $n$. This can be thought of as a tree. We then connect nodes with common ancestors with a fixed point

while True:
      label = label[label]

The labels of the above loop can be found on huggingface vitl14_labels.

Other:

cumulative sizes of features (for indexing sharded files)

Finding images overfit by Stable Diffusion

By analyzing the most duplicated images, we have found several more images verbatim copied by Stable Diffusion, posing a copyright problem:

Note on False positives

We noticed many images labled as dup by SNIP but not by raw feats are in fact newar duplicates, for example:

you may check a list of (randomly sampled) detected duplicate pairs here

Semantic Search

You may use the compressed features to do semantic search with faiss (see for instance, the clip-retrieval repository).

Contribute

Contributions are welcome. Usually, the best way is first to open an issue to discuss things.

This python project uses the hatch project manager. Dependencies are specified inside the pyproject.toml file, and build configs inside the hatch.toml file. As such you can enter the isolated development environment with hatch shell from inside the repository.

The code should be documented following the Numpy docstring standard.

To avoid silly mistakes, the code is checked with pyright. To ensure a consistent styling, all python code is formatted with black and we use the ruff linter. Remark that these can usually get installed in your editor, such as VS Code, to view the checks directly in the code. Once you have installed them (suggested via pipx), you can check that the code is consistent with:

hatch run check           # check for mistakes via static analysis with pyright
black --check snip_dedup/ # check formatting of all python files
ruff check snip_dedup/    # check linting rules

STILL TODO:

add docs / tutorial
add tests
check max file size on CI to prevent pushing data
auto publish github action. example at https://github.com/ofek/hatch-showcase/blob/master/.github/workflows/build.yml

Citation

@misc{webster2023deduplication,
      title={On the De-duplication of LAION-2B}, 
      author={Ryan Webster and Julien Rabin and Loic Simon and Frederic Jurie},
      year={2023},
      eprint={2303.12733},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.github/workflows		.github/workflows
docs		docs
snip_dedup		snip_dedup
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hatch.toml		hatch.toml
pyproject.toml		pyproject.toml
retrieve_dup_urls_demo.py		retrieve_dup_urls_demo.py
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

snip-dedup

This repo is a WIP

Install

Usage

What is a Duplicate?

Approximate CCs of Duplicates

Finding images overfit by Stable Diffusion

Note on False positives

Semantic Search

Contribute

Citation

About

Releases

Packages

Contributors 2

Languages

License

ryanwebster90/snip-dedup

Folders and files

Latest commit

History

Repository files navigation

snip-dedup

This repo is a WIP

Install

Usage

What is a Duplicate?

Approximate CCs of Duplicates

Finding images overfit by Stable Diffusion

Note on False positives

Semantic Search

Contribute

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages