## Loading

Let's load an existing index (that was created with clip index)

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"]=""
from clip_retrieval.clip_back import load_clip_indices, KnnService

In [2]:
columns = ["url", "caption"]
indices_loaded, indices, device, model, preprocess, mclip_model = load_clip_indices("/home/rom1504/indices_paths.json", True, True, columns, False)
knn_service = KnnService(indices_loaded=indices_loaded, device=device, model=model, preprocess=preprocess, columns_to_return=columns, metadata_is_ordered_by_ivf=False, mclip_model=mclip_model)

loading clip...
loading metadata...
loading indices...


## Doing a query

Let's extract a subset of the dataset composed mostly of watermarked images

In [5]:
results = knn_service.query(text_input="watermark", modality="image", indice_name="laion_400m", num_images=1000, num_result_ids=1000)

In [8]:
import pandas as pd
url_captions = pd.DataFrame([(e['url'], e['caption']) for e in results], columns=["url", "caption"])

In [9]:
url_captions

Unnamed: 0,url,caption
0,https://static2.bigstockphoto.com/thumbs/7/8/3...,"Milch spritzt Sammlung, isolated on white back..."
1,https://t2.ftcdn.net/jpg/00/62/03/77/400_F_620...,Abstract background
2,https://as2.ftcdn.net/jpg/00/57/88/83/500_F_57...,Spoed Foto op Canvas Abstract wave Abstract pa...
3,https://image.shutterstock.com/image-photo/sto...,Marine pattern with stylized blue waves. Cosme...
4,https://thumb7.shutterstock.com/image-photo/st...,water background - stock photo
...,...,...
995,https://t1.ftcdn.net/jpg/00/54/99/42/400_F_549...,Abstract business background
996,https://img3.stockfresh.com/files/sstk/200/550...,Brickwall as Background for Product Placement ...
997,https://thumb1.shutterstock.com/image-photo/st...,motorboat and nature - stock vector
998,https://thumb1.shutterstock.com/image-photo/st...,Molecular structure scientific vertical backgr...


In [10]:
url_captions.to_parquet("/tmp/mysubset.parquet")

## Downloading

Finally let's download this subset

In [13]:
!img2dataset --input_format=parquet --url_list=/tmp/mysubset.parquet --output_folder=/tmp/myoutput --processes_count=16 --thread_count=64 --output_format=files --url_col="url" --caption_col="caption"

Downloading file number 1 of 1 called /tmp/mysubset.parquet
  0%|                                                     | 0/1 [00:00<?, ?it/s]success=1.00 failed download=0.00 failed resize=0.00
100%|█████████████████████████████████████████████| 1/1 [00:08<00:00,  8.00s/it]


In [15]:
!ls /tmp/myoutput/*

/tmp/myoutput/00000.parquet

/tmp/myoutput/00000:
0000.jpg   0143.jpg   0286.jpg	 0429.jpg   0572.jpg   0715.jpg   0858.jpg
0000.json  0143.json  0286.json  0429.json  0572.json  0715.json  0858.json
0000.txt   0143.txt   0286.txt	 0429.txt   0572.txt   0715.txt   0858.txt
0001.jpg   0144.jpg   0287.jpg	 0430.jpg   0573.jpg   0716.jpg   0859.jpg
0001.json  0144.json  0287.json  0430.json  0573.json  0716.json  0859.json
0001.txt   0144.txt   0287.txt	 0430.txt   0573.txt   0716.txt   0859.txt
0002.jpg   0145.jpg   0288.jpg	 0431.jpg   0574.jpg   0717.jpg   0860.jpg
0002.json  0145.json  0288.json  0431.json  0574.json  0717.json  0860.json
0002.txt   0145.txt   0288.txt	 0431.txt   0574.txt   0717.txt   0860.txt
0003.jpg   0146.jpg   0289.jpg	 0432.jpg   0575.jpg   0718.jpg   0861.jpg
0003.json  0146.json  0289.json  0432.json  0575.json  0718.json  0861.json
0003.txt   0146.txt   0289.txt	 0432.txt   0575.txt   0718.txt   0861.txt
0004.jpg   0147.jpg   0290.jpg	 0433.jpg   0576.jpg   

In [16]:
from IPython.display import Image
Image(filename='/tmp/myoutput/00000/0000.jpg') 

<IPython.core.display.Image object>