-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do notebook to explore assignment and centroids of index #196
Comments
get 2 files from https://huggingface.co/datasets/laion/laion5B-index/tree/main/image.index load index: import faiss
index = faiss.read_index("populated.index", faiss.IO_FLAG_ONDISK_SAME_DIR) centroids: ind = faiss.extract_index_ivf(index)
centroids = ind.quantizer.reconstruct_n(0, ind.nlist)
centroids = index.chain.at(0).reverse_transform(centroids) (need to reverse the transformation to get back to initial space) assignments, something like: il = faiss.extract_index_ivf(index).invlists
d = np.ones((index.ntotal,), "int64")
begin_list = []
current_begin = 0
for i in tqdm(range(il.nlist)):
begin_list.append(current_begin)
ids = il.get_ids(i)
list_size = il.list_size(int(i))
items = faiss.rev_swig_ptr(ids, list_size)
new_ids = range(current_begin, current_begin + list_size)
d.put(np.array(items, "int"), np.array(new_ids, "int"))
il.release_ids(ids=ids, list_no=i)
current_begin += list_size |
l = [il.list_size(i) for i in range(il.nlist)] get all cluster list sizes |
get items of the largest cluster: def id_to_items(i):
ids = il.get_ids(i)
list_size = il.list_size(int(i))
items = faiss.rev_swig_ptr(ids, list_size)
items = np.array(items, "int")
return items
l = [il.list_size(i) for i in range(il.nlist)]
i = int(np.argmax(l))
ids = id_to_items(l) |
get urls of some metadata (paste in browser console) fetch("https://knn5.laion.ai/metadata", {
"headers": {
"accept": "*/*",
"accept-language": "en,fr-FR;q=0.9,fr;q=0.8,en-US;q=0.7,zh-CN;q=0.6,zh;q=0.5,zh-TW;q=0.4",
"content-type": "text/plain;charset=UTF-8",
"sec-ch-ua": "\"Chromium\";v=\"106\", \"Google Chrome\";v=\"106\", \"Not;A=Brand\";v=\"99\"",
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "\"Linux\"",
"sec-fetch-dest": "empty",
"sec-fetch-mode": "cors",
"sec-fetch-site": "cross-site"
},
"referrer": "https://rom1504.github.io/",
"referrerPolicy": "strict-origin-when-cross-origin",
"body": "{\"ids\":[4832458793,4832458822],\"indice_name\":\"laion5B\"}",
"method": "POST",
"mode": "cors",
"credentials": "omit"
}).then(a => a.json()).then(a => console.log(a.map(e => e.metadata))) |
=> plug that into a part of the ui |
@rom1504 , thanks for getting the code for getting the clusters. I am trying to understand the code. What does the function |
facebookresearch/faiss#2266 for centroids
https://github.com/rom1504/clip-retrieval/blob/main/clip_retrieval/ivf_metadata_ordering.py#L46
The text was updated successfully, but these errors were encountered: