# Cap3D Dataset Filtering

This notebook provides a step-by-step and visually-aided walk through the [Cap3D dataset](https://huggingface.co/datasets/tiange/Cap3D) filtering logic.

The filtering process makes use of the 3D asset descriptions provided by [Cap3D](https://cap3d-um.github.io/). These descriptions detail assets from a large-scale and broad-categoried collection of open-source 3D models (namely [Objaverse](https://arxiv.org/abs/2212.08051), [Objaverse-XL](https://arxiv.org/abs/2307.05663), and [ABO](https://arxiv.org/abs/2110.06199)).

The purpose of this notebook is to filter these assets for domain alignment with an architectural/building/interior design use case, with basic natural language processing techniques.

## Imports

In [1]:
import json
import os
from collections import Counter, OrderedDict
from pathlib import Path
from typing import List, Union

import nltk
import pandas as pd
from nltk.tokenize import word_tokenize

nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     /home/yunusskeete/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Visualise Captions

In [2]:
PATH_TO_CAPTIONS: Union[str, Path] = "captions/Cap3D_automated_Objaverse_full.csv"

df = pd.read_csv(
    PATH_TO_CAPTIONS,
    header=None,
    names=["id", "desc"],
)

Note, captions frequently contain reference to a "3D model":

In [3]:
df.loc[df["desc"].str.contains("3D")]

Unnamed: 0,id,desc
3,28d43a218cd8466a8c1f82b29b71e314,3D model of a cluttered outdoor scene with veg...
7,d7340a5b05b6460facbd90aaafb7f1f1,3D topographic landscape model with color-code...
8,bb3552fe9b074acf8ea531c2e9e25fe7,A stylized 3D corner room with textured walls ...
19,dcd9cb4775934a4d9572d923a3c8fc2c,"Abstract, amorphous 3D structure with varying ..."
23,578f5a1461a7441aad8a3a92f16ae94e,Monochromatic 3D model of a simplified frying ...
...,...,...
1002414,3536f5c1446f761882018b2f98f4d2aa901942a826cf36...,A voxel-based 3D model of a stylized monkey fa...
1002417,3623e74f34c1c3c523af6b2bb8ffcbe2d2dce897ef61b9...,Abstract 3D composition with human figures and...
1002418,64e9f7b7a1fc4c4ec56ed8b5917dfd610930043ac5e15f...,"3D object with a rough, irregular pink surface..."
1002419,fcd089d6a237fee21dfd5f0d6d9b74b2fd1150cdc61c7f...,Bright pink abstract 3D model of a building wi...


## Removing references to a 3D model

(Not implemented)

In [4]:
# A dictionary to hold references and replacements
_3D_phrase_replacements = {"3D model of a": ""}


# A funtion to be applied to the captions to remove references to a 3d model
def remove_3D_word(description: str) -> str:
    raise NotImplementedError()

## Tokenisation

In [5]:
# Convert descriptions to lowercase
df["desc tokens"] = df["desc"].apply(
    lambda desc: desc.lower().replace("/", " / ").replace("-", " - ")
)

# Tokenize each description
df["desc tokens"] = df["desc tokens"].apply(word_tokenize)
df

Unnamed: 0,id,desc,desc tokens
0,ed51a51909ee46c780db3a85e821feb2,"Matte green rifle with a long barrel, stock, a...","[matte, green, rifle, with, a, long, barrel, ,..."
1,9110b606f6c547b2980fcb3c8c4b6a1c,Rustic single-story building with a weathered ...,"[rustic, single, -, story, building, with, a, ..."
2,80d9caaa1fa04502af666135196456e1,a pair of purple and black swords with white h...,"[a, pair, of, purple, and, black, swords, with..."
3,28d43a218cd8466a8c1f82b29b71e314,3D model of a cluttered outdoor scene with veg...,"[3d, model, of, a, cluttered, outdoor, scene, ..."
4,75582285fab442a2ba31733f9c8fae66,Floating terrain piece with grassy landscape a...,"[floating, terrain, piece, with, grassy, lands..."
...,...,...,...
1002417,3623e74f34c1c3c523af6b2bb8ffcbe2d2dce897ef61b9...,Abstract 3D composition with human figures and...,"[abstract, 3d, composition, with, human, figur..."
1002418,64e9f7b7a1fc4c4ec56ed8b5917dfd610930043ac5e15f...,"3D object with a rough, irregular pink surface...","[3d, object, with, a, rough, ,, irregular, pin..."
1002419,fcd089d6a237fee21dfd5f0d6d9b74b2fd1150cdc61c7f...,Bright pink abstract 3D model of a building wi...,"[bright, pink, abstract, 3d, model, of, a, bui..."
1002420,f812dc980050f2d5f4b37df2a8620372f810dd6456a5f2...,Monochromatic gray 3D model of a stylized huma...,"[monochromatic, gray, 3d, model, of, a, styliz..."


## Vocabulary extraction

In [6]:
# Create the vocabulary
vocab = set()

# Create the word counts
word_counts = {}

for tokens in df["desc tokens"]:
    # Update the vocabulary
    vocab.update([token.replace("'", "") for token in tokens])

    # Count the occurrences of each word
    for word in tokens:
        if word in word_counts:
            word_counts[word] += 1
        else:
            word_counts[word] = 1

len(vocab)

36864

View word counts:

In [7]:
# Convert the dictionary to a Counter object
word_counter = Counter(word_counts)

# Order the Counter object by the number of occurrences
ordered_word_counter = OrderedDict(word_counter.most_common())
list(ordered_word_counter.items())[:15]

[('a', 1834283),
 (',', 1208508),
 ('.', 971807),
 ('and', 888411),
 ('with', 820817),
 ('of', 336260),
 ('white', 301931),
 ('-', 231668),
 ('in', 171788),
 ('on', 159463),
 ('featuring', 156252),
 ('the', 153565),
 ('3d', 143301),
 ('black', 135066),
 ('model', 130748)]

## Load keywords by category

In [8]:
PATH_TO_KEYWORDS_BY_CATEGORY = "keywords/keywords_by_category.json"

with open(PATH_TO_KEYWORDS_BY_CATEGORY, "rb") as f:
    keywords_dict = json.load(f)

keywords_dict.keys()

dict_keys(['Furniture', 'Materials', 'Appliances', 'Decor', 'Kitchenware', 'Bedroom', 'Bathroom', 'Living Room', 'Outdoor', 'Miscellaneous'])

In [9]:
EXCLUDED_CATEGORIES: List[str] = [
    "Appliances",
    "Kitchenware",
    "Outdoor",
    "Miscellaneous",
]
keywords = set(
    [
        item.lower()
        for category, sublist in keywords_dict.items()
        for item in sublist
        if category not in EXCLUDED_CATEGORIES
    ]
)
len(keywords)

271

## Filter dataset

In [10]:
def check_keywords(tokens: List[str]) -> bool:
    """
    Check if any token matches a keyword or its plural.
    """
    return any(((token in keywords) or (f"{token}s" in keywords)) for token in tokens)


# Determine if rows should be included
df["include"] = df["desc tokens"].apply(lambda tokens: check_keywords(tokens))

filtered_df = df[df["include"] == True].drop(columns=["include"])

excluded_df = df[df["include"] == False].drop(columns=["include"])

len(filtered_df), len(excluded_df)

(452263, 550159)

## Inspect filtered dataset

In [11]:
# Set the display width
pd.set_option("display.max_colwidth", 150)

In [12]:
filtered_df["desc"].sample(50)

883540             Stylized orange bear figurine with a simplified form, featuring a darker contrasting muzzle, a light chest patch, and wearing a grey collar.
400428                                                                                                                                     White Satellite Dish
602639                                                                                 a stack of paper rolls on a wooden bench under a white curved-edge roof.
157442                                          wooden side table with a round top and three legs, accompanied by a wooden stool with a round top and two legs.
793807    Modular purple 3D structure with rectilinear shapes, featuring an upper floor with cut-out windows, lower spaces resembling storefronts, and protr...
78426     A three-dimensional geometric assembly composed of multiple colored lines and shaded planes, creating an abstract visualization that resembles a s...
457228                                  

In [13]:
excluded_df["desc"].sample(50)

467625                                                                                                                                            a Soviet tank
499860                                                                                                                                   Royalty-free ant model
297189                                                                                                                 a sword with a yellow handle and flames.
758621                                                                                    a wooden boat, a leaf, an island with a small tree, and a tree stump.
538328                                                                                                              a white building with a pink and blue roof.
941776                                                              3D model of a textured, ribbed pumpkin with a vibrant orange color and a short, green stem.
150953                                  

## Save filtered dataset

In [14]:
ENCODING: str = "utf-8"
PATH_TO_INCLUDED_IDS: Union[str, Path] = "ids/included_ids.json"
PATH_TO_EXCLUDED_IDS: Union[str, Path] = "ids/excluded_ids.json"

if not os.path.exists(os.path.dirname(PATH_TO_INCLUDED_IDS)):
    os.mkdir(os.path.dirname(PATH_TO_INCLUDED_IDS))

if not os.path.exists(os.path.dirname(PATH_TO_EXCLUDED_IDS)):
    os.mkdir(os.path.dirname(PATH_TO_EXCLUDED_IDS))

with open(PATH_TO_INCLUDED_IDS, "w", encoding=ENCODING) as f:
    json.dump(filtered_df["id"].tolist(), f, indent=4)

with open(PATH_TO_EXCLUDED_IDS, "w", encoding=ENCODING) as f:
    json.dump(excluded_df["id"].tolist(), f, indent=4)