In [1]:
!pip install top2vec
!pip install top2vec[sentence_encoders]

This project used multiple Python libraries in its different stages of web scraping, preprocessing, and in performing a search. The preprocessing step, necessary to get coherent topics through “Top2Vec”, employed “nltk”, regular expressions (“re”) and “string” to normalize, tokenize and exclude stopwords from the preprocessed text. The project also used nltk’s WordNet functions to expand search queries as well as “cv2” and “matplotlib” to visualize images.

In [None]:
import string
import re
from nltk.corpus import stopwords
from IPython.display import display
import nltk
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
import pandas as pd
import numpy as np
import ast
import matplotlib.pyplot as plt
from google.colab.patches import cv2_imshow
import cv2 as cv
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
from top2vec import Top2Vec

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


The website scraping produced a dataset of 6112 artworks, although not all of them have comprehensive metadata since some were not on sale, lacked descriptions, styles or dimensions, with few only having their title and artist. After scraping the two websites, and obtaining from the artworks the desired metadata, when available, the two datasets were opened into the notebook (as the scraping code was not run on it) and went through preprocessing to convert the dimensions and prices into just numbers and making sure the two were ready to be merged. For instance, the two raw datasets had different names for the column indicating the image path, which was fixed in this step. Moreover, prices under 100 dollars were removed from the Artmajeur dataset as in most cases they referred to the prices of digital licensing or print reproductions and not of the artwork itself. Subsequently to this preprocessing, the datasets were then merged.

In [None]:
# Load the dataset obtained through the previous code cells
anasaea = pd.read_csv('/content/drive/MyDrive/Uni/Digital Humanities Lab/digital humanities lab/raw datasets/anasaea.csv', encoding='utf-8')

# Anasaea preprocessing
anasaea.rename(columns={"Image Path": "Image"}, inplace=True)

# These three functions get the dimensions of the artwork through a regular expression
# and respectively return the dimensions in cm2, height and width
def dimensionstocm2(dimensions):
  regex = r"Height (\d+)cm, Width (\d+)cm"
  if type(dimensions) == str:
    match = re.match(regex, dimensions)
    if match:
      height = match.group(1).strip()
      width = match.group(2).strip()
      finaldimensions = float(height)*float(width)
      return finaldimensions

def height(dimensions):
  regex = r"Height (\d+)cm, Width (\d+)cm"
  if type(dimensions) == str:
    match = re.match(regex, dimensions)
    if match:
      height = match.group(1).strip()
      return height

def width(dimensions):
  regex = r"Height (\d+)cm, Width (\d+)cm"
  if type(dimensions) == str:
    match = re.match(regex, dimensions)
    if match:
      width = match.group(2).strip()
      return width

# Convert string into integers
def convertprices(price):
  try:
    intprice = int(price[1:])
  except:
    intprice = ""
  return intprice

# Apply functions to the Dataframe
anasaea['Height'] = anasaea['Dimensions'].apply(height)
anasaea['Width'] = anasaea['Dimensions'].apply(width)
anasaea['Dimensions in cm2'] = anasaea['Dimensions'].apply(dimensionstocm2)
anasaea['Price'] = anasaea['Price'].apply(convertprices)

In [None]:
# Load the dataset obtained through the scraping code cells
artmajeur = pd.read_csv('/content/drive/MyDrive/Uni/Digital Humanities Lab/digital humanities lab/raw datasets/artmajeur.csv', encoding='utf-8')

# Artmajeur preprocessing

# Remove the text after the artwork's description since it goes into details about
# general stylistic explanations that are not relevant to the single artwork
def removeabout(description):
  description = description.split('About this artwork:')
  description = description[0]
  return description

# These three functions get the dimensions of the artwork through a regular expression
# and respectively return the dimensions in cm2, height and width
def dimensionstocm2(dimensions):
  regex = r"(\d+)x(\d+) cm"
  if type(dimensions) == str:
    match = re.match(regex, dimensions)
    if match:
      height = match.group(1).strip()
      width = match.group(2).strip()
      finaldimensions = float(height)*float(width)
      return finaldimensions

def height(dimensions):
  regex = r"(\d+)x(\d+) cm"
  if type(dimensions) == str:
    match = re.match(regex, dimensions)
    if match:
      height = match.group(1).strip()
      return height

def width(dimensions):
  regex = r"(\d+)x(\d+) cm"
  if type(dimensions) == str:
    match = re.match(regex, dimensions)
    if match:
      width = match.group(2).strip()
      return width

# Remove prices under 100 as most of them are not for art sale,
# but for the sale of art prints or licensed use.
def removelicenseprices(price):
  try:
    if float(price) < 100:
      price = ""
    else:
      price = float(price)
  except:
    price = price
  return price


def normalizethemes(themes):
  if type(themes) == str:
    themes = ast.literal_eval(themes)
    themes = (" ".join(themes)).lower()
  else:
    themes = themes
  return themes

# Apply functions to the Dataframe
artmajeur['Description'] = artmajeur['Description'].apply(removeabout)
artmajeur['Height'] = artmajeur['Dimensions'].apply(height)
artmajeur['Width'] = artmajeur['Dimensions'].apply(width)
artmajeur['Dimensions in cm2'] = artmajeur['Dimensions'].apply(dimensionstocm2)
artmajeur['Price'] = artmajeur['Price'].apply(removelicenseprices)
artmajeur['Themes'] = artmajeur['Themes'].apply(normalizethemes)

In [None]:
# Merge the two Dataframes into one
df = pd.concat([anasaea, artmajeur], ignore_index=True)

In [None]:
df

Unnamed: 0,Title,Artist,Price,Description,Dimensions,Style,Link,Image,Height,Width,Dimensions in cm2,Themes
0,Zebra underwater No. 2,Shari Blackwell,60,"digital art, watermark will be removed upon pu...",,Modern art,https://anasaea.com/viewArtPiece/wFJmgvhhAstnd...,downloaded_images/wFJmgvhhAstndrvHt_pic1024.jpeg,,,,
1,Fokker DR I,Constantin Baghici,,The scene of a combat clash with the participa...,,Classicism,https://anasaea.com/viewArtPiece/CzpXJcBTZLkZG...,downloaded_images/CzpXJcBTZLkZG6pDW_pic1024.jpeg,,,,
2,Dreamer,Mistake Ann,,"When I was a child, I liked to run away from h...",,Surrealism,https://anasaea.com/viewArtPiece/Dubg6zfsEy6hi...,downloaded_images/Dubg6zfsEy6hijoBa_pic1024.jpeg,,,,
3,Pedro Pascal,Antonella Torquati,650,I portrayed him for my daughter :) // This pai...,"Height 40cm, Width 40cm",Contemporary art,https://anasaea.com/viewArtPiece/MiuSEBy6KeFLw...,downloaded_images/MiuSEBy6KeFLwAaQt_pic1024.jpeg,40,40,1600.0,
4,Softly Sunrise,Joseph Liberti,400,Sunrise softly tears the sky on Colorado.\n\nF...,"Height 71cm, Width 71cm",Abstract expressionism,https://anasaea.com/viewArtPiece/cozFLQpCp4L22...,downloaded_images/cozFLQpCp4L22yaTg_pic1024.jpeg,71,71,5041.0,
...,...,...,...,...,...,...,...,...,...,...,...,...
6107,Zeitlose Eleganz: Ein Hase,Birgit Wichmann,,Die Zeichnung zeigt einen Hasen in einer leben...,,Figurative,https://www.artmajeur.com/birgit-wichmann/en/a...,downloaded_images/17674213_default-create-a-pe...,,,,aquarell tierzeichnung naturwunder
6108,Temptation,Lorraine Lyn,,Can you resist the temptation?The beauty desir...,,Expressionism,https://www.artmajeur.com/lorrainelyn/en/artwo...,downloaded_images/16794052_temptation.jpg,,,,digitalart aiart desire temptation tentazione ...
6109,Amy Winehouse,Whiteline,,Amy Winehouse était une chanteuse et compositr...,,,https://www.artmajeur.com/whiteline/en/artwork...,downloaded_images/17415298_amy-winehouse-conve...,,,,
6110,Fiat 126P C64 Pixel Art,Rm64,159.99,,,Minimalism,https://www.artmajeur.com/rm64/en/artworks/173...,downloaded_images/17320972_maluch-3200-chbrgri...,,,,fiat 126 pixel pixelart commodore polish retro...


The following preprocessing was necessary to model coherent topics. In fact, not only does it remove sentence elements such as punctuation or short words, but also different kinds of stopwords. On the one hand, as the dataset included descriptions in multiple languages, the stopwords from these most prevalent languages were removed; on the other, a custom list of stopwords was compiled to mainly include words relating to the field of digital art, its styles, techniques and colors. In addition, the custom stopword list also includes other words that aren’t indicative of the themes of the artworks that were spotted by going through the dataset and LDA trial results. Some of these stopwords included words that regarded digital art are “'digital”, “art”, “painting”, “artwork”, “sale”, but also “kunst” (art), “edição” (edition) “rouge” (red) “digitalmente” (digitally) and “técnica” (technique) in languages other than English. Other stop words such as “captivant” (captivating) “special” and “reveals” were included to avoid generic terms. The text that underwent preprocessing consisted of not only the textual descriptions of the artworks, but also their titles and themes (only present in the Artmajeur artworks).

In [None]:
# Process, normalize, and remove stop words from the descriptions
stop_words = stopwords.words('english') + stopwords.words('french') + stopwords.words('german') + stopwords.words('italian') + stopwords.words('spanish') + stopwords.words('dutch')

def filter_descriptions(descriptions):
    """
    Normalize art descriptions by removing punctuation tokens and URLs.

    Args:
        descriptions (list of str): List of description strings.

    Returns:
        list of str: List of normalized descriptions.
    """
    normalized_descriptions = []

    for description in descriptions:

        # Split into tokens
        tokens = word_tokenize(description.lower())

        normalized = []
        for token in tokens:
            if token in string.punctuation:   # Remove punctuation
              continue
            elif token in stop_words:         # Remove stop words
              continue
            elif any(str.isdigit(c) for c in token):   # Remove tokens that include numbers
              continue
            elif len(token) < 3:                       # Remove short words
              continue
            elif token.startswith(('http', 'www','(',':')):              # Remove links & complicated words
              continue
            elif token.endswith(('com',')','-','.',';',',','"',':')):    # Remove links & complicated words
              continue
            else:
                normalized.append(token)

        # Join tokens back into a string and convert to lowercase
        normalized_description = " ".join(normalized).lower()
        normalized_descriptions.append(normalized_description)

    return normalized_descriptions


In [None]:
df['Title + Description + Themes'] = df['Title'].astype(str) + ' ' + df['Description'].astype(str) + ' ' + df['Themes'].astype(str)
descriptions = df['Title + Description + Themes']
normalized_descriptions = filter_descriptions(descriptions)
raw_keywords = []
for desc in normalized_descriptions:
  words = list(set(desc.split())) # Convert to set to remove repetitive words
  raw_keywords.append(words)
df['Keywords'] = raw_keywords

Despite having mentioned LDA trial runs, the final topic modeling algorithm we used ended up being Top2Vec, as LDA’s results were not satisfying. On the contrary, Top2Vec, working with a multilingual pretrained model as a base, produced more satisfactory results and managed to find relevant and different topics in the dataset. After the topics were modeled, they were assigned back to the DataFrame rows, using a threshold score of 0.5 to ensure accuracy.

In [None]:
# Create the Top2Vec model
documents = df['Keywords'].apply(lambda tokens: ' '.join(tokens)).tolist()
model = Top2Vec(documents=documents, speed='learn', embedding_model='universal-sentence-encoder-multilingual')

# Get model information
topic_words, word_scores, topic_nums = model.get_topics()

for topic_num in topic_nums:
    print(f"Topic {topic_num}:")
    print("Words:", topic_words[topic_num])
    print("Scores:", word_scores[topic_num])
    print("\n")

2024-06-18 21:33:02,101 - top2vec - INFO - Pre-processing documents for training
INFO:top2vec:Pre-processing documents for training
2024-06-18 21:33:02,914 - top2vec - INFO - Downloading universal-sentence-encoder-multilingual model
INFO:top2vec:Downloading universal-sentence-encoder-multilingual model
2024-06-18 21:33:06,111 - top2vec - INFO - Creating joint document/word embedding
INFO:top2vec:Creating joint document/word embedding
2024-06-18 21:33:30,845 - top2vec - INFO - Creating lower dimension embedding of documents
INFO:top2vec:Creating lower dimension embedding of documents
2024-06-18 21:33:48,565 - top2vec - INFO - Finding dense areas of documents
INFO:top2vec:Finding dense areas of documents
2024-06-18 21:33:48,828 - top2vec - INFO - Finding topics
INFO:top2vec:Finding topics


Topic 0:
Words: ['evoke' 'get' 'self' 'mind' 'wonder' 'intricate' 'deep' 'look' 'feel'
 'time' 'sense' 'see' 'street' 'represents' 'long' 'soul' 'nude' 'dance'
 'sun' 'body' 'hand' 'water' 'moment' 'girl' 'femme' 'journey' 'face'
 'line' 'old' 'wall' 'heart' 'used' 'back' 'female' 'earth' 'place'
 'animal' 'living' 'sky' 'dreams' 'home' 'night' 'human' 'large'
 'imagination' 'day' 'fantasy' 'reflection' 'decor' 'power']
Scores: [0.7375897  0.718816   0.66986704 0.6613444  0.65603787 0.6493392
 0.6459558  0.64400613 0.64135444 0.6404442  0.640195   0.6316486
 0.62484896 0.62031674 0.6113976  0.6113609  0.60989904 0.6083824
 0.6067419  0.60626704 0.60491204 0.60314584 0.59890807 0.5984658
 0.5973082  0.59613883 0.59452474 0.5942867  0.5940006  0.5930214
 0.5890195  0.5879015  0.5871559  0.5852127  0.5842141  0.58348584
 0.58245546 0.5813176  0.57845485 0.5720142  0.56337184 0.56307864
 0.55881345 0.55791366 0.5550075  0.5548378  0.55468565 0.5526205
 0.54986733 0.5482695 ]


Topic 1:
Wor

In [None]:
model = Top2Vec.load("/content/drive/MyDrive/Uni/Digital Humanities Lab/digital humanities lab/top2vec model.model")

# Get topic numbers and sizes
topic_sizes, topic_nums = model.get_topic_sizes()

# Create a copy of the original Dataframe
topic_doc = df.copy()
# Create a DataFrame to store the topic scores for all artworks
topic_scores = pd.DataFrame(0, index=topic_doc.index, columns=topic_nums)

# Loop over each topic
for t in topic_nums:
    # Get scores for the dominant topic of all documents and update the topic_scores DataFrame
    documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=t, num_docs=topic_sizes[t])
    topic_scores.loc[document_ids,t] = document_scores

# Add the DataFrame with as many columns as topics to the copy of the dataset's DataFrame
topic_doc = topic_doc.join(topic_scores)

# Determine the dominant topic and its score for each document
topic_doc['Dominant Topic'] = topic_scores.idxmax(axis=1)
topic_doc['Score'] = topic_scores.max(axis=1)

# Add them to the original DataFrame if the score is higher than the 0.5 threshold
df['Topic'] = topic_doc['Dominant Topic'].where(topic_doc['Score'] > 0.5)
df['Score'] = topic_doc['Score'].where(topic_doc['Score'] > 0.5)

While the keywords reported in the topic seem repetitive and many topics might seem to overlap, the analysis of the topics through the actual artworks they were assigned to provided more insight. For instance, the following cells show examples of the artwork appearing in different topics. In topic 49, with the top 5 words - as described in the cell listing all topics - being ['flower' 'flowers' 'eyes' 'dance' and 'heart'] is actually entirely about butterflies, while topic 11 with the words ['get' 'see' 'time' 'feel' and 'moment'] contains artworks without keywords. The other two examples of topics 27 and 22 present in the tail of the dataset are respectively about pop culture figures and animals, as the list of trained topics describes in the aforementioned cell.

In [None]:
# Tail of the dataset's DataFrame
display(df.tail(5)[["Title", "Description", "Topic", "Score"]])

Unnamed: 0,Title,Description,Topic,Score
6107,Zeitlose Eleganz: Ein Hase,Die Zeichnung zeigt einen Hasen in einer leben...,22.0,0.501584
6108,Temptation,Can you resist the temptation?The beauty desir...,1.0,0.533925
6109,Amy Winehouse,Amy Winehouse était une chanteuse et compositr...,27.0,0.694456
6110,Fiat 126P C64 Pixel Art,,,
6111,Jesus Escape From Guantanamo,"I put the title inside the work, precisely so ...",,


In [None]:
display(df[df['Topic'].isin([27])].head()[['Title', 'Description', 'Topic', 'Score']])

Unnamed: 0,Title,Description,Topic,Score
2194,Monica Bellucci Hollywood Actress Retro,"Retro art Monica Bellucci, born on September 3...",27.0,0.536951
2255,"""Tina Turner: Energy Of The Stage""","Das digitale Bild ""Tina Turner: Energy of the ...",27.0,0.528699
2360,Gainsbourg,GAINSBOURG impression sur dibond brillant prêt...,27.0,0.508188
2369,Dernier Bus Pour Bruford,Mr Strange (Jean-Marie Gitard) est un artiste ...,27.0,0.515874
2395,Bowie Ii,David Bowie: A Musical Chameleon and Icon \n \...,27.0,0.767684


In [None]:
display(df[df['Topic'].isin([22])].tail()[['Title', 'Description', 'Topic', 'Score']])

Unnamed: 0,Title,Description,Topic,Score
5625,Cute Love - Pet,Step into the realm of Digital art with this c...,22.0,0.70475
5642,Greyhound,The elegance and grace of the greyhound. This ...,22.0,0.585914
5653,Le Sauvage,"Une image graphique et colorée. Un chien racé,...",22.0,0.546805
6005,Because I Live,Cette œuvre a été créée suite à un questionnem...,22.0,0.588668
6107,Zeitlose Eleganz: Ein Hase,Die Zeichnung zeigt einen Hasen in einer leben...,22.0,0.501584


In [None]:
display(df[df['Topic'].isin([49])].tail(10)[['Title', "Artist", 'Description', 'Topic', 'Score']])

Unnamed: 0,Title,Artist,Description,Topic,Score
4176,Lutine Aux Bulles De Papillons,Madeleine Gendron,,49.0,0.69813
4286,Papillonne,Jérôme Baraniak,La beauté de la nature cache toujours une arme...,49.0,0.631043
4441,Gazing With Butterfly,Arija Paikule,Digital art with artist's signature.\n\n ...,49.0,0.648213
4518,Paràthyro,Lula Lp,Boho art\n\n \n ...,49.0,0.527753
4694,Butterfly,Natalia Nozdrina,There are butterflies that live only a day. On...,49.0,0.648332
5448,Boréale,Thomas Blondeau-Dumoulin,,49.0,0.673721
5451,Bly,Jiyonisus,It's a virtual picture made in monochromatic s...,49.0,0.633513
5507,Woman And Butterflies,Mariana,Mujer y mariposas\n\n \n ...,49.0,0.870023
5769,Contrast,Roxana Ferllini,Contrast is digital a mixed media art - \n \n*...,49.0,0.620613
6086,Butterfly Mask,Guillaume Bellebault,Artwork qui sera imprimé\n\n \n...,49.0,0.801913


In [None]:
display(df[df['Topic'].isin([11])].tail(5)[['Title', "Artist", 'Keywords', 'Topic', 'Score']])

Unnamed: 0,Title,Artist,Keywords,Topic,Score
5961,Lido15,Mariano Moriconi,[],11.0,0.999986
5966,24,Alex,[],11.0,0.999986
5968,Abstract Expressionism-[Vol.12]-2/12,Nabil Zeineddine,[],11.0,0.999986
5996,梦幻15,Tina Liao,[],11.0,0.999986
6003,Angel07,Nevio Massaro,[],11.0,0.999986


In [None]:
display(df[df['Topic'].isin([34])][['Title', "Artist", 'Keywords', 'Topic', 'Score']])

Unnamed: 0,Title,Artist,Keywords,Topic,Score
29,Shark Warriors,Shari Blackwell,"[warriors, shark, indian, native, american]",34.0,0.773282
74,She Wolf,Shari Blackwell,"[wolf, american, native, indian]",34.0,0.88204
98,Tribal Queen,Yoalah Brinson,"[queen, yoalah, tribal, brinson]",34.0,0.528603
130,Beautiful Dreams,Shari Blackwell,"[indian, native, dreams, american]",34.0,0.840136
183,Lofty Thoughts and Dreams,Shari Blackwell,"[dreams, indian, dreaming, thoughts, lofty, na...",34.0,0.694171
212,Whale Warrior,Shari Blackwell,"[whale, indian, warrior, native, american]",34.0,0.830724
306,Lion King,Shari Blackwell,"[indian, american, king, native, lion]",34.0,0.798335
337,Freedom,Shari Blackwell,"[freedom, native, american, indian]",34.0,0.81605
380,Running Free,Shari Blackwell,"[free, indian, running, native, american]",34.0,0.79765
396,Fresh Air,Shari Blackwell,"[indian, fresh, air, native, american]",34.0,0.804803


The interpretation of all 58 topics by reviewing their artworks, allowed the categorization of the modelled topics into themes, as displayed in the table below. The most common theme relates to animals, with one, as shown in the example relating specifically to butterflies. Other prevalent themes that emerge from the interpretation of topics are emotions, the representation of women, landscapes and urban contexts. At the same time, highly specific topics, such as one relating to butterflies, Native Americans, vision and sight, and the future. These topics highlight the diversity of the dataset, but also reveal how a relatively small dataset can cause skewing in topic modeling. For instance the unexpected topic regarding Native Americans likely emerged due to the presence of multiple artworks by a single artist depicting this culture, but this may not correspond to a frequent theme in digital arts. At the same time, these themes mostly seem to reflect common art subjects in traditional art, likely with some additions such as the theme regarding robots and the future, which might be more common to digital arts.

<table>
    <thead>
        <tr>
            <th>Theme</th>
            <th>Topic Number</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Landscapes</td>
            <td>9, 10, 21, 24</td>
        </tr>
        <tr>
            <td>Flowers</td>
            <td>25</td>
        </tr>
        <tr>
            <td>Sea</td>
            <td>28, 24</td>
        </tr>
        <tr>
            <td>Animals</td>
            <td>29, 22, 33, 32, 49, 54</td>
        </tr>
        <tr>
            <td>Universe</td>
            <td>20, 26</td>
        </tr>
        <tr>
            <td>Natural Elements</td>
            <td>12, 19</td>
        </tr>
        <tr>
            <td>Emotions</td>
            <td>1, 2, 3, 6, 44</td>
        </tr>
        <tr>
            <td>Native Americans</td>
            <td>34</td>
        </tr>
        <tr>
            <td>Love</td>
            <td>16</td>
        </tr>
        <tr>
            <td>Family</td>
            <td>43</td>
        </tr>
        <tr>
            <td>Music/Pop Culture</td>
            <td>27</td>
        </tr>
        <tr>
            <td>Dance</td>
            <td>42</td>
        </tr>
        <tr>
            <td>Vision</td>
            <td>37</td>
        </tr>
        <tr>
            <td>Women</td>
            <td>4, 36, 13, 18</td>
        </tr>
        <tr>
            <td>Robots/Futuristic</td>
            <td>15</td>
        </tr>
        <tr>
            <td>Urban</td>
            <td>41, 7, 30, 52</td>
        </tr>
    </tbody>
</table>

This query expansion aims to improve the relevance of results by adding synonyms and meronyms to the query. This was achieved through the WordNet (2010) database. However, as synonyms are automatically added detatched from the context of the query, it may cause the query expansion to also include unrelated words.

In [None]:
def expand_query(query):
    """
    Expand query through synonyms and meronyms

    Args:
    query (str)

    Returns:
    list : a list containing the word(s) of the original query and its synonyms and meronyms from WordNet
    """
    tokens = nltk.word_tokenize(query)
    expanded_query = []
    expanded_query.extend(tokens)

    for token in tokens:
        synonyms = set()
        meronyms = set()
        # Add synonyms to the set
        for synonym in wordnet.synsets(token):
            for lemma in synonym.lemmas():
                if synonym.wup_similarity(wordnet.synsets(token)[0]) >= 0.6:
                    synonyms.add(lemma.name())
            # Add the meronyms to the set
            for meronym in synonym.part_meronyms():
                for lemma in meronym.lemmas():
                    meronyms.add(lemma.name())
            for meronym in synonym.substance_meronyms():
                for lemma in meronym.lemmas():
                    meronyms.add(lemma.name())
        # Add to the expanded query list
        expanded_query.extend(synonyms)
        expanded_query.extend(meronyms)

    # Remove duplicates and convert back to list
    expanded_query = list(set(expanded_query))
    return expanded_query

The following test cells provide an example of how this query expansion works. The inclusion of synonyms and meronyms, — words with similar meanings and parts of the queried word, respectively — aims to broaden the search. However, we can see how the synonyms of the word “cat” also include words like “caterpillar” and “vomit”, amongst others. To mitigate this problem, we introduced a similarity threshold of 0.6 for synonyms to reduce overexpansion. Despite this measure, as words are not considered in their contexts, the issue of adding irrelevant keywords can persist. On the other hand, the addition of meronyms appears effective in retrieving words from the same field, and thus relevant, such as the query “computer” expanded through “keyboard” and “microchip” and room with “wall” and “ceiling”.

In [None]:
for synonym in wordnet.synsets("cat"):
  print(synonym.lemmas())

[Lemma('cat.n.01.cat'), Lemma('cat.n.01.true_cat')]
[Lemma('guy.n.01.guy'), Lemma('guy.n.01.cat'), Lemma('guy.n.01.hombre'), Lemma('guy.n.01.bozo')]
[Lemma('cat.n.03.cat')]
[Lemma('kat.n.01.kat'), Lemma('kat.n.01.khat'), Lemma('kat.n.01.qat'), Lemma('kat.n.01.quat'), Lemma('kat.n.01.cat'), Lemma('kat.n.01.Arabian_tea'), Lemma('kat.n.01.African_tea')]
[Lemma('cat-o'-nine-tails.n.01.cat-o'-nine-tails'), Lemma('cat-o'-nine-tails.n.01.cat')]
[Lemma('caterpillar.n.02.Caterpillar'), Lemma('caterpillar.n.02.cat')]
[Lemma('big_cat.n.01.big_cat'), Lemma('big_cat.n.01.cat')]
[Lemma('computerized_tomography.n.01.computerized_tomography'), Lemma('computerized_tomography.n.01.computed_tomography'), Lemma('computerized_tomography.n.01.CT'), Lemma('computerized_tomography.n.01.computerized_axial_tomography'), Lemma('computerized_tomography.n.01.computed_axial_tomography'), Lemma('computerized_tomography.n.01.CAT')]
[Lemma('cat.v.01.cat')]
[Lemma('vomit.v.01.vomit'), Lemma('vomit.v.01.vomit_up'), Lemm

In [None]:
query = "cat"
print(expand_query(query))

['cat', 'big_cat', 'true_cat']


In [None]:
query = "computer"
print(expand_query(query))

['bus', 'central_processing_unit', 'keyboard', 'microchip', 'CPU', 'monitoring_device', 'computer_peripheral', 'computer', 'computer_storage', 'floppy_disk', 'central_processor', 'chip', 'disk_cache', 'computer_memory', 'processor', 'silicon_chip', 'memory', 'computer_circuit', 'peripheral_device', 'information_processing_system', 'store', 'busbar', 'computing_machine', 'monitor', 'memory_board', 'hardware', 'micro_chip', 'data_converter', 'CRT', 'cathode-ray_tube', 'storage', 'diskette', 'mainframe', 'microprocessor_chip', 'computer_hardware', 'data_processor', 'computing_device', 'C.P.U.', 'peripheral', 'electronic_computer', 'floppy', 'computer_accessory']


In [None]:
query = "room"
print(expand_query(query))

['room', 'flooring', 'wall', 'floor', 'ceiling', 'room_light']


In [None]:
query = "father"
print(expand_query(query))

['father', 'begetter', 'sire', 'forefather', 'male_parent']


In [None]:
# Replace empty strings with NaN values
df = df.replace({"": np.nan})

# Convert multiple columns to numeric values
df['Dimensions in cm2'] = pd.to_numeric(df['Dimensions in cm2'], errors='coerce')
df['Height'] = pd.to_numeric(df['Height'], errors='coerce')
df['Width'] = pd.to_numeric(df['Width'], errors='coerce')
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

This search function serves as the backend of the final tool. Given the arguments in the form of a user input it applies filters based on these arguments and returns a filtered DataFrame and mean prices. The DataFrame is sorted by Price, as the project focuses on price fairness, unless a theme filter is applied, in which case, the order depends on the topic score assigned by the trained model.