<h3>Importing all the useful libraries

In [1]:
import torch
import clip
from PIL import Image
from torchvision import transforms
import numpy as np
import pandas as pd
import os
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer, util
import matplotlib.pyplot as plt
import torch.nn as nn
import spacy
nlp = spacy.load("en_core_web_sm")
from nltk.corpus import wordnet

  warn(


In [2]:
import gc
gc.collect()
torch.cuda.empty_cache()

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
# device="cpu"
device

'cpu'

<h3>First Approach - Directly feeding the test data into the pretrainined model

<h5>Loading a pretrained model that is only trained for english leanguage 

This is the Image & Text model CLIP, which maps text and images to a shared vector space. 

In [4]:
model, preprocess = clip.load("ViT-B/32", device=device)

Loading a Test dataset for english language which was in a tsv format

In [5]:
# Load the English test dataset
en_test_data_path = 'en.test.data.v1.1.txt'
en_test_gold_path = 'en.test.gold.v1.1.txt'

Functions to encode text and images easily

In [6]:

def encode_image(image_path):
    image = Image.open(image_path).convert('RGB')
    image = preprocess(image).unsqueeze(0).to(device)
    with torch.no_grad():
        return model.encode_image(image)

def encode_text(text):
    text = clip.tokenize([text]).to(device)
    with torch.no_grad():
        return model.encode_text(text)

In [7]:
# Load test data and gold labels
with open(en_test_data_path, "r", encoding="utf-8") as data_file, \
        open(en_test_gold_path, "r", encoding="utf-8") as gold_file:
    test_data_lines = data_file.readlines()
    gold_labels = [line.strip() for line in gold_file.readlines()]


First we tried to feed the text and images data directly to our pretrained model to see the result without any changes

In [8]:
# Lists to store calculated cosine similarities and predictions
cosine_similarities = []
predictions = []

# Further analysis for each instance
for i, (data_line, gold_label) in enumerate(zip(test_data_lines, gold_labels)):
    # Parse the test instance
    target_word, full_phrase, *image_paths = data_line.strip().split("\t")

    # Encode text features
    text_features = encode_text(full_phrase)

    # Encode image features for each image
    image_features = [encode_image('test_images/' + image_path) for image_path in image_paths]

    # Calculate cosine similarities
    similarities = [cosine_similarity(text_features.flatten().unsqueeze(0).cpu(), image.flatten().unsqueeze(0).cpu()).item()
                    for image in image_features]

    # Predict the image with the highest similarity
    predicted_label = image_paths[similarities.index(max(similarities))]

    # Store the results
    cosine_similarities.append(similarities)
    predictions.append(predicted_label)

    # Further analysis for each instance
    print(f"Instance {i + 1}:")
    print(f"Target Word: {target_word}")
    print(f"Full Phrase: {full_phrase}")
    print(f"Gold Image: {gold_label}")
    print(f"Predicted Image: {predicted_label}")
    print(f"Cosine Similarities: {similarities}")
    print("\n")


Instance 1:
Target Word: goal
Full Phrase: football goal
Gold Image: image.2166.jpg
Predicted Image: image.2166.jpg
Cosine Similarities: [0.26201966404914856, 0.20519670844078064, 0.23524916172027588, 0.22653508186340332, 0.19679154455661774, 0.24347779154777527, 0.206990584731102, 0.2362373322248459, 0.30433389544487, 0.24262438714504242]


Instance 2:
Target Word: mustard
Full Phrase: mustard seed
Gold Image: image.4429.png
Predicted Image: image.4429.png
Cosine Similarities: [0.29952043294906616, 0.22497449815273285, 0.28654956817626953, 0.21882936358451843, 0.2025444507598877, 0.2692073881626129, 0.22509106993675232, 0.27111658453941345, 0.2130971997976303, 0.1958523839712143]


Instance 3:
Target Word: seat
Full Phrase: eating seat
Gold Image: image.4432.jpg
Predicted Image: image.4432.jpg
Cosine Similarities: [0.2158469259738922, 0.2072405368089676, 0.19851523637771606, 0.20309880375862122, 0.2107028365135193, 0.2628728151321411, 0.23018671572208405, 0.2001945823431015, 0.2384347

We can observe below that the accuracy on english data set that we feed to the pretrianed model is around 57.6%.

In [9]:
# Calculate accuracy
correct_predictions = sum(1 for pred, gold in zip(predictions, gold_labels) if pred == gold)
accuracy = correct_predictions / len(gold_labels)

print(f"Accuracy for the English is: {accuracy * 100:.2f}%")

Accuracy for the English is: 57.67%


Saving predicted resul in a saperated text file according to our desired format

In [11]:
# Save predictions to a separate file
with open("en.pred.test.txt", "w") as predictions_file:
    predictions_file.write("\n".join(predictions))

<h4>Second Approach

<h2>Deffinitions + Context Using Wordnet

Now to increese the accuracy I tried to add deffinition of each target word and combine the encoded vector of deffinition and the context given in our dataset. After combining we feed it to our pretrained dataset to see the desired result.
I used wordnet to get the deffinition of each word that wordnet can provide.

In [12]:
def get_word_definition(word):
    synsets = wordnet.synsets(word)

    if synsets:
        # Take the first synset as the definition
        definition = synsets[0].definition()
        return definition
    else:
        return 'This is a picture of ' + word # using a neutral sentence if there is no deffinition available, we can also use an empty vector 

function to encode definition into our desired output size

In [13]:
def encode_definition(definition,req_shape, max_length=20, embedding_dim=300):
    # Tokenize the definition using spaCy
    doc = nlp(definition)

    # Filter out non-alphabetic tokens and limit the length
    tokens = [token.text.lower() for token in doc if token.is_alpha][:max_length]

    # Truncate or pad the tokens to the desired length
    tokens += ['<pad>'] * (max_length - len(tokens))

    # Convert tokens to word embeddings
    embeddings = [token.vector if token.has_vector else np.zeros(embedding_dim) for token in nlp(" ".join(tokens))]
    encoding = torch.tensor(embeddings)

    # Take the mean along the first axis (axis 0)
    encoding_mean = torch.mean(encoding, dim=0, keepdim=True)


    input_size=encoding_mean.shape[1]
    output_size = req_shape
    linear_layer = nn.Linear(input_size, output_size)
    # Apply the linear layer
    output_tensor = linear_layer(encoding_mean)

    return output_tensor

Now feeding the combine encoding of context and defnition to see how our pretrained model perform

In [14]:
# Lists to store calculated cosine similarities and predictions
cosine_similarities = []
predictions = []

# Further analysis for each instance
for i, (data_line, gold_label) in enumerate(zip(test_data_lines, gold_labels)):
    # Parse the test instance
    target_word, full_phrase, *image_paths = data_line.strip().split("\t")

    # Encode text features
    text_features = encode_text(full_phrase)
    req_shape=text_features.shape[1]

    #traget word deffinition
    definition = get_word_definition(target_word)

    #encoding into tensor
    encoded_definition = encode_definition(definition, req_shape).to(device)
    
    # Calculate the average or mean along the first dimension (axis 0)
    average_tensor = torch.mean(torch.stack([encoded_definition, text_features], dim=0), dim=0)
    

    # Encode image features for each image
    image_features = [encode_image('test_images/' + image_path) for image_path in image_paths]
    
    # Print the output shapes during the loop
    # print(f"\nInstance {i + 1}:")
    # print(f"Text Features Shape: {average_tensor.shape}")
    # for j, image_feature in enumerate(image_features):
    #     print(f"Image {j + 1} Features Shape: {image_feature.shape}")

    # Calculate cosine similarities
    similarities = [cosine_similarity(average_tensor.flatten().unsqueeze(0).detach().cpu(), image.flatten().unsqueeze(0).detach().cpu()).item()
                    for image in image_features]

    # Predict the image with the highest similarity
    predicted_label = image_paths[similarities.index(max(similarities))]

    # Store the results
    cosine_similarities.append(similarities)
    predictions.append(predicted_label)

    # Further analysis for each instance
    print(f"Instance {i + 1}:")
    print(f"Target Word: {target_word}")
    print(f"Full Phrase: {full_phrase}")
    print(f"Gold Image: {gold_label}")
    print(f"Predicted Image: {predicted_label}")
    print(f"Cosine Similarities: {similarities}")
    print("\n")


  encoding = torch.tensor(embeddings)


Instance 1:
Target Word: goal
Full Phrase: football goal
Gold Image: image.2166.jpg
Predicted Image: image.2166.jpg
Cosine Similarities: [0.24141016602516174, 0.20047661662101746, 0.21590130031108856, 0.23127561807632446, 0.19256630539894104, 0.24634824693202972, 0.20952269434928894, 0.2196078896522522, 0.3041743338108063, 0.2338143140077591]


Instance 2:
Target Word: mustard
Full Phrase: mustard seed
Gold Image: image.4429.png
Predicted Image: image.4429.png
Cosine Similarities: [0.24506056308746338, 0.15197470784187317, 0.1735268235206604, 0.1538206785917282, 0.0881265252828598, 0.15875665843486786, 0.1474067121744156, 0.20085132122039795, 0.15291760861873627, 0.10173965990543365]


Instance 3:
Target Word: seat
Full Phrase: eating seat
Gold Image: image.4432.jpg
Predicted Image: image.4432.jpg
Cosine Similarities: [0.15684126317501068, 0.12714096903800964, 0.1481904685497284, 0.17517465353012085, 0.1440087854862213, 0.19453269243240356, 0.13413575291633606, 0.09543253481388092, 0.1

You can see that the accuracy has not been instead it went down for some reason, knowing that I decided not to try diffinition on farsi and italian dataset

In [15]:
# Calculate accuracy
correct_predictions = sum(1 for pred, gold in zip(predictions, gold_labels) if pred == gold)
accuracy = correct_predictions / len(gold_labels)

print(f"Accuracy for the English is: {accuracy * 100:.2f}%")

Accuracy for the English is: 46.00%


<h4>Here we load the multilingual CLIP model. Note that this model can only encode text. If you need embeddings for images, you must load the 'clip-ViT-B-32' model

In [22]:
# Here we load the multilingual CLIP model. Note, this model can only encode text.
# If you need embeddings for images, you must load the 'clip-ViT-B-32' model
multi_model = SentenceTransformer('clip-ViT-B-32-multilingual-v1')

Loading model for image

In [23]:
# We use the original clip-ViT-B-32 for encoding images
img_model = SentenceTransformer('clip-ViT-B-32')

<h2>Italian Language

<h4>First Approach - Directly feeding the test data into the pretrainined model

Loading a test dataset for italian language

In [20]:
# Load the italian test dataset
data_file_path = 'it.test.data.v1.1.txt'
gold_file_path = 'it.test.gold.v1.1.txt'
pred_file_path = 'it.test.preds.txt'

In [25]:
# Define function to calculate cosine similarity
def calculate_cosine_similarity(image_embedding, text_embedding):
    cos_sim = util.pytorch_cos_sim(image_embedding, text_embedding)
    return cos_sim.item()

feeding our italian dataste into the pretrained model to see the result

In [71]:
with open(data_file_path, 'r', encoding='utf-8') as data_file, \
        open(gold_file_path, 'r', encoding='utf-8') as gold_file, \
        open(pred_file_path, 'w', encoding='utf-8') as pred_file:
    
    correct_predictions = 0
    total_instances = 0
    

    for i, (line_data, line_gold) in enumerate(zip(data_file, gold_file), 1):
        # Parse data
        data_parts = line_data.strip().split('\t')
        target_word, full_phrase, *image_paths = data_parts

        # Load images
        image_embeddings = [img_model.encode(Image.open(os.path.join('test_images', img_path))) for img_path in image_paths]

        # Encode text
        text_embedding = multi_model.encode(full_phrase)

        # Calculate cosine similarity for each image
        similarities = [calculate_cosine_similarity(image_embedding, text_embedding) for image_embedding in image_embeddings]

        # Find the index of the image with the highest similarity
        predicted_index = similarities.index(max(similarities))

        # Get gold label
        gold_label = line_gold.strip()

        # Get predicted label
        predicted_label = image_paths[predicted_index]

        # Check accuracy
        if gold_label == predicted_label:
            correct_predictions += 1

        # Save predicted label to file
        pred_file.write(predicted_label + '\n')

        total_instances += 1


        # Further analysis for each instance
        print(f"Instance {i}:")
        print(f"Target Word: {target_word}")
        print(f"Full Phrase: {full_phrase}")
        print(f"Gold Image: {gold_label}")
        print(f"Predicted Image: {predicted_label}")
        print(f"Cosine Similarities: {similarities}")
        
        print("\n")

    # Calculate accuracy
    accuracy = correct_predictions / total_instances



    print(f"Accuracy: {accuracy:.2%}")
  

Instance 1:
Target Word: gomma
Full Phrase: gomma per smacchiare
Gold Image: image.8.jpg
Predicted Image: image.0.jpg
Cosine Similarities: [0.22921927273273468, 0.22401492297649384, 0.22230525314807892, 0.22764545679092407, 0.22010408341884613, 0.23065254092216492, 0.22437046468257904, 0.22302085161209106, 0.2307814508676529, 0.21099971234798431]


Instance 2:
Target Word: asino
Full Phrase: asino gioco di carte
Gold Image: image.18.jpg
Predicted Image: image.18.jpg
Cosine Similarities: [0.26544082164764404, 0.2032451182603836, 0.1727599948644638, 0.21278910338878632, 0.20795506238937378, 0.21338549256324768, 0.1984795331954956, 0.20422445237636566, 0.20711947977542877, 0.1925031691789627]


Instance 3:
Target Word: colonna
Full Phrase: colonna missione
Gold Image: image.20.jpg
Predicted Image: image.22.jpg
Cosine Similarities: [0.26893672347068787, 0.23251423239707947, 0.240172877907753, 0.2592426538467407, 0.23082196712493896, 0.2444537878036499, 0.25345972180366516, 0.23317265510559

You can see the accuracy here which is way lower than the english, we also tried deffination but it drop the accuracy even more

In [93]:
print(f"For Ialaian;")
print(f"Accuracy: {accuracy:.2%}")


For Ialaian;
Accuracy: 28.50%
MMR Score: 27.8601


<h4>Second Approach

<h3>Deffinition + Context Using Wordnet and Deep Translator

<h5>Now we want to do same thing for italian language that we did for english dataset, we want to feed the model both the context and deffinition using wordnet but problem here is that wordnet only give diffinition in english words of english words, to tackle this problem we use deep translator library to first translat a traget italian word into english and then after getting the deffinition convert the deffinition back into italian. aftern that cobine both the context and deffinotion too get the text encoding

In [18]:
from deep_translator import GoogleTranslator
import spacy
nlp = spacy.load("en_core_web_sm")
from nltk.corpus import wordnet

Here as mentioned above converting a target italian word into english to get english deffinition and then convert it back into italian deffition

In [17]:
def get_tran_word_definition(word):

    tran_word = GoogleTranslator(source='it', target='en').translate(word)
    synsets = wordnet.synsets(tran_word)

    if synsets:
        # Take the first synset as the definition
        tran_en_definition = synsets[0].definition()
        tran_it_definition = GoogleTranslator(source='en', target='it').translate(tran_en_definition)
        return tran_it_definition
    else:
        return 'Questa è una foto di ' + word

In [26]:
with open(data_file_path, 'r', encoding='utf-8') as data_file, \
        open(gold_file_path, 'r', encoding='utf-8') as gold_file, \
        open(pred_file_path, 'w', encoding='utf-8') as pred_file:
    
    correct_predictions = 0
    total_instances = 0
    

    for i, (line_data, line_gold) in enumerate(zip(data_file, gold_file), 1):
        # Parse data
        data_parts = line_data.strip().split('\t')
        target_word, full_phrase, *image_paths = data_parts

        # Load images
        image_embeddings = [img_model.encode(Image.open(os.path.join('test_images', img_path))) for img_path in image_paths]

        #traget word deffination
        definition = get_tran_word_definition(target_word)

        # Encode text
        text_embedding = multi_model.encode(full_phrase + ' and '+ definition)

        # Calculate cosine similarity for each image
        similarities = [calculate_cosine_similarity(image_embedding, text_embedding) for image_embedding in image_embeddings]

        # Find the index of the image with the highest similarity
        predicted_index = similarities.index(max(similarities))

        # Get gold label
        gold_label = line_gold.strip()

        # Get predicted label
        predicted_label = image_paths[predicted_index]

        # Check accuracy
        if gold_label == predicted_label:
            correct_predictions += 1

        # Save predicted label to file
        pred_file.write(predicted_label + '\n')

        total_instances += 1

     

        # Further analysis for each instance
        print(f"Instance {i}:")
        print(f"Target Word: {target_word}")
        print(f"Full Phrase: {full_phrase}")
        print(f"Gold Image: {gold_label}")
        print(f"Predicted Image: {predicted_label}")
        print(f"Cosine Similarities: {similarities}")
       
        print("\n")

    # Calculate accuracy
    accuracy = correct_predictions / total_instances

    print(f"Accuracy: {accuracy:.2%}")
 

Instance 1:
Target Word: gomma
Full Phrase: gomma per smacchiare
Gold Image: image.8.jpg
Predicted Image: image.5.jpg
Cosine Similarities: [0.25127485394477844, 0.20806270837783813, 0.2508394420146942, 0.2473309338092804, 0.25201091170310974, 0.26131466031074524, 0.23534278571605682, 0.2099171131849289, 0.24460753798484802, 0.24802476167678833]


Instance 2:
Target Word: asino
Full Phrase: asino gioco di carte
Gold Image: image.18.jpg
Predicted Image: image.18.jpg
Cosine Similarities: [0.3086782991886139, 0.1440698802471161, 0.10222669690847397, 0.2802508771419525, 0.19566607475280762, 0.2596437931060791, 0.17638692259788513, 0.10360850393772125, 0.24431662261486053, 0.13124696910381317]


Instance 3:
Target Word: colonna
Full Phrase: colonna missione
Gold Image: image.20.jpg
Predicted Image: image.20.jpg
Cosine Similarities: [0.24080626666545868, 0.22608904540538788, 0.23451261222362518, 0.2437828630208969, 0.23716208338737488, 0.2465818077325821, 0.2689940929412842, 0.227204307913780

The accuracy we get for italian language after feeding the model with deffinitiona and contex is almost the same as when we feed the model without deffinition. 

In [27]:
print(f"For Ialaian Language with Context + Deffinition;")
print(f"Accuracy: {accuracy:.2%}")

For Ialaian Language with Context + Deffinition;
Accuracy: 27.21%


<h4>Farsi Language

<h4>First Approach - Directly feeding the test data into the pretrainined model

Now loading the test datset for farsi language to see the resul on the pretrained model

In [28]:
# Load the Farsi test dataset
fa_data_file_path = 'fa.test.data.txt'
fa_gold_file_path = 'fa.test.gold.txt'
fa_pred_file_path = 'fa.test.preds.txt'

Loading our farsi dataset into the pretrained model to see the result

In [29]:

accuracies = []
mmr_scores = []

with open(fa_data_file_path, 'r', encoding='utf-8') as data_file, \
        open(fa_gold_file_path, 'r', encoding='utf-8') as gold_file, \
        open(fa_pred_file_path, 'w', encoding='utf-8') as pred_file:
    
    correct_predictions = 0
    total_instances = 0


    for i, (line_data, line_gold) in enumerate(zip(data_file, gold_file), 1):
        # Parse data
        data_parts = line_data.strip().split('\t')
        target_word, full_phrase, *image_paths = data_parts

        # Load images
        image_embeddings = [img_model.encode(Image.open(os.path.join('test_images', img_path))) for img_path in image_paths]

        # Encode text
        text_embedding = multi_model.encode(full_phrase)

        # Calculate cosine similarity for each image
        similarities = [calculate_cosine_similarity(image_embedding, text_embedding) for image_embedding in image_embeddings]

        # Find the index of the image with the highest similarity
        predicted_index = similarities.index(max(similarities))

        # Get gold label
        gold_label = line_gold.strip()

        # Get predicted label
        predicted_label = image_paths[predicted_index]

        # Check accuracy
        if gold_label == predicted_label:
            correct_predictions += 1

        # Save predicted label to file
        pred_file.write(predicted_label + '\n')

        total_instances += 1

 

        # Further analysis for each instance
        print(f"Instance {i}:")
        print(f"Target Word: {target_word}")
        print(f"Full Phrase: {full_phrase}")
        print(f"Gold Image: {gold_label}")
        print(f"Predicted Image: {predicted_label}")
        print(f"Cosine Similarities: {similarities}")
        print("\n")

        # Calculate accuracy for this instance
        accuracy = correct_predictions / total_instances


        accuracies.append(accuracy)
     


Instance 1:
Target Word: برنج‎
Full Phrase: فلز برنج
Gold Image: image.2731.jpg
Predicted Image: image.2726.jpg
Cosine Similarities: [0.21873752772808075, 0.28382185101509094, 0.24252140522003174, 0.23285827040672302, 0.21533432602882385, 0.2839623689651489, 0.24119487404823303, 0.24333912134170532, 0.24546955525875092, 0.23266051709651947]


Instance 2:
Target Word: ملخ
Full Phrase: ملخ بادی
Gold Image: image.921.jpg
Predicted Image: image.921.jpg
Cosine Similarities: [0.2313474714756012, 0.215951070189476, 0.2246669977903366, 0.2580552399158478, 0.21614138782024384, 0.21615564823150635, 0.22258412837982178, 0.21115709841251373, 0.21341641247272491, 0.21109427511692047]


Instance 3:
Target Word: شام
Full Phrase: سرزمین شام
Gold Image: image.2750.jpg
Predicted Image: image.2750.jpg
Cosine Similarities: [0.22924411296844482, 0.2219386249780655, 0.22480764985084534, 0.216277077794075, 0.2563123106956482, 0.23069876432418823, 0.24047861993312836, 0.23283442854881287, 0.25792813301086426,

Accuracy for farsi dataset

In [30]:
print(f"For Farsi;")
print(f"Accuracy: {accuracy:.2%}")

For Farsi;
Accuracy: 28.50%
