# **ViLT model finetuing**

**Datasets used**



*   [Control Net Deep Fashion](https://huggingface.co/datasets/ldhnam/deepfashion_controlnet)
*   [Deep Fashion with masks](https://huggingface.co/datasets/SaffalPoosh/deepFashion-with-masks)



# Install Dependences



In [1]:
!pip install transformers
!pip install datasets
!pip install torch torchvision
!pip install tensorflow
!pip install flax



**Loading the datasets**

In [2]:
from datasets import load_dataset

In [3]:
model_checkpoint = "dandelin/vilt-b32-finetuned-vqa"

saffal_possh_df = load_dataset("SaffalPoosh/deepFashion-with-masks")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [4]:
# Checking a simple sample from the dataset
for data in saffal_possh_df.items():
  print(data[1]["gender"])
  print(data[1]["cloth_type"])
  print(data[1]["caption"])


['WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'WOMEN', 'MEN', 'WOMEN', 'WOMEN', 'MEN', 'WO

In [5]:
control_net_deep_fashion = load_dataset("ldhnam/deepfashion_controlnet")

In [6]:
# Checking a simple sample from the dataset
for data in control_net_deep_fashion.items():
  print(data[1]["caption"])


['a woman wearing a black shirt and jeans stands in front of a white background', 'a pregnant woman in a blue top and jeans poses for a picture in front of a white background', 'a woman wearing a white t - shirt with a painting on it in front of a white background', 'a woman wearing a black top and tan pants in front of a white background', "a woman wearing a white t - shirt with a heart and the word d'day printed on in front of a white background", 'a woman in a blue and white shirt and jeans in front of a white background', 'a woman in black pants and a polka dot blouse in front of a white background', 'a woman in a black top and blue jeans is looking down in front of a white background', 'a woman wearing a navy top and yellow pants in front of a white background', 'a woman wearing a black top and a tan skirt in front of a white background', 'a woman wearing a white crop top with a red slogan in front of a white background', 'a woman wearing a pink shirt with a heart and arrow on it 

# **Pseudo-label and Pseudo-Questions**



In [7]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM


tokenizer = AutoTokenizer.from_pretrained("potsawee/t5-large-generation-squad-QuestionAnswer")
model = AutoModelForSeq2SeqLM.from_pretrained("potsawee/t5-large-generation-squad-QuestionAnswer")

In [32]:
from tqdm import tqdm

action_key_words = ["in", "wearing", "standing", "is wearing",
                      "posing", "sitting", "walking", "carrying",
                      "leaning"]

# creating pseudo questions
def create_pseudo_questions_for_saffal_possh(data, size=50):
  dataset_selection = data["train"][0: size]
  labels = []
  questions = []
  answers = []

  print("Loading the dataset..")

  for key, sample in dataset_selection.items():
      if key == "caption":
        for caption in tqdm(sample):
          caption_tokens = caption.split(" ")
          if caption_tokens[2] in action_key_words:
            label = f"{caption_tokens[1]}_{caption_tokens[2]}"

            inputs = tokenizer(caption, return_tensors="pt")
            outputs = model.generate(**inputs, max_length=100)
            question_answer = tokenizer.decode(outputs[0], skip_special_tokens=False)
            question_answer = question_answer.replace(tokenizer.pad_token, "").replace(tokenizer.eos_token, "")
            question, answer = question_answer.split(tokenizer.sep_token)

            labels.append(label)
            questions.append(question)
            answers.append(answer)

  dataset_selection["questions"] = questions
  dataset_selection["answers"] = answers
  dataset_selection["labels"] = labels

  return dataset_selection

saffal_possh_df_processed = create_pseudo_questions_for_saffal_possh(saffal_possh_df)
print(saffal_possh_df_processed)

control_net_deep_fashion_processed = create_pseudo_questions_for_saffal_possh(control_net_deep_fashion)
print(control_net_deep_fashion_processed)


Loading the dataset..


100%|██████████| 50/50 [02:00<00:00,  2.42s/it]


{'images': [<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A80D1F490>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A80D1C850>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A81371B40>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A813725F0>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A813727D0>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A81372D40>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A80D1CEB0>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A80D1EC50>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A80D1E4D0>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A80D1C430>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A80D1E380>, <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F5A80D

100%|██████████| 50/50 [02:05<00:00,  2.50s/it]

{'image': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D80340>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D815D0>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D81600>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D82B60>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D81660>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D80D90>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D83280>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D82770>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D83DF0>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D80B50>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D83730>, <PIL.PngImagePlugin.PngImageFile image mode=RGB size=512x512 at 0x7F5A80D836A0>, <PIL.PngImagePlug




# **Model Train**


deepFashion-with-masks**

**TBD**