#  Paper Information

- __Paper title__: Unified Pre-training with Pseudo Texts for Text-To-Image
Person Re-identification
- __Date Published__: 4/9/2023
- __Conference__: ICCV 2023 (rank A)
- __Link github__: https://github.com/ZhiyinShao-H/UniPT
- __Link paper__: https://arxiv.org/abs/2309.01420

# Overview/Main contribution

- Construct a new dataset for general pretraining, called LUPerson-T
- Two pretext tasks for pretraining:
    - Constrastive Learning (CL): similar to CLIP
    - Masked Language Modeling: 
        - Behave like original BERT
            - Just use self-attention mechanism with query, key, value from text representation
        - Just for avoiding overfitting caused by CL, not for learning relation between text and image like paper APTM, IRRA, ...
            - APTM, IRRA use cross-attention mechanism with query  as text representation, key & value from visual representation 

- Objectives use for finetunning TBPR:
    - ID loss
    - Ranking loss 
        - Just triplet loss for text and image
    - PGU loss (from paper LGUR, AMM MM 2022)

# LUPerson-T Dataset

## Overview dataset

- Large-scale dataset: 1.3M pairs of image - text caption

- Based on LUPerson, a large-scale dataset for Person Re-Id:
    - Just attached the additional text caption for each image in original dataset
    - New dataset contains a ID label, and a text caption for each image

![image.png](attachment:257cc92b-3efe-415c-a354-d22afd2ec8a2.png)

## How dataset was created?

Follow divide-conquer-combine strategy 

__Phase 1 (Divide): Create a set of Attribute Phrases__

- Examined most person images in the dataset and extract 14 kinds of attributes based on their frequency
of occurrence, then group these attributes
into two sets:
    - The __required set__ includes 6 categories of attributes that every person must possess all of these:
    ```text
        <age>, <gender>, <upper clothes>, <lower clothes>, <action>  <hair length>
    ``` 
    - The __optional set__ contains 8 categories of attributes that a person is likely to have: 
    ```text 
            <bag>, <glasses>, <smoke>, <hat>, <cellphone>, <umbrella>, <gloves>, <vehicle>
    ```
 
- These 14 categories of attributes cover
most basic aspects of pedestrian appearances and already
facilitate good pre-training.

- For each attribute category, there are some corresponed prompts that behave as classes in classfication problem.

![image.png](attachment:8b4774d2-608d-4d82-9350-76ffb3bf29aa.png)

__Phase 2 (Conquer): Find the matched attibute phrases for each image__

- Convert attribute phrases into the standard prompt format: “A photo of a person <phrases>”
- Use CLIP to zero-shot classification for each attribute category
- For each person image feature, calculate cosine score between itself and all
prompt features. 
    - For each of the required attribute categories, choose the phrase with highest score 
    - For each of the optional attribute categories, choose the satisfied phrase with a softmax
probability larger than 0.9.

__Phase 3 (Combine): Fill the matched attribute phrase into blanks__

- For each person image and the matched attribute phrases get from phase 2:
    - Choose a random template from pre-defined 456 templates
    - Fill the attribute phares in correct positions

# Method for pretraining

## Architecture for Pretraining

![image.png](attachment:93c363ef-b0f6-49d5-8de3-63f1ec4aed62.png)

- Visual encoder: Vision Transformer (ViT-B16) or DeiT Transformer
    - Input is an image
    - Output is sequence of vector embedding
- Texual encoder: BERT
    - Input is a text
    - Output is sequence of vector embedding
- Visual/Texual Projections: 1x1 convolution with $D_{in}$ channel, $D_{out}$ kernels 
    - Apply for each vector (1x1x$D_{in}$) in the result sequence
- Max Pooling is used to get the global representation, not use the CLS token:
    - L x D -> 1 x D

## Objective Functions

- Cosntrastive Loss: like CLIP, APTM
- MLM: like BERT

Final loss: $L = L_{CL} + L_{MLM}$

# Method for finetunning

# Training phase

## Dataset & Data Augmentation

## Implemention detail

- __Pre-training__: 
    - Input image resolution is set to 384×128
    - Text token length is 100. 
    - AdamW optimizer with a base learning rate of 1e- 5. The learning rate is warmed up for 10% of the total steps
    - Batch size is set to 512. 
    - Total number of epochs
    - Model is pre-trained on 8 Nvidia Tesla V100 GPUs.

## Evaluation result

# Inference phase

# Conclusion

## New points in this paper

## Pro

## Cons

## How to improve?

# Demo in notebook

## Set up

### Define path

### Import libries / local modules

### Load config

### Load model checkpoint

## Get and summary model