#  Paper Information

- __Paper title__: Learning Granularity-Unified Representations for Text-to-Image Person Re-identification
- __Date Published__: 16/7/2022
- __Conference__: ACM MM 2022 (rank A)
- __Link github__: https://github.com/ZhiyinShao-H/LGUR

# Overview/Main contribution

## Main contribution

- Design two modules for learning relation between text and image (replace for cross encoder with high cost of computation)
    - __Dictionary-based Granularity Alignment (DGA)__: for reconstruct texual representation $T$ and visual represenatation $V$ in a common space $D$
        - Let $\mathcal{T}_D(Q, K, V)$ as Transformer Encoder block that replace self-attention by cross-attention 
    $$V_{re} = \mathcal{T}_V(V, D, D)$$ 
    <br>
    $$ T_{re} = \mathcal{T}_T(T, D, D) $$
        - D is special representation called atom, that is modality-independent
        - D is used as key and value
    - __Prototype-based Granularity Unification (PGU)__:
        - Learning fine-grained representation of text/image from the common reconstruction space $D$
        - $P$ is special query called prototype, that is for learning the fine-grained detailed represenation
        - $P$ is used as the query, while $V$ or $T$ is used as the key and value
    $$V_{pgu} = \mathcal{P}[\mathcal{T_P}(P, V_{re}, V_{re})]$$ 
    <br>
    $$ T_{pgu} = \mathcal{P}[\mathcal{T_P}(P, T_{re}, T_{re})] $$
    
        - Output representation of this module is also in space $D$
    
    - Number of vectors in $D, P$ is << number of vectors in $V, T$ --> lower cost of computation compared to image-text cross-attention in APTM, IRRA,...

![image.png](attachment:d7b7d9cb-6c36-4075-8d3f-4e3ec1f22762.png)

- Lightweight architecture for textual / visual encoders

# Model Architecture / Method

## Architecture

__Image/Link to image: ...__

### Vision Encoder

![image.png](attachment:2de95f83-25f9-4f9f-a5f0-28456556eb7e.png)

- Two choice for vision encoder:
    - ResNet50: 
        - Encode image $I$ into a tensor H x W x C
    - DeiT Transformer: DeiT is created by knowledge distilation method, with the teacher is Vision Transformer --> DeiT is lightweight compared to its teacher
        - Preprocess image $I$ into L = H x W patches
        - Encode sequence of patches into a sequence of vectors: L x d (behave like Vision Transformer)
        - Then reshape into H x W x d, a tensor with d channels
- Output of vision encoder is a tensor H x W x C

### Text Encoder

- A learnable bi-directional LSTM following the frozen BERT, which encode the text $T$ into a sequence of vectors L x d 

![image.png](attachment:de36bb74-3a08-4294-890f-fa085ef5f8eb.png)

### DGA module

- Let:
    - $\mathbf{T} \in \mathbb{R}^{L_{T}\text{ x }d_{T}}, \mathbf{V} \in \mathbb{R}^{H_V \text{ x }  W_V \text{ x } d_{V}}$ be the output of encoders
    --> Input of DGA module
    - __Multi-modality shared dictionary__ (MSD): 
        - $\textbf{D} \in \mathbb{R}^{s\text{ x }d}$, $s$ is the numbers of atom, $d$ is the dimension of vector
        - $\textbf{D}$ is random initilized, considered as a sequence of atoms

- DGA is combined of 2 modules: Visual Feature Recontruction (VFR) and Text Feature Reconstruction (TFR)

![image.png](attachment:c93961a7-71d5-4479-8f15-dadf41d4372d.png)

- What is $\textbf{MHA}$ ?: just a transformer encoder that replace self-attention by cross-attention
$$𝑀𝐻𝐴(Q, K, V) = 𝐹𝐹𝑁 (𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 (Q, K, V))$$

__Text Feature Reconstruction__

- For text represention: input is $\mathbf{T}$, output is $\textbf{T}_{re}, \textbf{T}$
    $$\textbf{T}_{re} = MHA(\textbf{T},\textbf{D}, \textbf{D})$$
    ![image.png](attachment:6711bfd1-c506-4052-97a4-8002f1ecd5ab.png)

__Visual Feature Reconstruction__

- For visual representation:
    - Input is $\mathbf{V}$
    - Output is $\textbf{V}_{re}, \textbf{V}_{g}$
        - $\textbf{V}_{re}$ is represenatation of $\textbf{V}$ in new reconstruction space $\mathcal{D}$, guided by $\textbf{D}$
        - $\textbf{V}_{g}$ is the new represenatation of $\textbf{V}$ in text space $\mathcal{T}$, guided by $\textbf{T}$
        
    <br>
    
    - A mask of score $M$ with same resolution as $\mathbf{V}$, channel = 1, is generated for spatial attention mechanism 
        - Use 1x1 convolution with number filter = 1 and sigmoid activation 
    ![image.png](attachment:18d5951d-01c2-4e43-b9e4-b01724072629.png)
    
    - Can consider $\mathbf{V}$ as a sequence of vectors L x C, or a tensor of feature maps H x W x C (with L = H x W)
    $$\mathbf{V}_{re} = MHA(\mathbf{V},\mathbf{D}, \mathbf{D}) \odot M$$ 
    <br>
    $$\mathbf{V}_{g} = MHA(\mathbf{V},\mathbf{T}, \mathbf{T}) \odot M$$ 


- In summary, output of DGA module is $\textbf{V}_{re}, \textbf{V}_{g}, \textbf{T}_{re}, \textbf{T} = DGA(\textbf{T}, \textbf{V})$:
    - $\textbf{V}_{re}, \textbf{V}_{g} = VFR(\textbf{V})$
    
    - $\textbf{T}_{re}, \textbf{T} = TFR(\textbf{T})$

### PGU module

- Input of PGU is $\textbf{V}_{re}, \textbf{V}_{g}, \textbf{T}_{re}, \textbf{T}$, consider each as sequence of vectors $L \text{ x } d_{in}$

- $\textbf{P} \in \mathbb{R}^{K\text{ x }d}$, $K$ is the numbers of protypes, $d$ is the dimension of vector (= $d_{in}$)
    - Consider $P$ as $K$ sequence of query, each sequence $i$ contains only 1 vector $\mathrm{p}_i$

- For convenient, let $\mathbf{F}$ is the representation for each $\{ \textbf{V}_{re}, \textbf{V}_{g}, \textbf{T}_{re}, \textbf{T} \}$
    - Then:
    $$\widetilde{\mathbf{F}}=P G U(\mathbf{P}, \mathbf{F}) =\operatorname{Concat}\left(f_1\left(\mathrm{p}_1, \mathbf{F}\right), \ldots, f_K\left(\mathrm{p}_K, \mathbf{F}\right)\right) $$
    
    - What is $f_i$:
        - $𝑀𝐻𝐴_2(Q, K, V) = 𝐹𝐹𝑁 (𝑀𝑢𝑙𝑡𝑖𝐻𝑒𝑎𝑑 (Q, K, V))$ - just a transformer encoder block
    
    
        - $f_i\left (\mathbf{p}_i, \mathbf{F}\right) =\mathbf{W}_i\left(M H A_2\left(\mathrm{p}_i, \mathbf{F}, \mathbf{F}\right)\right)$
    
    
        - $\mathbf{W}_i \in \mathbb{R}^{d_{out} \text{ x }d_{in}}$  is an owned projection matrix for each vector query $f_i$
        
        - $\mathrm{p}_i$ is sequence of query with length 1!
        
        - $\widetilde{\mathbf{F}} \in \mathbb{R}^{K \text{ x }d_{out}}$

- In summary, output of PGU module is $\widetilde{\textbf{V}}_{re}, \widetilde{\textbf{V}}_{g}, \widetilde{\textbf{T}}_{re}, \widetilde{\textbf{T}}$:
    - $\widetilde{\mathbf{V}_{re}}=P G U(\mathbf{P}, \mathbf{V_{re}})$,...

## Objective Functions

### ID loss

ID loss is applied 

# Training phase

## Dataset & Data Augmentation

## Implemention detail

## Evaluation result

# Inference phase

# Conclusion

## New points in this paper

## Pro

## Cons

## How to improve?

# Demo in notebook

## Set up

### Define path

### Import libries / local modules

### Load config

### Load model checkpoint

## Get and summary model