# Visual Relationship Detection

In this tutorial, we focus on the task of classifying visual relationships. For a given image, there might be many such relationships, defined formally as a `subject <predictate> object` (e.g. `person <riding> bike`).

These are relationships among a pair of objects in images (e.g. "man riding bicycle"), where "man" and "bicycle" are the subject and object, respectively, and "riding" is the relationship predicate.

![Visual Relationships](https://cs.stanford.edu/people/ranjaykrishna/vrd/dataset.png)

For the purpose of this tutorial, we operate over the [Visual Relationship Detection (VRD) dataset](https://cs.stanford.edu/people/ranjaykrishna/vrd/) and focus on action relationships. We define our three class classification task as **identifying whether a pair of bounding boxes represents a particular relationship.**

In the examples of the relationships shown below, the red box represents the _subject_ while the green box represents the _object_. The _predicate_ (e.g. kick) denotes what relationship connects the subject and the object.

In [1]:
import os

if os.path.basename(os.getcwd()) == "scene_graph":
    os.chdir("..")

### 1. Load Dataset
We load the VRD dataset and filter images with at least one action predicate in it, since these are more difficult to classify than geometric relationships like `above` or `next to`. We load the train, valid, and test sets as Pandas DataFrame objects with the following fields:
- `label`: The relationship between the objects. 0: `RIDE`, 1: `CARRY`, 2: `OTHER` action predicates
- `object_bbox`: coordinates of the bounding box for the object `[ymin, ymax, xmin, xmax]`
- `object_category`: category of the object
- `source_img`: filename for the corresponding image the relationship is in
- `subject_bbox`: coordinates of the bounding box for the object `[ymin, ymax, xmin, xmax]`
- `subject_category`: category of the subject

Note that the training DataFrame will have a labels field with all -1s. This denotes the lack of labels for that particular dataset. In this tutorial, we will assign probabilistic labels to the training set by writing labeling functions over attributes of the subject and objects!

In [2]:
%load_ext autoreload
%autoreload 2

import numpy as np

In [3]:
from scene_graph.utils import load_vrd_data

train_df, valid_df, test_df = load_vrd_data()

print("Train Relationships: ", len(train_df))
print("Dev Relationships: ", len(valid_df))
print("Test Relationships: ", len(test_df))

Train Relationships:  26
Dev Relationships:  26
Test Relationships:  26


## 2. Writing Labeling Functions
We now write labeling functions to detect what relationship exists between pairs of bounding boxes. To do so, we can encode various intuitions into the labeling functions. _Categorical_ intution: knowledge about the categories of subjects and objects usually involved in these relationships (e.g., `person` is usually the subject for predicates like `ride` and `carry`), and _spatial_ intuition: knowledge about the relative positions of the subject and objects (e.g., subject is usually higher than the object for the predicate `ride`).

In [4]:
RIDE = 0
CARRY = 1
OTHER = 2
ABSTAIN = -1

We begin with labeling functions that encode categorical intuition: we use knowledge about common subject-object category pairs that are common for `RIDE` and `CARRY` and also knowledge about what subjects or objects are unlikely to be involved in the two relationships.

In [5]:
from snorkel.labeling.lf import labeling_function

# Category-based LFs
@labeling_function()
def LF_ride_object(x):
    if x.subject_category == "person":
        if x.object_category in ["bike", "snowboard", "motorcycle", "horse"]:
            return RIDE
    return ABSTAIN


@labeling_function()
def LF_ride_rare_object(x):
    if x.subject_category == "person":
        if x.object_category in ["bus", "truck", "elephant"]:
            return RIDE
    return ABSTAIN


@labeling_function()
def LF_carry_object(x):
    if x.subject_category == "person":
        if x.object_category in ["bag", "surfboard", "skis"]:
            return CARRY
    return ABSTAIN


@labeling_function()
def LF_carry_subject(x):
    if x.object_category == "person":
        if x.subject_category in ["chair", "bike", "snowboard", "motorcycle", "horse"]:
            return CARRY
    return ABSTAIN


@labeling_function()
def LF_person(x):
    if x.subject_category != "person":
        return OTHER
    return ABSTAIN

We now encode our spatial intuition, which includes measuring the distance between the bounding boxes and comparing their relative areas.

In [6]:
# Distance-based LFs
@labeling_function()
def LF_ydist(x):
    if x.subject_bbox[3] < x.object_bbox[3]:
        return OTHER
    return ABSTAIN


@labeling_function()
def LF_dist(x):
    if np.linalg.norm(np.array(x.subject_bbox) - np.array(x.object_bbox)) <= 1000:
        return OTHER
    return ABSTAIN


# Size-based LF
@labeling_function()
def LF_area(x):
    subject_area = (x.subject_bbox[1] - x.subject_bbox[0]) * (
        x.subject_bbox[3] - x.subject_bbox[2]
    )
    object_area = (x.object_bbox[1] - x.object_bbox[0]) * (
        x.object_bbox[3] - x.object_bbox[2]
    )

    if subject_area / object_area <= 0.5:
        return OTHER
    return ABSTAIN

Note that the labeling functions have varying empirical accuracies and coverages. Due to class imbalance in our chosen relationships, labeling functions that label the `OTHER` class have higher coverage than labeling functions for `RIDE` or `CARRY`. This reflects the distribution of classes in the dataset as well.

In [7]:
from snorkel.labeling.apply import PandasLFApplier

lfs = [
    LF_ride_object,
    LF_ride_rare_object,
    LF_carry_object,
    LF_carry_subject,
    LF_person,
    LF_ydist,
    LF_dist,
    LF_area,
]

applier = PandasLFApplier(lfs)
L_train = applier.apply(train_df)
L_valid = applier.apply(valid_df)

  0%|          | 0/26 [00:00<?, ?it/s]

100%|██████████| 26/26 [00:00<00:00, 2529.80it/s]


  0%|          | 0/26 [00:00<?, ?it/s]

100%|██████████| 26/26 [00:00<00:00, 3207.69it/s]




In [8]:
from snorkel.labeling.analysis import LFAnalysis

Y_valid = valid_df.label.values
LFAnalysis(L_valid, lfs).lf_summary(Y_valid)

  return np.nan_to_num(0.5 * (X.sum(axis=0) / (self.L != -1).sum(axis=0) + 1))


Unnamed: 0,j,Polarity,Coverage,Overlaps,Conflicts,Correct,Incorrect,Emp. Acc.
LF_ride_object,0,[0],0.230769,0.230769,0.230769,5,1,0.833333
LF_ride_rare_object,1,[],0.0,0.0,0.0,0,0,0.0
LF_carry_object,2,[1],0.076923,0.076923,0.076923,2,0,1.0
LF_carry_subject,3,[1],0.038462,0.038462,0.038462,1,0,1.0
LF_person,4,[2],0.307692,0.307692,0.038462,5,3,0.625
LF_ydist,5,[2],0.576923,0.576923,0.307692,7,8,0.466667
LF_dist,6,[2],1.0,0.846154,0.346154,13,6,0.5
LF_area,7,[2],0.346154,0.346154,0.153846,5,4,0.555556


## 3. Train Label Model
We now train a multi-class `LabelModel` to assign training labels to the unalabeled training set.

In [9]:
from snorkel.labeling.model import LabelModel

label_model = LabelModel(cardinality=3, verbose=True)
label_model.fit(L_train, seed=123, lr=0.01, log_freq=10, n_epochs=100)

Computing O...
Estimating \mu...
[0 epochs]: TRAIN:[loss=1.612]
[10 epochs]: TRAIN:[loss=0.474]
[20 epochs]: TRAIN:[loss=0.248]
[30 epochs]: TRAIN:[loss=0.092]
[40 epochs]: TRAIN:[loss=0.084]
[50 epochs]: TRAIN:[loss=0.062]
[60 epochs]: TRAIN:[loss=0.049]
[70 epochs]: TRAIN:[loss=0.042]
[80 epochs]: TRAIN:[loss=0.033]
[90 epochs]: TRAIN:[loss=0.026]
Finished Training


We use [F1](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) Micro average for the multiclass setting, which calculates metrics globally by counting the total true positives, false negatives and false positives.

In [10]:
label_model.score(L_valid, Y_valid, metrics=["f1_micro"])

{'f1_micro': 0.5769230769230769}

## 4. Train a Classifier
You can then use these training labels to train any standard discriminative model, such as [an off-the-shelf ResNet](https://github.com/KaimingHe/deep-residual-networks), which should learn to generalize beyond the LF's we've developed!

#### Create DataLoaders for Classifier

In [11]:
from snorkel.classification.data import DictDataLoader
from scene_graph.model import FlatConcat, SceneGraphDataset, WordEmb, init_fc

# change to "scene_graph/data/VRD/sg_dataset/sg_train_images" for full set
TRAIN_DIR = "scene_graph/data/VRD/sg_dataset/samples"
train_df["labels"] = label_model.predict(L_train)

train_dl = DictDataLoader(
    SceneGraphDataset("train_dataset", "train", TRAIN_DIR, train_df),
    batch_size=16,
    shuffle=True,
)

valid_dl = DictDataLoader(
    SceneGraphDataset("valid_dataset", "valid", TRAIN_DIR, valid_df),
    batch_size=16,
    shuffle=False,
)

#### Define Model Architecture

In [12]:
import torchvision.models as models
import torch.nn as nn

from functools import partial
from snorkel.classification.scorer import Scorer
from snorkel.classification.task import ce_loss, softmax
from snorkel.classification.task import Task


# initialize pretrained feature extractor
cnn = models.resnet18(pretrained=True)

# freeze the resnet weights
for param in cnn.parameters():
    param.requires_grad = False

# define input features
in_features = cnn.fc.in_features
feature_extractor = nn.Sequential(*list(cnn.children())[:-1])

# initialize FC layer: maps 3 sets of image features to class logits
WEMB_SIZE = 100
fc = nn.Linear(in_features * 3 + 2 * WEMB_SIZE, 3)
init_fc(fc)

# define layers
module_pool = nn.ModuleDict(
    {
        "feat_extractor": feature_extractor,
        "prediction_head": fc,
        "feat_concat": FlatConcat(),
        "word_emb": WordEmb(),
    }
)

In [13]:
from scene_graph.model import get_task_flow

# define task flow through modules
task_flow = get_task_flow()
pred_cls_task = Task(
    name="scene_graph_task",
    module_pool=module_pool,
    task_flow=task_flow,
    loss_func=partial(ce_loss, "head_op"),
    output_func=partial(softmax, "head_op"),
    scorer=Scorer(metrics=["f1_micro"]),
)

### Train and Evaluate Model

In [14]:
from snorkel.classification.snorkel_classifier import SnorkelClassifier
from snorkel.classification.training import Trainer

model = SnorkelClassifier([pred_cls_task])
trainer = Trainer(
    n_epochs=1,
    lr=1e-3,
    checkpointing=True,
    checkpointer_config={"checkpoint_dir": "checkpoint"},
)
trainer.fit(model, [train_dl])

Epoch 0::   0%|          | 0/2 [00:00<?, ?it/s]

  return self.word_embs.loc[word].as_matrix()
Epoch 0::   0%|          | 0/2 [00:01<?, ?it/s, model/all/train/loss=1.25, model/all/train/lr=0.001]

Epoch 0::  50%|█████     | 1/2 [00:01<00:01,  1.34s/it, model/all/train/loss=1.25, model/all/train/lr=0.001]

Epoch 0::  50%|█████     | 1/2 [00:02<00:01,  1.34s/it, model/all/train/loss=1.07, model/all/train/lr=0.001]

Epoch 0:: 100%|██████████| 2/2 [00:02<00:00,  1.23s/it, model/all/train/loss=1.07, model/all/train/lr=0.001]




In [15]:
model.score([valid_dl])

{'scene_graph_task/valid_dataset/valid/f1_micro': 0.6153846153846154}