<a href="https://colab.research.google.com/github/xujinglin/TimeCycle/blob/master/TCC_Playground.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TCC Playground

Using this Colab, you can play with the Temporal Cycle Consistency algorithm on your data. It is a self-supervised method to train image/video encoders (like ResNet etc.) by looking for similarities in sets of images or frames of videos.

![alt text](https://1.bp.blogspot.com/-zhVafTWua44/XUtMTlA36sI/AAAAAAAAEZU/ZyREcK5HIpoup5XtFErogXz66XDyxKJBwCLcBGAs/s640/image2.gif)

Once this encoder has been trained, it can be used for a number of downstream applications, two of which are included in this Colab. One task is to propagate temporal labels from a handful of videos to the entire dataset (few-shot learning). For example, you can mark a phase/segment (lifting the bottle up) of an action (pouring) in one video, then these labels can be transferred to other videos using the TCC embeddings. Another task is synchronizing similar actions in videos. Following is an example result of aligning videos of people pitching a baseball using TCC. 

![alt text](https://1.bp.blogspot.com/-x5R7gEPTyWE/XUtMYCn3API/AAAAAAAAEZ8/9WyGQMi2cOEtEqi6hwMN9TO-LSrAkG7_ACEwYBhgL/s640/image1.gif). 

More details can be found in the [blogpost](https://ai.googleblog.com/2019/08/video-understanding-using-temporal.html) and [paper](https://arxiv.org/abs/1904.07846). If you find this useful, the full [codebase](https://github.com/google-research/google-research/tree/master/tcc) has a lot more functionality.


# Setup

Ensure you are running on a GPU Colab runtime. Run the following two cells that sets up all code to run the playground.

In [None]:
#@title Installs and imports
import cv2
import glob
import os
import math

from google.colab import drive

import matplotlib
from matplotlib.animation import FuncAnimation
import matplotlib.pyplot as plt

from IPython.display import HTML
from base64 import b64encode

import numpy as np
from sklearn.svm import SVC

import tensorflow as tf

!pip install dtw
from dtw import dtw

from tensorboard import notebook
%load_ext tensorboard

Collecting dtw
  Downloading https://files.pythonhosted.org/packages/66/a0/21d6ec377b8d5832218700e236205f8cdea38b3b2cdd0a732be170e2809b/dtw-1.4.0.tar.gz
Building wheels for collected packages: dtw
  Building wheel for dtw (setup.py) ... [?25l[?25hdone
  Created wheel for dtw: filename=dtw-1.4.0-cp36-none-any.whl size=5315 sha256=fd10b690b9639d7918699a704de9b9d986a50c32ea729f51644fae2edb86671c
  Stored in directory: /root/.cache/pip/wheels/8c/8b/7a/947d67b53cd54948890a173527b0470ef56998812fc9d0a803
Successfully built dtw
Installing collected packages: dtw
Successfully installed dtw-1.4.0


In [None]:
#@title TCC Codebase.

# Data Loading Utils
def read_video(video_filename, width=224, height=224):
  cap = cv2.VideoCapture(video_filename)
  frames = []
  if cap.isOpened():
    while True:
      success, frame_bgr = cap.read()
      if not success:
        break
      frame_rgb = cv2.cvtColor(frame_bgr, cv2.COLOR_BGR2RGB)
      frame_rgb = cv2.resize(frame_rgb, (width, height))
      frames.append(frame_rgb)
  frames = np.asarray(frames)
  return frames

def pad_zeros(frames, max_seq_len):
  npad = ((0, max_seq_len-len(frames)), (0, 0), (0, 0), (0, 0))
  frames = np.pad(frames, pad_width=npad, mode='constant', constant_values=0)
  return frames


def load_videos(path_to_raw_videos):
  drive.mount('/content/gdrive')
  video_filenames = sorted(glob.glob(os.path.join(path_to_raw_videos, '*.mp4')))
  print('Found %d videos to align.'%len(video_filenames))
  videos = []
  video_seq_lens = []
  for video_filename in video_filenames:
    frames = read_video(video_filename)
    videos.append(frames)
    video_seq_lens.append(len(frames))
  max_seq_len = max(video_seq_lens)
  videos = np.asarray([pad_zeros(x, max_seq_len) for x in videos])
  return videos, video_seq_lens


def play_video(video, video_seq_len):
  video = video[:video_seq_len]
  path_to_output_video = '/tmp/video.mp4'
  num_frames = len(video)
  fig, ax = plt.subplots(ncols=1, figsize=(5, 5), tight_layout=True)

  im0 = ax.imshow(unnorm(video[0]))
  def update(i):
    """Update plot with next frame."""
    im0.set_data(unnorm(video[i]))
    # Hide grid lines
    ax.grid(False)
    ax.set_title('Frame # %d'%i)
    # Hide axes ticks
    ax.set_xticks([])
    ax.set_yticks([])
    plt.tight_layout()

  anim = FuncAnimation(
      fig,
      update,
      frames=np.arange(num_frames),
      interval=200,
      blit=False)
  anim.save(path_to_output_video, dpi=80)
  plt.close()
  return show_video(path_to_output_video)


def viz_propagated_labels(video,
                          labels,
                          video_seq_len,
                          label_strings=None):
  video = video[:video_seq_len]
  path_to_output_video = '/tmp/labeled_video.mp4'
  if not label_strings:
    label_strings = [str(x) for x in range(np.max(labels))]
  num_frames = len(video)
  
  fig, ax = plt.subplots(ncols=1, figsize=(5, 5), tight_layout=True)

  im0 = ax.imshow(unnorm(video[0]))
  def update(i):
    """Update plot with next frame."""
    im0.set_data(unnorm(video[i]))
    # Hide grid lines
    ax.grid(False)
    ax.set_title('Label: %s'%label_strings[labels[i]])
    # Hide axes ticks
    ax.set_xticks([])
    ax.set_yticks([])
    plt.tight_layout()

  anim = FuncAnimation(
      fig,
      update,
      frames=np.arange(num_frames),
      interval=100,
      blit=False)
  anim.save(path_to_output_video, dpi=80)
  plt.close()
  return show_video(path_to_output_video)


def create_dataset(videos, seq_lens, batch_size, num_steps,
                   num_context_steps, context_stride): 
  ds = tf.data.Dataset.from_tensor_slices((videos, seq_lens))
  ds = ds.repeat()
  ds = ds.shuffle(len(videos))

  def sample_and_preprocess(video, seq_len):
    steps = tf.sort(tf.random.shuffle(tf.range(seq_len))[:num_steps])
    
    def get_context_steps(step):
      return tf.clip_by_value(
          tf.range(step - (num_context_steps - 1) * context_stride,
                   step + context_stride,
                   context_stride),
                   0, seq_len-1)

    steps_with_context = tf.reshape(
        tf.map_fn(get_context_steps, steps), [-1])
    frames = tf.gather(video, steps_with_context)
    frames = tf.cast(frames, tf.float32)
    frames = (frames/127.5) - 1.0
    frames = tf.image.resize(frames, (168, 168))
    return {'frames': frames,
            'seq_lens': seq_len,
            'steps': steps}

  ds = ds.map(sample_and_preprocess,
              num_parallel_calls=tf.data.experimental.AUTOTUNE)
  ds = ds.batch(batch_size)
  ds = ds.prefetch(1)
  return ds


# TCC Loss
def classification_loss(logits, labels, label_smoothing):
  """Loss function based on classifying the correct indices.
  In the paper, this is called Cycle-back Classification.
  Args:
    logits: Tensor, Pre-softmax scores used for classification loss. These are
      similarity scores after cycling back to the starting sequence.
    labels: Tensor, One hot labels containing the ground truth. The index where
      the cycle started is 1.
    label_smoothing: Float, label smoothing factor which can be used to
      determine how hard the alignment should be.
  Returns:
    loss: Tensor, A scalar classification loss calculated using standard softmax
      cross-entropy loss.
  """
  # Just to be safe, we stop gradients from labels as we are generating labels.
  labels = tf.stop_gradient(labels)
  return tf.reduce_mean(tf.keras.losses.categorical_crossentropy(
      y_true=labels, y_pred=logits, from_logits=True,
      label_smoothing=label_smoothing))


def regression_loss(logits, labels, num_steps, steps, seq_lens, loss_type,
                    normalize_indices, variance_lambda, huber_delta):
  """Loss function based on regressing to the correct indices.
  In the paper, this is called Cycle-back Regression. There are 3 variants
  of this loss:
  i) regression_mse: MSE of the predicted indices and ground truth indices.
  ii) regression_mse_var: MSE of the predicted indices that takes into account
  the variance of the similarities. This is important when the rate at which
  sequences go through different phases changes a lot. The variance scaling
  allows dynamic weighting of the MSE loss based on the similarities.
  iii) regression_huber: Huber loss between the predicted indices and ground
  truth indices.
  Args:
    logits: Tensor, Pre-softmax similarity scores after cycling back to the
      starting sequence.
    labels: Tensor, One hot labels containing the ground truth. The index where
      the cycle started is 1.
    num_steps: Integer, Number of steps in the sequence embeddings.
    steps: Tensor, step indices/frame indices of the embeddings of the shape
      [N, T] where N is the batch size, T is the number of the timesteps.
    seq_lens: Tensor, Lengths of the sequences from which the sampling was done.
      This can provide additional temporal information to the alignment loss.
    loss_type: String, This specifies the kind of regression loss function.
      Currently supported loss functions: regression_mse, regression_mse_var,
      regression_huber.
    normalize_indices: Boolean, If True, normalizes indices by sequence lengths.
      Useful for ensuring numerical instabilities don't arise as sequence
      indices can be large numbers.
    variance_lambda: Float, Weight of the variance of the similarity
      predictions while cycling back. If this is high then the low variance
      similarities are preferred by the loss while making this term low results
      in high variance of the similarities (more uniform/random matching).
    huber_delta: float, Huber delta described in tf.keras.losses.huber_loss.
  Returns:
     loss: Tensor, A scalar loss calculated using a variant of regression.
  """
  # Just to be safe, we stop gradients from labels as we are generating labels.
  labels = tf.stop_gradient(labels)
  steps = tf.stop_gradient(steps)

  if normalize_indices:
    float_seq_lens = tf.cast(seq_lens, tf.float32)
    tile_seq_lens = tf.tile(
        tf.expand_dims(float_seq_lens, axis=1), [1, num_steps])
    steps = tf.cast(steps, tf.float32) / tile_seq_lens
  else:
    steps = tf.cast(steps, tf.float32)

  beta = tf.nn.softmax(logits)
  true_time = tf.reduce_sum(steps * labels, axis=1)
  pred_time = tf.reduce_sum(steps * beta, axis=1)

  if loss_type in ['regression_mse', 'regression_mse_var']:
    if 'var' in loss_type:
      # Variance aware regression.
      pred_time_tiled = tf.tile(tf.expand_dims(pred_time, axis=1),
                                [1, num_steps])

      pred_time_variance = tf.reduce_sum(
          tf.square(steps - pred_time_tiled) * beta, axis=1)

      # Using log of variance as it is numerically stabler.
      pred_time_log_var = tf.math.log(pred_time_variance)
      squared_error = tf.square(true_time - pred_time)
      return tf.reduce_mean(tf.math.exp(-pred_time_log_var) * squared_error
                            + variance_lambda * pred_time_log_var)

    else:
      return tf.reduce_mean(
          tf.keras.losses.mean_squared_error(y_true=true_time,
                                             y_pred=pred_time))
  elif loss_type == 'regression_huber':
    return tf.reduce_mean(tf.keras.losses.huber_loss(
        y_true=true_time, y_pred=pred_time,
        delta=huber_delta))
  else:
    raise ValueError('Unsupported regression loss %s. Supported losses are: '
                     'regression_mse, regresstion_mse_var and regression_huber.'
                     % loss_type)
    
    
def pairwise_l2_distance(embs1, embs2):
  """Computes pairwise distances between all rows of embs1 and embs2."""
  norm1 = tf.reduce_sum(tf.square(embs1), 1)
  norm1 = tf.reshape(norm1, [-1, 1])
  norm2 = tf.reduce_sum(tf.square(embs2), 1)
  norm2 = tf.reshape(norm2, [1, -1])

  # Max to ensure matmul doesn't produce anything negative due to floating
  # point approximations.
  dist = tf.maximum(
      norm1 + norm2 - 2.0 * tf.matmul(embs1, embs2, False, True), 0.0)

  return dist


def get_scaled_similarity(embs1, embs2, similarity_type, temperature):
  """Returns similarity between each all rows of embs1 and all rows of embs2.
  The similarity is scaled by the number of channels/embedding size and
  temperature.
  Args:
    embs1: Tensor, Embeddings of the shape [M, D] where M is the number of
      embeddings and D is the embedding size.
    embs2: Tensor, Embeddings of the shape [N, D] where N is the number of
      embeddings and D is the embedding size.
    similarity_type: String, Either one of 'l2' or 'cosine'.
    temperature: Float, Temperature used in scaling logits before softmax.
  Returns:
    similarity: Tensor, [M, N] tensor denoting similarity between embs1 and
      embs2.
  """
  channels = tf.cast(tf.shape(embs1)[1], tf.float32)
  # Go for embs1 to embs2.
  if similarity_type == 'cosine':
    similarity = tf.matmul(embs1, embs2, transpose_b=True)
  elif similarity_type == 'l2':
    similarity = -1.0 * pairwise_l2_distance(embs1, embs2)
  else:
    raise ValueError('similarity_type can either be l2 or cosine.')

  # Scale the distance  by number of channels. This normalization helps with
  # optimization.
  similarity /= channels
  # Scale the distance by a temperature that helps with how soft/hard the
  # alignment should be.
  similarity /= temperature
  
  return similarity


def align_pair_of_sequences(embs1,
                            embs2,
                            similarity_type,
                            temperature):
  """Align a given pair embedding sequences.
  Args:
    embs1: Tensor, Embeddings of the shape [M, D] where M is the number of
      embeddings and D is the embedding size.
    embs2: Tensor, Embeddings of the shape [N, D] where N is the number of
      embeddings and D is the embedding size.
    similarity_type: String, Either one of 'l2' or 'cosine'.
    temperature: Float, Temperature used in scaling logits before softmax.
  Returns:
     logits: Tensor, Pre-softmax similarity scores after cycling back to the
      starting sequence.
    labels: Tensor, One hot labels containing the ground truth. The index where
      the cycle started is 1.
  """
  max_num_steps = tf.shape(embs1)[0]

  # Find distances between embs1 and embs2.
  sim_12 = get_scaled_similarity(embs1, embs2, similarity_type, temperature)
  
  # Softmax the distance.
  softmaxed_sim_12 = tf.nn.softmax(sim_12, axis=1)
  # Calculate soft-nearest neighbors.
  
  nn_embs = tf.matmul(softmaxed_sim_12, embs2)
  # Find distances between nn_embs and embs1.
  sim_21 = get_scaled_similarity(nn_embs, embs1, similarity_type, temperature)
  logits = sim_21
  labels = tf.one_hot(tf.range(max_num_steps), max_num_steps)

  return logits, labels

def _align_single_cycle(cycle, embs, cycle_length, num_steps,
                        similarity_type, temperature):
  """Takes a single cycle and returns logits (simialrity scores) and labels."""
  # Choose random frame.
  n_idx = tf.random.uniform((), minval=0, maxval=num_steps, dtype=tf.int32)
  # Create labels
  onehot_labels = tf.one_hot(n_idx, num_steps)

  # Choose query feats for first frame.
  query_feats = embs[cycle[0], n_idx:n_idx+1]

  num_channels = tf.shape(query_feats)[-1]
  for c in range(1, cycle_length+1):
    candidate_feats = embs[cycle[c]]

    if similarity_type == 'l2':
      # Find L2 distance.
      mean_squared_distance = tf.reduce_sum(
          tf.square(tf.tile(query_feats, [num_steps, 1])- candidate_feats), axis=1)
      # Convert L2 distance to similarity.
      similarity = -mean_squared_distance

    elif similarity_type == 'cosine':
      # Dot product of embeddings.
      similarity = tf.squeeze(tf.matmul(candidate_feats, query_feats,
                                        transpose_b=True))
    else:
      raise ValueError('similarity_type can either be l2 or cosine.')

    # Scale the distance  by number of channels. This normalization helps with
    # optimization.
    similarity = tf.truediv(similarity,
                            tf.cast(num_channels, tf.float32))
    # # Scale the distance by a temperature that helps with how soft/hard the
    # # alignment should be.
    similarity = tf.truediv(similarity, temperature)

    beta = tf.nn.softmax(similarity)
    beta = tf.expand_dims(beta, axis=1)
    beta = tf.tile(beta, [1, num_channels])

    # Find weighted nearest neighbour.
    query_feats = tf.reduce_sum(beta * candidate_feats,
                                axis=0, keepdims=True)

  return similarity, onehot_labels


def _align(cycles, embs, num_steps, num_cycles, cycle_length,
           similarity_type, temperature):
  """Align by finding cycles in embs."""
  logits_list = []
  labels_list = []
  for i in range(num_cycles):
    logits, labels = _align_single_cycle(cycles[i],
                                         embs,
                                         cycle_length,
                                         num_steps,
                                         similarity_type,
                                         temperature)
    logits_list.append(logits)
    labels_list.append(labels)

  logits = tf.stack(logits_list)
  labels = tf.stack(labels_list)

  return logits, labels


def gen_cycles(num_cycles, batch_size, cycle_length=2):
  """Generates cycles for alignment.
  Generates a batch of indices to cycle over. For example setting num_cycles=2,
  batch_size=5, cycle_length=3 might return something like this:
  cycles = [[0, 3, 4, 0], [1, 2, 0, 3]]. This means we have 2 cycles for which
  the loss will be calculated. The first cycle starts at sequence 0 of the
  batch, then we find a matching step in sequence 3 of that batch, then we
  find matching step in sequence 4 and finally come back to sequence 0,
  completing a cycle.
  Args:
    num_cycles: Integer, Number of cycles that will be matched in one pass.
    batch_size: Integer, Number of sequences in one batch.
    cycle_length: Integer, Length of the cycles. If we are matching between
      2 sequences (cycle_length=2), we get cycles that look like [0,1,0].
      This means that we go from sequence 0 to sequence 1 then back to sequence
      0. A cycle length of 3 might look like [0, 1, 2, 0].
  Returns:
    cycles: Tensor, Batch indices denoting cycles that will be used for
      calculating the alignment loss.
  """
  sorted_idxes = tf.tile(tf.expand_dims(tf.range(batch_size), 0),
                         [num_cycles, 1])
  sorted_idxes = tf.reshape(sorted_idxes, [batch_size, num_cycles])
  cycles = tf.reshape(tf.random.shuffle(sorted_idxes),
                      [num_cycles, batch_size])
  cycles = cycles[:, :cycle_length]
  # Append the first index at the end to create cycle.
  cycles = tf.concat([cycles, cycles[:, 0:1]], axis=1)
  return cycles


def compute_stochastic_alignment_loss(embs,
                                      steps,
                                      seq_lens,
                                      num_steps,
                                      batch_size,
                                      loss_type,
                                      similarity_type,
                                      num_cycles,
                                      cycle_length,
                                      temperature,
                                      label_smoothing,
                                      variance_lambda,
                                      huber_delta,
                                      normalize_indices):
  """Compute cycle-consistency loss by stochastically sampling cycles.
  Args:
    embs: Tensor, sequential embeddings of the shape [N, T, D] where N is the
      batch size, T is the number of timesteps in the sequence, D is the size of
      the embeddings.
    steps: Tensor, step indices/frame indices of the embeddings of the shape
      [N, T] where N is the batch size, T is the number of the timesteps.
    seq_lens: Tensor, Lengths of the sequences from which the sampling was done.
      This can provide additional information to the alignment loss.
    num_steps: Integer/Tensor, Number of timesteps in the embeddings.
    batch_size: Integer/Tensor, Batch size.
    loss_type: String, This specifies the kind of loss function to use.
      Currently supported loss functions: 'classification', 'regression_mse',
      'regression_mse_var', 'regression_huber'.
    similarity_type: String, Currently supported similarity metrics: 'l2',
      'cosine'.
    num_cycles: Integer, number of cycles to match while aligning
      stochastically.  Only used in the stochastic version.
    cycle_length: Integer, Lengths of the cycle to use for matching. Only used
      in the stochastic version. By default, this is set to 2.
    temperature: Float, temperature scaling used to scale the similarity
      distributions calculated using the softmax function.
    label_smoothing: Float, Label smoothing argument used in
      tf.keras.losses.categorical_crossentropy function and described in this
      paper https://arxiv.org/pdf/1701.06548.pdf.
    variance_lambda: Float, Weight of the variance of the similarity
      predictions while cycling back. If this is high then the low variance
      similarities are preferred by the loss while making this term low results
      in high variance of the similarities (more uniform/random matching).
    huber_delta: float, Huber delta described in tf.keras.losses.huber_loss.
    normalize_indices: Boolean, If True, normalizes indices by sequence lengths.
      Useful for ensuring numerical instabilities doesn't arise as sequence
      indices can be large numbers.
  Returns:
    loss: Tensor, Scalar loss tensor that imposes the chosen variant of the
      cycle-consistency loss.
  """
  # Generate cycles.
  cycles = gen_cycles(num_cycles, batch_size, cycle_length)

  logits, labels = _align(cycles, embs, num_steps, num_cycles, cycle_length,
                          similarity_type, temperature)

  if loss_type == 'classification':
    loss = classification_loss(logits, labels, label_smoothing)
  elif 'regression' in loss_type:
    steps = tf.gather(steps, cycles[:, 0])
    seq_lens = tf.gather(seq_lens, cycles[:, 0])
    loss = regression_loss(logits, labels, num_steps, steps, seq_lens,
                           loss_type, normalize_indices, variance_lambda,
                           huber_delta)
  else:
    raise ValueError('Unidentified loss type %s. Currently supported loss '
                     'types are: regression_mse, regression_huber, '
                     'classification .'
                     % loss_type)
  return loss


def compute_deterministic_alignment_loss(embs,
                                         steps,
                                         seq_lens,
                                         num_steps,
                                         batch_size,
                                         loss_type,
                                         similarity_type,
                                         temperature,
                                         label_smoothing,
                                         variance_lambda,
                                         huber_delta,
                                         normalize_indices):
  """Compute cycle-consistency loss for all steps in each sequence.
  This aligns each pair of videos in the batch except with itself.
  When aligning it also matters which video is the starting video. So for N
  videos in the batch, we have N * (N-1) alignments happening.
  For example, a batch of size 3 has 6 pairs of sequence alignments.
  Args:
    embs: Tensor, sequential embeddings of the shape [N, T, D] where N is the
      batch size, T is the number of timesteps in the sequence, D is the size
      of the embeddings.
    steps: Tensor, step indices/frame indices of the embeddings of the shape
      [N, T] where N is the batch size, T is the number of the timesteps.
    seq_lens: Tensor, Lengths of the sequences from which the sampling was
    done. This can provide additional information to the alignment loss.
    num_steps: Integer/Tensor, Number of timesteps in the embeddings.
    batch_size: Integer, Size of the batch.
    loss_type: String, This specifies the kind of loss function to use.
      Currently supported loss functions: 'classification', 'regression_mse',
      'regression_mse_var', 'regression_huber'.
    similarity_type: String, Currently supported similarity metrics: 'l2' ,
      'cosine' .
    temperature: Float, temperature scaling used to scale the similarity
      distributions calculated using the softmax function.
    label_smoothing: Float, Label smoothing argument used in
      tf.keras.losses.categorical_crossentropy function and described in this
      paper https://arxiv.org/pdf/1701.06548.pdf.
    variance_lambda: Float, Weight of the variance of the similarity
      predictions while cycling back. If this is high then the low variance
      similarities are preferred by the loss while making this term low
      results in high variance of the similarities (more uniform/random
      matching).
    huber_delta: float, Huber delta described in tf.keras.losses.huber_loss.
    normalize_indices: Boolean, If True, normalizes indices by sequence
      lengths. Useful for ensuring numerical instabilities doesn't arise as
      sequence indices can be large numbers.
  Returns:
    loss: Tensor, Scalar loss tensor that imposes the chosen variant of the
        cycle-consistency loss.
  """
  labels_list = []
  logits_list = []
  steps_list = []
  seq_lens_list = []

  for i in range(batch_size):
    for j in range(batch_size):
      # We do not align the sequence with itself.
      if i != j:
        logits, labels = align_pair_of_sequences(embs[i],
                                                 embs[j],
                                                 similarity_type,
                                                 temperature)
        logits_list.append(logits)
        labels_list.append(labels)
        steps_list.append(tf.tile(steps[i:i+1], [num_steps, 1]))
        seq_lens_list.append(tf.tile(seq_lens[i:i+1], [num_steps]))

  logits = tf.concat(logits_list, axis=0)
  labels = tf.concat(labels_list, axis=0)
  steps = tf.concat(steps_list, axis=0)
  seq_lens = tf.concat(seq_lens_list, axis=0)

  if loss_type == 'classification':
    loss = classification_loss(logits, labels, label_smoothing)
  elif 'regression' in loss_type:
    loss = regression_loss(logits, labels, num_steps, steps, seq_lens,
                           loss_type, normalize_indices, variance_lambda,
                           huber_delta)
  else:
    raise ValueError('Unidentified loss_type %s. Currently supported loss '
                     'types are: regression_mse, regression_huber, '
                     'classification.' % loss_type)

  return loss

# Visualization code
def dist_fn(x, y):
  dist = np.sum((x-y)**2)
  return dist


def get_nn(embs, query_emb):
  dist = np.linalg.norm(embs - query_emb, axis=1)
  assert len(dist) == len(embs)
  return np.argmin(dist), np.min(dist)


def unnorm(query_frame):
  min_v = query_frame.min()
  max_v = query_frame.max()
  query_frame = (query_frame - min_v) / (max_v - min_v)
  return query_frame


def viz_align(query_feats, candidate_feats, use_dtw):
  """Align videos based on dynamic time warping."""
  if use_dtw:
    _, _, _, path = dtw(query_feats, candidate_feats, dist=dist_fn)
    _, uix = np.unique(path[0], return_index=True)
    nns = path[1][uix] 
  else:
    nns = []
    for i in range(len(query_feats)):
      nn_frame_id, _ = get_nn(candidate_feats, query_feats[i])
      nns.append(nn_frame_id)
  return nns

def show_video(video_path):
  mp4 = open(video_path,'rb').read()
  data_url = "data:video/mp4;base64," + b64encode(mp4).decode()
  return HTML("""<video width=600 controls>
      <source src="%s" type="video/mp4"></video>
  """ % data_url)


def create_video(embs, frames, video_path, use_dtw, query=0):
  """Create aligned videos."""
  # If candiidate is not None implies alignment is being calculated between
  # 2 videos only.
  
  ncols = int(math.sqrt(len(embs)))
  fig, ax = plt.subplots(
      ncols=ncols,
      nrows=ncols,
      figsize=(5 * ncols, 5 * ncols),
      tight_layout=True)

  nns = []
  for candidate in range(len(embs)):
    nns.append(viz_align(embs[query], embs[candidate], use_dtw))
  ims = []

  def init():
    k = 0
    for k in range(ncols):
      for j in range(ncols):
        ims.append(ax[j][k].imshow(
            unnorm(frames[k * ncols + j][nns[k * ncols + j][0]])))
        ax[j][k].grid(False)
        ax[j][k].set_xticks([])
        ax[j][k].set_yticks([])
    return ims

  ims = init()

  def update(i):
    for k in range(ncols):
      for j in range(ncols):
        ims[k * ncols + j].set_data(
            unnorm(frames[k * ncols + j][nns[k * ncols + j][i]]))
    plt.tight_layout()
    return ims

  anim = FuncAnimation(
      fig,
      update,
      frames=np.arange(len(embs[query])),
      interval=100,
      blit=False)
  anim.save(video_path, dpi=40)
  plt.close()

def create_dynamic_video(embs, frames, video_path, use_dtw, query=0):
  """Create aligned videos."""
  fig, ax = plt.subplots(ncols=2, figsize=(10, 5), tight_layout=True)

  ax[0].set_title('Reference Frame')
  ax[1].set_title('Aligned Frame using TCC')
  nns = []
  for candidate in range(len(embs)):
    nns.append(viz_align(embs[query], embs[candidate], use_dtw))

  switch_video = max(1, len(embs[query])//len(embs))

  im0 = ax[0].imshow(unnorm(frames[0][0]))
  im1 = ax[1].imshow(unnorm(frames[1][nns[1][0]]))

  def update(i):
    """Update plot with next frame."""
    candidate = min(i // switch_video + 1,
                    len(embs)-1)

    im0.set_data(unnorm(frames[query][i]))
    im1.set_data(unnorm(frames[candidate][nns[candidate][i]]))
    # Hide grid lines
    ax[0].grid(False)
    ax[1].grid(False)

    # Hide axes ticks
    ax[0].set_xticks([])
    ax[1].set_xticks([])
    ax[0].set_yticks([])
    ax[1].set_yticks([])
    plt.tight_layout()

  anim = FuncAnimation(
      fig,
      update,
      frames=np.arange(len(embs[query])),
      interval=100,
      blit=False)
  anim.save(video_path, dpi=80)
  plt.close()


def viz_alignment(embs,
                  frames,
                  video_path,
                  grid_mode=True,
                  use_dtw=False,
                  query=0):
  """Visualize alignment."""

  if grid_mode:
    return create_video(
        embs,
        frames,
        video_path,
        use_dtw,
        query)
  else:
    return create_dynamic_video(
        embs,
        frames,
        video_path,
        use_dtw,
        query)

def convert_label_list(label_list, max_seq_len):
  labels = []
  curr_label = 0
  for i in range(max_seq_len):
    if i > label_list[curr_label]:
      curr_label += 1
    labels.append(curr_label)
  return labels


def fit_svm_model(train_embs, train_labels):
  """Fit a SVM classifier."""
  svm_model = SVC(decision_function_shape='ovo', verbose=2)
  svm_model.fit(train_embs, train_labels)
  train_acc = svm_model.score(train_embs, train_labels)
  print('Label propagation model accuracy:', train_acc)
  print('If this is too low, propagation will not work properly.')
  return svm_model

# Propagate labels
def propagate_labels(embs, labels):
  train_embs = []
  train_labels = []
  for video_id in labels:
    max_frame_id = max(labels[video_id])
    train_embs.extend(embs[video_id][:max_frame_id])
    train_labels.extend(convert_label_list(labels[video_id],
                                           max_frame_id))
  model = fit_svm_model(train_embs, train_labels)

  propagated_labels = []
  for video_id in range(len(embs)):
    pred_labels = model.predict(embs[video_id])
    propagated_labels.append(pred_labels)
  return propagated_labels


# Embed videos using model
def get_embs(model, videos, video_seq_lens, frames_per_batch,
             num_context_steps, context_stride):
  tf.keras.backend.set_learning_phase(0)
  embs_list = []
  for video, seq_len in zip(videos, video_seq_lens):
    embs = []
    num_batches = int(np.ceil(float(seq_len)/frames_per_batch))
    for i in range(num_batches):
      steps = np.arange(i*frames_per_batch, (i+1)*frames_per_batch)
      steps = np.clip(steps, 0, seq_len-1)
      def get_context_steps(step):
        return tf.clip_by_value(
          tf.range(step - (num_context_steps - 1) * context_stride,
                   step + context_stride,
                   context_stride),
                   0, seq_len-1)
      steps_with_context = tf.reshape(
        tf.map_fn(get_context_steps, steps), [-1])
      frames = tf.gather(video, steps_with_context)
      frames = tf.cast(frames, tf.float32)
      frames = (frames/127.5)-1.0
      frames = tf.image.resize(frames, (168, 168)) 
      frames = tf.expand_dims(frames, 0) 
      embs.extend(model(frames, training=False).numpy()[0])
    embs = embs[:seq_len]
    assert len(embs) == seq_len
    embs = np.asarray(embs)
    embs_list.append(embs)
  return embs_list


# Embedding Model
class Embedder(tf.keras.Model):
  def __init__(self, embedding_size, normalize_embeddings,
               num_context_steps):
    super().__init__()

    base_model = tf.keras.applications.resnet_v2.ResNet50V2(include_top=False,
                                        weights='imagenet',
                                        pooling='max')
    layer = 'conv4_block3_out'
    self.num_context_steps = num_context_steps
    self.base_model = tf.keras.Model(
        inputs=base_model.input,
        outputs=base_model.get_layer(layer).output)
    self.conv_layers = [tf.keras.layers.Conv3D(256, 3, padding='same')
                        for _ in range(2)]
    self.bn_layers = [tf.keras.layers.BatchNormalization()
                        for _ in range(2)]

    self.fc_layers = [tf.keras.layers.Dense(256,
                                            activation=tf.nn.relu) for _ in range(2)]
    
    self.embedding_layer = tf.keras.layers.Dense(embedding_size)
    self.normalize_embeddings = normalize_embeddings
    self.dropout = tf.keras.layers.Dropout(0.1)
  
  def call(self, frames, training):
    batch_size, _, h,  w, c = frames.shape
    frames = tf.reshape(frames,[-1, h, w, c])

    x = self.base_model(frames , training=training)
    _, h,  w, c = x.shape
    x = tf.reshape(x, [-1, self.num_context_steps, h, w, c])

    x = self.dropout(x)

    for conv_layer, bn_layer in zip(self.conv_layers,
                                    self.bn_layers):
      x = conv_layer(x)
      x = bn_layer(x)
      x = tf.nn.relu(x)
             
    x = tf.reduce_max(x, [1, 2, 3])

    _, c = x.shape
    x = tf.reshape(x, [batch_size, -1, c]) 
    
    for fc_layer in self.fc_layers:
      x = self.dropout(x)
      x = fc_layer(x)

    x = self.embedding_layer(x)
    
    if self.normalize_embeddings:
      x = tf.nn.l2_normalize(x, axis=-1)
    
    return x  

# Training an Encoder with TCC 

## Load data from your Google Drive

If you want to run TCC on your own dataset, put those videos in a GDrive
folder and update PATH_TO_RAW_VIDEOS. Be careful not to give too many/too large videos as all data is loaded into memory right away.

If you want to to run TCC on demo pouring videos, download https://drive.google.com/file/d/1GVyv1oPv7-a08zKx_aANikgtaSJrSvEt/view?usp=sharing , unzip and upload the videos to a GDrive folder and update PATH_TO_RAW_VIDEOS.

You will be asked to give authorization to run this code on data on your GDrive. Paste the authorization code into the cell below and press Enter.

In [None]:
PATH_TO_RAW_VIDEOS = '/content/gdrive/My Drive/Colab Notebooks/data/'
videos, video_seq_lens = load_videos(PATH_TO_RAW_VIDEOS)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive
Found 70 videos to align.


## Set Hyperparams

In [None]:
##@title 
BATCH_SIZE =  6#@param {type:"integer"}
NUM_STEPS = 32 #@param {type:"integer"}
NUM_CONTEXT_STEPS =  2#@param {type:"integer"}
CONTEXT_STRIDE =  15#@param {type:"integer"}

LOSS_TYPE = 'regression_mse_var' #@param ["regression_mse_var", "regression_mse", "regression_huber", "classification"]
STOCHASTIC_MATCHING = False #@param ["False", "True"] {type:"raw"}
SIMILARITY_TYPE = 'l2' #@param ["l2", "cosine"]
EMBEDDING_SIZE =  128 #@param {type:"integer"}
TEMPERATURE = 0.1 #@param {type:"number"}
LABEL_SMOOTHING = 0.0 #@param {type:"slider", min:0, max:1, step:0.05}                                   
VARIANCE_LAMBDA = 0.001 #@param {type:"number"}                                       
HUBER_DELTA = 0.1 #@param {type:"number"}                                        
NORMALIZE_INDICES = True #@param ["False", "True"] {type:"raw"}
NORMALIZE_EMBEDDINGS = False #@param ["False", "True"] {type:"raw"}

CYCLE_LENGTH = 2 #@param {type:"integer"}
NUM_CYCLES = 32 #@param {type:"integer"}

LEARNING_RATE = 1e-4 #@param {type:"number"}
DEBUG = False #@param ["False", "True"] {type:"raw"}

## Create Model and Training Loop based on Hyperparams
After setting the hyperparams above, initialize dataloader, model, optimizer and training loop.

In [None]:
LOGDIR = '/content/gdrive/My Drive/Colab Notebooks/tcc/'

# Uncomment this to clear up logdir. Be careful this is removing files from
# your Google Drive.
# %rm -r "$LOGDIR"
# If LOGDIR is on Drive, when the training loop deletes old checkpoints they
# end up in the trash of Google Drive. You may need to clear up that space. 

train_ds = create_dataset(videos, video_seq_lens,
                        batch_size=BATCH_SIZE,
                        num_steps=NUM_STEPS,
                        num_context_steps=NUM_CONTEXT_STEPS,
                        context_stride=CONTEXT_STRIDE)
model = Embedder(EMBEDDING_SIZE, NORMALIZE_EMBEDDINGS, NUM_CONTEXT_STEPS)
optimizer = tf.keras.optimizers.Adam(LEARNING_RATE) 
ckpt = tf.train.Checkpoint(optimizer=optimizer, model=model)
manager = tf.train.CheckpointManager(ckpt, LOGDIR, max_to_keep=3)
summary_writer = tf.summary.create_file_writer(LOGDIR, flush_millis=1000)

@tf.function
def train_one_iter(data):
  frames = data['frames']
  steps = data['steps']
  seq_lens = data['seq_lens']
  with tf.GradientTape() as tape:
    embs = model(frames, training=True)
    trainable_variables = model.trainable_variables
    if STOCHASTIC_MATCHING:
      loss = compute_stochastic_alignment_loss(embs,
                                        steps,
                                        seq_lens,
                                        num_cycles=NUM_CYCLES,
                                        cycle_length=CYCLE_LENGTH,
                                        num_steps=NUM_STEPS,
                                        batch_size=BATCH_SIZE,
                                        loss_type=LOSS_TYPE,
                                        similarity_type=SIMILARITY_TYPE,
                                        temperature=TEMPERATURE,
                                        label_smoothing=LABEL_SMOOTHING,
                                        variance_lambda=VARIANCE_LAMBDA,
                                        huber_delta=HUBER_DELTA,
                                        normalize_indices=NORMALIZE_INDICES)
    else:
        loss = compute_deterministic_alignment_loss(embs,
                                        steps,
                                        seq_lens,
                                        num_steps=NUM_STEPS,
                                        batch_size=BATCH_SIZE,
                                        loss_type=LOSS_TYPE,
                                        similarity_type=SIMILARITY_TYPE,
                                        temperature=TEMPERATURE,
                                        label_smoothing=LABEL_SMOOTHING,
                                        variance_lambda=VARIANCE_LAMBDA,
                                        huber_delta=HUBER_DELTA,
                                        normalize_indices=NORMALIZE_INDICES)
    # Add regularization losses.
    if model.losses:
      loss += tf.add_n(model.losses)

  grads = tape.gradient(loss, trainable_variables)
  optimizer.apply_gradients(zip(grads, trainable_variables))

  with tf.summary.record_if(tf.math.equal(
      tf.math.mod(optimizer.iterations, 10), 0)):
    tf.summary.scalar('loss', loss, optimizer.iterations)
    if DEBUG:
      for var_ in model.variables:
        tf.summary.histogram(var_.name, var_, step=optimizer.iterations)
      _, n, h, w, c = frames.shape
      tf.summary.image('frames', tf.cast(
        255.0*(tf.squeeze(tf.concat(tf.split(frames, n, axis=1),
                                    axis=3), axis=1)+1.0)/2.0, tf.uint8),
                      optimizer.iterations)

  return loss

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50v2_weights_tf_dim_ordering_tf_kernels_notop.h5


## Train the model 

In [None]:
%tensorboard --logdir "$LOGDIR"

NUM_TRAINING_STEPS = 10000
SAVE_CKPT_STEPS = 500

# Uncomment this to load previous checkpoint.
# ckpt.restore(manager.latest_checkpoint)

tf.keras.backend.set_learning_phase(1)
i = 0
with summary_writer.as_default():
  for data in train_ds.take(NUM_TRAINING_STEPS):
    loss = train_one_iter(data)
    if i % SAVE_CKPT_STEPS == 0:
      manager.save()
    i += 1

<IPython.core.display.Javascript object>

In [None]:
# Save model if you inerrupted training or finished training.
manager.save()

'/content/gdrive/My Drive/Colab Notebooks/tcc/ckpt-16'

# Applications using encoder trained with TCC

## Extract Per-frame Embeddings
Use the trained encoder to extract per-frame embeddings. 

In [None]:
FRAMES_PER_BATCH = 160 # Change if you have more GPU memory.
embs = get_embs(model, videos, video_seq_lens,
                frames_per_batch=FRAMES_PER_BATCH, 
                num_context_steps=NUM_CONTEXT_STEPS,
                context_stride=CONTEXT_STRIDE)

# # Save the embeddings so that you don't have to use GPU for later experiments.
PATH_TO_EMBS = os.path.join(LOGDIR, 'embeddings.npy')
np.save(PATH_TO_EMBS, embs)

# Load previously saved embeddings in case you have them stored.
# embs = np.load(PATH_TO_EMBS, allow_pickle=True)

## Align/Synchronize actions in the videos

One application of TCC embeddings is to synchronize different videos. The first video in the output is the reference video while the other videos are nearest neighbors in the embedding space from other videos. You can make USE_DTW to True to use Dynamic Time Warping (DTW) instead of nearest neighbors. While DTW might result in smoother videos, the alignment video without DTW is more representative of how good or bad the embeddings are by themselves. Setting GRID_MODE to True creates a grid of many videos aligned together while without it you get the reference frame on the left and the aligned frame from another video on the right.

In [None]:
OUTPUT_PATH = '/content/gdrive/My Drive/Colab Notebooks/tcc/output.mp4'
NUM_VIDEOS = 18
GRID_MODE = False
USE_DTW = True
viz_alignment(embs[:NUM_VIDEOS],
              videos[:NUM_VIDEOS],
              OUTPUT_PATH,
              grid_mode=GRID_MODE,
              use_dtw=USE_DTW)
show_video(OUTPUT_PATH)

## Few-shot learning to propagate phase/segment labels

Since the model finds similarities across videos, we can manually label the segments for a handful of videos (as few as one) and let TCC automatically propagate these labels across different videos. Potentially these labels can be used to create datasets for other downstream tasks.

![alt text](https://temporal-cycle-consistency.github.io/assets/fig/annotation.png)

We provide a simple utility to label videos manually. If you play the video using the `play_video` function, it shows the frame number. You can use that to provide label segments for some videos in the following format. 

`labels = {
  video_id_0: [Frame # when segment 0 ends,
              Frame # when segment 1 ends, ... ],
  video_id_1: [Frame # when segment 0 ends,
              Frame # when segment 1 ends, ... ],
              }`
              
where `video_id_x` corresponds to the index in the videos array.

Maintain a labels dictionary in the following format and add labels as you play multiple videos.


In [None]:
VIDEO_ID = 67
play_video(videos[VIDEO_ID], video_seq_lens[VIDEO_ID])

Add labels to a label dict as the video plays above:

In [None]:
labels = {0: [35, 64, 131, 145, 186],
          67: [46, 80, 180, 210, 255]}

Optionally, create human readable labels for the class labels. For example:

`label_strings = ['Hand Reaching', 'Lifting Bottle', 'Pouring Liquid', 'Placing Bottle', 'Hand Receding']`


In [None]:
label_strings = ['Hand Reaching', 'Lifting Bottle', 'Pouring Liquid', 'Placing Bottle', 'Hand Receding']


Propagate labels from the few manually labeled videos to the entire dataset. `propagated_labels` is a list of per-frame labels for each video. 

In [None]:
propagated_labels = propagate_labels(embs, labels)

[LibSVM]Label propagation model accuracy: 0.9682539682539683
If this is too low, propagation will not work properly.


Visualize the propagated labels to get a sense of how well the label propagation worked.

In [None]:
VIDEO_ID = np.random.choice(len(videos))
print('Visualizing video %d'%VIDEO_ID)
viz_propagated_labels(videos[VIDEO_ID],
                      propagated_labels[VIDEO_ID],
                      video_seq_lens[VIDEO_ID],
                      label_strings)

Visualizing video 49


# Citation

If you found our paper/code useful in your research, consider citing our paper:


```
author = {Dwibedi, Debidatta and Aytar, Yusuf and Tompson, Jonathan and Sermanet, Pierre and Zisserman, Andrew},
title = {Temporal Cycle-Consistency Learning},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2019},
}
```

