<a href="https://colab.research.google.com/github/rahiakela/tensorflow-computer-vision-cookbook/blob/main/10-applying-deep-learning-to-video/01_detecting_emotions_in_real_time.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Detecting emotions in real time

Computer vision is focused on the understanding of visual data. Of course, that includes videos, which, at their core, are a sequence of images, which means we can leverage most of our knowledge regarding deep learning for image processing to videos and reap great results.

In this recipe, we'll start training a convolutional neuronal network to detect emotions in human faces, and then we'll learn how to apply it in a real-time context using our webcam.

Then, in the remaining recipes, we'll use very advanced implementations of architectures, hosted in TensorFlow Hub (TFHub), specially tailored to tackle interesting video-related problems such as action recognition, frames generation, and text-to-video retrieval.

At its most basic form, a video is just a series of images. By leveraging this seemingly simple or trivial fact, we can adapt what we know about image classification to create very interesting video processing pipelines powered by deep learning.

In this recipe, we'll build an algorithm to detect emotions in real time (webcam streaming) or from video files. Pretty interesting, right?

##Setup

Let's download Facial Expression Recognition Challenge from [Kaggle](https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge)

In [None]:
from google.colab import files
files.upload() # upload kaggle.json file

In [None]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle
kaggle competitions download -c challenges-in-representation-learning-facial-expression-recognition-challenge

# unzip files
unzip -qq train.csv.zip
unzip -qq test.csv.zip
unzip -qq icml_face_data.csv.zip
gzip -d fer2013.tar.gz

rm -rf train.csv.zip
rm -rf test.csv.zip
rm -rf icml_face_data.csv.zip
rm -rf fer2013.tar.gz
rm -rf fer2013.tar.gz

# untar file
mkdir emotion_recognition
tar -xvf fer2013.tar -C emotion_recognition/
rm -rf fer2013.tar

In [None]:
%%shell

wget -q https://github.com/rahiakela/tensorflow-computer-vision-cookbook/raw/main/10-applying-deep-learning-to-video/videos/emotions.mp4
wget -q https://github.com/rahiakela/tensorflow-computer-vision-cookbook/raw/main/10-applying-deep-learning-to-video/videos/emotions2.mp4

wget -q https://github.com/rahiakela/tensorflow-computer-vision-cookbook/raw/main/10-applying-deep-learning-to-video/resources/haarcascade_frontalface_default.xml
wget -q https://github.com/rahiakela/tensorflow-computer-vision-cookbook/raw/main/10-applying-deep-learning-to-video/models/model-ep061-loss0.795-val_loss0.882.h5

mkdir resources
mkdir models
mkdir test_videos
mv haarcascade_frontalface_default.xml resources/
mv model-ep061-loss0.795-val_loss0.882.h5 models/
mv *.mp4.txt test_videos/

In [1]:
import csv
import glob
import pathlib
import cv2
import imutils
import numpy as np
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.image import *
from tensorflow.keras.utils import to_categorical

## Define some helper function

Let's define a list of all possible emotions in our dataset, along with a color associated with each one.

In [None]:
EMOTIONS = ["angry", "scared", "happy". "sad", "surprised", "neutral"]
COLORS = [
  "angry": (0, 0, 255), 
  "scared": (0, 128, 255), 
  "happy": (0, 255, 255), 
  "sad": (255, 0, 0), 
  "surprised": (178, 255, 102), 
  "neutral": (160, 160, 160)         
]

Let's define a method to build the emotion classifier architecture. It receives the input shape and the number of classes in the dataset:

In [None]:
def build_network(input_shape, classes):

  input = Input(shape=input_shape)
  """
  Each block in the network is comprised of two ELU activated, batch-normalized
  convolutions, followed by a max pooling layer, and ending with a dropout layer.
  """
  x = Conv2D(filters=32,
             kernel_size=(3, 3),
             padding="same",
             kernel_initializer="he_normal")(input)
  x = ELU()(x)
  x = BatchNormalization(axis=-1)(x)
  x = Conv2D(filters=32,
             kernel_size=(3, 3),
             kernel_initializer="he_normal",
             padding="same")(x)
  x = ELU()(x)
  x = BatchNormalization(axis=-1)(x)
  x = MaxPooling2D(pool_size=(2, 2))(x)
  x = Dropout(rate=0.25)(x)

  x = Conv2D(filters=64,
             kernel_size=(3, 3),
             kernel_initializer="he_normal",
             padding="same")(x)
  x = ELU()(x)
  x = BatchNormalization(axis=-1)(x)
  x = Conv2D(filters=64,
             kernel_size=(3, 3),
             kernel_initializer="he_normal",
             padding="same")(x)
  x = ELU()(x)
  x = BatchNormalization(axis=-1)(x)
  x = MaxPooling2D(pool_size=(2, 2))(x)
  x = Dropout(rate=0.25)(x)

  x = Conv2D(filters=128,
             kernel_size=(3, 3),
             kernel_initializer="he_normal",
             padding="same")(x)
  x = ELU()(x)
  x = BatchNormalization(axis=-1)(x)
  x = Conv2D(filters=128,
             kernel_size=(3, 3),
             kernel_initializer="he_normal",
             padding="same")(x)
  x = ELU()(x)
  x = BatchNormalization(axis=-1)(x)
  x = MaxPooling2D(pool_size=(2, 2))(x)
  x = Dropout(rate=0.25)(x)

  x = Flatten()(x)

  # Next, we have two dense, ELU activated, batch-normalized layers, also followed by a dropout
  x = Dense(units=64, kernel_initializer="he_normal")(x)
  x = ELU()(x)
  x = BatchNormalization(axis=-1)(x)
  x = Dropout(rate=0.5)(x)

  x = Dense(units=64, kernel_initializer="he_normal")(x)
  x = ELU()(x)
  x = BatchNormalization(axis=-1)(x)
  x = Dropout(rate=0.5)(x)

  # Finally, we encounter the output layer, with as many neurons as classes in the dataset. 
  # Of course, it's softmax-activated:
  x = Dense(units=classes, kernel_initializer="he_normal")(x)
  output = Softmax()(x)

  return Model(input, output)

The `load_dataset()` loads both the images and labels for the training, validation, and test datasets:

In [None]:
def load_dataset(dataset_path, classes):
  train_images = []
  train_labels = []
  val_images = []
  val_labels = []
  test_images = []
  test_labels = []

  """
  Let's parse the emotion column first. Although the dataset
  contains faces for seven classes, we'll combine disgust and angry (encoded as 0 and 1, respectively) 
  because both share most of the facial features, and merging them leads to better results
  """
  with open(dataset_path, "r") as f:
    reader = csv.DictReader(f)

    for line in reader:
      label = int(line["emotion"])
      if label <= 1:
        label = 0   # This merges classes 1 and 0
      if label > 0:
        label -= 1  # All classes start from 0

      """
      Next, we parse the pixels column, which is 2,034 whitespace-separated integers,
      corresponding to the grayscale pixels for the image (48x48=2034)
      """
      image = np.array(line["pixels"].split(" "), dtype="uint8")
      image = image.reshape((48, 48))
      image = img_to_array(image)

      """
      Now, to figure out to which subset this image and label belong, we must look at the Usage column.
      """
      if line["Usage"] == "Training":
        train_images.append(image)
        train_labels.append(label)
      elif line["Usage"] == "PrivateTest":
        val_images.append(image)
        val_labels.append(label)
      else:
        test_images.append(image)
        test_labels.append(label)

  # Convert all the images to NumPy arrays
  train_images = np.array(train_images)
  val_images = np.array(val_images)
  test_images = np.array(test_images)

  # Then, one-hot encode all the labels
  train_labels = to_categorical(np.array(train_labels), classes)
  val_labels = to_categorical(np.array(val_labels), classes)
  test_labels = to_categorical(np.array(test_labels), classes)
  
  # Return all the images and labels
  return (train_images, train_labels), (val_images, val_labels), (test_images, test_labels)

Now, let's define a function to compute the area of a rectangle. We'll use this later to get the largest face detection.

In [None]:
base = "fruits"
for subset in ["test", "train"]:
  folder = os.path.sep.join([f"{subset}_zip", subset])
  labels_path = os.path.sep.join([f"{subset}_labels.csv"])

  bboxes_df = bboxes_to_csv(folder)
  bboxes_df.to_csv(labels_path, index=None)

  # Then, use the same labels to produce the tf.train.Examples corresponding to the current subset of data being processed:
  writer = (tf.python_io.TFRecordWriter(f"resources/{subset}.record"))
  examples = pd.read_csv(f"fruits/{subset}_labels.csv")
  grouped = split(examples, "filename")

  path = os.path.join(f"fruits/{subset}_zip/{subset}")
  for group in grouped:
    tf_example = create_tf_example(group, path)
    writer.write(tf_example.SerializeToString())
  writer.close()

NotFoundError: ignored