# Introduction

Ever wanted to make Black Mirror or 1984 more of a reality?

If you have a security camera being pointed somewhere - say, that mysterious image capturing device outside Rashid - you can detect and identify the faces of people who are passing by.

It turns out that this is pretty useful in practice!

1. Apple FaceID: label one face, face recognition for authentication

2. Bank vaults: pre-label employees, sound an alarm if unrecognizable people appear

3. Just creeping: you don't need to label any faces at all! With the magic of machine learning, you can cluster similar faces together automatically. You can then review the clusters manually to see who's been passing by.

Option 3 is the hardest since we don't know the faces ahead of time. So we'll do that!

<div style="text-align:center;">
    <font size="5">We'll do this:</font> <img src="https://i.imgur.com/3V63A1s.gif" width="300" style="display:inline-block; margin:1em auto">
    <b>--></b>
    <img src="https://i.imgur.com/bli4HcP.gif" width="250" style="display:inline-block; margin:1em auto">
    <b>--></b>
    <img src="https://i.imgur.com/wAyzaFQ.gif" width="250" style="display:inline-block; margin:1em auto">
</div>
<p style="clear: both;">

### Tutorial content

In this tutorial, we'll walk through extracting and clustering faces from videos with [OpenCV](https://opencv.org/) and [face_recognition](https://github.com/ageitgey/face_recognition).

We'll use the following YouTube video as a data source, but it is trivial to use any video or webcam feed instead:

- [Highlights: CMU Welcomes Tenth President](https://www.youtube.com/watch?v=s00G1xyKVd0)


We will cover the following topics in this tutorial:
- [Installing the libraries](#Installing-the-libraries)
- [Getting some videos](#Loading-data-and-plotting)
- [A brief introduction to Haar Cascades](#A-brief-introduction-to-Haar-Cascades)
- [Face clustering and picking representatives](#Face-clustering-and-picking-representatives)
- [Summary](#Summary)
- [Limitations](#Limitations)
- [Extensions](#Extensions)

# Installing the libraries

**Note: All the libraries used are Python 3 compatible. Feel free to use pip3 instead of pip to install.**

## pytube (optional)

[pytube](https://github.com/nficano/pytube/): download YouTube videos with ease. Supports many options such as audio only, different video resolutions, exposing all the available video streams.

Just run ```pip install pytube```.

## OpenCV

[OpenCV](https://opencv.org/): leading library for open-source computer vision

**Getting OpenCV**

You have a couple of options, just pick one! We recommend the conda approach.

1. [conda](https://conda.io/docs/)  
    You can try installing the latest version with ```conda install opencv```, but this is currently broken for most people.  
    Instead, you can install an older version with ```conda install -c menpo opencv3```

2. [pip](https://pypi.python.org/pypi/pip)  
    Unofficial binaries are available for opencv on pip. The "contrib" version includes patented algorithms/non-free modules, and can be installed with ```pip install opencv-contrib-python```. The base version can be installed as ```pip install opencv-python```. It doesn't matter which you pick for our tutorial.

3. your system package manager  
    If you use a package manager, chances are they have opencv bindings, e.g. ```brew install opencv```.

4. from scratch  
    You can otherwise build opencv from scratch, as detailed on the [official opencv website](https://docs.opencv.org/master/df/d65/tutorial_table_of_content_introduction.html).
    
**Getting the pretrained classifiers**

OpenCV comes with some batteries included. We're interested in the pretrained object classifiers.

So you'll want to keep track of the folder that OpenCV was installed to and change HAAR_PREFIX in the cell below to match.

Alternatively, you can download the [OpenCV HaarCascades GitHub folder](https://github.com/opencv/opencv/tree/master/data/haarcascades) to your computer and point HAAR_PREFIX there.

## dlib

[dlib](http://dlib.net/): machine learning library

Try ```pip install dlib```. If you're on a Mac, though, Apple's clang compiler will give you a lot of trouble.

So if that didn't work, just run ```conda install -c menpo dlib```.

## face_recognition

[face_recognition](https://github.com/ageitgey/face_recognition): simple face recognition library

Try ```pip install face_recognition```. If you get dlib errors, try ```pip install face_recognition --no-dependencies```.

## Final steps

Once the above libraries are installed, make sure the following cell runs!

In [1]:
import os                          # filesystem functions

import cv2                         # computer vision: face extraction
import face_recognition            # computer vision: face recognition
import matplotlib.pyplot as plt    # drawing various images in the notebook
import pytube                      # for downloading videos from YouTube (optional)
import random                      # random sampling for representative face

from IPython import display        # useful display functions

# point this to the location of your OpenCV installation
# this must be an absolute path
# otherwise, you will see "error: (-215) !empty() in function detectMultiScale"
HAAR_PREFIX = r'/Users/.../anaconda/pkgs/opencv-3.4.1-py35_blas_openblas_200/share/OpenCV/haarcascades/'

In [2]:
# this cell just supports type annotations that we will be using

# type annotations
from typing import Callable, List, Tuple, Union

# type aliases
import numpy as np
RGBImage = np.ndarray

# Getting some videos

## Downloading with pytube

To download a video with pytube, all we need is its YouTube link. The abovementioned video has the URL of https://www.youtube.com/watch?v=s00G1xyKVd0

In [3]:
def download_video(youtube_url : str, filename : str = 'video') -> str:
    """
    Downloads the video at youtube_url into filename.
    Returns full name of downloaded file, including extension.
    
    The filename should not include any extensions,
    as the type of the downloaded file depends on YouTube itself.
    We do not provide any video encoding capability,
    look into ffmpeg if you want this feature.
    
    We download the first available stream returned from YouTube."""
    
    yt = pytube.YouTube(youtube_url)            # we create a YouTube object
    stream = yt.streams.first()                 # we get the first downloadable stream
    sname = stream.default_filename
    stream.download(filename=filename)          # and then we download it
    extension = sname[sname.index('.'):]
    return filename + extension
    
    # there are many more settings for downloading a video - audio only, quality settings, etc
    # check out https://github.com/nficano/pytube for the full list of options


VIDEO_URL = 'https://www.youtube.com/watch?v=s00G1xyKVd0'
downloaded_name = download_video(VIDEO_URL)
print("Downloaded {} to {}".format(VIDEO_URL, downloaded_name))

Downloaded https://www.youtube.com/watch?v=s00G1xyKVd0 to video.mp4


## Displaying the video and preparing a processing pipeline

It is possible to show videos inside the Jupyter Notebook too. This isn't ideal, since the Notebook is laggy compared to Jupyter qtconsole, but it can be very helpful in a pinch.

Here, we're defining a function that will display the video inside the notebook. A few key ideas:

- the video is loaded from a file with [cv2.VideoCapture](https://docs.opencv.org/2.4/modules/highgui/doc/reading_and_writing_images_and_video.html).

    - If you provide VideoCapture with a path to a local video, it reads that video. For example, VideoCapture('video.mp4').
    - If you provide it with a number, that becomes device ID. For example, VideoCapture(0) defaults to your webcam.


- we accept a list of transformations to make it easy to form an image processing pipeline. Each frame of the video will have the list of transformations applied to it.
    - For example, we're currently interested in detecting faces. So for every frame, we want to perform detectFaces(frame). However, in the future we may also want to detect cats. We can reuse our existing function by providing [detectFaces, detectCats].


- it often isn't useful to process _every_ frame of a video. If someone's face first appears in frame X, chances are that they still appear in frame X+1. We can save time by skipping every N frames.

In [4]:
class Transformation:
    """
    Class to represent an image transformation.
    
    It defines a transform method that allows for the chaining of transformations.
    """
    
    def transform(self, image : RGBImage) -> RGBImage:
        """
        Transforms the supplied image and returns the image after transformation.
        """
        raise NotImplementedError

        
def process_video(filename : str = None,
                  transformations : List[Transformation] = None,
                  show : bool = True,
                  skip : int = 0) -> None:
    """
    Displays the video located at filename in the notebook.
    
    filename:
        If no filename is provided, it displays the webcam instead.
        
    transformations:
        applies all the transformations to every frame of the video.
        If transformations are provided as [T1, T2, ..., Tn]
        Then they are applied as Tn(...(T2(T1(frame)))).
    
    show:
        if show is false, will not show the video.
    
    skip:
        will skip this many frames without transforming or showing them
    
    
    
    Note that the refresh rate is very choppy on most computers.
    This function is just a quick way to view your videos within
    the Jupyter notebook workflow.
    If you want your video to look nicer, look into double buffers.
    
    To get better performance, you should run this code from
    jupyter qtconsole.
    """
    
    # filename of 0 corresponds to the webcam, if you have one
    if filename is None:
        filename = 0
    
    vid = cv2.VideoCapture(filename)

    # https://stackoverflow.com/q/16703345
    # Mac doesn't detect end of video properly
    # so we need to hack around it instead of depending
    # on VideoCapture.read's return code
    total_frames = vid.get(cv2.CAP_PROP_FRAME_COUNT)
    cur_frame = 0
    
    try:
        while cur_frame < total_frames:
            # we read a frame of the video
            read_success, frame = vid.read()

            cur_frame = cur_frame % (skip + 1)
            
            # if we couldn't read a frame, we stop
            if not read_success:
                vid.release()
                break

            if cur_frame == 0:
                
                # OpenCV stores images in BGR format
                # matplotlib expects images in RGB format
                # therefore we need to convert the two
                frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)

                # apply all our transformations to the frame
                if transformations is not None:
                    for t in transformations:
                        frame = t.transform(frame)

                if show:
                    plt.axis('off')                 # remove the axis
                    plt.title("Video Stream")       # provide a title
                    plt.imshow(frame)               # load the frame
                    plt.show()                      # show the frame
                    display.clear_output(wait=True) # keep the current frame until we have a new one

    except KeyboardInterrupt: # we may choose to prematurely end the stream
        vid.release()

In [None]:
# we can display our video by running the following cell
process_video('video.mp4')

# A brief introduction to Haar Cascades

## Background

<div style="text-align:center;">
    <img src="https://imgs.xkcd.com/comics/tasks.png" width="180" style="display:inline-block; margin:1em auto">
    <img src="https://imgs.xkcd.com/comics/machine_learning.png" width="250" style="display:inline-block; margin:1em auto">
</div>
<p style="clear: both;">

Back before TensorFlow and neural networks appeared everywhere, object recognition was a pretty difficult problem in computer vision. People came up with the concept of [Haar features](https://en.wikipedia.org/wiki/Haar-like_feature). Intuitively, Haar features are just math with weighted rectangles. You have a detection rectangle that you're moving around, and you compute pixel intensities based on adjacent rectangular regions. This is best understood through examples. Below, Haar feature C mimics the detection region for the bridge of a nose.

<div style="text-align:center;">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Prm_VJ_fig1_featureTypesWithAlpha.png/600px-Prm_VJ_fig1_featureTypesWithAlpha.png" width="200" style="display:block;margin:0 auto">
        <img src="https://upload.wikimedia.org/wikipedia/commons/8/8a/Haar_Feature_that_looks_similar_to_the_bridge_of_the_nose_is_applied_onto_the_face.jpg" width="200" style="display:block;margin:0 auto">
    <caption><em>Example of HAAR Features, source: [Wikipedia](https://en.wikipedia.org/wiki/Viola%E2%80%93Jones_object_detection_framework)</em></caption>
</div>

Because the algorithm is so simple, a single Haar feature is pretty useless by itself. Instead, you'll need a LOT of Haar features, trained with a cascade classifier. A cascade classifier is just a multistage classifier that repeatedly feeds the classified result of one stage to the next.

You could also have noticed that we're implicitly assuming that faces are aligned to face forward, i.e., it wouldn't work very well for faces that were turned sideways.

So if this method sucks, why do we use it? The primary advantage of Haar-like features is calculation speed. Because we're just computing average intensities and areas of rectangles, computers are able to perform quick math. It's also relatively easy to parallelize.

And now with all that out of the way, let's get started!

## Training 

In the interests of tutorial length, we're not covering how to train a Haar cascade classifier. You can check the official [OpenCV](https://docs.opencv.org/3.3.0/dc/d88/tutorial_traincascade.html) tutorial for details. Essentially, you'd need a folder containing positive samples and a folder containing negative samples, e.g., one folder containing cropped images of bananas and another folder containing anything that isn't a banana. Going by advice online, you probably want at least 50 positive samples and 500 negative samples. Once you have that, OpenCV will use the following Haar features [by default](https://docs.opencv.org/2.4/modules/objdetect/doc/cascade_classification.html):

![haar features](https://docs.opencv.org/2.4/_images/haarfeatures.png)

## Use

For the purposes of this tutorial, we're just going to use the [inbuilt](https://github.com/opencv/opencv/tree/master/data/haarcascades) OpenCV classifiers. We're interested in the one that recognizes faces from the front angle, ```haarcascade_frontalface_default.xml```.

In [12]:
# this is global since we only need to initialize it once
# the haarcascade is supplied by OpenCV
face_cascade = cv2.CascadeClassifier(HAAR_PREFIX + 'haarcascade_frontalface_default.xml')

class FaceExtractingTransformation(Transformation):
    """
    Transformation that extracts and saves faces from a frame.
    """
    
    def __init__(self,
                 speed : float = 1.1,
                 face_threshold : int = 15,
                 min_face_size : int = 30,
                 base_path : str = 'faces/img{}.png',
                 save_index : int = 0,
                 draw_face : bool = True):
        """
        Creates a FaceExtracting transformation.
        
            speed: as speed increases, speed of facial recognition increases
                cost: accuracy
                note: for facial recognition, people typically use values between 1.05 and 1.4
            
            face_threshold: as threshold increases, false positives decreases
                cost: losing true positives
                note: typically in the range of 3 to 20
                
            min_face_size: minimum length and height of the faces found
            base_path: path that images are saved to
            start_index: first number in numbering for saving images
            draw_face: true if we should draw a rectangle around the faces we extracted
        """
        self.speed = speed
        self.face_threshold = face_threshold
        self.min_face_size = min_face_size
        self.base_path = base_path
        self.save_index = save_index
        self.draw_face = draw_face
        

    def get_faces(self, image : RGBImage,
                  speed : float,
                  threshold : int,
                  min_face_size : int) -> List[Tuple[int,int,int,int]]:
        """
        Gets the faces from the image.

        Faces are returned as bounding box coordinates, (x,y,w,h).

        speed is tunable.
            As speed increases, computation time decreases.
            But accuracy decreases too.
        threshold is tunable.
            As threshold increases, the number of false positives decreases.
            But true positives may be lost.
        """
        gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
        faces = face_cascade.detectMultiScale(gray,
                                              speed,
                                              threshold,
                                              minSize=(min_face_size, min_face_size))
        return faces
    
    
    def save_face(self, face : RGBImage) -> None:
        """
        Saves the provided face to disk.
        """
        plt.imsave(self.base_path.format(self.save_index), face)
        self.save_index += 1

    
    def draw_rectangle(self, 
                       image : RGBImage,
                       top_left : Tuple[int,int],
                       bot_right : Tuple[int,int],
                       color : Tuple[int,int,int] = (255,0,0),
                       thickness : int = 2) -> None:
        """
        Draws a rectangle touching the top_left and bot_right coordinates.
        """
        if self.draw_face:
            cv2.rectangle(image, top_left, bot_right, color, thickness)
        
    
    def transform(self, image : RGBImage) -> RGBImage:
        faces = self.get_faces(image, self.speed, self.face_threshold, self.min_face_size)
        
        for (x,y,w,h) in faces:
            face = image[y:y+h, x:x+w]
            self.save_face(face)
            self.draw_rectangle(image, (x,y), (x+w,y+h))

        return image

In [13]:
# this tends to take a while
# locally, this generated 3.1k images

# the recommended order of tuning for speed/accuracy is:
#   0. test that it works with show=True. But generate data with show=False
#   1. try increasing min_face_size
#   2. try adjusting skip
#   2. try increasing speed

process_video('video.mp4', 
              transformations=[FaceExtractingTransformation()], 
              show=False,
              skip=5)

<img src="https://i.imgur.com/bli4HcP.gif" width="300" style="float:left">

# Face clustering and picking representatives

With a folder full of faces, we'd like to group faces belonging to the same people together.

It is really easy to do this with the ```face_recognition``` library.

For each image file,

1. we load it with ```face_recognition.load_image_file(filepath)```
2. we encode it with ```face_recognition.face_encodings(image)```. Because we used a relatively inaccurate Haar cascade to get faces, there is a chance that this fails, in which case we will ignore it and proceed to the next face.
3. we can compare an encoded face against a list of other encoded faces with ```face_recognition.compare_faces(other_faces, current_face)```

## Picking a representative face

Upon successfully clustering images such that all of the images represent the same person, you'd normally just output "yup, person X is here" and call it a day. But we have no names! Instead, we'll aim to pick any one of their faces with equal probability. If we knew how many faces they have in total, we could just pick face number ```random.randint(0, len(faces))```. But we don't know how many faces there are in advance! (e.g. live security camera)

We pull out a probability party trick here.

## Reservoir sampling

Suppose that your phone only has memory for one photo. You can take photos multiple times, overwriting the old one. You see an interesting row of houses in the distance. As you walk from the first house to the last, you want to take a photo of any house with uniform $\frac{1}{n}$ probability. But you don't know how many houses there are, and you don't want to walk past the houses more than once! How can you do this?

It turns out that the strategy is to take a photo of house $i$ with probability $\frac{1}{i}$. To convince yourself of this, consider different situations for $n$.

- 1 house: you definitely want to take the photo with probability 1
- 2 houses: you want both houses to have probability half. Well, photograph the first house and with probability half replace it with the second house.

And in general you can do a proof by induction to show that everything works out.

## Back to our problem

We'll simply need to maintain counts of each person's face as we come across them, and replace our representative face for that person with probability $\frac{1}{\text{number of faces for that person}}$.

In [6]:
faces = [] # encodings of faces
face_counts = {} # counts for reservoir sampling
face_stored = {} # int label -> filepath to representative face


def add_faces(filepath : str) -> Union[int, None]:
    """
    Loads the face located at filepath and compares it against all known faces.
    
    If the face could not be loaded, it returns None.
    
    If the face matches with an existing face, we output the match index,
    i.e. the index of the face in our list of known faces which matched.
    
    Otherwise, we save the new face to our list of faces and output its index.
    
    In both cases, the index output is the index of the face's representative
    in our list of known faces.
    """
    image = face_recognition.load_image_file(filepath)
    encoding = face_recognition.face_encodings(image)

    if len(encoding) == 0:
        # if we couldn't load a face
        return None
    else:
        # otherwise grab the first face
        encoding = encoding[0]

    # try matching the face to existing faces, if it matched, return
    matches = face_recognition.compare_faces(faces, encoding)
    try:
        first_match_index = matches.index(True)
        return first_match_index
    except ValueError:
        pass
    
    # if no match, add the face and return
    faces.append(encoding)
    return len(faces) - 1


# we load our faces and populate our faces, face_counts, face_stored
for filename in os.listdir('faces'):
    if not filename.startswith('.'):
        filepath = 'faces/' + filename
        face_label = add_faces(filepath)
        
        # if we couldn't add the face, skip it
        if face_label is None:
            continue
        
        # otherwise perform reservoir sampling
        face_counts[face_label] = face_counts.get(face_label, 0) + 1
        rand = random.randint(1, face_counts[face_label])
        # only accept one of the count possibilities
        if rand == 1:
            face_stored[face_label] = filepath

## Viewing our representatives

What's a convenient way to see many pictures at once?

A video! Well, a photo [collage](http://answers.opencv.org/question/15589/make-a-collage-with-other-images/) works, but that gets overwhelming with too many faces. A video scales better.

OpenCV includes VideoWriter, simple but limited:

- keep it in AVI format
- use the "fourcc mp4v" codec

For actual video editing, you'd probably use [moviepy](https://zulko.github.io/moviepy/) or [ffmpeg](https://www.ffmpeg.org/) directly. The OpenCV developers aren't interested in expanding video support.

In [8]:
output_file = 'out.avi' # must be avi
output_codec = cv2.VideoWriter_fourcc('m', 'p','4','v') # this one tends to work better
fps = 1 # 1 frame per second = 1 face per second
width, height = (600, 600) # video dimensions
out = cv2.VideoWriter(output_file, output_codec, fps, (width, height))

for filepath in face_stored.values():
    image = cv2.imread(filepath)
    out.write(cv2.resize(image, (width, height))) # important: video and image dimensions must match

out.release()

print("Wrote {} which has {} faces.".format(output_file, len(face_stored)))

Wrote out.avi which has 36 faces.


In [None]:
process_video('out.avi')

<img src="https://i.imgur.com/wAyzaFQ.gif" height=300 width=300 style="float:left" />

# Summary

In this tutorial, we have:

1. Learned to use ```pytube``` to download YouTube videos for data
2. Prepared an image processing pipeline for videos
3. Learned a little theory about Haar classifiers - math with rectangles for describing faces
4. Used the built-in Haar classifiers of ```OpenCV``` to extract all the faces from a video
5. Learned how to recognize faces using ```face_recognition```, which uses machine learning behind the scenes
6. Used face classification to separate a collection of faces into clusters of faces, where each cluster is of the same face
6. Learned a little theory behind reservoir sampling, useful when you want to have uniform samples of an unknown-possibly-huge population
7. Used reservoir sampling to pick a cluster representative
7. Used ```OpenCV``` to create simple videos

# Limitations

1. We're only using the frontal face Haar feature. To *really* capture as many faces as possible, we should also find and/or train a Haar feature for side profiles of faces.

2. Haar is 2000s technology. Most people are all about neural networks now. I thought it would be neat to look at classic Haar to see that it still performs reasonably well.

3. I had to run this code on my computer, which is decidedly not strong enough for this. I had to use a relatively fast speed. If you had access to fancy GPUs and/or powerful CPUs, try speed=1.05 and a min_face_size of 0.

# Extensions

Once you have a picture of a specific face you're interested in, e.g., in your faces/ folder after running the above code: you can pass it as the argument to ```facial_recognition.compare_faces``` to check if that's the person you want.

- You can check for all the known people this way (e.g. employees at a bank)
- You can extract all the video frames in which someone appears and make that a new video (focusing on specific person)

We can also preprocess the frame to improve face detection. For example, we can auto-level the image brightness to improve the lighting conditions.