*Note: using a GPU runtime is recommended, though CPU is also workable, you'll just have to wait more (10-30 min).*

As per the usual Colab routine, start with clicking "Copy to Drive" button above.

Then execute the following cell to get the sources and do a typical editable Python install.

<font size="2">--no-build-isolation flag might be necessary due to https://stackoverflow.com/a/70805903.</font>

In [None]:
!git clone 'https://github.com/nsndp/video-to-faces.git'
!pip install --no-build-isolation -e video-to-faces

Now you should RESTART THE KERNEL (Runtime -> Restart session), or "import from ..." won't work.

After restarting, you can run the cell below and confirm the installation by seeing '/content/video-to-faces/src' among the outputs.

In [None]:
import sys
sys.path

For testing, we're gonna download some YouTube videos using [youtube-dl](https://github.com/ytdl-org/youtube-dl) library.

<font size="2">Installing like this instead of just "pip install youtube-dl" due to https://stackoverflow.com/a/75504772.</font>

<font size="2">-f 22 corresponds to a 720p mp4 version of a video (all formats can be listed with "yt-dlp -F \<url\>").</font>

In [None]:
!python3 -m pip install --force-reinstall https://github.com/yt-dlp/yt-dlp/archive/master.tar.gz

In [None]:
!mkdir videos
!yt-dlp -o videos/DrHorribleAct1.mp4 -f 22 https://www.youtube.com/watch?v=MPImhprnUsE
!yt-dlp -o videos/Railgun1FanTrailer.mp4 -f 22 https://www.youtube.com/watch?v=jL_vEw6hObw
!yt-dlp -o videos/LycoRecoTrailer.mp4 -f 22 https://www.youtube.com/watch?v=F5DMjhg3A6c
!mkdir output1
!mkdir output2
!mkdir output3
!mkdir output4

# Example 1

Let's start with detection on a live-action video: [Dr. Horrible's Sing Along Blog - Act 1](https://www.youtube.com/watch?v=jL_vEw6hObw).

We'll take a frame every 2 seconds, enlarge the boxes from faces to ~portraits and square them.

Every detection with a score <= 0.2 or width/height <= 20px will be rejected.

Every face with an average hash difference <= 10 to some other face will be marked as a duplicate.

We'll also save all intermediate results and logs for closer examination.

In [None]:
!python -m videotofaces -i 'videos/DrHorribleAct1.mp4' -o 'output1' -s 'live' -m 'detection' \
  --video-step 2 --det-scale 1.2 1.2 1.7 1.3 --det-square --det-min-score 0.2 --det-min-size 20 \
  --save-frames --save-rejects --save-dupes --hash-thr 10

Once done, you can browse the results using a helper function (change page_number to see the rest):

In [None]:
from videotofaces import image_gallery
image_gallery('output1/faces', page_size=36, page_number=0, height=110)

For the intermediate results, you can browse the frames (rejects will be drawn red, others green):

In [None]:
image_gallery('output1/intermediate/frames', page_size=12, page_number=23, height=200)

Or the rejection log using different filters (e.g. detections with a score >= 0.15 but <= 0.2):

In [None]:
from videotofaces import dataframe_with_images
dataframe_with_images('output1/intermediate/log_rejects.csv', 'output1', filter=('score', 0.15, 0.2), height=50)

Or the duplicates log by browsing the pairs with hash diffs from 9 (marked as dups) to 11 (barely passed):

In [None]:
dataframe_with_images('output1/intermediate/log_dupes1.csv', 'output1', sort_by='hash_diff', filter=('hash_diff', 9, 11), height=100)

Now let's continue with grouping the detected results using K-means clustering, autoselecting the best number of clusters from 2 to 6.

The duplicate check will also do another pass, now with embedding distances (<=0.2) instead of hash diffs.

In [None]:
!python -m videotofaces -o 'output1' -s 'live' -m 'grouping' --clusters '2-6' --group-log --save-dupes --enc-dup-thr 0.2

Browsing the grouped results in their subfolders (you can see that it indeed managed to split them into 3 characters and 1 "other" category):

In [None]:
image_gallery('output1/faces', page_size=36, page_number=0, height=110, subfolders=True, centered=True)

Looking at the clustering log (4 clusters were chosen because of the highest silhouette score):

In [None]:
import pandas as pd
pd.read_csv('output1/faces/log_clustering.csv')

Also looking at the "border line" of the duplicate check just like with hashes:

In [None]:
dataframe_with_images('output1/intermediate/log_dupes3.csv', 'output1', sort_by='distance', filter=('distance', 0.18, 0.22))

# Example 2

Now let's try an anime video: [a fan trailer to S1 of "A Certain Scientific Railgun"](https://www.youtube.com/watch?v=jL_vEw6hObw).

We'll launch both detection and grouping in one go this time.

But instead of automatic clustering, we'll classify the results into predefined groups.

For this, we need a folder with reference images. Here's the one prepped and uploaded to GDrive
for this demo:

In [None]:
!gdown 1mQS5wDyZ4zeb4NI-mbQYHYWkVrOBXUXN
!unzip vtf_demo_ref.zip

You can take a look at the "ref" folder contents either with the helper function below or just through the Colab's file browser to the left, confirming that it containts 4 subfolders named after the anime's 4 main characters, each with a face image inside.

In [None]:
from videotofaces import image_gallery
image_gallery('ref', subfolders=True)

Let's read 2 frames every second this time (i.e. step = 0.5 sec) and set a threshold for "other" (meaning none of the 4 main characters) as 0.75.

Launching the processing:

In [None]:
!python -m videotofaces -i 'videos/Railgun1FanTrailer.mp4' -o 'output2' -s 'anime' \
  --video-step 0.5 --enc-oth-thr 0.75 \
  --group-mode 'classification' --ref-dir 'ref' --group-log

Looking at the classified results, you can see that everything was sorted correctly, including 4 images in the "other" group:

In [None]:
image_gallery('output2/faces', subfolders=True, centered=True)

Looking at the classification log, you can see that, for example, **000120_0.jpg** was classified as "2.kuroko" because it had the smallest distance to that reference image (0.4624).

And scrolling down to **002136_0.jpg**, you can see that the shortest distance there is 0.7743, meaning that for a higher "other" threshold (e.g. --enc-oth-thr 0.8) this image would've been wrongly classified as "4.saten".

In [None]:
import pandas as pd
pd.read_csv('output2/faces/log_classification.csv')

# Example 3

Let's do one more short anime video: [Lycoris Recoil Trailer](https://www.youtube.com/watch?v=F5DMjhg3A6c).

Taking 4 frames per second from 0:10 to 1:15, since the rest is intro/outro without faces.

Clustering into 2, 4 or 6 groups and saving all 3 options for perusal, without autoselecting the best.

In [None]:
!python -m videotofaces -i 'videos/LycoRecoTrailer.mp4' -o 'output3' -s 'anime' \
  --video-step 0.25 --video-fragment 0.15 1.25 \
  --group-mode 'clustering' --clusters '2,4,6' --clusters-save-all

Exploring the results for 4 groups (change "G4" to "G2" or "G6" to explore the results for 2 or 6 groups):

In [None]:
image_gallery('output3/faces/G4', subfolders=True, centered=True)

# Example 4

Finally, let's run the very same command as in Example 1 but using GPU video decoding with Decord library instead of the default OpenCV.

For this, we'll need to build from source as per [Decord Readme instructions](https://github.com/dmlc/decord?tab=readme-ov-file#install-from-source):

In [None]:
!sudo add-apt-repository ppa:jonathonf/ffmpeg-4
!sudo apt-get update
!sudo apt-get install -y build-essential python3-dev python3-setuptools make cmake
!sudo apt-get install -y ffmpeg libavcodec-dev libavfilter-dev libavformat-dev libavutil-dev

# (see https://github.com/dmlc/decord/issues/102#issuecomment-710781378)
# (run "!find '/' -name libnvcuvid.so.1" to ensure that the 1st path in your Colab runtime is the same)
!ln -s /usr/lib64-nvidia/libnvcuvid.so.1 /usr/local/cuda/libnvcuvid.so

!git clone --recursive https://github.com/dmlc/decord
%cd decord
!mkdir build
%cd build

!cmake .. -DUSE_CUDA=ON -DCMAKE_BUILD_TYPE=Release
!make

%cd ../python
!python3 setup.py install --user

%cd /content
import decord

Launching the detection process:

In [None]:
!python -m videotofaces -i 'videos/DrHorribleAct1.mp4' -o 'output4' -s 'live' -m 'detection' \
  --video-step 2 --det-scale 1.2 1.2 1.7 1.3 --det-square --det-min-score 0.2 --det-min-size 20 \
  --save-frames --save-rejects --save-dupes --hash-thr 10 \
  --video-reader 'decord'

You should see a noticeable speedup compared to the Example 1 (~45 sec vs >2 min).

The difference could be much larger for long 1080p videos (where CPU video reading becomes a bottleneck), so for high-load processing on good GPUs this option is recommended.

Feel free to browse the results as usual (and note that they might differ insignificantly from Example 1 because of Decord and OpenCV getting frames at slightly different points):

In [None]:
from videotofaces import image_gallery
image_gallery('output4/faces', page_size=36, page_number=0, height=110)