# Trim the `Steam-OneFace` dataset

The objective of this notebook is to trim the `Steam-OneFace` dataset based on the intersection of the detection results of:
-   the `dlib` module,
-   the `face_alignment` (fa) module,
-   the `retinaface` (rf) module.

The trimmed datasets are:
-   `Steam-OneFace-small`:
    - 993 images
    - obtained by intersection of fa and rf,
-   `Steam-OneFace-tiny`:
    - 168 images
    - obtained by intersection of all three modules (dlib, fa, rf).

For more information, check:
-   my Github repository about [`Steam-OneFace`][steam-oneface-section]

[steam-oneface-section]: <https://github.com/woctezuma/steam-filtered-image-data#steam-oneface-dataset>

## Install requirements

In [38]:
%pip install Google-Colab-Transfer > /dev/null

In [39]:
import colab_transfer

colab_transfer.mount_google_drive()

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


## Load detecton results

### With the `dlib` module

In [40]:
!wget -q https://raw.githubusercontent.com/woctezuma/steam-face-detection/main/data/app_ids_with_faces_with_dlib.json
!wget -q https://raw.githubusercontent.com/woctezuma/steam-face-detection/main/data/app_ids_with_faces_with_dlib_resized.json

In [41]:
import json

# AppIDs for which AT LEAST ONE face was detected by dlib on ORIGINAL images:
with open('app_ids_with_faces_with_dlib.json', 'r') as f:
  dlib_original = json.load(f)

# AppIDs for which AT LEAST ONE face was detected by dlib on RESIZED images:
with open('app_ids_with_faces_with_dlib_resized.json', 'r') as f:
  dlib_resized = json.load(f)

dlib_detections = set([
      app_id
      for app_id in set(dlib_original).intersection(dlib_resized)
      if dlib_original[app_id] == 1 and dlib_resized[app_id] == 1
])

In [42]:
print(len(dlib_detections))

345


### With the `face_alignment` module

In [43]:
!gdown --id 1YWoC6Ux2OnzZcc3EyRGqTnqKUdxDlfDM

Downloading...
From: https://drive.google.com/uc?id=1YWoC6Ux2OnzZcc3EyRGqTnqKUdxDlfDM
To: /content/app_ids.txt
  0% 0.00/12.7k [00:00<?, ?B/s]100% 12.7k/12.7k [00:00<00:00, 19.7MB/s]


In [44]:
# AppIDs for which EXACTLY ONE face was detected by face_alignment (fa) on both:
# - original images,
# - resized images,
# and for which duplicates have been removed.
with open('app_ids.txt', 'r') as f:
  fa_detections = [i.strip() for i in f.readlines()]

In [45]:
print(len(fa_detections))

1688


### With the `retinaface` module

In [46]:
!gdown --id 1-0jW1kCBx_jf1oAAdKCvu8iA_Z2PmZIW

Downloading...
From: https://drive.google.com/uc?id=1-0jW1kCBx_jf1oAAdKCvu8iA_Z2PmZIW
To: /content/app_ids_with_retinaface.txt
  0% 0.00/18.7k [00:00<?, ?B/s]100% 18.7k/18.7k [00:00<00:00, 15.6MB/s]


In [47]:
# AppIDs for which EXACTLY ONE face was detected by retinaface (rf) on both:
# - original images,
# - resized images,
# and for which duplicates have been removed.
with open('app_ids_with_retinaface.txt', 'r') as f:
  rf_detections = [i.strip() for i in f.readlines()]

In [48]:
print(len(rf_detections))

2472


## Load image data

### High-resolution images (300x450)

In [49]:
import glob
from pathlib import Path

archive_name = 'filtered_banners.tar.gz'
image_dir_hr = 'data/original_vertical_steam_banners/'

if not Path(archive_name).exists():
  !gdown --id 1etzhe-EYyT86DYK8QbEHoYoAeU1nqRWy
  !tar -xf {archive_name}

file_names = glob.glob(image_dir_hr + '*.jpg')
print('#images = {}'.format(len(file_names)))  

Downloading...
From: https://drive.google.com/uc?id=1etzhe-EYyT86DYK8QbEHoYoAeU1nqRWy
To: /content/filtered_banners.tar.gz
948MB [00:04, 213MB/s]
#images = 17492


### Low-resolution images (256x256)

In [50]:
import glob
from pathlib import Path

archive_name = 'resized_banners.tar.gz'
image_dir_lr = 'data/resized_vertical_steam_banners/'

if not Path(archive_name).exists():
  !gdown --id 1-7ukPUIZKWPyg-Lcj9b59Rr3rSJY8SuH
  !tar -xf {archive_name}

file_names = glob.glob(image_dir_lr + '*.jpg')
print('#images = {}'.format(len(file_names)))  

Downloading...
From: https://drive.google.com/uc?id=1-7ukPUIZKWPyg-Lcj9b59Rr3rSJY8SuH
To: /content/resized_banners.tar.gz
522MB [00:06, 51.2MB/s]
#images = 17492


## Boilerplate functions

In [51]:
from pathlib import Path

def get_input_folder(resolution='hr'):
  if resolution == 'hr':
    input_folder = image_dir_hr
  else:
    input_folder = image_dir_lr

  return input_folder

def get_output_folder(resolution='hr', dataset_suffixe=''):
  output_folder = 'steam-oneface{}-{}/'.format(dataset_suffixe, resolution)
  Path(output_folder).mkdir(exist_ok=True)

  return output_folder

def get_output_archive(output_folder):
  archive_name = output_folder.strip('/') + '.tar.gz'

  return archive_name

In [52]:
import shutil

def copy_files(filtered_app_ids, 
               dataset_suffixe):
  
  for resolution in ['hr', 'lr']:
    input_folder = get_input_folder(resolution)
    output_folder = get_output_folder(resolution, 
                                      dataset_suffixe=dataset_suffixe)

    for app_id in filtered_app_ids:
      fname = str(app_id) + '.jpg'
      shutil.copyfile(input_folder + fname, 
                      output_folder + fname)

  return

def archive_folder(dataset_suffixe):

  for resolution in ['hr', 'lr']:
    output_folder = get_output_folder(resolution, 
                                      dataset_suffixe=dataset_suffixe)
    archive_name = get_output_archive(output_folder=output_folder)
    
    !du -sh {output_folder}
    !tar -cf {archive_name} {output_folder}
    !du -sh {archive_name}

  return

def export_archived_folder(dataset_suffixe):

  for resolution in ['hr', 'lr']:
    output_folder = get_output_folder(resolution, 
                                      dataset_suffixe=dataset_suffixe)
    archive_name = get_output_archive(output_folder=output_folder)
    
    colab_transfer.copy_file(file_name=archive_name,
                             source = colab_transfer.get_path_to_home_of_local_machine(),
                             destination = colab_transfer.get_path_to_home_of_google_drive())
  return  

## Build the `Steam-OneFace-small` dataset

In [53]:
steam_oneface_small = set(rf_detections).intersection(fa_detections)

print(len(steam_oneface_small))

993


In [54]:
filtered_app_ids = steam_oneface_small
dataset_suffixe = '-small'

copy_files(filtered_app_ids = filtered_app_ids,
           dataset_suffixe = dataset_suffixe)

archive_folder(dataset_suffixe = dataset_suffixe)

export_archived_folder(dataset_suffixe = dataset_suffixe)

54M	steam-oneface-small-hr/
53M	steam-oneface-small-hr.tar.gz
31M	steam-oneface-small-lr/
29M	steam-oneface-small-lr.tar.gz
Copying /content/steam-oneface-small-hr.tar.gz to /content/drive/My Drive/steam-oneface-small-hr.tar.gz
Copying /content/steam-oneface-small-lr.tar.gz to /content/drive/My Drive/steam-oneface-small-lr.tar.gz


In [60]:
!ls steam-oneface-small-hr/*.jpg | wc -l
!ls steam-oneface-small-lr/*.jpg | wc -l

993
993


## Build the `Steam-OneFace-tiny` dataset

In [55]:
steam_oneface_tiny = set(steam_oneface_small).intersection(dlib_detections)

print(len(steam_oneface_tiny))

168


In [56]:
filtered_app_ids = steam_oneface_tiny
dataset_suffixe = '-tiny'

copy_files(filtered_app_ids = filtered_app_ids,
           dataset_suffixe = dataset_suffixe)

archive_folder(dataset_suffixe = dataset_suffixe)

export_archived_folder(dataset_suffixe = dataset_suffixe)

8.8M	steam-oneface-tiny-hr/
8.6M	steam-oneface-tiny-hr.tar.gz
5.0M	steam-oneface-tiny-lr/
4.8M	steam-oneface-tiny-lr.tar.gz
Copying /content/steam-oneface-tiny-hr.tar.gz to /content/drive/My Drive/steam-oneface-tiny-hr.tar.gz
Copying /content/steam-oneface-tiny-lr.tar.gz to /content/drive/My Drive/steam-oneface-tiny-lr.tar.gz


In [59]:
!ls steam-oneface-tiny-hr/*.jpg | wc -l
!ls steam-oneface-tiny-lr/*.jpg | wc -l

168
168
