In [51]:
import os
import numpy as np
import pandas as pd

# Dataset Collection

## Assignee: 
+ Wen Hao: Build scraper module to gather images of faces
+ Gin, Xiaoxin: Data cleaning. Remove images with inaccuracy label manually.

## Description:

We have chosen to gather our images from Google Image search, by searching for faces with the specified emotion, e.g. `"happy man face"`. For example, the raw images obtained using the search keyword "happy man face" would have the true labels `"happy"` and `"man"`. Adding the word `"face"` to the search term was to reduce the amount of images that might not have any face. This would form the raw dataset of our project.

Collection of data was done via Selenium, a powerful tool for performing browser automation. It is used to automate the scraping from Google Images using the chromedriver. 

The steps for scraping the images are as follows:
1. Open a **headless** browser and navigate to "https://www.google.com/search?q={search_term}&tbm=isch&ijn=0"
  - The parameter `tbm=isch` specifies that this is an image search
  - The parameter `q={search_term}` is the term to search for
2. Keep scrolling down until all results are loaded in page
3. Loop through each image result to get its URL with full resolution (not thumbnail)
4. Download the image from the URL

The above steps were repeated for all 14 search terms (7 emotions and 2 genders). There is also about a 3 seconds wait between each image in the loop as there is a need to wait for the UI to load to get the full resolution image. Hence, the challenge that we've met while trying to run this locally (on our Macbooks) is that this takes quite a significant amount of time if we run this sequentially one by one on our own laptop.

The next approach to speed this up was to distribute this image search, to run all 14 searches in parallel. To do this, we build a docker container with the scraper and its dependencies built in and build an Airflow DAG (directed acyclic graph) run this on an instance of Airflow, with the search term as a parameter. 

However, the next challenge we found was that Selenium WebDriver/Chromedriver works very differently when running on a Linux based container, oftenly throwing the cryptic error `Selenium::WebDriver::Error::InvalidSessionIdError: invalid session id`. In Linux operating systems there is a shared memory space called `/dev/shm`. Any Linux process can create a partition within `/dev/shm` if the process wants to share memory with another process. This is often done to improve performance of similar processes. The shared memory space is often used by web browsers such as chrome when they’re being orchestrated by a selenium web driver. The solution was to mount memory to the path `/dev/shm`

While trying to run a single search, I also realised that the WebDriver session seems to only be active for about 30 minutes and beyond that, most containers will start having session timeout errors whie trying to retrieve the image elements.

Hence, we implemented a way to save progress at periodic checkpoints and resume the search from the index that it failed. With this, the overall solution was a brute force solution where we just increase the CPU and memory limit to the maximum possible and keep running it until we have saved all our results. The only caveat is that the image result is not static and the number of results we get is always changing, hence resuming from where we fail from a different search might end up having a slightly different set of results but it not very significant in this case.

<br>

## Airflow Screenshots

Below are the records of the total runs we have from the Airflow UI.

![scrape-man](images/scrape-man.png)

![scrape-woman](images/scrape-woman.png)

![download-preprocess](images/download-preprocess.png)

<br>

## Resource Monitoring

We have also setup grafana to monitor the resource usage:

![monitor](images/monitor.png)

## Conclusion

From the above chart, we can see that the amount of resources used was linearly increasing as times go on until a point and drops back down to 0. This suggest that there is a possible memory leakage somewhere within the implementation of Selenium WebDriver/Chromedriver. As time was limited, we kind of resolve this by adding checkpoints and brute forcing it by providing more compute resources for it to run to completion. 

Selenium might not have been the best tool for this scraping and we can explore other tools in the future.

### Scraper Demo 

The next cell we will run this scraper in dev mode which will only scrape the first 20 images to show how it works.

In [2]:
!python scraper.py -h

usage: scraper.py [-h] [-s S] [-o O] [-r R] [-d D]

optional arguments:
  -h, --help  show this help message and exit
  -s S        search term to scrape images from
  -o O        output root directory
  -r R        index to resume from
  -d D        whether to run in dev mode


In [1]:
!python scraper.py -s "sad man face" -d true

[WDM] - 

[2021-11-13 21:40:01,342] [INFO] - 

[WDM] - Current google-chrome version is 95.0.4638
[2021-11-13 21:40:01,401] [INFO] - Current google-chrome version is 95.0.4638
[WDM] - Get LATEST driver version for 95.0.4638
[2021-11-13 21:40:01,401] [INFO] - Get LATEST driver version for 95.0.4638
[WDM] - Driver [/Users/wenhao.lau/.wdm/drivers/chromedriver/mac64/95.0.4638.69/chromedriver] found in cache
[2021-11-13 21:40:01,460] [INFO] - Driver [/Users/wenhao.lau/.wdm/drivers/chromedriver/mac64/95.0.4638.69/chromedriver] found in cache
[2021-11-13 21:40:02,269] [INFO] - Search term: sad man face
[2021-11-13 21:40:03,901] [INFO] - Scrolling down ..
[2021-11-13 21:40:22,237] [INFO] - Reached the end ..
[2021-11-13 21:40:22,383] [INFO] - Total results found: 703
[2021-11-13 21:40:22,383] [INFO] - is_retry: False
[2021-11-13 21:40:22,383] [INFO] - dev: True
search_term=sad man face | is_retry=False : 100%|█| 20/20 [01:06<00:00,  3.34s/i
[2021-11-13 21:41:29,278] [INFO] - Total image URLs e

# Data Pre-processing
Assignee: Wen Hao

Description: Pre-process the images gathered. Standard image size, grayscale, etc.

# Data Exploration
Assignee: Gin, Xiaoxin

In [72]:
IMAGE_FOLDER = # replace it with your own path

totalImages = {}
for subfolder in os.listdir(IMAGE_FOLDER):
    dir = IMAGE_FOLDER + '/' + subfolder
    if os.path.isdir(dir):
        totalImages[subfolder] = len(os.listdir(dir))

pd.DataFrame(totalImages.items(), columns = ['category', '# of sample'])

Unnamed: 0,category,# of sample
0,neutral man face,241
1,sad woman face,848
2,angry woman face,753
3,disgusted woman face,746
4,happy man face,192
5,surprised woman face,851
6,sad man face,359
7,scared man face,394
8,angry man face,331
9,disgusted man face,514


In [73]:
print(sum(totalImages.values()), "files in total.")

7981 files in total.


# Model Development, Improvement & Evaluation