<h1>Build a dataset from Google Images<span class="tocSkip"></span></h1>

In this notebook we'll build a dataset with asian facial images from scratch, using Google. We'll manage to get korean, japanese and chinese faces for further classification.

# Imports

In [None]:
%%capture
from notebook import notebookapp
server = list(notebookapp.list_running_servers())[0]

if server['hostname'] == 'localhost':
  # Local environment
  %reload_ext autoreload
  %autoreload 2
  %matplotlib inline
else:
  # Cloud
  !pip install git+https://github.com/fastai/fastai.git
  !curl https://course.fast.ai/setup/colab | bash

In [None]:
# from fastai.utils.show_install import *
# show_install()

In [None]:
from fastai.vision import *

# Folder structure

We'll create this folder structure:
```
  data
    |-faces
        |-cn
           |-m
           |-w
        |+jp
        |+kr
    |+urls
```

In [None]:
nationalities = ['cn', 'jp', 'kr']
genders = ['m', 'w']

In [None]:
img_path = Path('data/faces')
url_path = Path('data/urls')

for p in (img_path, url_path):
    p.mkdir(parents=True, exist_ok=True)
    for n in nationalities:
        for g in genders:
            folder = p/n/g
            folder.mkdir(parents=True, exist_ok=True)

# Image search

1. We run a search in our browser like [this](https://www.google.com/search?cr=countryCN&as_st=y&biw=1351&bih=725&tbs=itp%3Aface%2Ciar%3As%2Cislt%3Aqsvga%2Cisz%3Aex%2Ciszw%3A200%2Ciszh%3A200%2Cctr%3AcountryCN&tbm=isch&sa=1&ei=BVoPXbeHCayXlwTSh4qQDQ&q=intitle%3Achen+site%3Acn.linkedin.com%2Fin+male+-female&oq=intitle%3Achen+site%3Acn.linkedin.com%2Fin+male+-female&gs_l=img.3...4235.6387..6677...0.0..1.130.964.10j2......0....1..gws-wiz-img.xWgLoyiPnks). In this particular case we are trying to get male pictures from LinkedIn China, using a chinese surname, with the 200x200 size indexed by google. We scroll down until image loading finishes.


2. Then we open the browser console (`F12`) and write this javascript command:


```
urls = Array.from(document.querySelectorAll('.rg_di .rg_meta')).map(el=>JSON.parse(el.textContent).tu);
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```

&emsp;&emsp;which gives us a file containing lines like this one:

```
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTIsU_k_6WKlX67lp1eyy8iy9IR9JtcC-ynIjgluBLYr6WKylklqg
```

&emsp;&emsp;We save the file in 'data/urls', under nationality and gender. For instance, for chinese men we'll use 'data/urls/cn/m'.

&emsp;&emsp;NOTE: we can get the LinkedIn profile URLs if we replace `.tu` by `.ru`. Maybe `.ou`. But remember that scrapping is against LinkedIn's TOS.
 
3. We repeat steps 1-2 for other surnames/names until we have enough samples for one class
4. We repeat steps 1-3 for the other genre, and then for the rest of nationalities.

## Ran searches

1. China: popular surnames
2. Japan: popular names
3. Korea: popular names

# Download images 

In the last step we download all the images in the CSV files with their URL's:

In [None]:
for n in nationalities:
    for g in genders:
        # concatenate all URL files
        ! rm data/urls/{n}/{g}/all.csv
        ! cat data/urls/{n}/{g}/* > data/urls/{n}/{g}/all.csv
        # download the images
        download_images(url_path/n/g/'all.csv', img_path/n/g, max_pics=2000)