<a href="https://colab.research.google.com/github/vtecftwy/fastbook/blob/master/02_manual_download_images_with_google.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Alternative method to manually download images
Creating your own dataset from Google Images. 

_Heavily based on the notebook by Francisco Ingham and Jeremy Howard in [`fastai/course-v3`](https://github.com/fastai/course-v3/blob/master/nbs/dl1/lesson2-download.ipynb), itself inspired by [Adrian Rosebrock](https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/)._

If you are not able to use the Bing Image Search API, this tutorial explains an alternative method using Google Images and some manual work.

**Steps:**
1. Get and save a list of image URLs for each class/category you define
2. Download the images, each in their respective folder

> You will have to repeat these steps for any new category you want to Google (e.g once for black bears, once for grizzly and once for teddy bears).

In [42]:
from fastai.vision import *

## 1. Get a list of URLs

### 1.1 Search and scroll

Go to [Google Images](http://images.google.com) and search for the images you are interested in. The more specific you are in your Google Search, the better the results and the less manual pruning you will have to do.

**Scroll down until you've seen all the images you want to download, or until you see a button that says 'Show more results'**. All the images you scrolled past are now available to download. To get more, click on the button, and continue scrolling. The maximum number of images Google Images shows is 700.

It is a good idea to put things you want to exclude into the search query, for instance if you are searching for the Eurasian wolf, "canis lupus lupus", it might be a good idea to exclude other variants:

    "canis lupus lupus" -dog -arctos -familiaris -baileyi -occidentalis

You can also limit your results to show only photos by clicking on Tools and selecting Photos from the Type dropdown.

### 1.2 Download URLs into a file

Now you must run some Javascript code in your browser which will save the URLs of all the images you want for you dataset.

In Google Chrome press <kbd>Ctrl</kbd><kbd>Shift</kbd><kbd>j</kbd> on Windows/Linux and <kbd>Cmd</kbd><kbd>Opt</kbd><kbd>j</kbd> on macOS, and a small window the javascript 'Console' will appear. 

In Firefox press <kbd>Ctrl</kbd><kbd>Shift</kbd><kbd>k</kbd> on Windows/Linux or <kbd>Cmd</kbd><kbd>Opt</kbd><kbd>k</kbd> on macOS. That is where you will paste the JavaScript commands.

You will need to get the urls of each of the images. 

When you run the code below in the console, a window will open and you should save the URL file with the appropriate name, e.g. `urls_black.csv`.

Before running the following commands, you may want to disable ad blocking extensions (uBlock, AdBlockPlus etc.) in Chrome. Otherwise the window.open() command doesn't work. Then you can run the following commands:

```javascript
urls=Array.from(document.querySelectorAll('.rg_i')).map(el=> el.hasAttribute('data-src')?el.getAttribute('data-src'):el.getAttribute('data-iurl'));
window.open('data:text/csv;charset=utf-8,' + escape(urls.join('\n')));
```


Repeat for each image category you are building.

### Create directory and upload urls file into your server

**The next steps must be done from the notebook where you are building your model. Copy this code in that notebook.**

Steps:
- create a folder for each class/category of images
- upload the url_xxx.csv files.
- download the images, once for each url_file you have

Choose an appropriate name for your images. You can run these steps multiple times to create different labels.

In [43]:
SAVE_ON_GDRIVE = False

In [44]:
if SAVE_ON_GDRIVE:
    root_dir = "/content/gdrive/My Drive/unpackai/my-model"
else:
    root_dir = "/content/"

path = Path(root_dir) / 'data'

In [45]:
# Run this cell for each of the category/class you have in your dataset
img_folder_name = 'black'
file = 'urls_black.csv'

dest = path / img_folder_name
dest.mkdir(parents=True, exist_ok=True)

In [49]:
img_folder_name = 'teddys'
file = 'urls_teddys.csv'

dest = path / img_folder_name
dest.mkdir(parents=True, exist_ok=True)

In [50]:
img_folder_name = 'grizzly'
file = 'urls_grizzly.csv'

dest = path / img_folder_name
dest.mkdir(parents=True, exist_ok=True)

Check that all image folders are ready as expected

In [51]:
path.ls()

[PosixPath('/content/data/teddys'),
 PosixPath('/content/data/url_teddy.csv'),
 PosixPath('/content/data/black'),
 PosixPath('/content/data/grizzly'),
 PosixPath('/content/data/.ipynb_checkpoints'),
 PosixPath('/content/data/url_grizzly.csv'),
 PosixPath('/content/data/urls_black.csv')]

Finally, upload the corresponding `urls_xxx.csv` file for each of the category/class in your dataset. 

You just need to press the folder icon on the left vertical menu, sekect the `data` folder just created above, right click on it and then select `Upload`. See the screenshot below.

![upload file](https://raw.githubusercontent.com/vtecftwy/fastbook/master/images/dwnl_0000.png)

## Download images

Now you will need to download your images from their respective urls.

fast.ai has a function that allows you to do just that. You just have to specify the urls filename as well as the destination folder and this function will download and save all images that can be opened. If they have some problem in being opened, they will not be saved.

Let's download our images! Notice you can choose a maximum number of images to be downloaded. In this case we will not download all the urls.

You will need to run this line once for every category.

In [52]:
classes = ['teddys','grizzly','black']

In [64]:
dest = path / 'black'
file = 'urls_black.csv'
download_images(path/file, dest, max_pics=200)

dest = path / 'teddys'
file = 'urls_teddys.csv'
download_images(path/file, dest, max_pics=200)

dest = path / 'grizzly'
file = 'urls_grizzly.csv'
download_images(path/file, dest, max_pics=200)

In [None]:
# If you have problems download, try with `max_workers=0` to see exceptions:
download_images(path/file, dest, max_pics=20, max_workers=0)



Then we can remove any images that can't be opened:

In [65]:
for c in classes:
    print(c)
    verify_images(path/c, delete=True, max_size=500)

teddys


grizzly


black
