<a href="https://colab.research.google.com/github/simecek/dspracticum2020/blob/master/lecture_05/Image_Downloader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Image Downloader

This is a script to download a set of images using Bing Search API.

It is highly inspired by [Fastbook](https://github.com/fastai/fastbook/blob/master/02_production.ipynb) but adjusted to Azure changes after October 30. 

In [1]:
import requests
import yaml
from pathlib import Path
from fastai.vision.utils import download_images, verify_images, get_image_files

To run this script, you need Azure Bing Search API secret key. You can try it for free for 7 days. Change the following cell to

```subscription_key = 'YOUR KEY'```

I am reading mine from a file with secrets on my Google Drive.

In [2]:
# mounts Google Drive
from google.colab import drive
drive.mount('/gdrive')

with open(r'/gdrive/My Drive/ds_praktikum/SECRETS.yaml') as file:
    secrets = yaml.safe_load(file)
subscription_key = secrets['bing_search_key']

Drive already mounted at /gdrive; to attempt to forcibly remount, call drive.mount("/gdrive", force_remount=True).


In [3]:
# based on fastai forum discussion on Nov 3, 2020
# be careful, on Free Tier you can do max 3 queries per second
# maximum for `count` parameter is 20, otherwise offset must be used 
def bing_image_search(search_term, count=150, offset=0):
    search_url = "https://api.bing.microsoft.com/v7.0/images/search"
    headers = {"Ocp-Apim-Subscription-Key" : subscription_key}

    params = {"q": search_term, 
              "license": "public", 
              "imageType": "photo", 
              "count": str(count),
              "offset": str(offset)}

    response = requests.get(search_url, headers=headers, params=params)
    response.raise_for_status()
    search_results = response.json()

    return [img['contentUrl'] for img in search_results["value"]]

In [8]:
search_term = 'tree'
img_urls = bing_image_search(search_term, 150, 0) +  bing_image_search(search_term, 150, 150)
print(len(img_urls))
img_urls[:10]

300


['https://images.pexels.com/photos/1067333/pexels-photo-1067333.jpeg?cs=srgb&dl=photography-of-tree-1067333.jpg&fm=jpg',
 'https://get.pxhere.com/photo/tree-nature-grass-branch-plant-wood-leaf-flower-trunk-green-autumn-soil-botany-garden-flora-tree-trunk-art-roots-woodland-strong-tree-roots-tree-with-roots-woody-plant-land-plant-arecales-617647.jpg',
 'https://c.pxhere.com/photos/11/5d/tree_trees_branch_branches_park_relaxation_season_spring-1023172.jpg!d',
 'http://www.publicdomainpictures.net/pictures/100000/velka/hilltop-oak-tree.jpg',
 'https://get.pxhere.com/photo/tree-plant-woody-plant-grass-trunk-branch-botany-lawn-garden-park-plantation-landscape-moss-shrub-oak-landscaping-botanical-garden-California-live-oak-1554447.jpg',
 'https://publicdomainpictures.net/pictures/60000/velka/eucalyptus-tree-trunk.jpg',
 'https://get.pxhere.com/photo/tree-forest-branch-plant-wood-flower-trunk-overgrown-bark-log-produce-soil-botany-flora-root-tribe-deciduous-gnarled-woodland-tree-stump-tree-ro

In [9]:
dest = Path('/gdrive/My Drive/ds_praktikum/data/image_data/'+search_term)
dest.mkdir(exist_ok=True)
download_images(dest, urls=img_urls)

In [10]:
img_paths = get_image_files(dest)
failed = verify_images(img_paths)
failed

(#1) [Path('/gdrive/My Drive/ds_praktikum/data/image_data/tree/00000217.png')]

In [11]:
# delete failed images
failed.map(Path.unlink);