# Search Bing Images

The idea of this notebook is to use Bing to automatically download images of the users chioce, e.g. cat, dog, donald trump. The user specifies the search term and the notebook goes off and gets around 1000 or so pictures which can be used for toy models of training. In order for this to work you will need to get a bing API key, this is free for one month. You will need to place your password where it says the key below (as a python string) and everything else should work.

Sign up for it [here](https://azure.microsoft.com/en-gb/try/cognitive-services/?api=bing-web-search-api)

Or else adapt the code to work with some other service.

Note that google has stopped the free API service at the time of writting, so a more hacky solution (or expensive one) would have to be used.

When choosing the search terms think of animals or people for which you think Bing is likely to return at least 1000 unique pictures, so something very obscure may not work.

The data sets here are going to be in the order of 1000 pictures of each class, so this is very much a toy example by the modern standards big-data projects; this excercise is meant to be educational, not production standard.

Also set the base directory, if you use the suggested one then everything downstream will also work out of the box, so I don't recomend changing it. The directory needs to already exist, which it should do if you have downloaded the whole thing from github as is.

In [None]:
import requests
from IPython.display import HTML

from PIL import Image
from io import BytesIO

import os

In [None]:
KEY = "MY-API-KEY-HERE"
base_dir = "../data/convnet"

In [1]:
search_terms = ['cat', 'dog', 'monkey', 'donkey', 'donald trump']

In [None]:
class ImageIterator:
    """
    Class that is used as an iterator. Runs over urls of images for a given
    search term, upto a maximum number.
    
    e.g.
    ```
    for url in ImageIterator("cat", 64):
        print(url)
    ```    
    """
    
    # These are class variables rather than belonging to an object, as they are
    # always the same.
    end_point = "https://api.cognitive.microsoft.com/bing/v7.0/images/search"
    key = KEY
    # This is the maximum number of images that bing lets us fetch at one
    # go.
    MAX_IMAGES = 150
    
    def __init__(self, search_term, max_values):
        """
        Input:
        search_term: str of what to search for, e.g. "cat"
        max_values: int
            Stop iterating after this many values, may stop
            before this if fewer than this are returned by
            bing.
        """

        self.search_term = search_term
        self.max_values = max_values
        
    # Does a search and returns urls
    def _search(self, total, skip):
        headers = {"Ocp-Apim-Subscription-Key" : self.key}
        params  = {"q": self.search_term, "count":total, "offset":skip}
        response = requests.get(self.end_point, headers=headers, params=params)
        response.raise_for_status()
        search_results = response.json()
        self.total_images = search_results['totalEstimatedMatches']
        self.max_values = min([self.max_values, self.total_images])
        return (el['contentUrl'] for el in search_results['value'])

    def __iter__(self):
        self.page_number = 0
        self.images_returned = 0
        self.search_results = self._search(min([self.MAX_IMAGES, self.max_values]), 0)
        return self
    
    def __next__(self):
        self.images_returned += 1
        if self.images_returned > self.max_values:
            raise StopIteration
        else:
            try:
                return next(self.search_results)
            # This means that this iteration has finished, but the user still
            # wants more, which means we have to call the API again with some
            # pagination
            except StopIteration:
                remaining_images = self.max_values - self.images_returned
                next_return = min([self.MAX_IMAGES, remaining_images])
                page = self.images_returned
                self.search_results = self._search(next_return, page)
                return next(self.search_results)


In [None]:
def process_image(url, directory=None, name=None, size=400):
    """
    Gets the image from a url, makes it square and turns it black
    and white and saves to a directory.
    """
    
    try:
        response = requests.get(url, timeout=1.)
        img = Image.open(BytesIO(response.content))
        img = img.resize((size,size)).convert('L')
        if name is None:
            return img
        file_name = os.path.join(directory, name)
        img.save(file_name)
    except:
        print("Can't do {}".format(url))

In [None]:
def do_for_term(term, total=1000):
    """
    Run the whole thing for a search term.
    """
    
    safe_term = term.replace(" ", "_")

    DIR = os.path.join(base_dir, safe_term)

    try:
        os.mkdir(os.path.join(base_dir, "{}".format(safe_term)))
    except FileExistsError:
        pass
    
    for i,url in enumerate(ImageIterator(term, total)):
        print(i)
        name="{}_{}.jpg".format(safe_term, i)
        pi = process_image(url, size=128, directory=DIR, name=name)

In [4]:
for term in search_terms:
    print("Looking for pictures of {}".format(term))
    do_for_term(term)

Looking for pictures of cat
Looking for pictures of dog
Looking for pictures of monkey
Looking for pictures of donkey
Looking for pictures of donald trump
