## Motivation
We want to train an algorithm to classify images as being dogs or cats.

## Problem
We need a dataset to do this. We will get data from the [Unsplash Image API](https://unsplash.com/developers).

## Caveats
Sometimes, raw data is unsuitable for machine learning algorithms. For instance, we may want:
- Only images that are landscape (i.e. width > height)
- All our images to be of the same resolution

---
## Step 1: Get cat and dog image URLs from the API
We will use the [`search/photos` GET method](https://unsplash.com/documentation#search-photos).

In [1]:
import requests as re
from IPython.display import JSON

# API variables
root_endpoint = 'https://api.unsplash.com/'
client_id = 'dZAXqQ7AEOsa9Y0Gw4hYRiBc-Kb1qVzNs7wHHhiPF9c'

# Convenience function for making API calls and grabbing results
def search_photos(search_term):
    api_method = 'search/photos'
    endpoint = root_endpoint + api_method
    response = re.get(endpoint, 
                      params={'query': search_term, 'per_page': 30, 'client_id': client_id})
    status_code, result = response.status_code, response.json()
    
    if status_code != 200:
        print(f'Bad status code: {status_code}')
        
    image_urls = [img['urls']['small'] for img in result['results']]
    
    return image_urls

In [2]:
dog_urls = search_photos('dog')
cat_urls = search_photos('cat')

---
## Step 2: Download  the images from the URLs
(Step 2a: Google [how to download an image from a URL in Python](https://stackoverflow.com/a/40944159))

We'll just define the function to download an image for now. Later on, we'll use it on images one at a time (but after doing some processing).

In [9]:
from PIL import Image

def download_image(url):
    image = Image.open(re.get(url, stream=True).raw)
    return image

In [10]:
test_img = download_image(cat_urls[0])
test_img.show()

---
## Step 3: Download and save images that meet our requirements
We'll need to know how to work with the [PIL Image data type](https://pillow.readthedocs.io/en/stable/reference/Image.html), which is what our `download_image(url)` function returns. Namely, we need to be able to a) get it's resolution and b) resize it.

In [21]:
import os


def is_landscape(image):
    return image.width > image.height


def save_category_images(urls, category_name, resolution=(256, 256)):
    save_folder = f'saved_images/{category_name}'
    if not os.path.exists(save_folder):
        os.mkdir(save_folder)
        
    for i, url in enumerate(urls):
        image = download_image(url)
        if is_landscape(image):
            image = image.resize(resolution)
            filename = f'{i:05d}.jpg'
            image.save(os.path.join(save_folder, filename))

In [20]:
save_category_images(dog_urls, 'dogs')
save_category_images(cat_urls, 'cats')