# Download Images

There are a few things to keep in mind here.
1. We want a broad sample of images. So I'm going to group them by family and genus and build lists that pick one (or two) images from each family genus combination.
2. There may be more one than image for a coreid. The problem is that we will not know which image the annotations refer to, so we need to exclude these coreids.
3. There will be bad image files or bad URLs to images or uncooperative hosts.
4. We shuffle the records so that we don't hit a particular host too hard.

In [1]:
import sys

sys.path.append('..')

In [2]:
import multiprocessing
from pathlib import Path

import pandas as pd

from herbarium.pylib import download_images as di

In [3]:
DATA_DIR = Path('..') / 'data'

URI_DIR = DATA_DIR / 'temp'
IMAGE_DIR = DATA_DIR / 'images'

ERROR1 = DATA_DIR / 'temp' / 'download_errors.txt'
ERROR2 = DATA_DIR / 'temp' / 'validate_errors.txt'

DB = DATA_DIR / 'angiosperms.sqlite'

## Sample record from each family/genus combination

In [None]:
di.sample_records(DB, URI_DIR, count=2000)

## Download images

In [None]:
csvs = list(URI_DIR.glob('uris_*.csv'))

In [None]:
with multiprocessing.Pool(processes=6) as pool:
    results = []
    for csv_file in csvs:
        results.append(pool.apply_async(
            di.download_images, (csv_file, IMAGE_DIR, ERROR1)))
    all_results = [result.get() for result in results]

## Validate images

In [4]:
di.validate_images(IMAGE_DIR, DB, error=ERROR2)