<center> <font size='6' font-weight='bold'> DataBase Downloader </font> </center>  
<center> <i> Projet Navee</i> </center>
<center> <i> Tony </i> </center>  

**Objective:**  
Since it's not suited to always have to redownload the whole training set each time we want to make tests on it, we want to implement a quick and simple algorithm to download a certain number of pictures in the `data_training` folder.  

Those files are going to be named accordingly to their corresponding ID in the database (i.e. 19388.png).  

What comes next is that we'll have to create class that inherits from *DataGenerator* (from Data_Gen.py) and we'll change its behavious so much so that everything works the same way (for instance how we can seek particular labels in the database) but instead of having to download the files, it will directly look into the `data_training` directory. Moreover, we'll add '... WHERE ID in {list_IDs}' to make sure we won't get IDs that weren't downloaded.

---

**Paramaters:**  
Please set `nb_images` to be equal to the number of images to include in the training set.  

In [45]:
nb_images = 100

# Preparations

In [3]:
import os
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from urllib.request import urlretrieve # to download images from the web

## Change to main file tree level.

In [4]:
os.getcwd()

'/Users/Tony/Desktop/projet-navee/data'

In [5]:
os.chdir('..')

In [6]:
os.getcwd()

'/Users/Tony/Desktop/projet-navee'

## Imports

In [7]:
from data import * # imports data.py
from Data_Gen import * # imports Data_Gen.py

Using TensorFlow backend.


## Config verification

In [8]:
db_path = 'data/database_BAM.sqlite'

Checking if the database has been correctly imported.

In [9]:
assert os.path.exists(db_path), "Database not found 👎🏻\n\
    Please check that you've successfully copied the database in the data\
    directory after having cloned the project ‼️"
print ('Dataset found 🤙🏻')

Dataset found 🤙🏻


# Retrieving Data

## Connecting to the database

In [10]:
db = data_base(db_path)

`db` is a custom object defined in `data.py`

## Getting a subset of the data

NB: `db.get_images_` has the following behavior:
- If given the same amount of images, always return the same IDs (no seed needed).
- However, for different values of `nb_images`, returns completely different images.

In [46]:
list_IDs = db.get_images_(nb_images)

In [47]:
labels = {i:db.get_label(i) for i in list_IDs}

`labels` has the shape:  
{4288:
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...}

In [48]:
len(list_IDs)

100

In [49]:
list_IDs[:5]

[171966541, 16777352, 54526109, 129050139, 34253499]

## Downloading images

In [36]:
def download_from_IDs(db, list_IDs, skip_errors=False, verbose=True):
    '''Download all images with the corresponding IDs from the specified database.
    
    Inputs:
        - db = data_base object
        - list_IDs = list of IDs
    Output:
        - list_errors = list of the images's ID that couldn't be downloaded.
    '''
    
    list_errors = []
    
    for ID in list_IDs:
        url = db.get_image(ID)
    
        try:
            # NB: All images from the BAM dataset are .jpg.
            filePath = os.path.join('data_training', str(ID)+'.jpg')
            urlretrieve(url, filePath)
        
        except:
            if verbose:
                print(f'Couldn\'t download image from {url} which corresponds to ID #{ID}.\n')
            
            list_errors.append(ID)
            
            if not(skip_errors):
                print('Download interrupted.')
    return list_errors

def remove_IDs(list_IDs, IDs_to_remove):
    '''Create a copy of list_IDS. Then removes all IDs from list_IDs that are in
    IDs_to_remove.
    
    Inputs:
        - list_IDs = list
        - IDs_to_remove = list
    Output:
        - A new list.
    '''
    
    return [ID for ID in list_IDs if ID not in IDs_to_remove]

In [50]:
list_errors = download_from_IDs(db, list_IDs, skip_errors=True, verbose=True)
list_errors

Couldn't download image from https://mir-s3-cdn-cf.behance.net/project_modules/disp/5416d333859704.56baca3371cd9.jpg which corresponds to ID #213909634.

Couldn't download image from https://mir-s3-cdn-cf.behance.net/project_modules/disp/67997521968169.5630a6d5d3265.jpg which corresponds to ID #146800737.

Couldn't download image from https://mir-s3-cdn-cf.behance.net/project_modules/disp/4d446c34359583.56cdcb5bf2703.jpg which corresponds to ID #216705711.

Couldn't download image from https://mir-s3-cdn-cf.behance.net/project_modules/disp/b5af1024093285.5633010610133.jpg which corresponds to ID #159383615.

Couldn't download image from https://mir-s3-cdn-cf.behance.net/project_modules/disp/a8ff3b16777240.5603af934184e.jpg which corresponds to ID #16777240.

Couldn't download image from https://mir-s3-cdn-cf.behance.net/project_modules/disp/c3964517895527.562c0e8c6df45.jpg which corresponds to ID #121634965.

Couldn't download image from https://mir-s3-cdn-cf.behance.net/project_module

[213909634,
 146800737,
 216705711,
 159383615,
 16777240,
 121634965,
 83886227,
 113246303,
 113246295,
 117440577,
 92274881]

In [51]:
new_list = remove_IDs(list_IDs, list_errors)

In [57]:
print(f'{len(new_list)} downloaded images instead of {nb_images}.')

89 downloaded images instead of 100.
