<center> <font size='6' font-weight='bold'> DataBase Downloader </font> </center>  
<center> <i> Projet Navee</i> </center>
<center> <i> Tony </i> </center>  

**Objective:**  
Since it's not suited to always have to redownload the whole training set each time we want to make tests on it, we want to implement a quick and simple algorithm to download a certain number of pictures in the `data_training` folder.  

Those files are going to be named accordingly to their corresponding ID in the database (i.e. 19388.png).  

What comes next is that we'll have to create class that inherits from *DataGenerator* (from Data_Gen.py) and we'll change its behavious so much so that everything works the same way (for instance how we can seek particular labels in the database) but instead of having to download the files, it will directly look into the `data_training` directory.

However, since it's not practical to download the whole database on our computers, the moment we'll generate the `data_training` folder, we'll also create a `database_BAM_training.sqlite` file that only contains the instances that were previously retrieved.

---

**Paramaters:**  
Please set `nb_images` to be equal to the number of images to include in the training set.  
Make sure that `image_size` must be the same in all different algorithms!

In [3]:
nb_images = 100

In [12]:
image_size = (224,224) # resize parameter

# Preparations

In [23]:
import os
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import urllib # to download images from the web
from shutil import copyfile # to create a copy of the dataset

## Change to main file tree level.

In [5]:
os.getcwd()

'/Users/Tony/Desktop/projet-navee/data'

In [6]:
os.chdir('..')

In [21]:
os.getcwd()

'/Users/Tony/Desktop/projet-navee'

## Imports

In [8]:
from data import * # imports data.py
from Data_Gen import * # imports Data_Gen.py

Using TensorFlow backend.


## Config verification

In [9]:
db_path = 'data/database_BAM.sqlite'

Checking if the database has been correctly imported.

In [10]:
assert os.path.exists(db_path), "Database not found 👎🏻\n\
    Please check that you've successfully copied the database in the data\
    directory after having cloned the project ‼️"
print ('Dataset found 🤙🏻')

Dataset found 🤙🏻


# Retrieving Data

## Connecting to the database

In [11]:
db = data_base(db_path)

`db` is a custom object defined in `data.py`

## Getting a subset of the data

In [13]:
IDs = db.get_images_(nb_images)

In [14]:
labels = {i:db.get_label(i) for i in IDs}

`labels` has the shape:  
{4288:
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ...}

In [15]:
len(IDs)

100

In [18]:
IDs[:5]

[159078039, 171966483, 143825721, 117440577, 224022099]

## Downloading images

In [None]:
def download_from_IDs(db, list_IDs, skip_errors=False):
    '''Download all images with the corresponding IDs from the specified database.
    
    Inputs:
        - db = data_base object
        - list_IDs = list of IDs
    Output:
        - list_errors = list of the images's ID that couldn't be downloaded.
    '''
    
    list_errors = []
    
    for ID in list_IDs:
        url = db.get_image(ID)
    
        try:
            reponse = requests.get(url)
            img = Image.open(bio(reponse.content))
            
            # NB: All images from the BAM dataset are .jpg.
            filePath = os.path.join('data_training', str(ID)+'.jpg')
            urllib.urlretrieve(url, filePath)
        
        except:
            print(f'Couldn\'t download image from {url} which corresponds to ID #{ID}.\n')
            
            list_errors.append(ID)
            
            if not(skip_errors):
                print('Download interrupted.')
                return list_errors

In [None]:
def create_subset_db(db_path, list_IDs):
    '''Create a copy of the given database in data_training and only keeps the instances
    that have their ID in list_IDs.
    
    Inputs:
        - db = data_base object
        - list_IDs = list of IDs
    Output:
        - None.
    '''
    
    db_path_training = os.path.join('data_training', 'database_BAM_training.sqlite')
    
    # First, we create a copy of the database.
    copyfile(db_path, db_path_training)
    
    # Then we 