# Data preparation - download from Microsoft OneDrive

---

In this mandatory exercise, you are implementing an image captioning network. For training and validation data, you will need images with corresponding descriptions. The dataset that you will use is the "Common Object in Context" (COCO) 2017. To solve the image captioning problem, you will use a neural network based on a CNN encoder and an RNN decoder. The CNN encoder will be the "Wide ResNet-101-2" network pretrained on the imageNet dataset. More information regarding the structure of the neural network is given in the "exercise" notebook. 

You don't need to run this notebook to start with the implementation part in the "exercise" notebook, but you will need to run this notebook before you can train the network. The COCO dataset is huge (~19GB) and to reduce the amount of data needed to be downloaded and preprocessed, you can directly download the following using this notebook:

- The features (as pickle files) produced by the Wide ResNet-101-2 network on the training and validation images.
- The validation images
- A vocabulary file (pickle) including information about the words in the COCO dataset.

The files will allocate approximately 4 GB. Check the amount of free space on your hard-drive before you start downloading and extracting the files. After the zip files have been downloaded and extracted, you can delete the zip files to free up disk space if needed. 

Software version:
- Python 3.7

In [None]:
import utils_data_preparation_download_onedrive.download_from_onedrive as onedrive
download_dir = "data/coco/"

---


<a id='Task1'></a>
### Step1: Download and extract COCO validation  images

The data can be found in folder "data/coco" subfolder "val2017". 


**Note**: If the process fails at some point, you may need to go into the "data/coco" folder and delete the files which were not downloaded or extracted correctly before trying again.

In [None]:
# download validation images
filename = "val2017.zip"
data_url = "http://images.cocodataset.org/zips/val2017.zip"
file_path = onedrive.download(url=data_url, download_dir=download_dir, filename=filename)
onedrive.extract(file_path=file_path, download_dir=download_dir)

---

<a id='Task2'></a>
### Step2: Download vocabulary ###


The vocabulary will be stored as a pickle file in "data/coco/vocabulary/"

**Note**: If the process fails at some point, you may need to go into the "data/coco" folder and delete the files which were not downloaded or extracted correctly before trying again.

In [None]:
# download vocabulary from Onedrive
filename = 'vocabulary.zip'
data_url = 'https://onedrive.live.com/download?cid=36039A0F53011CF6&resid=36039A0F53011CF6%21165454&authkey=ACGCj1C2lRMWfCk'
file_path = onedrive.download(url=data_url, download_dir=download_dir, filename=filename)
onedrive.extract(file_path=file_path, download_dir=download_dir)

---
<a id='Task3'></a>
### Step3: Download VGG16 features for the training and validation images ###



In [None]:
# download validation data CNN (Wide ResNet-101-2) features from Onedrive
filename = 'Val2017_cnn_features.zip'
data_url = 'https://onedrive.live.com/download?cid=36039A0F53011CF6&resid=36039A0F53011CF6%21165455&authkey=AGWzw3droynoGpg'
file_path = onedrive.download(url=data_url, download_dir=download_dir, filename=filename)
onedrive.extract(file_path=file_path, download_dir=download_dir)

In [None]:
# download training data CNN (Wide ResNet-101-2) features from Onedrive
filename = 'Train2017_cnn_features.zip'
data_url = 'https://onedrive.live.com/download?cid=36039A0F53011CF6&resid=36039A0F53011CF6%21165456&authkey=APOhU3ntqJPN_Us'
file_path = onedrive.download(url=data_url, download_dir=download_dir, filename=filename)
onedrive.extract(file_path=file_path, download_dir=download_dir)