# Processing dataset

Observe that this notebook aims to illustrate the process of dataset processing to perform experiments on cycle-WGAN. Therefore, we show how to use the splits proposed by Xian et al [1], on the datasets CUB, AWA1 and SUN. We also provided the h5files for CUB, SUN, FLO and AWA1 at the end of this page (if required to perform your experiments).

### Requirements

```
jupyter                   1.0.0

jupyter_client            5.2.3

jupyter_console           5.2.0

jupyter_core              4.4.0

git clone https://github.com/rfelixmg/util.git

```

In [18]:
import os
import sys
sys.path.append(os.path.abspath('../../'))
sys.path.append(os.path.abspath('../'))
sys.path.append(os.path.abspath('./'))

from scipy.io import loadmat
from util.datasets import load_h5
from util.storage import Container, DataH5py

import numpy as np
from copy import deepcopy

print("importing dependencies...")


importing dependencies...


In [2]:
print("Preparing directories")
!mkdir ../../data/cub ../../data/awa1 ../../data/sun
!mkdir ../../data/._original_

!wget -nc http://datasets.d2.mpi-inf.mpg.de/xian/xlsa17.zip -O ../../data/._original_/xian.etal.zip
!unzip -u -n -q ../../data/._original_/xian.etal.zip -d ../../data/._original_/ 
!mv ../../data/._original_/xlsa17/* ../../data/._original_/
!wget -q -nc https://www.dropbox.com/s/7yf0b1cx900ardo/cub_attributes_reed.npy?dl=0 -O ../../data/._original_/data/CUB/reed.etal.npy
#!ls ../../data/._original_/

_download_folder_ = '../../data/._original_/'
_outdir = '../../data/'

print("done")

Preparing directories
mkdir: cannot create directory ‘../../data/cub’: File exists
mkdir: cannot create directory ‘../../data/awa1’: File exists
mkdir: cannot create directory ‘../../data/sun’: File exists
mkdir: cannot create directory ‘../../data/._original_’: File exists
File `../../data/._original_/xian.etal.zip' already there; not retrieving.
mv: cannot move '../../data/._original_/xlsa17/code' to '../../data/._original_/code': Directory not empty
mv: cannot move '../../data/._original_/xlsa17/data' to '../../data/._original_/data': Directory not empty
done


In [3]:
def prepare_dataset(dname, basedir, has_semantic=False):
    from scipy.io import loadmat
    import numpy as np
    from util.storage import DictContainer, Container

    _dir_src = '{}/data/{}/'.format(basedir, dname)
    
    # For more information, pelase check Xian et al [1];
    _all_splits = loadmat('{}/{}'.format(_dir_src, 'att_splits.mat'))
    _odata = loadmat('{}/data/{}/res101.mat'.format(basedir, dname))
    
    # Verify whether features the \in R^(NxD) or R^(DxN), 
    # where N is number of samples
    if _odata['features'].shape[0] == 2048:
        _odata['features'] = _odata['features'].T
    # from 0 to (N-1)
    all_ids = np.arange(0, _odata['features'].shape[0])
    
    # Labels from 0 to (|C|-1)
    _odata['y'] = _odata['labels'] - 1
    
    # For more information, pelase check Xian et al [1];
    if has_semantic != False:
        # If semantic space is different then Xian et al [1];
        att_continuous = np.load(has_semantic)
    else:
        att_continuous = _all_splits['att'].T

    # Although we don't use the binary attributes, you can easily compute then following [2]
    binary_attributes = (att_continuous >= att_continuous.mean()).astype(np.float)

    # Sorting semantic features by class label
    all_continuous_attributes = np.array([att_continuous[_label] for _label in _odata['y']])
    all_binary_attributes = np.array([binary_attributes[_label] for _label in _odata['y']])
    
    # naming by Xian et al [1]
    _sets = ['train_loc', 'val_loc', 'trainval_loc', 'test_seen_loc', 'test_unseen_loc']
    # our naming for h5Files
    _namespaces = ['train_val', 'val', 'train', 'test/seen', 'test/unseen']
    
    ndb = DictContainer()
    for _set, _name in zip(_sets, _namespaces):
        
        #print(_set, _name)
        # as labels go from 1 to |C|, we need to subtract 1
        _ids = _all_splits[_set] - 1
        ndb.set_param('{}/A/continuous'.format(_name), all_continuous_attributes[_ids].squeeze())
        ndb.set_param('{}/A/binary'.format(_name), all_binary_attributes[_ids].squeeze())
        ndb.set_param('{}/X'.format(_name), _odata['features'][_ids].squeeze())
        ndb.set_param('{}/Y'.format(_name), _odata['labels'][_ids].squeeze())
        ndb.set_param('{}/y'.format(_name), _odata['y'][_ids].squeeze())
    ndb = Container(ndb.as_dict())
    ndb.name = dname
    if has_semantic != False:
        ndb.semantic = has_semantic
    return ndb

In [4]:
def prepare_knn(dname, basedir, has_semantic=False):
    from scipy.io import loadmat
    import numpy as np
    from util.storage import DictContainer, Container

    _dir_src = '{}/data/{}/'.format(basedir, dname)
    
    # For more information, pelase check Xian et al [1];
    _all_splits = loadmat('{}/{}'.format(_dir_src, 'att_splits.mat'))
    _odata = loadmat('{}/data/{}/res101.mat'.format(basedir, dname))
    
    # Labels from 0 to (|C|-1)
    _odata['y'] = _odata['labels'].squeeze() - 1
    
    # For more information, pelase check Xian et al [1];
    if has_semantic != False:
        # If semantic space is different then Xian et al [1];
        att_continuous = np.load(has_semantic)
    else:
        att_continuous = _all_splits['att'].T
    
    nknn = DictContainer()
    def _setting_domain_(_domain, _classes):
        nknn.set_param('{}/data'.format(_domain), 
                       np.array([att_continuous[i] for i in _classes]))
        nknn.set_param('{}/ids'.format(_domain), _classes + 1)
        nknn.set_param('{}/ys'.format(_domain), _classes)
        nknn.set_param('{}/labels'.format(_domain), 
                       np.array([_all_splits['allclasses_names'][i][0][0] for i in _classes]))
        
        nknn.set_param('{}/knn2id'.format(_domain), 
                       {key: value for key, value in enumerate(nknn.get_param('{}/ids'.format(_domain)))})
        nknn.set_param('{}/id2knn'.format(_domain), 
                       {value: key for key, value in enumerate(nknn.get_param('{}/ids'.format(_domain)))})
        nknn.set_param('{}/id2class'.format(_domain), 
        {value: nknn.get_param('{}/labels'.format(_domain))[key] for key, value in enumerate(nknn.get_param('{}/ids'.format(_domain)))})

        
    _setting_domain_('openset', np.arange(0, att_continuous.shape[0]))
    
    _classes = np.unique(_odata['y'][_all_splits['trainval_loc'] - 1].squeeze())
    _setting_domain_('openval', _classes)
    
    _classes = np.unique(_odata['y'][_all_splits['test_unseen_loc'] - 1].squeeze())
    _setting_domain_('zsl', _classes)

    return nknn

In [14]:
_dname = 'AWA1'
_data = prepare_dataset(dname=_dname, basedir=_download_folder_)
print("Data prepared: {}".format(_dname.lower()))

_knn = prepare_knn(dname=_dname, basedir=_download_folder_)
print("KNN prepared: {}".format(_dname.lower()))


DataH5py().save(dic=_data, filename='{}/{}/data.h5'.format(_outdir, _dname.lower()))
DataH5py().save(dic=_knn, filename='{}/{}/knn.h5'.format(_outdir, _dname.lower()))
print("Data saved")




Data prepared: awa1
KNN prepared: awa1
Data saved


In [17]:
_dname = 'CUB'

# For more information, pelase check Reed et al [3];
_semantic_dir = '../../data/._original_/data/CUB/reed.etal.npy'

_data = prepare_dataset(dname=_dname, basedir=_download_folder_, has_semantic=_semantic_dir)
print("Data prepare: {}".format(_dname.lower()))

_knn = prepare_knn(dname=_dname, basedir=_download_folder_, has_semantic=_semantic_dir)
print("KNN prepared: {}".format(_dname.lower()))


DataH5py().save(dic=_data, filename='{}/{}/data.h5'.format(_outdir, _dname.lower()))
DataH5py().save(dic=_knn, filename='{}/{}/knn.h5'.format(_outdir, _dname.lower()))
print("Data saved")


Data prepare: cub
KNN prepared: cub
Data saved


In [15]:
_dname = 'SUN'
_data = prepare_dataset(dname=_dname, basedir=_download_folder_)
print("Data prepared: {}".format(_dname.lower()))

_knn = prepare_knn(dname=_dname, basedir=_download_folder_)
print("KNN prepared: {}".format(_dname.lower()))


DataH5py().save(dic=_data, filename='{}/{}/data.h5'.format(_outdir, _dname.lower()))
DataH5py().save(dic=_knn, filename='{}/{}/knn.h5'.format(_outdir, _dname.lower()))
print("Data saved")

Data prepared: sun
KNN prepared: sun
Data saved


# Download H5files



In order to guarantee reproducibility of our approach, with numbers reported on Felix et al [4]. We attached to this file the weblinks to (1) download the h5files used to performing the training of models. (2) semantic space used for CUB; and (3) the pseudo-features generated in order to train the final classifier;


(1) Dataset (h5file):
https://drive.google.com/open?id=1wUNnqdNTapl7fZsGczUZIlQnGkpAtTpv

(2) CUB semantic space:
https://drive.google.com/open?id=1Jfltw0qbSCUvr4ekRl_3pekkCGzGolyA

(3) Pseudo features generated by cycle-WGAN:
https://drive.google.com/open?id=1MTGiOL3rbW6vi-WeTKbdix1hHcq6i5ki

# References

[1] Xian, Yongqin, Bernt Schiele, and Zeynep Akata. "Zero-shot learning-the good, the bad and the ugly." arXiv preprint arXiv:1703.04394 (2017).

[2] Lampert, Christoph H., Hannes Nickisch, and Stefan Harmeling. "Learning to detect unseen object classes by between-class attribute transfer." Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.

[3] Reed, Scott, et al. "Learning deep representations of fine-grained visual descriptions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

[4] Felix, Rafael, et al. "Multi-modal Cycle-consistent Generalized Zero-Shot Learning." Proceedings of the European Conference on Computer Vision (ECCV). 2018.

