# Get CIFAR-10 Data

In directory ~/data/wcl-data: 
- wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
- tar -xvf cifar-10-python.tar.gz

Follow unzipping instructions: [https://www.cs.toronto.edu/~kriz/cifar.html]

### Looking at data

In [4]:
import os

os.getcwd()
os.chdir('/home/sshad/data/wcl-data/cifar-10-batches-py')
os.getcwd() 

'/home/sshad/data/wcl-data/cifar-10-batches-py'

In [5]:
os.listdir()

['data_batch_4',
 'readme.html',
 'test_batch',
 'data_batch_3',
 'data_batch_2',
 'data_batch_1',
 'data_batch_5',
 'batches.meta']

In [6]:
batches =[file for file in os.listdir() if '_batch' in file]
batches

['data_batch_4',
 'test_batch',
 'data_batch_3',
 'data_batch_2',
 'data_batch_1',
 'data_batch_5']

In [7]:
train = batches[2:]
val = batches[0] # batch4 is validation set
test = batches[1]

### Unpickling 1 data file

In [27]:
import pickle

def unpickle(file):
    with open(file, 'rb') as fo:
        dict = pickle.load(fo, encoding='bytes')
    return dict


In [14]:
# Unpickling a batch results in a dictionary
batch1 = unpickle('data_batch_1')
type(batch1)

dict

In [16]:
# Dictionary keys
batch1.keys()

dict_keys([b'batch_label', b'labels', b'data', b'filenames'])

In [27]:
# What are the values?
for key in batch1.keys():
    print(key, type(batch1[key]))

b'batch_label' <class 'bytes'>
b'labels' <class 'list'>
b'data' <class 'numpy.ndarray'>
b'filenames' <class 'list'>


**data** -- a 10000x3072 numpy array of uint8s. Each row of the array stores a 32x32 colour image. The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue. The image is stored in row-major order, so that the first 32 entries of the array are the red channel values of the first row of the image.

**labels** -- a list of 10000 numbers in the range 0-9. The number at index i indicates the label of the ith image in the array data.

In [35]:
# Look inside: 

print("Batch Label: ", batch1[b'batch_label'])
print("Labels (examples, shape): ", batch1[b'labels'][:3], len(batch1[b'labels']))
print("Data (shape): ", batch1[b'data'].shape)
print("Filenames (examples, shape): ", batch1[b'filenames'][:3], len(batch1[b'filenames']))

Batch Label:  b'training batch 1 of 5'
Labels (examples, shape):  [6, 9, 9] 10000
Data (shape):  (10000, 3072)
Filenames (examples, shape):  [b'leptodactylus_pentadactylus_s_000004.png', b'camion_s_000148.png', b'tipper_truck_s_001250.png'] 10000


The dataset contains another file, called batches.meta. It too contains a Python dictionary object. It has the following entries:

**label_names** -- a 10-element list which gives meaningful names to the numeric labels in the labels array described above. 

For example, `label_names[0] == "airplane", label_names[1] == "automobile",` etc.

### How to efficiently load data into dataset? 

Pickling data loads it into memory. Implementing unpickling inside Dataset class might not be efficient because it will still load each batch (or all batches) into memory. 

I will load data here, and save it on disk in a way that makes it easy to work with Dataset.

Plan: 
- do not need batch label, filename
- need labels(y) & data(x)

**Not a great idea** but I saved them as pickle files and will unpickle inside Dataset class. It is no different than unpickling the original stuff inside the Dataset class. That implementation could be useful for recreation later..

Adapted from [SimCLR in Pytorch](https://medium.com/the-owl/simclr-in-pytorch-5f290cb11dd7): 

In [2]:
import numpy as np

In [9]:
images = np.array([],dtype=np.uint8).reshape((0,3072))
labels = np.array([])
for batch in batches:
    data_dict = unpickle(batch)
    images = np.append(images,data_dict[b'data'],axis=0)
    labels = np.append(labels,data_dict[b'labels'])

images.shape, labels.shape

((60000, 3072), (60000,))

In [28]:
def pickle_file(obj, file):
    with open(file, 'wb') as f:
        pickle.dump(obj, f)

In [20]:
os.getcwd()
os.chdir('/home/sshad/data/wcl-data/')

In [30]:
pickle_file(images, 'images')
pickle_file(labels, 'labels')

Just in case I want to split the data into train/val/test, I will do that here. 

In [31]:
train_images, train_labels = images[:40000], labels[:40000]
val_images, val_labels = images[40000: 50000], labels[40000:50000]
test_images, test_labels = images[50000:], labels[50000:]

pickle_file(train_images, 'train_images')
pickle_file(train_labels, 'train_labels')
pickle_file(val_images, 'val_images')
pickle_file(val_labels, 'val_labels')
pickle_file(test_images, 'test_images')
pickle_file(test_labels, 'test_labels')

In [34]:
train_images[0].shape()

TypeError: 'tuple' object is not callable