# Cars dataset

Here I make the [cars dataset](http://ai.stanford.edu/~jkrause/cars/car_dataset.html) usable.
The dataset has 16,185 images classified into 196 different classes of car. Classes are defined as specific makes, year, models.

### Reading the label file

In [1]:
from scipy.io import loadmat
import os
import pickle
import custom.tools as ctools

In [2]:
labels = loadmat(os.path.join('data', 'cars_dataset', 'cars_annos.mat'))
labels.keys()

dict_keys(['__header__', '__version__', '__globals__', 'annotations', 'class_names'])

In [3]:
annotations = labels['annotations'].flatten()
class_names = labels['class_names'].flatten()
print('annotations shape: {}' .format(annotations.shape))
print('class_names shape: {}' .format(class_names.shape))

annotations shape: (16185,)
class_names shape: (196,)


### Understanding the components of the labels
The annotations have image_name and class_name in indices 0 and 5, respectively.
The other annotations are tied to bounding boxes (1-4).
I am not sure what annotation 6 is for.

In [4]:
annotations[0]

(array(['car_ims/000001.jpg'], 
      dtype='<U18'), array([[112]], dtype=uint8), array([[7]], dtype=uint8), array([[853]], dtype=uint16), array([[717]], dtype=uint16), array([[1]], dtype=uint8), array([[0]], dtype=uint8))

In [5]:
print('Image name: {}, class label: {}' .format(annotations[0][0], annotations[0][5]))
print('Image name: {}, class label: {}' .format(annotations[88][0], annotations[88][5]))
print('Image name: {}, class label: {}' .format(annotations[89][0], annotations[89][5]))

Image name: ['car_ims/000001.jpg'], class label: [[1]]
Image name: ['car_ims/000089.jpg'], class label: [[1]]
Image name: ['car_ims/000090.jpg'], class label: [[2]]


From visual inspection it looks like the class labels correspond to class_names indexes.
__class_names is 1-indexed__

In [6]:
class_names[0]

array(['AM General Hummer SUV 2000'], 
      dtype='<U26')

In [7]:
class_names[1]

array(['Acura RL Sedan 2012'], 
      dtype='<U19')

In [8]:
# convert class_names into list for ease of indexing
label_full_list = [name[0] for name in class_names]
label_full_list[0:3]

['AM General Hummer SUV 2000', 'Acura RL Sedan 2012', 'Acura TL Sedan 2012']

### Cleaning the labels
1) Provide naming that cleanly separates make, model & year with '-'

2) Pair up each image name with its corresponding label.

In [9]:
def conditional_replace(string, old_substring, new_substring, verbose=False):
    if old_substring in string:
        if verbose:
            print(string)
        string = string.replace(old_substring, new_substring)
        if verbose:
            print(string)
    return string

label_tuples = []
for name in label_full_list:
    words = name.split(' ')
    make = words[0]
    if make in ['Aston', 'Land']:
        make = '_'.join(words[:2])
        model = '_'.join(words[2:-1])
    if make == 'Ram':
        make = 'Dodge'
        model = 'Ram_' + '_'.join(words[1:-1])
        print(make, model)
    else:
        model = '_'.join(words[1:-1])
    year = words [-1]
    make = conditional_replace(make, '-', '_')
    make = conditional_replace(make, '/', '_')
    model = conditional_replace(model, '-', '_')
    model = conditional_replace(model, '/', '_', verbose=True)
    tup = (make, model, year)
    label_tuples.append(tup)

label_tuples[-25:]

Dodge Ram_C/V_Cargo_Van_Minivan
Ram_C/V_Cargo_Van_Minivan
Ram_C_V_Cargo_Van_Minivan


[('Plymouth', 'Neon_Coupe', '1999'),
 ('Porsche', 'Panamera_Sedan', '2012'),
 ('Dodge', 'Ram_C_V_Cargo_Van_Minivan', '2012'),
 ('Rolls_Royce', 'Phantom_Drophead_Coupe_Convertible', '2012'),
 ('Rolls_Royce', 'Ghost_Sedan', '2012'),
 ('Rolls_Royce', 'Phantom_Sedan', '2012'),
 ('Scion', 'xD_Hatchback', '2012'),
 ('Spyker', 'C8_Convertible', '2009'),
 ('Spyker', 'C8_Coupe', '2009'),
 ('Suzuki', 'Aerio_Sedan', '2007'),
 ('Suzuki', 'Kizashi_Sedan', '2012'),
 ('Suzuki', 'SX4_Hatchback', '2012'),
 ('Suzuki', 'SX4_Sedan', '2012'),
 ('Tesla', 'Model_S_Sedan', '2012'),
 ('Toyota', 'Sequoia_SUV', '2012'),
 ('Toyota', 'Camry_Sedan', '2012'),
 ('Toyota', 'Corolla_Sedan', '2012'),
 ('Toyota', '4Runner_SUV', '2012'),
 ('Volkswagen', 'Golf_Hatchback', '2012'),
 ('Volkswagen', 'Golf_Hatchback', '1991'),
 ('Volkswagen', 'Beetle_Hatchback', '2012'),
 ('Volvo', 'C30_Hatchback', '2012'),
 ('Volvo', '240_Sedan', '1993'),
 ('Volvo', 'XC90_SUV', '2007'),
 ('smart', 'fortwo_Convertible', '2012')]

In [10]:
label_list = ['-'.join(tup) for tup in label_tuples]
label_list[0:10]

['AM-General_Hummer_SUV-2000',
 'Acura-RL_Sedan-2012',
 'Acura-TL_Sedan-2012',
 'Acura-TL_Type_S-2008',
 'Acura-TSX_Sedan-2012',
 'Acura-Integra_Type_R-2001',
 'Acura-ZDX_Hatchback-2012',
 'Aston_Martin-Martin_V8_Vantage_Convertible-2012',
 'Aston_Martin-Martin_V8_Vantage_Coupe-2012',
 'Aston_Martin-Martin_Virage_Convertible-2012']

In [11]:
clean_pairs = [(annotation[0][0], label_list[annotation[5][0][0]-1]) 
               for annotation in annotations]
clean_pairs = [(tup[0].strip('car_ims/'), tup[1]) for tup in clean_pairs]

In [12]:
clean_pairs[0], clean_pairs[88], clean_pairs[89], clean_pairs[-1]

(('000001.jpg', 'AM-General_Hummer_SUV-2000'),
 ('000089.jpg', 'AM-General_Hummer_SUV-2000'),
 ('000090.jpg', 'Acura-RL_Sedan-2012'),
 ('016185.jpg', 'smart-fortwo_Convertible-2012'))

### Saving the data

In [13]:
save_dir = os.path.join('data', 'notebooks', '1_make_dataset_usable')
dicto_to_save = {
    'clean_pairs': clean_pairs,
    'label_tuples': label_tuples
}

ctools.pickle_variable_to_path(dicto_to_save, 'pair_dicto.pkl', save_dir)