# Data pre-processing for Yelp Restaurant Photo Classification competition

---

## 1. Import Libraries and Datasets

In [1]:
import numpy as np
import pandas as pd
from common import *

%matplotlib inline

# Business id to labels
biz2labels = pd.read_csv('data/train.csv', header = 0, names = ['business','labels']).fillna('')

# Photo id to business id for the training dataset
photo2biz_train = pd.read_csv('data/train_photo_to_biz_ids.csv', header = 0, names = ['photo','business'])

# Get list of images
all_files, all_ids = path_to_images('data/train_photos')

print('There are %d training images' % len(all_ids))

Using TensorFlow backend.


There are 234842 training images


## 2. Extract Labels for Each Photo in Training Dataset
The `biz2labels` data frame is slightly rearranged in order to efficiently access the labels associated to each business. 

In [2]:
biz2labels['labels'] = biz2labels['labels'].apply(lambda x: tuple(sorted(int(t) for t in x.split())))
biz2labels.set_index('business', inplace=True)
biz2labels.head(n=10)

Unnamed: 0_level_0,labels
business,Unnamed: 1_level_1
1000,"(1, 2, 3, 4, 5, 6, 7)"
1001,"(0, 1, 6, 8)"
100,"(1, 2, 4, 5, 6, 7)"
1006,"(1, 2, 4, 5, 6)"
1010,"(0, 6, 8)"
101,"(1, 2, 3, 4, 5, 6)"
1011,"(2, 3, 5, 6)"
1012,"(1, 2, 3, 5, 6)"
1014,"(1, 2, 4, 5, 6)"
1015,"(1, 5, 6, 7)"


There are nine different labels:
* 0 = good_for_lunch
* 1 = good_for_dinner
* 2 = takes_reservations
* 3 = outdoor_seating
* 4 = restaurant_is_expensive
* 5 = has_alcohol
* 6 = has_table_service
* 7 = ambience_is_classy
* 8 = good_for_kids

The labels are then encoded.

In [3]:
all_targets = np.vstack(biz2labels.loc[photo2biz_train['business']]['labels'].apply(encode_label))
print('Number of rows: %d - Number of columns: %d' % (all_targets.shape[0], all_targets.shape[1]))

Number of rows: 234842 - Number of columns: 9


## 3. Split Training Dataset into Training, Validation and Test Datasets
The original training dataset is splitted into a training, validation and test datasets. The validation and test datasets are allocated 12.5% of the original dataset.

In [4]:
from sklearn.model_selection import train_test_split

train_files, test_files, train_targets, test_targets = train_test_split(all_files,
                                                                        all_targets,
                                                                        test_size=0.25,
                                                                        random_state=7)

valid_files, valid_targets = (test_files[:int(len(test_files)/2)], test_targets[:int(len(test_files)/2)])
test_files, test_targets = (test_files[int(len(test_files)/2):], test_targets[int(len(test_files)/2):])

print('There are %d images in the training dataset' % len(train_files))
print('There are %d images in the validation dataset' % len(valid_files))
print('There are %d images in the test dataset' % len(test_files))

There are 176131 images in the training dataset
There are 29355 images in the validation dataset
There are 29356 images in the test dataset


## 4. Save Arrays
The path to the images and the correponding labels are saved.

In [5]:
np.savez('data/preprocess/train.npz', img=train_files, target=train_targets)
np.savez('data/preprocess/valid.npz', img=valid_files, target=valid_targets)
np.savez('data/preprocess/test.npz', img=test_files, target=test_targets)