# Model for Yelp Restaurant Photo Classification competition
## Import Libraries and Datasets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from preprocess import *

%matplotlib inline

# Business id to labels
biz2labels = pd.read_csv('data/train.csv', header = 0, names = ['business','labels']).fillna('')

# Photo id to business id for the training dataset
photo2biz_train = pd.read_csv('data/train_photo_to_biz_ids.csv', header = 0, names = ['photo','business'])

# Photo id to business id for the test dataset
photo2biz_test = pd.read_csv('data/test_photo_to_biz.csv', header = 0, names = ['photo','business'])

# Get list of images
train_files, train_ids = path_to_images('data/train_photos')
test_files, test_ids = path_to_images('data/test_photos')

print('There are %d training images' % len(train_ids))
print('There are %d test images' % len(test_ids))

Using TensorFlow backend.


There are 234842 training images
There are 237152 test images


## Extract Labels for Each Photo in Training Dataset
The `biz2labels` data frame is slightly rearranged in order to efficiently access the labels associated to each business. 

In [2]:
biz2labels['labels'] = biz2labels['labels'].apply(lambda x: tuple(sorted(int(t) for t in x.split())))
biz2labels.set_index('business', inplace=True)
biz2labels.head(n=10)

Unnamed: 0_level_0,labels
business,Unnamed: 1_level_1
1000,"(1, 2, 3, 4, 5, 6, 7)"
1001,"(0, 1, 6, 8)"
100,"(1, 2, 4, 5, 6, 7)"
1006,"(1, 2, 4, 5, 6)"
1010,"(0, 6, 8)"
101,"(1, 2, 3, 4, 5, 6)"
1011,"(2, 3, 5, 6)"
1012,"(1, 2, 3, 5, 6)"
1014,"(1, 2, 4, 5, 6)"
1015,"(1, 5, 6, 7)"


There are nine different labels:
* 0 = good_for_lunch
* 1 = good_for_dinner
* 2 = takes_reservations
* 3 = outdoor_seating
* 4 = restaurant_is_expensive
* 5 = has_alcohol
* 6 = has_table_service
* 7 = ambience_is_classy
* 8 = good_for_kids

For each photo in the training dataset, the corresponding labels are retrieved.

In [3]:
train_targets = np.vstack(biz2labels.loc[photo2biz_train['business']]['labels'].apply(encode_label))
print('Number of rows: %d - Number of columns: %d' % (train_targets.shape[0], train_targets.shape[1]))

Number of rows: 234842 - Number of columns: 9


## Split Training Dataset into Training and Validation Datasets
The original training dataset is splitted into a training and validation datasets. The validation dataset is allocated 20% of the original dataset.

In [4]:
from sklearn.model_selection import train_test_split

seed = 7
valid_size = 0.2
train_files, valid_files, train_targets, valid_targets = train_test_split(train_files, 
                                                                          train_targets, 
                                                                          test_size=valid_size, 
                                                                          random_state=seed)

print('There are %d images in the training dataset' % len(train_files))
print('There are %d images in the validation dataset' % len(valid_files))

There are 187873 images in the training dataset
There are 46969 images in the validation dataset


## Preprocess the Data
The images are rescaled by dividing every pixel in every image by 255.

In [None]:
from PIL import ImageFile                            
ImageFile.LOAD_TRUNCATED_IMAGES = True                 

# pre-process the data for Keras
train_tensors = paths_to_tensor(train_files).astype('float32')/255
valid_tensors = paths_to_tensor(val_files).astype('float32')/255
test_tensors = paths_to_tensor(test_files).astype('float32')/255