In [259]:
import pandas as pd
import numpy as np
import os
import re
import string
from collections import defaultdict
import unidecode 

# Create Image Splits

In this notebook I split the 206949 photos provided into those that have captions, and those that don't. Then I further split the photos which have captions into training and developement sets, for the use of hyperparameter tuning in future experiments. 

## 1. Load image metadata

The file `photos.json`, downloaded directly from [the yelp dataset page](https://www.yelp.com/dataset/download), lists the photo and business identifier for each of the photos. The caption (if any), are also provided. 

In [3]:
df = pd.read_json("../data/yelp_photos/photos.json", lines = True)

In [14]:
df.head(20)

Unnamed: 0,business_id,caption,label,photo_id
0,OnAzbTDn79W6CFZIriqLrA,,inside,soK1szeyan202jnsGhUDmA
1,OnAzbTDn79W6CFZIriqLrA,,inside,dU7AyRB_fHOZkflodEyN5A
2,OnAzbTDn79W6CFZIriqLrA,,outside,6T1qlbBdKkXA1cDNqMjg2g
3,OnAzbTDn79W6CFZIriqLrA,Bakery area,inside,lHhMNhCA7rAZmi-MMfF3ZA
4,XaeCGHZzsMwvFcHYq3q9sA,,food,oHSCeyoK9oLIGaCZq-wRJw
5,XaeCGHZzsMwvFcHYq3q9sA,,food,EN9qzZpxfv00B_4X6q5lYA
6,XaeCGHZzsMwvFcHYq3q9sA,,food,M6c0qxQQwWkUzAxIvoTFuQ
7,XaeCGHZzsMwvFcHYq3q9sA,,food,876EKnk6deA7xA4i1aipJg
8,XaeCGHZzsMwvFcHYq3q9sA,,food,NFCDwGr_-TEiw9bzx3nFKw
9,XaeCGHZzsMwvFcHYq3q9sA,,food,YDmNsu38Nk2rPZMFEWBzvw


## 2. Isolate images with  captions

For training purposes, the images which have captions are what interest me. The ones that don't can be used to manually validate that my model is doing something reasonable, but won't be helpful for training purposes.  

In [15]:
# what do the missing values look like? 
df['caption'][0]

''

In [16]:
# isolate the indecies of incomplete values
missing_idx = df['caption'] == ''

In [260]:
# seperate full and missing data
df_full = df[np.logical_not(missing_idx)]
df_missing = df[missing_idx]

Now we have two datasets - one of images _with_ captions and one _without_. How many images have captions? 

In [261]:
print("Num Images with captions:    %d"%  df_full.shape[0])
print("Num Images without captions: %d"%  df_missing.shape[0])

Num Images with captions:    100807
Num Images without captions: 106142


Pretty good! half the images have catptions!

In [262]:
df_full[['photo_id', 'caption']].sample(20)

Unnamed: 0,photo_id,caption
44542,8S_K6cRyccNTQMw-N51u-Q,"Bigeye Tuna Poke : Roasted Macadamia Nuts, Fri..."
202839,nSZsgQD3OuvdH8cHErNCPg,Taco salad
170180,y90COAKwbyljtTdm6XElgw,Top Marmorierung
115885,l2gl4gImt_nkrqKrqjUZaw,Com Dia
47893,HWYKDyM649lKcxG9FCfjuw,Kung Pao Beef and Fried Rice
75167,Q9BUuhDigVDS2eyMlJaVrA,Shitake Mushroom
55510,y5wXpoC0NvbT2GMiJ0JyZQ,Vanilla caramel bourbon shake
180540,npcjI1I8eNBMcG2v_6oGjg,Sushi delux
103121,EqACZHqtNCovRi4KwOOgAA,Duo de crevette et petoncle.
115649,C1PCS2gPmIRhqFrVIybJnw,Appetizer Cold Beef dish - tasty but chewy be...


## 3. Text cleaning

Before I save the captions, it will be a good idea to clean the texts. 

First, remove newline caracters from the captions:

In [263]:
df_full['caption'] = df_full['caption'].apply(lambda x: re.sub(string = x, pattern = r'(\n|\r)', repl = " "))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Now, I've noticed that many catpions have the "and" character `&` - such as in "sweet & spicy". Later, I'll remove punctuations, but for coherency it will be useful to replace the `&` character with a native "and".

In [264]:
df_full['caption'].replace("\&", " and ", regex = True, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


Another common use of punctuation is for websites: for example _https://www.greatfood.com_. These will likely not be in a word embedding dictionary. If we replace all website-like strings with a common token, say **website**, it will be easier to learn more interesting patterns across captions, and reduce our vocabulary size. 

In [265]:
df_full.caption.replace("[a-z\:\/\.0-9]+\.(org|com|net)", " website ", regex = True, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


Removing numbers and replacing it with a token, say **num**

In [266]:
df_full.caption.replace("([0-9]+ |)[0-9]+", " num ", regex = True, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


Now, converting unicode to ascii characters. This will convert strings like "hélen" to "helen". This will help reduce our vocabulary size, as some people write the same words with/without accents. 

In [267]:
df_full.caption = df_full.caption.apply(lambda x: unidecode.unidecode(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


Removing punctuation

In [268]:
trans_table = str.maketrans('', '', string.punctuation + "–")
df_full['caption'] = df_full['caption'].apply(lambda x: x.translate(trans_table))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Lowercasing, and stripping whitespaces

In [269]:
df_full.caption = df_full.caption.apply(lambda x: x.lower().strip())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


In [278]:
df_full[['photo_id', 'caption']].sample(20)

Unnamed: 0,photo_id,caption
124699,mWkW2iw8UpCu_FQvZYZgPg,devilled eggs smoked bacon herbs souffletin...
131325,nfNSrMdVUobXmG2WtEkdhA,german chocolate cake slice
69622,T_kSjt7N-oD3GPhbtmeAhA,enjoying lunch at sonatas
138284,3t_vXBifzA69pviaz4gNww,ninja sushi and hiba hi
5781,ajUpavvax7YvlJW0fKsywA,veggie egg rolls
39846,t-xNrdtz1hR4Azk-navtFA,guacamole and bacon burger
33935,cv6CoBtd2IqWE4tTvP3KKA,lgros luxe cocktails and food
74004,ObdtJYhONq53Z9t0IsuiBw,duck and waffles
108374,vhiARXfAWyLvL4jGq7TX_A,initial table setting they take them away imme...
29983,__oEMTjoOi7NHyNfNgb2Jg,tempura


## 4. Seperate dataset

Now, I seperate the photos with captions into train and validation splits. Those which don't have captions are set aside for later, and can be our "playground" set.

In [279]:
# create a vector that will index the training data
train_idx = np.repeat(False, df_full.shape[0])
np.random.seed(8)
train_idx[np.random.choice(a = int(len(train_idx)), size = int(.8*len(train_idx)), replace = False)] = True

# a vector that indexes the validation data
valid_idx = np.logical_not(train_idx)

In [280]:
# making sure that one is true and the other is not
print(np.sum(np.logical_and(train_idx, valid_idx)) == 0)
print(np.sum(np.logical_or(train_idx, valid_idx)) == df_full.shape[0])


True
True


In [281]:
# isolating training and validation set
df_train = df_full[['photo_id', 'caption']][train_idx]
df_valid = df_full[['photo_id', 'caption']][valid_idx]

In [31]:
mkdir ../data/split_lists

mkdir: ../data/split_lists: File exists


In [282]:
df_train.to_csv("../data/split_lists/train_ids.csv", index = False)
df_valid.to_csv("../data/split_lists/valid_ids.csv", index = False)

In [283]:
# isolate the list of photo id's that don't have a caption
df_missing[['photo_id']].to_csv("../data/split_lists/playground_ids.csv", index = False)

In [254]:
ls -l ../data/split_lists/

total 7048
-rw-r--r--  1 timibennatan  staff  2441275 May  6 10:46 playground_ids.csv
-rw-r--r--@ 1 timibennatan  staff  4680466 May  6 10:46 train_ids.csv
-rw-r--r--  1 timibennatan  staff  1160594 May  6 10:46 valid_ids.csv


K. Peace. 

In [286]:
df_valid.head(20)

Unnamed: 0,photo_id,caption
18,_ExrVJTjGcChfzLH51etAw,shanghai rainbow trout
20,yPUPhsJvT6yx6l8QwShw1Q,grill rainbow trout
28,zvESg-w2JIBL5FhU7F2d-g,chicken parm
45,uqdXqfB8MXW6XU7Hk1gGIQ,mcg holiday jazz
69,VMedbsDZnCxmCE3Pndvtng,dining room
80,Y3OxcrMgt_wvxPqVNq5tPg,meatloaf is back topped with bob evans wildfir...
84,ACfgPpp3q6oaO0ytpx3cgw,quaint bar
85,bw1IaF8FVcmShr6xbE0umA,jerk quesadillas and empty crab soup bowl
128,q6VRi5FwTMqffR5bsuRumQ,table seating to the left
133,MZ6lyOh87ELi3S3Re3r5IQ,seafood variety platters
