# Google Vision API

This notebook applies the google vision API to Iens restaurant pictures to detect what food is on it. The notebook consists of 3 sections:

1. An example of calling the vision API for a single image (for in the blog)
2. Doing a batch query for all images of all restaurants and uploading the result to BigQuery
3. Querying the results / Analysis

### Initialize

In [None]:
import pandas as pd
import pandas_gbq as gbq 
import json
import matplotlib.pyplot as plt
pd.set_option('display.max_colwidth', 250) # Show all columns
%matplotlib inline

In [None]:
# project specifics
PRIVATE_KEY = '../google-credentials/gsdk-credentials.json'
PROJECT_ID = json.load(open(PRIVATE_KEY))['project_id']
APIKEY = open('../google-credentials/gc-API-key.txt').read()

In [None]:
# dataset specifics
city = 'amsterdam'
date = '20180124'
bq_table = '_'.join(['iens.iens', city, date])  # use iens.iens_comments when querying on the comments table
bq_table_out = '_'.join(['iens.iens_images', city, date])  

In [None]:
# select all info fields, plus image_urls
query = "SELECT info.id, info.name, image_urls FROM {} WHERE info.nr_images > 0".format(bq_table)

df = gbq.read_gbq(query, project_id=PROJECT_ID, private_key=PRIVATE_KEY)
df.shape

## 1. Calling the vision API 

First practice - just run a single request. See if it works!

Usefull documentation: 
* https://developers.google.com/api-client-library/python/start/get_started
* https://github.com/GoogleCloudPlatform/cloud-vision/tree/master/python/text

In [None]:
from googleapiclient.discovery import build
service = build('vision', 'v1', developerKey=APIKEY)
collection = service.images()

### Single request for Blog:

In [None]:
def make_request(url):
    return {'image': {'source': {'imageUri': url}},
            'features': [{
                'type': 'LABEL_DETECTION',
                'maxResults': 10}]}

def execute_request(url):
    return collection.annotate(body={'requests': make_request(url)}).execute()

In [None]:
burger_url = 'https://www.okokorecepten.nl/i/recepten/kookboeken/2014/jamies-comfort-food/jamie-oliver-hamburger-500.jpg'
result = execute_request(burger_url)

In [None]:
from IPython.display import Image

display(Image(burger_url))
display(pd.DataFrame(result['responses'][0]['labelAnnotations']).drop('mid', axis=1))

## 2. Set up batch request for all restaurants

Now that we know it works for a single image, extrapolate the query to do batch requests.

In [None]:
def make_batch_request(url_list):
    return collection.annotate(body={'requests' : [make_request(url) for url in url_list]})

def execute_batch_request(url_list, num_retries=1):
    return make_batch_request(url_list).execute(num_retries=num_retries)['responses']

In [None]:
examples = {'burger' : 'https://u.tfstatic.com/restaurant_photos/811/352811/169/612/barasti-killer-burger-b42ea.jpg',
            'steak' : 'https://u.tfstatic.com/restaurant_photos/811/352811/169/612/barasti-ribstuk-2c5f9.jpg'}

# execute with
# execute_batch_request(examples.values())

What we want eventually is a dictionary with the following structure, to upload into Google BigQuery:

* restaurant id = integer
* images = list of dicts:
    * image url = string
    * labelAnnotation = list of dicts:
        - description
        - mid
        - score
        - topicality
        
Note that Google allows max 16 images per request: https://cloud.google.com/vision/quotas. As there is not many restaurants with that many photo's let's just aggregate by restaurant for each batch and limit it to 16 images in case it does happen:

In [None]:
# convert to Series for batch request per restaurant
restaurant_image_list = df.groupby(['info_id'])['image_urls'].apply(lambda x: list(x)[0:16])

A problem that we encounter while calling the API is that some URLs can be unaccessible for the API leading to error responses. We could build a loop of some sort (with a time-out delay between each repitition), to keep trying a specific URL untill it succeeds. However, as it happens roughly for 10% of all our images, we choose to simply ignore this problem for now and don't return anything for the specific URL:

In [None]:
def parse_images(image_url, label_annotations):
    try: 
        return {
            'image_url' : image_url,
            'label_annotations' : label_annotations['labelAnnotations']
        }
    except KeyError:
        # don't return label_annotations if not found
        return {
            'image_url' : image_url
        }

Note that the `num_retries` parameter in the `execute()` method doesn't solve our problem. It simply repeats the call, but doesn't automatically save all succesfull responses a better final response for the batch.

Let's run it! *(This may take a while..)*

In [None]:
result = []
printcounter = 0
for restaurant_id, image_urls in restaurant_image_list.iteritems():
    # do batch request
    responses = execute_batch_request(image_urls)
    # create images object for one restaurant
    images = [
        parse_images(image_url, label_annotations)
        for image_url, label_annotations in 
        zip(image_urls, responses)
    ]
    # add results for one restaurant to list
    result.append({'info_id' : restaurant_id, 'images' : images})
    if (printcounter % 100 == 0):
        print('Finished restaurant', printcounter, '/', len(restaurant_image_list))
    printcounter += 1
    
len(result)

#### Write to jsonlines

To upload to BigQuery save as jsonlines

In [None]:
file = open('../iens_scraper/output/' + bq_table_out + '.jsonlines', 'w')
for item in result:
    file.write('%s\n' % item)
file.close()

#### upload to BigQuery

Would be nicer to do this directly from python. For example with `gbq.to_gbq` (which is for dataframes only).

In [None]:
!bq load --autodetect --replace --source_format=NEWLINE_DELIMITED_JSON \
        {bq_table_out} ../iens_scraper/output/{bq_table_out}.jsonlines

## 3. Query images

For example getting the top 15 most found labels by the vision API.

In [None]:
query = """
SELECT images.label_annotations.description, COUNT(*) AS count 
FROM {} 
GROUP BY images.label_annotations.description 
ORDER BY count DESC
LIMIT 15;
""".format(bq_table_out)

query_result = gbq.read_gbq(query, project_id=PROJECT_ID, private_key=PRIVATE_KEY)
query_result.head(10)

Or.. getting the max score for each hamburger image per restaurant:

In [None]:
keywords = ('hamburger', 'cheeseburger', 'veggie burger', 'slider') 
query = """
SELECT
  info_id, images.image_url, images.label_annotations.score, images.label_annotations.description
FROM (
  SELECT 
      *,
      ROW_NUMBER() OVER(PARTITION BY info_id ORDER BY images.image_url DESC, images.label_annotations.score DESC) AS highest_score
  FROM {}
  WHERE images.label_annotations.description IN {}
)
WHERE highest_score = 1
ORDER BY images.label_annotations.score DESC
""".format(bq_table_out, keywords)

query_result = gbq.read_gbq(query, project_id=PROJECT_ID, private_key=PRIVATE_KEY)
query_result.shape

Looking at the distribution of returned scores, we see that most have a confidence above 80%. Question is: which threshold score do we pick for claiming that there is indeed a burger on the picture?

In [None]:
query_result['images_label_annotations_score'].hist();

Examining some random cases from bottom up, we find that the threshold for good burger classification seems to lie around a score of 75%. Also we note that the descriptions 'veggie burger' and 'slider' might not be really what we are looking for. 

**Conclusion:** Use a score of 75% and up, and description hamburger for determining if a restaurant has them or not!

### Write restaurants with hamburger ids to file

In [None]:
query_result['info_id'].to_csv('../iens_scraper/output/image_tags.csv', index=False)

Done.