In [None]:
from IPython.core.display import HTML
HTML(""" <link href="https://fonts.googleapis.com/css2?family=Inter:wght@600&family=Noto+Sans+JP&display=swap" rel="stylesheet"> 
<style>
    div.text_cell_render h1 {
        font-family: 'Inter';
        font-size: 1.7em;
        line-height:1.4em;
        text-align:center;
        }

    div.text_cell_render { 
        font-family: 'Noto Sans JP';
        font-size:1.05em;
        line-height:1.5em;
        padding-left:3em;
        padding-right:3em;
        }
</style>""")

# The Google Vision API

This notebook describes the process of interacting with the Google Cloud Vision API (Application Programming Interface). The API can be used for a multitude of tasks and methods: from identifying objects on images to extracting text from images. One feature, the so-called "web detection" module is particularly usefull for humanities research. This feature of the API allows the identification of replications of a specific image in the Google index.** The module returns a list of web adresses (URLs) that can be used to gather information about the afterlives of images. This notebook explains how use the API.

The interaction with the API will be explained in this Jupyter Notebook. A Notebook is a more interactive and visually comprehensible Python script. A Python script is usually a ".py" file that contains some code. If you have installed Python on your system you can run these scripts in a terminal. A python script runs at once. A Jupyter Notebook, however, is a python script that does not run at once, but in cells. You can execute every cell separately. This allows for a step-by-step execution of the script, and, for example, the inspection of the data you work with. 

Cells are executed by pressing ```Shift``` + ```Enter```.

--------------------------------------------------

** Google's index of the Internet is not the same as the Internet itself but only a finegrained "roadmap". The possibility exists that webpages that host an image are not indexed by Google since it is estimated that Google indexed only [4% of all existing webpages](https://eu.tennessean.com/story/money/tech/2014/05/02/jj-rosen-popular-search-engines-skim-surface/8636081/) (this still concerns several trillion pages). Moreover, websites that existed in the past but are no longer present on the web fall outside the scope of the API. These limitations must be kept in mind when using the API and analyzing its output. 

### What is an API?

An Application Programming Interface (API) is a way to connect to a service or dataset. In the case of Google Cloud Vision, the API communicates between the Google service (which runs on a server somewhere) and your machine. An API can be compared with a "save as" button or a search interface. 

### Setting up your Google Cloud Account

Before we begin the interaction, we need to set up our account at Google. All the steps are described [here](https://cloud.google.com/vision/docs/before-you-begin), but we will walk through them in this notebook as well.

1. Make a Google Account (or use an existing one). You can register [here](https://accounts.google.com/signup/v2/webcreateaccount?service=mail&continue=https%3A%2F%2Fmail.google.com%2Fmail%2F&ltmpl=default&dsh=S-1022536791%3A1581929757665542&gmb=exp&biz=false&flowName=GlifWebSignIn&flowEntry=SignUp). It is advised to make a new Google account.

2. Google has a broad range of API services. We use only the Cloud Vision API. Activate it [here](https://console.cloud.google.com/flows/enableapi?apiid=vision.googleapis.com&_ga=2.55858594.803327971.1581929623-746618159.1580742281). 

3. Once we have activated the API, it requires authentication. This means that we need to create a project that comes with files that _identify_ that specific project. To do this, create a "service account key" [here](https://console.cloud.google.com/apis/credentials/serviceaccountkey?_ga=2.52048288.803327971.1581929623-746618159.1580742281). Select "new service account" and enter a name for your project. Then, select the role of "owner" for yourself in the project. Lastly, click on "create". This will prompt the download of a .json file. JSON is a popular file type for storing information. In this .json file your project credentials are found. 

### Interaction with the API

Interaction with the API is done by creating URLs that request information, similar to typing URLs in the address bar. Creating requests can be done in many ways. Google has a nice explanation of making an API request from scratch [here](https://cloud.google.com/vision/docs/internet-detection).

Because building an API key from scratch is a relatively complex task, we created a set of Python scripts that handle most of the coding. These scripts are found in the ```scrapelib``` folder. They can be loaded into this notebook by placing the placing the ```scrapelib``` folder and the ```gcv_api.py``` and ```functions.py``` files in the same folder as this notebook. Subsequently we import them, along with some other libraries. 

This immediately touches upon one of the core elements of Python: modules. Modules are secondary sets of Python scripts that handle specific tasks. There is for example a module for making visualizations, and a module for working with images. The ```scrapelib``` folder, together with the ```gcv_api.py``` and ```functions.py``` files can be seen as a module for working with the API. Additionally, some other modules need to be loaded. We do so by literally importing them:

In [None]:
import os,sys
from gcv_api import main
from functions import *
from multiprocessing.dummy import Pool as ThreadPool
import concurrent.futures

Now it's time to run some code. When working with large datasets, it's important to have a proper folder structure. In this case, the code needs the following structure:

```
+-- photo_folder
|   +-- example_photo_1_folder
        +-- example_photo_1_folder_source
            +-- example_photo_1.jpg
|   +-- example_photo_2_folder
        +-- example_photo_2_folder_source
            +-- example_photo_2.jpg
```

The scripts going to look at all the photo folders in the main folder. In every subfolder the original image file needs to be located in the ```source``` folder. 

Besides having a folder structure, we need to define some variables. We define:
- the API key from your own account (usually a very long code, something like 912421-09821f-n13r39)
- the path to the folder were all the photos are stored (```photo_folder``` in the example) 
- the name of the photo we want to scrape (```example_photo_1.jpg```). 
- the input folder (a combination of the top ```photo_folder```, the name of the photo and the ```source``` folder)

In [None]:
os

In [None]:
api_key = "AIzaSyCZFu_wHsXURDLgSDlUuUAwDnGwKrgBKUc"
base_path = "media/ruben/FEF44259F44213F5/Users/Ruben/Documents/GitHub/ReACT_GCV/notebooks/photo_folder" #set path to notebooks/photo_folder from the C drive, or /home
photo = "example_photo_1_folder"
input_folder_ = os.path.join(base_path,photo,photo + "_" + "source")
print("Image in input folder: {}".format(os.listdir(input_folder_)[0]))

Now that we have our input folder, it's time to call the API! As said, this only requires feeding some variables into the ```scrapelib``` library. Above we have called that library from the ```gcv_api.py``` file placed in the same folder as this notebook. If you want to know more about the inner workings of the scraper function, just take a look at the Python scripts in the ```scrapelib``` folder. 

Below we call the API using the ```main.main``` function.  We use the variables we have just set. For now, we use the variables we just defined and set ```iteration``` to ```1```.

In [None]:
main.main(
            input_folder = input_folder_,
            key = api_key,
            output_folder = os.path.join(base_path, photo, photo + "_"),
            iteration = 1
            )

After running this function, you will find a .json file in the output folder. JSON files are very helpful in ordering data. They are structured hierarchically, which means that they contain identifier/values pairs. One of the identifiers is ```pagesWithMatchingImages```. It is under this identifier that we find a list of URLs. These are the pages where your image is located.

We will discuss how to handle these files in the next notebooks. In the remainder of this notebook, we will focus on something else. When your image is famous or iconic, it is likely that there are hundreds, thousands or even millions of associated webpages. The Google API seems to have an internal limit on the number of webpages it returns in one .json file. This is problematic, because we do not know which URLs are returned and which ones are omitted. In fact, it seems that Google just returns the most recent URLs that host your image. 

To solve this (i.e. extent the number of results) we make use of a rather simple trick: we feed the images associated with the URLs found in the [output].json file back into the API. Because the images found by the API are likely to contain small variations: different colortones, dimensions or sizes. Because of this variation, the API tracks new instances of similar images. We can repeat this trick until we encounter no more new URLs.

This method has some implications. First, in theory we are dealing with exponential growth in the number of results. Assuming that each scraped image returns about 100 URLs and that the images present on those webpages are uploaded, we end up with 100 x 100 x 100 x 100 x 100 = 10 billion results after five iterations. Luckily, this is only theory because every iteration will yield many duplicate URLs. It is likely that some URLs reappear after some iterations. This is the reason that we have to remove the duplicates, also because you will burn your free Google Cloud credits fast if you just upload the same image over and over.

Below we upload the results from the first iteration "back" into the API. To do so, we need to gather all the web addresses of the images. These are also included in the .json files. To make things easier, we included a function to gather them in the ```functions.py``` file. We indicate the iteration number in the variable n.

In [None]:
n = 1

In [None]:
list_json = [os.path.join(base_path, photo, photo + "_" + str(n),f) for f in os.listdir(os.path.join(base_path,photo,photo + "_" + str(n))) if '.json' in f]
print('looking for scraped image URLs in {}'.format(list_json))
image_url_current = Json.extract_image_folder(list_json)
[print(x) for x in image_url_current[0:10]]

Now we have all the URLs referring to the actual images in the variable ```image_url_current```. After we check if the list is longer than 1 (otherwise we don't have anything to download) we download the images to our own computer. The previously defined variables are used to create a new folder called ```img``` in the folder belonging to the nth iteration. Because scraping uses some additional libraries that are a bit complex, you can call the scraping function ```Img.scrape``` to your list of URLs. Don't forget to set the so-called environment variable to the destination folder.

In [None]:
# Check if there are Images to Scrape, if not: break and go to next Photo
if len(image_url_current) == 0 or image_url_current is None:
    print("No URLs found in Iteration {}, going to next photo".format(n))

In [None]:
# Scrape images
images_destination = os.path.join(base_path,photo,photo + "_" + str(n), "img")
if not os.path.exists(images_destination):
    os.makedirs(images_destination)

os.chdir(images_destination)
Img.Scrape(image_url_current)

After downloading all the images, there is one thing left to do: remove the duplicates based on the URLs in the JSON. In subsequent iterations of scraping we do this in advance (remove duplicate URLs from the list). Additionally, we remove images that are very small. The threshold size is currently 4000 bytes.

In [None]:
# Remove duplicates
if len(image_url_current) > 1:
    Img.RemoveSmall(images_destination,4000)
    Duplicates.remove(images_destination)

## Full Pipeline

That's it! You now have the code to find images on the internet. To search for images beyond the first iteration, use the cell below and adjust the iteration number (in this case we start with 2 because we did the first iteration)

In [None]:
n = 2


# Post Images to API
if int(n) == 1:
    input_folder_ = os.path.join(base_path, photo, photo + "_" + "source")

if int(n) > 1:
    input_folder_ = os.path.join(base_path, photo, photo + "_" + str(int(n)-1), "img")

try:
    main.main(
            input_folder = input_folder_,
            key = api_key,
            output_folder = os.path.join(base_path, photo, photo + "_"),
            iteration = n
            )
except Exception as e:
    print(e)

# Gather Image URLs from Output files (.json files) and (if n > 1) remove duplicates
list_json = [os.path.join(base_path, photo, photo + "_" + str(n),f) for f in os.listdir(os.path.join(base_path,photo,photo + "_" + str(n))) if '.json' in f]
print('looking for scraped URLs in {}'.format(list_json))
image_url_current = Json.extract_image_folder(list_json)

if int(n) > 1:
    processed_urls = []
    for iter_previous in range(1,int(n)):
        list_json_prev = [os.path.join(base_path, photo, photo + "_" +str(iter_previous),f) for f in os.listdir(os.path.join(base_path,photo,photo + "_" + str(iter_previous))) if '.json' in f]
        print('looking for previous URLs in {}'.format(list_json_prev))
        image_url = Json.extract_image_folder(list_json_prev)
        processed_urls = processed_urls + image_url

    duplicates = [u for u in image_url_current if u in processed_urls]
    print("{}/{} image-URLs removed (duplicates)".format(len(duplicates),len(image_url_current)))
    image_url_current = [u for u in image_url_current if u not in list(set(processed_urls))]

# Check if there are Images to Scrape, if not: break and go to next Photo
if len(image_url_current) == 0 or image_url_current is None:
    print("No URLs found in Iteration {}, going to next photo".format(n))
    exit()

# Scrape images
images_destination = os.path.join(base_path,photo,photo + "_" + str(n), "img")
if not os.path.exists(images_destination):
    os.makedirs(images_destination)

os.chdir(images_destination)
Img.Scrape(image_url_current)

# Remove duplicates
if len(image_url_current) > 1:
    Img.RemoveSmall(images_destination,4000)
    Duplicates.remove(images_destination)