<a href="https://colab.research.google.com/github/sayakpaul/GCP-ML-API-Demos/blob/master/Abstract_Locator_Reader.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook presents a small demo using the [Vision](https://cloud.google.com/vision) and [Text-to-Speech](https://cloud.google.com/text-to-speech) APIs offered by GCP. ollowing is the workflow of this demo - 

<div align="center"><img src="https://i.ibb.co/jLXqJRw/image.png"></img></div> 

This demo requires to have billing-enabled GCP project and in there the Video Intelligence and Text-to-Speech APIs should be enabled. You should also have your GCP Credentials key in `json` format (refer [here](https://cloud.google.com/docs/authentication/getting-started)). I followed the official samples and tutorials of the APIs (which are available at the aforementioned links) to developed this demo. Additionally, I used the [`pdf2image`](https://pypi.org/project/pdf2image/) and [`pytesseract`](https://pypi.org/project/pytesseract/) libraries for PDF-to-PNG conversion and for local OCR respectively. 


Thanks to the [GDE program](https://developers.google.com/programs/experts/) for providing with the GCP credit support which made this demo possible. 

<div align="center"><img src="https://i.ibb.co/ZXtwJjV/Webp-net-resizeimage.png" width="100" height="100"></img></div> 

In [None]:
#@title Upload your GCP credentials key to Colab
from google.colab import files
files.upload()

In [None]:
#@title Install dependencies
!sudo apt install poppler-utils
!sudo apt-get install tesseract-ocr
!pip install --upgrade google-cloud-vision
!pip install --upgrade google-cloud-texttospeech
!pip install pdf2image
!pip install pytesseract

In [1]:
#@title Set the path to GCP credentials key
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/content/fast-ai-exploration-f32c198aac7e.json' 
!echo $GOOGLE_APPLICATION_CREDENTIALS

/content/fast-ai-exploration-f32c198aac7e.json


In [2]:
#@title Imports
from google.cloud import vision
from google.cloud.vision import types
from google.cloud import texttospeech

from PIL import Image, ImageDraw
from PIL import Image

from IPython.display import Audio

from pdf2image import convert_from_path

import matplotlib.pyplot as plt
import numpy as np
import pytesseract
import html
import io
import re

In [3]:
#@title Utility for for converting a PDF to PNG (only the first page)
def pdf2png_first_page(pdf_path, png_path="first_page.png"):
	# First convert the PDF to PNG images and then serialize the first
	# page (since it contains the abstract) as an image
	pages = convert_from_path(pdf_path, dpi=500)
	pages[0].save(png_path)
	print("First page is serialized as {}".format(png_path))

In [4]:
#@title Download a an arXiv paper and serialize its first page as an image
paper_download_link = "https://storage.googleapis.com/video-api-storage/ResNet.pdf" #@param {type:"string"}
first_page_image_name = "first_page.png" #@param {type:"string"}

!wget -q $paper_download_link -O paper.pdf
pages = convert_from_path('paper.pdf', 500)
pages[0].save(first_page_image_name, 'PNG')
print("First page of the paper serialized to {}".format(first_page_image_name))

First page of the paper serialized to first_page.png


In [5]:
#@title Utility functions
#@markdown References:
#@markdown - https://cloud.google.com/vision/docs/fulltext-annotations
#@markdown - https://stackoverflow.com/questions/22588074/polygon-crop-clip-using-python-pil
def get_document_blocks(image_file):
	client = vision.ImageAnnotatorClient()
	bounds = []

	with io.open(image_file, 'rb') as image_file:
		content = image_file.read()

	image = types.Image(content=content)
	response = client.document_text_detection(image=image)
	document = response.full_text_annotation

	# Segregate the blocks
	for page in document.pages:
		for block in page.blocks:
			bounds.append(block.bounding_box)

	# `bounds` only contains the coordinates for blocks.
	# Pages→Blocks→Paragraphs→Words→Symbols
	return bounds

def extract_text_from_blocks(image, bounds,
	char_threshold=1000, debug=False):
	# Reference: # https://stackoverflow.com/questions/22588074/polygon-crop-clip-using-python-pil
	texts = []
	# ==================Take polygon crops=====================
	draw = ImageDraw.Draw(image)
	for i, bound in enumerate(bounds):
		imArray = np.asarray(image)
		maskIm = Image.new('L', (imArray.shape[1], imArray.shape[0]), 0)
		ImageDraw.Draw(maskIm).polygon([
			(bound.vertices[0].x, bound.vertices[0].y),
			(bound.vertices[1].x, bound.vertices[1].y),
			(bound.vertices[2].x, bound.vertices[2].y),
			(bound.vertices[3].x, bound.vertices[3].y)],
				outline=1, fill=1)
		mask = np.array(maskIm)

		# Assemble new image (uint8: 0-255)
		newImArray = np.empty(imArray.shape,dtype='uint8')

		# Copy color values (RGB)
		newImArray[:,:,:3] = imArray[:,:,:3]

		# Filtering image by mask
		newImArray[:,:,0] = newImArray[:,:,0] * mask
		newImArray[:,:,1] = newImArray[:,:,1] * mask
		newImArray[:,:,2] = newImArray[:,:,2] * mask

		# =========Employ Tesseract to perform OCR locally=========
		text = pytesseract.image_to_string(newImArray)
		if len(text) > char_threshold:
			texts.append(text)
			if debug:
				plt.imshow(newImArray)
				plt.show()
				print(text)

	return texts

def draw_boxes(image, bounds, color):
	draw = ImageDraw.Draw(image)
	for bound in bounds:
		draw.polygon([
			bound.vertices[0].x, bound.vertices[0].y,
			bound.vertices[1].x, bound.vertices[1].y,
			bound.vertices[2].x, bound.vertices[2].y,
			bound.vertices[3].x, bound.vertices[3].y], None, color)
	return image

def get_probable_abstract(texts):
	texts_sorted = sorted(texts, key=len)
	for text in texts_sorted:
		if text.split()[0].isupper() & text.count("[") <= 1:
			return text

def render_doc_text(filein, fileout):
	image = Image.open(filein)
	bounds = get_document_blocks(filein)
	
	draw_boxes(image, bounds, 'red')
	if fileout != 0:
		image.save(fileout)    
		print("Image serialized as {}".format(str(fileout)))
	
	return image, bounds

In [6]:
#@title Generate blocks around dense text blocks
input_image_path = "first_page.png" #@param {type:"string"}
output_image_path = "first_page_bounded.png" #@param {type:"string"}
_, bounds = render_doc_text(input_image_path, fileout="first_page_bounded.png")

Image serialized as first_page_bounded.png


In [7]:
#@title Generate a probable abstract from the page first of the paper
abstract_output_file = "abstract_summary.txt" #@param {type:"string"}

image = Image.open(input_image_path)
texts = extract_text_from_blocks(image, bounds)
probable_abstract = get_probable_abstract(texts)
f = open(abstract_output_file, "w").write(probable_abstract)
print("Abstract written to {}".format(str(abstract_output_file)))

Abstract written to abstract_summary.txt


In [8]:
#@title Utility functions for generating SSML and audio
#@markdown Reference: https://cloud.google.com/text-to-speech/docs/ssml-tutorial
def text_to_ssml(abstract_text_file):
    # Parses lines of input file
    with open(abstract_text_file, "r") as f:
        raw_lines = f.read()
    
    # Process it
    raw_lines = html.escape(raw_lines)
    raw_lines = re.sub("[[\s*\d*\,*]*]", "", raw_lines)
    
    # Convert plaintext to SSML
    # Wait two seconds between each address
    ssml = "<speak>{}</speak>".format(
        raw_lines.replace("\n", '\n<break time="200ms"/>')
    )

    # Return the concatenated string of ssml script
    return ssml

def ssml_to_audio(ssml_text, outfile="sample_audio.mp3"):
    # Instantiates a client
    client = texttospeech.TextToSpeechClient()

    # Sets the text input to be synthesized
    synthesis_input = texttospeech.SynthesisInput(ssml=ssml_text)

    # Builds the voice request, selects the language code ("en-US") and
    # the SSML voice gender ("MALE")
    voice = texttospeech.VoiceSelectionParams(
        language_code="en-US", ssml_gender=texttospeech.SsmlVoiceGender.MALE
    )

    # Selects the type of audio file to return
    audio_config = texttospeech.AudioConfig(
        audio_encoding=texttospeech.AudioEncoding.MP3
    )

    # Performs the text-to-speech request on the text input with the selected
    # voice parameters and audio file type
    response = client.synthesize_speech(
        input=synthesis_input, voice=voice, audio_config=audio_config
    )

    # Writes the synthetic audio to the output file.
    with open(outfile, "wb") as out:
        out.write(response.audio_content)
        print("Audio content written to file " + outfile)

    return str(outfile)

In [9]:
#@title Generate audio for the abstract
ssml = text_to_ssml(abstract_output_file)
audio_filename = ssml_to_audio(ssml, "abstract.mp3")
Audio(filename=audio_filename, autoplay=True)

Audio content written to file abstract.mp3
