# Usage

Follow these steps:

1. Import and run the initial cells
2. Choose a document and run `crawl_doc` and `generate_keys` cells for the document:
    - There are 9 examples (documents) in the notebook: [Doc1](#doc1), [Doc2](#doc2), [Doc3](#doc3), [Doc4](#doc4), [Doc5](#doc5), [Doc6](#doc6), [Doc7](#doc7), [Doc8](#doc8), [Doc9](#doc9). Just pick one to get the keywords, then proceed to the next step.
3. Next, execute [FIREBASE pipeline](#fatih) for the selected document.
4. Finally, you can [visualize the results](#visualize) and inspect all the obtained images together with the computed semantic similarity scores.

In [1]:
%load_ext autoreload
%autoreload 2

In [5]:
import warnings
warnings.filterwarnings('ignore')

import pathlib
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import display

import sys
repo_root = str((pathlib.Path.cwd().parent / 'code').resolve())
sys.path.append(repo_root)

from firebase_utils import firebase
from wiki_utils import save_wiki_image, get_wiki_images
from bing_utils import get_bing_results_per_paragraph
from img_utils import get_image_with_url
from doc_utils import crawl_doc

# Initialize FIREBASE instance

In [6]:
firebase_model = firebase()

<a id='doc1'></a>
# Doc1. Human digestive system

In [400]:
url = 'https://www.asge.org/home/about-asge/newsroom/media-backgrounders-detail/human-digestive-system'
pars = crawl_doc(url,th=25)

"par1" The human digestive system, (also known as the digestive tract, the GI tract, the alimentary canal) is a series of connected organs leading from the mouth to the anus. The digestive system allows us to break down the food we eat to obtain energy and nourishment. 

"par2" The digestive system -- which can be up to 30 feet in length in adults -- is usually divided into eight parts: the mouth, the esophagus, the stomach, the small intestine (or "small bowel") and the large intestine (also called "large bowel" or "colon") with the liver, pancreas, and gallbladder adding secretions to help digestion. These organs combine to perform six tasks: ingestion, secretion, propulsion, digestion, absorption, and defecation. 

"par3" The mouth starts the process by ingesting and mechanically breaking down the food we eat into a swallowable form, adding some early secretions to start the process of digestion. The esophagus is the muscular tube connecting the mouth to the stomach. A ring-like mus

In [401]:
keys, par_embeddings = firebase_model.generate_keys(pars, num_words=2, thresh=0.51, top_n=1)
keys

{'par1': 'digestive tract', 'par2': 'large intestine'}

# Notes
- threshold on paragraphs (I put a 25-word limit)
- A single word works, but with similarity threshold = 0.42, we get two images for the document, which is fine because those images cover most of that "short" article.

<a id='doc2'></a>
# Doc2. Human visual system

In [7]:
url = 'https://blogs.aalto.fi/neuroscience/2016/10/15/visual-system/'
pars = crawl_doc(url,th=35)

"par1" The visual system of any specie is a sensory system developed to perceive the environment through the light: it uses light to form images of the surroundings. Light in the form of electromagnetic radiation propagates as a wave with a certain amplitude, frequency and wavelength. The human visual system has evolved to transduce the electromagnetic waves to meaningful information about the world. 

"par2" Vision begins with the eye. Images are formed in the eye by refraction, the bending of light rays that can occur when they travel from one transparent medium to another. Lightwaves enter the eye through cornea, refracts, goes through the pupil and reaches the retina. The pigmentation of the eye is provided by the iris; the white part of the eye is called sclera and it forms the eyeball; extraocular muscles move the eye and the eye’s orbit. The axons exit the retina thought the optic nerve in the back of the eye. Image formation in the eye happens in the retina according to basic o

In [8]:
keys, par_embeddings = firebase_model.generate_keys(pars, num_words=2, thresh=0.58, top_n=1)
keys

{'par1': 'electromagnetic radiation',
 'par2': 'eyeball extraocular',
 'par6': 'cortical processing'}

# Notes
- threshold on paragraphs (I put a 35-word limit), because this article has longer paragraphs compared to the first one.
- A single word doesn't work, so I usea keyword of size 2. With similarity threshold = 0.58, we get three images for the document, which is again fine because those images cover most of the article.

<a id='doc3'></a>
# Doc3. Basic components of a building

In [415]:
url = 'https://theconstructor.org/building/12-basic-components-building-structure/34024/'
pars = crawl_doc(url,th=30)

Paragraph contains special characters. Skipping...
"par1" The basic components of a building structure are the foundation, floors, walls, beams, columns, roof, stair, etc. These elements serve the purpose of supporting, enclosing and protecting the building structure.  

"par2" The roof forms the topmost component of a building structure. It covers the top face of the building. Roofs can be either flat or sloped based on the location and weather conditions of the area. 

"par3" Lintels are constructed above the wall openings like doors, windows, etc. These structures support the weight of the wall coming over the opening. Normally, lintels are constructed by reinforced cement concrete. In residential buildings, lintels can be either constructed from concrete or from bricks. 

"par4" Beams and slabs form the horizontal members in a building. For a single storey building, the top slab forms the roof. In case of a multi-storey building, the beam transfers the load coming from the floor ab

In [416]:
keys, par_embeddings = firebase_model.generate_keys(pars, num_words=2, thresh=0.72, top_n=1)
keys

{'par1': 'building structure',
 'par2': 'roof forms',
 'par3': 'reinforced cement',
 'par4': 'cement concrete',
 'par8': 'stairs like'}

# Notes
- threshold on paragraphs = 30 words --> but we get a dummy paragraph at the beginning.
- bigrams are needed as keywords (i.e., l=2).
- to embed these keywords, we face a problem because these bigrams are out-of-vocab for Word2Vec, I searched the keyword of each paragraph manually and observed that the baseline (just picking the first image of google) works well.

<a id='doc4'></a>
# Doc4. Volcano eruption

In [418]:
url = 'https://www.dogonews.com/2021/1/6/hawaiis-kilauea-volcano-eruption-creates-spectacular-lava-lake'
pars = crawl_doc(url,th=25)

"par1" After erupting almost continuously for over three decades — from 1983 to 2018 —  Hawaii's Kilauea volcano finally seemed to lose steam, producing no lava for nearly two years. The slumber ended on the night of December 20, 2020, when the active volcano began spewing out dramatic lava fountains and giant puffs of gas and steam from a fissure in the northwest wall of the Halemaʻumaʻu crater. 

"par2" As of December 31, 2020, the volcano had ejected over 27 million cubic meters (953 million cubic feet) of molten rock — enough to fill more than 8,000 Olympic-sized swimming pools — and replaced the existing water lake with a nearly 600-foot-deep lava lake.  Fortunately, the magma is contained inside the volcano's crater and does not pose a risk to people or property as it did in 2018, when the molten rock flowed through a residential neighborhood, destroying over 700 homes.  

"par3" Residents have, however, been asked to limit outdoor activities in areas with high volcanic smog leve

In [419]:
keys, par_embeddings = firebase_model.generate_keys(pars, num_words=1, thresh=0.37)
keys

{'par4': 'volcano'}

# Notes
- threshold on paragraphs = 25 words --> but we get a dummy paragraph at the end.
- unigrams are fine as keywords (i.e., l=1) because the whole article is just about volcanos.
- for the dummy paragraph (par6), the keyword is of course out-of-vocab of the embedding models but the baseline (i.e., returning the first result of bing) returns no images which is good.

<a id='doc5'></a>
# Doc5. NASA finds water on the Moon

In [421]:
url = 'https://www.dogonews.com/2020/11/6/nasas-sofia-finds-water-on-the-moons-sunlit-side'
pars = crawl_doc(url,th=25)

"par1" The presence of ice in the permanently shadowed craters around the Moon's poles has been known for some time. However, researchers had been unsure if the hydration detected on the satellite's sunlit areas was "molecular" water (H2O), or hydroxyl (OH), a molecule that's one hydrogen atom shy of becoming water. On October 26, 2020, NASA confirmed that the liquid was indeed water. 

"par2" “We had indications that H2O – the familiar water we know – might be present on the sunlit side of the Moon,” said Paul Hertz,  Director of Astrophysics at NASA.  “Now we know it is there. This discovery challenges our understanding of the lunar surface and raises intriguing questions about resources relevant for deep space exploration.” 

"par3" The discovery was made using data collected by NASA’s Stratospheric Observatory for Infrared Astronomy (SOFIA). The modified Boeing 747 aircraft, which can fly its large telescope at altitudes of up to 45,000 feet, allows researchers to clearly observe s

In [422]:
keys, par_embeddings = firebase_model.generate_keys(pars, num_words=1, thresh=0.35)
keys

{'par2': 'lunar', 'par3': 'moon', 'par5': 'meteorites'}

# Notes
- threshold on paragraphs = 25 words --> but we get a dummy paragraph again at the end (because the sources of Doc5 and Doc4 are the same website).
- unigrams are fine as keywords (i.e., l=1) because the whole article is just about the moon.
- for the dummy paragraph (par6), the keyword is of course out-of-vocab of the embedding models but the baseline (i.e., returning the first result of bing) returns no images which is good.

<a id='doc6'></a>
# Doc6. Human auditory system. An example of complete failure

This document essentially doesn't have any paragraphs with more than one sentence.

In [338]:
url = 'https://w3.ual.es/~vruiz/Docencia/Apuntes/Perception/Sistema_Auditivo/index.html'
pars = crawl_doc(url,th=25,disp=False)

Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...
Paragraph contains special characters. Skipping...


In [339]:
keys, par_embeddings = firebase_model.generate_keys(pars, num_words=2, thresh=0.4)
keys

{}

# Notes
- If you set disp=True when crawling the data above, you'll see that the article is written in a weird way that the whole document will be collected as one paragraph, and you'll see multiple repetitive chunks as the extracted paragraphs. So our framework doesn't work on this document.
- [This article](http://www.cs.tut.fi/~ypsilon/80545/HumanAuditorySystem.html) is another example of our framework's failure, because our framework is not suited for bulletpoint-like articles, it is meant to be used for well-written (i.e., structured in organized (and relatively long) paragraphs) articles.

<a id='doc7'></a>
# Doc7. BERT explained

- This is a rather failure case, because the keyBERT extracts keywords related to Google or researchers rather than anything about NLP and BERT.

In [424]:
url = 'https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270'
pars = crawl_doc(url,th=45)

"par1" BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language. It has caused a stir in the Machine Learning community by presenting state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others. 

"par2" BERT’s key technical innovation is applying the bidirectional training of Transformer, a popular attention model, to language modelling. This is in contrast to previous efforts which looked at a text sequence either from left to right or combined left-to-right and right-to-left training. The paper’s results show that a language model which is bidirectionally trained can have a deeper sense of language context and flow than single-direction language models. In the paper, the researchers detail a novel technique named Masked LM (MLM) which allows bidirectional training in models in which it was previously impossible. 

"par3" In the 

In [425]:
keys, par_embeddings = firebase_model.generate_keys(pars, num_words=3, thresh=0.6, top_n=1)
keys

{'par1': 'researchers at google', 'par3': 'computer vision researchers'}

<a id='doc8'></a>
# Doc8. WHO - Coronavirus disease (COVID-19): Herd immunity, lockdowns and COVID-19

- This is another failure case. We do find good images, but our scoring system selects the worst one.

In [427]:
url = 'https://www.who.int/news-room/q-a-detail/herd-immunity-lockdowns-and-covid-19'
pars = crawl_doc(url,th=45)

Paragraph contains special characters. Skipping...
"par1" 'Herd immunity', also known as 'population immunity', is the indirect protection from an infectious disease that happens when a population is immune either through vaccination or immunity developed through previous infection. WHO supports achieving 'herd immunity' through vaccination, not by allowing a disease to spread through any segment of the population, as this would result in unnecessary cases and deaths. 

"par2" Vaccines train our immune systems to create proteins that fight disease, known as ‘antibodies’, just as would happen when we are exposed to a disease but – crucially – vaccines work without making us sick. Vaccinated people are protected from getting the disease in question and passing on the pathogen, breaking any chains of transmission. Visit our webpage on COVID-19 and vaccines for more detail.  

"par3" To safely achieve herd immunity against COVID-19, a substantial proportion of a population would need to be

In [428]:
keys, par_embeddings = firebase_model.generate_keys(par_merger(pars), num_words=3, thresh=0.3, top_n=1)
keys

{'par1': 'immunity through vaccination'}

<a id='doc9'></a>
# Healthy Habits We Can Learn From Our Birds

In [430]:
url = 'https://lafeber.com/pet-birds/healthy-habits-can-learn-birds/'
pars = crawl_doc(url,th=25)

"par1" Looking for some healthy habits to try in the year ahead? Good news! Chances are you have your very own health coach living with you right now — look no further than your feathered friend! Here are a few of your bird’s healthy habits that you should follow, too. 

"par2" Eat the rainbow. This trending phrase is a simple way to remind us to eat a variety of fruits and vegetables to help get the vitamins and minerals our bodies need. The same goes for our pet birds. Parrots cannot live on seed alone — they thrive when fed a healthy diet that is supplemented by a variety of fruits and veggies. Likewise, their human stewards cannot and should not consume a diet devoid of healthy selections. In the wild, parrots are natural-born foragers — they eat what’s in season, which adds variety; we can reap health benefits by doing the same. Try some of these winter fruits and veggies. 

"par3" Eat slow, and enjoy your food. Ever noticed that your bird takes his/her time to thoroughly enjoy a 

In [431]:
keys, par_embeddings = firebase_model.generate_keys(pars, num_words=2, thresh=0.4, top_n=1)
keys

{'par2': 'wild parrots', 'par4': 'breakfast birds'}

# Final notes

In the above documents, we get a pretty good performance in terms of the quality of the images. Failure cases are (1) getting dummy paragraphs (which as shown in Doc 4 and Doc 5, their keywords typically result in no Bing images), (2) getting irrelevant keywords (which doesn't occur in the above documents, but by changing the threshold of keyword extraction, we can squeeze irrelevant terms out the returned keywords), and (3) when keywords are out-of-vocab of the embedding model, then we need to resort to the baseline (i.e., returning the first Bing result).

<a id='fatih'></a>
# Run FIREBASE for the selected document

In [9]:
results = get_bing_results_per_paragraph(firebase_model, keys, par_embeddings)

[INFO] Total of 35 images found.
Obtaining paragraphs for the bing images for the keyword electromagnetic radiation...
This is an image tag, but does not contain an image source: <img alt="TRUSTe" class="comp lazyload badge-image mntl-block" data-src="//privacy-policy.truste.com/privacy-seal/seal?rid=e166d0ee-e663-4ad0-9384-f5bd78093a89" id="badge-image_1-0"/>
DONE!
Computing scores for the images...
DONE!
[INFO] Total of 35 images found.
Obtaining paragraphs for the bing images for the keyword eyeball extraocular...
This is an image tag, but does not contain an image source: <img class="j-next-thumb thumb"/>
This is an image tag, but does not contain an image source: <img class="j-tooltip-thumb tooltip-thumb" onerror="this.src=''" slide-thumb-1="https://image.slidesharecdn.com/lacrimalapparatuseyelidandexternalfeaturesofeyeball-161021005848/85/lacrimal-apparatus-eye-lid-and-external-features-of-eye-ball-1-320.jpg?cb=1508905076" slide-thumb-10="https://image.slidesharecdn.com/lacrimala

This is an image tag, but does not contain an image source: <img id="linkedsharedimg"> </img>
DONE!
Computing scores for the images...
DONE!
[INFO] Total of 35 images found.
Obtaining paragraphs for the bing images for the keyword cortical processing...
This is an image tag, but does not contain an image source: <img aria-hidden="true" class="what-image b-lazy" data-src="//els-jbs-prod-cdn.jbs.elsevierhealth.com/cms/attachment/553ccbcc-03bf-422f-9735-e334d4720bda/gr1.jpg" loading="lazy"/>
DONE!
Computing scores for the images...
DONE!


<a id='visualize'></a>
# Visualize the results

In [10]:
from IPython.core.display import HTML
from IPython.display import Javascript, display
from ipywidgets import widgets
import tqdm
import numpy as np
from matplotlib import image
import os
import json
import operator
from PIL import Image
import io
from IPython.core.debugger import set_trace

import random
import pathlib
import pickle

In [12]:
num_images_to_show_most = 9
img_offset = 0

assert img_offset >= -1 

num_cols = 3
num_rows = int(min(3, np.ceil(num_images_to_show_most / num_cols)))
num_per_tab = num_cols * num_rows
num_tabs = len(list(results.keys()))
scale = 3

checkboxes = {}

tab_contents = []
for kk, cur_par in tqdm.tqdm(enumerate(results.keys()), desc='Setting up scored images'):
    cur_keyword = results[cur_par]['keyword']
    cur_images = results[cur_par]['images']
    cur_scores = results[cur_par]['scores']
    num_images_to_show = int(min(num_images_to_show_most, len(cur_images)))
    num_rows = int(min(25, np.ceil(num_images_to_show / num_cols)))
    num_per_tab = num_cols * num_rows
    
    cur_sort_idx = list(np.argsort(cur_scores)[::-1])
    
    cur_images = [cur_images[i] for i in cur_sort_idx]
    cur_scores = [cur_scores[i] for i in cur_sort_idx]
    
    print('cur_keyword: {}\nnum_rows: {}\nnum_images_to_show: {}'.format(cur_keyword, num_rows, num_images_to_show))
    
    rows = []
    cur_num_rows = num_rows
    for ii in range(cur_num_rows):
        cur_row = []
        cur_num_cols = int(min(num_cols, num_images_to_show))
        if ii == cur_num_rows - 1:
            cur_num_cols = num_images_to_show - (cur_num_rows - 1) * num_cols
        for jj in range(cur_num_cols):
            cur_index = img_offset + ii * num_cols + jj
            cur_img = widgets.Image(value=get_image_with_url(cur_images[cur_index], binary=True, save_image=True, title=cur_keyword, size=(512,512), score=cur_scores[cur_index]), width=256, height=256)
            cur_checkbox = widgets.Checkbox(disabled=True, description=str(cur_scores[cur_index]), indent=False, layout=widgets.Layout(width='100px', height='28'))
            cur_checkbox.width = '90px'
            # checkboxes[random_imgs[cur_index].stem] = cur_checkbox
            cur_box = widgets.VBox([cur_img, cur_checkbox])
            cur_box.layout.align_items = 'center'
            cur_box.layout.padding = '6px'
            cur_row.append(cur_box)
        cur_hbox = widgets.HBox(cur_row)
        rows.append(cur_hbox)
    tab_contents.append(widgets.VBox(rows))
    
    
tab = widgets.Tab()
tab.children = tab_contents
for i in range(len(tab.children)):
    tab.set_title(i, list(results.values())[i]['keyword'])
    
display(tab)

Setting up scored images: 0it [00:00, ?it/s]

cur_keyword: electromagnetic radiation
num_rows: 2
num_images_to_show: 5


Setting up scored images: 1it [00:00,  1.48it/s]

cur_keyword: eyeball extraocular
num_rows: 1
num_images_to_show: 2


Setting up scored images: 2it [00:00,  1.78it/s]

cur_keyword: cortical processing
num_rows: 1
num_images_to_show: 3


Setting up scored images: 3it [00:01,  2.10it/s]


Tab(children=(VBox(children=(HBox(children=(VBox(children=(Image(value=b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\…