# Extrating text esssence

To get nice matching pictures for a sentence, we need to find the right words to query for in the sentence. This notebook is about finding some way to extract the essence of a sentence. Let's use `spaCy`.

In [None]:
import numpy as np

import spacy
from spacy.lang.en import English
from spacy import displacy

Also load the unsplash `get_photo_url()` function from notebook `random-line-random-pic.ipynb` to experiment with our results at the end.

In [None]:
import sys
sys.path.append('../')

from src.images.unsplash import Unsplash
from IPython.display import Image

## Getting the data

In [None]:
with open('../data/cipsum.txt','r') as f:
    corpus = f.read()
    
# remove newlines, split sentences and strip leading and trailing whitespaces
lines = [line.strip() for line in corpus.replace("\n", "").split('.') if line is not '']

Let's start by checking this for one random line.

In [None]:
line = str(np.random.choice(lines))
print(line)

## Using `spaCy`'s linguistic features

Spacy has a nice feature to determine part of speech tagger to get word dependencies within a sentance. Let's have a look. Documentation can be found [here](https://spacy.io/usage/linguistic-features#pos-tagging)

First load a language model and receive and create an nlp object.

In [None]:
nlp = spacy.load('en_core_web_sm')
doc = nlp(line)

Use part of speech tagger from the language model: for each token in the `doc`, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.

In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object. The `dep_` attribute returns the predicted dependency label.

The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.

In [None]:
for token in doc:
    print(token.text, token.pos_, token.dep_, token.head.text)

It's all better explained when visualising it. 

Ps. you can ask spacy to explain their abbreviations through `spacy.explain('amod')`.

In [None]:
displacy.render(doc, style='dep', jupyter=True)

### Noun chuncks

The above allows us to parse whathever we think we need. However, `spaCy` already does some engineering for you an generated so called noun chuncks.

> "Noun chunks are “base noun phrases” – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”. To get the noun chunks in a document, simply iterate over `Doc.noun_chunks`."

This is very nice ineed! We just extract those and use it for a picture?

In [None]:
for chunk in doc.noun_chunks:
    print(chunk.text)
#     print(chunk.root.text, chunk.root.dep_, chunk.root.head.text)

Let's see how a picture would look like:

In [None]:
query = ' '.join([chunk.text for chunk in doc.noun_chunks])

print('Sentence:', line)
print('Query   :', query)

unsplash = Unsplash()
Image(url = unsplash.get_photo_url(query))

This might still be too much text. Let's consder the nouns only without their adjectives.

In [None]:
query = ' '.join([chunk.root.text for chunk in doc.noun_chunks])

print('Sentence:', line)
print('Query   :', query)

unsplash = Unsplash()
Image(url = unsplash.get_photo_url(query))

That's better already!

How about we include the ROOT VERB of the sentence?

In [None]:
nouns = [chunk.root.text for chunk in doc.noun_chunks]
verbs = [token.text for token in doc if token.dep_ == 'ROOT' and token.pos_ == 'VERB']
query = ' '.join(nouns + verbs)

print('Sentence:', line)
print('Query   :', query)

unsplash = Unsplash()
Image(url = unsplash.get_photo_url(query))

Mmm...