## Document Embedding
   * An embedding is a multi-dimensional     representation of a Document
   * An Embedding could be :
        * numpy **ndarray**
        * Scipy **sparse array**
        * Tensorflow and Pytorch **sparse arrays**

In [1]:
import numpy as np
from jina import Document
d1 = Document(embedding = np.array([1,2,3]))
d2 = Document(embedding = np.array([[1,2,3], [4,5,6]]))

In [4]:
display(d1,d2)

In [12]:
# Access to each document's Embedding attribute
display(d1.embedding)
print("------")
display(d2.embedding)

array([1, 2, 3])

------


array([[1, 2, 3],
       [4, 5, 6]])

#### Sparce Embedding

In [14]:
import scipy.sparse as sp

d1 = Document(embedding = sp.coo_matrix([0,0,0,1,0]))
d2 = Document(embedding = sp.csr_matrix([0,0,0,1,0]))
d3 = Document(embedding =sp.bsr_matrix([0,0,0,1,0]))
d4 = Document(embedding = sp.csc_matrix([0,0,0,1,0]))

d5 = Document(blob=sp.coo_matrix([0, 0, 0, 1, 0]))
d6 = Document(blob=sp.csr_matrix([0, 0, 0, 1, 0]))
d7 = Document(blob=sp.bsr_matrix([0, 0, 0, 1, 0]))
d8 = Document(blob=sp.csc_matrix([0, 0, 0, 1, 0]))

In [22]:
display(d1)
display(d1.embedding)

<1x5 sparse matrix of type '<class 'numpy.int32'>'
	with 1 stored elements in COOrdinate format>

In [21]:
display(d5)
display(d5.blob)

<1x5 sparse matrix of type '<class 'numpy.int32'>'
	with 1 stored elements in COOrdinate format>

In [30]:
# tensorflow and pytorch sparce arrays
import torch
import tensorflow as tf

#more information about SparseTensor : https://www.tensorflow.org/api_docs/python/tf/sparse/SparseTensor

indices = [[0, 0], [1, 2]] # specifies the indices of the elements in the sparse tensor that contain nonzero values
values = [1, 2] # supplies the values for each element in indices
dense_shape = [3, 4] # specifies the dense_shape of the sparse tensor

d1 = Document(embedding=torch.sparse_coo_tensor(indices, values, dense_shape))
d2 = Document(embedding=tf.SparseTensor(indices, values, dense_shape))
d3 = Document(blob=torch.sparse_coo_tensor(indices, values, dense_shape))
d4 = Document(blob=tf.SparseTensor(indices, values, dense_shape))

In [31]:
display(d1)
display(d1.embedding)

<3x4 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in COOrdinate format>

## Document tags
   * **Document** contains the **tags** field that can hold a map-like structure that can map arbitrary values

In [35]:
from jina import Document

doc = Document(tags={'dimensions': {'height': 5.0, 'weight': 10.0, 'last_modified': 'Monday'}})

print(doc.tags['dimensions'])

{'last_modified': 'Monday', 'weight': 10.0, 'height': 5.0}


In [36]:
#using __ to access nested fields
print(doc.tags__dimensions__height)

5.0


## Visualize Recursive and nested Document
   * **Document** can be recursive both horizontally and vertically
   * each **Document** can have multiple **chunks** and **matches**
   * **chunks** and **matches** are **Document** object 

In [4]:
import numpy as np
from jina import Document

d0 = Document(id='🐲', embedding=np.array([0, 0]))
d1 = Document(id='🐦', embedding=np.array([1, 0]))
d2 = Document(id='🐢', embedding=np.array([0, 1]))
d3 = Document(id='🐯', embedding=np.array([1, 1]))
#chunks and matches are recursive attributes
d0.chunks.append(d1) # chunks is the list of sub_document
d0.chunks[0].chunks.append(d2)  
d0.matches.append(d3) # matches is the the list of matched documents (neighbour-Document)of this document

display(d0)

* **granularity** is The recursion "depth" of the recursive chunks structure
* **adjacency** is the recursion "width" of the recursive match

## Construct Document 

In [5]:
from jina import Document
d = Document(uri='https:://jina.ai',
            mime_type='text/plain',
            granularity=1,
            adjacency=3,
            tags={'foo': 'bar'})

In [6]:
d

In [7]:
# construct document from dict or json string:
import json

d = {'id': 'hello123', 'content': 'world'}
d1 = Document(d)

d = json.dumps({'id': 'hello123', 'content': 'world'})
d2 = Document(d)

In [10]:
##update document according to another source document

s = Document(
    id='🐲',
    content='hello-world',
    tags={'a': 'b'},
    chunks=[Document(id='🐢')],
)
d = Document(
    id='🐦',
    content='goodbye-world',
    tags={'c': 'd'},
    chunks=[Document(id='🐯')],
)
print("document before updating", d)
# only update `id` field
d.update(s, fields=['id'])
print("\n\ndocument after update", d)

document before updating {'id': '🐦', 'chunks': [{'id': '🐯', 'mime_type': 'text/plain', 'granularity': 1, 'parent_id': '🐦'}], 'mime_type': 'text/plain', 'tags': {'c': 'd'}, 'text': 'goodbye-world'}


document after update {'id': '🐲', 'chunks': [{'id': '🐯', 'mime_type': 'text/plain', 'granularity': 1, 'parent_id': '🐦'}], 'mime_type': 'text/plain', 'tags': {'c': 'd'}, 'text': 'goodbye-world'}


In [11]:
#update all fields
d.update(s) 
print(d)# `tags` field as `dict` will be merged

{'id': '🐲', 'chunks': [{'id': '🐢', 'mime_type': 'text/plain', 'granularity': 1, 'parent_id': '🐲'}], 'mime_type': 'text/plain', 'tags': {'a': 'b', 'c': 'd'}, 'text': 'hello-world'}


**Document** could be constructed from JSON, ndarray and files 
**https://docs.jina.ai/fundamentals/document/document-api/**

## Add relevancy to Document
   * to add relevancy we could use relevance attributes in document object :
        * **scores** attribute : The relevance information of this Document.
        * **evaluations** attribute : The evaluation information of this Document

In [12]:
d = Document()
d.scores['cosine similarity'] = 0.96
d.scores['cosine similarity'].description = 'cosine similarity'
d.scores['cosine similarity'].op_name = 'cosine()'
d.evaluations['recall'] = 0.56
d.evaluations['recall'].description = 'recall at 10'
d.evaluations['recall'].op_name = 'recall()'
d

In [14]:
for score_key, value_score in d.scores.items():
    print(f' {score_key} => {value_score.description}: {value_score.value}')
    
for evaluation_key, evaluation_score in d.evaluations.items():
    print(f' {evaluation_key} => {evaluation_score.description}: {evaluation_score.value}')

 cosine similarity => cosine similarity: 0.9599999785423279
 recall => recall at 10: 0.5600000023841858
