Message Format
Clare edited this page Feb 12, 2016
·
10 revisions
All messages should be a subset of the following object:
{
word: str
crawl_date: str # ISO8601-formatted
urls: [ # list of URL objects
{
url: str
title: str
author: str
source: str # e.g. "The Atlantic" or abcxyz.com
search_provider: str # e.g. 'google'
date: str # ISO8601-formatted
doc: str # contains the cleaned body of the URL
features: {
italic: bool # Does the term occur in emphasis in the document?
bold: bool
quotes: bool
readability: float # Reading score
}
variants: [...] # List of spelling variants in this document
sentences: [ # List of sentences mentioning the term
{
s: str
s_clean: str # Cleaned sentence, search terms replaced with '_TERM_'
frd: int # binary classification of sentence being an FRD (1 or 0)
frd_likelihood: float # likelihood of sentence being an FRD as percentage (eg. 0.1 = 10% likely)
pos_tags: str # Penn Treebank tagged sentence in format 'A/DT _TERM_/JJ culture/NN is/VBZ'
rating: float # Relative rating of the FRDs to one another (eg. 1st best, 2nd best, etc)
}
]
}
]
}
All strings are expected to be UTF-8
encoded when the message is saved to S3 or to database, and Unicode strings when within the application. If the application requires strings to be UTF-8
encoded for some operation, it will be clearly marked in the code, and strings should be decoded back into Unicode as soon as possible.