Skip to content
Clare edited this page Feb 12, 2016 · 10 revisions

All messages should be a subset of the following object:

{
  word: str
  crawl_date: str # ISO8601-formatted
  urls: [ # list of URL objects
    {
      url: str
      title: str
      author: str
      source: str  # e.g. "The Atlantic" or abcxyz.com
      search_provider: str  # e.g. 'google'
      date: str # ISO8601-formatted
      doc: str # contains the cleaned body of the URL
      features: {
        italic: bool # Does the term occur in emphasis in the document?
        bold: bool
        quotes: bool
        readability: float # Reading score
      }
      variants: [...] # List of spelling variants in this document
      sentences: [ # List of sentences mentioning the term
        {
        s: str
        s_clean: str # Cleaned sentence, search terms replaced with '_TERM_'
        frd: int # binary classification of sentence being an FRD (1 or 0)
        frd_likelihood: float # likelihood of sentence being an FRD as percentage (eg. 0.1 = 10% likely)
        pos_tags: str # Penn Treebank tagged sentence in format 'A/DT _TERM_/JJ culture/NN is/VBZ'
        rating: float # Relative rating of the FRDs to one another (eg. 1st best, 2nd best, etc)
        }
      ]
    }
  ]
}

String encoding

All strings are expected to be UTF-8 encoded when the message is saved to S3 or to database, and Unicode strings when within the application. If the application requires strings to be UTF-8 encoded for some operation, it will be clearly marked in the code, and strings should be decoded back into Unicode as soon as possible.

Serapis

This Wiki was created and is maintained by summer.ai. Talk to Manuel if you run into problems!

Clone this wiki locally