Save original data on translation error #335

c0c0n3 · 2020-07-08T07:33:58Z

This PR implements #318 with what boils down to a glorified try/catch: if an insert fails because of a JSON to tabular translation error, we save all original entities in the batch as JSON instead of QL canonical tabular format. This PR also upgrades all our Timescale test images to the latest stable release, Timescale 1.7.1 / Postgres 12, since our current version (Timescale 1.7.1 / Postgres 10) is not fully compatible with the changes implemented by this PR as well as being deprecated and no longer maintained by the Timescale team.

Implementation details

QL divides up entities received received at the notify endpoint by grouping them by type. Then each type batch gets inserted separately into the DB. For each such batch of n entities, this PR catches any ProgrammingError raised on inserting, then handles the exception by converting all n entities to JSON and storing them in their corresponding entity table, populating

entity ID and type columns as usual, but making sure no exception can occur in the process;
time index with the current time---this too avoids unexpected exceptions that could stop error handling and also comes in handy to correlate log entries of e.g. QL and DB services;
a new column with the original entity's JSON---well, to the extent that Python's JSON serialisation and deserialisation procedures are the inverse of each other, I don't think in general this is the case, but for practical purposes we can probably assume it's true.

Additionally if the KEEP_RAW_ENTITY env var is set to true, each entity's JSON will be stored in this new column even when no exception gets raised, i.e. during the normal flow. (This may result in the table needing up to 10x more storage.) KEEP_RAW_ENTITY gets read in on each API call to the notify endpoint so it can be set dynamically and it will affect every subsequent insert operation.

The original entity column's DB type is jsonb for Timescale and (dynamic) object for Crate. The format of the data stored in it in the case of an insert failure is

{
    "data": { ... original NGSI entity ... },
    "failedBatchID": "... a unique string to identify the batch of entities that failed ...",
    "error": "... reason why the normal insert failed ..."
}

whereas if KEEP_RAW_ENTITY = true and the insert did not fail, the format will be

{
    "data": { ... original NGSI entity ... }
}

i.e. there will be no failedBatchID and error fields. This way, even with KEEP_RAW_ENTITY set to true, we can still easily pick up which rows correspond to an insert failure and for each row determine the batch it was in. Knowing which batch an entity was in can help with debugging since it may happen that just one entity in the batch caused the error whereas all the remaining n - 1 could've been inserted just fine, so, in general, looking at just one row in isolation isn't enough to pin down the cause of the failure. Here is an example Timescale query to see what batches failed and how many entities were in each

SELECT (__original_ngsi_entity__ ->> 'failedBatchID') AS batch, count(*) as count
FROM some_entity_table 
WHERE (__original_ngsi_entity__ ->> 'failedBatchID') IS NOT NULL
GROUP BY (__original_ngsi_entity__ ->> 'failedBatchID')

whereas this would be the Crate equivalent

SELECT __original_ngsi_entity__['failedBatchID'] AS batch, count(*) as count
FROM some_entity_table 
WHERE __original_ngsi_entity__['failedBatchID'] IS NOT NULL
GROUP BY __original_ngsi_entity__['failedBatchID']

Limitations

Best effort approach to data loss prevention. The implementation is an unsophisticated workflow but hopefully gives us a fighting chance to avoid losing entire datasets in the most common error scenarios---i.e. those related to the conversion of JSON to tabular format, e.g. an NGSI attribute changing its type over time. It is not meant to be a robust data loss prevention solution---e.g. recover from network failures, internal exceptions (reporter/translator), etc. We talked about it in the past, and, IMHO, to really improve reliability we'd have to move away from synchronous messaging (e.g. what's the use of returning errors to Orion?) to an async messaging arch with buffering, durability and persistence guarantees...
Error recovery. The mechanism by which we should recover from failure (e.g. fix original entity and insert it again) is not in scope.
Error detection/alerting. Ditto.
Original entity column index. Currently there's no index, but we may well need one to speed up queries when debugging failures. Since how we're going to actually recover from failures is still up in the air, it's best to hold off on this one.
Not catering for internal or validation errors. If there's an upstream error in the Reporter or even in the Translator before the insert (see ValueErrors), we'll still lose the whole batch.
Translation error detection. We catch ProgrammingError exceptions that happen on insert and handle them by inserting the original JSON entities instead of their tabular counterpart, blindly assuming such an error was caused by a problem in translating JSON to tabular format---e.g. an attribute changed its type. While catching ProgrammingError is better than the overly broad Exception (e.g. if the DB is overloaded, it's kind of pointless to try a new insert right after one has just failed), it could still be too broad but I haven't managed to narrow it down to something more specific so that's the best we can do for now.
Not separating the wheat from the chaff. Ignoring race conditions for a moment, in principle we should be able to figure out which entities in the batch can't be converted to tabular format and insert those that can. Instead if there's any translation error, I ditch the whole batch---e.g. even if nine out of ten entities could've been inserted without a glitch, all ten wind up in the error batch.
Brittle code. Much of the insert workflow implicitly assumes col_names[k] <-> rows[k] <-> entities[k] but we never really check anywhere that always holds true. So slight changes to the insert workflow could cause nasty bugs...Note this is something that was there already and this PR simply piggybacks on that mechanism. I tried making the code more robust but it turned out to be too much work, so I had to roll those changes back.

chicco785 · 2020-07-08T10:14:58Z

@c0c0n3 it seems it needs a rebase

chicco785 · 2020-07-08T10:17:22Z

src/translators/sql_translator.py

+        col_list = ', '.join(['"{}"'.format(c.lower()) for c in col_names])
+        placeholders = ','.join(['?'] * len(col_names))
+        stmt = f"insert into {table_name} ({col_list}) values ({placeholders})"
+        try:


so this should be the "safe" way, right?

if sql injection proof is what you mean, yes in principle the question marks should stop a an attack based on dodgy values---well as long as the driver we use and the server do their job right. but i didn't actually change the way it used to work, just factor that bit of code out into a separate method without even thinking about sql injection to be honest. now that i look at it with malicious eyes, this line could still be a problem actually

col_list = ', '.join(['"{}"'.format(c.lower()) for c in col_names])

the "safe" stuff i had in mind for this PR was to stop as much as reasonably possible the second insert from bombing out---i.e. the insert we do to recover from a SQL failure in the existing one. so e.g. we only save the bare min and in the plainest possible way, see code below.

chicco785 · 2020-07-08T10:20:09Z

src/translators/sql_translator.py


        # Define column types
        # {column_name -> crate_column_type}
        table = {
            'entity_id': self.NGSI_TO_SQL['Text'],
            'entity_type': self.NGSI_TO_SQL['Text'],
            self.TIME_INDEX_NAME: self.NGSI_TO_SQL[TIME_INDEX],
-            FIWARE_SERVICEPATH: self.NGSI_TO_SQL['Text']
+            FIWARE_SERVICEPATH: self.NGSI_TO_SQL['Text'],
+            ORIGINAL_ENTITY_COL: self.NGSI_TO_SQL[NGSI_TEXT]


why not NGSI_STRUCTURED_VALUE?

yea, that's what i did initially, but then i thought plain text would be more bomb proof. in fact, if we use a db json field, the db parser may not necessarily agree w/ the python serializer and puke at us. so plain text is the safest---see my comment above about it.

i thought about "object" so that we could query it and eventually use it to "fix" data

stashing away the original entity as json would have some advantages though, e.g. we can run json queries on that column to debug failures---postgres has an integrated json query lang, not sure about crate. perhaps, we should store it as json, since the likelihood of python producing a json the sql engine won't be able to parse is, frankly, just a corner case in my opinion---famous last words.

i thought about "object" so that we could query it and eventually use it to "fix" data

yep, see my comment, i think you commented just before i hit the comment button :-)
should we do json then? which is more important to you bomb proof insert or being able to query the json?

if think we store as a json, and ifthe message is not "json", we log the error

c0c0n3 · 2020-07-08T11:56:56Z

it seems it needs a rebase

oh dear, didn't realise that, i ported the code from another local branch i had were i was experimenting w/ different implementation options. will rebase when i'm finished, there's still a couple of commits i need to squeeze in before i can call it a day

c0c0n3 · 2020-07-08T18:13:52Z

@chicco785 if you could please have a look at this PR again and read through the limitations section above.
Still to do:

Should we store the original entity as JSON instead of plain text? I reckon the benefits outweigh the risks?
Not sure if we should improve anything at this stage---see limitations section.
One thing that springs to mind though is error recovery. As it stands now, picking up which entities failed is more work than I'd like it to be. In fact, I only realised after the fact we might need an extra column or two to tell us explicitly a row is a "failure" record and why it failed! If we decide to go the JSON route, we wouldn't need extra columns as we can query the JSON so we could store a structure along the lines of { failed: yes, cause: nasty, original: data }? Thoughts?

c0c0n3 · 2020-07-09T16:41:13Z

@chicco785 so I've implemented the storing of original entity and associated recovery metadata as JSON---see updated PR description at the top. Besides being able to easily query the column (see examples in PR description), QL's memory footprint is now back where it was before---which is not to say it's good, but at least this PR didn't make it worse. (If we want to be able to handle huge notification payloads, we'll have to move away from the current paradigm of "suck everything in memory" to constant space processing e.g. with streams---this too would be another big change in the architecture, so we can't do it now.)

chicco785 reviewed Jul 8, 2020

View reviewed changes

c0c0n3 marked this pull request as draft July 8, 2020 11:58

c0c0n3 added 8 commits July 8, 2020 16:41

port dumbed-down implementation from experiments branch.

f9d57e4

pin timescale/pg docker img version to latest known to be working w/ ql.

33d5c66

pin timescale/pg docker img version to latest known to be working w/ ql.

fc0d876

upgrade all timescale test images to latest stable version.

ada1298

factor out safe reading of env vars.

3738702

move reading of env vars out of translator.

faa71e9

document env vars.

be43956

implement optional saving of original data.

73d46a7

c0c0n3 force-pushed the impl-318 branch from 0c59abb to 73d46a7 Compare July 8, 2020 14:43

c0c0n3 marked this pull request as ready for review July 8, 2020 18:14

store original entities and associated recovery metadata as json.

97ab537

chicco785 approved these changes Jul 20, 2020

View reviewed changes

chicco785 merged commit 83d070c into master Jul 20, 2020

c0c0n3 deleted the impl-318 branch July 20, 2020 12:05

This was referenced Jul 20, 2020

store raw json incoming notification #318

Closed

Release 0.7.6 #275

Merged

c0c0n3 mentioned this pull request Jul 29, 2020

Rethink how to store original data for Crate backend #344

Closed

c0c0n3 mentioned this pull request Oct 27, 2020

Notification not processed, Column __original_ngsi_entity__ unknown #381

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save original data on translation error #335

Save original data on translation error #335

c0c0n3 commented Jul 8, 2020 •

edited

chicco785 commented Jul 8, 2020

chicco785 Jul 8, 2020

c0c0n3 Jul 8, 2020

chicco785 Jul 8, 2020

c0c0n3 Jul 8, 2020

chicco785 Jul 8, 2020

c0c0n3 Jul 8, 2020

c0c0n3 Jul 8, 2020

chicco785 Jul 9, 2020

c0c0n3 commented Jul 8, 2020

c0c0n3 commented Jul 8, 2020

c0c0n3 commented Jul 9, 2020

Save original data on translation error #335

Save original data on translation error #335

Conversation

c0c0n3 commented Jul 8, 2020 • edited

Implementation details

Limitations

chicco785 commented Jul 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

c0c0n3 commented Jul 8, 2020

c0c0n3 commented Jul 8, 2020

c0c0n3 commented Jul 9, 2020

c0c0n3 commented Jul 8, 2020 •

edited