Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save original data on translation error #335

Merged
merged 9 commits into from
Jul 20, 2020
Merged

Save original data on translation error #335

merged 9 commits into from
Jul 20, 2020

Conversation

c0c0n3
Copy link
Member

@c0c0n3 c0c0n3 commented Jul 8, 2020

This PR implements #318 with what boils down to a glorified try/catch: if an insert fails because of a JSON to tabular translation error, we save all original entities in the batch as JSON instead of QL canonical tabular format. This PR also upgrades all our Timescale test images to the latest stable release, Timescale 1.7.1 / Postgres 12, since our current version (Timescale 1.7.1 / Postgres 10) is not fully compatible with the changes implemented by this PR as well as being deprecated and no longer maintained by the Timescale team.

Implementation details

QL divides up entities received received at the notify endpoint by grouping them by type. Then each type batch gets inserted separately into the DB. For each such batch of n entities, this PR catches any ProgrammingError raised on inserting, then handles the exception by converting all n entities to JSON and storing them in their corresponding entity table, populating

  • entity ID and type columns as usual, but making sure no exception can occur in the process;
  • time index with the current time---this too avoids unexpected exceptions that could stop error handling and also comes in handy to correlate log entries of e.g. QL and DB services;
  • a new column with the original entity's JSON---well, to the extent that Python's JSON serialisation and deserialisation procedures are the inverse of each other, I don't think in general this is the case, but for practical purposes we can probably assume it's true.

Additionally if the KEEP_RAW_ENTITY env var is set to true, each entity's JSON will be stored in this new column even when no exception gets raised, i.e. during the normal flow. (This may result in the table needing up to 10x more storage.) KEEP_RAW_ENTITY gets read in on each API call to the notify endpoint so it can be set dynamically and it will affect every subsequent insert operation.

The original entity column's DB type is jsonb for Timescale and (dynamic) object for Crate. The format of the data stored in it in the case of an insert failure is

{
    "data": { ... original NGSI entity ... },
    "failedBatchID": "... a unique string to identify the batch of entities that failed ...",
    "error": "... reason why the normal insert failed ..."
}

whereas if KEEP_RAW_ENTITY = true and the insert did not fail, the format will be

{
    "data": { ... original NGSI entity ... }
}

i.e. there will be no failedBatchID and error fields. This way, even with KEEP_RAW_ENTITY set to true, we can still easily pick up which rows correspond to an insert failure and for each row determine the batch it was in. Knowing which batch an entity was in can help with debugging since it may happen that just one entity in the batch caused the error whereas all the remaining n - 1 could've been inserted just fine, so, in general, looking at just one row in isolation isn't enough to pin down the cause of the failure. Here is an example Timescale query to see what batches failed and how many entities were in each

SELECT (__original_ngsi_entity__ ->> 'failedBatchID') AS batch, count(*) as count
FROM some_entity_table 
WHERE (__original_ngsi_entity__ ->> 'failedBatchID') IS NOT NULL
GROUP BY (__original_ngsi_entity__ ->> 'failedBatchID')

whereas this would be the Crate equivalent

SELECT __original_ngsi_entity__['failedBatchID'] AS batch, count(*) as count
FROM some_entity_table 
WHERE __original_ngsi_entity__['failedBatchID'] IS NOT NULL
GROUP BY __original_ngsi_entity__['failedBatchID']

Limitations

  • Best effort approach to data loss prevention. The implementation is an unsophisticated workflow but hopefully gives us a fighting chance to avoid losing entire datasets in the most common error scenarios---i.e. those related to the conversion of JSON to tabular format, e.g. an NGSI attribute changing its type over time. It is not meant to be a robust data loss prevention solution---e.g. recover from network failures, internal exceptions (reporter/translator), etc. We talked about it in the past, and, IMHO, to really improve reliability we'd have to move away from synchronous messaging (e.g. what's the use of returning errors to Orion?) to an async messaging arch with buffering, durability and persistence guarantees...
  • Error recovery. The mechanism by which we should recover from failure (e.g. fix original entity and insert it again) is not in scope.
  • Error detection/alerting. Ditto.
  • Original entity column index. Currently there's no index, but we may well need one to speed up queries when debugging failures. Since how we're going to actually recover from failures is still up in the air, it's best to hold off on this one.
  • Not catering for internal or validation errors. If there's an upstream error in the Reporter or even in the Translator before the insert (see ValueErrors), we'll still lose the whole batch.
  • Translation error detection. We catch ProgrammingError exceptions that happen on insert and handle them by inserting the original JSON entities instead of their tabular counterpart, blindly assuming such an error was caused by a problem in translating JSON to tabular format---e.g. an attribute changed its type. While catching ProgrammingError is better than the overly broad Exception (e.g. if the DB is overloaded, it's kind of pointless to try a new insert right after one has just failed), it could still be too broad but I haven't managed to narrow it down to something more specific so that's the best we can do for now.
  • Not separating the wheat from the chaff. Ignoring race conditions for a moment, in principle we should be able to figure out which entities in the batch can't be converted to tabular format and insert those that can. Instead if there's any translation error, I ditch the whole batch---e.g. even if nine out of ten entities could've been inserted without a glitch, all ten wind up in the error batch.
  • Brittle code. Much of the insert workflow implicitly assumes col_names[k] <-> rows[k] <-> entities[k] but we never really check anywhere that always holds true. So slight changes to the insert workflow could cause nasty bugs...Note this is something that was there already and this PR simply piggybacks on that mechanism. I tried making the code more robust but it turned out to be too much work, so I had to roll those changes back.

@chicco785
Copy link
Contributor

@c0c0n3 it seems it needs a rebase

col_list = ', '.join(['"{}"'.format(c.lower()) for c in col_names])
placeholders = ','.join(['?'] * len(col_names))
stmt = f"insert into {table_name} ({col_list}) values ({placeholders})"
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this should be the "safe" way, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if sql injection proof is what you mean, yes in principle the question marks should stop a an attack based on dodgy values---well as long as the driver we use and the server do their job right. but i didn't actually change the way it used to work, just factor that bit of code out into a separate method without even thinking about sql injection to be honest. now that i look at it with malicious eyes, this line could still be a problem actually

col_list = ', '.join(['"{}"'.format(c.lower()) for c in col_names])

the "safe" stuff i had in mind for this PR was to stop as much as reasonably possible the second insert from bombing out---i.e. the insert we do to recover from a SQL failure in the existing one. so e.g. we only save the bare min and in the plainest possible way, see code below.


# Define column types
# {column_name -> crate_column_type}
table = {
'entity_id': self.NGSI_TO_SQL['Text'],
'entity_type': self.NGSI_TO_SQL['Text'],
self.TIME_INDEX_NAME: self.NGSI_TO_SQL[TIME_INDEX],
FIWARE_SERVICEPATH: self.NGSI_TO_SQL['Text']
FIWARE_SERVICEPATH: self.NGSI_TO_SQL['Text'],
ORIGINAL_ENTITY_COL: self.NGSI_TO_SQL[NGSI_TEXT]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not NGSI_STRUCTURED_VALUE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, that's what i did initially, but then i thought plain text would be more bomb proof. in fact, if we use a db json field, the db parser may not necessarily agree w/ the python serializer and puke at us. so plain text is the safest---see my comment above about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about "object" so that we could query it and eventually use it to "fix" data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stashing away the original entity as json would have some advantages though, e.g. we can run json queries on that column to debug failures---postgres has an integrated json query lang, not sure about crate. perhaps, we should store it as json, since the likelihood of python producing a json the sql engine won't be able to parse is, frankly, just a corner case in my opinion---famous last words.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought about "object" so that we could query it and eventually use it to "fix" data

yep, see my comment, i think you commented just before i hit the comment button :-)
should we do json then? which is more important to you bomb proof insert or being able to query the json?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if think we store as a json, and ifthe message is not "json", we log the error

@c0c0n3
Copy link
Member Author

c0c0n3 commented Jul 8, 2020

it seems it needs a rebase

oh dear, didn't realise that, i ported the code from another local branch i had were i was experimenting w/ different implementation options. will rebase when i'm finished, there's still a couple of commits i need to squeeze in before i can call it a day

@c0c0n3 c0c0n3 marked this pull request as draft July 8, 2020 11:58
@c0c0n3
Copy link
Member Author

c0c0n3 commented Jul 8, 2020

@chicco785 if you could please have a look at this PR again and read through the limitations section above.
Still to do:

  • Should we store the original entity as JSON instead of plain text? I reckon the benefits outweigh the risks?
  • Not sure if we should improve anything at this stage---see limitations section.
  • One thing that springs to mind though is error recovery. As it stands now, picking up which entities failed is more work than I'd like it to be. In fact, I only realised after the fact we might need an extra column or two to tell us explicitly a row is a "failure" record and why it failed! If we decide to go the JSON route, we wouldn't need extra columns as we can query the JSON so we could store a structure along the lines of { failed: yes, cause: nasty, original: data }? Thoughts?

@c0c0n3 c0c0n3 marked this pull request as ready for review July 8, 2020 18:14
@c0c0n3
Copy link
Member Author

c0c0n3 commented Jul 9, 2020

@chicco785 so I've implemented the storing of original entity and associated recovery metadata as JSON---see updated PR description at the top. Besides being able to easily query the column (see examples in PR description), QL's memory footprint is now back where it was before---which is not to say it's good, but at least this PR didn't make it worse. (If we want to be able to handle huge notification payloads, we'll have to move away from the current paradigm of "suck everything in memory" to constant space processing e.g. with streams---this too would be another big change in the architecture, so we can't do it now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants