-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save original data on translation error #335
Conversation
@c0c0n3 it seems it needs a rebase |
col_list = ', '.join(['"{}"'.format(c.lower()) for c in col_names]) | ||
placeholders = ','.join(['?'] * len(col_names)) | ||
stmt = f"insert into {table_name} ({col_list}) values ({placeholders})" | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this should be the "safe" way, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if sql injection proof is what you mean, yes in principle the question marks should stop a an attack based on dodgy values---well as long as the driver we use and the server do their job right. but i didn't actually change the way it used to work, just factor that bit of code out into a separate method without even thinking about sql injection to be honest. now that i look at it with malicious eyes, this line could still be a problem actually
col_list = ', '.join(['"{}"'.format(c.lower()) for c in col_names])
the "safe" stuff i had in mind for this PR was to stop as much as reasonably possible the second insert from bombing out---i.e. the insert we do to recover from a SQL failure in the existing one. so e.g. we only save the bare min and in the plainest possible way, see code below.
src/translators/sql_translator.py
Outdated
|
||
# Define column types | ||
# {column_name -> crate_column_type} | ||
table = { | ||
'entity_id': self.NGSI_TO_SQL['Text'], | ||
'entity_type': self.NGSI_TO_SQL['Text'], | ||
self.TIME_INDEX_NAME: self.NGSI_TO_SQL[TIME_INDEX], | ||
FIWARE_SERVICEPATH: self.NGSI_TO_SQL['Text'] | ||
FIWARE_SERVICEPATH: self.NGSI_TO_SQL['Text'], | ||
ORIGINAL_ENTITY_COL: self.NGSI_TO_SQL[NGSI_TEXT] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not NGSI_STRUCTURED_VALUE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, that's what i did initially, but then i thought plain text would be more bomb proof. in fact, if we use a db json field, the db parser may not necessarily agree w/ the python serializer and puke at us. so plain text is the safest---see my comment above about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought about "object" so that we could query it and eventually use it to "fix" data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stashing away the original entity as json would have some advantages though, e.g. we can run json queries on that column to debug failures---postgres has an integrated json query lang, not sure about crate. perhaps, we should store it as json, since the likelihood of python producing a json the sql engine won't be able to parse is, frankly, just a corner case in my opinion---famous last words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought about "object" so that we could query it and eventually use it to "fix" data
yep, see my comment, i think you commented just before i hit the comment button :-)
should we do json then? which is more important to you bomb proof insert or being able to query the json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if think we store as a json, and ifthe message is not "json", we log the error
oh dear, didn't realise that, i ported the code from another local branch i had were i was experimenting w/ different implementation options. will rebase when i'm finished, there's still a couple of commits i need to squeeze in before i can call it a day |
@chicco785 if you could please have a look at this PR again and read through the limitations section above.
|
@chicco785 so I've implemented the storing of original entity and associated recovery metadata as JSON---see updated PR description at the top. Besides being able to easily query the column (see examples in PR description), QL's memory footprint is now back where it was before---which is not to say it's good, but at least this PR didn't make it worse. (If we want to be able to handle huge notification payloads, we'll have to move away from the current paradigm of "suck everything in memory" to constant space processing e.g. with streams---this too would be another big change in the architecture, so we can't do it now.) |
This PR implements #318 with what boils down to a glorified try/catch: if an insert fails because of a JSON to tabular translation error, we save all original entities in the batch as JSON instead of QL canonical tabular format. This PR also upgrades all our Timescale test images to the latest stable release, Timescale
1.7.1
/ Postgres12
, since our current version (Timescale1.7.1
/ Postgres10
) is not fully compatible with the changes implemented by this PR as well as being deprecated and no longer maintained by the Timescale team.Implementation details
QL divides up entities received received at the
notify
endpoint by grouping them by type. Then each type batch gets inserted separately into the DB. For each such batch ofn
entities, this PR catches anyProgrammingError
raised on inserting, then handles the exception by converting alln
entities to JSON and storing them in their corresponding entity table, populatingAdditionally if the
KEEP_RAW_ENTITY
env var is set to true, each entity's JSON will be stored in this new column even when no exception gets raised, i.e. during the normal flow. (This may result in the table needing up to 10x more storage.)KEEP_RAW_ENTITY
gets read in on each API call to thenotify
endpoint so it can be set dynamically and it will affect every subsequent insert operation.The original entity column's DB type is
jsonb
for Timescale and (dynamic)object
for Crate. The format of the data stored in it in the case of an insert failure iswhereas if
KEEP_RAW_ENTITY = true
and the insert did not fail, the format will bei.e. there will be no
failedBatchID
anderror
fields. This way, even withKEEP_RAW_ENTITY
set to true, we can still easily pick up which rows correspond to an insert failure and for each row determine the batch it was in. Knowing which batch an entity was in can help with debugging since it may happen that just one entity in the batch caused the error whereas all the remainingn - 1
could've been inserted just fine, so, in general, looking at just one row in isolation isn't enough to pin down the cause of the failure. Here is an example Timescale query to see what batches failed and how many entities were in eachwhereas this would be the Crate equivalent
Limitations
ValueError
s), we'll still lose the whole batch.ProgrammingError
exceptions that happen on insert and handle them by inserting the original JSON entities instead of their tabular counterpart, blindly assuming such an error was caused by a problem in translating JSON to tabular format---e.g. an attribute changed its type. While catchingProgrammingError
is better than the overly broadException
(e.g. if the DB is overloaded, it's kind of pointless to try a new insert right after one has just failed), it could still be too broad but I haven't managed to narrow it down to something more specific so that's the best we can do for now.col_names[k] <-> rows[k] <-> entities[k]
but we never really check anywhere that always holds true. So slight changes to the insert workflow could cause nasty bugs...Note this is something that was there already and this PR simply piggybacks on that mechanism. I tried making the code more robust but it turned out to be too much work, so I had to roll those changes back.