Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to use a --convert function that runs initialization code first #420

Closed
strada opened this issue Mar 23, 2022 · 12 comments
Closed

Comments

@strada
Copy link

strada commented Mar 23, 2022

When I have an insert command with transform like this:

cat items.json | jq '.data' | sqlite-utils insert listings.db listings - --convert '
d = enchant.Dict("en_US")
row["is_dictionary_word"] = d.check(row["name"])
'  --import=enchant --ignore

I noticed as the number of rows increases the operation becomes quite slow, likely due to the creation of the d = enchant.Dict("en_US") object for each row. Is there a way to share that instance d between transform function calls, like a shared context?

@simonw simonw added the enhancement New feature or request label Mar 24, 2022
@simonw
Copy link
Owner

simonw commented Mar 24, 2022

Yeah, this is WAY harder than it should be.

There's a clumsy workaround you could use which looks something like this: create a file my_enchant.py containing:

import enchant
d = enchant.Dict("en_US")

def check(word):
    return d.check(word)

Then run sqlite-utils like this:

PYTHONPATH=. cat items.json | jq '.data' | sqlite-utils insert listings.db listings - --convert 'my_enchant.check(value)' --import my_enchant

Except I tried that and it doesn't work! I don't know the right pattern for getting --import to work with modules in the same directory.

So yeah, this is definitely a big feature gap.

@simonw
Copy link
Owner

simonw commented Mar 24, 2022

I can think of three ways forward:

  • Figure out a pattern that gets that local file import workaround to work
  • Add another option such as --convert-init that lets you pass code that will be executed once at the start
  • Come up with a pattern where the --convert code can run some initialization code and then return a function which will be called against each value

I quite like the idea of that third option - I'm going to prototype it and see if I can work something out.

@simonw
Copy link
Owner

simonw commented Mar 24, 2022

Here's how the _compile_code() mechanism works at the moment:

def _compile_code(code, imports, variable="value"):
locals = {}
globals = {"r": recipes, "recipes": recipes}
# If user defined a convert() function, return that
try:
exec(code, globals, locals)
return locals["convert"]
except (AttributeError, SyntaxError, NameError, KeyError, TypeError):
pass
# Try compiling their code as a function instead
body_variants = [code]
# If single line and no 'return', try adding the return
if "\n" not in code and not code.strip().startswith("return "):
body_variants.insert(0, "return {}".format(code))
code_o = None
for variant in body_variants:
new_code = ["def fn({}):".format(variable)]
for line in variant.split("\n"):
new_code.append(" {}".format(line))
try:
code_o = compile("\n".join(new_code), "<string>", "exec")
break
except SyntaxError:
# Try another variant, e.g. for 'return row["column"] = 1'
continue
if code_o is None:
raise SyntaxError("Could not compile code")
for import_ in imports:
globals[import_.split(".")[0]] = __import__(import_)
exec(code_o, globals, locals)
return locals["fn"]

At the end it does this:

    return locals["fn"]

So it's already building and then returning a function.

The question is if there's a sensible way to allow people to further customize that function by executing some code first, in a way that's easy to explain.

@simonw
Copy link
Owner

simonw commented Mar 24, 2022

Aha! This may be possible already:

# If user defined a convert() function, return that
try:
exec(code, globals, locals)
return locals["convert"]
except (AttributeError, SyntaxError, NameError, KeyError, TypeError):
pass

And yes, this does indeed work - you can do something like this:

echo '{"name": "harry"}' | sqlite-utils insert db.db people - --convert '
import time
# Simulate something expensive
time.sleep(1)

def convert(row):
    row["upper"] = row["name"].upper()
'

And after running that:

sqlite-utils dump db.db 
BEGIN TRANSACTION;
CREATE TABLE [people] (
   [name] TEXT,
   [upper] TEXT
);
INSERT INTO "people" VALUES('harry','HARRY');
COMMIT;

So this is a documentation issue - there's a trick for it but I didn't know what the trick was!

@simonw simonw added documentation and removed enhancement New feature or request labels Mar 24, 2022
@simonw simonw changed the title Transform command with shared context Document how to use a --convert function that runs initialization code first Mar 24, 2022
@simonw
Copy link
Owner

simonw commented Mar 25, 2022

That documentation is split across a few places. This is the only bit that talks about def convert() pattern right now:

But that's for sqlite-utils convert - the documentation for sqlite-utils insert --convert at https://sqlite-utils.datasette.io/en/stable/cli.html#applying-conversions-while-inserting-data doesn't mention it.

Since both sqlite-utils convert and sqlite-utils insert --convert apply the same rules to the code, they should link to a shared explanation in the documentation.

@simonw
Copy link
Owner

simonw commented Mar 25, 2022

Actually this doesn't work as I thought. This demo shows that the initialization code is run once per item, not a single time at the start of the run:

% sqlite-utils insert dogs.db dogs dogs.json --convert '
import random
print("seeding")
random.seed(10)
print(random.random())

def convert(row):
    print(row)
    row["random_score"] = random.random()
'
seeding
0.5714025946899135
seeding
0.5714025946899135
seeding
0.5714025946899135
seeding
0.5714025946899135

Also that print(row) line is not being printed anywhere that gets to the console for some reason.

... my mistake, that happened because I changed this line in order to try to get local imports to work:

    try:
        exec(code, globals, locals)
        return globals["convert"]
    except (AttributeError, SyntaxError, NameError, KeyError, TypeError):

It should be locals["convert"]

@simonw
Copy link
Owner

simonw commented Mar 25, 2022

This works:

% sqlite-utils insert dogs.db dogs dogs.json --convert '
import random
print("seeding")
random.seed(10)
print(random.random())

def convert(row):
    global random
    print(row)
    row["random_score"] = random.random()
'
seeding
0.5714025946899135
{'id': 1, 'name': 'Cleo'}
{'id': 2, 'name': 'Pancakes'}
{'id': 3, 'name': 'New dog'}
(sqlite-utils) sqlite-utils % sqlite-utils rows dogs.db dogs
[{"id": 1, "name": "Cleo", "random_score": 0.4288890546751146},
 {"id": 2, "name": "Pancakes", "random_score": 0.5780913011344704},
 {"id": 3, "name": "New dog", "random_score": 0.20609823213950174}]

Having to use global random inside the function is frustrating but apparently necessary. https://stackoverflow.com/a/56552138/6083

@simonw simonw closed this as completed in 0b7b80b Mar 25, 2022
@simonw
Copy link
Owner

simonw commented Mar 25, 2022

@simonw
Copy link
Owner

simonw commented Mar 28, 2022

So now this should solve your problem:

echo '[{"name": "notaword"}, {"name": "word"}]
' | python3 -m sqlite_utils insert listings.db listings - --convert '
import enchant
d = enchant.Dict("en_US")

def convert(row):
    global d
    row["is_dictionary_word"] = d.check(row["name"])
'

@simonw
Copy link
Owner

simonw commented Mar 28, 2022

@strada
Copy link
Author

strada commented Mar 29, 2022

@simonw Thanks for looking into it and documenting the solution!

simonw added a commit that referenced this issue Apr 13, 2022
@simonw
Copy link
Owner

simonw commented Aug 28, 2022

I found a fix that makes that global workaround unnecessary:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants