Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringType() allows casting from int to str #315

Closed
xihan opened this issue Jul 9, 2015 · 19 comments
Closed

StringType() allows casting from int to str #315

xihan opened this issue Jul 9, 2015 · 19 comments

Comments

@xihan
Copy link

xihan commented Jul 9, 2015

Hi,

I noticed that StringType defines allow_casts = (str, int).
I wonder what the motivation is for this. I was quite surprised that
a StringType() field silently casts int input to str.

I can always make a subclass that overrides the allow_casts class variable though.

Thanks for your help!

@kracekumar
Copy link
Member

I am not aware of the reason, probably early contributors can chip in @schematics/owners

@jokull
Copy link
Contributor

jokull commented Jul 11, 2015

Blame points to me 😁

I honestly can’t remember. I agree that it is weird.

@kstrauser
Copy link
Contributor

I think this is an acceptable compromise, because it's such a common thing. First, consider that all Python ints can be represented as strings. Second, consider a common serialization format: YAML. You can tag values to be of a specific type, but that's pretty rare. Instead, the default mode for YAML parsers is to infer a value's type based on regular expression. See how it handles ints, for example.

Now consider this YAML document:

    name: Joe Smith
    address: 123 Any Street.
    zip: 12345

YAML would parse the zip code as an int, even though semantically it's a string. Other examples include USPS tracking numbers and phone numbers without the dashes. And YAML's not alone in representing int-looking strings as integers:

$ sqlite3 foo.db
SQLite version 3.8.8 2015-01-16 12:24:00
Enter ".help" for usage hints.
sqlite> create table bar(baz str);
sqlite> insert into bar (baz) values ("123");
sqlite> .dump bar
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE bar(baz str);
INSERT INTO "bar" VALUES(123);
COMMIT;

In any of those cases, we'd expect a Model with a StringType zip (or phone or tracking_number) field to accept int values without complaint. Even Python itself acts this way: "%s" % 123 will silently cast an int to a string when it's expecting a string value.

While I certainly get that it's not philosophically pure to implicitly cast ints to strings, it's so common (especially in the domains were Schematics is likely to be used) that I think it's a reasonable default behavior.

@xihan xihan closed this as completed Jul 16, 2015
@tuukkamustonen
Copy link

Just bumped into this. I wasn't really expecting the behavior, and couldn't find mentioning about it from the docs.

Would it make sense to add something like following to StringType's docstring:

StringType also accepts ``bool`` and ``int`` values, by casting them into
string representation and validating the resulting ``str`` as usual.

StringType also has allow_casts = (int, str), that specifies which types can be converted into string in to_native() but it doesn't list bool and boolean output from to_native(), so I'm not sure how it works and what the docstring should accurately be?

@bintoro bintoro reopened this Apr 11, 2016
@bintoro
Copy link
Member

bintoro commented Apr 11, 2016

By and large, Schematics has a lax approach to type casting. I've been wanting to make it less so, or at least make the behavior configurable. This has already been addressed w.r.t numbers; see my monologue and the resulting patch in #364.

In short, I ended up making IntType stricter across the board. A super-strict mode can be enabled by setting the strict flag on a field-by-field basis. Something like that should probably be implemented for StringType as well.

One thing I'd love to hear people's thoughts on is the relative importance of controlling type strictness at data import time as opposed to field definition time. Should we have both? If yes, how should the settings interact?

@lkraider
Copy link
Contributor

If you just want to validate the input type you could skip the conversion step.

Maybe split an identity test step (currently done in the conversion, eg. isinstance(value, str) ) to allow skip the to_native conversion code.

*edited for clarity

@lkraider
Copy link
Contributor

Just adding here: schematics is about importing data from possibly untrusted sources and making it fit into a schema, by doing the necessary lossless conversions when possible.

If you want to enforce a specific input datatype, that is a subset of what the default schematics types do, and not supported by default. You can always subclass and do your own to_native acceptance logic.

@bintoro
Copy link
Member

bintoro commented Apr 11, 2016

If you just want to validate data you could skip the conversion step.

This is not really supported usage. Validator functions are not required to do any form of type-checking, so if the type is wrong, all bets are off — see #338. This is why BaseType.validate() now performs conversion automatically before running the validators.

@lkraider
Copy link
Contributor

Yes, you are right, sorry for not being clear.

I mean that maybe splitting the current conversion into [conversion] + [identity test] would be a way to cleanly skip the conversion step when just type checking of input is needed.

Edit: the logical order is identity test first and then conversion, with an exception raised in the identity test if strict type checking is requested.

@bintoro
Copy link
Member

bintoro commented Apr 11, 2016

Okay I get it. Well, it's true that everything can be solved by subclassing, but the default behavior should at least be reasonable. As @tuukkamustonen pointed out, StringType().to_native(True) returns the string "True", which I think is just too surprising. When does that even make sense? Integers in general are probably fine but my vote is for excluding bool.

@lkraider
Copy link
Contributor

Is the issue about

  1. the specific cases where to_native -> String are reasonable; or
  2. the general case of only type-checking the input (and raising if type(input) != str)?

For (1), [see my next message] the following line could be removed:

return str(value)

For (2), while applying the same solution above would resolve for the String type, I would propose to split the type test out of the to_native function, so that all types would basically refactor the isinstance checks out, enabling strict input type-checking for all types.

@lkraider
Copy link
Contributor

Just looked again, I didn't match the if/else clause correctly for (1), the problem is that:
isinstance(True, int) == True

The solution would be to replace isinstance with a strict type(value) in self.allow_casts, not sure if it would break other code though.

@bintoro
Copy link
Member

bintoro commented Apr 11, 2016

I would just special-case int and say that its subclasses are not okay. Integer coercion is needed because of the YAML issue and similar scenarios, and they will always involve plain integers, not random subclasses of int.

For example, consider this class from util.py:

class Constant(int):

    def __new__(cls, name, value):
        return int.__new__(cls, value)

    def __init__(self, name, value):
        self.name = name
        int.__init__(self)

    def __repr__(self):
        return self.name

It's an int subclass whose string representation is a name completely unrelated to the integer value. If someone wants to feed such an object to a StringType field, it should be the user's responsibility to first decide on the desired representation and cast it to either a pure int or a str.

@lkraider
Copy link
Contributor

Do you mean something like:

if isinstance(value, self.allow_casts):
    if isinstance(value, int):
        if type(value) != int: raise ConversionError(...)

I agree it's a simple fix for this case. Not sure how complete.

@tuukkamustonen
Copy link

I only speak from a "customer's" perspective now: I am using schematics to validate incoming request data in a Flask application. I want to be strict with data - a string needs to be a string, an int needs to be an int, and so on. This is both to avoid accidentally introducing potential security problems and to keep the API as straightforward and simple as possible. I don't want to run in those scenarios were client accidentally sends us a number, and then we process that as a string, resulting in something else than what client expected.

I do see it now that I should have read the project description better - maybe schematics just isn't intended for this use? Anyhow, I feel the documentation is a bit lacking in this sense - the docs do hint about type coercion, but an example or two (or the table from #364) would do better to enlighten reader's awareness of the data casting.

schematics is about importing data from possibly untrusted sources and making it fit into a schema, by doing the necessary lossless conversions when possible.

You put it nicely there, @lkraider. I would love to see this under the project description.

StringType().to_native(True) returns the string "True", which I think is just too surprising

I would say this is correct, if the input was "True" and not True. The problem (for me) is that "True" is accepted in the first place.

One thing I'd love to hear people's thoughts on is the relative importance of controlling type strictness at data import time as opposed to field definition time. Should we have both? If yes, how should the settings interact?

Can you clarify the difference between data import and field definition time?

I've been also following https://github.com/sloria/webargs and https://github.com/marshmallow-code/marshmallow lately. If the focus of schematics is not in API validation, would you encourage to rather use webargs for input validation in an API and leave schematics for YAML parsing (and similar)?

@bintoro
Copy link
Member

bintoro commented Apr 12, 2016

First, Schematics is very much intended for API validation — it's even mentioned as a use case on the project's front page.

YAML was brought up only as an example of a serialization format that doesn't automatically distinguish between integers and conceptual strings that look like integers, and I think it's a good point to keep in mind.

Like I said earlier, the library seems to have a history of coercing datatypes at almost any cost. There's something to be said for this approach (https://en.wikipedia.org/wiki/Robustness_principle), but I've felt the behavior has been too loose, so I've been trying to weed out the questionable cases: IntType no longer truncates floats and ListType no longer turns 1 into [1]. The boolean issue here falls into the same category.

@jvantuyl's comment in #324 mentioned giving users better control over the coercion logic. I agree this would be good. But I think having saner defaults should be the first step. Well-thought-out behavior out of the box will cover the vast majority of use cases.

There are basically three ways a string field could work:

  • Call str() on everything. If the type implements __str__, it's good.
  • Call str() on a few predefined types. This is what we currently do.
  • Only accept str and its subclasses.

It's easy to see all of these can make sense, depending on circumstances.

Can you clarify the difference between data import and field definition time?

I mean

class Address(Model):
    zipcode = StringType(strict=False)  # strictness fixed at definition time

a = Address(any_data)

vs

class Address(Model):
    zipcode = StringType()

a1 = Address(typed_data, strict_types=True)  # strictness specified at invocation time
a2 = Address(yaml_data, strict_types=False)

@lkraider Yeah, that's the gist of what I was suggesting, at least to fix the immediate issue. Alternatively, we could special-case bool and always reject it. But I'm unable to imagine any scenario where a strict subclass of int could legitimately occur when a string is expected.

@bintoro
Copy link
Member

bintoro commented Apr 12, 2016

One more thing: if we're coercing integers, shouldn't we be coercing floats too? What's the logic behind accepting one and rejecting the other?

A deserialization layer that will incorrectly infer int for a zip code is likely to also infer float for a version number such as 2.1.

@tuukkamustonen
Copy link

I agree, making sane defaults should be the first goal.

Let me take back my thoughts a bit. Now that I realize (this is not in the docs, afaik) that the library does these data conversions, and after thinking about it, it doesn't feel so bad. It's actually something AWS does with their CloudFormation templates - you can give 1 or "1" for a number, true or true for a boolean, and so on. I'm still doubtful about the benefits, but it hasn't harmed me so far.

So, if I document that a field is of type str and it interprets whatever you throw at it as a string, then I think it's fine, and reading True as "True" is OK. For an API, it's more a documentation and convention matter than a question of "what's right technically".

Conversion for the other datatypes cannot be so loose, however. I think what you did in #364 is indeed the way to go. The old mode is confusing (silently converting 3.2 into 3). As long as the conversion doesn't discard any details (in target type), it should be fine. And yeah, floats should work the same... although I'm not sure how they work at the moment.

For me, configurability in the field definitions would suffice. Whatever the style, all interfaces should function the same, so either I would enable the strict mode or not. Then of course, it would be rather boring to define strict=True for each field individually, so toggling the strict mode during data import would be nice.

Just one more thing, from #364:

I believe "sensible or strict" is a more useful menu of configurations than "sensible or silly".

It kinda reads that the default mode is now "sensible and silly"... :D

@abo123456789
Copy link

Hi,

I noticed that StringType defines allow_casts = (str, int). I wonder what the motivation is for this. I was quite surprised that a StringType() field silently casts int input to str.

I can always make a subclass that overrides the allow_casts class variable though.

Thanks for your help!
I fix this problem by publish a new package:
pip install schemv==2.1.1.1
i will check the input value type equals to the actual type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants