Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deserialization of X12 data format / Backtracking during Option<T> deserialization #2705

Closed
kurtbuilds opened this issue Feb 26, 2024 · 1 comment

Comments

@kurtbuilds
Copy link

kurtbuilds commented Feb 26, 2024

I'm writing a serde implementation for the X12 data format, which is a very old, widely used, and enterprise-y format that functions vaguely similarly to a CSV. Delimiters are configurable, but generally speaking, each file will contain a series of segments separated by ~. Each segment begins with a segment code, followed by *, followed by however many fields are contained in that segment, delimited by *. Elements/fields can themselves be containers of multiple values, separated by :. Here's an example (simplified for brevity):

ISA*00*00~
GS*HC*123*1234~
ST*837*000000001*0000~

One approach to do this is a generic way, and while I haven't spent as much time on this approach, it feels fairly straight forward. (assume #[derive(Serialize, Deserialize) for all structs)

struct Document {
    segments: Vec<GenericSegment>
}
struct GenericSegment {
    name: String,
    elements: Vec<X12Value>,
}
enum X12Value {
    Value(String),
    Container(Vec<String)
}

So the above document would get parsed into:

Document { segments: [
    GenericSegment { name: "ISA", elements: [X12Value::Value("00"), X12Value::Value("00") ]},
    GenericSegment { name: "GS", elements: [X12Value::Value("HC"), X12Value::Value("123"), X12Value::Value("1234") ]},
    GenericSegment { name: "ST", elements: [X12Value::Value("837"), X12Value::Value("000000001"), X12Value::Value("0000") ]}
]}

However, to give these structs semantic meaning, I'd like to parse them into data structures like this:

struct Document {
    header: Isa,
    subheader: Gs,
    transaction: St,
}
#[serde(name = "ISA")]
struct Isa {
    field1: String,
    field2: String,
}
#[serde(name = "GS")]
struct Gs {
    field1: String,
    field2: String,
    field3: String,
}
#[serde(name = "ST")]
struct St {
    field1: String,
    field2: String,
    field3: String,
}

For this to work, there's a notion of a "passthrough" struct, which is just a container and has no name to match on (e.g. Document above has no "name" associated), and "segment" structs, which fail to parse if the first consumed token does not match.

It's a little hacky to "detect" whether a struct is passthrough or not, but so far, the story works out pretty well.

Where this breaks down significantly, is that some structs have optional fields. So there's a struct like:

// note there's no name, so this is passthough/container
struct Person {
    name: Nm,
   address: Option<Loc>
}
#[serde(name = "NM")]
struct Nm {
    first: String,
    last: String,
}
#[serde(name = "LOC")]
struct Loc {
    first: String,
    last: String,
}

What needs to happen is the deserializer needs to check the first token, and ask, "does this match what's inside Option?", and if so, give the value. If it doesn't match, then set the field to None. However, it seems like serde's API doesn't support that kind of backtracking, and it only supports looking at the string statically (i.e. without knowledge of what's contained in the option) to decide whether something is Some(T) or None.

In code, the solution I'd like to write would look something like this:

impl<'de, 'a> Deserializer<'de> for &'a mut X12Deserializer<'de> {
    fn deserialize_option<V>(self, visitor: V) -> Result<V::Value> where V: Visitor<'de>, {
        match visitor.visit_some(self) {
            Ok(v) => Ok(v),
            Err(X12DeserializerError::InvalidType { .. }) => visitor.visit_none(), // alternatively, there's no way to return V::default() or similar
            Err(e) => Err(e),
        }
    }
}

However, this code doesn't compile, because all the visitor methods take an own self, so only one method can be called. There's also no way to call V::default() or just directly return None because of the (lack of) generic bounds.

Am I thinking about this problem in the right way? What is a valid solution to doing some kind of backtracking?

(I was able to get the code compiling and working correctly by using unsafe to forcibly copy the visitor, but that's ridiculously unsafe voodoo magic, but it demonstrates the logic works as desired. I think the only reason this doesn't seg fault is visitor.visit_none might be a no-op in many implementations (?). )

Interestingly, implementing back tracking for Vec is supported because each element of the Vec is a fallible deserialization. Match/catch the appropriate error, and backtracking is achieved. I thought I'd be able to call visit_seq from deserialize_option, using a custom SeqAccess implementer that itself validates that the seq is of length 0 or 1. After all, Option can be thought of as an iterator of max length 1. However, that fails with an Unexpected type error, with seq received but expected Option.

@kurtbuilds kurtbuilds changed the title Backtracking during deserialization Deserialization of X12 data format / Backtracking during Option<T> deserialization Feb 26, 2024
@kurtbuilds
Copy link
Author

Closed in favor of #2712

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant