Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matching Algorithms #73

Closed
adammontville opened this issue Nov 18, 2015 · 21 comments
Closed

Matching Algorithms #73

adammontville opened this issue Nov 18, 2015 · 21 comments

Comments

@adammontville
Copy link
Contributor

We have had discussions on the list leading to sub-discussions about matching algorithms. While we have not taken explicit consensus for that aspect yet, I didn't notice anything in section 2.5 Requirements for Data Model Operations (or elsewhere) suggesting that we might consider such algorithms.

Does the operations section need to have some requirement pertaining to endpoint matching algorithms?

(See https://mailarchive.ietf.org/arch/msg/sacm/BZzBvwxGUcCqfMiK221cJLJoDeI as one point of reference.)

@jimsch
Copy link
Contributor

jimsch commented Dec 2, 2015

I don't see this a something that needs to be mentioned in this document. I would expect that the exact matching algorithm used is going to be highly proprietary and dependent on specific data models. At this point is is not even clear if the IM would provide the necessary points to hang the result of the matching algorithm is expressed as this implies This means that it would not even be clear at this point how the result of a matching algorithm is even expressed.

We have not even expressed that there is a set of operations that needs to be supported.

@ncamwing
Copy link
Contributor

There isn't consensus to this notion.
I also don't see it needing to be mentioned as a requirement nor in this document. Jim raises a good point, and would agree that the matching algorithm is likely the "vendor differentiation" provided in SACM based solutions.

@adammontville
Copy link
Contributor Author

I'm not confident that there's consensus in either direction with respect to matching algorithms. If it doesn't need to be mentioned in this document, then where does treatment of the subject belong? Correctly matching endpoints between disparate implementations seems critical to the success of this effort.

@jimsch
Copy link
Contributor

jimsch commented Jan 20, 2016

I think that the important thing to say about matching algorithms is not what the algorithm is, that is not of interest in a standardization situation because it will very from type of endpoint to type of endpoint. But rather it is of importance to make sure that the data required to evaluate the matching algorithm is present in the information model.

If I look at the things that one can specify about the algorithm, it is really not very interesting. One can specify the set of return values <Match, No Match, Indeterminate, Insufficient Information, Error> and one can specify some "mathematical properties" (it should be commutative in that Match(A, B)==Match(B, A), it is not transitive Match(A,B) + Match(B,C) ===> Match(A, C) does not necessarily hold true). However all of this is not really interesting to make a requirement on.

Again, the important thing is the information in the IM and DM not the algorithm itself.

@adammontville
Copy link
Contributor Author

I agree that the information used for matching needs to be standardized as part of the IM and relating DMs. I'm not sure I agree that algorithm knowledge isn't important. We may not need to standardize the algorithms in this group, but I think we need to consider methods of identifying the preferred algorithms in the ecosystem. I think there may be more to matching than meets the eye, which makes it more interesting to me.

@henkbirkholz
Copy link
Member

Is "the ecosytem" a specific SACM domain or referring to architectural components in general?

Maybe the Information Model could elaborate on how to express custom, SACM domain specific or even standardized (probably unlikely) labels for algorithms (or methods) used to create SACM content? If the exact "algorithm" has to be identified across SACM domains, a global unique (and probably standardized) label is required, I think.

@adammontville
Copy link
Contributor Author

I'm not sure how else to talk about the "ecosystem". Consider a reasonably sized enterprise that uses vulnerability assessment tools from Vendor A and configuration assessment tools from Vendor B. There may be one asset management component between them, which could alleviate the problem, but if there is no such component, and Vendor A's vulnerability assessment tool needs to communicate an asset list to Vendor B's configuration assessment tool, and vice versa, how is the enterprise to be assured that the way Vendor A's tool reconciles assets is identical to the way Vendor B's tool reconciles assets? Ideally there's a single asset management component in the ecosystem, but it seems plausible that there may not be.

This seems like a problem that shouldn't be hard to solve. I'm thinking that TLS has this sort of problem solved in the way it expresses cipher suites. Two disparate implementations can agree on a specific method of operation, because those methods are enumerated and well-defined.

@henkbirkholz
Copy link
Member

This is how I understand it: at least inside a SACM domain there has to be a way for two SACM components from two different vendors to agree not only on upon... a SACM data model for transport encoding of SACM statements (a capability), but also to align a shared understanding of how the actual content of that payload was created (another capability, probably. a method)?

Yes, TLS solved this via a list of global unique standardized labels: https://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml#tls-parameters-4 (which could be called a "third component between them"). That's most certainly one way to approach this, but I am not sure if this approach is in scope for SACM?

@ncamwing
Copy link
Contributor

I can potentially see the need for SACM to "name" the data models.....and agree with Jim that SACM needs to have a good IM (and DM) so that the SACM components can discern on the information afforded (by the DM). That said, I still do not see how "algorithm" comes into play....but rather SACM has a well understood information model from which SACM components can extract what they are looking for. Where I see the "naming" of DM's comes from the group understanding that there may be different DM's in use (but they all have to map to the IM if they are to be SACM compliant)......we have requirements to allow for discovery of this, which I guess implies we are labeling the DMs.

@jimsch
Copy link
Contributor

jimsch commented Jan 20, 2016

So IM-004 says that DMs have to be identifiable. That should address that need.

Adam, I am unsure of what the problem is you are describing.is. Are you saying that Vendor A and Vendor B are going to assign different asset identifier to the same asset and that without some type of help it will be impossible to know that they are the same assets?

I have much larger worries that this, are they using the same DMs and is there a converter between the two models in the system. Based on what I expect to see, it should be easy to enumerate all of the ids assigned by Vendor A and all of them assigned by Vendor B. The trick is then to try and figure out if the assets ids refer to the same item. This will depend on the traits of the two assets. If the vendors do not populate their view of the asset with enough data for the matching to work, then you are going to be out of luck. This is a function of the IM model being sufficiently complete in describing each of the different assets classes and specifying the correct set of minimal items. This is an IM problem and not a requirements problem.

If the IM has an asset ID field, and it is filled in differently by the two different vendors then there is a different problem. The IM, and thus the matching algorithm, would need to make statements about the uniqueness of a value for a specific asset. But again these are statements about the IM/DM and not the matching algorithm.

Clarifications would be appreciated. I also do not understand why the TLS reference makes any sense here. When looking at it I would call this the definition of a field in the IM about a specific attribute. Again not related to a matching algorithm.

@adammontville
Copy link
Contributor Author

Maybe I'm not taking in the view from the proper perspective. Maybe "algorithm" is not the right word either. As you mention, the trick is to figure out whether two distinct identifiers reference the same item - a "same as" relationship. If conditions for establishing this "same as" relationship must exist solely in the IM, and if these conditions are already known to be required, then my concern is satisfied.

@llorenzin
Copy link

Adam, I understand your concern to be: how does one SACM component ensure that (an identity it constructed to refer to a particular endpoint) matches (an identity another SACM component constructed to refer to the same endpoint) and not (an identity another SACM component constructed to refer to a different endpoint). I believe that this is absolutely critical for SACM!
As I understand the differences between IM and DM, it appears to me that this is a requirement on any DM(s), not on the IM. Our hope was that OP-007 Data Abstraction combined with DM-007 Data Source would be sufficient. I note that OP-007 contains a MUST for the data model and a SHOULD for the interfaces, so perhaps it would be helpful to break OP-007 into two components - retain the SHOULD for the interfaces in OP-007, and create a new DM-00x with the MUST for the data model(s). Would that help?

@adammontville
Copy link
Contributor Author

Yes, I think it would (as a contributor).

@jimsch
Copy link
Contributor

jimsch commented Jan 27, 2016

I would disagree that this is a DM requirement. I believe that two different SCAM components can and will construct different identities for the same component. (Assuming that there is some type of identity field in the IM/DM). I cannot see this as being anything that is enforceable on any component. What if the type of identity that I assign is an index in my internal database. My internal database is not going to be the same as your database. What there needs to be is a sufficiently rich set of . attributes that a matching algorithm can make some statements about the probability of two sets of attributes as referring to the same end point.

I do not think that a new requirement helps this in any way.

@adammontville
Copy link
Contributor Author

If all you assign is an internal DB index, then you've got insufficient data to discover a "same as" relationship. We need to support reasonably deterministic endpoint identity matching, wherever that needs to be articulated.

@jimsch
Copy link
Contributor

jimsch commented Jan 28, 2016

Does this mean that you are advocating that a unique identifier be assigned to every endpoint and that every system that creates data for SACM needs to be able to determine that endpoint, even if it never talks to it? I think this is unrealistic. (And also eliminates the need for a matching algorithm.) I expect that they will need to look at attributes about the endpoint to make this determination. Such as what network is it on, what the ip and mac addresses are and so forth in order to make such a determination. If there is a requirement for a single global identifier for every endpoint this is a new requirement and needs to be discussed and documented.

@adammontville
Copy link
Contributor Author

No, that's not at all what I'm advocating. Did you perhaps misread "insufficient" as "sufficient"?

I want to ensure that our requirements don't make it hard for disparate implementations to reconcile endpoint data between each other. Wherever this sort of thing is best described in our documentation, is fine by me.

Lisa had suggested that these were aligned with DM requirements and you had disagreed. This is about where this conversation stands at this point, and I'd like to try and wrap it up so we can move ahed. As I said before, I think we need to support reasonably deterministic endpoint identity matching. I think Lisa agreed that this would be critical for SACM.

Further, the points you made in your last comment are those that I believe support the need for us to say as much as possible when it comes to how endpoints can be matched. You mention all the attributes we can look at and use to infer the match. What are those relationships? How many valid assertions do we need before we are confident in the match? Which assertions matter?

@GunnarEngelbach
Copy link

Take CVSS as an example.

https://nvd.nist.gov/cvss.cfm

It's a list of characteristics about a vulnerability that determine its severity. But it's also a very specific formula that has to be applied to those characteristics in order for the resulting score to have any meaning.

If there was no formula associated with the standard, then the scores calculated by two different implementations would likely not match.

Determining whether or not two different sets of attributes refer to the same endpoint is a much more complicated algorithm, particularly given that there's no guarantee both sets have all the same attributes, or that some of the values are different even when they really to refer to the same endpoint. It's a difficult and non-obvious problem, which is why the formula for comparing those attributes needs to be part of the standard: otherwise the results will be inconsistent.

Note that, for the purposes of SACM, the identifier for any endpoint would need to be the set of identity attributes. For convenience/human readability an implementation could choose to assign some other identifier referencing the record for that endpoint in the datastore, but such an identifier would only be usable in SACM if SACM also defined the full datastore.

@jimsch
Copy link
Contributor

jimsch commented Jan 28, 2016

Adam, If there is a requirement here, and I am not convinced that there is, then it would appear to be a requirement on the IM and not the DM. The requirement would run along the lines of - All of the elements of the SACM standard endpoint matching algorithm MUST be in the IM. I think that this would be a real problem for the WG if it was not the case and is therefore not needed.

Gunnar, Does this mean that as a vendor I cannot define additional attributes in my DM which would allow me to produce a better result than the standard SACM matching algorithm? This would violate the apparent requirement that you have stated for reproducibility. My vendor specific algorithm would not match the standard SACM algorithm in this case.

Based on what you are saying, there needs to be a milestone for an endpoint matching algorithm draft. This might be part of the IM document, but it would probably be better starting separately.

@GunnarEngelbach
Copy link

Jim,

I think allowance has to be made for both. A standard algorithm is necessary for interoperability, but implementors need to have the option to use proprietary methods to account for usages not covered by the standard algorithm, because the standard isn't as likely to keep pace with changes, because the standard algorithm might be incorrect in some edge cases, etc.

An alternative would be that attribute-based endpoint matching be an MTI service available as part of a SACM deployment. Id est, some node in the SACM infrastructure provides a public service of indicating whether or not two sets of endpoint attributes refer to the same endpoint or not. But that has some serious drawbacks as well.

@adammontville
Copy link
Contributor Author

Per VI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants