New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple-paragraph subsections and no-text subsections are handled incorrectly #113

Closed
waldoj opened this Issue Aug 1, 2012 · 39 comments

Comments

Projects
None yet
3 participants
@waldoj
Member

waldoj commented Aug 1, 2012

When a subsection spans paragraphs, only the first paragraph is indented (e.g.). Indent all paragraphs properly. Also, subsections that have no text themselves, only children, are not displayed at all.

@twneale

This comment has been minimized.

twneale commented Aug 28, 2012

Also permalink each paragraph in the subsection, or maybe move the ¶ to the end of the last paragraph?

@waldoj

This comment has been minimized.

Member

waldoj commented Aug 29, 2012

I've wondered about permalinking each paragraph in the subsection, but it would require inventing a citation method other than the actual citation method. That is, section 10-2, subsection A3b has a URL of /10-2/#A3b. But to link to the second paragraph of A3b would mean that the URL would depart from the actual citation scheme, which makes me nervous.

Right now I'm not really sure why the pilcrow is showing up at the end of the first paragraph. :) Debugging mentally, from memory, it seems like it ought to show up at the end of the last paragraph of the subsection. I'm going to tear that apart this week, though, and figure out what's going on there. My fear is that the parser is storing paragraphs, rather than subsections, which would be a mess.

@twneale

This comment has been minimized.

twneale commented Aug 29, 2012

Statutes are so hard to work with...I give you insane props for undertaking this.

@waldoj

This comment has been minimized.

Member

waldoj commented Aug 29, 2012

I'm only doing it because I don't know better. :)

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 6, 2012

The problem here lies within the parser, unfortunately, rather than the renderer. In the example section, there are two problems with B(7) that can be observed in the database.

  1. B(7)a is not recorded as such. It's just B(7), with the text stored as "a. Any sale of its securities by an..." That is, "a. " is not being recognized as a subsection prefix.
  2. The second paragraph of B(7)b isn't being included with B(7)b. It's recorded properly in the text table, but it has no record in the text_sections table.
@waldoj

This comment has been minimized.

Member

waldoj commented Sep 6, 2012

To solve the problem of having following paragraphs included, the foreach ($prefix_candidates as $prefix) loop within Parser::parse (line 294) needs to be modified. When a prefix cannot be identified for a section, then the prefix assigned to the prior section should be applied the instant one.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 6, 2012

I'm pretty sure that the problem of B(7)a is that the parser is written expecting there to be a "B" section, inside of which is a "7" section, inside of which is an "a" section. Instead, there's a "B" section, inside of which is a "7a" section. That is, in the SGML, it looks like this:

7. a. Any sale of its securities by an issuer or any sale of securities by a registered
broker-dealer and its registered agent acting on behalf of an issuer if, after the sale,
such issuer has not more than 35 security holders, and if its securities have not been
offered to the general public by advertisement or solicitation; or

The parser assumes that it could only look like this:

7. Text text text, text text.

a. Any sale of its securities by an issuer or any sale of securities by a registered
broker-dealer and its registered agent acting on behalf of an issuer if, after the sale,
such issuer has not more than 35 security holders, and if its securities have not been
offered to the general public by advertisement or solicitation; or

The Florida Statutes have the same problem, too.

This is going to require some cogitating to fix. I'll sleep on it.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 6, 2012

The solution to the "7a" problem is, I think, as follows:

  1. Hack off the first 12 characters.
  2. Check whether two or more sets of ". " are found in that string.
  3. If so, erase everything after the last ". ".
  4. Of what remains, break it into an array based on periods and spaces.
  5. Validate each element, sequentially, against $prefix_candidates. Break when a match cannot be made.

Also, don't forget to deal with the question of whether we need to store an entry for just "7" or if it's OK to just have a "7a" entry.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 7, 2012

Note that part of the problem here arises from Virginia's use of a <section> SGML tag. That tag doesn't enclose sections, but instead paragraphs. Two consecutive paragraphs in a single section are both wrapped in <section> tags.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 7, 2012

The "7b" problem has been solved, which has created the new problem of the duplicate display of the section prefix and the anchor. The Law::get_law() method prepends every section with its prefix and appends to every section its anchor. So we've now got two consecutive paragraphs, both labeled "b." and both with a #B7b anchor.

It's not clear to me whether this is a problem that should be solved in the parser (should multiple-paragraph sections be detected and stored differently?) or in Law::get_law() (should multiple-paragraph sections be detected and stored differently?)

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 10, 2012

Let's think through what the ideal behavior is.

It seems to me that, ideally, the first paragraph would be prepended with the section identifier, and the final paragraph would be followed by the section-identifying pilcrow. Since each section is enclosed within <section> tags, it it follows that each paragraph in a multi-paragraph section should be enclosed within <p> tags.

How do we make this happen?

Well, we already put every paragraph inside of both tags, so it should be straightforward to add additional paragraphs within each section. The logic of multiple paragraphs belonging to a single section belongs in the class that assembles the text, not in the page display. That's where I'll start.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 10, 2012

Within law.php, we were using $section to describe each paragraph, which led to some semantic confusion. I changed every use of the variable to $paragraph, which should make the real functionality a bit more obvious.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 10, 2012

The B(7)b problem is solved. It involved a series of conditionals in law.php, checking to see whether the current section identifier is the same as the prior or the next one. Those control when to start and end <section> tags.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 10, 2012

The B(7)a problem might be easy to fix in theory, but in practice it's non-trivial. The portion of Law::parse that identifies and stores section identifiers is pretty complex. No more complex than is necessary, I don't think, but it's really premised on the idea that each section will have just a single identifier.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 10, 2012

In the Code of Virginia, there are four instances of three-level-deep section identifiers, e.g. 38.2-1705 E(1)a. So two levels isn't enough.

Notably, the following PCRE yielded no false positives:

<section>\r([0-9a-z]{1})\. ([0-9a-z]{1})\. 

Apparently we can quite safely extract section identifiers in this format, though it'll need to be tested with optional parentheses on either side of each identifier and making the period optional. By expanding that PCRE to three segments, and making the latter two optional, we should be able to get a solid list of section identifiers for every section.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 10, 2012

This is starting to get there:

<section>\r\(?([0-9a-z]{1})\)?\.? \(?([0-9a-z]{1})(\)|\.)+

What I really want, for that second stanza, is to say "this whole portion is optional, but if it does appear, then it needs to contain either the paren or a period." That's sure to be possible in PCRE, but I've never had to do that. Ditto for the third stanza that needs to be added, too.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 10, 2012

Oh, hey, nested parens are a thing! This seems to work!

<section>\r\(?([0-9a-z]{1})\)?\.? (\(?([0-9a-z]{1})(\)|\.)+)? (\(?([0-9a-z]{1})(\)|\.)+)?
@waldoj

This comment has been minimized.

Member

waldoj commented Sep 11, 2012

Don't forget to resolve the problem of per-section vs. per-paragraph links! As things stand, there are links on every paragraph, which means duplicate anchors in multi-paragraph sections.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 11, 2012

This is 90% worked out. The one sticking point—and this is a big one—is the question of what to do with blank sections. We can now separate out the components of B(7)a, but how are we to store it? An entry in text_sections requires an entry in text. An entry in text requires an entry in laws.

At the moment, my best idea is to establish a blank entry in text for B(7). I worry about what sort of trouble that could create, but I can't come up with a better system for this. law.php would need to check whether an ostensible section contains any content at all. Presumably those that don't would all be these sections that exist solely as containers. Semantically, I think this is accurate. It is a section without any text. Right?

@twneale

This comment has been minimized.

twneale commented Sep 12, 2012

For what it's worth, I agree--the nested enumerations just don't have any
content. I like to think as a statute as a tree of nodes, and in that
sense, a section without any content may not be such a weird thing. I'm
with you on it's semantic accurateness, and from there what remains is just
the PITA of figuring out how to store and re-display it. Whatever you
manage to come up with will probably feel a tad unclean, but I've seen this
same phenomenon in New York and in statutes of a few other states too.

On Tue, Sep 11, 2012 at 5:29 PM, Waldo Jaquith notifications@github.comwrote:

This is 90% worked out. The one sticking point—and this is a big one—is
the question of what to do with blank sections. We can now separate out the
components of B(7)a, but how are we to store it? An entry in text_sectionsrequires an entry in
text. An entry in text requires an entry in laws.

At the moment, my best idea is to establish a blank entry in text for
B(7). I worry about what sort of trouble that could create, but I can't
come up with a better system for this. law.php would need to check whether
an ostensible section contains any content at all. Presumably those that
don't would all be these sections that exist solely as containers.
Semantically, I think this is accurate. It is a section without any text.
Right?


Reply to this email directly or view it on GitHubhttps://github.com/waldoj/statedecoded/issues/113#issuecomment-8473458.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 12, 2012

Thanks for the reassurance, Thom. :) I've been chewing this over all evening, and I do think that this is (at worst) the best bad option. At best, it might actually be a pretty sensible approach.

You're a bold man, not unsubscribing from this thread (with e-mailed updates!), as I've documented the slow unravelling of my sanity in dealing with this problem. :)

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 14, 2012

I'm closing in. This has been pretty unpleasant. I now have the parser identifying multi-prefix paragraphs and creating blank paragraphs to fill in those gaps, but there are some small bugs associated with it that need to be addressed.

@waldoj

This comment has been minimized.

Member

waldoj commented Sep 14, 2012

I initially assigned this issue to v0.4 because of its urgency. Because dealing with this still-unresolved issue has made the release of v0.4 now two weeks overdue, and because it doesn't actually apply to this release (which is for enhancements to the dictionary system). Perhaps most important, the problem itself is substantially—but not entirely—one that affects only the Code of Virginia. It's become an irrelevant obstacle to the v0.4 release. As a result, I have reassigned this to v0.5. I will pause working on this just long enough to package up and release v0.4, and then return to it.

@twneale

This comment has been minimized.

twneale commented Dec 19, 2012

Hmmm, the bug says this issue is closed, but your tweet said it's unresolved, so here goes nothing...

I was just staring at the example section (http://vacode.org/13.1-514/) and had the standard horrifying realization when dealing with statutes--namely, that it appears the unenumerated paragraph following (B)(7)(b) may not be part of paragraph (b) after all. Looking at (b) together with (a) I noticed that they're part of the same disjunctive statement (joined with or, and b provides the second part of the or). The unenumerated paragraph begins a new sentence and also begins "[w]ith respect to this subdivision 7" and not "with...paragraph b." So it could apply to both (a) and (b) in subdivision 7, and it's content seems at least to suggest that. So what I'm suggesting is that the last paragraph may be tail text (to borrow xml's parlance) on subdivision 7.

I mean, this is splitting hairs for sure, but sometimes it's possible to look at the print version and the indentation will solve the riddle. In the US Code, you can look at the typesetters' instructions to determine the level of indentation and infer is position in the hierarchy, but here it looks like their sgml dump doesn't give you much information to go on.

It depends how precise you want to be, I guess. For a free tool, exact precision may not be realistic, but if these occurrences are rare, maybe you could code up a way to flag the ambiguity and just add some manual data to override the imported data. We have to do that a lot in Open States with all the weird crap we get from states.

@waldoj

This comment has been minimized.

Member

waldoj commented Dec 19, 2012

Just a quick note—it is open, it's just that I referred to this in a duplicate issue that I closed, which is why it says "Closed" right above your comment.

Now back to reading your comment which, on a quick skim, contained alarming suggestions. :)

@waldoj

This comment has been minimized.

Member

waldoj commented Dec 19, 2012

You're right about this, Thom. (This is precisely the sort of mistake that I tend to make, incidentally—making a change in light of one requirement, then changing it later in light of a conflicting requirement, not realizing that I have undone my first change.) I don't think that the concern that you've raised negates the need to resolve this issue, since both Virginia and Florida are known to have this problem, but it does raise the possibility that making this change may yield an entirely new problem. :) That's something that I'll be sure to bug-test for thoroughly.

I'm hopeful that #167, due out in the next release, will push this problem out of The State Decoded's scope.

@waldoj

This comment has been minimized.

Member

waldoj commented Feb 7, 2013

I do believe that this is conceptually solved, with #167. I'm not going to put this release on hold to determine that, but this has the happy effect of pushing the problem back on the legal entity's code. Well-formed XML will be stored and rendered properly.

@waldoj

This comment has been minimized.

Member

waldoj commented Mar 1, 2013

There remains just one thing to check on: Does the parser store a blank subsection, and is that then rendered properly?

waldoj added a commit that referenced this issue Mar 22, 2013

@waldoj

This comment has been minimized.

Member

waldoj commented Mar 22, 2013

Nope, the parser wasn't storing blank subsections. But it should be now.

Now to go through the renderer and make sure that it handles blank subsections reasonably.

@waldoj

This comment has been minimized.

Member

waldoj commented Mar 22, 2013

With the important caveat that I still need to test this, I feel pretty good about it.

@waldoj

This comment has been minimized.

Member

waldoj commented Mar 22, 2013

There's another class of (potential) problem that needs to be addressed. This example is a use case that was planned for:

<section prefix="B">Any person who knows that a false statement has been made in writing
concerning the financial condition or ability to pay of himself or of any person for
whom he is acting, or any firm or corporation in which he is interested or for which he
is acting and who, with intent to defraud, procures, upon the faith thereof.

    <section prefix="i" type="table">
    +------------------+--------------------+
    | PERSON           | STATEMENT          |
    +------------------+--------------------+
    | any              | false              |
    | no               | true               |
    +------------------+--------------------+
    </section>

For his own benefit, or for the benefit of the person, firm or corporation in which he
is interested or for which he is acting, any such delivery, payment, loan, credit,
extension, discount making, acceptance, sale or endorsement, shall, if the value of the
thing or the amount of the loan, credit or benefit obtained is $200 or more, be guilty
of grand larceny or, if the value is less than $200, be guilty of petit
larceny.</section>

But I need to check on how that is stored, and how it's rendered. I think it will be stored as:

  • A
  • B
    • Bi
  • B
  • C

And that's fine, but the renderer needs to deal with that properly, recognizing that the second B section is a continuation of the first and styling it appropriately.

@waldoj

This comment has been minimized.

Member

waldoj commented Jun 13, 2013

It is not rendered like that, as it turns out. The whole of B appears, and then Bi:

  • A
  • B + B
    • Bi
  • C

I have seen this happen on the official Code of Virginia site, and now I know what that's about.

@waldoj

This comment has been minimized.

Member

waldoj commented Jun 13, 2013

It turns out that this is just how the XML is being parsed by SimpleXMLElement. And perhaps quite rightly—I'm not sure, because my experience with XML is too limited.

How should this XML be parsed?

<section prefix="IV">
    The vehicle shall be used by an employee whose duties are routinely related to public safety or response to life-threatening situations:
    <section prefix="a">
        A law-enforcement officer, with general or limited police powers;
    </section>
    except as otherwise defined.
</section>

As:

IV. The vehicle shall be used by an employee whose duties are routinely related to public safety or response to life-threatening situations:except as otherwise defined.

a. A law-enforcement officer, with general or limited police powers;

Or as:

IV. The vehicle shall be used by an employee whose duties are routinely related to public safety or response to life-threatening situations

a. A law-enforcement officer, with general or limited police powers;

IV. except as otherwise defined.

Or as:

IV. The vehicle shall be used by an employee whose duties are routinely related to public safety or response to life-threatening situations

a. A law-enforcement officer, with general or limited police powers;

except as otherwise defined.

Or something else entirely?

@waldoj

This comment has been minimized.

Member

waldoj commented Jun 14, 2013

I cannot identify a method of accomplishing this with SimpleXML.

@waldoj

This comment has been minimized.

Member

waldoj commented Jun 14, 2013

Kicking this along to v0.8.

@twneale

This comment has been minimized.

twneale commented Jun 14, 2013

I'm late to the party, but I think the last example is correct, so that the snippet has a structure like this:

--(IV)
  |
  \--The vehicle shall be used by an employee whose duties are routinely related to public safety or response to life-threatening situations:
  |  |
  |  \--(a)
  |      |
  |      \--A law-enforcement officer, with general or limited police powers; 
  \--except as otherwise defined.
@waldoj

This comment has been minimized.

Member

waldoj commented Jun 14, 2013

Oh, I'm kicking it along, but still working on it actively. :) I just don't want it to be a blocker to the very-nearly-finished v0.7. I'm glad you, too, think that it the last example is correct, because that's precisely how I expect (and need) it to come out. This morning I posted to a regional tech mailing list about this, and there's a useful discussion unfolding there. There seems to to be a consensus that this is valid XML, if non-ideal, and @fj suggests that rendering this as a DOM may well yield something closer to that last example.

I wish I had a sense of how common that this is in legal XML documents. Anecdotally, I've seen it a bunch of times, both as XML and as a document structure that seems best represented like this, but I sure wish I had some data to figure out how much time it's worth spending on perfecting this.

@waldoj

This comment has been minimized.

Member

waldoj commented Jun 14, 2013

Apparently the problem here is SimpleXML:

Mixed content simply exceeds the bounds of what SimpleXML was designed to handle.

Time to find a new tool.

@waldoj

This comment has been minimized.

Member

waldoj commented Jun 14, 2013

I'm experimenting with PHP's DOMDocument. I'm not wildly confident that it'll solve the problem, but after a bit of experimentation, I think it may well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment