Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle semicolon after colon #13

Open
wvengen opened this issue Oct 1, 2018 · 7 comments
Open

Handle semicolon after colon #13

wvengen opened this issue Oct 1, 2018 · 7 comments
Labels
bug:parsing Something is not parsed or parsed incorrectly parser:strict Affects the strict parser

Comments

@wvengen
Copy link
Member

wvengen commented Oct 1, 2018

An ingredients list like "Schokolade (Süßungsmittel: Maltit; Kakaobutter, Kakaomasse)" contains mixed separators (; and ,). Hiere the semicolon is used to indicate the end of the second-level nesting for Maltit.

@wvengen wvengen added parser:strict Affects the strict parser bug:parsing Something is not parsed or parsed incorrectly labels Oct 1, 2018
@wvengen
Copy link
Member Author

wvengen commented Oct 1, 2018

This is not really an issue for the loose parser, which handles each separator as equal. But if the need arises, it could be implemented there as well.

@wvengen
Copy link
Member Author

wvengen commented Nov 18, 2019

This often has a meaning, in e.g.

glucosesiroop, suiker, water, gemodificeerd zetmeel, gelatine (rund), vitamine A, vitamine C, vitamine D3, vitamine E, vitamine B6, foliumzuur, vitamine B12, biotine, pantotheenzuur, kaliumjodide, zinkcitraat, magnesiumoxide, zuurteregelaar: citroenzuur; kleurstoffen: curcumine, anthocyanen (vlierbes); natuurlijke aroma’s: sinaasappel, kers, citroen; glansmiddel: carnaubawas; plantaardige olie: kokosnootolie (Cocos nucifera L.); emulgatoren: mono- en diglyceriden van vetzuren, citroenzuuresters van mono- en diglyceriden van vetzuren; maltodextrine.

Here the semicolon ends a list after a colon.

@wvengen
Copy link
Member Author

wvengen commented Nov 12, 2020

Another example, where it also ends the list after a colon.

Water; plantaardige oliën (zonnebloem 15,2%, raapzaad 6%, lijnzaad 4,8%, palm, palmpit, geheel geharde palmpit, geheel geharde palm); mineraal: calciumzouten van orthofosforzuur; gemodificeerd maïszetmeel; palmstearine; emulgatoren: E471 (niet dierlijk) en zonnebloemlecithine; zout 0,2%; conserveermiddel: E202; voedingszuur: citroenzuur; antioxidant: E385; aroma; vitaminen: A, thiamine (B1), riboflavine (B2), B6, foliumzuur (B11), B12 en D2; kleurstof: carotenen

@wvengen
Copy link
Member Author

wvengen commented Jun 17, 2024

Ok, I have something that seems to work ...

rule list
  # ...
  contains:( ( (ingredient ws* ',' ws*)* ingredient_coloned )+ ( ws* ingredient (ws* ',' ws* ingredient)* ) ) <ListNode>
  # ...
end

rule ingredient_coloned_inner_list
  # ...
  contains:( ingredient_coloned_simple_with_amount_and_nest ( ws* ',' ws* ingredient_coloned_simple_with_amount_and_nest )* ';' ) <ListNode>
end

@wvengen
Copy link
Member Author

wvengen commented Jun 17, 2024

This seems to tackle it!
An ingredient listing like

Ingrediënten: mineraalwater, suiker, citroensap uit concentraat, aardbeiensap uit concentraat, smaakversterker: erythritol, natuurlijk aroma, zoetstof: steviolglycosiden; vitaminen: Vitamine B6, Vitamine B12.

used to put everything after ; in the notes, but it is properly parsed with this change!
update actually, this is a somewhat malformed line: some coloned ingredients end with a comma, others with a semicolon. In this instance, one can understand that smaakversterker: erythritol is one nested ingredient, and natuurlijk aroma the next.

@wvengen
Copy link
Member Author

wvengen commented Jun 17, 2024

Still having trouble to parse an ingredient list with a nesting IngredientColoned ending with a non-nested ingredient.

wvengen added a commit that referenced this issue Jun 19, 2024
Many cases are handled, but not yet:
- coloned list ending with a '.' (at the end of an ingredient listing)
- list having a both coloned list ending with ';', and one ending with ','
- ';'-separated lists (instead of regular coma-separated lists)
@wvengen
Copy link
Member Author

wvengen commented Jun 19, 2024

Commit a4ca35c handles most cases. Pending:

  • coloned list ending with a '.' (at the end of an ingredient listing)
  • list having a both coloned list ending with ';', and one ending with ','
  • ';'-separated lists (instead of regular coma-separated lists)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:parsing Something is not parsed or parsed incorrectly parser:strict Affects the strict parser
Projects
None yet
Development

No branches or pull requests

1 participant