-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TG2-VALIDATION_GEOGRAPHY_CONSISTENT #95
Comments
Comment by John Wieczorek (@tucotuco) migrated from spreadsheet: |
This may not be the best example |
There has been discussion between Issues #95 and #139 - the wording converged over time such as the two tests appeared to be testing for the same thing. Discussion on ZOOM has resulted in separating the two. #139 is testing individual terms for validity at that level - it looks at only one level in the hierarchy at a time and checks the validity of what is there at the level. |
The conclusion from the Zoom discussion 22/8/2022 with @chicoreus, @tucotuco and @ArthurChapman suggests that this test is more a test for consistency than ambiguity. My suggestion from 17/8/2022 was "...we get ambiguity when we have only two non-empty terms that conflict in some way". This same reasoning applies to #123. I have edited the specifications accordingly and would value a careful review of all relevant items. |
This misses the case "WA", which we had in the original discussion and the reason why we used "ambiguity" rather than "consistent" in the test name. The "WA" alone case does not have two terms with which to check consistency. It has one term where there are multiple different geographic entities it could refer to. That is purely ambiguous and not inconsistent. I think this is one of the tests we have gone in circles on, but I can no longer recall. If not, I think we are just about to. Is that a good indicator to separate the notions into two tests? The solutions for the two cases are distinct, so maybe it is a good idea. In the consistency case, something is definitively wrong and should be fixed. In the ambiguity case, something is missing and ought to be provided. |
#139 is currently testing that each NOT_EMPY geography term has an unambiguous match at the same level in the source authority: ...COMPLIANT if the individual values of dwc:continent, dwc:country, dwc:countryCode, dwc:stateProvince, dwc:county, dwc:municipality can be unambiguously resolved from the bdq:sourceAuthority)... The example where we have only dwc:stateProvince="WA" will result in NOT_COMPLIANT if there is ambiguity. #95 is currently testing for internal/input consistency between NOT_EMPTY input geography terms and the equivalents in the source authority, so the example above will be CONSISTENT. ...COMPLIANT if the combination of values of dwc:continent, dwc:country, dwc:countryCode, dwc:stateProvince, dwc:county, dwc:municipality are consistent with the bdq:sourceAuthority... Are those tests what we require? Do the results of the white board examples still make sense for #95 and #139? Maybe I am (still) crazy. | Continent | Country | State/Province | County | 95 | 139 | |
I maintain as argued in the comments of test #139 that the test is problematic, less useful, and should be dropped. It is dubious whether there is any realistic way to make sure each name "matches" the level in the hierarchy where it is placed. The Darwin Core term names are a poor reflection of the impressive variety of terms for the actual administrative levels in the world. What's really important is how many distinct geographic entities the geography combination corresponds to in the source authority. If there are zero matches, the input combination may have an error or the authority may be incomplete (or incorrect, such as out of date, no way to tell the difference, potentially difficult to solve). Getting zero matches would alert the tester of a potential problem of a particular nature (unknown-ness). If there is one match, the input can be uniquely understood vis-à-vis the source authority (unambiguously understandable - nothing to solve). If there is more than one match, something is ambiguous (not enough information provided to make the distinction - usually relatively easy to solve). Once one thinks about implementation, this test becomes a lot more complex than it appears on the surface. Begin with the results of a "simple" implementation that gets results for the number of exact string matches for the rightmost not empty entry in a row where a) every entry to the left of it in the row can also be found as an exact match parent somewhere in the hierarchy above it , and b) the same pattern holds true successively for every not empty field except continent, which has no parent in Darwin Core. There is no single API call against any existing service that will do this, though one is planned for BELS. A slightly more sophisticated implementation would restrict the searches for values in the country field to record types (nation, dependent state, unincorporated territory, semi-independent political entity, etc.). Given this more sophisticated implementation, the table below shows the results one would get today for the Getty Thesaurus of Geographic Names.
The thing is, the count as a result would make this a measure test, I think, and I don't think that is what we want. Does that mean we are forced into having two tests, one to see if there are any matches and one to see if there is more than one match? Seems wasteful in terms of processing. Needs more thought. Open to suggestions. |
Great summary, @tucotuco. I can see lots of value in an ideal world, but as you say an implementation nightmare. Would there be value in just doing a few consistency checks. Country, State/Province/ and perhaps Municipality? Many countries use County consistently, other don't, and there is a lot of difference in what is meant by County in different countries, and sometimes includes >1 level. ArcInfo decided to just us ADM0, ADM1 and ADM2 rather than labelling them - I know there are other levels, but basically they use just these three in much of the data. Would there be value in us keeping this test and just using these three levels for this test? |
@ArthurChapman First, I am not trying to get rid of this test. A functional implementation would be extremely useful. However, as to your proposals, there is no way to do consistency checks by admin level with TGN. The vocabulary for the feature types does not map uniquely to Darwin Core terms and, as you pointed out, the levels are arbitrary. Example, Brazil. If anyone uses a macroregion, the states get bumped down to level 2. If anyone uses a mesoregion or a microregion, counties and municipalities get bumped down for each of those. So a "county level" entity in Brazil could be at any of three different depths in the hierarchy, one of which could not even be captured in Darwin Core outside of dwc:higherGeography. Brazil is not unique in this phenomenon. By the way, for posterity, Julian Kapoor, working with Robert Hijmans on GADM under the Biogeomancer Project, assembled this list of administrative level terms: Single Administrative area, Administrative county, Administrative Region, Aimag, Amt, Aprinki, Apskritis, Area, Arrondissement, Arrondissements, Arrondissment, Atoll, Autonomou, Autonomous city, Autonomous Commune, Autonomous Community, Autonomous Island, autonomous province, Autonomous Region, Autonomous Republic, Autonomous sector, Avtonomiuri respublika, Avtonomnaya oblast, Avtonomnyy okrug, Aymag, Baladiyah, (Banner), (Barangay), (Barony), Bibhag, Borough, Bundeslander, Canton, Capital city, Capital district, Capital Metropolitan City, Capital region, Capital Territory, Capitale d'état - zone spéciale, Castello, Census Area, Census Division, Centrally Administered Area, Cercle, Chantun, Chuan-shih, Circle, City, City and Borough, City and County, City Municipality, City/Municipality, Ciudades autónomas, Comarca, Comisaría, Commissiary, Commonwealth, Commune, Commune Autonome, (Community), Comuna, Comunidad Autónoma, Comunidad autónomas, Concelho, Constituen y, Constituency, Corregimiento, Corregimiento de, Country, County, (Crown Dpendency), Daerah Khusus ibuk, Daerah Istimewa, daerah-daerah, Departament, Departamento, Département, Départements, Departments, Dependencias Federales, Dependency, Development Region, Diamerismata, Distirct, District, District Municipality, Distrikkaya, Distrikt, Distrito, Distrito Capital, Distrito Federal, Distrito Municipal, Distrito Nacio, Division, Do, (Duchy), Dzongkhag, Economic Prefecture, Eilandgebieden, Emirate, Entity, Estado, Faritany Mizakatena, Faritra, Federal Dependency, Federal District, (Federal Subject), Federal Territory, Fivondronana, Fovaros, Fu, Fylke, Gorod, Gorsovet, Governorate, grad, Gwangyeoksi, Hlavni mesto, Hoofdstedelijke gewest, Hsien, (Hundred), Independent City, Independent Town, Intendancy, Intendencia, Intendency, Island, Island council, Island group, Island Region, Judet, Kabupaten, Kaghak, K'alak'i, Kampeng nakhon, Kanton, Kaupstadir, Kayaing, Ken, Khêt, Khetphiset, Khoueng, Kingdom, Kommuner, Kotamadya, Kraj, Kraje, Kray, Kreisfreie Städte, Krong, Laen, Land, Länd, Lander, Landsvæðun, (Legal entity), Local Authority, (Local Council), Maakond, Magisterial district, Marz, Megye, Mehoz, Metropolis, Metropolitan City, Miesto savivaldybė, Mintaqah, Mkoa, Moughataas, Muhafazah, Munic¡pio, Municipality, Municipio, Municipio Especial, Municipiu, Muong, National capital - special zone, National Capital Area, National Dist, National Territory, Neutral City, Neutral Zone, Nomos, Oblast, Oblasy, Opcine, Opština, Ostan, Parish, Parròquia, Part, Partido, Police Station, Prefecture, préfecture, préfecture economique, Prefegitura, propinsi, (Principality), Province, Provincia, Província, Provincie, Provinsie, (Public body), Pyine, Qark, Région capitale, Raion, Raione, Rajoni, Rajono savivaldybė, Rayon, Reef, Regency, Região, Regierungsbezirk, Region, Région, Región Autónoma, Regional council, (Regional County Municipality), Regional District, Regional Municipality, Regione, Republic, Respublika, Ressort, (Riding), Rural District, (Rural Municipality), Sahar, Savivaldybė, Sector, Sector autónomo, See, Senatorial District, Sha`biyah, Sheng, Shih, (Shire), Si, sous-préfecture, Sous-régions, Special City, (Special administrative region), Special district, Special Municipal, Special municipality, Special region, Special region or zone, Srok, State, Statistical Region, Statisticna regij, Subdistrict, Sub-district, Sub-prefecture, Sub-region, Sýsla, Syssel, Taluk, Tarafa, Territoire, Territorial authority, Territorial Unit, Territorio Nacional, Territory, Teukbyeolsi, Thana, Thanh Pho, (Theme), Tinh, To, Todof, (Town), Town council, (Township), Traditional county, Union territo, Union territor, Unitary authority, United Counties, unknown, Upazila, Urban district, Urban prefectur, velayat, Vikas kshetra, Village, Ville Neutre, Voblasts', Voivodship, water bodies, Wilaya, Wilayah persekutuan, Wilayat, Wojewodztwa, Yin, Zila, Zizhiqu, Zupanija, županija. |
The zoom discussion with @ArthurChapman, @tucotuco and @chicoreus today concluded that tests #95, #139 and #118 were going to be very difficult to implement properly given the lack of a consistent geographic terms hierarchy by comparison with the taxonomic terms. Note the issues arising from the table above for example. We will therefore remove these tests from CORE. In their place, we will
|
Further notes from the zoom discussion with @ArthurChapman, @tucotuco and @Tasilee: Continent values and their use tend to be very inconsistent between data in the wild and source authorities. Conclusion was to focus on dwc:country and dwc:stateProvince values as noted above. The matches column in @tucotuco's table above clarified the problems we have been having untangling the concepts of consistency and unambiguity in hierarchically organized data. The concept we have been trying to label consistency aligns with the property of having one or more matches on the source authority. The concept we have been trying to label unambiguity aligns with the property of having exactly one match on the source authority. As noted in @tucotuco's list of divisions above, the ranks found in Getty do not neatly align with dwc:country and dwc:stateProvince, a simple example being United Kingdom (Nation), England (Country) in Getty, where, given #62, we would expect dwc:country to have a value that would match to the United Kingdom (Nation) value in Getty, rather than the included country level term in Getty. |
Specifically, what we mean by country in Darwin Core is an administrative entity corresponding to place types "nation", "dependent state", "unincorporated territory", "semi-independent political entity", etc. where the list covers all of the entities in the list of ISO country codes. |
Important summary here, but this test appears to be intractable to implement, so marking as non-core after discussion. |
Added to the Notes (see comments under #123 for discussion of reasons. This a parallel case.) "Note: that for this test to work, the lowest ranking element must be present and the higher ranking elements be consistent with it." Do we need to reword the Expected Response? |
Changed Field to TestField, added ActedUpon/Consulted, added date last modified. |
Changed "Output Type" to TestType and deleted "Warning Type". Updated Specification Last Updated |
@Tasilee - I though this was a DO NOT IMPLEMENT given our definitions |
Aligned specifications to match current template |
The text was updated successfully, but these errors were encountered: