Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ids for s and w in FOLIA #19

Open
berndmoos opened this issue Nov 13, 2023 · 1 comment
Open

ids for s and w in FOLIA #19

berndmoos opened this issue Nov 13, 2023 · 1 comment

Comments

@berndmoos
Copy link

I apologize if this is documented - I couldn't find it:

I am indexing a FOLIA corpus, to be queried via CQL. This works fine as far as "normal" annotations are concerned, i.e. I can query for (e.g.) POS or lemma on the token level, and also for annotations on the sentence level. However, it remains unclear to me how to account for the xml:id attribute on <s> and <w> elements. The XML looks like this:

<s class="line" xml:id="s3">
            <w xml:id="s3.w1">
                <t>are</t>
                <lemma class="be"/>
                <pos class="VBB"/>
            </w>
            <w xml:id="s3.w2">
                <t>you</t>
                <lemma class="you"/>
                <pos class="PNP"/>
            </w>
            <w xml:id="s3.w3">
                <t>ready</t>
                <lemma class="ready"/>
                <pos class="AV0"/>
            </w>
</s>

And I've tried several variants in the indexing configuration file such as:

<!-- id for the <w>-element -->
<token type="string" offset="false" realoffset="false" parent="false">
             <pre>
                  <item type="string" value="word.id" />
               </pre>
                <post> 
                    <item type="attribute" name="#" />
                 </post>
</token>

So far, I haven't been able to find or do anything with the xml:ids.

What I'd like to understand/do is:

  1. How to represent xml:id on both sentence and token level in the config file
  2. How to integrate them into a CQL query
  3. How to access the ids programmatically after having done a query

For (3), I currently test my attempts like so:

  List<String> prefixes = new ArrayList<>();
  prefixes.add("t");
  prefixes.add("word.id");
  List<CodecSearchTree.MtasTreeHit<String>> allHits 
          = mtasCodecInfo.getPositionedTermsByPrefixesAndPositionRange("content", index, prefixes, spans.startPosition(), 
              spans.endPosition()-1);
  allHits.sort((MtasTreeHit<String> o1, MtasTreeHit<String> o2) -> Integer.compare(o1.startPosition, o2.startPosition));
  for (CodecSearchTree.MtasTreeHit<String> hit : allHits){
      System.out.print(CodecUtil.termValue(hit.data) + "(" + hit.startPosition + ")" +  " / " );
  }

I'd be grateful if somebody could point me in the right direction. Thanks in advance.

@matthijsbrouwer
Copy link
Member

To represent them on both sentence and token level:

<mappings>
	<mapping type="word" name="w">
		<token type="string" offset="false" realoffset="false" parent="false">
			<pre>
				<item type="string" value="word.id" />
			</pre>
			<post>
				<item type="attribute" namespace="http://www.w3.org/XML/1998/namespace" name="id" />
			</post>
		</token>
	</mapping>
	<mapping type="group" name="s">
		<token type="string" offset="false">
			<pre>
				<item type="string" value="sentence.id" />
			</pre>
			<post>
				<item type="attribute" namespace="http://www.w3.org/XML/1998/namespace" name="id" />
			</post>
		</token>
	</mapping>
</mappings>

Search with CQL for [word.id="s3.w2"] or <sentence.id="s3"/>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants