## Collect Data

This notebook uses [AnnotationQueryPython](https://github.com/elsevierlabs-os/AnnotationQueryPython) to query a portion of the CAT Annotated ScienceDirect CCBy Open Access Corpus, and extract sentences containing an ambiguous word selected semi-randomly from a publicly available [list of Ambiguous Words](https://muse.dillfrog.com/lists/ambiguous).

Common Annotation Toolkit (CAT) Annotations are standoff annotations that contain offsets for various text spans and annotations. In our case, we will use the word and sentence annotations that were generated using the [Genia Tagger](http://www.nactem.ac.uk/GENIA/tagger/), to write out a set of sentences containing the selected ambiguous word.

In [2]:
from AQPython.Utilities import *
from AQPython.Annotation import *
from AQPython.Query import *
from AQPython.Concordancers import *

from pyspark.sql import SQLContext, Row
from pyspark.sql import functions as F
from pyspark.sql.types import *

In [3]:
OUTPUT_FOLDER = "/mnt/els/labs/projects/SujitsDatasets/nlp-examples-cs-03"
CCBY_OA_GENIA_DIR = "/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/"
CCBY_OA_GENIA_TEXT_DIR = "/dbfs/mnt/cat-annots-01022019/sd/str/"
display(dbutils.fs.ls(CCBY_OA_GENIA_DIR))

path,name,size
dbfs:/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/_SUCCESS,_SUCCESS,0
dbfs:/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/part-00000-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,part-00000-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,250249823
dbfs:/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/part-00001-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,part-00001-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,253776761
dbfs:/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/part-00002-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,part-00002-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,246402435
dbfs:/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/part-00003-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,part-00003-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,248621370
dbfs:/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/part-00004-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,part-00004-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,254645568
dbfs:/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/part-00005-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,part-00005-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,256516918
dbfs:/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/part-00006-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,part-00006-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,252104344
dbfs:/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/part-00007-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,part-00007-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,251146550
dbfs:/mnt/cat-annots-01022019/sd/parquet/genia/ccby-open-access/part-00008-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,part-00008-2174adbe-a1ee-4975-96c6-a96556081fc8.gz.parquet,258876943


In [4]:
spark.conf.set('spark.sql.shuffle.partitions', 1)
spark.conf.set('spark.sql.autoBroadcastJoinThreshold', -1) 

### CAT Annotation Structure

An example of CAT Annotation is shown below. The columns are standard across all annotation types, any extra parameters specific to an annotation is stored in the `other` column as name-value pairs.

Notice that only the `word` and `NP` annotations contain the text in the `orig` property in the `other` column. All annotations, including the two mentioned, specify their start and end character offsets. So the general strategy is to look up offsets for larger entities by word or phrase.

The AnnotationQuery (Scala) and AnnotationQueryPython libraries provide a set of tools that implement this strategy in an efficient way.

In [6]:
genia_ds = spark.read.parquet(CCBY_OA_GENIA_DIR + '/part-00000-*')
display(genia_ds)

annotId,annotSet,annotType,docId,endOffset,other,startOffset,text
8580,ge,NP,S0749641917302061,58811,parentId=6866&orig=e&lemma=e&pos=NN&tokidx=25.25&origAnnotID=6900,58810,
8581,ge,word,S0749641917302061,58811,parentId=6866&orig=e&lemma=e&pos=NN&tokidx=25&origAnnotID=6898,58810,
8582,ge,word,S0749641917302061,58812,parentId=6866&orig=%29&lemma=%29&pos=%29&tokidx=26&origAnnotID=6899,58811,
8583,ge,word,S0749641917302061,58813,parentId=6866&orig=%5D&lemma=%5D&pos=%29&tokidx=27&origAnnotID=6901,58812,
8584,ge,word,S0749641917302061,58817,parentId=6866&orig=and&lemma=and&pos=CC&tokidx=28&origAnnotID=6902,58814,
8585,ge,NP,S0749641917302061,58827,parentId=6866&orig=the+slope&lemma=the+slope&pos=DT+NN&tokidx=29.30&origAnnotID=6906,58818,
8586,ge,word,S0749641917302061,58821,parentId=6866&orig=the&lemma=the&pos=DT&tokidx=29&origAnnotID=6903,58818,
8587,ge,word,S0749641917302061,58827,parentId=6866&orig=slope&lemma=slope&pos=NN&tokidx=30&origAnnotID=6904,58822,
8588,ge,word,S0749641917302061,58830,parentId=6866&orig=of&lemma=of&pos=IN&tokidx=31&origAnnotID=6905,58828,
8589,ge,NP,S0749641917302061,58855,parentId=6866&orig=the+stress%E2%80%93strain+curves&lemma=the+stress%E2%80%93strain+curve&pos=DT+NN+NNS&tokidx=32.34&origAnnotID=6911,58831,


### Extract Sentences containing the Ambiguous word

We will create two Spark Dataframes, one containing the `word` annotations where the text (`orig`) of the word matches the ambiguous word, and the other containing `sentence` annotations.

We will then use the `Contains()` function to join the two Dataframes, and return the LHS Dataframe containing the `sentence` annotations that contains the matched `word` annotations.

At this point, our Dataframe only contains the sentence offsets. We will use `Hydrate()` to extract the text between these offsets from the text of the document.

Finally, we convert the Dataframe to an RDD and extract the docId and text into a tab-separated file and write it out to a single part-file.

In [8]:
# AMBIG_WORD = "direct"
# AMBIG_WORD = "charge"
# AMBIG_WORD = "round"
AMBIG_WORD = "compound"

In [9]:
genia_df = GetAQAnnotations(
  (spark.read.parquet(CCBY_OA_GENIA_DIR + "/part-00000-*")
        .filter(F.col("annotType").isin(["word", "sentence"]))),
  props=["attr", "orig", "parentId"],
  decodeProps=["orig"],
  lcProps=["orig"])
genia_df.persist()
display(genia_df)

annotId,annotSet,annotType,docId,endOffset,startOffset,properties
2,ge,word,S0028393217304220,7313,7312,"Map(orig -> a, parentId -> 6896)"
1,ge,sentence,S0028393217304220,7590,7312,Map()
3,ge,word,S0028393217304220,7314,7313,"Map(orig -> ), parentId -> 6896)"
5,ge,word,S0028393217304220,7318,7315,"Map(orig -> the, parentId -> 6896)"
6,ge,word,S0028393217304220,7337,7319,"Map(orig -> farnsworth-munsell, parentId -> 6896)"
7,ge,word,S0028393217304220,7341,7338,"Map(orig -> 100, parentId -> 6896)"
8,ge,word,S0028393217304220,7345,7342,"Map(orig -> hue, parentId -> 6896)"
9,ge,word,S0028393217304220,7350,7346,"Map(orig -> test, parentId -> 6896)"
10,ge,word,S0028393217304220,7357,7351,"Map(orig -> colour, parentId -> 6896)"
11,ge,word,S0028393217304220,7363,7358,"Map(orig -> chips, parentId -> 6896)"


In [10]:
word_df = FilterType(genia_df, annotType="word")
ambig_word_df = FilterProperty(word_df, name="orig", value=AMBIG_WORD)
display(ambig_word_df)

annotId,annotSet,annotType,docId,endOffset,startOffset,properties
1372,ge,word,S0278691518301698,23040,23032,"Map(orig -> compound, parentId -> 611)"
3920,ge,word,S0278691518301698,33879,33871,"Map(orig -> compound, parentId -> 3155)"
4751,ge,word,S0278691518301698,37844,37836,"Map(orig -> compound, parentId -> 4008)"
4849,ge,word,S0278691518301698,38201,38193,"Map(orig -> compound, parentId -> 4084)"
5700,ge,word,S0278691518301698,41824,41816,"Map(orig -> compound, parentId -> 4954)"
5763,ge,word,S0278691518301698,42111,42103,"Map(orig -> compound, parentId -> 4954)"
7615,ge,word,S0278691518301698,49980,49972,"Map(orig -> compound, parentId -> 6866)"
9570,ge,word,S0278691518301698,58731,58723,"Map(orig -> compound, parentId -> 8769)"
9604,ge,word,S0278691518301698,58921,58913,"Map(orig -> compound, parentId -> 8845)"
9200,ge,word,S0014489418300158,53302,53294,"Map(orig -> compound, parentId -> 8470)"


In [11]:
sentence_df = FilterType(genia_df, annotType="sentence")
display(sentence_df)

annotId,annotSet,annotType,docId,endOffset,startOffset,properties
1,ge,sentence,S0028393217304220,7590,7312,Map()
76,ge,sentence,S0028393217304220,7718,7591,Map()
116,ge,sentence,S0028393217304220,7770,7730,Map()
126,ge,sentence,S0028393217304220,7951,7771,Map()
173,ge,sentence,S0028393217304220,7993,7952,Map()
185,ge,sentence,S0028393217304220,8046,7994,Map()
200,ge,sentence,S0028393217304220,8156,8058,Map()
219,ge,sentence,S0028393217304220,8349,8157,Map()
262,ge,sentence,S0028393217304220,8455,8350,Map()
292,ge,sentence,S0028393217304220,8606,8456,Map()


In [12]:
ambig_sentences_df = Contains(sentence_df, ambig_word_df, 2000, False)
display(ambig_sentences_df)

annotId,annotSet,annotType,docId,endOffset,startOffset,properties
1338,ge,sentence,S0278691518301698,23140,22880,Map()
3882,ge,sentence,S0278691518301698,33958,33710,Map()
4735,ge,sentence,S0278691518301698,37983,37762,Map()
4811,ge,sentence,S0278691518301698,38202,38030,Map()
5681,ge,sentence,S0278691518301698,42112,41743,Map()
7593,ge,sentence,S0278691518301698,50089,49868,Map()
9496,ge,sentence,S0278691518301698,58732,58435,Map()
9572,ge,sentence,S0278691518301698,59038,58733,Map()
9161,ge,sentence,S0014489418300158,53398,53143,Map()
1586,ge,sentence,S0022519317304332,34346,34118,Map()


In [13]:
ambig_sent_text_df = Hydrate(ambig_sentences_df, txtPath=CCBY_OA_GENIA_TEXT_DIR)
display(ambig_sent_text_df)

annotId,annotSet,annotType,docId,endOffset,startOffset,properties
2942,ge,sentence,S000292971500333X,27966,27694,"Map(text -> We used primary skin fibroblasts from individuals with HS and primary skin fibroblast cell lines completely deficient of PEX1 (compound heterozygous for p.[Thr263Ilefs∗6];[Ile700Tyrfs∗42], c.[788_789del];[2097dup])19 or PEX6 (homozygous for p.Gly135Aspfs∗23 [c.402del]).20)"
4265,ge,sentence,S000292971500333X,33610,33408,"Map(text -> In family 5, the two affected individuals were compound heterozygous for a previously reported pathogenic c.821C>T (p.Pro274Leu)24 variant in PEX6 and an ultra-rare missense variant on the other allele.)"
5852,ge,sentence,S000292971500333X,40535,40392,Map(text -> Our combined findings show that HS is caused by compound heterozygosity for a loss-of-function allele and a hypomorphic allele in PEX1 or PEX6.)
6155,ge,sentence,S000292971500333X,41982,41827,"Map(text -> Compound heterozygosity of the hypomorphic PEX6 c.1802G>A allele has been reported previously in seven individuals with a Zellweger spectrum disorder.20,23)"
6369,ge,sentence,S000292971500333X,43146,42734,"Map(text -> Because the PEX6 c.1802G>A allele has a frequency of 0.41% in the European population (see ExAC Browser in the Web Resources), we expect that future WES studies will identify additional individuals who have a mild PBD due to compound heterozygosity of the PEX6 c.1802G>A allele and a severe PEX6 allele and who have not been suspected of or analyzed for a peroxisomal disorder on the basis of clinical diagnosis.)"
37,ge,sentence,S000292971630444X,9733,9654,Map(text -> Proband II.1 from family A is compound heterozygous for REEP6 variants (M1/M2).)
3434,ge,sentence,S000292971630444X,27575,27214,"Map(text -> Optic cups were selected after morphological assessment under a light microscope at several time points during the differentiation and fixed in 4% paraformaldehyde (PFA) for 40 min at 4°C, cryoprotected by incubation overnight in 30% sucrose in phosphate buffer saline (PBS), embedded in OCT compound (Sakura Finetek), frozen, and cryosectioned (6 μm sections).)"
6472,ge,sentence,S000292971630444X,40217,39890,"Map(text -> Individual A-II:1 (EG76) harbored compound heterozygous variants in REEP6: a missense mutation (GRCh37 [hg19] chr19:g.1496339T>C, GenBank: NM_138393.1; c.404T>C [p.Leu135Pro]) and a single-nucleotide deletion also in exon 4 (GRCh37 [hg19] chr19:g.1496383del, GenBank: NM_138393.1; c.448del [p.Ala150Pfs∗2]) (Figures 1A and 1B).)"
608,ge,sentence,S0002929717302021,38280,38127,"Map(text -> Compound heterozygous individuals with two rare alleles in RNU4ATAC were observed, and for each such individual a line is drawn linking the two variants.)"
2264,ge,sentence,S0002929717302021,46797,46696,"Map(text -> Thus, disease risk due to compound heterozygosity or X-linked inheritance is explicitly accommodated.)"


In [14]:
print("number of candidate sentences:", ambig_sent_text_df.count())

In [15]:
dbutils.fs.rm(OUTPUT_FOLDER, True)
(ambig_sent_text_df
  .coalesce(1)
  .rdd
  .map(lambda aq: "{:s}\t{:d}\t{:s}".format(aq.docId, aq.annotId, aq.properties["text"]))
  .saveAsTextFile(OUTPUT_FOLDER)
)