## Approach for searching the contexts
- For each pdf in pdf_names:
  - First construct contexts dictionary with contexts as keys and doc_id, block_id, sentence_number, **all extractions' linear_order_number** from {this_pdf_name}_events.xlsx
  - Get doc_sentence_map with doc_id, sentence_text, sentence_id, doc_sent_linear_order from {this_pdf_name}_all.csv
  - For each context key in contexts dictionary :
    - this_doc_id, sentence_number from contexts dictionary
    - from doc_sentence_map, get all sentences up until **doc_sent_map's doc_sent_linear_order**
    - get 3 top closest doc_id, sentence_number pairs where the context was found up until this_doc_id,this_sentence_number
    - calculate distance between **doc_sent_map's doc_sent_linear_order** between context dictionary value and doc_sentence_map for each of the nearest context matches
- Plot the histogram

## Approach for ordering the documents, extractions
- For each pdf in pdf_names:
  - First sort doc_id, block_id, sentence_number and add linear_order_number from {this_pdf_name}_all_linear_order.xlsx
  - Construct doc_sentence_map with doc_id, sentence_text, sentence_id, and add doc_sent_linear_order from {this_pdf_name}_all.csv
  - For each of the sorted doc_id, block_id, sentence_number and linear_order_number, add doc_sent_linear_order.
  - Now only save events from this list doc_id, block_id, sentence_number and linear_order_number and doc_sent_linear_order along with contexts and sentence text {this_pdf_name}_events.xlsx
  - This is because mentions["documents"] and mentions["extractions"] do not match. mentions["extractions"] have page numbers and block numbers. But again COSMOS JSON does not have block numbers. So we need to sort in this order doc_id, block_id, sentence_number and add linear_order_number for extractions, then a linear order number for documents after sorting doc_ids and sentence_numbers. And combine the two for searching the contexts within documents by document linear order.

In [2]:
!pip install altair
!pip install altair vega_datasets
!pip install vega
!pip install altair_viewer
!pip install textwrap

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting vega
  Downloading vega-4.0.0-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ipytablewidgets<0.4.0,>=0.3.0 (from vega)
  Downloading ipytablewidgets-0.3.1-py2.py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.2/190.2 kB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jupyter<2.0.0,>=1.0.0 (from vega)
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting traittypes>=0.0.6 (from ipytablewidgets<0.4.0,>=0.3.0->vega)
  Downloading traittypes-0.2.1-py2.py3-none-any.whl (8.6 kB)
Collecting lz4 (from ipytable

In [3]:
import numpy as np
import pandas as pd
import os
import json
import io

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!ls
%cd "/content/drive/MyDrive/Colab Notebooks/skema/data"
!ls

drive  sample_data
/content/drive/MyDrive/Colab Notebooks/skema/data
'1471-2334-3-19 (2) (2).pdf'
 cosmos-and-extractions-jsons-for-3-papers
 data_modeling_covid_italy.json
 data-response-to-covid-19-was-italy-unprepared.json
 data-sars-double.json
 doc_pg_blk_sent_event.xlsx
 event_linear_order_modeling_covid_italy.xlsx
 event_linear_order_modeling.xlsx
 event_linear_order_response_to_covid_19_was_italy_unprepared.xlsx
 event_linear_order_sarsdouble.xlsx
 modeling_covid_italy_all.json
 modeling_covid_italy_all_linear_order_5_17.xlsx
 modeling_covid_italy_all_linear_order.xlsx
 modeling_covid_italy_all.xlsx
 modeling_covid_italy_events.xlsx
 modeling_covid_italy.json
 modeling_covid_italy.xlsx
'modelling_doc_event (1).xlsx'
 modelling_doc_event.xlsx
 response-to-covid-19-was-italy-unprepared_all.json
 response-to-covid-19-was-italy-unprepared_all_linear_order_5_17.xlsx
 response-to-covid-19-was-italy-unprepared_all_linear_order.xlsx
 response-to-covid-19-was-italy-unprepared_all.xlsx
 

In [6]:
path = "/content/drive/MyDrive/Colab Notebooks/skema/data/"
os.path.exists(path)

True

In [7]:
filenames = ["sarsdouble.xlsx", "modeling_covid_italy.xlsx", "response-to-covid-19-was-italy-unprepared.xlsx"]
diff_distance_map = []
for filename in filenames:
  print("\n**********************************************************************************\n")
  print("The annotated extractions file name %s " %(filename))
  df = pd.read_csv(os.path.join(path, filename))
  df['locationContext'] = df['locationContext'].replace({"^'|'$": ""}, regex=True)
  df['temporalContext'] = df['temporalContext'].replace({"^'|'$": ""}, regex=True)

  print("Number of page numbers different from extractions with that of manually annotated are : %d out of %d" %(len(df[df['pg_num'] != df["page_num"]]), df.shape[0]) )
  gp = df.groupby(['locationContext']).count()
  print(gp.reset_index()[['locationContext', 'event_id']])


**********************************************************************************

The annotated extractions file name sarsdouble.xlsx 
Number of page numbers different from extractions with that of manually annotated are : 11 out of 172
                                      locationContext  event_id
0                                             Beijing        17
1                                     China,Hong Kong        15
2                    China,Hong Kong,Beijing,Shanghai         1
3     China,Hong Kong,Mainland China,Beijing,Shanghai         1
4             Europe,Inner Mongolia,Hong Kong,Beijing         1
5   Guangdong Province,Hong Kong,China,Mainland Ch...         2
6   Guangdong Province,Hong Kong,Mainland China,Be...         2
7                           Guangdong,China,Hong Kong         1
8                                           Hong Kong        43
9                              Hong Kong,Amoy Gardens        21
10                                Hong Kong,Guangdong   

In [8]:
print("Distances between annotated page numbers cosmos page numbers: ", len(df[df['pg_num'] != df["page_num"]]))

Distances between annotated page numbers cosmos page numbers:  5


In [9]:
path = "/content/drive/MyDrive/Colab Notebooks/skema/data/"
extractions_path = "cosmos-and-extractions-jsons-for-3-papers"
annotated_files = ["data-sars-double.json", "data_modeling_covid_italy.json", "data-response-to-covid-19-was-italy-unprepared.json"]
extraction_files =["extractions_sarsdouble.json", "extractions_modeling_covid_italy--COSMOS-data.json","extractions_response-to-covid-19-was-italy-unprepared--COSMOS-data.json"]
paper_names = ["sarsdouble.pdf", "modeling_covid_italy.pdf", "response-to-covid-19-was-italy-unprepared.pdf"]
ann_extr_file_pairs = {}
for name,ann, extr in zip(paper_names,annotated_files, extraction_files):
  ann_extr_file_pairs[name] = [ann, extr]

In [10]:
def save_extr_ann_file(path, filename, map):
  pd.DataFrame.from_records(map).to_csv(os.path.join(path, filename+".xlsx"))

  with io.open(os.path.join(path, filename+".json"), 'w', encoding='utf-8') as f:
    f.write(json.dumps(map, ensure_ascii=False))

In [93]:
def combine_ann_extr_all(path, extractions_path, extr, ann,filename ):
  with open(os.path.join(path,extractions_path, extr ), "r", encoding='UTF-8') as f:
    contents = f.readlines()
    extractions = json.loads(contents[0])
  with open(os.path.join(path, ann ), "r", encoding='UTF-8') as f:
    contents = f.read()
    annotations = json.loads(contents)

  print(len(annotations), len(extractions["mentions"]), len(extractions["documents"]))
  doc_sentence_map = {}
  linear_order = 1
  doc_ids = sorted(list(extractions["documents"].keys()))
  for doc_id in doc_ids:
    #print(doc_id, document)
    document = extractions["documents"][doc_id]
    for i,sentence in enumerate(document['sentences']):
      doc_sentence_map[(doc_id,i)] = {"sentence_text":sentence['words'], "sno":linear_order}
      linear_order += 1

  doc_event_map = []
  event_doc_map = {}
  for mention in extractions['mentions']:
    # if mention['id'].startswith("E:"):
    for att in mention["attachments"]:
      if "pageNum" in att.keys():
        this_text = doc_sentence_map[(mention['document'], mention['sentence'])]['sentence_text']
        this_linear_order = doc_sentence_map[(mention['document'], mention['sentence'])]['sno']

        doc_event_map.append({"doc_id":mention['document'], "pg_num":att["pageNum"][0],"blk_id":att["blockIdx"][0],"sentence_id":mention['sentence'], 
                              "doc_sentence_count":len(extractions["documents"][mention['document']]["sentences"]), 
                              "event_id":mention["id"], "event":mention["text"] }) #corrected_sent_number -> 1,2,5,4,6,7, => 1,2,3,4,5,6
        event_doc_map[mention["id"]] = {"doc_id":mention['document'],"pg_num":att["pageNum"][0],"blk_id":att["blockIdx"][0],"sentence_id":mention['sentence'], 
                              "doc_sentence_count":len(extractions["documents"][mention['document']]["sentences"]), 
                              "event_id":mention["id"], "event":mention["text"] , "sentence_text":this_text, "doc_sent_linear_order":this_linear_order}
  event_ann_map = {}
  for ann in annotations:
    event_ann_map[ann['eventId']] = {"annotated_page_num":ann["page_num"],"para_num":ann["para_num"], "event":ann["event"], 'locationContext': ann['locationContext'],
    'temporalContext': ann['temporalContext'],'explanation': ann['explanation']}
  
  empty_map = {"annotated_page_num":"","para_num":"", "event":"", 'locationContext': "",
    'temporalContext': "",'explanation': ""}
  event_extr_ann_map = []
  for event, values in event_doc_map.items():
    this_event = event_ann_map[event] if event in event_ann_map.keys() else empty_map
    event_extr_ann_map.append({"doc_id":values['doc_id'],"annotated_page_num":this_event["annotated_page_num"],"para_num":this_event["para_num"], "event_id":values["event_id"],
                               "event":this_event["event"], 'locationContext': ",".join(this_event['locationContext']),
    "sentence_text":",".join(values["sentence_text"]),
    'temporalContext': ",".join(this_event['temporalContext']),'explanation': this_event['explanation'], 'pg_num':values['pg_num'], 'blk_id':values['blk_id'], 
    'sentence_id':values['sentence_id'], 'doc_sentence_count':values['doc_sentence_count'], "doc_sent_linear_order":values["doc_sent_linear_order"]})


  df = pd.DataFrame.from_records(event_extr_ann_map)
  df = df[df.columns]
  print(df.columns.to_list())
  print("\n")
  df['locationContext'] = df['locationContext'].replace({"^'|'$": ""}, regex=True)
  df['temporalContext'] = df['temporalContext'].replace({"^'|'$": ""}, regex=True)
  df.sort_values(by=['doc_id', 'pg_num','blk_id', 'sentence_id', 'doc_sentence_count'], inplace=True,
               ascending = [True, True, True, True, True])
  d = df[['doc_id', 'pg_num','blk_id','sentence_id', 'doc_sentence_count']]
  df['linear_order'] = [i for i in range( 1,len(d)+1, 1)]
  print("columns \n",df.columns)
  df.to_csv(os.path.join(path, filename+"_linear_order_5_17"+".xlsx"))
  save_extr_ann_file(path, filename, event_extr_ann_map)
  

### Uncomment following lines to combine annotations and extractions

In [94]:
for key in paper_names:
  name = key
  ann, extr = ann_extr_file_pairs[name]
  combine_ann_extr_all(path, extractions_path, extr, ann, name[:-4]+"_all")

174 6212 91
['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event', 'locationContext', 'sentence_text', 'temporalContext', 'explanation', 'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count', 'doc_sent_linear_order']


columns 
 Index(['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event',
       'locationContext', 'sentence_text', 'temporalContext', 'explanation',
       'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count',
       'doc_sent_linear_order', 'linear_order'],
      dtype='object')
302 10045 76
['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event', 'locationContext', 'sentence_text', 'temporalContext', 'explanation', 'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count', 'doc_sent_linear_order']


columns 
 Index(['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event',
       'locationContext', 'sentence_text', 'temporalContext', 'explanation',
       'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count',
       'doc_sent_l

In [99]:
def get_doc_id_sentence_text(path, extractions_path, extr,filename ):
  with open(os.path.join(path,extractions_path, extr ), "r", encoding='UTF-8') as f:
    contents = f.readlines()
    extractions = json.loads(contents[0])

  print(len(extractions["documents"]))
  doc_sentence_map = []
  linear_order = 1
  doc_ids = sorted(list(extractions["documents"].keys()))
  for doc_id in doc_ids:
    #print(doc_id, document)
    document = extractions["documents"][doc_id]
    for i,sentence in enumerate(document['sentences']):
      doc_sentence_map.append({"doc_id":doc_id,"sentence_id":i,"sentence_text":sentence['words'], "sno":linear_order})
      linear_order += 1
  pd.DataFrame.from_records(doc_sentence_map).to_csv(os.path.join(path, filename+".csv"))

In [100]:
for key in paper_names:
  name = key
  ann, extr = ann_extr_file_pairs[name]
  get_doc_id_sentence_text(path, extractions_path, extr, name[:-4]+"_all")

91
76
47


In [14]:
filenames = ["sarsdouble_all.xlsx", "modeling_covid_italy_all.xlsx", "response-to-covid-19-was-italy-unprepared_all.xlsx"]

### Save all event extractions with linear orders

In [97]:
filenames = ["sarsdouble_all_linear_order_5_17.xlsx", "modeling_covid_italy_all_linear_order_5_17.xlsx", "response-to-covid-19-was-italy-unprepared_all_linear_order_5_17.xlsx"]
diff_distance_map = []
for filename in filenames:
  print("\n**********************************************************************************\n")
  print("The annotated extractions file name %s " %(filename))
  df = pd.read_csv(os.path.join(path, filename), index_col=False)
  df = df[df.columns[1:]]
  print(df.columns.to_list())
  print("\n")
  event_df = df[df['event_id'].str.startswith("E:")]
  print(len(event_df), len(df))
  f = filename.replace("-","_").split("_")
  f = "_".join(f[:f.index("all")])
  print(f,f+"_event.xlsx")
  event_df.to_csv(os.path.join(path, f+"_events.xlsx"))
  


**********************************************************************************

The annotated extractions file name sarsdouble_all_linear_order_5_17.xlsx 
['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event', 'locationContext', 'sentence_text', 'temporalContext', 'explanation', 'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count', 'doc_sent_linear_order', 'linear_order']


172 6108
sarsdouble sarsdouble_event.xlsx

**********************************************************************************

The annotated extractions file name modeling_covid_italy_all_linear_order_5_17.xlsx 
['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event', 'locationContext', 'sentence_text', 'temporalContext', 'explanation', 'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count', 'doc_sent_linear_order', 'linear_order']


302 10045
modeling_covid_italy modeling_covid_italy_event.xlsx

**********************************************************************************

The anno

In [16]:
event_df[['doc_id','pg_num', 'blk_id','sentence_id', 'doc_sentence_count', 'linear_order']]

Unnamed: 0,doc_id,pg_num,blk_id,sentence_id,doc_sentence_count,linear_order
94,-100047078,7,3,4,12,95
100,-100047078,7,3,4,12,101
188,-100047078,7,3,8,12,189
189,-100047078,7,3,8,12,190
241,-100047078,7,3,8,12,242
358,-1045735558,1,6,0,2,359
631,-1113525426,13,1,16,39,632
890,-1468384437,8,4,1,6,891
904,-1468384437,8,4,2,6,905
1101,-1657329247,5,5,2,6,1102


In [71]:
from collections import Counter

filenames = ["sarsdouble_all_linear_order_5_17.xlsx", "modeling_covid_italy_all_linear_order_5_17.xlsx", "response-to-covid-19-was-italy-unprepared_all_linear_order_5_17.xlsx"]
loc_context_counter = Counter()
temporal_context_counter = Counter()
loc_context_counters = {filename: Counter() for filename in filenames}
temporal_context_counters = {filename: Counter() for filename in filenames}

loc_tf = {}
with pd.ExcelWriter(os.path.join(path, "summarized_context_counter.xlsx")) as writer:

  for filename in filenames:
    print("\n**********************************************************************************\n")
    print("The annotated extractions file name %s " %(filename))
    df = pd.read_csv(os.path.join(path, filename), index_col=False)
    df = df[df.columns[1:]]
    df = df.fillna("")
    loc_context_counter = Counter()
    temporal_context_counter = Counter()
    loc_context_counter = Counter([l for l in sum(df['locationContext'].apply(lambda x: x.split(",")).to_list(), []) if l != ''])
    temporal_context_counter = Counter([l for l in sum(df['temporalContext'].apply(lambda x: x.split(",")).to_list(), []) if l != ''])
    print(loc_context_counter)
    print(temporal_context_counter)
    loc_context_counters[filename] = loc_context_counter
    temporal_context_counters[filename] = temporal_context_counter
    loc_context_counter.update(temporal_context_counter)
    pd.DataFrame.from_dict(loc_context_counter, orient="index").reset_index().to_excel(writer, sheet_name=filename[:-27], index=False)



**********************************************************************************

The annotated extractions file name sarsdouble_all_linear_order_5_17.xlsx 
Counter({'Hong Kong': 128, 'Inner Mongolia': 39, 'Beijing': 39, 'Guangdong': 24, 'Amoy Gardens': 21, 'China': 20, 'Guangdong Province': 7, 'Shanghai': 6, 'Mainland China': 5, 'Europe': 1, 'Paris': 1})
Counter({' 2003': 304, 'March 17': 135, 'May 10': 135, '1983': 21, '1985': 21, '03/17': 10, '03/20': 10, '03/23': 10, '03/26': 10, '03/29': 10, '04/01': 10, '04/04': 10, '04/07': 10, '04/10': 10, '04/13': 10, '04/16': 10, '04/19': 10, '04/22': 10, '04/25': 10, '04/28': 10, '05/01': 10, '05/04': 10, '05/07': 10, '05/10': 10, 'February 21st': 9, 'February 21': 9, '17 March': 8, '10 May': 8, 'November 2002': 5, 'February 22': 4})

**********************************************************************************

The annotated extractions file name modeling_covid_italy_all_linear_order_5_17.xlsx 
Counter({'Italy': 253, 'Lodi Province'



In [72]:
df.columns

Index(['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event',
       'locationContext', 'temporalContext', 'explanation', 'pg_num', 'blk_id',
       'sentence_id', 'doc_sentence_count', 'linear_order'],
      dtype='object')

In [None]:
'doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event',
       'locationContext', 'temporalContext', 'explanation', 'pg_num', 'blk_id',
       'sentence_id', 'doc_sentence_count', 'linear_order'

In [70]:
from collections import Counter

filenames = ["sarsdouble_all_linear_order_5_17.xlsx", "modeling_covid_italy_all_linear_order_5_17.xlsx", "response-to-covid-19-was-italy-unprepared_all_linear_order_5_17.xlsx"]

loc_tf = {}
# with pd.ExcelWriter(os.path.join(path, "summarized_context_counter.xlsx")) as writer:

for filename in filenames:
  print("\n**********************************************************************************\n")
  print("The annotated extractions file name %s " %(filename))
  df = pd.read_csv(os.path.join(path, filename), index_col=False)
  df = df[df.columns[1:]]
  df = df.fillna("")
  


'response-to-covid-19-was-italy-unprepared'

In [61]:
pd.DataFrame.from_dict(loc_context_counter, orient="index").reset_index()

Unnamed: 0,index,0
0,Hong Kong,128
1,Guangdong,24
2,Inner Mongolia,39
3,China,20
4,Amoy Gardens,21
5,Mainland China,5
6,Beijing,39
7,Shanghai,6
8,Europe,1
9,Guangdong Province,7


In [47]:
df = df.fillna("")
Counter([l for l in sum(df['locationContext'].apply(lambda x: x.split(",")).to_list(), []) if l != ''])
Counter([l for l in sum(df['temporalContext'].apply(lambda x: x.split(",")).to_list(), []) if l != ''])

Counter({'March 17': 135,
         ' 2003': 304,
         'May 10': 135,
         '1983': 21,
         '1985': 21,
         '03/17': 10,
         '03/20': 10,
         '03/23': 10,
         '03/26': 10,
         '03/29': 10,
         '04/01': 10,
         '04/04': 10,
         '04/07': 10,
         '04/10': 10,
         '04/13': 10,
         '04/16': 10,
         '04/19': 10,
         '04/22': 10,
         '04/25': 10,
         '04/28': 10,
         '05/01': 10,
         '05/04': 10,
         '05/07': 10,
         '05/10': 10,
         'February 21st': 9,
         'February 21': 9,
         'February 22': 4,
         'November 2002': 5,
         '17 March': 8,
         '10 May': 8})

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

train_set = ["The sky is blue.", "The sun is bright."]
test_set = ["The sun in the sky is bright.", "We can see the shining sun, the bright sun."]
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(train_set)
print ("Vocabulary:")
print(count_vectorizer.vocabulary_)
Vocab = list(count_vectorizer.vocabulary_)
print(Vocab)

# Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
freq_term_matrix = count_vectorizer.transform(test_set)
print (freq_term_matrix.todense())

count_array = freq_term_matrix.toarray()
df = pd.DataFrame(data=count_array, columns=Vocab)
print(df)

from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
print ("IDF:")
print(tfidf.idf_)

Vocabulary:
{'the': 5, 'sky': 3, 'is': 2, 'blue': 0, 'sun': 4, 'bright': 1}
['the', 'sky', 'is', 'blue', 'sun', 'bright']
[[0 1 1 1 1 2]
 [0 1 0 0 2 2]]
   the  sky  is  blue  sun  bright
0    0    1   1     1    1       2
1    0    1   0     0    2       2
IDF:
[2.09861229 1.         1.40546511 1.40546511 1.         1.        ]


In [None]:
a = np.array([[9, 2, 3],
           [4, 5, 6],
           [7, 0, 5]])
a1 = a[a[:, 0].argsort()]
a2 = a1[a1[:, 1].argsort()]
a3 = a2[a2[:, 2].argsort()]
a1,a2,a3

(array([[4, 5, 6],
        [7, 0, 5],
        [9, 2, 3]]),
 array([[7, 0, 5],
        [9, 2, 3],
        [4, 5, 6]]),
 array([[9, 2, 3],
        [7, 0, 5],
        [4, 5, 6]]))

In [None]:
path = "/content/drive/MyDrive/Colab Notebooks/skema/data/"
extractions_path = "cosmos-and-extractions-jsons-for-3-papers"
annotated_files = ["data-sars-double.json", "data_modeling_covid_italy.json", "data-response-to-covid-19-was-italy-unprepared.json"]
extraction_files =["extractions_sarsdouble.json", "extractions_modeling_covid_italy--COSMOS-data.json","extractions_response-to-covid-19-was-italy-unprepared--COSMOS-data.json"]
paper_names = ["sarsdouble.pdf", "modeling_covid_italy.pdf", "response-to-covid-19-was-italy-unprepared.pdf"]
ann_extr_file_pairs = {}
for name,ann, extr in zip(paper_names,annotated_files, extraction_files):
  ann_extr_file_pairs[name] = [ann, extr]

In [None]:
ann_extr_file_pairs

{'sarsdouble.pdf': ['data-sars-double.json', 'extractions_sarsdouble.json'],
 'modeling_covid_italy.pdf': ['data_modeling_covid_italy.json',
  'extractions_modeling_covid_italy--COSMOS-data.json'],
 'response-to-covid-19-was-italy-unprepared.pdf': ['data-response-to-covid-19-was-italy-unprepared.json',
  'extractions_response-to-covid-19-was-italy-unprepared--COSMOS-data.json']}

In [None]:
for ann, extr in list(ann_extr_file_pairs.values())[:1]:
  print("This extractions file ",extr)
  with open(os.path.join(path,extractions_path, extr ), "r", encoding='UTF-8') as f:
    contents = f.readlines()
    extractions = json.loads(contents[0])
    

This extractions file  extractions_sarsdouble.json


In [None]:
# events

# eventId
# page number , block number, sentence id -page -> linear order

# eventid 1,2,3,4,5

# linear sentence_order : 6,10,12,12,14

# location context
# temporal context 

# 1) missing linear order in TextReadingPipeline output -> extractions json
# 2) location context - annotations 
# 	for each eventID in eventMentions in annotations

# 		locations = [ "Italy", "Rome"]
# 		linear order entire extractions -> linear order for this evntID
# 			sentenceID -> documents and look in sentence text from this sentenceID 
# 			and get closest sentenceID where you find this location's text