## Approach for searching the contexts
- For each pdf in pdf_names:
  - First construct contexts dictionary with contexts as keys and doc_id, block_id, sentence_number, **all extractions' linear_order_number** from {this_pdf_name}_events.xlsx
  - Get doc_sentence_map with doc_id, sentence_text, sentence_id, doc_sent_linear_order from {this_pdf_name}_all.csv
  - For each context key in contexts dictionary :
    - this_doc_id, sentence_number from contexts dictionary
    - from doc_sentence_map, get all sentences up until **doc_sent_map's doc_sent_linear_order**
    - get 3 top closest doc_id, sentence_number pairs where the context was found up until this_doc_id,this_sentence_number
    - calculate distance between **doc_sent_map's doc_sent_linear_order** between context dictionary value and doc_sentence_map for each of the nearest context matches
- Plot the histogram

## Approach for ordering the documents, extractions
- For each pdf in pdf_names:
  - First sort doc_id, block_id, sentence_number and add linear_order_number from {this_pdf_name}_all_linear_order.xlsx
  - Construct doc_sentence_map with doc_id, sentence_text, sentence_id, and add doc_sent_linear_order from {this_pdf_name}_all.csv
  - For each of the sorted doc_id, block_id, sentence_number and linear_order_number, add doc_sent_linear_order.
  - Now only save events from this list doc_id, block_id, sentence_number and linear_order_number and doc_sent_linear_order along with contexts and sentence text {this_pdf_name}_events.xlsx
  - This is because mentions["documents"] and mentions["extractions"] do not match. mentions["extractions"] have page numbers and block numbers. But again COSMOS JSON does not have block numbers. So we need to sort in this order doc_id, block_id, sentence_number and add linear_order_number for extractions, then a linear order number for documents after sorting doc_ids and sentence_numbers. And combine the two for searching the contexts within documents by document linear order.

In [1]:
!pip install altair
!pip install altair vega_datasets
!pip install vega
!pip install altair_viewer
!pip install textwrap

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting vega
  Downloading vega-4.0.0-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ipytablewidgets<0.4.0,>=0.3.0 (from vega)
  Downloading ipytablewidgets-0.3.1-py2.py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.2/190.2 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jupyter<2.0.0,>=1.0.0 (from vega)
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting traittypes>=0.0.6 (from ipytablewidgets<0.4.0,>=0.3.0->vega)
  Downloading traittypes-0.2.1-py2.py3-none-any.whl (8.6 kB)
Collecting lz4 (from ipytable

In [2]:
import numpy as np
import pandas as pd
import os
import json
import io

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
!ls
%cd "/content/drive/MyDrive/Colab Notebooks/skema/data"
!ls

drive  sample_data
/content/drive/MyDrive/Colab Notebooks/skema/data
'1471-2334-3-19 (2) (2).pdf'
 cosmos-and-extractions-jsons-for-3-papers
 data_modeling_covid_italy.json
 data-response-to-covid-19-was-italy-unprepared.json
 data-sars-double.json
 doc_pg_blk_sent_event.xlsx
 event_linear_order_modeling_covid_italy.xlsx
 event_linear_order_modeling.xlsx
 event_linear_order_response_to_covid_19_was_italy_unprepared.xlsx
 event_linear_order_sarsdouble.xlsx
 modeling_covid_italy_all.csv
 modeling_covid_italy_all.gsheet
 modeling_covid_italy_all.json
 modeling_covid_italy_all_linear_order_5_17.xlsx
 modeling_covid_italy_all_linear_order.xlsx
 modeling_covid_italy_all.xlsx
 modeling_covid_italy_events.csv
 modeling_covid_italy_events.gsheet
 modeling_covid_italy_events.xlsx
 modeling_covid_italy.json
 modeling_covid_italy.xlsx
'modelling_doc_event (1).xlsx'
 modelling_doc_event.xlsx
 response-to-covid-19-was-italy-unprepared_all.csv
 response-to-covid-19-was-italy-unprepared_all.gsheet
 re

In [5]:
path = "/content/drive/MyDrive/Colab Notebooks/skema/data/"
os.path.exists(path)

True

In [6]:
filenames = ["sarsdouble.xlsx", "modeling_covid_italy.xlsx", "response-to-covid-19-was-italy-unprepared.xlsx"]
diff_distance_map = []
for filename in filenames:
  print("\n**********************************************************************************\n")
  print("The annotated extractions file name %s " %(filename))
  df = pd.read_csv(os.path.join(path, filename))
  df['locationContext'] = df['locationContext'].replace({"^'|'$": ""}, regex=True)
  df['temporalContext'] = df['temporalContext'].replace({"^'|'$": ""}, regex=True)

  print("Number of page numbers different from extractions with that of manually annotated are : %d out of %d" %(len(df[df['pg_num'] != df["page_num"]]), df.shape[0]) )
  gp = df.groupby(['locationContext']).count()
  print(gp.reset_index()[['locationContext', 'event_id']])


**********************************************************************************

The annotated extractions file name sarsdouble.xlsx 
Number of page numbers different from extractions with that of manually annotated are : 11 out of 172
                                      locationContext  event_id
0                                             Beijing        17
1                                     China,Hong Kong        15
2                    China,Hong Kong,Beijing,Shanghai         1
3     China,Hong Kong,Mainland China,Beijing,Shanghai         1
4             Europe,Inner Mongolia,Hong Kong,Beijing         1
5   Guangdong Province,Hong Kong,China,Mainland Ch...         2
6   Guangdong Province,Hong Kong,Mainland China,Be...         2
7                           Guangdong,China,Hong Kong         1
8                                           Hong Kong        43
9                              Hong Kong,Amoy Gardens        21
10                                Hong Kong,Guangdong   

In [7]:
print("Distances between annotated page numbers cosmos page numbers: ", len(df[df['pg_num'] != df["page_num"]]))

Distances between annotated page numbers cosmos page numbers:  5


In [8]:
path = "/content/drive/MyDrive/Colab Notebooks/skema/data/"
extractions_path = "cosmos-and-extractions-jsons-for-3-papers"
annotated_files = ["data-sars-double.json", "data_modeling_covid_italy.json", "data-response-to-covid-19-was-italy-unprepared.json"]
extraction_files =["extractions_sarsdouble.json", "extractions_modeling_covid_italy--COSMOS-data.json","extractions_response-to-covid-19-was-italy-unprepared--COSMOS-data.json"]
paper_names = ["sarsdouble.pdf", "modeling_covid_italy.pdf", "response-to-covid-19-was-italy-unprepared.pdf"]
ann_extr_file_pairs = {}
for name,ann, extr in zip(paper_names,annotated_files, extraction_files):
  ann_extr_file_pairs[name] = [ann, extr]

In [9]:
def save_extr_ann_file(path, filename, map):
  pd.DataFrame.from_records(map).to_csv(os.path.join(path, filename+".xlsx"))

  with io.open(os.path.join(path, filename+".json"), 'w', encoding='utf-8') as f:
    f.write(json.dumps(map, ensure_ascii=False))

In [10]:
def combine_ann_extr_all(path, extractions_path, extr, ann,filename ):
  with open(os.path.join(path,extractions_path, extr ), "r", encoding='UTF-8') as f:
    contents = f.readlines()
    extractions = json.loads(contents[0])
  with open(os.path.join(path, ann ), "r", encoding='UTF-8') as f:
    contents = f.read()
    annotations = json.loads(contents)

  print(len(annotations), len(extractions["mentions"]), len(extractions["documents"]))
  doc_sentence_map = {}
  linear_order = 1
  doc_ids = sorted(list(extractions["documents"].keys()))
  for doc_id in doc_ids:
    #print(doc_id, document)
    document = extractions["documents"][doc_id]
    for i,sentence in enumerate(document['sentences']):
      doc_sentence_map[(doc_id,i)] = {"sentence_text":sentence['words'], "sno":linear_order}
      linear_order += 1

  doc_event_map = []
  event_doc_map = {}
  for mention in extractions['mentions']:
    # if mention['id'].startswith("E:"):
    for att in mention["attachments"]:
      if "pageNum" in att.keys():
        this_text = doc_sentence_map[(mention['document'], mention['sentence'])]['sentence_text']
        this_linear_order = doc_sentence_map[(mention['document'], mention['sentence'])]['sno']

        doc_event_map.append({"doc_id":mention['document'], "pg_num":att["pageNum"][0],"blk_id":att["blockIdx"][0],"sentence_id":mention['sentence'], 
                              "doc_sentence_count":len(extractions["documents"][mention['document']]["sentences"]), 
                              "event_id":mention["id"], "event":mention["text"] }) #corrected_sent_number -> 1,2,5,4,6,7, => 1,2,3,4,5,6
        event_doc_map[mention["id"]] = {"doc_id":mention['document'],"pg_num":att["pageNum"][0],"blk_id":att["blockIdx"][0],"sentence_id":mention['sentence'], 
                              "doc_sentence_count":len(extractions["documents"][mention['document']]["sentences"]), 
                              "event_id":mention["id"], "event":mention["text"] , "sentence_text":this_text, "doc_sent_linear_order":this_linear_order}
  event_ann_map = {}
  for ann in annotations:
    event_ann_map[ann['eventId']] = {"annotated_page_num":ann["page_num"],"para_num":ann["para_num"], "event":ann["event"], 'locationContext': ann['locationContext'],
    'temporalContext': ann['temporalContext'],'explanation': ann['explanation']}
  
  empty_map = {"annotated_page_num":"","para_num":"", "event":"", 'locationContext': "",
    'temporalContext': "",'explanation': ""}
  event_extr_ann_map = []
  for event, values in event_doc_map.items():
    this_event = event_ann_map[event] if event in event_ann_map.keys() else empty_map
    event_extr_ann_map.append({"doc_id":values['doc_id'],"annotated_page_num":this_event["annotated_page_num"],"para_num":this_event["para_num"], "event_id":values["event_id"],
                               "event":this_event["event"], 'locationContext': ",".join(this_event['locationContext']),
    "sentence_text":",".join(values["sentence_text"]),
    'temporalContext': ",".join(this_event['temporalContext']),'explanation': this_event['explanation'], 'pg_num':values['pg_num'], 'blk_id':values['blk_id'], 
    'sentence_id':values['sentence_id'], 'doc_sentence_count':values['doc_sentence_count'], "doc_sent_linear_order":values["doc_sent_linear_order"]})


  df = pd.DataFrame.from_records(event_extr_ann_map)
  df = df[df.columns]
  print(df.columns.to_list())
  print("\n")
  df['locationContext'] = df['locationContext'].replace({"^'|'$": ""}, regex=True)
  df['temporalContext'] = df['temporalContext'].replace({"^'|'$": ""}, regex=True)
  df.sort_values(by=['doc_id', 'pg_num','blk_id', 'sentence_id'], inplace=True,
               ascending = [True, True, True, True])
  d = df[['doc_id', 'pg_num','blk_id','sentence_id', 'doc_sentence_count']]
  df['linear_order'] = [i for i in range( 1,len(d)+1, 1)]
  print("columns \n",df.columns)
  df.to_csv(os.path.join(path, filename+"_linear_order_5_17"+".xlsx"))
  save_extr_ann_file(path, filename, event_extr_ann_map)
  

### Uncomment following lines to combine annotations and extractions

In [11]:
for key in paper_names:
  name = key
  ann, extr = ann_extr_file_pairs[name]
  combine_ann_extr_all(path, extractions_path, extr, ann, name[:-4]+"_all")

174 6212 91
['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event', 'locationContext', 'sentence_text', 'temporalContext', 'explanation', 'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count', 'doc_sent_linear_order']


columns 
 Index(['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event',
       'locationContext', 'sentence_text', 'temporalContext', 'explanation',
       'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count',
       'doc_sent_linear_order', 'linear_order'],
      dtype='object')
302 10045 76
['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event', 'locationContext', 'sentence_text', 'temporalContext', 'explanation', 'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count', 'doc_sent_linear_order']


columns 
 Index(['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event',
       'locationContext', 'sentence_text', 'temporalContext', 'explanation',
       'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count',
       'doc_sent_l

In [12]:
def get_doc_id_sentence_text(path, extractions_path, extr,filename ):
  with open(os.path.join(path,extractions_path, extr ), "r", encoding='UTF-8') as f:
    contents = f.readlines()
    extractions = json.loads(contents[0])

  print(len(extractions["documents"]))
  doc_sentence_map = []
  linear_order = 1
  doc_ids = sorted(list(extractions["documents"].keys()))
  for doc_id in doc_ids:
    #print(doc_id, document)
    document = extractions["documents"][doc_id]
    for i,sentence in enumerate(document['sentences']):
      doc_sentence_map.append({"doc_id":doc_id,"sentence_id":i,"sentence_text":" ".join(sentence['words']),
                               "raw_text":" ".join(sentence['raw']), "text":document['text'], "sno":linear_order})
      if len(sentence['words']) != len(sentence['raw']):
        print("words and raw words lengths are not equal!")
      linear_order += 1
  pd.DataFrame.from_records(doc_sentence_map).to_csv(os.path.join(path, filename+".csv"))

In [13]:
for key in paper_names:
  name = key
  ann, extr = ann_extr_file_pairs[name]
  get_doc_id_sentence_text(path, extractions_path, extr, name[:-4]+"_all")

91
76
47


In [14]:
filenames = ["sarsdouble_all.xlsx", "modeling_covid_italy_all.xlsx", "response-to-covid-19-was-italy-unprepared_all.xlsx"]

### Save all event extractions with linear orders

In [15]:
filenames = ["sarsdouble_all_linear_order_5_17.xlsx", "modeling_covid_italy_all_linear_order_5_17.xlsx", "response-to-covid-19-was-italy-unprepared_all_linear_order_5_17.xlsx"]
diff_distance_map = []
for filename in filenames:
  print("\n**********************************************************************************\n")
  print("The annotated extractions file name %s " %(filename))
  df = pd.read_csv(os.path.join(path, filename), index_col=False)
  df = df[df.columns[1:]]
  print(df.columns.to_list())
  print("\n")
  event_df = df[df['event_id'].str.startswith("E:")]
  print(len(event_df), len(df))
  f = filename.replace("-","_").split("_")
  f = "_".join(f[:f.index("all")])
  print(f,f+"_event.xlsx")
  event_df.sort_values(by=['doc_id', 'pg_num','blk_id', 'sentence_id', 'linear_order'], inplace=True,
               ascending = [True, True, True, True, True])
  event_df.to_csv(os.path.join(path, f+"_events.csv"))
  


**********************************************************************************

The annotated extractions file name sarsdouble_all_linear_order_5_17.xlsx 
['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event', 'locationContext', 'sentence_text', 'temporalContext', 'explanation', 'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count', 'doc_sent_linear_order', 'linear_order']


172 6108
sarsdouble sarsdouble_event.xlsx


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  event_df.sort_values(by=['doc_id', 'pg_num','blk_id', 'sentence_id', 'linear_order'], inplace=True,



**********************************************************************************

The annotated extractions file name modeling_covid_italy_all_linear_order_5_17.xlsx 
['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event', 'locationContext', 'sentence_text', 'temporalContext', 'explanation', 'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count', 'doc_sent_linear_order', 'linear_order']


302 10045
modeling_covid_italy modeling_covid_italy_event.xlsx


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  event_df.sort_values(by=['doc_id', 'pg_num','blk_id', 'sentence_id', 'linear_order'], inplace=True,



**********************************************************************************

The annotated extractions file name response-to-covid-19-was-italy-unprepared_all_linear_order_5_17.xlsx 
['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event', 'locationContext', 'sentence_text', 'temporalContext', 'explanation', 'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count', 'doc_sent_linear_order', 'linear_order']


42 4429
response_to_covid_19_was_italy_unprepared response_to_covid_19_was_italy_unprepared_event.xlsx


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  event_df.sort_values(by=['doc_id', 'pg_num','blk_id', 'sentence_id', 'linear_order'], inplace=True,


In [16]:
event_df[['doc_id','pg_num', 'blk_id','sentence_id', 'doc_sentence_count', 'linear_order']]

Unnamed: 0,doc_id,pg_num,blk_id,sentence_id,doc_sentence_count,linear_order
1916,-1949274406,2,4,3,13,1917
2102,-1949274406,2,4,10,13,2103
2156,-1949274406,2,4,12,13,2157
1519,-1810507189,8,2,1,15,1520
1588,-1810507189,8,2,5,15,1589
1618,-1810507189,8,2,8,15,1619
1680,-1810507189,8,2,11,15,1681
1198,-1675118546,11,2,1,9,1199
1213,-1675118546,11,2,2,9,1214
1294,-1675118546,11,2,6,9,1295


In [17]:
from collections import Counter

filenames = ["sarsdouble_all_linear_order_5_17.xlsx", "modeling_covid_italy_all_linear_order_5_17.xlsx", "response-to-covid-19-was-italy-unprepared_all_linear_order_5_17.xlsx"]
loc_context_counter = Counter()
temporal_context_counter = Counter()
loc_context_counters = {filename: Counter() for filename in filenames}
temporal_context_counters = {filename: Counter() for filename in filenames}

loc_tf = {}
with pd.ExcelWriter(os.path.join(path, "summarized_context_counter.xlsx")) as writer:

  for filename in filenames:
    print("\n**********************************************************************************\n")
    print("The annotated extractions file name %s " %(filename))
    df = pd.read_csv(os.path.join(path, filename), index_col=False)
    df = df[df.columns[1:]]
    df = df.fillna("")
    loc_context_counter = Counter()
    temporal_context_counter = Counter()
    loc_context_counter = Counter([l for l in sum(df['locationContext'].apply(lambda x: x.split(",")).to_list(), []) if l != ''])
    temporal_context_counter = Counter([l for l in sum(df['temporalContext'].apply(lambda x: x.split(",")).to_list(), []) if l != ''])
    print(loc_context_counter)
    print(temporal_context_counter)
    loc_context_counters[filename] = loc_context_counter
    temporal_context_counters[filename] = temporal_context_counter
    loc_context_counter.update(temporal_context_counter)
    pd.DataFrame.from_dict(loc_context_counter, orient="index").reset_index().to_excel(writer, sheet_name=filename[:-27], index=False)



**********************************************************************************

The annotated extractions file name sarsdouble_all_linear_order_5_17.xlsx 
Counter({'Hong Kong': 128, 'Inner Mongolia': 39, 'Beijing': 39, 'Guangdong': 24, 'Amoy Gardens': 21, 'China': 20, 'Guangdong Province': 7, 'Shanghai': 6, 'Mainland China': 5, 'Europe': 1, 'Paris': 1})
Counter({' 2003': 304, 'March 17': 135, 'May 10': 135, '1983': 21, '1985': 21, '03/17': 10, '03/20': 10, '03/23': 10, '03/26': 10, '03/29': 10, '04/01': 10, '04/04': 10, '04/07': 10, '04/10': 10, '04/13': 10, '04/16': 10, '04/19': 10, '04/22': 10, '04/25': 10, '04/28': 10, '05/01': 10, '05/04': 10, '05/07': 10, '05/10': 10, 'February 21st': 9, 'February 21': 9, '17 March': 8, '10 May': 8, 'November 2002': 5, 'February 22': 4})

**********************************************************************************

The annotated extractions file name modeling_covid_italy_all_linear_order_5_17.xlsx 
Counter({'Italy': 253, 'Lodi Province'



In [18]:
df.columns

Index(['doc_id', 'annotated_page_num', 'para_num', 'event_id', 'event',
       'locationContext', 'sentence_text', 'temporalContext', 'explanation',
       'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count',
       'doc_sent_linear_order', 'linear_order'],
      dtype='object')

In [19]:
from collections import Counter

filenames = ["sarsdouble_all_linear_order_5_17.xlsx", "modeling_covid_italy_all_linear_order_5_17.xlsx", "response-to-covid-19-was-italy-unprepared_all_linear_order_5_17.xlsx"]

loc_tf = {}
# with pd.ExcelWriter(os.path.join(path, "summarized_context_counter.xlsx")) as writer:

for filename in filenames:
  print("\n**********************************************************************************\n")
  print("The annotated extractions file name %s " %(filename))
  df = pd.read_csv(os.path.join(path, filename), index_col=False)
  df = df[df.columns[1:]]
  df = df.fillna("")
  



**********************************************************************************

The annotated extractions file name sarsdouble_all_linear_order_5_17.xlsx 

**********************************************************************************

The annotated extractions file name modeling_covid_italy_all_linear_order_5_17.xlsx 

**********************************************************************************

The annotated extractions file name response-to-covid-19-was-italy-unprepared_all_linear_order_5_17.xlsx 


In [20]:
pd.DataFrame.from_dict(loc_context_counter, orient="index").reset_index()

Unnamed: 0,index,0
0,Italy,38
1,Lombardy,8
2,Rome,4
3,western European countries,1
4,Spain,1
5,northern regions,1
6,Italian,1
7,Veneto,1
8,Emilia-Romagna,1
9,Piedmont,1


In [21]:
df = df.fillna("")
Counter([l for l in sum(df['locationContext'].apply(lambda x: x.split(",")).to_list(), []) if l != ''])
Counter([l for l in sum(df['temporalContext'].apply(lambda x: x.split(",")).to_list(), []) if l != ''])

Counter({'between 2000 and 2017': 3,
         '2017': 2,
         'pre-COVID-19': 1,
         '2014': 1,
         '12-week period': 2,
         '2020': 3,
         'April 2020': 2,
         'COVID-19 outbreak': 1,
         'COVID-19 pandemic': 1,
         '2019': 2,
         'COVID-19 health crisis': 1,
         'COVID-19 crisis': 5,
         '2005': 2,
         'more recently': 1,
         '2021–2023': 1,
         '31st January 2020': 1,
         'on 3rd February': 1,
         'between 31st January 2020': 1,
         'beginning of June': 1,
         'two months after the beginning of the first wave': 1,
         '1 January–30 April 2015–2019': 1,
         'during the COVID-19 emergency': 1,
         'during the early days of the COVID-19 outbreak': 2,
         'since the early 1990s': 4,
         '2004': 1,
         '2007': 1,
         'early on in the pandemic (March 2020)': 1,
         'between 1st February and 14th April': 2})

### Explore TF-IDF solution and N-GRAM solution since contexts can be phrases

#### Exploring TF-IDF for finding closest context

In [23]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

filenames = ["sarsdouble_all.csv", "modeling_covid_italy_all.csv", "response-to-covid-19-was-italy-unprepared_all.csv"]
events_filenames = {"sarsdouble_all.csv":"sarsdouble_events.csv", "modeling_covid_italy_all.csv":"modeling_covid_italy_events.csv", 
                    "response-to-covid-19-was-italy-unprepared_all.csv":"response_to_covid_19_was_italy_unprepared_events.csv"}

loc_tf = {}
# with pd.ExcelWriter(os.path.join(path, "summarized_context_counter.xlsx")) as writer:

for filename in filenames:
  print("\n**********************************************************************************\n")
  print("The annotated extractions file name %s " %(filename))
  df = pd.read_csv(os.path.join(path, filename), index_col=False)

  sentences = df['sentence_text'].to_list()
  count_vectorizer = CountVectorizer()
  count_vectorizer.fit_transform(sentences)
  print ("Vocabulary: ")
  print(count_vectorizer.vocabulary_)
  vocab = list(count_vectorizer.vocabulary_)
  print(vocab)


  events_df = pd.read_csv(os.path.join(path, events_filenames[filename]), index_col=False)
  events_df.fillna("")
  events_df['locationContext'] = events_df['locationContext'].replace({"^'|'$": ""}, regex=True)
  events_df['temporalContext'] = events_df['temporalContext'].replace({"^'|'$": ""}, regex=True)
  events_df['locationContext'] = events_df['locationContext'].str.replace('nan', "")
  events_df['locationContext'] = events_df['locationContext'].str.replace('\'','')
  events_df['temporalContext'] = events_df['temporalContext'].str.replace('nan', "")
  events_df['temporalContext'] = events_df['temporalContext'].str.replace('\'','')
  
  location_contexts = []
  temporal_contexts = []
  for context in events_df["locationContext"].str.replace('nan', "").tolist():
    context = str(context).replace("nan", "").lower() if type(context) == float else context.lower()
    location_contexts.extend(context.split(",")) 
  for context in events_df["temporalContext"].str.replace('nan', "").tolist():
    context = str(context).replace("nan", "").lower() if type(context) == float else context.lower()
    temporal_contexts.extend(context.split(","))
  location_contexts = list(set(location_contexts))
  temporal_contexts = list(set(temporal_contexts))
  print(len(location_contexts), len(set(location_contexts)))
  print(len(temporal_contexts), len(set(temporal_contexts)))

  freq_term_matrix = count_vectorizer.transform(location_contexts)
  print("Frequency Term matrix")
  print (freq_term_matrix.todense())

  count_array = freq_term_matrix.toarray()
  df = pd.DataFrame(data=count_array, columns=vocab)
  #print(df)

  tfidf = TfidfTransformer(norm="l2")
  tfidf.fit(freq_term_matrix)
  print ("IDF:")
  print(tfidf.idf_)
  freq_term_matrix = count_vectorizer.transform(temporal_contexts)
  print("Frequency Term matrix")
  print (freq_term_matrix.todense())

  count_array = freq_term_matrix.toarray()
  df = pd.DataFrame(data=count_array, columns=vocab)
  #print(df)

  tfidf = TfidfTransformer(norm="l2")
  tfidf.fit(freq_term_matrix)
  print ("IDF:")
  print(tfidf.idf_)
  #print(df.head)
  


**********************************************************************************

The annotated extractions file name sarsdouble_all.csv 
Vocabulary: 
{'background': 181, 'since': 1149, 'november': 855, '2002': 29, 'and': 133, 'perhaps': 916, 'earlier': 431, 'an': 129, 'outbreak': 886, 'of': 867, 'very': 1332, 'contagious': 309, 'atypical': 172, 'pneumonia': 933, 'now': 856, 'named': 831, 'severe': 1125, 'acute': 94, 'respiratory': 1072, 'syndrome': 1222, 'initiated': 696, 'in': 660, 'the': 1239, 'guangdong': 586, 'province': 995, 'china': 254, 'this': 1247, 'started': 1182, 'world': 1374, 'wide': 1362, 'epidemic': 459, 'after': 106, 'medical': 798, 'doctor': 410, 'from': 547, 'guangzhou': 587, 'infected': 683, 'several': 1124, 'persons': 921, 'at': 169, 'hotel': 622, 'kowloon': 734, 'around': 158, 'february': 513, '21st': 32, '2003': 30, 'sar': 1096, 'hong': 617, 'kong': 733, 'although': 125, 'apparently': 142, 'classical': 264, 'its': 727, 'onset': 872, 'pattern': 909, 'became': 1

#### N-Gram solution since contexts can be phrases

In [25]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [26]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import *

filenames = ["sarsdouble_all.csv", "modeling_covid_italy_all.csv", "response-to-covid-19-was-italy-unprepared_all.csv"]
events_filenames = {"sarsdouble_all.csv":"sarsdouble_events.csv", "modeling_covid_italy_all.csv":"modeling_covid_italy_events.csv", 
                    "response-to-covid-19-was-italy-unprepared_all.csv":"response_to_covid_19_was_italy_unprepared_events.csv"}

loc_tf = {}
# with pd.ExcelWriter(os.path.join(path, "summarized_context_counter.xlsx")) as writer:

for filename in filenames:
  print("\n**********************************************************************************\n")
  print("The annotated extractions file name %s " %(filename))
  df = pd.read_csv(os.path.join(path, filename), index_col=False)

  words = df['sentence_text'].to_list()
  sentences = []
  for sent in df['text'].to_list():
    sentences.extend(sent.split("\n"))

  events_df = pd.read_csv(os.path.join(path, events_filenames[filename]), index_col=False)
  events_df.fillna("")
  events_df['locationContext'] = events_df['locationContext'].replace({"^'|'$": ""}, regex=True)
  # events_df['temporalContext'] = events_df['temporalContext'].replace({"^'|'$": ""}, regex=True)
  events_df['locationContext'] = events_df['locationContext'].str.replace('nan', "")
  events_df['locationContext'] = events_df['locationContext'].str.replace('\'','')
  # events_df['temporalContext'] = events_df['temporalContext'].str.replace('nan', "")
  # events_df['temporalContext'] = events_df['temporalContext'].str.replace('\'','')
  
  location_contexts = []
  temporal_contexts = []
  for context in events_df["locationContext"].str.replace('nan', "").tolist():
    context = str(context).replace("nan", "").lower() if type(context) == float else context.lower()
    location_contexts.extend(context.split(",")) 
  for context in events_df["temporalContext"].str.replace('nan', "").tolist():
    context = str(context).replace("nan", "").lower() if type(context) == float else context.lower()
    temporal_contexts.extend(context.split(","))
  location_contexts = list(set(location_contexts))
  temporal_contexts = list(set(temporal_contexts))
  print(len(location_contexts), len(set(location_contexts)))
  print(len(temporal_contexts), len(set(temporal_contexts)))

  tokens = [word.casefold() for sentence in sentences for sent in sent_tokenize(sentence) for word in word_tokenize(sent)]
  word_fd = nltk.FreqDist(tokens)
  bigram_fd = nltk.FreqDist(nltk.bigrams(tokens))
  finder = BigramCollocationFinder(word_fd, bigram_fd)

  word_fd = nltk.FreqDist(tokens)
  finder = BigramCollocationFinder(word_fd, bigram_fd)
  bigram_measures = nltk.collocations.BigramAssocMeasures()
  scored = finder.score_ngrams(bigram_measures.raw_freq)
  print(finder.nbest(bigram_measures.pmi, 5))
  print(scored)

  fourgram_measures = nltk.collocations.QuadgramAssocMeasures()
  finder_4grams = QuadgramCollocationFinder.from_words(tokens)
  scored_4grams = finder_4grams.score_ngrams(fourgram_measures.raw_freq)
  print(scored_4grams)
  # finder_4grams.apply_word_filter(lambda w: len(w) < 3)


**********************************************************************************

The annotated extractions file name sarsdouble_all.csv 
12 12
31 31
[('3centre', "d'enseignement"), ('4genetique', 'des'), ('antoine', 'danchin'), ('calcul', 'scientifique-enpc'), ('contributions', 'tw')]
[(('of', 'the'), 0.011334990699049042), (('.', 'the'), 0.00565011039446463), (('in', 'the'), 0.004433163540272248), (('hong', 'kong'), 0.004224544079553554), (('(', ')'), 0.003459606056918343), (('from', 'the'), 0.0034248361467985604), (('the', 'epidemic'), 0.0033900662366787783), ((',', 'the'), 0.003355296326558996), (('number', 'of'), 0.00326837155125954), (('to', 'the'), 0.003077137045600737), ((')', '.'), 0.0029902122703012813), (('.', 'we'), 0.002746822899462805), (('0', ')'), 0.002746822899462805), (('(', '0'), 0.002729437944402914), (('that', 'the'), 0.0027120529893430224), (('the', 'disease'), 0.0026946680342831313), (('.', 'this'), 0.0026425131691034578), (('it', 'is'), 0.0026425131691034578)

In [27]:
location_contexts

['',
 'italian regions',
 'rome',
 'italian',
 'piedmont',
 'spain',
 'european countries and the usa',
 'emilia-romagna',
 'northern regions',
 'northern italy',
 'western european countries',
 'lombardy',
 'italy',
 'veneto']

In [28]:
for loc in location_contexts:
  if loc:
    print(loc)
    for t in scored_4grams:
      words, score = t
      text = " ".join(words)
      if loc in text:
        print(loc, text )

italian regions
italian regions affected italian regions ,
italian regions by the italian regions
italian regions italian regions , which
italian regions italian regions , with
italian regions the italian regions ,
italian regions unevenly affected italian regions
rome
rome cases in rome ,
rome covid-19 cases in rome
rome in rome , and
rome rome , and the
rome acute coronary syndrome at
rome coronary syndrome at 15
rome for acute coronary syndrome
rome syndrome at 15 hospitals
rome , rome , italy
rome of rome , rome
rome rome , italy and
rome rome , rome ,
rome sapienza university of rome
rome university of rome ,
italian
italian , the italian government
italian italian response to the
italian the italian response to
italian 1 of the italian
italian affected italian regions ,
italian by the italian regions
italian first italian covid-19 positive
italian has unevenly affected italian
italian italian covid-19 positive patient
italian italian regional policy responses
italian italian regi

In [29]:
temporal_contexts

['',
 'between 2000 and 2017',
 'covid-19 crisis',
 'during the early days of the covid-19 outbreak',
 '31st january 2020',
 '2005',
 'on 3rd february',
 '1 january–30 april 2015–2019',
 'since the early 1990s',
 'covid-19 outbreak',
 'april 2020',
 '2019',
 '2020',
 '12-week period',
 'covid-19 pandemic',
 'pre-covid-19',
 'between 31st january 2020',
 '2017',
 'beginning of june',
 'more recently',
 '2007',
 '2021–2023',
 '2004',
 'during the covid-19 emergency',
 'early on in the pandemic (march 2020)',
 'covid-19 health crisis',
 'two months after the beginning of the first wave',
 '2014',
 'between 1st february and 14th april']

In [30]:
location_contexts

['',
 'italian regions',
 'rome',
 'italian',
 'piedmont',
 'spain',
 'european countries and the usa',
 'emilia-romagna',
 'northern regions',
 'northern italy',
 'western european countries',
 'lombardy',
 'italy',
 'veneto']

In [31]:
len(tfidf.idf_), len(vocab)

(1803, 1803)

In [32]:
for ctx in location_contexts:
  if ctx in count_vectorizer.vocabulary_.keys():
    print(ctx,count_vectorizer.vocabulary_[ctx])
  else:
    print(ctx, " does not exist in this vocab")

  does not exist in this vocab
italian regions  does not exist in this vocab
rome 1454
italian 939
piedmont 1228
spain 1538
european countries and the usa  does not exist in this vocab
emilia-romagna  does not exist in this vocab
northern regions  does not exist in this vocab
northern italy  does not exist in this vocab
western european countries  does not exist in this vocab
lombardy 1013
italy 941
veneto 1746


In [33]:
if "italy" in count_vectorizer.vocabulary_.keys():
  print("Italy")

Italy


##### numpy way of sorting multiple columns

In [34]:
a = np.array([[9, 2, 3],
           [4, 5, 6],
           [7, 0, 5]])
a1 = a[a[:, 0].argsort()]
a2 = a1[a1[:, 1].argsort()]
a3 = a2[a2[:, 2].argsort()]
a1,a2,a3

(array([[4, 5, 6],
        [7, 0, 5],
        [9, 2, 3]]),
 array([[7, 0, 5],
        [9, 2, 3],
        [4, 5, 6]]),
 array([[9, 2, 3],
        [7, 0, 5],
        [4, 5, 6]]))

## Solution: for searching the contexts
- For each pdf in pdf_names:
  - First construct contexts dictionary with contexts as keys and doc_id, block_id, sentence_number, **all extractions' linear_order_number** from {this_pdf_name}_events.xlsx
  - Get doc_sentence_map with doc_id, sentence_text, sentence_id, doc_sent_linear_order from {this_pdf_name}_all.csv
  - For each context key in contexts dictionary :
    - this_doc_id, sentence_number from contexts dictionary
    - from doc_sentence_map, get all sentences up until **doc_sent_map's doc_sent_linear_order**
    - get 3 top closest doc_id, sentence_number pairs where the context was found up until this_doc_id,this_sentence_number
    - calculate distance between **doc_sent_map's doc_sent_linear_order** between context dictionary value and doc_sentence_map for each of the nearest context matches
- Plot the histogram

In [46]:

filenames = ["sarsdouble_all.csv", "modeling_covid_italy_all.csv", "response-to-covid-19-was-italy-unprepared_all.csv"]
events_filenames = {"sarsdouble_all.csv":"sarsdouble_events.csv", "modeling_covid_italy_all.csv":"modeling_covid_italy_events.csv", 
                    "response-to-covid-19-was-italy-unprepared_all.csv":"response_to_covid_19_was_italy_unprepared_events.csv"}

loc_tf = {}
# with pd.ExcelWriter(os.path.join(path, "summarized_context_counter.xlsx")) as writer:

for filename in filenames:
  print("\n**********************************************************************************\n")
  print("The annotated extractions file name %s " %(filename))
  df = pd.read_csv(os.path.join(path, filename), index_col=False)

  words = df['sentence_text'].to_list()
  sentences = []
  for sent in df['text'].to_list():
    sentences.extend(sent.split("\n"))

  events_df = pd.read_csv(os.path.join(path, events_filenames[filename]), index_col=False)
  events_df.fillna("")
  events_df['locationContext'] = events_df['locationContext'].replace({"^'|'$": ""}, regex=True)
  # events_df['temporalContext'] = events_df['temporalContext'].replace({"^'|'$": ""}, regex=True)
  events_df['locationContext'] = events_df['locationContext'].str.replace('nan', "")
  events_df['locationContext'] = events_df['locationContext'].str.replace('\'','')
  # events_df['temporalContext'] = events_df['temporalContext'].str.replace('nan', "")
  # events_df['temporalContext'] = events_df['temporalContext'].str.replace('\'','')
  
  location_contexts = []
  temporal_contexts = []
  events_df["locationContext"] = events_df["locationContext"].apply(lambda x: str(x).replace("nan", "").lower()
                                if type(x) == float else x.lower())
  events_df["temporalContext"] = events_df["temporalContext"].apply(lambda x: str(x).replace("nan", "").lower()
                                if type(x) == float else x.lower())
  events_df["location_contexts"] = events_df["locationContext"].apply(lambda x: str(x).replace("nan", "").lower().split(",") 
                                if type(x) == float else x.lower().split(","))
  events_df["temporal_contexts"] = events_df["temporalContext"].apply(lambda x: str(x).replace("nan", "").lower().split(",") 
                                if type(x) == float else x.lower().split(","))
  for context in events_df["location_contexts"].tolist():
    location_contexts.extend(context) 
  for context in events_df["temporal_contexts"].tolist():
    temporal_contexts.extend(context)
  ulocation_contexts = list(set(location_contexts))
  utemporal_contexts = list(set(temporal_contexts))
  print(len(location_contexts), len(set(location_contexts)))
  print(len(temporal_contexts), len(set(temporal_contexts)))
  all_contexts = ulocation_contexts + utemporal_contexts
  contexts = {}
  for ctx in ulocation_contexts:
    contexts[ctx] = events_df[events_df["locationContext"].str.contains(ctx)][['doc_id', 'event_id',
       'event', 'sentence_text', 'pg_num', 'blk_id', 'sentence_id', 'doc_sent_linear_order', 'linear_order']].values.tolist()


**********************************************************************************

The annotated extractions file name sarsdouble_all.csv 
295 12
853 31

**********************************************************************************

The annotated extractions file name modeling_covid_italy_all.csv 
308 8
353 28

**********************************************************************************

The annotated extractions file name response-to-covid-19-was-italy-unprepared_all.csv 
64 14
49 29


In [47]:
events_df["location_contexts"] = events_df["locationContext"].apply(lambda x: str(x).replace("nan", "").lower().split(",") 
                                if type(x) == float else x.lower().split(","))
events_df["locationContext"] = events_df["locationContext"].apply(lambda x: str(x).replace("nan", "").lower()
                                if type(x) == float else x.lower())

In [48]:
for ctx in ulocation_contexts:
  if ctx:
    contexts[ctx] 

In [49]:
ulocation_contexts

['',
 'italian regions',
 'rome',
 'italian',
 'piedmont',
 'spain',
 'european countries and the usa',
 'emilia-romagna',
 'northern regions',
 'northern italy',
 'western european countries',
 'lombardy',
 'italy',
 'veneto']

In [50]:
Counter(location_contexts)

Counter({'italy': 38,
         'northern regions': 1,
         'rome': 4,
         'lombardy': 8,
         '': 2,
         'western european countries': 1,
         'spain': 1,
         'italian regions': 1,
         'european countries and the usa': 2,
         'italian': 1,
         'veneto': 1,
         'emilia-romagna': 1,
         'piedmont': 1,
         'northern italy': 2})

In [None]:
events_df[events_df["locationContext"].str.contains('italy')]

In [53]:
df

Unnamed: 0.1,Unnamed: 0,doc_id,sentence_id,sentence_text,raw_text,text,sno
0,0,-100047078,0,6.1.4 Workforce shortages The effort to expand...,6.1.4 Workforce shortages The effort to expand...,6.1.4 Workforce shortages The effort to expand...,1
1,1,-100047078,1,This has meant also increasing working hours a...,This has meant also increasing working hours a...,6.1.4 Workforce shortages The effort to expand...,2
2,2,-100047078,2,"The emergency magnified , therefore , all the ...","The emergency magnified , therefore , all the ...",6.1.4 Workforce shortages The effort to expand...,3
3,3,-100047078,3,"Italy , as many other OECD countries , had bee...","Italy , as many other OECD countries , had bee...",6.1.4 Workforce shortages The effort to expand...,4
4,4,-100047078,4,According to the OECD Health at a Glance Indic...,According to the OECD Health at a Glance Indic...,6.1.4 Workforce shortages The effort to expand...,5
...,...,...,...,...,...,...,...
255,255,859959020,18,These differences are mainly due to multiple i...,These differences are mainly due to multiple i...,5.\nNational and regional policy responses The...,256
256,256,859959020,19,"In a companion paper , we discuss the differen...","In a companion paper , we discuss the differen...",5.\nNational and regional policy responses The...,257
257,257,859959020,20,Similarly to many European countries and the U...,Similarly to many European countries and the U...,5.\nNational and regional policy responses The...,258
258,258,859959020,21,The excess mortality recorded in these setting...,The excess mortality recorded in these setting...,5.\nNational and regional policy responses The...,259


In [67]:
print("\n**********************************************************************************\n")
print("The annotated extractions file name %s " %(filenames[2]))
df = pd.read_csv(os.path.join(path, filenames[2]), index_col=False)

words = df['text'].to_list()
sentences = []
for sent in df['text'].to_list():
  sentences.extend(sent.split("\n"))
for ctx, values in contexts.items():
  print("context ", ctx)
  for val in values:
    doc_id = val[0]
    sentence_num, doc_sent_linear_order, linear_order = val[6:]
    print(doc_id, sentence_num, doc_sent_linear_order, linear_order)
    sentences = df[df['sno']<=doc_sent_linear_order][['doc_id','sentence_id','sentence_text', 'sno']].values.tolist()
    for sentence in sentences:
      if ctx in sentence[2]:
        print("context "+ctx,sentence[2],  doc_sent_linear_order - sentence[3])
      else:
        print("None")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
-1949274406 12 124 2157
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
-1810507189 1 93 1520
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None

In [66]:
contexts['rome']

[[-1675118546,
  'E:-1325102996',
  'Current government instructions are based on a three-tier regional risk assessment',
  'Current,government,instructions,are,based,on,a,three-tier,regional,risk,assessment,.',
  11,
  2,
  1,
  78,
  1199],
 [-1657329247,
  'E:166257913',
  'which, if completed, will result in a more than doubled overall pre-pandemic capacity',
  'A,further,30,%,expansion,(,almost,2400,beds,),to,the,already,expanded,ICU,bed,numbers,has,been,planned,(,April,2020,),which,,,if,completed,,,will,result,in,a,more,than,doubled,overall,pre-pandemic,capacity,.',
  5,
  5,
  2,
  73,
  1102],
 [-100047078,
  'E:1158695828',
  'number increased',
  'Despite,recent,attempts,by,the,Italian,government,to,address,this,imbalance,through,increasing,the,number,of,students,training,to,become,nurse,(,the,number,increased,to,13,000,in,2014,from,a,low,3100,),,,the,COVID-19,crisis,has,heightened,the,shortage,of,health,care,professionals,suffered,by,the,SSN,,,with,a,pre-COVID-19,incidence,o

In [43]:
events_df.columns

Index(['Unnamed: 0', 'doc_id', 'annotated_page_num', 'para_num', 'event_id',
       'event', 'locationContext', 'sentence_text', 'temporalContext',
       'explanation', 'pg_num', 'blk_id', 'sentence_id', 'doc_sentence_count',
       'doc_sent_linear_order', 'linear_order', 'location_contexts',
       'temporal_contexts'],
      dtype='object')

##### ignore

In [None]:
path = "/content/drive/MyDrive/Colab Notebooks/skema/data/"
extractions_path = "cosmos-and-extractions-jsons-for-3-papers"
annotated_files = ["data-sars-double.json", "data_modeling_covid_italy.json", "data-response-to-covid-19-was-italy-unprepared.json"]
extraction_files =["extractions_sarsdouble.json", "extractions_modeling_covid_italy--COSMOS-data.json","extractions_response-to-covid-19-was-italy-unprepared--COSMOS-data.json"]
paper_names = ["sarsdouble.pdf", "modeling_covid_italy.pdf", "response-to-covid-19-was-italy-unprepared.pdf"]
ann_extr_file_pairs = {}
for name,ann, extr in zip(paper_names,annotated_files, extraction_files):
  ann_extr_file_pairs[name] = [ann, extr]

In [None]:
ann_extr_file_pairs

In [None]:
for ann, extr in list(ann_extr_file_pairs.values())[:1]:
  print("This extractions file ",extr)
  with open(os.path.join(path,extractions_path, extr ), "r", encoding='UTF-8') as f:
    contents = f.readlines()
    extractions = json.loads(contents[0])
    

In [None]:
# events

# eventId
# page number , block number, sentence id -page -> linear order

# eventid 1,2,3,4,5

# linear sentence_order : 6,10,12,12,14

# location context
# temporal context 

# 1) missing linear order in TextReadingPipeline output -> extractions json
# 2) location context - annotations 
# 	for each eventID in eventMentions in annotations

# 		locations = [ "Italy", "Rome"]
# 		linear order entire extractions -> linear order for this evntID
# 			sentenceID -> documents and look in sentence text from this sentenceID 
# 			and get closest sentenceID where you find this location's text