# Analysis: Descriptions' and Annotations' Lengths
## Post Annotation and Aggregation

Outputs the files:
  * `../data/analysis_data/descriptions_with_counts.csv`: adds columns to `descriptions.csv` for word counts and sentence counts, where words are alphanumeric tokens (punctuation excluded)
  * `../data/analysis_data/descs_stats.csv`: contains the count, minimum, maximum, average, and standard deviation of all descriptions and each type of description
  * `../data/crc_metadata/all_descs_with_offsets.csv`: contains one row for every description in the annotated datasets with columns for the descriptions' corresponding id, eadid, file, start offset, and end offset

***

**Table of Contents**

[0.](#0) Loading

[1.](#1) Lengths of Descriptions and Annotations

  * [Lengths of Descriptions](#1.1)
  
  * TO DO: [Lengths of Annotations](#1.2)
  
[2.](#2) Offsets of Descriptions

[3.](#3) Offsets of Tokens
    
  * [BIO Tags](#3.1)

***

<a id="0"></a>
### 0. Loading
First, begin by loading Python programming libraries and the dataset to be analyzed.

In [1]:
import utils  # import custom functions
import config # import directory path variables

from pathlib import Path

import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter

%matplotlib inline
import matplotlib.pyplot as plt

from intervaltree import Interval, IntervalTree

<a id="1"></a>
## 1. Lengths of Descriptions and Annotations
**Find the minimum, maximum, average, and standard deviation of word and sentence counts...**
* Per description (by `desc_id` - a.k.a. per "document" for document classifiers)
* Per metadata field (Title, Biographical / Historical, Scope and Contents, and Processing Information)
* Per collection (identifiable with the `eadid` column)
* Per annotation label (Omission, Stereotype, Generalization, etc.)
* Per annotation category (Person Name, Linguistic, Contextual)

<a id="1.1"></a>
### 1.1 Lengths of Descriptions

In [37]:
descs_path = config.crc_meta_path+"all_descriptions.csv"     # descriptions in column of CSV file

In [38]:
desc_df = pd.read_csv(descs_path, index_col=0)
desc_df.head()

Unnamed: 0,eadid,description,field,desc_id
0,AA5,Professor James Aitken White was a leading Sco...,Biographical / Historical,0
1,AA5,Papers of The Very Rev Prof James Whyte (1920-...,Title,1
2,AA6,Rev Thomas Allan was born on 16 August 1916 in...,Biographical / Historical,2
3,AA6,Papers of Rev Tom Allan (1916-1965)\n\n,Title,3
4,AA7,Alec Cheyne was born on 1 June 1924 in Errol i...,Biographical / Historical,4


In [39]:
# # Remove metadata field name from each description
# new_descs = []
# descs = list(desc_df.description)
# fields = list(desc_df.field)
# i = 0
# maxI = len(descs)
# while i < maxI:
#     d, f = descs[i], fields[i]
#     to_remove = f+":\n"
#     d = d.replace(to_remove,"")
#     new_descs += [d]
#     i += 1
# assert len(new_descs) == len(descs)
# # new_descs[:10]            # Looks good

In [40]:
# # Update the CSV file
# desc_df.description = new_descs
# desc_df.head()
# desc_df.to_csv(descs_path)

In [42]:
# Write each description to a txt file named with desc_id
ids = list(desc_df.desc_id)
descs = list(desc_df.description)
desc_txt_dir = config.crc_meta_path+"descriptions_brat/"
utils.strToTxt(ids, descs, "description", desc_txt_dir)

In [None]:
corpus = PlaintextCorpusReader(desc_txt_dir, "description\d+.txt", encoding="utf8")
# print(len(corpus.fileids()), desc_df.shape[0])  # Looks good
print(corpus.fileids()[-20:]) # Looks good

#### Length per Description

In [None]:
desc_words, desc_lower_words, desc_sents = utils.getWordsSents(corpus)
print(desc_words[0][:10])
print(desc_lower_words[0][:10])
print(desc_sents[0][:2])

In [None]:
# Add word and sentence counts to DataFrame/CSV of descriptions
word_count = [len(word_list) for word_list in desc_words]  # includes digits but not punctuation
sent_count = [len(sent_list) for sent_list in desc_sents]
print(word_count[:2], sent_count[:4])  # Looks good
# len(desc_sents[2]) # 14

In [None]:
desc_df.insert(len(desc_df.columns), "word_count", word_count)
desc_df.insert(len(desc_df.columns), "sent_count", sent_count)
desc_df.head()

In [None]:
desc_df.to_csv("descriptions_with_counts.csv")  # write a new CSV file with the word and sentence counts

In [None]:
desc_df = desc_df.reset_index()
desc_df.head(1)

In [None]:
desc_df_stats = utils.makeDescribeDf("All", desc_df)
desc_df_stats

#### Lengths per Metadata Field

In [None]:
field = "Biographical / Historical"
bh_stats = utils.makeDescribeDf(field, desc_df)
bh_stats

In [None]:
field = "Scope and Contents"
sc_stats = utils.makeDescribeDf(field, desc_df)
sc_stats

In [None]:
field = "Processing Information"
pi_stats = utils.makeDescribeDf(field, desc_df)
pi_stats

In [None]:
field = "Title"
t_stats = utils.makeDescribeDf(field, desc_df)
t_stats

#### Combine the Statistics

In [None]:
df_stats = pd.concat([desc_df_stats, t_stats, sc_stats, bh_stats, pi_stats], axis=0)
df_stats

In [None]:
df_stats.to_csv("../data/analysis_data/descs_stats.csv")

#### Prepare data for visualization in Observable

In [None]:
df_descs = pd.read_csv("../data/analysis_data/descriptions_with_counts.csv", index_col=0)
df_descs.head()

<a id="1.2"></a>
### 1.2 Length of Annotations

* Dataset: `annot-post/data/aggregated_final.csv`

<a id="2"></a>
## 2. Offsets of Descriptions

**Get the start and end offset of every description so that automated labels can be exported as .ann files for visualization with brat.**

The [standoff format](https://brat.nlplab.org/standoff.html) that the brat rapid annotation tool uses records the start offset and end offset of annotated text spans where:
* The **start offset** is the index of the *first character* in the annotated text span (which is also the number of characters in the document preceding the beginning of the annotated text span)
* The **end offset** is the index of the character *after the annotated text span* (which means the end offset corresponds to the character immediately following the annotated text span)

This means that the start offset of the first description of each document will be 0 and the end offset of the last description of each document will equal the length (number of characters) of the document.  There are multiple descriptions for each document, so we need to determinen the intermediate start and end offsets as well, which we'll add as a column to the file `../data/crc_metadata/all_descriptions.csv`.

<div class="alert-block alert-class alert-warning">
    <p><b>NOTE:</b> Need to re-assign description IDs to files in annotation_data directory (already reassigned to crc_metadata and aggregated_data files)</p>
</div>

In [2]:
file_type = ".txt"  # Read in only the PlainText files

In [3]:
filenames = os.listdir(config.doc_path)
filenames = [f for f in filenames if f[-4:] == file_type] # the descriptions are in the txt files
print(filenames[:3])

['Coll-227_00100.txt', 'La_03600.txt', 'PJM_03000.txt']


In [5]:
descs_details = utils.getDescriptionsInFiles(config.doc_path, filenames)

Great!  Now create a DataFrame of the description data:

In [27]:
ids_col = list(descs_details.keys())
desc_col, field_col, file_col, eadid_col, start_offset_col, end_offset_col = [], [], [], [], [], []
for desc_id in ids_col:
    desc_dict = descs_details[desc_id]
    
    eadid = (re.findall("^.*(?=_\d+.txt)", desc_dict["file"]))[0]
    eadid_col += [eadid]
    
    field_col += [desc_dict["field"]]
    
    file_col += [desc_dict["file"]]
    
    desc_col += [desc_dict["description"]]
    
    start_offset_col += [desc_dict["start_offset"]]
    end_offset_col += [desc_dict["end_offset"]]

new_descs_df = pd.DataFrame({
    "desc_id":ids_col, "eadid":eadid_col, "field":field_col, "file":file_col, 
    "description":desc_col, "desc_start_offset":start_offset_col, "desc_end_offset":end_offset_col
})

new_descs_df.head()

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
0,0,Coll-227,Title,Coll-227_00100.txt,Records of the Phrenological Society of Edinburgh,29,79
1,1,Coll-227,Scope and Contents,Coll-227_00100.txt,The records of the Phrenological Society inclu...,100,610
2,2,Coll-227,Biographical / Historical,Coll-227_00100.txt,The Phrenological Society of Edinburgh was for...,638,2277
3,3,La,Title,La_03600.txt,"Letter: 1825 Jan. 10, 27 Lower Belgrave Place ...",7,117
4,4,La,Title,La_03600.txt,"Letter: 1825 Mar. 1, 27 Lower Belgrave Place [...",125,223


Write the data to a CSV file:

In [2]:
# new_descs_df.to_csv(config.crc_meta_path+"descs_with_offsets.csv")
new_descs_df = pd.read_csv(config.crc_meta_path+"descs_with_offsets.csv", index_col=0)
new_descs_df.head()

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
0,0,Coll-227,Title,Coll-227_00100.txt,Records of the Phrenological Society of Edinburgh,29,79
1,1,Coll-227,Scope and Contents,Coll-227_00100.txt,The records of the Phrenological Society inclu...,100,610
2,2,Coll-227,Biographical / Historical,Coll-227_00100.txt,The Phrenological Society of Edinburgh was for...,638,2277
3,3,La,Title,La_03600.txt,"Letter: 1825 Jan. 10, 27 Lower Belgrave Place ...",7,117
4,4,La,Title,La_03600.txt,"Letter: 1825 Mar. 1, 27 Lower Belgrave Place [...",125,223


Now assign description IDs from this DataFrame to the aggregated annotated datasets:

In [13]:
df_grouped = pd.read_csv(config.agg_path+"desc_field_descid_label_eadid.csv", index_col=0)
# df_grouped.head()

In [6]:
df_merged = df_grouped.merge(new_descs_df, left_on=["description", "field", "eadid"], right_on=["description", "field", "eadid"])
df_merged.head()

Unnamed: 0,description,field,desc_id_x,label,eadid,desc_id_y,file,desc_start_offset,desc_end_offset
0,John Baillie: posthumous,Title,68,{'Unknown'},BAI,70381,BAI_01000.txt,1290,1315
1,"Letters received from Henry Sloane Coffin, wit...",Scope and Contents,143,"{'Masculine', 'Unknown'}",BAI,47675,BAI_01300.txt,5853,5983
2,Family photographs consist of:photographs of f...,Scope and Contents,221,"{'Masculine', 'Unknown', 'Feminine'}",BAI,81505,BAI_01600.txt,5967,6202
3,"Correspondence and related items, including le...",Scope and Contents,292,{'Unknown'},BAI,33009,BAI_01900.txt,5297,5506
4,From 1927-1930 John Baillie was Professor of S...,Biographical / Historical,361,"{'Gendered-Pronoun', 'Unknown', 'Masculine', '...",BAI,43372,BAI_02200.txt,15180,15419


Great!  Now we want to keep the *right* DataFrame's description IDs, so we'll drop `desc_id_x` and remove the `_y` from `desc_id_y`:

In [9]:
df_merged = df_merged.drop(columns=["desc_id_x"])
df_merged = df_merged.rename(columns={"desc_id_y":"desc_id"})
df_merged.head()

Unnamed: 0,description,field,label,eadid,desc_id,file,desc_start_offset,desc_end_offset
0,John Baillie: posthumous,Title,{'Unknown'},BAI,70381,BAI_01000.txt,1290,1315
1,"Letters received from Henry Sloane Coffin, wit...",Scope and Contents,"{'Masculine', 'Unknown'}",BAI,47675,BAI_01300.txt,5853,5983
2,Family photographs consist of:photographs of f...,Scope and Contents,"{'Masculine', 'Unknown', 'Feminine'}",BAI,81505,BAI_01600.txt,5967,6202
3,"Correspondence and related items, including le...",Scope and Contents,{'Unknown'},BAI,33009,BAI_01900.txt,5297,5506
4,From 1927-1930 John Baillie was Professor of S...,Biographical / Historical,"{'Gendered-Pronoun', 'Unknown', 'Masculine', '...",BAI,43372,BAI_02200.txt,15180,15419


Update the file:

In [10]:
df_merged.to_csv(config.agg_path+"desc_field_descid_label_eadid.csv")

<a id="3"></a>

## 3. Offsets of Tokens

In [54]:
# df = pd.read_csv(config.crc_meta_path+"all_descriptions.csv", index_col=0)
# df.loc[df.desc_id == 57361]
files = os.listdir("../data/crc_metadata/descriptions_brat/")
assert "EUA_IN1_56700.txt" in files
with open("../data/crc_metadata/descriptions_brat/EUA_IN1_56700.txt", "r") as f:
    f_string = f.read()
    print(f_string)


Scope and Contents:
Letters of congratulation to staff members on their work and external appointments.

Scope and Contents:
Correspondence about planning and budgeting issues for the Student Advisory and Counselling Service (SACS), about an individual student complaint about how SACS had dealt with her case, about the management structure and staffing of SACS. Includes a copy of the 1992/93 SACS annual report.

Scope and Contents:
Copy of the University's submissions to the 2001 Research Assessment Exercise for Classics, Ancient History, Byzantine and Modern Greek Studies, Archaeology, History, History of Art, Architecture and Design, and Philosophy.

Scope and Contents:
Certificate of Social Study 1948. File contains: Enrolment form, correspondence, job adverts for Supervisor of boarded out children for Fife County Council, Woman Assistant for County Public Assistance Officer for Gloucestershire County Council. Details of practical placements are located at the back of the file and 

In [63]:
df_desc = pd.read_csv(config.crc_meta_path+"descs_with_offsets.csv", index_col=0)
df_desc.head()

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
0,0,Coll-227,Title,Coll-227_00100.txt,Records of the Phrenological Society of Edinburgh,29,79
1,1,Coll-227,Scope and Contents,Coll-227_00100.txt,The records of the Phrenological Society inclu...,100,610
2,2,Coll-227,Biographical / Historical,Coll-227_00100.txt,The Phrenological Society of Edinburgh was for...,638,2277
3,3,La,Title,La_03600.txt,"Letter: 1825 Jan. 10, 27 Lower Belgrave Place ...",7,117
4,4,La,Title,La_03600.txt,"Letter: 1825 Mar. 1, 27 Lower Belgrave Place [...",125,223


In [64]:
df_desc.loc[df_desc.description.isna() == True]
# df_desc.loc[df_desc.description == "N/A"]
# df_desc.loc[df_desc.file == "EUA_IN1_56700.txt"]

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
57361,57361,EUA_IN1,Scope and Contents,EUA_IN1_56700.txt,,6137,6141


In [65]:
df_desc.description = df_desc.description.fillna("N/A")
df_desc.loc[df_desc.desc_id == 57361]

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
57361,57361,EUA_IN1,Scope and Contents,EUA_IN1_56700.txt,,6137,6141


Write the corrected description to the `descs_with_offsets.csv` file:

In [66]:
df_desc.to_csv(config.crc_meta_path+"descs_with_offsets.csv")

Get the offsets of the tokens in every description:

In [None]:
descs = list(df_desc.description)
desc_ids = list(df_desc.desc_id)
desc_start_offsets = list(df_desc.desc_start_offset)
desc_end_offsets = list(df_desc.desc_end_offset)

In [69]:
tokens_dict, offsets_dict = getTokensAndOffsetsFromStrings(descs, desc_ids, desc_start_offsets, desc_end_offsets)

In [72]:
tokens_col, offsets_col, desc_ids_col = [], [], []
for desc_id,token_list in tokens_dict.items():
    tokens_col += token_list
    offsets_list = offsets_dict[desc_id]
    offsets_col += offsets_list
    assert len(token_list) == len(offsets_list)
    desc_ids_col += [desc_id]*len(token_list)

assert len(tokens_col) == len(offsets_col)
assert len(tokens_col) == len(desc_ids_col)

In [73]:
for col_list in [tokens_col, offsets_col, desc_ids_col]:
    print(col_list[0:5])

['Records', 'of', 'the', 'Phrenological', 'Society']
[(29, 36), (37, 39), (40, 43), (44, 57), (58, 65)]
[0, 0, 0, 0, 0]


Looks good!  Now create a DataFrame with these lists as columns:

In [74]:
df_tokens = pd.DataFrame({"desc_id":desc_ids_col, "token":tokens_col, "offsets":offsets_col})
df_tokens.head()

Unnamed: 0,desc_id,token,offsets
0,0,Records,"(29, 36)"
1,0,of,"(37, 39)"
2,0,the,"(40, 43)"
3,0,Phrenological,"(44, 57)"
4,0,Society,"(58, 65)"


In [75]:
df_tokens.tail()

Unnamed: 0,desc_id,token,offsets
2239703,88596,on,"(465, 467)"
2239704,88596,12,"(468, 470)"
2239705,88596,January,"(471, 478)"
2239706,88596,1937,"(479, 483)"
2239707,88596,.,"(483, 484)"


Great!  Now write the DataFrame to a file:

In [76]:
df_tokens.to_csv(config.agg_path+"descid_token_offsets.csv")

<a id="3.1"></a>
### 3.1 BIO Tags

Compare the descriptions' tokens' offsets to the annotated text spans' offsets to determine which tokens to mark as the beginning of an annotation (`B-[LABELNAME]`), inside an annotation (`I-[LABELNAME]`), and unannotated, or outisde of an annotation (`O`).

In [3]:
df_tokens = pd.read_csv(config.agg_path+"descid_token_offsets.csv", index_col=0)
token_desc_ids = list(df_tokens.desc_id)
tokens = list(df_tokens.token)
token_offsets = list(df_tokens.offsets)
token_offsets_clean = [offsets[1:-1].split(", ") for offsets in token_offsets]
token_offsets_tuples = [tuple((int(offsets[0]), int(offsets[1]))) for offsets in token_offsets_clean]
# print(token_offsets_tuples[:5])  # Looks good

  mask |= (ar1 == a)


In [120]:
# # cols_to_read = ['offsets_ann','text_ann','label','id', "desc_id"]
# df_merged = pd.read_csv(config.agg_path+"aggregated_with_eadid_descid_desc_cols.csv", index_col=0)#usecols=cols_to_read)
# # df_merged.sort_values(by=["desc_id","desc_start_offset"], ascending=True)
# df_merged = df_merged.drop_duplicates()
# df_merged.head()
df_ann = pd.read_csv(config.agg_path+"aggregated_with_eadid_descid_cols.csv", index_col=0) #aggregated_final.csv
df_ann.head()

Unnamed: 0,file,offsets,text,label,category,eadid,field,id,desc_id
9,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,AA5,Biographical / Historical,0,0
16,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,AA5,Biographical / Historical,1,0
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical,2,0
5,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical,3,0
6,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical,4,0


<div class="alert alert-block alert-danger">
NEED TO FIX MERGING OF ANNOTATIONS AND DESCRIPTIONS THEY'RE IN FOR ALL AGGREGATED DATA WITH DESCID OR DESC COLUMN!!!
</div>

In [116]:
# df_tokens.loc[df_tokens.desc_id == 610]
df_merged.loc[df_merged.desc_id == 610]

Unnamed: 0,eadid,field,file_ann,offsets_ann,text_ann,label,category,id,description,file_desc,desc_id,file,desc_start_offset,desc_end_offset
3359,Coll-146,Title,Coll-146_00400.ann,"(1279, 1295)","Koestler, Arthur",Unknown,Person-Name,39824,"Letter :: Koestler, Arthur",Coll-146_11200.txt,610,Coll-146_11200.txt,127,154
5175,Coll-146,Title,Coll-146_00400.ann,"(1279, 1295)","Koestler, Arthur",Unknown,Person-Name,39824,"Letter :: Koestler, Arthur",Coll-146_07700.txt,610,Coll-146_11200.txt,127,154
6991,Coll-146,Title,Coll-146_00400.ann,"(1279, 1295)","Koestler, Arthur",Unknown,Person-Name,39824,"Letter :: Koestler, Arthur",Coll-146_03200.txt,610,Coll-146_11200.txt,127,154
14255,Coll-146,Title,Coll-146_00400.ann,"(1279, 1295)","Koestler, Arthur",Unknown,Person-Name,39824,"Letter :: Koestler, Arthur",Coll-146_05300.txt,610,Coll-146_11200.txt,127,154
17887,Coll-146,Title,Coll-146_00400.ann,"(1279, 1295)","Koestler, Arthur",Unknown,Person-Name,39824,"Letter :: Koestler, Arthur",Coll-146_01600.txt,610,Coll-146_11200.txt,127,154
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19768703,Coll-146,Title,Coll-146_14800.ann,"(10, 26)","Koestler, Arthur",Unknown,Person-Name,44967,"Letter :: Koestler, Arthur",Coll-146_14200.txt,610,Coll-146_11200.txt,127,154
19773243,Coll-146,Title,Coll-146_14800.ann,"(10, 26)","Koestler, Arthur",Unknown,Person-Name,44967,"Letter :: Koestler, Arthur",Coll-146_10700.txt,610,Coll-146_11200.txt,127,154
19774151,Coll-146,Title,Coll-146_14800.ann,"(10, 26)","Koestler, Arthur",Unknown,Person-Name,44967,"Letter :: Koestler, Arthur",Coll-146_12300.txt,610,Coll-146_11200.txt,127,154
19782323,Coll-146,Title,Coll-146_14800.ann,"(10, 26)","Koestler, Arthur",Unknown,Person-Name,44967,"Letter :: Koestler, Arthur",Coll-146_08600.txt,610,Coll-146_11200.txt,127,154


For efficiency, create dictionaries with the DataFrames' data, associating text and offsets to description IDs.

In [44]:
# Do the opposite of DataFrame.explode(), creating one row with for each
# value in the cols_to_groupby (list of one or more items) and lists of 
# values in the other columns, and setting the cols_to_groupby as the Index
# or MultiIndex in the resulting DataFrame
def implodeDataFrame(df, cols_to_groupby):
    cols_to_agg = list(df.columns)
    for col in cols_to_groupby:
        cols_to_agg.remove(col)
    agg_dict = dict.fromkeys(cols_to_agg, lambda x: x.tolist())
    return df.groupby(cols_to_groupby).agg(agg_dict).reset_index().set_index(cols_to_groupby)

In [45]:
df_tokens_imploded = implodeDataFrame(df_tokens, ["desc_id"])
df_tokens_imploded.head()

Unnamed: 0_level_0,token,offsets
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,"[Records, of, the, Phrenological, Society, of,...","[(29, 36), (37, 39), (40, 43), (44, 57), (58, ..."
1,"[The, records, of, the, Phrenological, Society...","[(100, 103), (104, 111), (112, 114), (115, 118..."
2,"[The, Phrenological, Society, of, Edinburgh, w...","[(638, 641), (642, 655), (656, 663), (664, 666..."
3,"[Letter, :, 1825, Jan., 10, ,, 27, Lower, Belg...","[(7, 13), (13, 14), (15, 19), (20, 24), (25, 2..."
4,"[Letter, :, 1825, Mar, ., 1, ,, 27, Lower, Bel...","[(125, 131), (131, 132), (133, 137), (138, 141..."


In [37]:
df_merged_imploded = implodeDataFrame(df_merged, ["desc_id"])
df_merged_imploded.head()

Unnamed: 0_level_0,offsets_ann,text_ann,label,id
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
167,"[(1436, 1444), (1436, 1444)]","[Beardman, Beardman]","[Omission, Unknown]","[31928, 31929]"
508,"[(3105, 3111)]",[editor],[Occupation],[28226]
610,"[(1279, 1295), (1279, 1295), (1279, 1295), (12...","[Koestler, Arthur, Koestler, Arthur, Koestler,...","[Unknown, Unknown, Unknown, Unknown, Unknown, ...","[39824, 39824, 39824, 39824, 39824, 39824, 398..."
611,"[(1279, 1295), (1279, 1295), (1279, 1295), (12...","[Koestler, Arthur, Koestler, Arthur, Koestler,...","[Unknown, Unknown, Unknown, Unknown, Unknown, ...","[39824, 39824, 39824, 39824, 39824, 39824, 398..."
640,"[(2118, 2122), (2118, 2127), (2118, 2127), (22...","[Lady, Lady Luck, Lady Luck, Weaver, Warren]","[Gendered-Role, Stereotype, Feminine, Unknown]","[43620, 43621, 43622, 43623]"


**Step 1: O tags**

Compare description IDs in the two DataFrames above to determine which descriptions (from `df_tokens_imploded`) do not have annotations (thus are are not in `df_merged_imploded`), and assign all those descriptions' tokens an `O` tag (for *outside* of an annotation).

In [40]:
all_desc_ids = list(df_tokens_imploded.index)
ann_desc_ids = list(df_merged_imploded.index)
unannotated = [desc_id for desc_id in all_desc_ids if not desc_id in ann_desc_ids]
print("Rows to assign tag 'O':", len(unannotated))

Rows to assign tag 'O': 86742


In [48]:
o_df = df_tokens_imploded.loc[df_tokens_imploded.index.isin(unannotated)]
assert o_df.shape[0] == len(unannotated)

In [50]:
tokens_list = list(o_df.token)
tags = [["O"]*len(tokens) for tokens in tokens_list]
assert len(tags) == len(tokens_list)
o_df.insert(len(o_df.columns), "ann_tag", tags)
o_df.head()

Unnamed: 0_level_0,token,offsets,ann_tag
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,"[Records, of, the, Phrenological, Society, of,...","[(29, 36), (37, 39), (40, 43), (44, 57), (58, ...","[O, O, O, O, O, O, O]"
1,"[The, records, of, the, Phrenological, Society...","[(100, 103), (104, 111), (112, 114), (115, 118...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
2,"[The, Phrenological, Society, of, Edinburgh, w...","[(638, 641), (642, 655), (656, 663), (664, 666...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
3,"[Letter, :, 1825, Jan., 10, ,, 27, Lower, Belg...","[(7, 13), (13, 14), (15, 19), (20, 24), (25, 2...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,"[Letter, :, 1825, Mar, ., 1, ,, 27, Lower, Bel...","[(125, 131), (131, 132), (133, 137), (138, 141...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


In [53]:
assert len(o_df.token[100]) == len(o_df.ann_tag[100])
assert len(o_df.token[488]) == len(o_df.ann_tag[488])
assert len(o_df.token[0]) == len(o_df.ann_tag[0])

**Step 2: B- and I- tags**

For description IDs that do have annotations (and thus are in `df_merged_imploded`), assign their tokens tags of `B-[LABELNAME]` and `I-[LABELNAME]` for *beginning* and *inside* of an annotation, replacing `[LABELNAME]` with the name of the annotation's label.

In [41]:
annotated = [desc_id for desc_id in all_desc_ids if desc_id in ann_desc_ids]
print("Rows to assign 'B-' or 'I-'':", len(annotated))

Rows to assign 'B-' or 'I-'': 1855


In [54]:
bi_df = df_tokens_imploded.loc[df_tokens_imploded.index.isin(annotated)]
assert bi_df.shape[0] == len(annotated)

In [55]:
bi_df.head()

Unnamed: 0_level_0,token,offsets
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1
167,"[Brick, Burning, ,, Beardman, 's]","[(1421, 1426), (1427, 1434), (1434, 1435), (14..."
508,"[Interpreting, sequence, motifs, [, Letter, to...","[(3064, 3076), (3077, 3085), (3086, 3092), (30..."
610,"[Letter, :, :, Koestler, ,, Arthur]","[(127, 133), (134, 135), (135, 136), (137, 145..."
611,"[Letter, :, :, Koestler, ,, Arthur]","[(127, 133), (134, 135), (135, 136), (137, 145..."
640,"[Lady, Luck, :, the, theory, of, probability, ...","[(2118, 2122), (2123, 2127), (2127, 2128), (21..."


In [57]:
bi_dict = bi_df.to_dict('index')
print(bi_dict[167])

{'token': ['Brick', 'Burning', ',', 'Beardman', "'s"], 'offsets': ['(1421, 1426)', '(1427, 1434)', '(1434, 1435)', '(1436, 1444)', '(1444, 1446)']}


In [59]:
ann_dict = df_merged_imploded.to_dict('index')
print(ann_dict[167])

{'offsets_ann': ['(1436, 1444)', '(1436, 1444)'], 'text_ann': ['Beardman', 'Beardman'], 'label': ['Omission', 'Unknown'], 'id': [31928, 31929]}


In [78]:
# Turn a string of offsets into a tuple with each offset of type int
# "(1436, 1444)" --> (1436, 1444)
def offsetsStrToTuple(offsets_str):
    offsets_list = offsets_str[1:-1].split(", ")
    offsets_ints = [int(o) for o in offsets_list]
    return tuple((offsets_ints))

assert type(offsetsStrToTuple('(1436, 1444)')) == tuple
assert type(offsetsStrToTuple('(1436, 1444)')[0]) == int
assert type(offsetsStrToTuple('(1436, 1444)')[1]) == int

In [101]:
desc_ids = list(bi_dict.keys())[:100]  # START WITH SAMPLE
assert len(set(desc_ids)) == len(desc_ids)  # Make sure every description ID is unique
log = 0
descid_to_tag = dict.fromkeys(desc_ids)
for desc_id in desc_ids:
    text_spans = ann_dict[desc_id]["text_ann"]
    desc_tokens = bi_dict[desc_id]['token']
    desc_tokens_offsets = bi_dict[desc_id]['offsets']
    desc_tags = []
    for i,desc_token in enumerate(desc_tokens):
        token_offset_pair = offsetsStrToTuple(desc_tokens_offsets[i])
        span_indeces, tags = [], []  # Note: one token may have multiple tags
        
        # Record the indeces of every item in text_spans with the desc_token
        for j,text_span in enumerate(text_spans):
            span_offset_pair = offsetsStrToTuple(ann_dict[desc_id]["offsets_ann"][j])    
            # Be sure a matching token's offsets are within the annotated text span
            if (desc_token in text_span
               ) and (
                token_offset_pair[0] >= span_offset_pair[0]
                ) and (
                token_offset_pair[1] <= span_offset_pair[1]):
                    span_indeces += [j] 
            else:
                span_indeces += ["unannotated"]
        for j in span_indeces:
            # If the token is annotated, assign it a B- or I- tag with a label
            if type(j) == int:
            # If the start offsets are the same, assign a 'B-' tag
                if token_offset_pair[0] == span_offset_pair[0]:
                    tags += ['B-'+ann_dict[desc_id]["label"][j]]
                # Otherwise, assign an 'I-' tag
                else:
                    tags += ['I-'+ann_dict[desc_id]["label"][j]]
            # If the description token isn't annotated, assign it an O tag
            elif j == "unannotated":
                tags += ["O"]
            else:
                raise ValueError("Invalid j value: {}".format(j))
        
        desc_tags += [set(tags)]
    
    assert len(desc_tokens) == len(desc_tags)
    descid_to_tag[desc_id] = desc_tags
    
    log += 1
    if log % 100 == 0:
        print("Assigned tags for {} descriptions".format(log))

Assigned tags for 100 descriptions


In [109]:
did = 610 #508 #167
# print(ann_dict[did])
print(bi_dict[did])
# print(descid_to_tag[did])

# spans = ['Beardman', 'Beardman']
# spans2 = ["Brick Burning"]
# tokens = ['Brick', 'Burning', ',', 'Beardman', "'s"]
# # print(spans.count('Beardman'))
# # # print(spans.index('Beardman'))
# # # print(tokens.index('Beardman'))
# # for k in range(0,3):
# #     print(k)
# indeces = [index for index in range(len(spans)) if spans[index] == 'Beardman']
# print(indeces)

{'token': ['Letter', ':', ':', 'Koestler', ',', 'Arthur'], 'offsets': ['(127, 133)', '(134, 135)', '(135, 136)', '(137, 145)', '(145, 146)', '(147, 153)']}


In [4]:
# is_annotated_col = []
# annotated_id = []
# i, maxI = 0, len(token_desc_ids)  #1188478, 1189478
# while i < maxI:
#     desc_id = token_desc_ids[i]
#     token = tokens[i]
#     token_start, token_end = token_offsets_tuples[i][0], token_offsets_tuples[i][1] 
    
#     ann_df = df_merged.loc[df_merged.desc_id == desc_id]
#     ann_id_list = list(ann_df.id)
#     ann_offsets_list = list(ann_df.offsets_ann)
#     ann_offsets_clean = [ann_offsets[1:-1].split(", ") for ann_offsets in ann_offsets_list]
#     ann_offsets_tuples = [tuple((int(ann_offsets[0]), int(ann_offsets[1]))) for ann_offsets in ann_offsets_clean]
    
#     for j,ann_offsets in enumerate(ann_offsets_tuples):
#         ann_start = ann_offsets[0]
#         ann_end = ann_offsets[1]
#         if token_start == ann_start:
#             is_annotated_col += ["B"]
#             annotated_id += [ann_id_list[j]]
#         elif (token_start > ann_start) and (token_start <= ann_end):
#             is_annotated_col += ["I"]
#             annotated_id += [ann_id_list[j]]
#         else:
#             is_annotated_col += ["O"]
#             annotated_id += ["None"]
    
#     i += 1

# assert len(is_annotated_col) == len(token_desc_ids)
# assert len(is_annotated_col) == len(annotated_id)

KeyboardInterrupt: 

In [None]:
df_tokens.insert(len(df_tokens.columns),"is_annotated",is_annotated_col)
df_tokens.insert(len(df_tokens.columns),"ann_id",annotated_id)
df_tokens.head()

In [5]:
print(len(is_annotated_col))

48022031


In [8]:
print(len(annotated_id))

48022031
