# Analysis: Descriptions' and Annotations' Lengths
## Post Annotation and Aggregation

Outputs the files:
  * `../data/analysis_data/descriptions_with_counts.csv`: adds columns to `descriptions.csv` for word counts and sentence counts, where words are alphanumeric tokens (punctuation excluded)
  * `../data/analysis_data/descs_stats.csv`: contains the count, minimum, maximum, average, and standard deviation of all descriptions and each type of description
  * `../data/crc_metadata/all_descs_with_offsets.csv`: contains one row for every description in the annotated datasets with columns for the descriptions' corresponding id, eadid, file, start offset, and end offset

***

**Table of Contents**

[0.](#0) Loading

[1.](#1) Lengths of Descriptions and Annotations

  * [Lengths of Descriptions](#1.1)
  
  * TO DO: [Lengths of Annotations](#1.2)
  
[2.](#2) Offsets of Descriptions

[3.](#3) Offsets of Tokens

***

<a id="0"></a>
### 0. Loading
First, begin by loading Python programming libraries and the dataset to be analyzed.

In [1]:
import utils  # import custom functions
import config # import directory path variables

from pathlib import Path

import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter

%matplotlib inline
import matplotlib.pyplot as plt

<a id="1"></a>
## 1. Lengths of Descriptions and Annotations
**Find the minimum, maximum, average, and standard deviation of word and sentence counts...**
* Per description (by `desc_id` - a.k.a. per "document" for document classifiers)
* Per metadata field (Title, Biographical / Historical, Scope and Contents, and Processing Information)
* Per collection (identifiable with the `eadid` column)
* Per annotation label (Omission, Stereotype, Generalization, etc.)
* Per annotation category (Person Name, Linguistic, Contextual)

<a id="1.1"></a>
### 1.1 Lengths of Descriptions

In [37]:
descs_path = config.crc_meta_path+"all_descriptions.csv"     # descriptions in column of CSV file

In [38]:
desc_df = pd.read_csv(descs_path, index_col=0)
desc_df.head()

Unnamed: 0,eadid,description,field,desc_id
0,AA5,Professor James Aitken White was a leading Sco...,Biographical / Historical,0
1,AA5,Papers of The Very Rev Prof James Whyte (1920-...,Title,1
2,AA6,Rev Thomas Allan was born on 16 August 1916 in...,Biographical / Historical,2
3,AA6,Papers of Rev Tom Allan (1916-1965)\n\n,Title,3
4,AA7,Alec Cheyne was born on 1 June 1924 in Errol i...,Biographical / Historical,4


In [39]:
# # Remove metadata field name from each description
# new_descs = []
# descs = list(desc_df.description)
# fields = list(desc_df.field)
# i = 0
# maxI = len(descs)
# while i < maxI:
#     d, f = descs[i], fields[i]
#     to_remove = f+":\n"
#     d = d.replace(to_remove,"")
#     new_descs += [d]
#     i += 1
# assert len(new_descs) == len(descs)
# # new_descs[:10]            # Looks good

In [40]:
# # Update the CSV file
# desc_df.description = new_descs
# desc_df.head()
# desc_df.to_csv(descs_path)

In [42]:
# Write each description to a txt file named with desc_id
ids = list(desc_df.desc_id)
descs = list(desc_df.description)
desc_txt_dir = config.crc_meta_path+"descriptions_brat/"
utils.strToTxt(ids, descs, "description", desc_txt_dir)

In [None]:
corpus = PlaintextCorpusReader(desc_txt_dir, "description\d+.txt", encoding="utf8")
# print(len(corpus.fileids()), desc_df.shape[0])  # Looks good
print(corpus.fileids()[-20:]) # Looks good

#### Length per Description

In [None]:
desc_words, desc_lower_words, desc_sents = utils.getWordsSents(corpus)
print(desc_words[0][:10])
print(desc_lower_words[0][:10])
print(desc_sents[0][:2])

In [None]:
# Add word and sentence counts to DataFrame/CSV of descriptions
word_count = [len(word_list) for word_list in desc_words]  # includes digits but not punctuation
sent_count = [len(sent_list) for sent_list in desc_sents]
print(word_count[:2], sent_count[:4])  # Looks good
# len(desc_sents[2]) # 14

In [None]:
desc_df.insert(len(desc_df.columns), "word_count", word_count)
desc_df.insert(len(desc_df.columns), "sent_count", sent_count)
desc_df.head()

In [None]:
desc_df.to_csv("descriptions_with_counts.csv")  # write a new CSV file with the word and sentence counts

In [None]:
desc_df = desc_df.reset_index()
desc_df.head(1)

In [None]:
desc_df_stats = utils.makeDescribeDf("All", desc_df)
desc_df_stats

#### Lengths per Metadata Field

In [None]:
field = "Biographical / Historical"
bh_stats = utils.makeDescribeDf(field, desc_df)
bh_stats

In [None]:
field = "Scope and Contents"
sc_stats = utils.makeDescribeDf(field, desc_df)
sc_stats

In [None]:
field = "Processing Information"
pi_stats = utils.makeDescribeDf(field, desc_df)
pi_stats

In [None]:
field = "Title"
t_stats = utils.makeDescribeDf(field, desc_df)
t_stats

#### Combine the Statistics

In [None]:
df_stats = pd.concat([desc_df_stats, t_stats, sc_stats, bh_stats, pi_stats], axis=0)
df_stats

In [None]:
df_stats.to_csv("../data/analysis_data/descs_stats.csv")

#### Prepare data for visualization in Observable

In [None]:
df_descs = pd.read_csv("../data/analysis_data/descriptions_with_counts.csv", index_col=0)
df_descs.head()

<a id="1.2"></a>
### 1.2 Length of Annotations

* Dataset: `annot-post/data/aggregated_final.csv`

<a id="2"></a>
## 2. Offsets of Descriptions

**Get the start and end offset of every description so that automated labels can be exported as .ann files for visualization with brat.**

The [standoff format](https://brat.nlplab.org/standoff.html) that the brat rapid annotation tool uses records the start offset and end offset of annotated text spans where:
* The **start offset** is the index of the *first character* in the annotated text span (which is also the number of characters in the document preceding the beginning of the annotated text span)
* The **end offset** is the index of the character *after the annotated text span* (which means the end offset corresponds to the character immediately following the annotated text span)

This means that the start offset of the first description of each document will be 0 and the end offset of the last description of each document will equal the length (number of characters) of the document.  There are multiple descriptions for each document, so we need to determinen the intermediate start and end offsets as well, which we'll add as a column to the file `../data/crc_metadata/all_descriptions.csv`.

<div class="alert-block alert-class alert-warning">
    <p><b>NOTE:</b> Need to re-assign description IDs to files in annotation_data directory (already reassigned to crc_metadata and aggregated_data files)</p>
</div>

In [2]:
file_type = ".txt"  # Read in only the PlainText files

In [3]:
filenames = os.listdir(config.doc_path)
filenames = [f for f in filenames if f[-4:] == file_type] # the descriptions are in the txt files
print(filenames[:3])

['Coll-227_00100.txt', 'La_03600.txt', 'PJM_03000.txt']


In [4]:
metadata_field_names = ["Title", "Scope and Contents", "Biographical / Historical", "Processing Information"]
            
# INPUT: file path to a document of metadata descriptions (str)
# OUTPUT: a dictionary of metadata description ids and the associated 
#         description text, field name and offsets contained in the input file
def getDescriptionsInFiles(dirpath, file_list, fieldnames=metadata_field_names):
    desc_dict = dict()
    did = 0
    for filename in file_list:

        # Get a string of the input file's text (metadata descriptions)
        f_string = open(os.path.join(dirpath+filename),'r').read()
        
        for fieldname in fieldnames:
            pattern = "(?<={}:\n).+".format(fieldname)
            match_list = re.findall(pattern, f_string)
            if len(match_list) > 0:
                for match in match_list:
                    desc_dict[did] = dict.fromkeys(["description", "field", "file", "start_offset", "end_offset"])
                    desc_dict[did]["description"] = match
                    desc_dict[did]["field"] = fieldname
                    desc_dict[did]["file"] = filename
                    desc_dict[did]["start_offset"] = f_string.find(match)
                    desc_dict[did]["end_offset"] = f_string.find(match) + len(match) + 1
                    did += 1
                    
    return desc_dict

In [5]:
descs_details = getDescriptionsInFiles(config.doc_path, filenames)

Great!  Now create a DataFrame of the description data:

In [27]:
ids_col = list(descs_details.keys())
desc_col, field_col, file_col, eadid_col, start_offset_col, end_offset_col = [], [], [], [], [], []
for desc_id in ids_col:
    desc_dict = descs_details[desc_id]
    
    eadid = (re.findall("^.*(?=_\d+.txt)", desc_dict["file"]))[0]
    eadid_col += [eadid]
    
    field_col += [desc_dict["field"]]
    
    file_col += [desc_dict["file"]]
    
    desc_col += [desc_dict["description"]]
    
    start_offset_col += [desc_dict["start_offset"]]
    end_offset_col += [desc_dict["end_offset"]]

new_descs_df = pd.DataFrame({
    "desc_id":ids_col, "eadid":eadid_col, "field":field_col, "file":file_col, 
    "description":desc_col, "desc_start_offset":start_offset_col, "desc_end_offset":end_offset_col
})

new_descs_df.head()

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
0,0,Coll-227,Title,Coll-227_00100.txt,Records of the Phrenological Society of Edinburgh,29,79
1,1,Coll-227,Scope and Contents,Coll-227_00100.txt,The records of the Phrenological Society inclu...,100,610
2,2,Coll-227,Biographical / Historical,Coll-227_00100.txt,The Phrenological Society of Edinburgh was for...,638,2277
3,3,La,Title,La_03600.txt,"Letter: 1825 Jan. 10, 27 Lower Belgrave Place ...",7,117
4,4,La,Title,La_03600.txt,"Letter: 1825 Mar. 1, 27 Lower Belgrave Place [...",125,223


Write the data to a CSV file:

In [2]:
# new_descs_df.to_csv(config.crc_meta_path+"descs_with_offsets.csv")
new_descs_df = pd.read_csv(config.crc_meta_path+"descs_with_offsets.csv", index_col=0)
new_descs_df.head()

Unnamed: 0,desc_id,eadid,field,file,description,desc_start_offset,desc_end_offset
0,0,Coll-227,Title,Coll-227_00100.txt,Records of the Phrenological Society of Edinburgh,29,79
1,1,Coll-227,Scope and Contents,Coll-227_00100.txt,The records of the Phrenological Society inclu...,100,610
2,2,Coll-227,Biographical / Historical,Coll-227_00100.txt,The Phrenological Society of Edinburgh was for...,638,2277
3,3,La,Title,La_03600.txt,"Letter: 1825 Jan. 10, 27 Lower Belgrave Place ...",7,117
4,4,La,Title,La_03600.txt,"Letter: 1825 Mar. 1, 27 Lower Belgrave Place [...",125,223


Now assign description IDs from this DataFrame to the aggregated annotated datasets:

In [13]:
df_grouped = pd.read_csv(config.agg_path+"desc_field_descid_label_eadid.csv", index_col=0)
df_agg = pd.read_csv(config.agg_path+"aggregated_with_eadid_descid_desc_cols.csv", index_col=0)

In [6]:
df_merged = df_grouped.merge(new_descs_df, left_on=["description", "field", "eadid"], right_on=["description", "field", "eadid"])
df_merged.head()

Unnamed: 0,description,field,desc_id_x,label,eadid,desc_id_y,file,desc_start_offset,desc_end_offset
0,John Baillie: posthumous,Title,68,{'Unknown'},BAI,70381,BAI_01000.txt,1290,1315
1,"Letters received from Henry Sloane Coffin, wit...",Scope and Contents,143,"{'Masculine', 'Unknown'}",BAI,47675,BAI_01300.txt,5853,5983
2,Family photographs consist of:photographs of f...,Scope and Contents,221,"{'Masculine', 'Unknown', 'Feminine'}",BAI,81505,BAI_01600.txt,5967,6202
3,"Correspondence and related items, including le...",Scope and Contents,292,{'Unknown'},BAI,33009,BAI_01900.txt,5297,5506
4,From 1927-1930 John Baillie was Professor of S...,Biographical / Historical,361,"{'Gendered-Pronoun', 'Unknown', 'Masculine', '...",BAI,43372,BAI_02200.txt,15180,15419


Great!  Now we want to keep the *right* DataFrame's description IDs, so we'll drop `desc_id_x` and remove the `_y` from `desc_id_y`:

In [9]:
df_merged = df_merged.drop(columns=["desc_id_x"])
df_merged = df_merged.rename(columns={"desc_id_y":"desc_id"})
df_merged.head()

Unnamed: 0,description,field,label,eadid,desc_id,file,desc_start_offset,desc_end_offset
0,John Baillie: posthumous,Title,{'Unknown'},BAI,70381,BAI_01000.txt,1290,1315
1,"Letters received from Henry Sloane Coffin, wit...",Scope and Contents,"{'Masculine', 'Unknown'}",BAI,47675,BAI_01300.txt,5853,5983
2,Family photographs consist of:photographs of f...,Scope and Contents,"{'Masculine', 'Unknown', 'Feminine'}",BAI,81505,BAI_01600.txt,5967,6202
3,"Correspondence and related items, including le...",Scope and Contents,{'Unknown'},BAI,33009,BAI_01900.txt,5297,5506
4,From 1927-1930 John Baillie was Professor of S...,Biographical / Historical,"{'Gendered-Pronoun', 'Unknown', 'Masculine', '...",BAI,43372,BAI_02200.txt,15180,15419


Update the file:

In [10]:
df_merged.to_csv(config.agg_path+"desc_field_descid_label_eadid.csv")

Perform the same operations on the aggregated dataset:

In [17]:
df_merged = df_agg.merge(new_descs_df, left_on=["description", "field", "eadid"], right_on=["description", "field", "eadid"])
df_merged = df_merged.drop(columns=["desc_id_x"])
df_merged = df_merged.rename(columns={"desc_id_y":"desc_id", "file_x":"file_ann", "offsets":"offsets_ann", "file_y":"file_desc", "text":"text_ann"})
df_merged.head()

Unnamed: 0,eadid,field,file_ann,offsets_ann,text_ann,label,category,id,description,file_desc,desc_id,file,desc_start_offset,desc_end_offset
0,BAI,Title,BAI_01000.ann,"(1290, 1302)",John Baillie,Unknown,Person-Name,211,John Baillie: posthumous,BAI_01000.txt,70381,BAI_01000.txt,1290,1315
1,BAI,Scope and Contents,BAI_01300.ann,"(5875, 5894)",Henry Sloane Coffin,Unknown,Person-Name,524,"Letters received from Henry Sloane Coffin, wit...",BAI_01300.txt,47675,BAI_01300.txt,5853,5983
2,BAI,Scope and Contents,BAI_01300.ann,"(5925, 5936)",Hugh Martin,Unknown,Person-Name,525,"Letters received from Henry Sloane Coffin, wit...",BAI_01300.txt,47675,BAI_01300.txt,5853,5983
3,BAI,Scope and Contents,BAI_01300.ann,"(5951, 5963)",John Baillie,Masculine,Person-Name,526,"Letters received from Henry Sloane Coffin, wit...",BAI_01300.txt,47675,BAI_01300.txt,5853,5983
4,BAI,Scope and Contents,BAI_01300.ann,"(5951, 5963)",John Baillie,Unknown,Person-Name,527,"Letters received from Henry Sloane Coffin, wit...",BAI_01300.txt,47675,BAI_01300.txt,5853,5983


Update the file:

In [18]:
df_merged.to_csv(config.agg_path+"aggregated_with_eadid_descid_desc_cols.csv")

<a id="3"></a>

## 3. Offsets of Tokens

In [2]:
df_merged = pd.read_csv(config.agg_path+"aggregated_with_eadid_descid_desc_cols.csv")
text_ann_list = list(df_merged.text_ann)
ann_id_list = list(df_merged.id)
desc_list = list(df_merged.description)

In [5]:
ann_offsets_list = list(df_merged.offsets_ann)
ann_offsets_clean = [ann_offsets[1:-1].split(", ") for ann_offsets in ann_offsets_list]
ann_offsets_tuples = [tuple((int(ann_offsets[0]), int(ann_offsets[1]))) for ann_offsets in ann_offsets_clean]
print(ann_offsets_tuples[0:5])

[(1290, 1302), (5875, 5894), (5925, 5936), (5951, 5963), (5951, 5963)]


In [8]:
tokens, token_offsets = [], []
i, maxI = 0, 5  # START WITH A SAMPLE, THEN: len(text_ann_list)
while i < maxI:
    start_offset, end_offset = ann_offsets_tuples[i][0], ann_offsets_tuples[i][1]
    ann_tokens = word_tokenize(text_ann_list[i])
    desc = desc_list[i][start_offset:end_offset+1]
    print(desc)

    token_start_offset = start_offset
    token_end_offset = start_offset+len(token)
    # tokens += [token]
    # token_offsets += [tuple((token_start_offset, token_end_offset))]
    print(token, tuple((token_start_offset, token_end_offset)))
    
    for token in ann_tokens[1:]:
        desc = desc[token_end_offset+1:]
        token_start_offset = desc.index(token)+token_end_offset
        token_end_offset = token_start_offset + len(token)
        i+= 1
        # tokens += [token]
        # token_offsets += [tuple((token_start_offset, token_end_offset))]
        print(token, tuple((token_start_offset, token_end_offset)))


Baillie (1290, 1297)


ValueError: substring not found

In [9]:
desc_list[0][1290:1303]

''