# Analysis: Descriptions' and Annotations' Lengths
## Post Annotation and Aggregation

Outputs the files:
  * `annot-post/data/descriptions_with_counts.csv`: adds columns to `descriptions.csv` for word counts and sentence counts, where words are alphanumeric tokens (punctuation excluded)
  * `annot-post/data/descs_stats.csv`: contains the count, minimum, maximum, average, and standard deviation of all descriptions and each type of description

***

**Table of Contents**

[0.](#0) Loading

[1.](#1) Lengths of Descriptions and Annotations

  * [Lengths of Descriptions](#1.1)
  
  * TO DO: [Lengths of Annotations](#1.2)
  
[2.](#2) Offsets of Descriptions

***

<a id="0"></a>
### 0. Loading
First, begin by loading Python programming libraries and the dataset to be analyzed.

In [1]:
import utils  # import custom functions

import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

# import nltk
# from nltk.tokenize import word_tokenize
# from nltk.tokenize import sent_tokenize
# # nltk.download('punkt')
# from nltk.corpus import PlaintextCorpusReader
# # nltk.download('averaged_perceptron_tagger')
# from nltk.corpus import stopwords
# # nltk.download('stopwords')
# from nltk.tag import pos_tag
# from nltk.text import Text
# from nltk.probability import FreqDist
# from collections import Counter
# from wordcloud import WordCloud

# %matplotlib inline
# import matplotlib.pyplot as plt



In [6]:
# dir_path = "data/"
# data_files = ["aggregated_final.csv", "aggregated_with_annotator_eadid_note_cols.csv", 
#               "aggregated_with_eadid_descid_desc_cols.csv", "descriptions.csv"]

In [7]:
# df = pd.read_csv(dir_path+data_files[2], index_col=0)
# df.head()

In [8]:
# print("Rows:",df.shape[0], "\nColumns:",df.shape[1])  # Rows: 55260, Columns: 10

<a id="1"></a>
## 1. Lengths of Descriptions and Annotations
Minimum, maximum, average, and standard deviation of word and sentence counts...
* Per description (by `desc_id` - a.k.a. per "document" for document classifiers)
* Per metadata field (Title, Biographical / Historical, Scope and Contents, and Processing Information)
* Per collection (identifiable with the `eadid` column)
* Per annotation label (Omission, Stereotype, Generalization, etc.)
* Per annotation category (Person Name, Linguistic, Contextual)

<a id="1.1"></a>
### 1.1 Lengths of Descriptions

In [17]:
descs_path = "../data/crc_metadata/all_descriptions.csv"     # descriptions in column of CSV file

In [18]:
desc_df = pd.read_csv(descs_path, index_col=0)
desc_df.head()

Unnamed: 0,eadid,description,field,desc_id
0,AA5,Professor James Aitken White was a leading Sco...,Biographical / Historical,0
1,AA5,Papers of The Very Rev Prof James Whyte (1920-...,Title,1
2,AA6,Rev Thomas Allan was born on 16 August 1916 in...,Biographical / Historical,2
3,AA6,Papers of Rev Tom Allan (1916-1965)\n\n,Title,3
4,AA7,Alec Cheyne was born on 1 June 1924 in Errol i...,Biographical / Historical,4


In [83]:
# # Remove metadata field name from each description
# new_descs = []
# descs = list(desc_df.description)
# fields = list(desc_df.field)
# i = 0
# maxI = len(descs)
# while i < maxI:
#     d, f = descs[i], fields[i]
#     to_remove = f+":\n"
#     d = d.replace(to_remove,"")
#     new_descs += [d]
#     i += 1
# assert len(new_descs) == len(descs)
# # new_descs[:10]            # Looks good

In [84]:
# # Update the CSV file
# desc_df.description = new_descs
# desc_df.head()
# desc_df.to_csv(descs_path)

In [86]:
# Write each description to a txt file named with desc_id
ids = list(desc_df.desc_id)
zero_padding = len(str(ids[-1]))
desc_txt_dir = "data/descriptions/"
i, maxI = 0, len(ids)
while i < maxI:
    d_id = str(ids[i])
    padding = zero_padding - len(d_id)  # pad with zeros so file order aligns with DataFrame order
    id_str = ("0"*padding) + d_id
    filename = "description"+id_str+".txt"
    f = open((desc_txt_dir+filename), "w", encoding="utf8")
    f.write(new_descs[i])
    f.close()
    i += 1
print("Files written to "+desc_txt_dir)

Files written to data/descriptions/


In [2]:
corpus = PlaintextCorpusReader("data/descriptions/", "description\d+.txt", encoding="utf8")
# print(len(corpus.fileids()), desc_df.shape[0])  # Looks good
print(corpus.fileids()[-20:]) # Looks good

['description11868.txt', 'description11869.txt', 'description11870.txt', 'description11871.txt', 'description11872.txt', 'description11873.txt', 'description11874.txt', 'description11875.txt', 'description11876.txt', 'description11877.txt', 'description11878.txt', 'description11879.txt', 'description11880.txt', 'description11881.txt', 'description11882.txt', 'description11883.txt', 'description11884.txt', 'description11885.txt', 'description11886.txt', 'description11887.txt']


#### Length per Description

In [3]:
desc_words, desc_lower_words, desc_sents = utils.getWordsSents(corpus)
print(desc_words[0][:10])
print(desc_lower_words[0][:10])
print(desc_sents[0][:2])

['Professor', 'James', 'Aitken', 'White', 'was', 'a', 'leading', 'Scottish', 'Theologian', 'and']
['professor', 'james', 'aitken', 'white', 'was', 'a', 'leading', 'scottish', 'theologian', 'and']
['Professor James Aitken White was a leading Scottish Theologian and Moderator of the General Assembly of the Church of Scotland.', "He was educated at Daniel Stewart's College and the University of Edinburgh where he studied philosophy and divinity."]


In [33]:
# Add word and sentence counts to DataFrame/CSV of descriptions
word_count = [len(word_list) for word_list in desc_words]  # includes digits but not punctuation
sent_count = [len(sent_list) for sent_list in desc_sents]
print(word_count[:2], sent_count[:4])  # Looks good
# len(desc_sents[2]) # 14

[179, 9] [8, 1, 14, 1]


In [36]:
desc_df.insert(len(desc_df.columns), "word_count", word_count)
desc_df.insert(len(desc_df.columns), "sent_count", sent_count)
desc_df.head()

Unnamed: 0_level_0,description,field,desc_id,word_count,sent_count
eadid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AA5,Professor James Aitken White was a leading Sco...,Biographical / Historical,0,179,8
AA5,Papers of The Very Rev Prof James Whyte (1920-...,Title,1,9,1
AA6,Rev Thomas Allan was born on 16 August 1916 in...,Biographical / Historical,2,315,14
AA6,Papers of Rev Tom Allan (1916-1965)\n\n,Title,3,6,1
AA7,Alec Cheyne was born on 1 June 1924 in Errol i...,Biographical / Historical,4,333,14


In [100]:
desc_df.to_csv("descriptions_with_counts.csv")  # write a new CSV file with the word and sentence counts

In [38]:
desc_df = desc_df.reset_index()
desc_df.head(1)

Unnamed: 0,eadid,description,field,desc_id,word_count,sent_count
0,AA5,Professor James Aitken White was a leading Sco...,Biographical / Historical,0,179,8


In [101]:
desc_df_stats = utils.makeDescribeDf("All", desc_df)
desc_df_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,total_descriptions,mean,std,min,max
metadata_field,by,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
All,word_count,11888.0,30.780535,183.985507,1.0,14147.0
All,sent_count,11888.0,2.047863,11.263009,1.0,854.0


#### Lengths per Metadata Field

In [102]:
field = "Biographical / Historical"
bh_stats = utils.makeDescribeDf(field, desc_df)
bh_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,total_descriptions,mean,std,min,max
metadata_field,by,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Biographical / Historical,word_count,576.0,130.635417,139.046526,6.0,1110.0
Biographical / Historical,sent_count,576.0,6.647569,6.759143,1.0,45.0


In [95]:
field = "Scope and Contents"
sc_stats = utils.makeDescribeDf(field, desc_df)
sc_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,total_descriptions,mean,std,min,max
metadata_field,by,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Scope and Contents,word_count,6198.0,39.262988,248.426508,2.0,14147.0
Scope and Contents,sent_count,6198.0,2.304453,15.363817,1.0,854.0


In [96]:
field = "Processing Information"
pi_stats = utils.makeDescribeDf(field, desc_df)
pi_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,total_descriptions,mean,std,min,max
metadata_field,by,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Processing Information,word_count,280.0,9.417857,10.892454,4.0,177.0
Processing Information,sent_count,280.0,1.075,0.335611,1.0,4.0


In [97]:
field = "Title"
t_stats = utils.makeDescribeDf(field, desc_df)
t_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,total_descriptions,mean,std,min,max
metadata_field,by,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Title,word_count,4834.0,9.243691,6.764719,1.0,51.0
Title,sent_count,4834.0,1.227141,0.751659,1.0,15.0


#### Combine the Statistics

In [103]:
df_stats = pd.concat([desc_df_stats, t_stats, sc_stats, bh_stats, pi_stats], axis=0)
df_stats

Unnamed: 0_level_0,Unnamed: 1_level_0,total_descriptions,mean,std,min,max
metadata_field,by,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
All,word_count,11888.0,30.780535,183.985507,1.0,14147.0
All,sent_count,11888.0,2.047863,11.263009,1.0,854.0
Title,word_count,4834.0,9.243691,6.764719,1.0,51.0
Title,sent_count,4834.0,1.227141,0.751659,1.0,15.0
Scope and Contents,word_count,6198.0,39.262988,248.426508,2.0,14147.0
Scope and Contents,sent_count,6198.0,2.304453,15.363817,1.0,854.0
Biographical / Historical,word_count,576.0,130.635417,139.046526,6.0,1110.0
Biographical / Historical,sent_count,576.0,6.647569,6.759143,1.0,45.0
Processing Information,word_count,280.0,9.417857,10.892454,4.0,177.0
Processing Information,sent_count,280.0,1.075,0.335611,1.0,4.0


In [104]:
df_stats.to_csv("../data/analysis_data/descs_stats.csv")

#### Prepare data for visualization in Observable

In [41]:
df_descs = pd.read_csv("../data/analysis_data/descriptions_with_counts.csv", index_col=0)
df_descs.head()

Unnamed: 0,eadid,description,field,desc_id,word_count,sent_count
9,AA5,Professor James Aitken White was a leading Sco...,Biographical / Historical,0,179,8
17,AA5,Papers of The Very Rev Prof James Whyte (1920-...,Title,1,9,1
39,AA6,Rev Thomas Allan was born on 16 August 1916 in...,Biographical / Historical,2,315,14
47,AA6,Papers of Rev Tom Allan (1916-1965)\n\n,Title,3,6,1
70,AA7,Alec Cheyne was born on 1 June 1924 in Errol i...,Biographical / Historical,4,333,14


<a id="1.2"></a>
### 1.2 Length of Annotations

* Dataset: `annot-post/data/aggregated_final.csv`

<a id="2"></a>
### 2. Offsets of Descriptions

Get the start and end offset of every description so that automated labels can be exported as .ann files for visualization with brat.

In [11]:
doc_path = "../data/crc_metadata/descriptions_brat/"
file_type = ".txt"  # Read in only the PlainText files

In [12]:
filenames = os.listdir(doc_path)
filenames = [f for f in filenames if f[-4:] == file_type] # the descriptions are in the txt files
print(filenames[:6])

['Coll-227_00100.txt', 'La_03600.txt', 'PJM_03000.txt', 'La_07300.txt', 'Coll-1434_07400.txt', 'Coll-1434_03100.txt']


The [standoff format](https://brat.nlplab.org/standoff.html) that the brat rapid annotation tool uses records the start offset and end offset of annotated text spans where:
* The **start offset** is the index of the *first character* in the annotated text span (which is also the number of characters in the document preceding the beginning of the annotated text span)
* The **end offset** is the index of the character *after the annotated text span* (which means the end offset corresponds to the character immediately following the annotated text span)

This means that the start offset of the first description of each document will be 0 and the end offset of the last description of each document will equal the length (number of characters) of the document.  There are multiple descriptions for each document, so we need to determinen the intermediate start and end offsets as well, which we'll add as a column to the file `../data/crc_metadata/all_descriptions.csv`.

In [43]:
data_path = "../data/aggregated_data/aggregated_with_eadid_descid_desc_cols.csv"
df = pd.read_csv(data_path, index_col=0)
df = df.drop(columns=["offsets","text","label","category","id"])
df = df.drop_duplicates()
df.head()

Unnamed: 0,desc_id,eadid,field,file,description
0,0,AA5,Biographical / Historical,AA5_00100.ann,Professor James Aitken White was a leading Sco...
6,1,AA5,Title,AA5_00100.ann,Papers of The Very Rev Prof James Whyte (1920-...
19,2,AA6,Biographical / Historical,AA6_00100.ann,Rev Thomas Allan was born on 16 August 1916 in...
50,3,AA6,Title,AA6_00100.ann,Papers of Rev Tom Allan (1916-1965)\n\n
62,4,AA7,Biographical / Historical,AA7_00100.ann,Alec Cheyne was born on 1 June 1924 in Errol i...


In [32]:
descriptions = list(df.description)
ann_files = list(set(list(df.file)))
# Replace .ann with .txt in each file's name
txt_files = [f[:-4]+".txt" for f in ann_files]
file_dict = dict(zip(txt_files,ann_files))
assert file_dict["AA5_00100.txt"] == "AA5_00100.ann"

In [33]:
desc_start_offsets, desc_end_offsets = [], []
start_offset, end_offset = 0, 0
desc_id_order = []
for filename in txt_files:
    with open(doc_path+filename, "r") as f:
        f_string = f.read()
        subdf = df.loc[df.file == file_dict[filename]]
        descs = list(subdf.description)
        desc_ids = list(subdf.desc_id)
        desc_id_order = desc_id_order+desc_ids
        for d in descs:
            # If there is no description text, use the previous description's
            # end offset as this description's start and end offsets
            if type(d) != str:
                desc_start_offsets += [None]
                desc_end_offsets += [None]
            # If there is text for this description, use the index of the first
            # character of the text as the start offset and the index of the character
            # immediately following the last character of the text as the end offset
            else:
                start_offset = f_string.find(d)
                end_offset = start_offset+len(d)+1
                desc_start_offsets += [start_offset]
                desc_end_offsets += [end_offset]
    f.close()
assert len(desc_start_offsets) == len(descriptions)
assert len(desc_end_offsets) == len(descriptions)

In [44]:
offset_df = pd.DataFrame({"desc_id":desc_id_order, "desc_start_offset":desc_start_offsets, "desc_end_offset":desc_end_offsets})
offset_df.head()

Unnamed: 0,desc_id,desc_start_offset,desc_end_offset
0,6492,66.0,93.0
1,6515,264.0,294.0
2,6506,1220.0,1282.0
3,6516,1348.0,1376.0
4,6517,1382.0,1440.0


In [45]:
joined = df.set_index("desc_id").join(offset_df.set_index("desc_id"))
joined.head()

Unnamed: 0_level_0,eadid,field,file,description,desc_start_offset,desc_end_offset
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,AA5,Biographical / Historical,AA5_00100.ann,Professor James Aitken White was a leading Sco...,661.0,1724.0
1,AA5,Title,AA5_00100.ann,Papers of The Very Rev Prof James Whyte (1920-...,24.0,78.0
2,AA6,Biographical / Historical,AA6_00100.ann,Rev Thomas Allan was born on 16 August 1916 in...,588.0,2512.0
3,AA6,Title,AA6_00100.ann,Papers of Rev Tom Allan (1916-1965)\n\n,24.0,62.0
4,AA7,Biographical / Historical,AA7_00100.ann,Alec Cheyne was born on 1 June 1924 in Errol i...,445.0,2441.0


In [47]:
descs_path = "../data/crc_metadata/all_descriptions.csv"     # descriptions in column of CSV file
descs_df = pd.read_csv(descs_path, index_col=0)
descs_df.head()

Unnamed: 0,eadid,description,field,desc_id
0,AA5,Professor James Aitken White was a leading Sco...,Biographical / Historical,0
1,AA5,Papers of The Very Rev Prof James Whyte (1920-...,Title,1
2,AA6,Rev Thomas Allan was born on 16 August 1916 in...,Biographical / Historical,2
3,AA6,Papers of Rev Tom Allan (1916-1965)\n\n,Title,3
4,AA7,Alec Cheyne was born on 1 June 1924 in Errol i...,Biographical / Historical,4


In [58]:
all_desc_ids = list(descs_df.desc_id)
joined_desc_ids = list(joined.index)
missing = [d for d in all_desc_ids if d not in joined_desc_ids]
print(len(missing))

149


In [60]:
missing_df = descs_df.loc[descs_df.desc_id.isin(missing)]
missing_df.head()

Unnamed: 0,eadid,description,field,desc_id
469,Coll-1000,"Compiled by Graeme D. Eddie, Edinburgh Univers...",Processing Information,469
481,Coll-1010,"Compiled by Graeme D. Eddie, Edinburgh Univers...",Processing Information,481
483,Coll-1014,"Compiled by Graeme D. Eddie, Edinburgh Univers...",Processing Information,483
488,Coll-1018,"Compiled by Graeme D. Eddie, Edinburgh Univers...",Processing Information,488
531,Coll-1024,"Compiled by Graeme D. Eddie, Edinburgh Univers...",Processing Information,531


In [62]:
missing_df.to_csv("../data/crc_metadata/descs_missing_offsets.csv")
joined.to_csv("../data/crc_metadata/descs_with_offsets.csv")