# Analysis: Descriptions' and Annotations' Lengths
## Post Annotation and Aggregation

Outputs the files:
  * `annot-post/data/descriptions_with_counts.csv`: adds columns to `descriptions.csv` for word counts and sentence counts, where words are alphanumeric tokens (punctuation excluded)
  * `annot-post/data/descs_stats.csv`: contains the count, minimum, maximum, average, and standard deviation of all descriptions and each type of description

***

**Table of Contents**

[0.](#0) Loading

[1.](#1) Lengths of Descriptions and Annotations

  * [Lengths of Descriptions](#1.1)
  
  * TO DO: [Lengths of Annotations](#1.2)
  
[2.](#2) Offsets of Descriptions

***

<a id="0"></a>
### 0. Loading
First, begin by loading Python programming libraries and the dataset to be analyzed.

In [43]:
import utils  # import custom functions

import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter

%matplotlib inline
import matplotlib.pyplot as plt

In [34]:
# dir_path = "data/"
# data_files = ["aggregated_final.csv", "aggregated_with_annotator_eadid_note_cols.csv", 
#               "aggregated_with_eadid_descid_desc_cols.csv", "descriptions.csv"]

In [35]:
# df = pd.read_csv(dir_path+data_files[2], index_col=0)
# df.head()

In [36]:
# print("Rows:",df.shape[0], "\nColumns:",df.shape[1])  # Rows: 55260, Columns: 10

<a id="1"></a>
## 1. Lengths of Descriptions and Annotations
**Find the minimum, maximum, average, and standard deviation of word and sentence counts...**
* Per description (by `desc_id` - a.k.a. per "document" for document classifiers)
* Per metadata field (Title, Biographical / Historical, Scope and Contents, and Processing Information)
* Per collection (identifiable with the `eadid` column)
* Per annotation label (Omission, Stereotype, Generalization, etc.)
* Per annotation category (Person Name, Linguistic, Contextual)

<a id="1.1"></a>
### 1.1 Lengths of Descriptions

In [37]:
descs_path = "../data/crc_metadata/all_descriptions.csv"     # descriptions in column of CSV file

In [38]:
desc_df = pd.read_csv(descs_path, index_col=0)
desc_df.head()

Unnamed: 0,eadid,description,field,desc_id
0,AA5,Professor James Aitken White was a leading Sco...,Biographical / Historical,0
1,AA5,Papers of The Very Rev Prof James Whyte (1920-...,Title,1
2,AA6,Rev Thomas Allan was born on 16 August 1916 in...,Biographical / Historical,2
3,AA6,Papers of Rev Tom Allan (1916-1965)\n\n,Title,3
4,AA7,Alec Cheyne was born on 1 June 1924 in Errol i...,Biographical / Historical,4


In [39]:
# # Remove metadata field name from each description
# new_descs = []
# descs = list(desc_df.description)
# fields = list(desc_df.field)
# i = 0
# maxI = len(descs)
# while i < maxI:
#     d, f = descs[i], fields[i]
#     to_remove = f+":\n"
#     d = d.replace(to_remove,"")
#     new_descs += [d]
#     i += 1
# assert len(new_descs) == len(descs)
# # new_descs[:10]            # Looks good

In [40]:
# # Update the CSV file
# desc_df.description = new_descs
# desc_df.head()
# desc_df.to_csv(descs_path)

In [42]:
# Write each description to a txt file named with desc_id
ids = list(desc_df.desc_id)
zero_padding = len(str(ids[-1]))
desc_txt_dir = "data/descriptions/"
i, maxI = 0, len(ids)
while i < maxI:
    d_id = str(ids[i])
    padding = zero_padding - len(d_id)  # pad with zeros so file order aligns with DataFrame order
    id_str = ("0"*padding) + d_id
    filename = "description"+id_str+".txt"
    f = open((desc_txt_dir+filename), "w", encoding="utf8")
    f.write(new_descs[i])
    f.close()
    i += 1
print("Files written to "+desc_txt_dir)

In [None]:
corpus = PlaintextCorpusReader("data/descriptions/", "description\d+.txt", encoding="utf8")
# print(len(corpus.fileids()), desc_df.shape[0])  # Looks good
print(corpus.fileids()[-20:]) # Looks good

#### Length per Description

In [None]:
desc_words, desc_lower_words, desc_sents = utils.getWordsSents(corpus)
print(desc_words[0][:10])
print(desc_lower_words[0][:10])
print(desc_sents[0][:2])

In [None]:
# Add word and sentence counts to DataFrame/CSV of descriptions
word_count = [len(word_list) for word_list in desc_words]  # includes digits but not punctuation
sent_count = [len(sent_list) for sent_list in desc_sents]
print(word_count[:2], sent_count[:4])  # Looks good
# len(desc_sents[2]) # 14

In [None]:
desc_df.insert(len(desc_df.columns), "word_count", word_count)
desc_df.insert(len(desc_df.columns), "sent_count", sent_count)
desc_df.head()

In [None]:
desc_df.to_csv("descriptions_with_counts.csv")  # write a new CSV file with the word and sentence counts

In [None]:
desc_df = desc_df.reset_index()
desc_df.head(1)

In [None]:
desc_df_stats = utils.makeDescribeDf("All", desc_df)
desc_df_stats

#### Lengths per Metadata Field

In [None]:
field = "Biographical / Historical"
bh_stats = utils.makeDescribeDf(field, desc_df)
bh_stats

In [None]:
field = "Scope and Contents"
sc_stats = utils.makeDescribeDf(field, desc_df)
sc_stats

In [None]:
field = "Processing Information"
pi_stats = utils.makeDescribeDf(field, desc_df)
pi_stats

In [None]:
field = "Title"
t_stats = utils.makeDescribeDf(field, desc_df)
t_stats

#### Combine the Statistics

In [None]:
df_stats = pd.concat([desc_df_stats, t_stats, sc_stats, bh_stats, pi_stats], axis=0)
df_stats

In [None]:
df_stats.to_csv("../data/analysis_data/descs_stats.csv")

#### Prepare data for visualization in Observable

In [None]:
df_descs = pd.read_csv("../data/analysis_data/descriptions_with_counts.csv", index_col=0)
df_descs.head()

<a id="1.2"></a>
### 1.2 Length of Annotations

* Dataset: `annot-post/data/aggregated_final.csv`

<a id="2"></a>
### 2. Offsets of Descriptions

**Get the start and end offset of every description so that automated labels can be exported as .ann files for visualization with brat.**

The [standoff format](https://brat.nlplab.org/standoff.html) that the brat rapid annotation tool uses records the start offset and end offset of annotated text spans where:
* The **start offset** is the index of the *first character* in the annotated text span (which is also the number of characters in the document preceding the beginning of the annotated text span)
* The **end offset** is the index of the character *after the annotated text span* (which means the end offset corresponds to the character immediately following the annotated text span)

This means that the start offset of the first description of each document will be 0 and the end offset of the last description of each document will equal the length (number of characters) of the document.  There are multiple descriptions for each document, so we need to determinen the intermediate start and end offsets as well, which we'll add as a column to the file `../data/crc_metadata/all_descriptions.csv`.

In [67]:
data_path = "../data/aggregated_data/aggregated_with_eadid_descid_desc_cols.csv"
df = pd.read_csv(data_path, index_col=0)
df = df.drop(columns=["offsets","text","label","category","id"])
df = df.drop_duplicates()
df.head()

Unnamed: 0,desc_id,eadid,field,file,description
0,0,AA5,Biographical / Historical,AA5_00100.ann,Professor James Aitken White was a leading Sco...
6,1,AA5,Title,AA5_00100.ann,Papers of The Very Rev Prof James Whyte (1920-...
19,2,AA6,Biographical / Historical,AA6_00100.ann,Rev Thomas Allan was born on 16 August 1916 in...
50,3,AA6,Title,AA6_00100.ann,Papers of Rev Tom Allan (1916-1965)\n\n
62,4,AA7,Biographical / Historical,AA7_00100.ann,Alec Cheyne was born on 1 June 1924 in Errol i...


In [68]:
descriptions = list(df.description)
ann_files = list(set(list(df.file)))
# Replace .ann with .txt in each file's name
txt_files = [f[:-4]+".txt" for f in ann_files]
file_dict = dict(zip(txt_files,ann_files))
assert file_dict["AA5_00100.txt"] == "AA5_00100.ann"

In [70]:
desc_start_offsets, desc_end_offsets = [], []
start_offset, end_offset = 0, 0
desc_id_order = []
for filename in txt_files:
    with open(doc_path+filename, "r") as f:
        f_string = f.read()
        subdf = df.loc[df.file == file_dict[filename]]
        descs = list(subdf.description)
        desc_ids = list(subdf.desc_id)
        desc_id_order = desc_id_order+desc_ids
        for d in descs:
            # If there is no description text, don't record
            # any offsets, instead record 'None'
            if type(d) != str:
                desc_start_offsets += [None]
                desc_end_offsets += [None]
            # If there is text for this description, use the index of the first
            # character of the text as the start offset and the index of the character
            # immediately following the last character of the text as the end offset
            else:
                start_offset = f_string.find(d)
                # Make sure the description is found in the file 
                # (if str.find(substr) == -1, the substring wasn't found)
                if (start_offset >= 0):
                    end_offset = start_offset+len(d)+1
                    desc_start_offsets += [start_offset]
                    desc_end_offsets += [end_offset]
                else:
                    desc_start_offsets += ["not_found"]
                    desc_end_offsets += ["not_found"]
    f.close()
assert len(desc_start_offsets) == len(descriptions)
assert len(desc_end_offsets) == len(descriptions)

In [71]:
offset_df = pd.DataFrame({"desc_id":desc_id_order, "desc_start_offset":desc_start_offsets, "desc_end_offset":desc_end_offsets})
offset_df.head()

Unnamed: 0,desc_id,desc_start_offset,desc_end_offset
0,59,284,337
1,60,592,647
2,61,726,759
3,62,765,826
4,63,832,898


In [72]:
joined = df.set_index("desc_id").join(offset_df.set_index("desc_id"))
joined.head()

Unnamed: 0_level_0,eadid,field,file,description,desc_start_offset,desc_end_offset
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,AA5,Biographical / Historical,AA5_00100.ann,Professor James Aitken White was a leading Sco...,661,1724
1,AA5,Title,AA5_00100.ann,Papers of The Very Rev Prof James Whyte (1920-...,24,78
2,AA6,Biographical / Historical,AA6_00100.ann,Rev Thomas Allan was born on 16 August 1916 in...,588,2512
3,AA6,Title,AA6_00100.ann,Papers of Rev Tom Allan (1916-1965)\n\n,24,62
4,AA7,Biographical / Historical,AA7_00100.ann,Alec Cheyne was born on 1 June 1924 in Errol i...,445,2441


In [101]:
assert (joined.loc[joined.desc_start_offset == None]).shape[0] == 0
assert (joined.loc[joined.desc_end_offset == None]).shape[0] == 0

In [102]:
joined_found = joined.loc[(joined.desc_start_offset != "not_found")]
joined_notfound = joined.loc[(joined.desc_start_offset == "not_found")]

# Check that any "not_found" start offsets have correspondings "not_found" end offsets
joined_found_end = joined.loc[(joined.desc_end_offset != "not_found")]
assert joined_found_end.shape == joined_found.shape
joined_notfound_end = joined.loc[(joined.desc_end_offset == "not_found")]
assert joined_notfound_end.shape == joined_notfound.shape

print(joined_found.shape)
print(joined_notfound.shape)

(34118, 6)
(11, 6)


In [103]:
joined_notfound

Unnamed: 0_level_0,eadid,field,file,description,desc_start_offset,desc_end_offset
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
591,Coll-1036,Scope and Contents,Coll-1036_00500.ann,"Miscellaneous proofs of Songs of the Hebrides,...",not_found,not_found
591,Coll-1036,Scope and Contents,Coll-1036_00500.ann,"Miscellaneous proofs of Songs of the Hebrides,...",not_found,not_found
591,Coll-1036,Scope and Contents,Coll-1036_00700.ann,"Miscellaneous proofs of Songs of the Hebrides,...",not_found,not_found
591,Coll-1036,Scope and Contents,Coll-1036_00700.ann,"Miscellaneous proofs of Songs of the Hebrides,...",not_found,not_found
732,Coll-1057,Title,Coll-1057_00400.ann,Page mounted with photograph of the farm of F....,not_found,not_found
2261,Coll-13,Scope and Contents,Coll-13_00900.ann,Plan of Roofing of the Building at the South W...,not_found,not_found
2274,Coll-13,Scope and Contents,Coll-13_01300.ann,"Plan of the socket, section, plan elevation an...",not_found,not_found
3501,Coll-1320,Scope and Contents,Coll-1320_01600.ann,Contains:\nletters concerning the reproduction...,not_found,not_found
5197,Coll-1434,Biographical / Historical,Coll-1434_13300.ann,William White Anderson was born on 17 March 18...,not_found,not_found
7804,Coll-146,Biographical / Historical,Coll-146_12100.ann,Robert McCheyne was born in Edinburgh on 21 Ma...,not_found,not_found


In [104]:
descs_path = "../data/crc_metadata/all_descriptions.csv"
descs_df = pd.read_csv(descs_path, index_col=0)
descs_df.head()

Unnamed: 0,eadid,description,field,desc_id
0,AA5,Professor James Aitken White was a leading Sco...,Biographical / Historical,0
1,AA5,Papers of The Very Rev Prof James Whyte (1920-...,Title,1
2,AA6,Rev Thomas Allan was born on 16 August 1916 in...,Biographical / Historical,2
3,AA6,Papers of Rev Tom Allan (1916-1965)\n\n,Title,3
4,AA7,Alec Cheyne was born on 1 June 1924 in Errol i...,Biographical / Historical,4


In [105]:
all_desc_ids = list(descs_df.desc_id)
joinedfound_desc_ids = list(joined_found.index)
missing = [d for d in all_desc_ids if d not in joinedfound_desc_ids]
print(len(missing))

157


In [106]:
missing_df = descs_df.loc[descs_df.desc_id.isin(missing)]
missing_df.head()

Unnamed: 0,eadid,description,field,desc_id
469,Coll-1000,"Compiled by Graeme D. Eddie, Edinburgh Univers...",Processing Information,469
481,Coll-1010,"Compiled by Graeme D. Eddie, Edinburgh Univers...",Processing Information,481
483,Coll-1014,"Compiled by Graeme D. Eddie, Edinburgh Univers...",Processing Information,483
488,Coll-1018,"Compiled by Graeme D. Eddie, Edinburgh Univers...",Processing Information,488
531,Coll-1024,"Compiled by Graeme D. Eddie, Edinburgh Univers...",Processing Information,531


In [107]:
missing_df.loc[missing_df.description.isna() == True]

Unnamed: 0,eadid,description,field,desc_id


There are no empty descriptions.

To retrieve the remaining file names and offsets, use the `eadid` column values to find the possible files each missing description could be in, then locate that description by its offsets within one of those files:

In [108]:
doc_path = "../data/crc_metadata/descriptions_brat/"
file_type = ".txt"  # Read in only the PlainText files

In [109]:
filenames = os.listdir(doc_path)
filenames = [f for f in filenames if f[-4:] == file_type] # the descriptions are in the txt files
print(filenames[:6])

['Coll-227_00100.txt', 'La_03600.txt', 'PJM_03000.txt', 'La_07300.txt', 'Coll-1434_07400.txt', 'Coll-1434_03100.txt']


In [110]:
eadids = list(missing_df.eadid)
descs = list(missing_df.description)
fields = list(missing_df.field)
desc_ids = list(missing_df.desc_id)
missing_eadids, missing_descs, missing_fields, missing_desc_ids = [], [], [], []
missing_filenames, desc_start_offsets, desc_end_offsets = [], [], []
for i,d in enumerate(descs):
    eadid = eadids[i]
    for filename in filenames:
        if eadid in filename:
            # Make sure each description is associated with ONE file only
            if filename not in missing_filenames:
                with open(doc_path+filename, "r") as f:
                    f_string = f.read()
                    # Use the index of the first character of the text as the start offset
                    # and the index of the character immediately following the last 
                    # character of the text as the end offset
                    start_offset = f_string.find(d)
                    # Make sure the description is found in the file 
                    # (if str.find(substr) == -1, the substring wasn't found)
                    if (start_offset >= 0):
                        end_offset = start_offset+len(d)+1
                        desc_start_offsets += [start_offset]
                        desc_end_offsets += [end_offset]
                        missing_filenames += [filename]
                        missing_eadids += [eadid]
                        missing_descs += [d]
                        missing_desc_ids += [desc_ids[i]]
                        missing_fields += [fields[i]]
                    f.close()
        

assert len(missing_filenames) == len(desc_start_offsets)
assert len(missing_filenames) == len(desc_end_offsets)
# NOTE: descriptions can be repeated within the same collection (identified with an eadid),
# so there can be more missing_filenames than there were rows in missing_df

In [111]:
missing_df_with_offsets = pd.DataFrame({"desc_id":missing_desc_ids, "eadid":missing_eadids, 
                                        "field":missing_fields, "file":missing_filenames, 
                                        "description":missing_descs, "desc_start_offset":desc_start_offsets, 
                                        "desc_end_offset":desc_end_offsets})
missing_df_with_offsets = missing_df_with_offsets.set_index("desc_id")
missing_df_with_offsets.tail()

Unnamed: 0_level_0,eadid,field,file,description,desc_start_offset,desc_end_offset
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
11892,Coll-1310,Title,Coll-1310_01900.txt,\n,302,304
11892,Coll-1310,Title,Coll-1310_02400.txt,\n,19,21
11892,Coll-1310,Title,Coll-1310_02600.txt,\n,0,2
11892,Coll-1310,Title,Coll-1310_00200.txt,\n,0,2
11893,Coll-146,Scope and Contents,Coll-146_28800.txt,"TS signed1p. At head of paper: E Gellhorn, MD ...",728,913


***

In [112]:
empty_descs_df = missing_df_with_offsets.loc[missing_df_with_offsets.description == "\n"]
print("Total empty descriptions:",empty_descs_df.shape[0])
print("Eadids of empty descriptions:",empty_descs_df.eadid.unique())
print("Files of empty descriptions:\n",empty_descs_df.file.unique())

Total empty descriptions: 30
Eadids of empty descriptions: ['Coll-1310']
Files of empty descriptions:
 ['Coll-1310_01700.txt' 'Coll-1310_02800.txt' 'Coll-1310_01500.txt'
 'Coll-1310_00800.txt' 'Coll-1310_01100.txt' 'Coll-1310_01300.txt'
 'Coll-1310_01400.txt' 'Coll-1310_02900.txt' 'Coll-1310_03000.txt'
 'Coll-1310_01600.txt' 'Coll-1310_01200.txt' 'Coll-1310_00900.txt'
 'Coll-1310_01000.txt' 'Coll-1310_02100.txt' 'Coll-1310_00500.txt'
 'Coll-1310_00700.txt' 'Coll-1310_02300.txt' 'Coll-1310_02700.txt'
 'Coll-1310_00300.txt' 'Coll-1310_00100.txt' 'Coll-1310_02500.txt'
 'Coll-1310_01800.txt' 'Coll-1310_00600.txt' 'Coll-1310_02200.txt'
 'Coll-1310_02000.txt' 'Coll-1310_00400.txt' 'Coll-1310_01900.txt'
 'Coll-1310_02400.txt' 'Coll-1310_02600.txt' 'Coll-1310_00200.txt']


**TO DO:** manually add the correct titles into these files!

***

In [113]:
all_desc_offsets_df = pd.concat([joined_found,missing_df_with_offsets])
all_desc_offsets_df.head()

Unnamed: 0_level_0,eadid,field,file,description,desc_start_offset,desc_end_offset
desc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,AA5,Biographical / Historical,AA5_00100.ann,Professor James Aitken White was a leading Sco...,661,1724
1,AA5,Title,AA5_00100.ann,Papers of The Very Rev Prof James Whyte (1920-...,24,78
2,AA6,Biographical / Historical,AA6_00100.ann,Rev Thomas Allan was born on 16 August 1916 in...,588,2512
3,AA6,Title,AA6_00100.ann,Papers of Rev Tom Allan (1916-1965)\n\n,24,62
4,AA7,Biographical / Historical,AA7_00100.ann,Alec Cheyne was born on 1 June 1924 in Errol i...,445,2441


Write the results to a file:

In [114]:
all_desc_offsets_df.to_csv("../data/crc_metadata/all_descs_with_offsets.csv")