# Annotated Data Analysis 
## Post Annotation and Aggregation

In the aggregated dataset, examine:

[1.](#1) Most common text and metadata field annotated per label
    
  * [Omission](#Omission)
  * [Stereotype](#Stereotype)
  * [Gendered Role](#GR)
  * [Generalization](#Generalization)

[2.](#2) Language of Material of Annotations

[3.](#3) Dates of Material of Annotations

[4.](#4) Lengths of Descriptions and Annotations

[X.](#X) Correlation (if any) between type of gender biased language and type of descriptive metadata field

[X.](#X) Annotators' rationale for applying labels (as documented in the `note` column)

[X.](#X) People referred to with feminine vs. masculine terms in annotated text

***

### 0. Loading
First, begin by loading Python programming libraries and the dataset to be analyzed.

In [2]:
import pandas as pd
import numpy as np
import string
import csv
import re
# import json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter
from wordcloud import WordCloud

%matplotlib inline
import matplotlib.pyplot as plt

In [106]:
dir_path = "data/"
data_files = ["aggregated_final.csv", "aggregated_with_annotator_eadid_note_cols.csv", 
              "aggregated_with_eadid_descid_desc_cols.csv", "descriptions.csv"]

In [4]:
df = pd.read_csv(dir_path+data_files[2], index_col=0)
df.head()

Unnamed: 0,file,offsets,text,label,category,eadid,description,field,id,desc_id
9,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,0,0
16,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,1,0
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,2,0
5,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,3,0
6,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,4,0


In [6]:
print("Rows:",df.shape[0], "\nColumns:",df.shape[1])

Rows: 55260 
Columns: 10


<a id="1"></a>
### 1. Most Common Text Annotated Per Label

Determine what the most frequently annotated text spans are for particular labels, as well as the type of description (metadata field) in which particular labels were applied. 

In [60]:
metadata_field_names = ["Scope and Contents", "Biographical / Historical", 
                        "Title", "Processing Information"]

def getFieldRatios(df_field_values):
    total = sum(df_field_values)
    ratios = []
    for v in df_field_values:
        ratios += [v/total]
    return ratios

<a id="Omission"></a>
#### 1.1 Omission

In [7]:
df_om = df.loc[df.label == "Omission"]
df_om.shape

(7586, 10)

In [8]:
df_om_text = df_om.text.value_counts(sort=True, ascending=False)
df_om_text.head(30)

Thomson               981
Ledermann             351
a man                 286
Beale                 220
Beatty                146
Lady Thomson           79
men                    77
two men                77
group of men           43
Thurstone              38
an Indian man          38
Auerbach               33
Bartlett               32
Wilmut                 28
Levison                25
a woman                25
Koestler               24
Burt                   23
Mrs Kennedy-Fraser     21
Mamaine                21
three men              21
Burns                  20
Hector                 20
colleagues             19
Janine                 19
another man            19
Spearman               19
his wife               18
Corson                 18
others                 17
Name: text, dtype: int64

In [81]:
df_om_field = df_om.field.value_counts()
df_om_field.sort_index(inplace=True)
df_om_field_values = df_om_field.values
df_om_field

Biographical / Historical     631
Processing Information          3
Scope and Contents           5480
Title                        1472
Name: field, dtype: int64

In [86]:
om_ratios = getFieldRatios(df_om_field_values)

<a id="Stereotype"></a>
#### 1.2 Stereotype

In [19]:
df_st = df.loc[df.label == "Stereotype"]
df_st.shape

(2648, 10)

In [20]:
df_st_text = df_st.text.value_counts(sort=True, ascending=False)
df_st_text.head(30)

man                                429
men                                342
a man                              223
a woman                            108
woman                               67
two men                             54
women                               49
a group of men                      32
female                              24
Man                                 22
boys                                21
an Indian man                       21
a man and a woman                   21
his                                 20
boy                                 17
his wife Florence Jewel Baillie     16
Men                                 15
cowboys                             15
three men                           15
Empress of Britain                  15
his wife                            13
Women                               12
two women                           12
another man                         12
a boy                               11
Woman                    

In [82]:
df_st_field = df_st.field.value_counts()
df_st_field.sort_index(inplace=True)
df_st_field_values = list(df_st_field.values)
df_st_field

Biographical / Historical     482
Scope and Contents           1745
Title                         421
Name: field, dtype: int64

In [84]:
df_st_field_values.insert(1,0)
df_st_field_values # Looks good

[482, 0, 1745, 421]

In [85]:
st_ratios = getFieldRatios(df_st_field_values)

<a id="Generalization"></a>
#### 1.3 Generalization

In [15]:
df_ge = df.loc[df.label == "Generalization"]
df_ge.shape

(2061, 10)

In [16]:
df_ge_text = df_ge.text.value_counts(sort=True, ascending=False)
df_ge_text.head(30)

man           566
woman         246
boy            41
he             36
Thomson        35
boys           34
his            34
Midwifery      31
MA             25
Empress        23
Chairman       21
M.A.           20
He             20
Beale          19
M.B.           19
Man            18
cowboys        17
Duchess        16
B.Sc.          15
Ledermann      15
girls          15
men            13
Englishman     12
Scotsman       12
Lords          12
cowboy         12
Princess       11
Ch.B.          10
B.A.           10
Sir             9
Name: text, dtype: int64

In [87]:
df_ge_field = df_ge.field.value_counts()
df_ge_field.sort_index(inplace=True)
df_ge_field_values = df_ge_field.values
df_ge_field

Biographical / Historical     402
Processing Information          4
Scope and Contents           1193
Title                         462
Name: field, dtype: int64

In [88]:
ge_ratios = getFieldRatios(df_ge_field_values)

#### Save the data to a CSV file:

In [91]:
df = pd.DataFrame({"Omission Count":df_om_field.values, "Omission Ratio":om_ratios, 
                   "Stereotype Count":df_st_field_values, "Stereotype Ratio":st_ratios,
                   "Generalization Count": df_ge_field.values, "Generalization Ratio":ge_ratios,}, 
                  index=["Biographical / Historical", "Processing Information", 
                         "Scope and Contents",  "Title"])
df.T

Unnamed: 0,Biographical / Historical,Processing Information,Scope and Contents,Title
Omission Count,631.0,3.0,5480.0,1472.0
Omission Ratio,0.08318,0.000395,0.722383,0.194042
Stereotype Count,482.0,0.0,1745.0,421.0
Stereotype Ratio,0.182024,0.0,0.658988,0.158988
Generalization Count,402.0,4.0,1193.0,462.0
Generalization Ratio,0.195051,0.001941,0.578845,0.224163


In [92]:
df = df.T
df.to_csv("analysis_data/labels_per_metadata_field.csv")

<a id="2"></a>
## 2. Language of Material of Annotations

First, find how many of each language there is across the entire dataset.  Then, find how many of each language there is for the `Omission`, `Stereotype`, and `Generalization` labels.

In [98]:
meta_df = pd.read_csv("data/CRC_units-grouped-by-fonds_clean.csv", index_col=0)
meta_df.head(3)

Unnamed: 0,eadid,unit_title,unit_identifier,unique_language,unique_date,unique_geography
0,Coll-1064,"['Papers of Professor Walter Ledermann', '1 (3...","['Coll-1064', 'Coll-1064/1', 'Coll-1064/2', 'C...",['English'],"['1937-01-01 - 1954-12-31', '1937-02-02 - 1938...","['Edinburgh (Scotland)', 'St Andrews (Scotland..."
1,Coll-31,['Drawings from the Office of Sir Rowand Ander...,"['Coll-31', 'Coll-31/1', 'Coll-31/1/1', 'Coll-...",['English'],"['1814-01-01 - 1924-12-31', '1874-01-01 - 1905...",['']
2,Coll-51,['Papers of Sir Roderick Impey Murchison and h...,"['Coll-51', 'Coll-51/1', 'Coll-51/2', 'Coll-51...",['English'],"['1771-01-01 - 1935-12-31', '1723-01-01 - 1935...","['Calcutta (India)', 'Europe', 'Tarradale (Sco..."


In [99]:
languages = list(meta_df.unique_language)
lang_counts = dict()
for language in languages:
    if type(language) == str:
        l = language.replace("'","")
        l = l.replace("[","")
        l = l.replace("]","")
        l_list = l.split(", ")
        for lang in l_list:
            if lang == "restEnglish":
                lang_counts["English"] += 1
            elif lang in lang_counts:
                lang_counts[lang] += 1
            else:
                lang_counts[lang] = 1

print(lang_counts)

{'English': 1018, 'Russian': 9, 'Latin': 49, 'French': 32, 'Mostly English': 4, '': 16, 'Gaelic or Scottish Gaelic': 1, 'Multiple languages': 15, 'German': 29, 'Kikuyu or Gikuyu': 2, 'Kikuyu': 2, 'Gikuyu': 2, 'Spanish or Castilian': 2, 'Spanish': 7, 'Castilian': 2, 'Fulah': 1, 'Scots': 9, 'Polish': 6, 'Italian': 13, 'Scots Dialect': 1, 'Scottish Gaelic': 2, 'Czech': 2, 'Irish Gaelic': 2, 'Scots dialect': 1, 'Russin': 1, 'English Gaelic': 1, 'Gaelic Scottish Gaelic': 1, 'Swahili': 1, 'Arabic': 6, 'Bemba': 1, 'Egyptian (Ancient)': 1, 'Burmese': 4, 'Greek': 2, 'Hebrew': 5, 'Norman-French/French': 1, 'Old': 1, 'Chinese': 3, 'Afrikaans': 1, 'Norwegian': 2, 'Dutch': 4, 'Flemish': 1, 'France': 2, 'Efik': 1, 'Hindi': 1, 'Persian': 2, 'English mostly': 2, 'Korean': 1, 'Swedish': 5, 'Mainly English': 1, 'Japanese': 2, 'Hungarian': 3, 'Batak languages': 1, 'Portugese': 1, 'Czechoslavakian': 1, 'Japanese English translation': 1, 'Slavic languages': 1, 'Icelandic': 1, 'Bulgarian': 1, 'Romanian': 1,

In [100]:
lang_counts_df = pd.DataFrame.from_dict(lang_counts, orient="index").reset_index()
lang_counts_df.columns = ["language", "collection_count"]
lang_counts_df = lang_counts_df.sort_values(by="collection_count", ascending=False)
# lang_counts_df.head()
lang_counts_df.to_csv("analysis_data/language_material_collection_counts.csv")

In [111]:
def getSubDf(data_df, label):
    return data_df.loc[data_df.label == label]

In [109]:
data_df = pd.read_csv(dir_path+data_files[2], index_col=0)  # aggregated_final.csv
data_df.head()

Unnamed: 0,file,offsets,text,label,category,eadid,description,field,id,desc_id
9,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,0,0
16,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,1,0
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,2,0
5,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,3,0
6,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,4,0


<a id="2.1"></a>
### 2.1 Omission

In [112]:
df = getSubDf(data_df, "Omission")
eadids = list(df.eadid)

In [113]:
languages = []
for eadid in eadids:
    cell = meta_df.loc[meta_df.eadid == eadid].unique_language.values
    for l in cell:
        l = l.replace("'","")
        l = l.replace("[","")
        l = l.replace("]","")
        l_list = l.split(", ")
        for language in l_list:
            languages += [language]

print(len(languages))
print(languages[:10])

19304
['English', 'Greek', 'Hebrew', 'German', 'English', 'Greek', 'Hebrew', 'German', 'English', 'Greek']


In [114]:
lang_counts = Counter(languages)

In [115]:
lang_counts_df = pd.DataFrame.from_dict(lang_counts, orient="index").reset_index()
lang_counts_df.columns = ["language", "collection_count"]
lang_counts_df = lang_counts_df.sort_values(by="collection_count", ascending=False)
lang_counts_df.head()

Unnamed: 0,language,collection_count
0,English,6409
4,French,2276
3,German,1892
6,Latin,1352
24,Polish,1100


In [116]:
lang_counts_df.to_csv("analysis_data/language_material_collection_counts_omission.csv")

<a id="2.2"></a>
### 2.2 Stereotype

In [117]:
df = getSubDf(data_df, "Stereotype")
eadids = list(df.eadid)

In [118]:
languages = []
for eadid in eadids:
    cell = meta_df.loc[meta_df.eadid == eadid].unique_language.values
    for l in cell:
        l = l.replace("'","")
        l = l.replace("[","")
        l = l.replace("]","")
        l_list = l.split(", ")
        for language in l_list:
            languages += [language]

print(len(languages))
print(languages[:10])

3322
['English', 'Greek', 'Hebrew', 'German', 'English', 'Greek', 'Hebrew', 'German', 'English', 'Greek']


In [119]:
lang_counts = Counter(languages)

In [120]:
lang_counts_df = pd.DataFrame.from_dict(lang_counts, orient="index").reset_index()
lang_counts_df.columns = ["language", "collection_count"]
lang_counts_df = lang_counts_df.sort_values(by="collection_count", ascending=False)
lang_counts_df.head()

Unnamed: 0,language,collection_count
0,English,2261
3,German,178
13,French,172
2,Hebrew,83
1,Greek,79


In [121]:
lang_counts_df.to_csv("analysis_data/language_material_collection_counts_stereotype.csv")

<a id="2.3"></a>
### 2.3 Generalization

In [122]:
df = getSubDf(data_df, "Generalization")
eadids = list(df.eadid)

In [123]:
languages = []
for eadid in eadids:
    cell = meta_df.loc[meta_df.eadid == eadid].unique_language.values
    for l in cell:
        l = l.replace("'","")
        l = l.replace("[","")
        l = l.replace("]","")
        l_list = l.split(", ")
        for language in l_list:
            languages += [language]

print(len(languages))
print(languages[:10])

3476
['English', 'Greek', 'Hebrew', 'German', 'English', 'Greek', 'Hebrew', 'German', 'English', 'Greek']


In [124]:
lang_counts = Counter(languages)

In [125]:
lang_counts_df = pd.DataFrame.from_dict(lang_counts, orient="index").reset_index()
lang_counts_df.columns = ["language", "collection_count"]
lang_counts_df = lang_counts_df.sort_values(by="collection_count", ascending=False)
lang_counts_df.head()

Unnamed: 0,language,collection_count
0,English,1677
13,French,295
3,German,210
6,Italian,158
5,Latin,156


In [126]:
lang_counts_df.to_csv("analysis_data/language_material_collection_counts_generalization.csv")

In [41]:
meta_df.head()

Unnamed: 0,eadid,unit_title,unit_identifier,unique_language,unique_date,unique_geography
0,Coll-1064,"['Papers of Professor Walter Ledermann', '1 (3...","['Coll-1064', 'Coll-1064/1', 'Coll-1064/2', 'C...",['English'],"['1937-01-01 - 1954-12-31', '1937-02-02 - 1938...","['Edinburgh (Scotland)', 'St Andrews (Scotland..."
1,Coll-31,['Drawings from the Office of Sir Rowand Ander...,"['Coll-31', 'Coll-31/1', 'Coll-31/1/1', 'Coll-...",['English'],"['1814-01-01 - 1924-12-31', '1874-01-01 - 1905...",['']
2,Coll-51,['Papers of Sir Roderick Impey Murchison and h...,"['Coll-51', 'Coll-51/1', 'Coll-51/2', 'Coll-51...",['English'],"['1771-01-01 - 1935-12-31', '1723-01-01 - 1935...","['Calcutta (India)', 'Europe', 'Tarradale (Sco..."
3,Coll-204,"['Lecture Notes of John Robison', 'Introductio...","['Coll-204', 'Coll-204/1', 'Coll-204/2', 'Coll...","['English', 'Russian', 'Latin', 'French']","['1779-01-01 - 1801-12-31', '1779-01-01 - 1801...","['Edinburgh (Scotland)', 'Stirlingshire Scotla..."
4,Coll-206,['Records of the Wernerian Natural History Soc...,"['Coll-206', 'Coll-206/1', 'Coll-206/1/1', 'Co...",['English'],"['1808-01-01 - 1858-12-31', '1808-01-12 - 1858...","['Edinburgh (Scotland)', 'Freiburg im Breisgau..."


In [42]:
data_df.head()

Unnamed: 0,file,offsets,text,label,category,eadid,annot_id,desc_id
0,AA5_00100.ann,"(789, 791)",He,Gendered-Pronoun,Linguistic,AA5,0,5223
1,AA5_00100.ann,"(871, 873)",he,Gendered-Pronoun,Linguistic,AA5,1,5223
2,AA5_00100.ann,"(913, 916)",his,Gendered-Pronoun,Linguistic,AA5,2,5223
3,AA5_00100.ann,"(928, 930)",he,Gendered-Pronoun,Linguistic,AA5,3,5223
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,4,5223


<a id="3"></a>
## 3. Date of Material of Annotations

First, find how many of each language there is across the entire dataset.  Then, find how many of each language there is for the `Omission`, `Stereotype`, and `Generalization` labels.

In [129]:
# meta_df = pd.read_csv("data/CRC_units-grouped-by-fonds_clean.csv", index_col=0)

In [130]:
# dates = list(meta_df.unique_date)
# date_counts = dict()
# for d in dates:
#     if type(d) == str:
#         d = d.replace("'","")
#         d = d.replace("[","")
#         d = d.replace("]","")
#         d_list = d.split(", ")
#         for lang in l_list:
#             if lang == "restEnglish":
#                 lang_counts["English"] += 1
#             elif lang in lang_counts:
#                 lang_counts[lang] += 1
#             else:
#                 lang_counts[lang] = 1

# print(lang_counts)

In [100]:
# lang_counts_df = pd.DataFrame.from_dict(lang_counts, orient="index").reset_index()
# lang_counts_df.columns = ["language", "collection_count"]
# lang_counts_df = lang_counts_df.sort_values(by="collection_count", ascending=False)
# # lang_counts_df.head()
# lang_counts_df.to_csv("analysis_data/language_material_collection_counts.csv")

## 4. Lengths of Descriptions and Annotations
Minimum, maximum, average, and standard deviation of word and sentence counts...
* Per description (by descid - a.k.a. per "document" for document classifiers)
* Per metadata field (Title, Biographical / Historical, Scope and Contents, and Processing Information)
* Per collection (identifiable with the `eadid` column)
* Per annotation label (Omission, Stereotype, Generalization, etc.)
* Per annotation category (Person Name, Linguistic, Contextual)