# Annotated Data Analysis 
## Post Annotation and Aggregation

In the aggregated dataset, examine:

[1.](#1) Aggregated Data's Associated Genders

[2.](#X) Annotators' rationale for applying labels (as documented in the `note` column)

  * [2.1](#2.1) Associated Genders
  * [2.2](#2.2) Label: Stereotype

[X.](#3) Dates of Material of Annotations

[X.](#X) Correlation (if any) between type of gender biased language and type of descriptive metadata field

[X.](#X) People referred to with feminine vs. masculine terms in annotated text

***

The code in this Jupyter Notebook is part of a PhD project to create a gold standard dataset labeled for gender biased language, on which a classifier can be trained to identify gender bias in archival metadata descriptions.  

This project is focused on the English language and archival institutions in the United Kingdom.

* Author: Lucy Havens
* Date: August 2022
* Project: PhD Case Study 1
* Data Provider: [ArchivesSpace](https://archives.collections.ed.ac.uk/), Centre for Research Collections, University of Edinburgh

***

## 0. Loading
First, begin by loading Python programming libraries and the dataset to be analyzed.

In [2]:
import utils  # custom functions

import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter
from wordcloud import WordCloud

%matplotlib inline
import matplotlib.pyplot as plt

In [3]:
dir_path = "data/"
data_files = ["aggregated_final.csv", "aggregated_with_annotator_eadid_note_cols.csv", 
              "aggregated_with_eadid_descid_desc_cols.csv", "descriptions.csv", "all_annotators.csv"]

<a id="1"></a>
## 1. Aggregated Data's Associated Genders

In [3]:
filepath = dir_path+data_files[0]
df_agg = pd.read_csv(dir_path+data_files[0], index_col=0)
df_agg = utils.addAssociatedGenders(df_agg)
df_agg.head()

Unnamed: 0,file,offsets,text,label,category,associated_genders
0,Coll-1434_11900.ann,"(1954, 1957)",his,Generalization,Linguistic,Masculine
1,Coll-1397_00100.ann,"(2633, 2638)",Lords,Generalization,Linguistic,Unclear
2,Coll-1310_00800.ann,"(3703, 3706)",Man,Generalization,Linguistic,Masculine
3,Coll-1434_14500.ann,"(5782, 5788)",cowboy,Generalization,Linguistic,Masculine
4,BAI_02300.ann,"(1586, 1596)",shipmaster,Generalization,Linguistic,Unclear


Update the file with the `associated_genders` column:

In [4]:
df_agg.to_csv(filepath)

<a id="1"></a>
## 2. Annotators' Rationale

Analyze the data in annotators' comments as they recorded them in the `note` field.

In [5]:
dir_path = "../annot/data/data_annot/"
ann_files = ["ann0_labels_notes.csv", "ann1_labels_notes.csv", 
             "ann2_labels_notes.csv", "ann3_labels_notes.csv", "ann4_labels_notes.csv",]
df0 = pd.read_csv(dir_path+ann_files[0])
df1 = pd.read_csv(dir_path+ann_files[1])
df2 = pd.read_csv(dir_path+ann_files[2])
df3 = pd.read_csv(dir_path+ann_files[3])
df4 = pd.read_csv(dir_path+ann_files[4])
df0.head()

Unnamed: 0,annotator,file,entity,label,start,end,text,category,note
0,Annotator 0,Coll-1444_00100.ann,T1,Unknown,52,66,Robert E. Bell,Person-Name,
1,Annotator 0,Coll-1444_00100.ann,T2,Generalization,219,228,Bachelors,Linguistic,A masculine term for a degree that can be awar...
2,Annotator 0,Coll-1444_00100.ann,T3,Generalization,301,310,Bachelors,Linguistic,A masculine term for a degree that can be awar...
3,Annotator 0,Coll-1444_00100.ann,T4,Generalization,368,372,Ed.B,Linguistic,"B for bachelor, a masculine term for a degree ..."
4,Annotator 0,Coll-1444_00100.ann,T5,Generalization,377,381,M.Ed,Linguistic,"M for master, a masculine term for a degree th..."


In [6]:
df0.shape[0] + df1.shape[0] + df2.shape[0] + df3.shape[0] + df4.shape[0]

77833

<a id="2.1"></a>
### 2.1 Associated Genders

In [6]:
df0 = utils.addAssociatedGenders(df0)
df1 = utils.addAssociatedGenders(df1)
df2 = utils.addAssociatedGenders(df2)
df3 = utils.addAssociatedGenders(df3)
df4 = utils.addAssociatedGenders(df4)

In [8]:
df0.head()

Unnamed: 0,annotator,file,entity,label,start,end,text,category,note,associated_genders
0,Annotator 0,Coll-1444_00100.ann,T1,Unknown,52,66,Robert E. Bell,Person-Name,,Unclear
1,Annotator 0,Coll-1444_00100.ann,T2,Generalization,219,228,Bachelors,Linguistic,A masculine term for a degree that can be awar...,Masculine
2,Annotator 0,Coll-1444_00100.ann,T3,Generalization,301,310,Bachelors,Linguistic,A masculine term for a degree that can be awar...,Masculine
3,Annotator 0,Coll-1444_00100.ann,T4,Generalization,368,372,Ed.B,Linguistic,"B for bachelor, a masculine term for a degree ...",Masculine
4,Annotator 0,Coll-1444_00100.ann,T5,Generalization,377,381,M.Ed,Linguistic,"M for master, a masculine term for a degree th...",Unclear


In [9]:
df_list = [df0, df1, df2, df3, df4]
for i in range(len(df_list)):
    df = df_list[i]
    df_m = df.loc[df.associated_genders == "Masculine"]
    df_f = df.loc[df.associated_genders == "Feminine"]
    df_fm = df.loc[df.associated_genders == "Multiple"]
    df_unclear = df.loc[df.associated_genders == "Unclear"]
    print("Annotator", i)
    print("- Masculine:", df_m.shape[0])
    print("- Feminine:", df_f.shape[0])
    print("- Feminine & Masculine:", df_fm.shape[0])
    print("- Unclear:", df_unclear.shape[0])

Annotator 0
- Masculine: 6955
- Feminine: 1260
- Feminine & Masculine: 1746
- Unclear: 21825
Annotator 1
- Masculine: 4104
- Feminine: 440
- Feminine & Masculine: 853
- Unclear: 11163
Annotator 2
- Masculine: 2239
- Feminine: 467
- Feminine & Masculine: 687
- Unclear: 16483
Annotator 3
- Masculine: 1615
- Feminine: 202
- Feminine & Masculine: 795
- Unclear: 2513
Annotator 4
- Masculine: 1342
- Feminine: 636
- Feminine & Masculine: 559
- Unclear: 1949


Write the DataFrames to the `annot_post/data` directory:

In [10]:
for i in range(len(ann_files)):
    filepath = "data/"+ann_files[i]
    df = df_list[i]
    df.to_csv(filepath)

<a id="2.2"></a>
### 2.2 Label: Stereotype

Load the data file with all the annotators' labels and notes:

In [25]:
df = pd.read_csv(dir_path+data_files[-1], index_col=0)

In [26]:
label_name = "Stereotype"
df_ste = df.loc[df.label == label_name]
print(df_ste.shape)
df_ste.head()

(3153, 13)


Unnamed: 0,annotator,file,entity,label,start,end,text,category,note,eadid,field,id,field2
76483,Annotator 4,AA5_00100.ann,T7,Stereotype,34,63,The Very Rev Prof James Whyte,Contextual,form of address characteristic of male homosoc...,AA5,Title,0,Papers of The Very Rev Prof James Whyte (1920-...
76485,Annotator 4,AA5_00100.ann,T15,Stereotype,696,723,leading Scottish Theologian,Contextual,man associated with leadership role\n,AA5,Biographical / Historical,7,Professor James Aitken White was a leading Sco...
70361,Annotator 3,AA6_00100.ann,T14,Stereotype,655,675,to William and Agnes,Contextual,male family members prioritised\n,AA6,Biographical / Historical,49,Rev Thomas Allan was born on 16 August 1916 in...
70362,Annotator 3,AA6_00100.ann,T15,Stereotype,810,836,first class honours degree,Contextual,honour or achievement held by man\n,AA6,Biographical / Historical,61,Rev Thomas Allan was born on 16 August 1916 in...
70363,Annotator 3,AA6_00100.ann,T11,Stereotype,1039,1067,Tom Allan married Jane Moore,Contextual,marriage - man listed as active party\n,AA6,Biographical / Historical,71,Rev Thomas Allan was born on 16 August 1916 in...


Determine which genders are associated with `Stereotype` annotations based on the labeled text (`text` column) and any comments provided (`note` column):

In [31]:
# Replace NaN notes with "" (an empty string)
df_ste = df_ste.fillna("")

In [32]:
genders = []
notes = list(df_ste.note)
texts = list(df_ste.text)
fem = ["wom.n", "girl", "^gal", "female", "lady", "ladies", "wi[fv]e", "her", "she"] # should add lass, lassie
mas = ["^man", "^men", "boy", "male", "lad$", "lads", "laddie", "husband", "his", "him", "he"]
for i in range(len(notes)):
    if notes[i] != np.nan:
        n = notes[i]
        t = texts[i]
        feminine, masculine = False, False

        for f in fem:
            if feminine:
                break
            if (len(re.findall(f, n)) > 0) or (len(re.findall(f, t)) > 0):
                feminine = True
        for m in mas:
            if masculine:
                break
            if (len(re.findall(m, n)) > 0) or (len(re.findall(m, t)) > 0):
                masculine = True

        if (feminine == True) and (masculine == False):
            genders += ["Feminine"]
        elif (feminine == False) and (masculine == True):
            genders += ["Masculine"]
        elif (feminine == True) and (masculine == True):
            genders += ["Multiple"]
        else:
            genders += ["Unclear"]

print("Feminine stereotypes:", genders.count("Feminine"))
print("Masculine stereotypes:", genders.count("Masculine"))
print("F & M stereotypes:", genders.count("Multiple"))
print("Unclear stereotypes:", genders.count("Unclear"))

Feminine stereotypes: 461
Masculine stereotypes: 2149
F & M stereotypes: 359
Unclear stereotypes: 184


In [33]:
f_ste_count = genders.count("Feminine")
m_ste_count = genders.count("Masculine")
fm_ste_count = genders.count("Multiple")
uncl_ste_count = genders.count("Unclear")
total_ste = len(genders)
counts = [f_ste_count, m_ste_count, fm_ste_count, uncl_ste_count]
print("Ratios:")
for c in counts:
    print(c/total_ste)

Ratios:
0.14620995876942594
0.6815731049793847
0.11385981604820805
0.05835712020298129


<div class="alert alert-block alert-warning">
    <b>To Do:</b> Think about how to *normalize* this data, so the ratios are related to the total presence of people of particular genders/gender groups!
</div>

In [34]:
# Add the genders associated with each stereotype to the DataFrame
df_ste.insert(len(df_ste.columns), "associated_genders", genders)
df_ste.head()

Unnamed: 0,annotator,file,entity,label,start,end,text,category,note,eadid,field,id,field2,associated_genders
76483,Annotator 4,AA5_00100.ann,T7,Stereotype,34,63,The Very Rev Prof James Whyte,Contextual,form of address characteristic of male homosoc...,AA5,Title,0,Papers of The Very Rev Prof James Whyte (1920-...,Masculine
76485,Annotator 4,AA5_00100.ann,T15,Stereotype,696,723,leading Scottish Theologian,Contextual,man associated with leadership role\n,AA5,Biographical / Historical,7,Professor James Aitken White was a leading Sco...,Masculine
70361,Annotator 3,AA6_00100.ann,T14,Stereotype,655,675,to William and Agnes,Contextual,male family members prioritised\n,AA6,Biographical / Historical,49,Rev Thomas Allan was born on 16 August 1916 in...,Masculine
70362,Annotator 3,AA6_00100.ann,T15,Stereotype,810,836,first class honours degree,Contextual,honour or achievement held by man\n,AA6,Biographical / Historical,61,Rev Thomas Allan was born on 16 August 1916 in...,Masculine
70363,Annotator 3,AA6_00100.ann,T11,Stereotype,1039,1067,Tom Allan married Jane Moore,Contextual,marriage - man listed as active party\n,AA6,Biographical / Historical,71,Rev Thomas Allan was born on 16 August 1916 in...,Unclear


<div class="alert alert-success">
    <b>To Review Later:</b>
</div>

In [35]:
other_ste = df_ste.loc[df_ste.associated_genders == "Unknown"]
other_ste.to_csv("analysis_data/unknown_stereotypes.csv")

#### 2.2.1: Types of Stereotypes

Normalize the text in the `note` column:

In [36]:
notes = list(df_ste.note)
notes = [note.lower().strip() for note in notes]
texts = list(df_ste.text)
texts = [text.lower().strip() for text in texts]
print(notes[:3])
# # Get the note (annotator's rationale) for the stereotype labels
# notes0 = list(df_ste0.note)
# notes3 = list(df_ste3.note)
# notes4 = list(df_ste4.note)
# # Lowercase all the text
# notes0 = [note.lower() for note in notes0]
# notes3 = [note.lower() for note in notes3]
# notes4 = [note.lower() for note in notes4]

['form of address characteristic of male homosocial group bonding', 'man associated with leadership role', 'male family members prioritised']


Replace the `note` column with the normalized text:

In [40]:
df_ste["note"] = notes

Remove empty notes:

In [38]:
filled_notes = [n for n in notes if n != ""]
assert len(filled_notes) < len(notes)
print("Count of all Stereotype notes:", len(filled_notes))

Count of all Stereotype notes: 3146


Find the number of unique notes:

In [39]:
unique_notes = set(filled_notes)
print("Count of unique Stereotype notes:", len(unique_notes))

Count of unique Stereotype notes: 203


Interesting!  There must be a lot of repetition in the types of notes.  Let's export them to review manually and see if we can categorize them into types of stereotypes.

In [52]:
small_df = pd.DataFrame({"id":df_ste.id, "note":df_ste.note, "text":df_ste.text, "associated_genders":df_ste.associated_genders})
grouped = (small_df.groupby(["note", "associated_genders"]).agg({"text":lambda x:x.tolist(), "id":lambda x:x.tolist()})).reset_index()
grouped.head()

Unnamed: 0,note,associated_genders,text,id
0,,Masculine,[Also in 1896 he married Ellen Milne McCulloch...,"[4913, 52151, 52289, 53472, 59241]"
1,,Multiple,[Letters to Florence Jewel Baillie from her mo...,[1205]
2,,Unclear,[a man],[57896]
3,"""he had his sons"" rather than ""they had their ...",Multiple,"[married Annie Macpherson, by whom he had his ...",[3447]
4,"a woman ""acted as"" rather than was a clinical ...",Multiple,[acted as a],[13554]


In [53]:
print(grouped.shape)

(246, 4)


In [54]:
grouped.to_csv("analysis_data/all_stereotype_note_groups.csv") # File of grouped stereotype notes across all annotators' data

In [None]:
# # Make one DataFrame per annotator (0, 3 and 4)
# anns = [0, 3, 4]
# df_ste0 = df_ste.loc[df_ste.annotator == anns[0]]
# df_ste3 = df_ste.loc[df_ste.annotator == anns[1]]
# df_ste4 = df_ste.loc[df_ste.annotator == anns[2]]
# # Print the rows (label counts) per annotator
# print(df_ste0.shape[0], df_ste3.shape[0], df_ste4.shape[0])