# Analysis: Gendered Pronouns and Roles
## Post Annotation and Aggregation

A comparison of Regular Expressions and part-of-speech tagging to manual annotation for labeling gendered pronouns and gendered roles.

***

**Table of Contents**

  [I. Loading](#load)

  [II. Gendered Pronouns](#gp)
  
   * RegEx


  [III. Gendered Roles](#gr)
  
   * RegEx?
   * POS?

***

<a id="load"></a>
## I. Loading

In [1]:
import pandas as pd
import re
import numpy as np
import utils



Load the data:

In [2]:
data_dir = "../data/"

In [3]:
file_path = data_dir+"aggregated_data/aggregated_with_eadid_descid_desc_cols.csv"

In [4]:
df = pd.read_csv(file_path, index_col=0)
df.head()

Unnamed: 0,file,offsets,text,label,category,eadid,description,field,id,desc_id
9,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,0,0
16,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,1,0
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,2,0
5,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,3,0
6,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,4,0


In [5]:
print("Rows:",df.shape[0])
print("Columns:",df.shape[1])

Rows: 55260
Columns: 10


In [6]:
desc_path = data_dir+"crc_metadata/all_descriptions.csv"
df_desc = pd.read_csv(desc_path, index_col=0)
df_desc.tail()

Unnamed: 0,eadid,description,field,desc_id
72618,Coll-1266,"Honorary degree scroll from Leiden University,...",Title,11889
72620,Coll-1266,Programmes for honorary degree ceremony at the...,Title,11890
68537,Coll-1266,Contains honorary degree certificate and citat...,Scope and Contents,11891
30630,Coll-1310,\n,Title,11892
73912,Coll-146,"TS signed1p. At head of paper: E Gellhorn, MD ...",Scope and Contents,11893


In [7]:
print("Rows:",df_desc.shape[0])
print("Columns:",df_desc.shape[1])

Rows: 11894
Columns: 4


Replace the first DataFrame's `description` column with descriptions that don't include the metadata field name from the second DataFrame's `description` column:

In [8]:
index_cols = ["desc_id", "eadid", "field"]
df = df.set_index(index_cols)
df_desc = df_desc.set_index(index_cols)

In [9]:
df = df.join(df_desc, how="left", on=index_cols, lsuffix="_remove", rsuffix="")
df = df.drop(columns=["description_remove"])
df = df.reset_index()
df.head()

Unnamed: 0,desc_id,eadid,field,file,offsets,text,label,category,id,description
0,0,AA5,Biographical / Historical,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,0,Professor James Aitken White was a leading Sco...
1,0,AA5,Biographical / Historical,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,1,Professor James Aitken White was a leading Sco...
2,0,AA5,Biographical / Historical,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,2,Professor James Aitken White was a leading Sco...
3,0,AA5,Biographical / Historical,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,3,Professor James Aitken White was a leading Sco...
4,0,AA5,Biographical / Historical,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,4,Professor James Aitken White was a leading Sco...


Update the aggregated data file:

In [11]:
df.to_csv(file_path)

***

**TO FIX:**

In [13]:
df_desc.loc[df_desc.desc_id == 11892].description

30630    \n
Name: description, dtype: object

In [23]:
df_desc.loc[df_desc.description == "\n"]  #"\t" " " "" --> the newline at 30630 seems to be the only one not recorded properly

Unnamed: 0,eadid,description,field,desc_id


In [19]:
df_desc.loc[(df_desc.eadid == "Coll-1310") & (df_desc.field == "Title")]

Unnamed: 0,eadid,description,field,desc_id
2315,Coll-1310,"Letter from staff at Teachers College, Columbi...",Title,2315
2316,Coll-1310,"Letter from the Colonial Office, Downing Stree...",Title,2316
2317,Coll-1310,"Letters and telegrams between Thomson, the War...",Title,2317
2318,Coll-1310,Letters to Thomson compiled by Hector and titl...,Title,2318
2319,Coll-1310,Letter from Sir Edmund Taylor Whittaker to Tho...,Title,2319
...,...,...,...,...
2769,Coll-1310,"Offprint of Thurstone's article, The Mental Ag...",Title,2769
2770,Coll-1310,"Psychometric Monographs no. 1, Primary Mental ...",Title,2770
2771,Coll-1310,"Sukhatme, P V\n\n",Title,2771
2772,Coll-1310,"Offprint of Thurstone's article, Current Issue...",Title,2772


***

<a id="2"></a>
## II. Gendered Pronouns

In [46]:
label = "Gendered-Pronoun"

#### Find the total number of `Gendered-Pronoun` labels and the count of unique text spans annotated with the label

In [47]:
df_gp = df.loc[df.label == label]
print("Total gendered pronouns manually labeled:",df_gp.shape[0])

Total gendered pronouns manually labeled: 4171


In [48]:
gp_counts = df_gp.text.value_counts()
text = list(gp_counts.index)
counts = list(gp_counts.values)
gp_counts_dict = dict(zip(text,counts))
print(gp_counts_dict)

{'his': 1261, 'he': 1245, 'He': 772, 'him': 246, 'her': 229, 'His': 146, 'she': 139, 'She': 82, 'Her': 26, 'himself': 20, 'Him': 2, "He's": 1, 'Himself': 1, 'H': 1}


In [49]:
gp_values = list(set(list(df_gp.text)))
gp_values_normalized = [gp.lower() for gp in gp_values]
gp_values_unique = list(set(list(gp_values_normalized)))
gp_dict = dict.fromkeys(gp_values_unique)
for key in gp_dict.keys():
    total_count = 0
    if key in gp_counts_dict:
        total_count += gp_counts_dict[key]
    if key.capitalize() in gp_counts_dict:
        total_count += gp_counts_dict[key.capitalize()]
    gp_dict[key] = total_count
print(gp_dict)

{'h': 1, 'himself': 21, 'her': 255, 'he': 2017, "he's": 1, 'she': 221, 'his': 1407, 'him': 248}


#### Automatically Identify Gendered Pronouns

First, split the descriptions into words.  Then, try three approaches to searching for matching strings.

In [50]:
descs = list(df_desc.description)
descs_split = [d.split(" ") for d in descs]
# print(descs_split[:2])

**Approach 1:** Define RegEx patterns:

In [51]:
pattern1 = re.compile("^[hx][ei][rsm](self)*$")
pattern2 = re.compile("^s*[xh]e('s)*$")
pattern3 = re.compile("^x*e[iy]r*$")
pattern_list = [pattern1, pattern2, pattern3]

Count matches with the patterns:

In [52]:
matches = utils.countMatches(descs_split,pattern_list)
print(matches)
print("Total:",np.asarray(list(matches.values())).sum())

{'he': 2193, 'his': 1520, 'himself': 32, 'her': 298, 'she': 274, 'him': 223, 'herself': 2, "he's": 2}
Total: 4544


**Approach 2:** Try a word list:

In [53]:
gendered_pronouns = ["his","hers","himself","herself","hm","he's","she's","she","he","him","her", "xe", "xeir", "xey", "ey", "eir"] # HM for Her/His Majesty
list_matches = dict.fromkeys(gendered_pronouns, 0)
for desc in descs_split:
    desc_lowercased = [word.lower() for word in desc]
    for gp in gendered_pronouns:
        if gp in desc_lowercased:
            list_matches[gp] += 1
print(list_matches)
print("Total:",np.asarray(list(list_matches.values())).sum())

{'his': 821, 'hers': 0, 'himself': 31, 'herself': 2, 'hm': 3, "he's": 2, "she's": 0, 'she': 92, 'he': 560, 'him': 171, 'her': 193, 'xe': 0, 'xeir': 0, 'xey': 0, 'ey': 0, 'eir': 0}
Total: 1875


**From the manual annotation:**

In [54]:
print(gp_dict)
print("Total:",np.asarray(list(gp_dict.values())).sum())

{'h': 1, 'himself': 21, 'her': 255, 'he': 2017, "he's": 1, 'she': 221, 'his': 1407, 'him': 248}
Total: 4171


#### Compare Automated and Manual Methods

Evaluate the number of gendered pronouns found with the RegEx approach (which seems to be the better automated approach) and the manual annotation for each description:

In [67]:
# Start with a sample - the first description
df.loc[df.desc_id == 0]

Unnamed: 0,desc_id,eadid,field,file,offsets,text,label,category,id,description
0,0,AA5,Biographical / Historical,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,0,Professor James Aitken White was a leading Sco...
1,0,AA5,Biographical / Historical,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,1,Professor James Aitken White was a leading Sco...
2,0,AA5,Biographical / Historical,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,2,Professor James Aitken White was a leading Sco...
3,0,AA5,Biographical / Historical,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,3,Professor James Aitken White was a leading Sco...
4,0,AA5,Biographical / Historical,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,4,Professor James Aitken White was a leading Sco...
5,0,AA5,Biographical / Historical,AA5_00100.ann,"(1350, 1361)",James Whyte,Masculine,Person-Name,5,Professor James Aitken White was a leading Sco...
10,0,AA5,Biographical / Historical,AA5_00100.ann,"(661, 689)",Professor James Aitken White,Masculine,Person-Name,10,Professor James Aitken White was a leading Sco...
11,0,AA5,Biographical / Historical,AA5_00100.ann,"(696, 723)",leading Scottish Theologian,Stereotype,Contextual,11,Professor James Aitken White was a leading Sco...
12,0,AA5,Biographical / Historical,AA5_00100.ann,"(704, 723)",Scottish Theologian,Occupation,Contextual,12,Professor James Aitken White was a leading Sco...
13,0,AA5,Biographical / Historical,AA5_00100.ann,"(728, 787)",Moderator of the General Assembly of the Churc...,Occupation,Contextual,13,Professor James Aitken White was a leading Sco...


In [66]:
df_gp.loc[df_gp.eadid == "AA5"]

Unnamed: 0,desc_id,eadid,field,file,offsets,text,label,category,id,description
2,0,AA5,Biographical / Historical,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,2,Professor James Aitken White was a leading Sco...
3,0,AA5,Biographical / Historical,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,3,Professor James Aitken White was a leading Sco...
4,0,AA5,Biographical / Historical,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,4,Professor James Aitken White was a leading Sco...
14,0,AA5,Biographical / Historical,AA5_00100.ann,"(789, 791)",He,Gendered-Pronoun,Linguistic,14,Professor James Aitken White was a leading Sco...
15,0,AA5,Biographical / Historical,AA5_00100.ann,"(871, 873)",he,Gendered-Pronoun,Linguistic,15,Professor James Aitken White was a leading Sco...
16,0,AA5,Biographical / Historical,AA5_00100.ann,"(913, 916)",his,Gendered-Pronoun,Linguistic,16,Professor James Aitken White was a leading Sco...
17,0,AA5,Biographical / Historical,AA5_00100.ann,"(928, 930)",he,Gendered-Pronoun,Linguistic,17,Professor James Aitken White was a leading Sco...


In [59]:
d = df_gp.loc[df_gp.desc_id == 0].description.unique()[0]
print(d)

Professor James Aitken White was a leading Scottish Theologian and Moderator of the General Assembly of the Church of Scotland. He was educated at Daniel Stewart's College and the University of Edinburgh where he studied philosophy and divinity. After his ordination he spent three years as an army Chaplain and then in 1948 was inducted to Dunollie Road Church in Oban. James Whyte moved to Mayfield North Church in Edinburgh in 1954 and in 1958 was appointed to the chair of practical theology and Christian ethics at the University of St Andrew's where he remained until 1987. His primary interests were in liturgy and ecclesiastical architecture and he also lectured on pastoral care.
James Whyte was called upon to preach at the memorial service for the victims of the Lockerbie disaster on 4th January 1989. The service was relayed around the world and was widely cited in the press having had a great impact. The full text of this sermon was published in Laughter and Tears: Thoughts on Faith 

In [62]:
d.index("H")

128

In [48]:
gp_counts_per_desc = df_gp.drop(columns=["file","offsets","category","eadid","field","id","text"])
gp_counts_per_desc = gp_counts_per_desc.groupby(["desc_id"]).count()
gp_counts_per_desc = gp_counts_per_desc.reset_index()
gp_counts_per_desc.head()

Unnamed: 0,desc_id,label
0,0,7
1,2,10
2,4,10
3,29,1
4,87,1


In [49]:
desc_ids = list(df_desc.desc_id)
desc_dict = dict(zip(desc_ids,descs_split))
automatic_labels = dict.fromkeys(desc_ids,0)
for desc_id,desc in desc_dict.items():
    matches = utils.countMatches(desc, pattern_list)
    total_matches = int(np.asarray(list(matches.values())).sum())
    automatic_labels[desc_id] = total_matches

In [50]:
manual_desc_ids = list(gp_counts_per_desc.desc_id)
ids_to_add = [desc_id for desc_id in desc_ids if desc_id not in manual_desc_ids]
label_counts_to_add = [0]*len(ids_to_add)
to_add = pd.DataFrame({"desc_id":ids_to_add, "label":label_counts_to_add})
gp_counts_per_desc = pd.concat([gp_counts_per_desc,to_add])
gp_counts_per_desc.tail()

Unnamed: 0,desc_id,label
10705,11889,0
10706,11890,0
10707,11891,0
10708,11892,0
10709,11893,0


In [51]:
gp_counts_per_desc = gp_counts_per_desc.rename(columns={"label":"manual_label_count"})
gp_counts_per_desc.insert(len(gp_counts_per_desc.columns), "regex_label_count", list(automatic_labels.values()))
gp_counts_per_desc.head()

Unnamed: 0,desc_id,manual_label_count,regex_label_count
0,0,7,7
1,2,10,0
2,4,10,10
3,29,1,0
4,87,1,10


In [52]:
manual = list(gp_counts_per_desc.manual_label_count)
regex = list(gp_counts_per_desc.regex_label_count)
comparison_to_manual = []
for i,manual_count in enumerate(manual):
    if manual_count == regex[i]:
        comparison_to_manual += ["same"]
    elif manual_count > regex[i]:
        comparison_to_manual += ["more"]
    else:
        comparison_to_manual += ["less"]
gp_counts_per_desc.insert(len(gp_counts_per_desc.columns),"regex_compared_to_manual",comparison_to_manual)
gp_counts_per_desc.head()

Unnamed: 0,desc_id,manual_label_count,regex_label_count,regex_compared_to_manual
0,0,7,7,same
1,2,10,0,more
2,4,10,10,same
3,29,1,0,more
4,87,1,10,less


In [53]:
gp_counts_per_desc.regex_compared_to_manual.value_counts()

same    9692
less    1125
more    1077
Name: regex_compared_to_manual, dtype: int64

In [57]:
total = gp_counts_per_desc.regex_compared_to_manual.value_counts().sum()

In [59]:
same_count = gp_counts_per_desc.regex_compared_to_manual.value_counts()[0]
less_count = gp_counts_per_desc.regex_compared_to_manual.value_counts()[1]
more_count = gp_counts_per_desc.regex_compared_to_manual.value_counts()[2]
print("Ratios:")
print(" - same:",str(round((same_count/total)*100))+"%")
print(" - less:",str(round((less_count/total)*100))+"%")
print(" - more:",str(round((more_count/total)*100))+"%")

Ratios:
 - same: 81%
 - less: 9%
 - more: 9%


Based on the counts of gendered pronouns per description, it looks like RegEx does pretty well!

## III. Gendered Roles

First, investigate the text labeled with `Gendered-Role` to decide whether RegEx or part-of-speech (POS) tagging could be suitable to automatically annotating with this label.

In [68]:
filepath = data_dir+"analysis_data/labeled_text_occurrences.csv"

In [73]:
label_name = "Gendered Role"

In [74]:
df_label_occur = pd.read_csv(filepath)
df_gr_occur = df_label_occur.loc[df_label_occur.label == label_name]
df_gr_occur.head()

Unnamed: 0,text,occurrence,label
4463,men,415,Gendered Role
4464,Sir,330,Gendered Role
4465,Mr,269,Gendered Role
4466,Mrs,256,Gendered Role
4467,Lady,155,Gendered Role


In [75]:
text_list = list(df_gr_occur.text)
text_list_lower = [t.lower() for t in text_list]
text_unique = set(text_list_lower)
print(len(text_unique))

231
