# Analysis: Gendered Pronouns and Roles
## Post Annotation and Aggregation

A comparison of Regular Expressions and part-of-speech tagging to manual annotation for labeling gendered pronouns and gendered roles.

***

**Table of Contents**

  [I. Loading](#load)

  [II. Gendered Pronouns](#gp)
  
   * RegEx
   * POS


  [III. Gendered Roles](#gr)
  
   * RegEx?
   * POS?

***

<a id="load"></a>
### I. Loading

In [1]:
import pandas as pd
import re
import numpy as np

Load the data:

In [51]:
file_path = "data/aggregated_with_eadid_descid_cols.csv"

In [52]:
df = pd.read_csv(file_path, index_col=0)
df.head()

Unnamed: 0,file,offsets,text,label,category,eadid,field,id,desc_id
9,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,AA5,Biographical / Historical,0,0
16,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,AA5,Biographical / Historical,1,0
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical,2,0
5,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical,3,0
6,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical,4,0


In [53]:
print("Rows:",df.shape[0])
print("Columns:",df.shape[1])

Rows: 55260
Columns: 9


In [42]:
desc_path = "data/all_descriptions.csv"
df_desc = pd.read_csv(desc_path, index_col=0)
df_desc.tail()

Unnamed: 0,eadid,description,field,desc_id
72618,Coll-1266,"Honorary degree scroll from Leiden University,...",Title,11889
72620,Coll-1266,Programmes for honorary degree ceremony at the...,Title,11890
68537,Coll-1266,Contains honorary degree certificate and citat...,Scope and Contents,11891
30630,Coll-1310,\n,Title,11892
73912,Coll-146,"TS signed1p. At head of paper: E Gellhorn, MD ...",Scope and Contents,11893


***

**TO FIX:**

In [43]:
df_desc.loc[df_desc.desc_id == 11892].description

30630    \n
Name: description, dtype: object

***

<a id="2"></a>
### II. Gendered Pronouns

In [66]:
label = "Gendered-Pronoun"

#### Find the total number of `Gendered-Pronoun` labels and the count of unique text spans annotated with the label

In [67]:
df_gp = df.loc[df.label == label]
print("Total gendered pronouns manually labeled:",df_gp.shape[0])

Total gendered pronouns manually labeled: 4171


In [68]:
gp_counts = df_gp.text.value_counts()
text = list(gp_counts.index)
counts = list(gp_counts.values)
gp_counts_dict = dict(zip(text,counts))
print(gp_counts_dict)

{'his': 1261, 'he': 1245, 'He': 772, 'him': 246, 'her': 229, 'His': 146, 'she': 139, 'She': 82, 'Her': 26, 'himself': 20, 'Him': 2, "He's": 1, 'H': 1, 'Himself': 1}


In [69]:
gp_values = list(set(list(df_gp.text)))
gp_values_normalized = [gp.lower() for gp in gp_values]
gp_values_unique = list(set(list(gp_values_normalized)))
gp_dict = dict.fromkeys(gp_values_unique)
for key in gp_dict.keys():
    total_count = 0
    if key in gp_counts_dict:
        total_count += gp_counts_dict[key]
    if key.capitalize() in gp_counts_dict:
        total_count += gp_counts_dict[key.capitalize()]
    gp_dict[key] = total_count
print(gp_dict)

{'him': 248, 'he': 2017, 'her': 255, "he's": 1, 'she': 221, 'himself': 21, 'his': 1407, 'h': 1}


#### Automatically Identify Gendered Pronouns

First, split the descriptions into words.  Then, try three approaches to searching for matching strings.

In [70]:
descs = list(df_desc.description)
descs_split = [d.split(" ") for d in descs]
print(descs_split[:2])

[['Professor', 'James', 'Aitken', 'White', 'was', 'a', 'leading', 'Scottish', 'Theologian', 'and', 'Moderator', 'of', 'the', 'General', 'Assembly', 'of', 'the', 'Church', 'of', 'Scotland.', 'He', 'was', 'educated', 'at', 'Daniel', "Stewart's", 'College', 'and', 'the', 'University', 'of', 'Edinburgh', 'where', 'he', 'studied', 'philosophy', 'and', 'divinity.', 'After', 'his', 'ordination', 'he', 'spent', 'three', 'years', 'as', 'an', 'army', 'Chaplain', 'and', 'then', 'in', '1948', 'was', 'inducted', 'to', 'Dunollie', 'Road', 'Church', 'in', 'Oban.', 'James', 'Whyte', 'moved', 'to', 'Mayfield', 'North', 'Church', 'in', 'Edinburgh', 'in', '1954', 'and', 'in', '1958', 'was', 'appointed', 'to', 'the', 'chair', 'of', 'practical', 'theology', 'and', 'Christian', 'ethics', 'at', 'the', 'University', 'of', 'St', "Andrew's", 'where', 'he', 'remained', 'until', '1987.', 'His', 'primary', 'interests', 'were', 'in', 'liturgy', 'and', 'ecclesiastical', 'architecture', 'and', 'he', 'also', 'lectured

**Approach 1:** Define RegEx patterns:

In [71]:
pattern1 = re.compile("^[hx][ei][rsm](self)*$")
pattern2 = re.compile("^s*[xh]e('s)*$")
pattern_list = [pattern1, pattern2]

Count matches with the patterns:

In [72]:
def addMatch(ismatch, matches):
    if ismatch != None:
        matched_word = ismatch[0]
        if matched_word in matches.keys():
            matches[matched_word] += 1
        else:
            matches[matched_word] = 1
    return matches

def checkForMatch(text, matches, patterns):
    for pattern in patterns:
        ismatch = pattern.match(text)
        matches = addMatch(ismatch, matches) 
    return matches

def countMatches(descs,patterns):
    matches = dict()
    for desc in descs:
        if type(desc) == list:
            for word in desc:
                if len(word) > 1:
                    word_uncapitalized = word[0].lower() + word[1:]
                else:
                    word_uncapitalized = word.lower()
                matches = checkForMatch(word_uncapitalized, matches, patterns)                
        elif type(desc) == str:
            desc = desc.lower()
            matches = checkForMatch(desc, matches, patterns)
        else:
            raise ValueError
    return matches 


In [73]:
matches = countMatches(descs_split,pattern_list)
print(matches)
print("Total:",np.asarray(list(matches.values())).sum())

{'he': 2193, 'his': 1520, 'himself': 32, 'her': 298, 'she': 274, 'him': 223, 'herself': 2, "he's": 2}
Total: 4544


**Approach 2:** Try specific words represented in RegEx:

In [74]:
words_pattern = re.compile("his|hers|himself|herself|hm|he's|she's|she|he|him|her")

In [75]:
word_matches = countMatches(descs_split,[words_pattern])
print(word_matches)
print("Total:",np.asarray(list(word_matches.values())).sum())

{'he': 4752, 'his': 1815, 'himself': 38, 'she': 969, 'him': 270, 'hers': 14, "he's": 2}
Total: 7860


**Approach 3:** Try a word list:

In [76]:
gendered_pronouns = ["his","hers","himself","herself","hm","he's","she's","she","he","him","her"]
list_matches = dict.fromkeys(gendered_pronouns, 0)
for desc in descs_split:
    desc_lowercased = [word.lower() for word in desc]
    for gp in gendered_pronouns:
        if gp in desc_lowercased:
            list_matches[gp] += 1
print(list_matches)
print("Total:",np.asarray(list(list_matches.values())).sum())

{'his': 821, 'hers': 0, 'himself': 31, 'herself': 2, 'hm': 3, "he's": 2, "she's": 0, 'she': 92, 'he': 560, 'him': 171, 'her': 193}
Total: 1875


**The correct label counts according to the manual annotation:**

In [77]:
print(gp_dict)
print("Total:",np.asarray(list(gp_dict.values())).sum())

{'him': 248, 'he': 2017, 'her': 255, "he's": 1, 'she': 221, 'himself': 21, 'his': 1407, 'h': 1}
Total: 4171


#### Compare Automated and Manual Methods

Evaluate the number of gendered pronouns found with the first RegEx approach (which seems to be the best automated approach) and the manual annotation for each description:

In [31]:
df_gp.head()

Unnamed: 0,file,offsets,text,label,category,eadid,associated_genders
39447,AA5_00100.ann,"(789, 791)",He,Gendered-Pronoun,Linguistic,AA5,Masculine
39448,AA5_00100.ann,"(871, 873)",he,Gendered-Pronoun,Linguistic,AA5,Masculine
39449,AA5_00100.ann,"(913, 916)",his,Gendered-Pronoun,Linguistic,AA5,Masculine
39450,AA5_00100.ann,"(928, 930)",he,Gendered-Pronoun,Linguistic,AA5,Masculine
39451,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,Masculine
