# Analysis: Commonly Annotated Text 
## Post Annotation and Aggregation

***

**Table of Contents**

[0.](#0) Loading

[1.](#1) Common text and metadata field annotated per label
    
  * [1.1](#Omission) Omission
  * [1.2](#Stereotype) Stereotype
  * [1.3](#Generalization) Generalization
  * [1.4](#GR) Gendered Role
  
[2.](#2) Save the data to CSVs

***

### 0. Loading
First, begin by loading Python programming libraries and the dataset to be analyzed.

In [89]:
import pandas as pd
import csv

In [90]:
dir_path = "data/"
data_files = ["aggregated_final.csv", "aggregated_with_annotator_eadid_note_cols.csv", 
              "aggregated_with_eadid_descid_cols.csv", "all_escriptions.csv"]
analysis_path = "analysis_data/"

In [91]:
df = pd.read_csv(dir_path+data_files[2], index_col=0)
df.head()

Unnamed: 0,file,offsets,text,label,category,eadid,field,id,desc_id
9,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,AA5,Biographical / Historical,0,0
16,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,AA5,Biographical / Historical,1,0
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical,2,0
5,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical,3,0
6,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical,4,0


In [92]:
print("Rows:",df.shape[0], "\nColumns:",df.shape[1])

Rows: 55260 
Columns: 9


<a id="1"></a>
### 1. Common Text Annotated Per Label

Determine what the most frequently annotated text spans are for particular labels, as well as the type of description (metadata field) in which particular labels were applied. 

In [93]:
metadata_field_names = ["Scope and Contents", "Biographical / Historical", 
                        "Title", "Processing Information"]

def getFieldRatios(df_field_values):
    total = sum(df_field_values)
    ratios = []
    for v in df_field_values:
        ratios += [v/total]
    return ratios

def getValueCountsDataFrame(label_series, label_name):
    df = pd.DataFrame(label_series)
    df = df.reset_index()
    df = df.rename(columns={"index":"text", "text":"occurrence"})
    df.insert(len(df.columns), "label", [label_name]*df.shape[0])
    return df

<a id="Omission"></a>
#### 1.1 Omission

In [94]:
df_om = df.loc[df.label == "Omission"]
df_om.shape

(7586, 9)

In [95]:
om_text = df_om.text.value_counts(sort=True, ascending=False)
df_om_text = getValueCountsDataFrame(om_text, "Omission")
df_om_text.head()

Unnamed: 0,text,occurrence,label
0,Thomson,981,Omission
1,Ledermann,351,Omission
2,a man,286,Omission
3,Beale,220,Omission
4,Beatty,146,Omission


In [96]:
df_om_field = df_om.field.value_counts()
df_om_field.sort_index(inplace=True)
df_om_field_values = df_om_field.values
df_om_field

Biographical / Historical     631
Processing Information          3
Scope and Contents           5480
Title                        1472
Name: field, dtype: int64

In [97]:
om_ratios = getFieldRatios(df_om_field_values)

<a id="Stereotype"></a>
#### 1.2 Stereotype

In [98]:
df_st = df.loc[df.label == "Stereotype"]
df_st.shape

(2648, 9)

In [99]:
st_text = df_st.text.value_counts(sort=True, ascending=False)
df_st_text = getValueCountsDataFrame(st_text, "Stereotype")
df_st_text.head()

Unnamed: 0,text,occurrence,label
0,man,429,Stereotype
1,men,342,Stereotype
2,a man,223,Stereotype
3,a woman,108,Stereotype
4,woman,67,Stereotype


In [100]:
df_counts = pd.concat([df_om_text,df_st_text])

In [101]:
df_st_field = df_st.field.value_counts()
df_st_field.sort_index(inplace=True)
df_st_field_values = list(df_st_field.values)
df_st_field

Biographical / Historical     482
Scope and Contents           1745
Title                         421
Name: field, dtype: int64

In [102]:
df_st_field_values.insert(1,0)
df_st_field_values # Looks good

[482, 0, 1745, 421]

In [103]:
st_ratios = getFieldRatios(df_st_field_values)

<a id="Generalization"></a>
#### 1.3 Generalization

In [104]:
df_ge = df.loc[df.label == "Generalization"]
df_ge.shape

(2061, 9)

In [105]:
ge_text = df_ge.text.value_counts(sort=True, ascending=False)
df_ge_text = getValueCountsDataFrame(ge_text, "Generalization")
df_ge_text.tail()

Unnamed: 0,text,occurrence,label
475,B.C.L.,1,Generalization
476,statesmen,1,Generalization
477,Eric Reeve,1,Generalization
478,Isaac Forsyth,1,Generalization
479,Sealwoman,1,Generalization


In [106]:
df_counts = pd.concat([df_counts,df_ge_text])

In [107]:
df_ge_field = df_ge.field.value_counts()
df_ge_field.sort_index(inplace=True)
df_ge_field_values = df_ge_field.values
df_ge_field

Biographical / Historical     402
Processing Information          4
Scope and Contents           1193
Title                         462
Name: field, dtype: int64

In [108]:
ge_ratios = getFieldRatios(df_ge_field_values)

<a id="GR"></a>
#### 1.4 Gendered Role

In [109]:
df_gr = df.loc[df.label == "Gendered-Role"]
df_gr.shape

(3590, 9)

In [110]:
gr_text = df_gr.text.value_counts(sort=True, ascending=False)
df_gr_text = getValueCountsDataFrame(gr_text, "Gendered Role")
df_gr_text.tail()

Unnamed: 0,text,occurrence,label
286,boy,1,Gendered Role
287,master,1,Gendered Role
288,Femme,1,Gendered Role
289,Bros,1,Gendered Role
290,granddaughter,1,Gendered Role


In [111]:
df_counts = pd.concat([df_counts,df_gr_text])

In [112]:
df_gr_field = df_gr.field.value_counts()
df_gr_field.sort_index(inplace=True)
df_gr_field_values = df_gr_field.values
df_gr_field

Biographical / Historical     774
Processing Information          1
Scope and Contents           2298
Title                         517
Name: field, dtype: int64

In [113]:
gr_ratios = getFieldRatios(df_gr_field_values)

<a id="2"></a>
### 2. Save the data to CSVs:

In [114]:
df_counts.to_csv(analysis_path+"labeled_text_occurrences.csv", index=False)

In [115]:
df = pd.DataFrame({"Omission Count":df_om_field.values, "Omission Ratio":om_ratios, 
                   "Stereotype Count":df_st_field_values, "Stereotype Ratio":st_ratios,
                   "Generalization Count": df_ge_field.values, "Generalization Ratio":ge_ratios,
                  "Gendered Role Count": df_gr_field.values, "Gendered Role Ratio":gr_ratios,}, 
                  index=["Biographical / Historical", "Processing Information", 
                         "Scope and Contents",  "Title"])
df.T

Unnamed: 0,Biographical / Historical,Processing Information,Scope and Contents,Title
Omission Count,631.0,3.0,5480.0,1472.0
Omission Ratio,0.08318,0.000395,0.722383,0.194042
Stereotype Count,482.0,0.0,1745.0,421.0
Stereotype Ratio,0.182024,0.0,0.658988,0.158988
Generalization Count,402.0,4.0,1193.0,462.0
Generalization Ratio,0.195051,0.001941,0.578845,0.224163
Gendered Role Count,774.0,1.0,2298.0,517.0
Gendered Role Ratio,0.215599,0.000279,0.640111,0.144011


In [116]:
df = df.T
df.to_csv("analysis_data/labels_per_metadata_field.csv")