# Analysis: Language of Material of Annotations  
## Post Annotation and Aggregation

***

**Table of Contents**

[1. Loading](#1)

[2. Language of Material of Annotations](#2)

  * [2.1 Omission](#2.1)
  * [2.2 Stereotype](#2.2)
  * [2.3 Generalization](#2.3)

***

<a id="1"></a>
### 1. Loading
First, begin by loading Python programming libraries and the dataset to be analyzed.

In [1]:
import pandas as pd
import numpy as np
import string, csv, re, os, sys #,json

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
from nltk.text import Text
from nltk.probability import FreqDist
from collections import Counter
from wordcloud import WordCloud

%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
dir_path = "data/"
data_files = ["aggregated_final.csv", "aggregated_with_annotator_eadid_note_cols.csv", 
              "aggregated_with_eadid_descid_desc_cols.csv", "descriptions.csv"]

In [4]:
df = pd.read_csv(dir_path+data_files[2], index_col=0)
df.head()

Unnamed: 0,file,offsets,text,label,category,eadid,description,field,id,desc_id
9,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,0,0
16,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,1,0
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,2,0
5,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,3,0
6,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,4,0


In [6]:
print("Rows:",df.shape[0], "\nColumns:",df.shape[1])

Rows: 55260 
Columns: 10


<a id="2"></a>
## 2. Language of Material of Annotations

First, find how many of each language there is across the entire dataset.  Then, find how many of each language there is for the `Omission`, `Stereotype`, and `Generalization` labels.

In [98]:
meta_df = pd.read_csv("data/CRC_units-grouped-by-fonds_clean.csv", index_col=0)
meta_df.head(3)

Unnamed: 0,eadid,unit_title,unit_identifier,unique_language,unique_date,unique_geography
0,Coll-1064,"['Papers of Professor Walter Ledermann', '1 (3...","['Coll-1064', 'Coll-1064/1', 'Coll-1064/2', 'C...",['English'],"['1937-01-01 - 1954-12-31', '1937-02-02 - 1938...","['Edinburgh (Scotland)', 'St Andrews (Scotland..."
1,Coll-31,['Drawings from the Office of Sir Rowand Ander...,"['Coll-31', 'Coll-31/1', 'Coll-31/1/1', 'Coll-...",['English'],"['1814-01-01 - 1924-12-31', '1874-01-01 - 1905...",['']
2,Coll-51,['Papers of Sir Roderick Impey Murchison and h...,"['Coll-51', 'Coll-51/1', 'Coll-51/2', 'Coll-51...",['English'],"['1771-01-01 - 1935-12-31', '1723-01-01 - 1935...","['Calcutta (India)', 'Europe', 'Tarradale (Sco..."


In [99]:
languages = list(meta_df.unique_language)
lang_counts = dict()
for language in languages:
    if type(language) == str:
        l = language.replace("'","")
        l = l.replace("[","")
        l = l.replace("]","")
        l_list = l.split(", ")
        for lang in l_list:
            if lang == "restEnglish":
                lang_counts["English"] += 1
            elif lang in lang_counts:
                lang_counts[lang] += 1
            else:
                lang_counts[lang] = 1

print(lang_counts)

{'English': 1018, 'Russian': 9, 'Latin': 49, 'French': 32, 'Mostly English': 4, '': 16, 'Gaelic or Scottish Gaelic': 1, 'Multiple languages': 15, 'German': 29, 'Kikuyu or Gikuyu': 2, 'Kikuyu': 2, 'Gikuyu': 2, 'Spanish or Castilian': 2, 'Spanish': 7, 'Castilian': 2, 'Fulah': 1, 'Scots': 9, 'Polish': 6, 'Italian': 13, 'Scots Dialect': 1, 'Scottish Gaelic': 2, 'Czech': 2, 'Irish Gaelic': 2, 'Scots dialect': 1, 'Russin': 1, 'English Gaelic': 1, 'Gaelic Scottish Gaelic': 1, 'Swahili': 1, 'Arabic': 6, 'Bemba': 1, 'Egyptian (Ancient)': 1, 'Burmese': 4, 'Greek': 2, 'Hebrew': 5, 'Norman-French/French': 1, 'Old': 1, 'Chinese': 3, 'Afrikaans': 1, 'Norwegian': 2, 'Dutch': 4, 'Flemish': 1, 'France': 2, 'Efik': 1, 'Hindi': 1, 'Persian': 2, 'English mostly': 2, 'Korean': 1, 'Swedish': 5, 'Mainly English': 1, 'Japanese': 2, 'Hungarian': 3, 'Batak languages': 1, 'Portugese': 1, 'Czechoslavakian': 1, 'Japanese English translation': 1, 'Slavic languages': 1, 'Icelandic': 1, 'Bulgarian': 1, 'Romanian': 1,

In [100]:
lang_counts_df = pd.DataFrame.from_dict(lang_counts, orient="index").reset_index()
lang_counts_df.columns = ["language", "collection_count"]
lang_counts_df = lang_counts_df.sort_values(by="collection_count", ascending=False)
# lang_counts_df.head()
lang_counts_df.to_csv("analysis_data/language_material_collection_counts.csv")

In [111]:
def getSubDf(data_df, label):
    return data_df.loc[data_df.label == label]

In [109]:
data_df = pd.read_csv(dir_path+data_files[2], index_col=0)  # aggregated_final.csv
data_df.head()

Unnamed: 0,file,offsets,text,label,category,eadid,description,field,id,desc_id
9,AA5_00100.ann,"(1032, 1043)",James Whyte,Masculine,Person-Name,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,0,0
16,AA5_00100.ann,"(1129, 1177)",chair of practical theology and Christian ethics,Occupation,Contextual,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,1,0
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,2,0
5,AA5_00100.ann,"(1241, 1244)",His,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,3,0
6,AA5_00100.ann,"(1315, 1317)",he,Gendered-Pronoun,Linguistic,AA5,Biographical / Historical:\nProfessor James Ai...,Biographical / Historical,4,0


<a id="2.1"></a>
### 2.1 Omission

In [112]:
df = getSubDf(data_df, "Omission")
eadids = list(df.eadid)

In [113]:
languages = []
for eadid in eadids:
    cell = meta_df.loc[meta_df.eadid == eadid].unique_language.values
    for l in cell:
        l = l.replace("'","")
        l = l.replace("[","")
        l = l.replace("]","")
        l_list = l.split(", ")
        for language in l_list:
            languages += [language]

print(len(languages))
print(languages[:10])

19304
['English', 'Greek', 'Hebrew', 'German', 'English', 'Greek', 'Hebrew', 'German', 'English', 'Greek']


In [114]:
lang_counts = Counter(languages)

In [115]:
lang_counts_df = pd.DataFrame.from_dict(lang_counts, orient="index").reset_index()
lang_counts_df.columns = ["language", "collection_count"]
lang_counts_df = lang_counts_df.sort_values(by="collection_count", ascending=False)
lang_counts_df.head()

Unnamed: 0,language,collection_count
0,English,6409
4,French,2276
3,German,1892
6,Latin,1352
24,Polish,1100


In [116]:
lang_counts_df.to_csv("analysis_data/language_material_collection_counts_omission.csv")

<a id="2.2"></a>
### 2.2 Stereotype

In [117]:
df = getSubDf(data_df, "Stereotype")
eadids = list(df.eadid)

In [118]:
languages = []
for eadid in eadids:
    cell = meta_df.loc[meta_df.eadid == eadid].unique_language.values
    for l in cell:
        l = l.replace("'","")
        l = l.replace("[","")
        l = l.replace("]","")
        l_list = l.split(", ")
        for language in l_list:
            languages += [language]

print(len(languages))
print(languages[:10])

3322
['English', 'Greek', 'Hebrew', 'German', 'English', 'Greek', 'Hebrew', 'German', 'English', 'Greek']


In [119]:
lang_counts = Counter(languages)

In [120]:
lang_counts_df = pd.DataFrame.from_dict(lang_counts, orient="index").reset_index()
lang_counts_df.columns = ["language", "collection_count"]
lang_counts_df = lang_counts_df.sort_values(by="collection_count", ascending=False)
lang_counts_df.head()

Unnamed: 0,language,collection_count
0,English,2261
3,German,178
13,French,172
2,Hebrew,83
1,Greek,79


In [121]:
lang_counts_df.to_csv("analysis_data/language_material_collection_counts_stereotype.csv")

<a id="2.3"></a>
### 2.3 Generalization

In [122]:
df = getSubDf(data_df, "Generalization")
eadids = list(df.eadid)

In [123]:
languages = []
for eadid in eadids:
    cell = meta_df.loc[meta_df.eadid == eadid].unique_language.values
    for l in cell:
        l = l.replace("'","")
        l = l.replace("[","")
        l = l.replace("]","")
        l_list = l.split(", ")
        for language in l_list:
            languages += [language]

print(len(languages))
print(languages[:10])

3476
['English', 'Greek', 'Hebrew', 'German', 'English', 'Greek', 'Hebrew', 'German', 'English', 'Greek']


In [124]:
lang_counts = Counter(languages)

In [125]:
lang_counts_df = pd.DataFrame.from_dict(lang_counts, orient="index").reset_index()
lang_counts_df.columns = ["language", "collection_count"]
lang_counts_df = lang_counts_df.sort_values(by="collection_count", ascending=False)
lang_counts_df.head()

Unnamed: 0,language,collection_count
0,English,1677
13,French,295
3,German,210
6,Italian,158
5,Latin,156


In [126]:
lang_counts_df.to_csv("analysis_data/language_material_collection_counts_generalization.csv")

In [41]:
meta_df.head()

Unnamed: 0,eadid,unit_title,unit_identifier,unique_language,unique_date,unique_geography
0,Coll-1064,"['Papers of Professor Walter Ledermann', '1 (3...","['Coll-1064', 'Coll-1064/1', 'Coll-1064/2', 'C...",['English'],"['1937-01-01 - 1954-12-31', '1937-02-02 - 1938...","['Edinburgh (Scotland)', 'St Andrews (Scotland..."
1,Coll-31,['Drawings from the Office of Sir Rowand Ander...,"['Coll-31', 'Coll-31/1', 'Coll-31/1/1', 'Coll-...",['English'],"['1814-01-01 - 1924-12-31', '1874-01-01 - 1905...",['']
2,Coll-51,['Papers of Sir Roderick Impey Murchison and h...,"['Coll-51', 'Coll-51/1', 'Coll-51/2', 'Coll-51...",['English'],"['1771-01-01 - 1935-12-31', '1723-01-01 - 1935...","['Calcutta (India)', 'Europe', 'Tarradale (Sco..."
3,Coll-204,"['Lecture Notes of John Robison', 'Introductio...","['Coll-204', 'Coll-204/1', 'Coll-204/2', 'Coll...","['English', 'Russian', 'Latin', 'French']","['1779-01-01 - 1801-12-31', '1779-01-01 - 1801...","['Edinburgh (Scotland)', 'Stirlingshire Scotla..."
4,Coll-206,['Records of the Wernerian Natural History Soc...,"['Coll-206', 'Coll-206/1', 'Coll-206/1/1', 'Co...",['English'],"['1808-01-01 - 1858-12-31', '1808-01-12 - 1858...","['Edinburgh (Scotland)', 'Freiburg im Breisgau..."


In [42]:
data_df.head()

Unnamed: 0,file,offsets,text,label,category,eadid,annot_id,desc_id
0,AA5_00100.ann,"(789, 791)",He,Gendered-Pronoun,Linguistic,AA5,0,5223
1,AA5_00100.ann,"(871, 873)",he,Gendered-Pronoun,Linguistic,AA5,1,5223
2,AA5_00100.ann,"(913, 916)",his,Gendered-Pronoun,Linguistic,AA5,2,5223
3,AA5_00100.ann,"(928, 930)",he,Gendered-Pronoun,Linguistic,AA5,3,5223
4,AA5_00100.ann,"(1217, 1219)",he,Gendered-Pronoun,Linguistic,AA5,4,5223
