# Identifying People in CDCS Metadata Descriptions
Run named entity recognition (NER) on the metadata descriptions extracted from the CDCS' online archival catalog.

The code in this Jupyter Notebook is part of a PhD project to create a gold standard dataset labeled for gender biased language, on which a classifier can be trained to identify gender bias in archival metadata descriptions.  

This project is focused on the English language and archival institutions in the United Kingdom.

* Author: Lucy Havens
* Date: November 2020 - February 2021
* Project: PhD Case Study 1
* Data Provider: [ArchivesSpace](https://archives.collections.ed.ac.uk/), Centre for Research Collections, University of Edinburgh

***

**Table of Contents**

  [I. Corpus Statistics](#corpus-stats)

  [II. Named Entity Recognition with SpaCy](#ner)
  
  [III. Checking for Duplicate Descritions](#check-dups)
  
***

In [1]:
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
import string
import csv
import re

import spacy
from spacy import displacy
from collections import Counter
try:
    import en_core_web_sm
except ImportError:
    print("Downlading en_core_web_sm model")
    import sys
    !{sys.executable} -m spacy download en_core_web_sm
else:
    print("Already have en_core_web_sm")
import en_core_web_sm
nlp = en_core_web_sm.load()

Already have en_core_web_sm


<a id="corpus-stats"></a>
## I. Corpus Statisctics

In [2]:
descs = PlaintextCorpusReader("../AnnotationData/descriptions_by_fonds_split_with_ann/descriptions_by_fonds_split_with_ann/", ".+\.txt")

In [3]:
tokens = descs.words()
print(tokens[0:20])

['Identifier', ':', 'AA4', 'Title', ':', 'Papers', 'of', 'Rev', 'Prof', 'John', 'McIntyre', '(', '1916', '-', '2005', ')', 'Scope', 'and', 'Contents', ':']


In [4]:
sentences = descs.sents()
print(sentences[0:5])

[['Identifier', ':', 'AA4'], ['Title', ':', 'Papers', 'of', 'Rev', 'Prof', 'John', 'McIntyre', '(', '1916', '-', '2005', ')'], ['Scope', 'and', 'Contents', ':', 'Sermons', 'and', 'addresses', ',', '1940', '-', '1997', ';', 'personal', 'papers', ',', '1932', '-', '2006', ';', 'lectures', ',', '1946', '-', '1990', ';', 'class', 'notes', 'and', 'lecture', 'notes', ',', '1932', '-', '1941', ';', 'correspondence', ',', '1979', '-', '1983', ';', 'papers', 'relating', 'to', 'the', 'Dean', 'of', 'the', 'Thistle', ',', '1974', '-', '1980', ';', 'newspaper', 'cuttings', ',', '1945', '-', '1982', ';', 'publications', 'and', 'articles', ',', '1929', '-', '1987', ';', 'essays', ',', '1935', '-', '1979', ',', 'audio', 'cassette', 'tapes', ',', '1987', 'and', 'papers', 'relating', 'to', 'television', 'broadcasts', ',', '1960', '-', '1967', '.', 'See', 'also', 'External', 'Documents', '(', 'below', ').'], ['Biographical', '/', 'Historical', ':', 'John', 'McIntyre', '(', '1916', '-', '2005', ')', 'was'

In [5]:
def corpusStatistics(plaintext_corpus_read_lists):
    total_chars = 0
    total_tokens = 0
    total_sents = 0
    total_files = 0
    
    # fileids are the TXT file names in the nls-text-ladiesDebating folder:
    for fileid in plaintext_corpus_read_lists.fileids():
        total_chars += len(plaintext_corpus_read_lists.raw(fileid))
        total_tokens += len(plaintext_corpus_read_lists.words(fileid))
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
        total_files += 1
    
    print("Total Estimated...")
    print("  Characters:", total_chars)
    print("  Tokens:", total_tokens)
    print("  Sentences:", total_sents)
    print("  Files:", total_files)

corpusStatistics(descs)

Total Estimated...
  Characters: 13739019
  Tokens: 2754044
  Sentences: 156124
  Files: 3649


In [6]:
words = [t for t in tokens if t.isalpha()]
print("Total Estimated Words:",len(words))

Total Estimated Words: 2006380


In [7]:
to_exclude = set((stopwords.words("english")) + ["Title", "Identifier", "Scope", 
                "Contents", "Biograhical", "Historical", "Processing", "Information"]
                )
words = [w for w in words if w not in to_exclude]

In [8]:
print("Total Words Excluding Metadata Field Names:",len(words))

Total Words Excluding Metadata Field Names: 1273237


<a id="ner"></a>
## II. Name Entity Recognition with spaCy
Run named entity recognition (NER) to estimate the names in the dataset and get a sense for the value in manually labeling names during the annotation process. 

In [9]:
fileids = descs.fileids()

In [10]:
sentences = []
for fileid in fileids:
    file = descs.raw(fileid)
    sentences += nltk.sent_tokenize(file)

In [11]:
person_list = []
for s in sentences:
    s_ne = nlp(s)
    for entity in s_ne.ents:
        if entity.label_ == 'PERSON':
            person_list += [entity.text] 

In [12]:
unique_persons = list(set(person_list))
print(len(unique_persons))

28625


In [13]:
unique_persons

['Twynholm Manse',
 'Owen',
 'Christopher Dawson',
 'Archibald Lindesay',
 'William Fordaill',
 'Christmas Card',
 'Headteacher',
 'Goulstonian Lecturer',
 'Nicholas Thoums',
 'Melvin J.',
 'Lila Black',
 'H. Tregellas',
 'Mary Tovey',
 'Children',
 'Ein Beitrag',
 'A. McLaren',
 'Tayne',
 'William S. Morrison',
 'John Brownlee',
 'J.H Sang',
 'Group Almoner',
 'Mme Laure Mitchell',
 'Daemonologie',
 'S. Karger',
 'Steve Byrne',
 'Amal Chandra Chaudhuri',
 'Fallowis',
 'Chekhov',
 'Henry Tristram',
 'Andrew More',
 'Mrs Adele Koestler',
 'Sewaltoun',
 'Walter Harris',
 'Willie',
 'Steen Willasden',
 'Donald Watterson',
 'George Low',
 'MD PhD',
 'Patrick Hepburn',
 'Bindan Blood',
 'Hugh Macfarlane',
 'Simon',
 'F. Duthie',
 'D5 Donovan',
 'William Rerik',
 'Jeremy Bentham',
 'Raphael Falk',
 'Curly Lop-Eared Lincoln Sow',
 'Thomas Dikkesone',
 'P. Brown',
 'David Weir',
 'Mula Rosoff',
 'Belech De',
 'Richard Fitz John',
 'Gilbert Lawdir',
 'Chesney',
 'Begri',
 'Alexander Petrovich K

Not perfect...some non-person entities labeled such as `Librarian` and `Diploma`.  I'll add labeling person names to the annotation instructions!

<a id="check-dups"></a>
## III. Checking for Duplicate Descriptions

In [15]:
# descs = PlaintextCorpusReader("../AnnotationData/descriptions_by_fonds_split_with_ann/descriptions_by_fonds_split_with_ann/", ".+\.txt")
# fileids = descs.fileids()
# descs.raw(fileids[0])

In [14]:
x = descs.raw(fileids[0])
x.split("\n")

['Identifier: AA4',
 '',
 'Title:',
 'Papers of Rev Prof John McIntyre (1916-2005)',
 '',
 'Scope and Contents:',
 'Sermons and addresses, 1940-1997; personal papers, 1932-2006; lectures, 1946-1990; class notes and lecture notes, 1932-1941; correspondence, 1979-1983; papers relating to the Dean of the Thistle, 1974-1980; newspaper cuttings, 1945-1982; publications and articles, 1929-1987; essays, 1935-1979, audio cassette tapes, 1987 and papers relating to television broadcasts, 1960-1967.See also External Documents (below).',
 '',
 'Biographical / Historical:',
 "John McIntyre (1916-2005) was born in Glasgow, educated at Bathgate Academy and then Edinburgh University where he graduated in philosophy and then in divinity. In 1943 he became the minister of the Ayrshire village of Fenwick before leaving Scotland in 1945 to teach theology at St Andrew's College in the University of Sydney. Rev McIntyre remained in Australia until 1956 when he returned to Scotland to become professor of di

In [16]:
descs_split = []
for fileid in fileids:
    d = descs.raw(fileid).split("\n")
    for section in d:
        if len(section) > 0:
            if "Identifier" not in section and "Title" not in section and "Scope and Contents" not in section and "Biographical / Historical" not in section and "Processing Information" not in section and "No information provided" not in section:
                descs_split += [section]
print(descs_split[0:4])

['Papers of Rev Prof John McIntyre (1916-2005)', 'Sermons and addresses, 1940-1997; personal papers, 1932-2006; lectures, 1946-1990; class notes and lecture notes, 1932-1941; correspondence, 1979-1983; papers relating to the Dean of the Thistle, 1974-1980; newspaper cuttings, 1945-1982; publications and articles, 1929-1987; essays, 1935-1979, audio cassette tapes, 1987 and papers relating to television broadcasts, 1960-1967.See also External Documents (below).', "John McIntyre (1916-2005) was born in Glasgow, educated at Bathgate Academy and then Edinburgh University where he graduated in philosophy and then in divinity. In 1943 he became the minister of the Ayrshire village of Fenwick before leaving Scotland in 1945 to teach theology at St Andrew's College in the University of Sydney. Rev McIntyre remained in Australia until 1956 when he returned to Scotland to become professor of divinity at New College.", 'In 1972, he was asked to be interim Moderator in St Giles Cathedral and in 19

In [17]:
len(descs_split)

93636

In [18]:
len(set(descs_split))

82258

In [19]:
print(len(descs_split)-len(set(descs_split)))

11378


It looks like there are 11,378 descriptions that are repeated in the corpus.