# Identifying People in CDCS Metadata Descriptions
Run named entity recognition (NER) on the metadata descriptions extracted from the CDCS' online archival catalog.

The code in this Jupyter Notebook is part of a PhD project to create a gold standard dataset labeled for gender biased language, on which a classifier can be trained to identify gender bias in archival metadata descriptions.  

This project is focused on the English language and archival institutions in the United Kingdom.

* Author: Lucy Havens
* Date: November 2020 - February 2021
* Project: PhD Case Study 1
* Data Provider: [ArchivesSpace](https://archives.collections.ed.ac.uk/), Centre for Research Collections, University of Edinburgh

***

**Table of Contents**

  [I. Corpus Statistics](#corpus-stats)

  [II. Named Entity Recognition with SpaCy](#ner)
  
  [III. Checking for Duplicate Descritions](#check-dups)
  
***

In [2]:
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
# nltk.download('punkt')
from nltk.corpus import PlaintextCorpusReader
# nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
# nltk.download('stopwords')
from nltk.tag import pos_tag
import string
import csv
import re

import spacy
from spacy import displacy
from collections import Counter
try:
    import en_core_web_sm
except ImportError:
    print("Downlading en_core_web_sm model")
    import sys
    !{sys.executable} -m spacy download en_core_web_sm
else:
    print("Already have en_core_web_sm")
import en_core_web_sm
nlp = en_core_web_sm.load()

Already have en_core_web_sm


<a id="corpus-stats"></a>
## I. Corpus Statisctics

In [3]:
descs = PlaintextCorpusReader("lucy_final/", ".+\.txt")

In [4]:
tokens = descs.words()
print(tokens[0:20])

['Identifier', ':', 'AA5', 'Title', ':', 'Papers', 'of', 'The', 'Very', 'Rev', 'Prof', 'James', 'Whyte', '(', '1920', '-', '2005', ')', 'Scope', 'and']


In [5]:
sentences = descs.sents()
print(sentences[0:5])

[['Identifier', ':', 'AA5'], ['Title', ':', 'Papers', 'of', 'The', 'Very', 'Rev', 'Prof', 'James', 'Whyte', '(', '1920', '-', '2005', ')'], ['Scope', 'and', 'Contents', ':', 'Sermons', 'and', 'addresses', ',', '1948', '-', '1996', ';', 'lectures', ',', '1949', '-', '1982', ';', 'class', 'notes', 'and', 'lecture', 'notes', ',', '1949', '-', '1982', ';', 'correspondence', ',', '1988', '-', '1989', 'and', '1964', '-', '1970', ';', 'newspaper', 'cuttings', ',', '1988', '-', '1989', 'and', '1964', '-', '1969', ';', 'publications', 'and', 'articles', ',', '1902', '-', '1970', ';', 'church', 'magazines', ',', '1929', '-', '1993', ';', 'conference', 'papers', ',', '1978', ';', 'moderatorial', 'papers', ',', '1988', '-', '1989', ';', 'University', 'Christian', 'Consultative', 'Group', 'papers', ',', '1970', '-', '1972', ';', 'Church', 'of', 'Scotland', 'and', 'the', 'Congregational', 'Union', 'of', 'Scotland', 'papers', ',', '1959', '-', '1967', ';', 'personal', 'papers', ',', '1848', '-', '198

In [14]:
def corpusStatistics(plaintext_corpus_read_lists):
    total_chars = 0
    total_tokens = 0
    total_sents = 0
    total_files = 0
    
    # fileids are the TXT file names in the nls-text-ladiesDebating folder:
    for fileid in plaintext_corpus_read_lists.fileids():
        total_chars += len(plaintext_corpus_read_lists.raw(fileid))
        total_tokens += len(plaintext_corpus_read_lists.words(fileid))
        total_sents += len(plaintext_corpus_read_lists.sents(fileid))
        total_files += 1
    
    print("Total...")
    print("  Characters:", total_chars)
    print("  Tokens:", total_tokens)
    print("  Sentences:", total_sents)
    print("  Files:", total_files)

corpusStatistics(descs)

Total...
  Characters: 14915776
  Tokens: 2965456
  Sentences: 169637
  Files: 3961


In [15]:
words = [t for t in tokens if t.isalpha()]
print("Total Words:",len(words))

Total Words: 2150078


In [22]:
to_exclude = set((stopwords.words("english")) + ["Title", "Identifier", "Scope", 
                "Contents", "Biograhical", "Historical", "Processing", "Information"]
                )
words = [w for w in words if w not in to_exclude]

In [23]:
print("Total Words Excluding Metadata Field Names:",len(words))

Total Words Excluding Metadata Field Names: 1355667


<a id="ner"></a>
## II. Name Entity Recognition with spaCy
Run named entity recognition (NER) to estimate the names in the dataset and get a sense for the value in manually labeling names during the annotation process. 

In [6]:
fileids = descs.fileids()

In [8]:
sentences = []
for fileid in fileids:
    file = descs.raw(fileid)
    sentences += nltk.sent_tokenize(file)

In [9]:
person_list = []
for s in sentences:
    s_ne = nlp(s)
    for entity in s_ne.ents:
        if entity.label_ == 'PERSON':
            person_list += [entity.text] 

In [10]:
unique_persons = list(set(person_list))
print(len(unique_persons))

28565


In [11]:
unique_persons

['W. Warre Cornish',
 'George Kennedy',
 'Triticum',
 'Honorary D.D.',
 'Dave',
 'E. W. Dallas',
 'John Forestar',
 'Alexander Smith / David Laing',
 'David Prowdie',
 'Lewis R\n\nScope',
 'Robert Abercrombie',
 'Völklinger Hütte',
 'John Kemlok',
 'B J Bedell, Psychometrika',
 'William Laidlaw',
 'Neil Morrison',
 'Routledge',
 'Andhra Pradesh',
 'James Lindsay',
 'Jake Kugel',
 'Alexander Honiman',
 'Fletcher Hayes',
 'Thomas Forrest',
 'William Edmondstoune Aytoun',
 'Alex Forbes Epis',
 'MS_BOX_17',
 'Margaret May',
 'Leeds Co-Op',
 'George Grove',
 'Basil Zaharoff',
 'Nicholson',
 'Maurice Wilkins',
 'Michael Ramsay',
 'Hine',
 'Walter Brown',
 'Fitzroy St [',
 'Nathaniel Wallis',
 'Removing Ant Heap\n\n',
 'Rue Saint-Jean',
 'Housemaid',
 'Official Transumpt',
 'Collins',
 'Charterisio',
 'Saverio Mercadante\n\n',
 'MacKinnon',
 'Louis Fleury\n\n',
 'John Forous',
 'Armstrong',
 'Prix',
 'Francesco Caccia',
 'Patrick Ogyll',
 'Hodgson',
 'Mayville',
 'George MacLeod',
 'W. Heffer

Not perfect...some non-person entities labeled such as `Londres` and `Snuff Box`.  I'll add labeling person names to the annotation instructions!

<a id="check-dups"></a>
## III. Checking for Duplicate Descriptions

In [6]:
descs = PlaintextCorpusReader("lucy_final (copy)/", ".+\.txt")
fileids = descs.fileids()
descs.raw(fileids[0])

"Identifier: AA5\n\nTitle:\nPapers of The Very Rev Prof James Whyte (1920-2005)\n\nScope and Contents:\nSermons and addresses, 1948-1996; lectures, 1949-1982; class notes and lecture notes, 1949-1982; correspondence, 1988-1989 and 1964-1970; newspaper cuttings, 1988-1989 and 1964-1969; publications and articles, 1902-1970; church magazines, 1929-1993; conference papers, 1978; moderatorial papers, 1988-1989; University Christian Consultative Group papers, 1970-1972; Church of Scotland and the Congregational Union of Scotland papers, 1959-1967; personal papers, 1848-1983; photographs 1911 and 1960.See also External Documents (below).\n\nBiographical / Historical:\nProfessor James Aitken White was a leading Scottish Theologian and Moderator of the General Assembly of the Church of Scotland. He was educated at Daniel Stewart's College and the University of Edinburgh where he studied philosophy and divinity. After his ordination he spent three years as an army Chaplain and then in 1948 was 

In [32]:
x = descs.raw(fileids[0])
x.split("\n")

['Identifier: AA5',
 '',
 'Title:',
 'Papers of The Very Rev Prof James Whyte (1920-2005)',
 '',
 'Scope and Contents:',
 'Sermons and addresses, 1948-1996; lectures, 1949-1982; class notes and lecture notes, 1949-1982; correspondence, 1988-1989 and 1964-1970; newspaper cuttings, 1988-1989 and 1964-1969; publications and articles, 1902-1970; church magazines, 1929-1993; conference papers, 1978; moderatorial papers, 1988-1989; University Christian Consultative Group papers, 1970-1972; Church of Scotland and the Congregational Union of Scotland papers, 1959-1967; personal papers, 1848-1983; photographs 1911 and 1960.See also External Documents (below).',
 '',
 'Biographical / Historical:',
 "Professor James Aitken White was a leading Scottish Theologian and Moderator of the General Assembly of the Church of Scotland. He was educated at Daniel Stewart's College and the University of Edinburgh where he studied philosophy and divinity. After his ordination he spent three years as an army Ch

In [46]:
descs_split = []
for fileid in fileids:
    d = descs.raw(fileid).split("\n")
    for section in d:
        if len(section) > 0:
            if "Identifier" not in section and "Title" not in section and "Scope and Contents" not in section and "Biographical / Historical" not in section and "Processing Information" not in section and "No information provided" not in section:
                descs_split += [section]
print(descs_split[0:4])

['Papers of The Very Rev Prof James Whyte (1920-2005)', 'Sermons and addresses, 1948-1996; lectures, 1949-1982; class notes and lecture notes, 1949-1982; correspondence, 1988-1989 and 1964-1970; newspaper cuttings, 1988-1989 and 1964-1969; publications and articles, 1902-1970; church magazines, 1929-1993; conference papers, 1978; moderatorial papers, 1988-1989; University Christian Consultative Group papers, 1970-1972; Church of Scotland and the Congregational Union of Scotland papers, 1959-1967; personal papers, 1848-1983; photographs 1911 and 1960.See also External Documents (below).', "Professor James Aitken White was a leading Scottish Theologian and Moderator of the General Assembly of the Church of Scotland. He was educated at Daniel Stewart's College and the University of Edinburgh where he studied philosophy and divinity. After his ordination he spent three years as an army Chaplain and then in 1948 was inducted to Dunollie Road Church in Oban. James Whyte moved to Mayfield Nor

In [47]:
len(descs_split)

105111

In [48]:
len(set(descs_split))

81622