## Header 
Author : Amina Matt and Yichen Wang  
Date created : 14.10.2021  
Date last modified : 21.11.2021  
Python version : 3.8  
Description : Text processing of the CARICOM Compilation Archive (CCA) https://louverture.ch/cca/ 



# To Do List
- [X] check number items
- [X] to JSON 
- [X] JSON fix None answer
- [ ] Add colonial location
- [ ] JSON cleaning of parenthesis in names?
- [ ] save NER 

# Initialization

In [1]:
# -*- coding: utf-8 -*-

import nltk #Natural Language Toolkit is a natural language programming library
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
import pandas as pd
from nltk import pos_tag
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.chunk import conlltags2tree
from nltk.tree import Tree
import random
from pandas.io.json import json_normalize
import pickle


#PATHS
DATA_FOLDER = './data/'
caricom_sample = DATA_FOLDER +'Caricom_Archive_Sample_Schema1.txt'
caricom = DATA_FOLDER +'Caricom_Archive.txt'


# Text separation into items 
In the primary text source, each item is separated by a return and the '=>' starting string. Each item references a different actor of colonial entreprise. Separating each of them into items helps us to differentiate the extraction depending on the scheme they follow.

In [2]:
#Input: path for the .txt file 
#Output: list of string, where each element is an item, i.e. a separate entry in the document of origin
#Requirements: -
#Description: separate the items based on the '=>' string that characterize a new entry
def divide_items(textFilePath):
    f = open(textFilePath,"r")
    item = []
    flagTOC = False
    for line in f: 
        
        if (line == '1 CARICOM MEMBER STATES\n') :
            flagTOC = True #the TOC has been full read
            #print('the TOC has been read')
            
        if flagTOC : #check if line is a TOC entry
            if (line[0].isdigit()) and  (line[1] == '.') and (line[2].isdigit()) : #we have a subTOC entry, level n.n
                colonialIndex = line[0:3]
                colonialIndex = colonialIndex.replace('\n','')
                #print('the colonial index is ' + str(colonialIndex))
              
        if (line != '\n'):
            if (line[0] == '=') and (line[1] == '>'):
                item_text = ''
                while (line != '\n'):
                    item_text = item_text + line
                    line = f.readline()
                #Once the item is read we add its colonial index that corresponds to a TOC entry
                #We add the index at the end to retrieve it easily
                item_text = item_text.replace('\n','')    
                item_text = item_text + (' '+colonialIndex)
                #print('The text item now has the colonial index'+item_text)
                item.append(item_text)
    f.close()
    return item 

In [3]:
text_items = divide_items(caricom)
items_total = len(text_items)
print(f'There are {len(text_items)} items in total.')

There are 464 items in total.


In [4]:
print(f'This is one text item:\n{text_items[random.randrange(len(text_items))]}.\n')

This is one text item:
=> The second protagonist set temporarily in a colonial context by Gottfried Keller (1819–1890), renowned Swiss novelist from Zurich, is Martin Salander («Martin Salander», 1886). He comes from the provincial town of Münsterburg in Switzerland, becomes very rich in Brazil with the cultivation and trade of coffee and tobacco, loses everything to a fraudulent financial scheme, returns to Brazil and regains his wealth. His son also travels to Brazil to continue his father’s business there: Arnold Salander expands his father’s estate and finds a capable Swiss «for operation and supervision», who will soon be involved in the business transactions. Although the only way to get rich twice in Brazil through coffee and tobacco is by being involved in chattel slavery, a professor of German literature at Zurich University in 2020 speculated vaguely if Salander might have become rich through emigration or perhaps as an engineer. Slavery as a possibility was not even mentione

## Table of Contents retrieving

In [5]:
def tocList_func(textFilePath):
    f = open(textFilePath,"r")
    tocList = []
    for line in f: 
        if (line == '1 CARICOM MEMBER STATES\n') :
            break
        else : 
            if (line[0].isdigit()) and  (line[1] == '.') and (line[2].isdigit())  : #we have a subTOC entry, level n.n
                toc = (line[0:3],line[4:-1])
                #print(toc)
                tocList.append(toc)
    f.close()
    return tocList
tocList = tocList_func(caricom)   
tocList

[('1.1', 'Antigua and Barbuda'),
 ('1.2', 'Bahamas'),
 ('1.3', 'Barbados'),
 ('1.4', 'Dominica'),
 ('1.5', 'Grenada'),
 ('1.6',
  'Guyana (Guiana): Dutch/English colonies «ara», «Essequibo», and «Berbice»'),
 ('1.6', '1 Berbice'),
 ('1.6', '2 Demerara (Demerrara, Demerary)'),
 ('1.6', '3 Essequibo'),
 ('1.7', 'Haiti (colony «Saint-Domingue»)'),
 ('1.7', '1 Economic'),
 ('1.7', '2 Military'),
 ('1.7', '3 Ideological'),
 ('1.8', 'Jamaica'),
 ('1.9', 'Montserrat'),
 ('1.1', ' St. Vincent & The Grenadines'),
 ('1.1', ' Suriname'),
 ('1.1', ' Trinidad and Tobago'),
 ('2.1', 'Cuba'),
 ('2.2',
  'Netherlands Antilles (colonies «Aruba», «Bonaire», «Curaçao», «St. Eustacius»)'),
 ('2.3', 'French West Indies (colonies «Guiana», «Guadeloupe», «Martinique»)'),
 ('2.4',
  'Danish West Indies (colonies «St. John», «St. Croix», and «St. Thomas»)'),
 ('2.5', 'Venezuela'),
 ('2.6', 'Bermudas'),
 ('3.1', ' North America (the Thirteen Colonies and the United States)'),
 ('3.1', '1 Alabama'),
 ('3.1', '2 

## Named Entities Recognition with NER Stanford 
The first objective is to extract information of interest from the text. In this case we are interested in person's names, locations and activities. The first step towards this goal is to use Named Entities Recognition to recognize which words contain the information we are looking for.

In [6]:
#Stanford NER 
NER_FOLDER = './NER-Standford/stanford-ner-2020-11-17'
CLASSIFIER_PATH = NER_FOLDER+'/classifiers/'
JAR_PATH = NER_FOLDER+'/stanford-ner.jar'

#classifiers
classifier_3 = 'english.all.3class.distsim.crf.ser.gz'#3 class model for recognizing locations, persons, and organizations
classifier_4 = 'english.conll.4class.distsim.crf.ser.gz'#4 class model for recognizing locations, persons, organizations, and miscellaneous entities
classifier_7 = 'english.muc.7class.distsim.crf.ser.gz' #7 class model for recognizing locations, persons, organizations, times, money, percents, and dates

st = StanfordNERTagger(CLASSIFIER_PATH+classifier_7, JAR_PATH, encoding='utf-8')



#Extracting named-entities
text = open(caricom_sample, 'r').read()
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)

print(classified_text)

[('=', 'O'), ('>', 'O'), ('François', 'PERSON'), ('Aimé', 'PERSON'), ('Louis', 'PERSON'), ('Dumoulin', 'PERSON'), ('(', 'O'), ('1753-1834', 'O'), (')', 'O'), ('from', 'O'), ('Vevey', 'LOCATION'), ('(', 'O'), ('Canton', 'LOCATION'), ('of', 'O'), ('BerneVaud', 'O'), (')', 'O'), ('left', 'O'), ('Switzerland', 'LOCATION'), ('at', 'O'), ('the', 'O'), ('age', 'O'), ('of', 'O'), ('20', 'O'), ('for', 'O'), ('the', 'O'), ('Caribbean', 'LOCATION'), ('and', 'O'), ('lived', 'O'), ('on', 'O'), ('Grenada', 'LOCATION'), ('1773–1783', 'O'), ('.', 'O'), ('He', 'O'), ('worked', 'O'), ('as', 'O'), ('a', 'O'), ('painter', 'O'), (',', 'O'), ('secretary', 'O'), ('to', 'O'), ('the', 'O'), ('governor', 'O'), ('of', 'O'), ('the', 'O'), ('island', 'O'), (',', 'O'), ('and', 'O'), ('merchant', 'O'), ('.', 'O'), ('In', 'O'), ('1778', 'DATE'), (',', 'O'), ('he', 'O'), ('was', 'O'), ('pressed', 'O'), ('into', 'O'), ('the', 'O'), ('English', 'O'), ('army', 'O'), ('of', 'O'), ('Governor', 'O'), ('MacCartney', 'O'), ('

At this point the whole text is tagged. However the entities aren't grouped together. For example, a person full name is separate into two tuples.

## BIO tagging for readable Named Entities (i.e. regrouped NE)

[BIO](https://en.wikipedia.org/wiki/Inside–outside–beginning_(tagging)) tags are a way to regroup tokens, to make the output more readable. 
A person name with first and last name should be regroup by assigning  
 -B to the beginning of named entities  
 -I assigned to inside  
 -O assigned to other  
This is done by checking the tokens just before and after the one of interest.

In [7]:
# Function imported from 
# https://pythonprogramming.net/using-bio-tags-create-named-entity-lists/?completed=/testing-stanford-ner-taggers-for-speed/

# Tag tokens with standard NLP BIO tags
def bio_tagger(ne_tagged):
		bio_tagged = [] #empty list
		prev_tag = "O" #starting with a O tag
		for token, tag in ne_tagged:
			if tag == "O": #O
				bio_tagged.append((token, tag))
				prev_tag = tag
				continue
			if tag != "O" and prev_tag == "O": # Begin NE
				bio_tagged.append((token, "B-"+tag))
				prev_tag = tag
			elif prev_tag != "O" and prev_tag == tag: # Inside NE
				bio_tagged.append((token, "I-"+tag))
				prev_tag = tag
			elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
				bio_tagged.append((token, "B-"+tag))
				prev_tag = tag
		return bio_tagged

In [8]:
bio_text = bio_tagger(classified_text)
bio_text

[('=', 'O'),
 ('>', 'O'),
 ('François', 'B-PERSON'),
 ('Aimé', 'I-PERSON'),
 ('Louis', 'I-PERSON'),
 ('Dumoulin', 'I-PERSON'),
 ('(', 'O'),
 ('1753-1834', 'O'),
 (')', 'O'),
 ('from', 'O'),
 ('Vevey', 'B-LOCATION'),
 ('(', 'O'),
 ('Canton', 'B-LOCATION'),
 ('of', 'O'),
 ('BerneVaud', 'O'),
 (')', 'O'),
 ('left', 'O'),
 ('Switzerland', 'B-LOCATION'),
 ('at', 'O'),
 ('the', 'O'),
 ('age', 'O'),
 ('of', 'O'),
 ('20', 'O'),
 ('for', 'O'),
 ('the', 'O'),
 ('Caribbean', 'B-LOCATION'),
 ('and', 'O'),
 ('lived', 'O'),
 ('on', 'O'),
 ('Grenada', 'B-LOCATION'),
 ('1773–1783', 'O'),
 ('.', 'O'),
 ('He', 'O'),
 ('worked', 'O'),
 ('as', 'O'),
 ('a', 'O'),
 ('painter', 'O'),
 (',', 'O'),
 ('secretary', 'O'),
 ('to', 'O'),
 ('the', 'O'),
 ('governor', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('island', 'O'),
 (',', 'O'),
 ('and', 'O'),
 ('merchant', 'O'),
 ('.', 'O'),
 ('In', 'O'),
 ('1778', 'B-DATE'),
 (',', 'O'),
 ('he', 'O'),
 ('was', 'O'),
 ('pressed', 'O'),
 ('into', 'O'),
 ('the', 'O'),
 ('English',

Using the BIO tags we can recreate a tokens list with regrouped/readable named entities. 

In [9]:
# Function imported from 
# https://pythonprogramming.net/using-bio-tags-create-named-entity-lists/?completed=/testing-stanford-ner-taggers-for-speed/

# Create tree       
def stanford_tree(bio_tagged):
	tokens_raw, ne_tags = zip(*bio_tagged)
	tokens = [word for word in tokens_raw if word]
	pos_tags = [pos for token, pos in pos_tag(tokens)]

	conlltags = [(token, pos, ne) for token, pos, ne in zip(tokens, pos_tags, ne_tags)]
	ne_tree = conlltags2tree(conlltags) #from BIO to tree format
	return ne_tree

In [None]:
tree_text = stanford_tree(bio_text)
tree_text

In [11]:
# Function imported from 
# https://pythonprogramming.net/using-bio-tags-create-named-entity-lists/?completed=/testing-stanford-ner-taggers-for-speed/

# Parse named entities from tree
def structure_ne(ne_tree):
	ne = []
	for subtree in ne_tree:
		if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
			ne_label = subtree.label()
			ne_string = " ".join([token for token, pos in subtree.leaves()])
			ne.append((ne_string, ne_label))
		else:
			ne_label = 'O'
			ne_string = subtree[0]
			ne.append((ne_string, ne_label))           
	return ne

In [12]:
clean_ne = structure_ne(tree_text)
clean_ne

[('=', 'O'),
 ('>', 'O'),
 ('François Aimé Louis Dumoulin', 'PERSON'),
 ('(', 'O'),
 ('1753-1834', 'O'),
 (')', 'O'),
 ('from', 'O'),
 ('Vevey', 'LOCATION'),
 ('(', 'O'),
 ('Canton', 'LOCATION'),
 ('of', 'O'),
 ('BerneVaud', 'O'),
 (')', 'O'),
 ('left', 'O'),
 ('Switzerland', 'LOCATION'),
 ('at', 'O'),
 ('the', 'O'),
 ('age', 'O'),
 ('of', 'O'),
 ('20', 'O'),
 ('for', 'O'),
 ('the', 'O'),
 ('Caribbean', 'LOCATION'),
 ('and', 'O'),
 ('lived', 'O'),
 ('on', 'O'),
 ('Grenada', 'LOCATION'),
 ('1773–1783', 'O'),
 ('.', 'O'),
 ('He', 'O'),
 ('worked', 'O'),
 ('as', 'O'),
 ('a', 'O'),
 ('painter', 'O'),
 (',', 'O'),
 ('secretary', 'O'),
 ('to', 'O'),
 ('the', 'O'),
 ('governor', 'O'),
 ('of', 'O'),
 ('the', 'O'),
 ('island', 'O'),
 (',', 'O'),
 ('and', 'O'),
 ('merchant', 'O'),
 ('.', 'O'),
 ('In', 'O'),
 ('1778', 'DATE'),
 (',', 'O'),
 ('he', 'O'),
 ('was', 'O'),
 ('pressed', 'O'),
 ('into', 'O'),
 ('the', 'O'),
 ('English', 'O'),
 ('army', 'O'),
 ('of', 'O'),
 ('Governor', 'O'),
 ('MacCartn

In [13]:
def ner_text(text):
    tokenized_text = word_tokenize(text)
    classified_text = st.tag(tokenized_text)
    bio_text = bio_tagger(classified_text)
    tree_text = stanford_tree(bio_text)
    ner_item = structure_ne(tree_text)
    return ner_item

# From NE tree to JSON

The structure NE list for each text is transformed into an entry in a dataframe. The goal is to have for each sample of text an entry with the *relevant* informations.  
The difficult part is to sort the relevant informations. Which of the persons is the one of interest? Which location is the location where the organization or the person was involved? Which dates are the dates of interest? 
Here we deal only with the transformation.

## Use schema 1 **(*name* (date) from *origin*)** to retrieve JSON names, origins and dates attributes in the text item.

In [14]:
#MIGHT BE USELESS 
#Input:
#Output: 
#Requirements: 
#Description: 
def is_date(dateString):
    return any(s.isdigit() for s in dateString)
#Works for (1731-1820)

In [15]:
#Input: item is a single entry from text source 1 with NER tags (characterized by the '=>' starting string)
#Output: True is the text is structured as schema 1, False otherwise
#Requirements: is_date() function
#Description: Test if the first elements of a text match the schema 1. Namely, does the first words match the  **Name** (*date*) from *city* pattern.
def schema1_test(item):
    tags = [x[1] for x in item]
    text_middle= [x[0] for x in item]
    #start and end of piece of interest
    schema1 = False
    try:
        person_Index = tags.index('PERSON')
    except ValueError:
        person_Index = 1 #default
        print("List does not contain value")
    try: 
        location_Index = tags.index('LOCATION')
    except ValueError:
        print("List does not contain value")
        location_Index = 0 #default
    if person_Index < location_Index:
        ner_middle = item[person_Index+1:location_Index-1]
    #digit test
    digit_test = any(x.isdigit() for x in text_middle)
    #parenthesis test
    if digit_test :
        schema1 = ('(' and ')') in text_middle#parenthesis test

    return schema1

In [16]:
#Function test
schema1_test(ner_text(text_items[80]))

True

## From NER to JSON

In [17]:
#Input: item is a single entry from text source 1 with NER tags (characterized by the '=>' starting string)
#Output: A JSON string with Person,Date,Location keys if is the text is structured as schema 1, None otherwise
#Requirements: is_date() function
#Description: Test if the first elements of a text match the schema 1. 
#Namely, does the first words match the  **Name** (*date*) from *city* pattern.
#If it matches schema1 it returns a dictionary 
def schema1_JSON(item):
    #Default
    schema1 = False
    s1item_JSON = None
    #Separate text and tags
    text = [x[0] for x in item]
    tags = [x[1] for x in item]
    
    ##--Start and end of piece of interest, i.e. ...'PERSON'.....'LOCATION'--##
    try:
        person_Index = tags.index('PERSON')
        person = text[person_Index]
    except ValueError:
        person_Index = -1 #default
        print("Item does not contain a PERSON value")
        
   
    #Location can be found with LOCATION tags. But it should, according to schema 1 also be just after the 'from'.. confidence will tell us if it is both sources or not
    #Origin Method 1
    try:
        #Case 1: 1 word location
        origin_Index_method1 = text.index('from')
        origin_1 = text[origin_Index_method1+1]
        #Case 2: 2 words location, e.g. Le Locle 
        #print('First letter' + text[origin_Index_method1+2][0])
        if (text[origin_Index_method1+2][0]).isupper() : 
            origin_1 = origin_1 +' '+ text[origin_Index_method1+2]
        #Case 3: the City of Location, e.g. the City of Geneva
        if (text[origin_Index_method1+2]) == 'City' : 
            origin_1 = text[origin_Index_method1+4]
            
        #print('The origin index using from gives origin as :' + origin_1)
    except ValueError:
        print("Item does not contain any 'form' string")
        origin_1 = -1 #default
    
    
    #Origin Method2
    try: 
        origin_Index_method2 = tags.index('LOCATION')
    except ValueError:
        print("Item does not contain a LOCATION value")
        origin_Index_method2 = -1 #default
   

    #Check if both methods give the same answers
    o_confidence = (origin_Index_method1 == origin_Index_method2)
        
    #If there are PERSON and LOCATION values, with PERSON first we continue the schema1 test
    if person_Index < origin_Index_method1 and person_Index > 0 and origin_Index_method1 > 0 :
        #Define part in between PER and LOC tags
        ner_middle = item[person_Index+1:origin_Index_method1]
        #print('This is the person value'+str(item[person_Index]))
        #print('This is the location value'+str(item[origin_Index_method1+1]))
        #print('This is the in NER between'+str(ner_middle))
        text_middle = [x[0] for x in ner_middle]
        #print('This is the in text between'+str(ner_middle))
        
        #Parenthesis test
        try:
            par1_Index = text_middle.index('(')
        except ValueError:
            par1_Index = -1 #default
        #print("par 1 index " + str(par1_Index))
              
        try:
            par2_Index = text_middle.index(')')
        except ValueError:
            par2_Index = -1 #default
        #print("par 2 index " + str(par2_Index))
        
        #If there are parenthesis
        if par1_Index < par2_Index and par2_Index >= 0 and par1_Index >= 0 :
            date_par = text_middle[par1_Index+1:par2_Index]
            #print('This is the text in between parenthesis ' +str(date_par))
            #SKIPPING DIGIT TEST
            #digit test
            #digit_test = any(x.isdigit() for x in str(date_par))
            #print('The digit test results : '+str(digit_test))
            #Save informations from schema 1
            #if digit_test :
                
                #retrieve date
             #   date = ''
              #  date_split = str(date_par).split('–')
              #  for x in str(date_split):
              #      if x.isdigit():
             #           date = date +' '+ x
            date = str(date_par[0])
            #print('The retrieved date is ' + date)
           
        
            #Create a JSON dictionary
            s1item_JSON = {
                'person' : person,
                'date': date,
                'origin': origin_1,
                'o_confidence':o_confidence
                #'field':NA
            }
    return s1item_JSON

In [18]:
#Function test
print('This is an example where the function fails due to bad NER')
n = 4
print(text_items[n])
schema1_JSON(ner_text(text_items[n]))

print('\n\nThis is an example where the function works')
n = 30
print(text_items[n])
print(schema1_JSON(ner_text(text_items[n])))

This is an example where the function fails due to bad NER
=> Marx Rütimeyer‏‎ (b. 1647) from Vinelz (Canton of Berne) worked as a goldminer in the Bahamas and died there. 1.2
Item does not contain a PERSON value


This is an example where the function works
=> Jean Huguenin (1685–1740) from Le Locle (Canton of Neuchâtel) moved to Holland with Swiss troops. His son Jean Roulof Huguenin (1731-1764) became ensign in the regiment Douglas, a military unit which had been sent to Berbice to suppress the slave rising of 1763. Lieutenant Colonel Robert Douglas was a Scotsman at the service of the Dutch army and the second in command in the expedition against the rebellious slaves. Huguenin died in Berbice and is buried in Fort Nassau. 1.6
{'person': 'Jean Huguenin', 'date': '1685–1740', 'origin': 'Le Locle ( Canton of Neuchâtel', 'o_confidence': False}


In [19]:
#Input: item is a single entry from text source 1 with NER tags (characterized by the '=>' starting string)
#Output: A JSON string with Person,Date,Location keys if is the text is structured as schema 1, None otherwise
#Requirements: is_date() function
#Description: Test if the first elements of a text match the schema 1. 
#Namely, does the first words match the  => In date, **Name** from *city* pattern.
#If it matches schema1 it returns a dictionary 
def schema2_JSON(item):
    #Default
    schema2 = False
    s2item_JSON = None
    #Separate text and tags
    text = [x[0] for x in item]
    tags = [x[1] for x in item]
    
    #'In YEAR' test 
    inYear = (text[2] == 'In') and (len(text[3]) == 4) and (text[3].isdigit())
    #print('This item looks like an schema 2 item '+str(inYear) + str(text[2:4]))
    if inYear : 
        date = text[3]
    
    ##--Start and end of piece of interest, i.e. ...'PERSON'.....'LOCATION'--##
    try:
        person_Index = tags.index('PERSON')
        person = text[person_Index]
        print("Item does contain a PERSON value")
    except ValueError:
        person_Index = -1 #default
        print("Item does not contain a PERSON value")
    
    #Location can be found with LOCATION tags. But it should, according to schema 2 also be just after the 'from'.. confidence will tell us if it is both sources or not
    #Origin Method 1
    try:
        #Case 1: 1 word location
        origin_Index_method1 = text.index('from')
        origin_1 = text[origin_Index_method1+1]
        #Case 2: 2 words location, e.g. Le Locle 
        print('First letter' + text[origin_Index_method1+2][0])
        if (text[origin_Index_method1+2][0]).isupper() : 
            origin_1 = origin_1 +' '+ text[origin_Index_method1+2]
        #Case 3: the City of Location, e.g. the City of Geneva
        if (text[origin_Index_method1+2]) == 'City' : 
            origin_1 = text[origin_Index_method1+4]
            
        print('The origin index using from gives origin as :' + origin_1)
    except ValueError:
        print("Item does not contain any 'from' string")
        origin_1 = -1 #default
    
    
    #Origin Method2
    try: 
        origin_Index_method2 = tags.index('LOCATION')
    except ValueError:
        print("Item does not contain a LOCATION value")
        origin_Index_method2 = -1 #default
        
    #Check if both methods give the same answers
    o_confidence = (origin_Index_method1 == origin_Index_method2)


    #Create a JSON dictionary
    s1item_JSON = {
        'person' : person,
        'date': date,
        'origin': origin_1,
        'o_confidence':o_confidence
        #'field':NA
    }

    return s1item_JSON

In [20]:
print('Let\'s try schema 2\n')
n = 45
print(text_items[n])
print(schema2_JSON(ner_text(text_items[n])))

Let's try schema 2

=> In 1767, one H. Werndli from Zurich, employed as a surgeon in Berbice, made a gift of plants and seeds to the Zurich Botanical Gardens. In 1773, he sent the Zurich Naturalist Society a collection of reptiles (e.g. the embryo of an armadillo preserved in alcohol) and of «American snakes». 1.6
Item does contain a PERSON value
First letter,
The origin index using from gives origin as :Zurich
{'person': 'H. Werndli', 'date': '1767', 'origin': 'Zurich', 'o_confidence': False}


In [21]:
any(x.isdigit() for x in '1685–1740')
sep = '1685–1740'.split('–')
any(x.isdigit() for x in sep[1])

True

# All functions

In [22]:
def text_and_tags(item):
    #Default
    schema1 = False
    s1item_JSON = None
    #Separate text and tags
    text = [x[0] for x in item]
    tags = [x[1] for x in item]
    return text,tags


def person_index(text,tags):
    ##--Start and end of piece of interest, i.e. ...'PERSON'.....'LOCATION'--##
    try:
        person_Index = tags.index('PERSON')
        #print("Item does contain a PERSON value"+str(text[person_Index]))
    except ValueError:
        person_Index = -1 #default
        print("Item does not contain a PERSON value")
    return person_Index

def origin_location_index(text,tags):
    #The origin location can be found with LOCATION tags. But it should, according to schema 2 also be just after the 'from'.. confidence will tell us if it is both sources or not
    #Origin Method 1
    try:
        #Case 1: 1 word origin location
        origin_Index_method1 = text.index('from')
        origin_1 = text[origin_Index_method1+1]
        #Case 2: 2 words origin location, e.g. Le Locle 
        #print('First letter' + text[origin_Index_method1+2][0])
        if (text[origin_Index_method1+2][0]).isupper() : 
            origin_1 = origin_1 +' '+ text[origin_Index_method1+2]
        #Case 3: the City of Location, e.g. the City of Geneva
        if ((text[origin_Index_method1+2]) == 'City') or  ((text[origin_Index_method1+2]) == 'Canton'): 
            origin_1 = text[origin_Index_method1+4]
        #print('The origin index using from gives origin as :' + origin_1)
    except ValueError:
        print("Item does not contain any 'from' string")
        origin_1 = -1 #default
        origin_Index_method1 = -1
    
    
    #Origin Method2
    try: 
        origin_Index_method2 = tags.index('LOCATION')
    except ValueError:
        print("Item does not contain a LOCATION value")
        origin_Index_method2 = -1 #default
        
    #Check if both methods give the same answers
    o_confidence = (origin_Index_method1 == origin_Index_method2)
    return origin_Index_method1,origin_1,o_confidence
  
def person_location(person_Index,origin_Index_method1) :
    #If there are PERSON and LOCATION values, with PERSON first we continue the schema1 test
    flag = (person_Index < origin_Index_method1) and (person_Index > 0) and (origin_Index_method1 > 0)
    return flag    
    

def date(text,tags,person_Index,origin_Index_method1):
    #SCHEMA 2
    #In YEAR' test 
    inYear = (text[2] == 'In') and (len(text[3]) == 4) and (text[3].isdigit())
    #print('This item looks like an schema 2 item '+str(inYear) + str(text[2:4]))
    if inYear : 
        date = text[3]
        return date
    
    else :
    #SCHEMA 1
    #(date)
        #Define part in between PER and LOC tags
        ner_middle = item[person_Index+1:origin_Index_method1]
        text_middle = text[person_Index+1:origin_Index_method1]

        #Parenthesis test
        try:
            par1_Index = text_middle.index('(')
        except ValueError:
            par1_Index = -1 #default

        try:
            par2_Index = text_middle.index(')')
        except ValueError:
            par2_Index = -1 #default

        #If there are parenthesis
        if par1_Index < par2_Index and par2_Index >= 0 and par1_Index >= 0 :
            date_par = text_middle[par1_Index+1:par2_Index]
            date = str(date_par[0])
            return date
        else :
            return None

def colonial_location(text,tocList) :
    tocFromText = text[len(text)-1]#the last item is the TOC entry 
    #print(tocFromText)
    tocListIndex = [x[0] for x in tocList]
    colonialLoc = tocList[tocListIndex.index(tocFromText)][1]
    
    return colonialLoc 

In [91]:
def colonial_activites(text):
    # Colonial activities
    trading = ['cotton', 'indigo', 'sugar', 'tobacco', 'textile', 'merchant']
    military = ['captain','lieutenant','commander','regiment', 'rebellion', 'troops']
    plantation = ['plantation', 'plantations']
    slave_trade = ['slave ship', 'slave-ship']
    result = []
    for word in text:
        if word in trading:
            result.append('trading')
        if word in military:
            result.append('military')
        if word in plantation:
            result.append('plantation owner')
        if word in slave_trade:
            result.append('slave trade')

    return None if len(result) == 0 else result

In [92]:
text = text_and_tags(ner_items[12])[0]

In [93]:
colonial_activites(text)

['plantation owner', 'plantation owner']

In [98]:
' '.join(text)

'= > The Peschiers were Huguenots from the south of France who settled in Geneva . Pierre Peschier ( 1688–1766 ) was a pharmacist with links to England . His son Jean ( b . 1735 ) settled in Grenada , possibly as a member of the British military , where he married Rose de Belgens from a family rich plantation owners . His younger brother Henri ( b . 1741 ) joined him later , and , financed by their brother Jean Antoine , who still lived in Geneva , the two Peschier brothers acquired a plantation of 192 acres called Bonne Chance with at least 80 slaves . They paid 12,600 livres for it . The brothers also became merchants in the capital and chief port of St.George ’ s . Henri ( Henry ) then decided to emigrate to Trinidad , where he arrived in 1781 with some slaves . 1.5'

## Main function

In [26]:
# ner_items = []
# for item in text_items:
#     ner_item = ner_text(item)
#     ner_items.append(ner_item)
# len(ner_items)

In [27]:
# # save and load ner_items in pickle
# pickle.dump(ner_items, open( "ner_items.p", "wb" ) )


In [28]:
ner_items = pickle.load( open( "ner_items.p", "rb" ) )
len(ner_items)

464

In [None]:
jsonList= []
i = 0
s1 = 0
for item in ner_items:
        #nerItem = ner_text(item)
        text_tags = text_and_tags(item)
        personIndex = person_index(text_tags[0],text_tags[1])
        origin_info = origin_location_index(text_tags[0],text_tags[1]) #origin_Index_method1,origin,o_confidence

        #Test if it will be one of the two schemas
        if person_location(personIndex,origin_info[0]) :
            person = text_tags[0][personIndex]
            origin = origin_info[1]
            o_confidence = origin_info[2]
            
            #Retrieve date according to schema1 or schema2 if no date then None
            dateValue = date(text_tags[0],text_tags[1],personIndex,origin_info[0])
            
            #Retrieve colonial location
            colonialLoc = colonial_location(text_tags[0],tocList)
            
            activites = colonial_activites(text_tags[0])
            
            #Create a JSON dictionary
            item_JSON = {
                'person' : person,
                'date': dateValue,
                'origin': origin,
                'o_confidence':o_confidence,
                'colonial_Location': colonialLoc,
                'activities': activites
            }
            jsonList.append(item_JSON)
        #print(item_JSON)

In [104]:
len(jsonList)

295

=> Paul Coulon (1731 – 1820) from Neuchâtel (NW Switzerland), together with Jacques Louis Pourtalès (1722–1814) from Neuchâtel and Johann Jakob Thurneysen (1729–1784) from Bâle, owned the plantations Bellair (coffee and cocoa), Mont Saint–Jean (coffee), La Conférence (sugar), Clavier, and Larcher. Until 1797, they produced sugar, coffee, cocoa, and cotton with about 100 to 200 slaves on each plantation. The plantations were administered by François und Pierre de Meuron from Neuchâtel. One of them married a woman qualified in the racist terminology of the island a «quarteronne», daughter of white father and a mulatto mother and took her home with him to Neuchâtel.

In [107]:
caricomDataRaw.loc[caricomDataRaw['person']=='Paul Coulon']

Unnamed: 0,person,date,origin,o_confidence,colonial_Location,activities
8,Paul Coulon,1731,Neuchâtel ( NW Switzerland ),False,Grenada,"[plantation owner, trading, trading, trading, ..."


In [108]:
jsonList[8]

{'person': 'Paul Coulon',
 'date': '1731',
 'origin': 'Neuchâtel ( NW Switzerland )',
 'o_confidence': False,
 'colonial_Location': 'Grenada',
 'activities': ['plantation owner',
  'trading',
  'trading',
  'trading',
  'plantation owner',
  'plantation owner']}

In [133]:
caricomDataRaw.loc[caricomDataRaw['person']=='Henry Peschier']['activities'].tolist()

[['trading', 'plantation owner', 'plantation owner', 'trading']]

## From JSONs to Dataframe

In [106]:
#transform JSON list into a dataframe 
caricomDataRaw = pd.json_normalize(jsonList)

## Cleaning
- Remove all the duplicates
- If some entries have the samed person we need to merge or remove one of the entry...

In [109]:
def clean(raw_Data):
    tmp = raw_Data.drop_duplicates(inplace=True)
    clean_Data = tmp
    return clean_Data
   
    
i = 120
caricomDataRaw.iloc[i:i+10]

Unnamed: 0,person,date,origin,o_confidence,colonial_Location,activities
120,Heinrich Escher,1815,Zurich,False,Cuba,"[plantation owner, plantation owner]"
121,Heinrich Studer,1779-1831,Winterthur,False,Cuba,[plantation owner]
122,Johannes Köhli‏‎,1773–1814,Biel,False,Cuba,[trading]
123,Karl Wilhelm Scherb,1780–1827,Bischofszell,False,Cuba,[trading]
124,Johann Ulrich Zellweger,1804–1871,the,False,Cuba,"[plantation owner, trading, trading, plantatio..."
125,Jacob Jakob,1850,the,False,Cuba,[plantation owner]
126,Philippe Robert-Tissot,,Neuchâtel,False,Cuba,"[plantation owner, plantation owner]"
127,Eine Selbstschau »,1771-1848,Unterseen BE,False,Cuba,
128,Favre,,Couvet,False,Cuba,[slave trade]
129,Charles Rossel,1822,Nantes,False,Cuba,"[slave trade, slave trade]"


In [111]:
caricomDataRaw['origin']

0               Zurich
1          Saint-Aubin
2               Zurich
3         Schaffhausen
4               Africa
            ...       
290        a St.Gallen
291              Berne
292                  a
293           a Geneva
294    TumeglDomleschg
Name: origin, Length: 295, dtype: object

In [125]:
caricomDataRaw.loc[caricomDataRaw['date'].values == None]

Unnamed: 0,person,date,origin,o_confidence,colonial_Location,activities
3,Grafen Karl von Zinzendorf,,Schaffhausen,False,Barbados,"[trading, trading, trading, trading, trading, ..."
4,Samuel Müller,,Africa,False,Barbados,
6,Anton Schulthess,,a Zurich,False,Barbados,"[trading, military]"
7,Jean-Antoine Bertrand (,,Geneva,False,Dominica,"[trading, plantation owner]"
11,Jean Henri (,,where,False,Grenada,[trading]
...,...,...,...,...,...,...
276,Carl Vogt ( 1817–1895 ),,Germany,False,Anti-Black Racism and Ideologies Relevant to C...,
279,Martin Salander,,the Dutch,False,Anti-Black Racism and Ideologies Relevant to C...,[military]
290,Jan Willem ( Baron,,a St.Gallen,False,African and European Logistics,[plantation owner]
291,Saint Domingue,,Berne,False,African and European Logistics,"[trading, trading, trading, trading, trading, ..."


In [115]:
caricomDataRaw['origin'].unique()

array(['Zurich', 'Saint-Aubin', 'Schaffhausen', 'Africa', 'Geneva',
       'a Zurich', 'Neuchâtel ( NW Switzerland )', 'Vevey', 'St.Gallen',
       'where', 'Brazil', 'Bâle', 'Lausanne',
       'Le Locle ( Canton of Neuchâtel', 'a Geneva', 'the', 'church',
       'Lelienburg', 'Bürglen', 'Burgdorf ( Canton of Berne )', 'Basel',
       'an Yverdon', 'Thurgau', 'Treytorrens ( Payerne', 'his',
       'Speicher', 'Walenstadt', 'Aarau', 'a', 'Bournens',
       'La Tour-de-Peilz', 'Lutry ( Canton of Vaud', 'a St.Gallen',
       'Neuchâtel', 'Murten', 'Switzerland',
       'St. Gallen ( E Switzerland )', 'La Rochelle', 'Versoix',
       'Sonvillier', 'Schöftland ( Canton', 'Saint-Domingue', 'trade',
       'Berne', 'Le Locle', 'a Neuchâtel', 'Hunziker', 'Solothurn',
       'Aargau', 'Dornach', '1824', 'Lucerne',
       'Graubünden ( E Switzerland )', 'Jamaica', 'Rougement', 'Yverdon',
       'Morges', 'Môtier', 'Bourmens', 'Echallens', 'Obersimmental',
       '1738–1744', 'Noréaz', '1796',
  

In [117]:
caricomDataRaw['colonial_Location'].unique()

array(['Antigua and Barbuda', 'Barbados', 'Dominica', 'Grenada',
       'Guyana (Guiana): Dutch/English colonies «ara», «Essequibo», and «Berbice»',
       'Haiti (colony «Saint-Domingue»)', 'Jamaica', 'Montserrat', 'Cuba',
       'Netherlands Antilles (colonies «Aruba», «Bonaire», «Curaçao», «St. Eustacius»)',
       'French West Indies (colonies «Guiana», «Guadeloupe», «Martinique»)',
       'Danish West Indies (colonies «St. John», «St. Croix», and «St. Thomas»)',
       'Venezuela', 'Bermudas',
       ' North America (the Thirteen Colonies and the United States)',
       'Brazil (Colonial Brazil, United Kingdom with Portugal, independent empire)',
       'Southern Africa', 'East Indies',
       'Anti-Black Racism and Ideologies Relevant to Caribbean Economic Space',
       'Marine Navigation', 'African and European Logistics'],
      dtype=object)

In [112]:
caricomDataRaw.iloc[4:5]

Unnamed: 0,person,date,origin,o_confidence,colonial_Location,activities
4,Samuel Müller,,Africa,False,Barbados,



## Get location when mentioned further to deal with : from the city of...

## Use section name to retrieve JSON colonial location attribute

To do 

## Use predefined categories to retrieve the JSON type attribute 

# Scratch

### Old version of schema 1 test
This version is outdated. To restrictive it gets only 18 items.

In [None]:
#Input: item is a single entry from text source 1 with NER tags (characterized by the '=>' starting string)
#Output: True is the text is structured as schema 1, False otherwise
#Requirements: is_date() function
#Description: Test if the first elements of a text match the schema 1. Namely, does the first words match the  **Name** (*date*) from *city* pattern.
def schema1_test(item): 
    testValue = (item[2][1] == ('PERSON' or 'ORGANIZATION)')) and (item[3][0] == '(') and (is_date(item[4][0]) == True) and (item[5][0] == ')') and (item[6][0] == 'from') and (item[7][1] == 'LOCATION')
    return testValue

schema1_test(clean_ne)

What about multiple persons in a paragraph?
    -> one ID per person with same organization groups etc...

In [None]:
dataSet = pd.DataFrame({
                     'id':[],
                     'person':[],
                     'location':[],
                     'period':[],})
dataSet.

In [None]:
person_list = []

for ent in tokens.ents:
    if ent.label_ == 'PERSON':
        person_list.append(ent.text)
        
person_counts = Counter(person_list).most_common(20)
df_person = pd.DataFrame(person_counts, columns =['text', 'count'])

In [None]:
len(classified_text)

In [None]:
json