### **Digital SongShu Project**
#### Last updated 2018-03-05 by Ruben G. Tsui

#### This script extracts personal names from Fagushan's <code>Buddhist_Studies_Person_Authority.xml</code> data file relevant to the study of SongShu

In [76]:
import json, re
import pandas as pd
from bs4 import BeautifulSoup
from bs4.element import NavigableString

In [5]:
%%time
fin = r"C:\NLP\Raft\Song shu-20181231T032348Z-001\Fagushan\authority_person\Buddhist_Studies_Person_Authority.xml"

with open(fin, 'r', encoding='utf-8') as fi:
    data = fi.read()
    soup = BeautifulSoup(data, 'lxml')

Wall time: 21.4 s


#### Retrieve all personal names with the attributes <code> < note type="dynasty" >XX< /note ></code> where XX = ['東晉', '', '', '']

In [11]:
%%time
## How many different dynasties are represented in the data file?
notes = soup.find_all('note', {'type': 'dynasty'})
print('No. of <note type="dynasty"> tags', len(notes))

No. of <note type="dynasty"> tags 40352
Wall time: 5.8 s


In [20]:
%%time
dyn = []
for note in notes:
    dyn.append(note.text.strip())
dyn = list(set(dyn)) # convert to set to remove duplicates
#print(dyn)
## save a copy to an Excel file
dyn_df = pd.DataFrame(dyn, columns=['Dynasty'])
dyn_df.head(30)

Wall time: 132 ms


In [53]:
filter_dyn = dyn_df['Dynasty'].str.contains(r'晉|劉宋|北魏|南梁|南齊|南北朝|南朝') # the string is a regex 

In [None]:
## We're using this list to filter the person
dyn_df[filter_dyn]

In [59]:
regex_dynasty = re.compile(r'(晉|劉宋|北魏|南梁|南齊|南北朝|南朝)')

In [65]:
if regex_dynasty.search('  劉宋 '):
    print('success!')
else:
    print('failure!')

success!


In [69]:
%%time
# Filter out all persons not from the following dynasties
regex_dynasty = re.compile(r'(晉|劉宋|北魏|南梁|南齊|南北朝|南朝)')

persons = soup.find_all('person')
target_persons = []
i = 0
for person in persons:
    dyn = person.find_all('note', {'type': 'dynasty'})
    if (len(dyn) > 0):
        if regex_dynasty.search(dyn[0].text.strip()):
            target_persons.append(person)

print(len(target_persons))

2263
Wall time: 6.59 s


In [None]:
target_persons[0:5]

In [95]:
## Extarct useful info from this subset and write to an Excel file
PERSONS = []
for person in target_persons:
    PERSON = {}
    #print("No. of children: ", len(list(person.children)))
    PERSON['id'] = person['xml:id']
    canonical_name = person.find_all('persname', {'xml:lang':'zho-Hant'})
    if (len(canonical_name) > 0):
        PERSON['name'] = canonical_name[0].text
    alternative_name = person.find_all('persname', {'type':'alternative'})
    if (len(alternative_name) > 0):
        PERSON['alt_name1'] = alternative_name[0].text
    if (len(alternative_name) > 1):
        PERSON['alt_name2'] = alternative_name[1].text
    
    #for c in person.children:
    #    if isinstance(c, NavigableString):
    #        print(c)
    #    else:
    #        print(c.name, ":", c.text)
    PERSONS.append(PERSON)
person_df = pd.DataFrame(PERSONS)

columns = ['id','name','alt_name1','alt_name2']
person_df = person_df.reindex(columns, axis=1)


In [98]:
def sort_by_length_of_indicated_column(df, col, ascending=False):
    s = df[col].str.len().sort_values(ascending=ascending).index
    df_out = df.reindex(s)
    return df_out

In [102]:
person_df = sort_by_length_of_indicated_column(person_df, 'name')
person_df.tail(100)

Unnamed: 0,id,name,alt_name1,alt_name2
1658,A013789,師會,釋師會,法相
1657,A013786,江泌,士清,
1656,A013782,遵誨,釋遵誨,真行大師
1639,A012800,趙珍,,
1621,A011650,慧集,王虵之,王蛇之
1622,A011716,僧端,端,尼師端
1623,A011719,法洪,,
1625,A011825,王薈,敬文,小奴
1626,A011856,蔡謨,道明,文穆
1627,A011872,曇恒,,


### **IMPORTANT**: Check the Excel file and delete common single-character names

In [None]:
%%time
person_df.to_excel('fagushan.persons.liusong.xlsx')