# Hacking Buddhist Texts with NLP

**Improving Part-of-Speech and Dependency Tagging of Pre-Modern Literary Chinese Texts**

**Contributors:**  
>Tyler Seymour | Python & NLP | <tylerseymour@protonmail.com> | <https://tylerseymour.pw>  
>
>Kelsey Seymour | MPhil Linguistics & PhD East Asian Languages | <kelseymour@protonmail.com>


## Introduction

This proof-of-concept notebook uses a (Modern) Chinese SpaCy library as a base to tokenize and analyze parts of speech for a short Buddhist verse. Mistakes in the tags were corrected by hand to train an improved model for Pre-Modern Literary Chinese Texts. It was built with Pandas, SpaCy, Xiaoquan Kong's Chinese SpaCy models. It implements a Part of Speech tagger and visualizations using diSplaCy. 

The Chinese SpaCy library uses Standard Mandarin to generate its tags for texts. However, the texts used by Chinese readers are not only written in modern Chinese. Chinese characters can be used to represent several different languages, including specialized language like Legal Chinese, Chinese “dialects” like Cantonese, Wu, Min, and Hakka, as well as pre-modern literary forms, like Classical Chinese, written vernacular Chinese, and poetic language. Chinese characters also historically served as the writing system for Korean, Japanese, and Vietnamese. All of these historical and current written languages differ from Standard Mandarin, some slightly and some greatly, in terms of language features like lexicon, morphology, syntax, and semantics. 

Chinese religious texts, like the Buddhist one analyzed in this project, are generally written in a form of Classical Chinese, but they are still read by individuals and groups throughout the modern Sinophone world. This project demonstrates that the existing Chinese POS libraries are not yet sufficient to deal with these texts, as well as other texts written in Chinese characters that diverge from Standard Mandarin. This project is a first step towards correcting this gap.

## Imports & Setup

In [1]:
import spacy
from spacy import displacy
from tabulate import tabulate
import pandas as pd
import zh_core_web_sm
import glob
import tarfile

print()
print("Complete!")


Complete!


## Hide Warnings

In [2]:
import warnings
warnings.simplefilter('ignore')
warnings.filterwarnings('ignore')
warnings.filterwarnings("ignore", category=DeprecationWarning)

print()
print("No more ugly warnings!")




# POS Tagger

## Corpus Quick Loader

In [3]:
inputTitle = input("Title of Text [Enter for Default]: ")

while True:
    if inputTitle == '':
        inputTitle = '匹夫之勇的故事'
        inputCharacters = "出处国语越语上勾践既许之乃致其众而誓之曰吾不欲匹夫之勇也欲其旅进旅退也.释义打仗不能光凭个人的勇敢要用智谋要靠集体的力量.故事春秋时越王勾践被吴王夫差打败在吴国囚禁三年受尽了耻辱回国后他决心自励图强立志复国.十年过去了越国国富民强兵马强壮将士们又次向勾践来请战君王越国的四方民众敬爱您就象敬爱自己的父母样.现在儿子要替父母报仇臣子要替君主报仇.请您再下命令与吴国决死战.勾践答应了将士们的请战要求把军土们召集在起向他们表示决心说我听说古代的贤君不为士兵少而忧愁只是忧愁士兵们缺乏自强的精神.我不希望你们不用智谋单凭个人的勇敢而希望你们步调致同进同退前进的时候要想到会得到奖赏后退的时候要想到会受到处罚这样就会得到应有的赏赐进不听令退不知耻会受到应有的惩罚.到了出征的时候越国的人都互相勉励.大家都说这样的国君谁能不为他效死呢由于全体将士斗志十分高涨终于打败了吴王夫差灭掉了吴国"
        break
    else:
        inputCharacters = input("Enter Characters: ")
        break

print()
print("Complete.")
print()
print(inputTitle)
print()
print(inputCharacters)

Title of Text [Enter for Default]: Lotus Sutra
Enter Characters: 如是我聞：一時，佛住王舍城耆闍崛山中，與大比丘眾萬二千人俱，皆是阿羅漢，諸漏已盡，無復煩惱，逮得己利，盡諸有結，心得自在。

Complete.

Lotus Sutra

如是我聞：一時，佛住王舍城耆闍崛山中，與大比丘眾萬二千人俱，皆是阿羅漢，諸漏已盡，無復煩惱，逮得己利，盡諸有結，心得自在。


In [4]:
nlp = spacy.load('zh')
doc = nlp(inputCharacters)
doc.user_data["title"] = inputTitle
title = str(doc.user_data["title"])

print()
print("Title: " + title)
print()
print(doc)

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.898 seconds.
Prefix dict has been built succesfully.



Title: Lotus Sutra

如是我聞：一時，佛住王舍城耆闍崛山中，與大比丘眾萬二千人俱，皆是阿羅漢，諸漏已盡，無復煩惱，逮得己利，盡諸有結，心得自在。


## Tokenize

In [5]:
headers = [
    'text', 'lemma_', 'pos_', 'tag_', 'dep_', 'shape_', 'is_alpha', 'is_stop',
    'has_vector', 'ent_iob_', 'ent_type_', 'vector_norm', 'is_oov'
]

doc_data = []

for token in doc:
    token_data = [
        token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
        token.shape_, token.is_alpha, token.is_stop, token.has_vector,
        token.ent_iob_, token.ent_type_, token.vector_norm, token.is_oov
    ]
    doc_data.append(token_data)

display(pd.DataFrame(doc_data, columns=headers))

Unnamed: 0,text,lemma_,pos_,tag_,dep_,shape_,is_alpha,is_stop,has_vector,ent_iob_,ent_type_,vector_norm,is_oov
0,如是,如是,X,NN,nmod,xx,True,False,True,O,,5.184927,False
1,我聞,我聞,X,NN,nmod,xx,True,False,False,O,,0.0,True
2,：,：,X,",",nsubj,：,False,False,True,O,,8.478117,False
3,一時,一時,X,NNP,acl,xx,True,False,False,B,DATE,0.0,True
4,，,，,X,NNP,advmod,，,False,False,True,O,,4.587259,False
5,佛住,佛住,X,NNP,ROOT,xx,True,False,False,B,PERSON,0.0,True
6,王舍城,王舍城,X,NN,nmod,xxx,True,False,True,B,PERSON,2.736184,False
7,耆,耆,X,VV,nummod,x,True,False,True,O,,1.696108,False
8,闍,闍,X,SFN,nmod,x,True,False,False,O,,0.0,True
9,崛,崛,X,NNP,nmod,x,True,False,True,O,,0.715596,True


## Visualize Each Sentence

In [6]:
sentences = inputCharacters.split('。')

count = 0
for sentence in sentences:
    count += 1
    doc = nlp(sentence)
    doc.user_data["title"] = str(count)

    headers = [
        'text', 'lemma_', 'pos_', 'tag_', 'dep_', 'shape_', 'is_alpha',
        'is_stop', 'has_vector', 'ent_iob_', 'ent_type_', 'vector_norm',
        'is_oov'
    ]

    doc_data = []

    for token in doc:
        token_data = [
            token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop, token.has_vector,
            token.ent_iob_, token.ent_type_, token.vector_norm, token.is_oov
        ]
    doc_data.append(token_data)

    pd.DataFrame(doc_data, columns=headers)
    displacy.render(doc, style="dep", jupyter=True, options={'distance': 125})
    svg = displacy.render(doc, style="dep")
    path = ("./zh-exports/_" + str(count) + "_" + title + ".svg")
    with open(path, "w+", encoding="utf-8") as f:
        f.write(svg)
        print("Exported " + path)

  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


Exported ./zh-exports/_1_Lotus Sutra.svg


  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)
  "__main__", mod_spec)


Exported ./zh-exports/_2_Lotus Sutra.svg


# Utilities

In [7]:
# tf = tarfile.open("./zh-exports/zh_core_web_sm-2.0.4.tar.gz")
# tf.extractall()

print()
print("Extraction Complete. ")


Extraction Complete. 


# References

## Chinese NLP Library:
1. https://github.com/howl-anderson/Chinese_models_for_SpaCy/blob/master/README.en-US.md
2. https://github.com/howl-anderson/Chinese_models_for_SpaCy



## I should probably look at these:
1. https://github.com/howl-anderson/Chinese_models_for_SpaCy/blob/master/workflow.md
2. https://github.com/thunlp/Chinese_Rumor_Dataset