# TF-IDF analysis for Chinese Buddhist Texts

In this tutorial, we will use a technique called*TF-IDF* to analyze a work from the Chinese Buddhist Canon using Python.


TF-IDF stands for *Term frequency-Inverse Document frequency*. This is a common method used to determine the relative importance (i.e. statistical significance or frequency) of terms within a document, as compared to other documents within a larger corpus. It is used, for instance, as a simple way to rank the relevance of a particular web page pages relative to a given search terms.

Research question: What are we trying to find out: Something we already know?


For this tutorial, we will analyze a text called the _Xingshichao_ 行事鈔. This is a work written in the 7th century by a Chinese monk called Daoxuan 道宣 about monastic rules and regulations. In principle, however, this same techinque can be applied to any text or collection of texts from the Chinese Buddhist canon, or indeed, any text in Chinese, Buddhist or otherwise, and with a few changes could be used to work on non-Chinese texts (indeed this techique was originally designed for non-Chinese texts, so is perhaps more easily used with such works.)

## Importing the necessary libraries
First, we will import the libraries we will use. We don't necessarily have to do this first, but it's perhaps good to have these all in one place in our script

- `sys` is a set of functions for system specific paramaters
    - but I'm not not sure where this is needed rn
- `os` is a set of functions needed for interacting with the operating system
    - in this case is used for retrieving file names
- `jieba` is the Chinese-language specific text segmentation program
    - aka tokenizer
    - Technically, I think it is possible to implement tf-idf without tokenizing, but this is beyond my pay grade
- `re` is the regular expression (regex) module (is it called a 'module'?)
    - This is used to make sure we get the files we want, and not the ones we don't want

In [2]:
import sys
import os 
import jieba
import re

## Getting the names of the files we will work with

The next bit of code uses the `os` library to get a list of all the files within a particular directory; the directory that hold the files we will use to analyze

In [3]:
list_of_files_in_dir = os.listdir('../T1804/By_Chapter')  # or '/Users/thomasnewhall/Desktop/GitHub/SP-2020-DH-projects-New/T1804/By_Chapter'

list_of_files_in_dir

['T1804_ch27_諸雜要行篇.txt',
 'T1804_ch08_受戒緣集篇.txt',
 'T1804_ch24_導俗化方篇.txt',
 'T1804_ch19_鉢器制聽篇.txt',
 'T1804_ch18_四藥受淨篇.txt',
 'T1804_ch13_篇聚名報篇.txt',
 'T1804_ch06_結界方法篇.txt',
 'T1804_ch03_足數眾相篇.txt',
 'T1804_ch17_二衣總別篇.txt',
 'T1804_ch11_安居策修篇.txt',
 'T1804_ch16_懺六聚法篇.txt',
 'T1804_ch05_通辨羯磨篇.txt',
 'T1804_ch29_尼眾別行篇(old).txt',
 'T1804_ch30_諸部別行篇.txt',
 'T1804_ch21_頭陀行儀篇.txt',
 'T1804_ch14_隨戒釋相篇.txt',
 'T1804_ch10_說戒正儀篇.txt',
 'T1804_ch12_自恣宗要篇.txt',
 'T1804_ch23_計請設則篇.txt',
 'T1804_ch02_集僧通局篇.txt',
 'T1804_ch26_瞻病送終篇.txt',
 'T1804_ch25_主客相待篇.txt',
 'T1804_ch09_師資相攝篇.txt',
 'T1804_ch04_受欲是非篇.txt',
 'T1804_ch07_僧網大綱篇.txt',
 'readme.txt',
 'T1804_ch28_沙彌別行篇.txt',
 'T1804_ch00_序.txt',
 'T1804_ch22_僧像致敬篇.txt',
 'T1804_ch29_尼眾別行篇.txt',
 'T1804_ch20_對施興治篇.txt',
 'T1804_ch15_持犯方軌篇.txt',
 'T1804_ch01_標宗顯德篇.txt']

## About the files

The files for this tutorial found within the directory `T1804/By_Chapter` correspond to the thirty chapters, plus the preface of the aforementioned text, the _Xinghischao_.  At the time being the files within the main directory `T1804` contain the full text divided up into twelve "scrolls" (_juan_ 卷), rather than chapters,  a traditional way to divide up Chinese texts, which also corresponds to the way that CBETA saves its files.

The indicator T1804 can be understood as "Taishō text number 1804" which is the text number from of the original "Taishō" printed edition of this text, from which the electronic (CBETA) edition was created. This "Taishō number" is the standard way to refer to such texts academically. The number after `ch` of course refers to the chapter number, with the preface being labeled `ch00`, while the chinese text after the chapter number (i.e. `標宗顯德篇`) is the title of the chapter in Chinese.

Although it should be possible to retrieve these using an API, these were simply copied-and-pasted from the online CBETA viewer, https://cbetaonline.dila.edu.tw/zh/T1804_001, thus, as indicated in the `readme.txt` file, there may be some slight problems with missing lines, etc., when compared with the full text file, but all effort was made to make sure these were copied as completely as possible. 

## Getting just the files we need

Notice that the names of the files are now saved in a variable called `list_of_files_in_dir`. Perhaps not the best variable name, but whatever.

Notice also that the are several files such as `readme.txt` which, while good to keep in the folder, are not necessary for our analysis

In order to get only the files we need, we can use a regular expression to find all the files that match a specific file naming pattern.

In [6]:
# This chunk of python gets files according to a particular filename pattern in regex (and ignores other files)

#establishes a list variable called "files"
files = [] 

# uses a for loop to iterate through each file_name in the list_of_files_in_dir, created above
for file_name in list_of_files_in_dir:

    #uses the re.finditer function to find file names that match the particular pattern we are looking for
    for match in re.finditer(r"T1804\_ch[0-9]{2}\_.{1,5}\.txt", file_name):
        
        # appends the regex match to the "files" list 
        files.append(match.group(0))

print(files) #prints out the new list of files
len(files) #prints the number of items in that list; should be equal to 31 because we have 30 chapters + a preface 

['T1804_ch27_諸雜要行篇.txt', 'T1804_ch08_受戒緣集篇.txt', 'T1804_ch24_導俗化方篇.txt', 'T1804_ch19_鉢器制聽篇.txt', 'T1804_ch18_四藥受淨篇.txt', 'T1804_ch13_篇聚名報篇.txt', 'T1804_ch06_結界方法篇.txt', 'T1804_ch03_足數眾相篇.txt', 'T1804_ch17_二衣總別篇.txt', 'T1804_ch11_安居策修篇.txt', 'T1804_ch16_懺六聚法篇.txt', 'T1804_ch05_通辨羯磨篇.txt', 'T1804_ch30_諸部別行篇.txt', 'T1804_ch21_頭陀行儀篇.txt', 'T1804_ch14_隨戒釋相篇.txt', 'T1804_ch10_說戒正儀篇.txt', 'T1804_ch12_自恣宗要篇.txt', 'T1804_ch23_計請設則篇.txt', 'T1804_ch02_集僧通局篇.txt', 'T1804_ch26_瞻病送終篇.txt', 'T1804_ch25_主客相待篇.txt', 'T1804_ch09_師資相攝篇.txt', 'T1804_ch04_受欲是非篇.txt', 'T1804_ch07_僧網大綱篇.txt', 'T1804_ch28_沙彌別行篇.txt', 'T1804_ch00_序.txt', 'T1804_ch22_僧像致敬篇.txt', 'T1804_ch29_尼眾別行篇.txt', 'T1804_ch20_對施興治篇.txt', 'T1804_ch15_持犯方軌篇.txt', 'T1804_ch01_標宗顯德篇.txt']


31

## About the regex

So you may be wondering what the bit `r"T1804\_ch[0-9]{2}\_.{1,5}\.txt"` means.

... explanation goes here... 

## Getting TF-IDF for one file first

So we have 31 files we want to work with, and eventually we'd like to be able to do this all in one go, but for now, lets just make sure this is working with one file. The file we will work with is Chapter 9, `T1804_ch09_師資相攝篇.txt` 

I happen to be somewhat familiar with this chapter; I've translated part of it and generally know it's content. As you may be able to tell from the title _shizi xiangshi pian_ 師資相攝篇 this is a chapter about _shizi_ 師資 or "masters and disciples;" and the title can be translated as "Chapter on the Relationship Between masters and Disciples." Thus, we would expect to find words related to "masters" and "disciples" to be relatively more common in this chapter than the text as a whole. Lets see if it works!

The first thing we need to do is get the file into Python. There are many many ways to do that, but one common way is to use the `with` function, follwed by a `read` function (method?)

In [7]:
#this will read in the file and save it as a string 
#maybe we don't really need it saved as a string; taking up space...? Idk there may be a better solution, but this works for now

 # I believe the 'with' function is a way to open the file a way that prevents errors from happening, and also closes it automatically, so perhaps this is a better way to open files generally?
with open("../T1804/By_Chapter/T1804_ch09_師資相攝篇.txt", "r", encoding="utf-8") as ch9:
    file_as_string = ch9.read()  #the `read` function "reads" the file and makes it into a string variable

# print it out, just to be sure everything is ok
print(file_as_string) 

師資相攝篇第九

佛法增益廣大寔由師徒相攝。互相敦遇財法兩濟。日積業深行久德固者皆賴斯矣。比玄教陵遲慧風揜扇。俗懷悔慢道出非法。並由師無率誘之心。資闕奉行之志。二彼相捨妄流鄙境。欲令光道焉可得乎。故拯倒懸之急。授以安危之方。幸敬而行之。則永無法滅。就中初明弟子依止。後明二師攝受。初中分二。初明師弟名相。後明依止法。問云何名師和尚闍梨。答此無正翻。善見云。無罪見罪訶責。是名我師。共於善法中教授令知故。是我闍梨。論傳云。和尚者外國語。此云知有罪知無罪。是名和尚。四分律弟子訶責和尚中亦同。明了論正本云優波陀訶。翻為依學。依此人學戒定慧故。即和尚是也。方土音異耳。相傳云。和尚為力生(道力由成)。闍梨為正行(能糾正弟子行)。未見經論。雜含中外道亦號師為和尚。弟子者。學在我後。名之為弟。解從我生。名之為子。次總相攝。尸迦羅越六方禮經弟子事師有五事。一當敬難之。二當念其恩。三所有言教隨之。四思念不厭。五從後稱譽之。師教弟子亦有五事。一當令疾知。二令勝他人弟子。三令知己不忘。四有疑悉解。五欲令智慧勝師。僧祇師度弟子者不得為供給自己故。度人出家者得罪。當使彼人因我度故修諸善法得成道果。四分云。和尚看弟子當如兒意。弟子看和尚當如父想。準此兒想應具四心。一匠成訓誨。二慈念。三矝愛。四攝以衣食。如父想者。亦具四心。一親愛。二敬順。三畏難。四尊重。敬養侍接如臣子之事君父。故律云。如是展轉相敬重相瞻視。能令正法便得久住增益廣大。二明依止法。先明應法。二明正行。初中言得不依止者。八人。四分六種。一樂靜。二守護住處。三有病。四看病。五滿五歲已上行德成就。六自有智行住處無勝己者。七飢儉世無食。十誦云。若恐餓死。當於日日見和尚處住。恐不得者。若五日十五日若二由旬半若至自恣時。一一隨緣如上來見和尚。八行道稱意所。五分。諸比丘各勤修道無人與依止。當於眾中上座大德心生依止敬如師法而住。二須依止人十種。四分云。一和尚命終。二和尚休道。三和尚決意出界。四和尚捨畜眾。五弟子緣離他方。六弟子不樂住處更求勝緣。七未滿五夏。八不諳教網。文云。若愚癡無智者盡壽依止。此約行教明之。十誦。受戒多歲不知五法盡形依止。一不知犯。二不知不犯。三不知輕。四不知重。五不誦廣戒通利。毘尼母。若百臘不知法者。應從十臘者依止。僧祇中四法。不善知毘尼。不能自立。不能立他。盡形依止。九或愚或智。愚謂性戾癡慢數犯眾罪。智謂犯已即知依法懺洗。志非貞

## Text segmentation (tokenization) of the text

"Tokenizing" basically means splitting up the text into individual words. Aadflthough this seems like it should be straightforward, is actually pretty complex, and the way to do it varies depending on the language. 

### Tokenizing in English

In English, for example, we can have a very simple tokenization using the spaces between words to indicate where breaks are made

One simple way to tokenization for English text is with the `.split` function, which splits up a string based on 

For example:

```
"Penny bought bright blue fishes".split(" ")
```
will output:

```
['Penny', 'bought', 'bright', 'blue', 'fishes']
```

There are of course more sophisticated ways to do this, for example, if you wanted expressions 'The White House' to be understood as a single 'word' or expression. For that, you can use a proper tokenizer from one of the NLP packages (though you probably have to customize it)

(the above example comes from "Using TF-IDF with Chinese" Tutorial)

### Tokenizing in Chinese

We can do basically the same thing in Chinese:

For example, consider the first sentence in this chapter `"佛法增益廣大寔由師徒相攝。"`

Using with the `list` function, we can effectively split each character of the string into a list:

```
list("佛法增益廣大寔由師徒相攝。")
```
Which returns the list:

```
['佛', '法', '增', '益', '廣', '大', '寔', '由', '師', '徒', '相', '攝', '。']

```

The problem is that, though this will work in some cases, for the most part, Chinese words are not just single characters, but combinations of several characters.

##  Tokenizing with `jieba`

So, to do something more sophisticated, with Chinese characters, we can use the `jieba` library.
    - nb: there are surely other libraries in python that do similar things; 
    - similarly, I know `jieba` is included as part of some other NLP libraries
    - alteratively it might make sense to make your own tokenizer
        - For example, there's paul vierthaler's one
        - and I may do my own tutorial on making a tokenizer
	- but `jieba` is pre-made so that's nice
	- Also, for example, you can use the `mecab` library for doing various NLP tasks in Japanese, but that's something different and I don't really know about it at this point

There are several functions (i think they're called methods) within `jieba` that can segment/tokenize text. To learn more about these, you should go visit the documentation on the [`jieba` github repository](https://github.com/fxsjy/jieba)

We will use `jieba.lcut` for this tutorial; though there is also `.cut` `.cut_for_search` and `.lcut_for_search`. `.lcut` and `.lcut` for search tokenizes Chinese text into a list (hence the "`l`" in the name)

In [13]:
# we will now tokenize the same text as above using jieba
seg_list = jieba.lcut("佛法增益廣大寔由師徒相攝。互相敦遇財法兩濟。日積業深行久德固者皆賴斯矣")
seg_list

['佛法',
 '增益',
 '廣大',
 '寔',
 '由',
 '師徒',
 '相攝',
 '。',
 '互相',
 '敦遇',
 '財法',
 '兩濟',
 '。',
 '日積業',
 '深行久德',
 '固者',
 '皆',
 '賴斯',
 '矣']

Notice now that several of thes words are split into two-character terms such as _fofa_ 佛法 (The teaching of Buddhas),  _zengyi_ 增益 ("To increase and flourish") and _xiangshe_ 相攝 ("mutual dependence"). This corresponds more closely to how we, as humans readers would undestand the text

## Using a specialized dictionary for Jieba

Now, it's not obvious from the above examples, but `jieba` is designed for *Modern* chinese, whereas the text we are working with is not only written in *Classical* Chinese, with a grammar that is different than modern Chinee, but also what is sometimes called *Buddhist* Chinese, which uses a very distinct set of vocabulary, and occasionally unique or distinctive grammar.

To account for this distinct vocabulary, `jieba` allows us to use our own dictionary of words upon which it bases its Tokenization. 

The dictionary we will use is also taken from CBETA, a digitized version of Ding Fubao's 丁福保 _Fojioa Daxidian_ 佛學大辭典 (Ding Fubao's Dictionary of Buddhist Studies) (found at XXX website).

This is a modern dictionary
may be incomplete
or otherwise weird
maybe better solutions
but is fine for now

As for the grammar, 
since TF-IDF is basically a 'grammar-agnostic' method (my term) that is, it doesn't matter what the grammar is
or a 'bag of words' model'
it doesn't really require that we analyze the grammar, but for other methods this would be important.

In [17]:
dictionary_file_path = "../dfb_headwords_from_beautifulsoup.txt" # or '/Users/thomasnewhall/Desktop/GitHub/SP-2020-DH-projects-New/dfb_headwords_from_beautifulsoup.txt'

jieba.load_userdict(dictionary_file_path)

The file is of the form xxx

The above  code loads the file into jieba for use

In [19]:
seg_list = jieba.lcut("佛法增益廣大寔由師徒相攝。")
seg_list

['佛法', '增益', '廣大', '寔', '由', '師徒', '相攝', '。']

We re-run the code again and notice there is not a great difference; so perhaps that was not completely necessary :(

What if we run it on the whole file?

In [26]:
tokenized_text = jieba.lcut(file_as_string)
print(tokenized_text, len(tokenized_text))

['師資', '相攝', '篇', '第九', '\n', '\n', '佛法', '增益', '廣大', '寔', '由', '師徒', '相攝', '。', '互相', '敦遇', '財法', '兩濟', '。', '日積業', '深行', '久德', '固者', '皆', '賴斯', '矣', '。', '比', '玄教', '陵', '遲慧', '風', '揜', '扇', '。', '俗懷悔', '慢道', '出', '非法', '。', '並由師', '無率', '誘之心', '。', '資闕', '奉行', '之志', '。', '二彼相', '捨', '妄流', '鄙境', '。', '欲令', '光道', '焉', '可得乎', '。', '故拯', '倒懸', '之急', '。', '授', '以', '安危', '之方', '。', '幸敬而行', '之', '。', '則永', '無法', '滅', '。', '就', '中', '初明', '弟子', '依止', '。', '後', '明', '二師', '攝受', '。', '初中', '分二', '。', '初', '明師', '弟', '名相', '。', '後', '明', '依止', '法', '。', '問', '云何', '名師', '和', '尚闍梨', '。', '答此', '無正', '翻', '。', '善見', '云', '。', '無罪', '見', '罪訶責', '。', '是', '名', '我師', '。', '共', '於', '善法', '中', '教授', '令知', '故', '。', '是', '我', '闍梨', '。', '論傳云', '。', '和尚', '者', '外國語', '。', '此云', '知有罪知無罪', '。', '是', '名', '和尚', '。', '四分律', '弟子', '訶責', '和尚', '中亦同', '。', '明了論', '正本', '云', '優波陀', '訶', '。', '翻為', '依學', '。', '依', '此人', '學', '戒定慧', '故', '。', '即', '和尚', '是', '也', '。', '方', '土音', '異耳', '。', '相傳', '云', '。', '和尚'

so many words

We notice that although In general it looks ok, there are a number of characters that are split up here that perhaps should be made into *stopwords* such as the *Newline* characters (`'\n'`), punctuation marks (` '。'`) and parentheses (`'('`and` ')'`) which are not words, per se and are just 'noise' for our analysis

## Ok lets "Vectorize" em

So, to get the tf-idf score for the words, we first need to get the tf, or term frequency score. To do that, we first have to count the number of time each word appears in the document, and to do that, we can a function called a vectorizer.

"Vectorizing" a text bascially means to convert the text to numerical chart, where each term is has an index number, and then the counts of each term are the values in that chart. The "chart," in mathematical terms is considerd a 'vector'. Basically the idea is to take text and make it purely numerical, so that you can count it, and do various mathematical manipuations on it.

The simplest kind of Vectorizing for our purposes is "Count Vectorizing," whereby each word is simply counted. That is, if a word appears once, it is assigned a `1` if it appears twice its value is `2` and so on. Fancy words but not rocket science.

To implement this, we first need to import the `CountVectorizer` function from `scikit-learn`
    - n.b. there are probably other programs that do this, and indeed, there are easy ways to do this by hand, but this is sort of one standard way to do it, so we're gonna run with it

In [27]:
#  import the `CountVectorizer` function from `scikit-learn`

from sklearn.feature_extraction.text import CountVectorizer # the `from... import...` function imports only one function from a larger library of functions. Of course, we could import everything from Scikit learn, but it's rather big and we don't need to use it all, so we'll just import what we need for now

## Setting up the vectorizer program

{explain something about this step here}

In [53]:
chinese_stop_words = ['。','(',')','[',']','1','\n'] # i may also be able to use jieba.analyse.set_stop_words(file_name) with this but it requies a file name, so idk

count_vectorizer = CountVectorizer(tokenizer=jieba.lcut,  stop_words=chinese_stop_words)

Getting the documents read

In [54]:
documents = [] 

documents.append(file_as_string)

count_vectorized = count_vectorizer.fit_transform(documents)

In [55]:
count_vectorized.toarray()

array([[10,  3,  1, ...,  1,  1,  1]])

In [56]:
count_vectorizer.get_feature_names()

['一',
 '一一',
 '一七',
 '一作',
 '一切',
 '一則',
 '一匠',
 '一坐',
 '一夜',
 '一子',
 '一宿',
 '一宿還',
 '一席',
 '一月',
 '一條線',
 '一樂靜',
 '一死',
 '一當令',
 '一當敬',
 '一白事離',
 '一種',
 '一衣',
 '一請',
 '七',
 '七尺',
 '七明',
 '七未',
 '七滿',
 '七飢儉',
 '三',
 '三不應',
 '三世',
 '三令',
 '三休',
 '三則',
 '三十五',
 '三千威儀',
 '三子',
 '三惡道',
 '三慚',
 '三明',
 '三時',
 '三時教',
 '三有',
 '三毒',
 '三畏難',
 '三百',
 '三種',
 '三能',
 '三自',
 '三莫為',
 '三藏',
 '三說',
 '三請',
 '上',
 '上座',
 '上行',
 '下',
 '下座',
 '下至知',
 '不',
 '不下',
 '不修',
 '不具',
 '不出',
 '不厭',
 '不受',
 '不可',
 '不合',
 '不同',
 '不問',
 '不善',
 '不喚',
 '不在',
 '不壞',
 '不失',
 '不好',
 '不如',
 '不如法',
 '不定',
 '不得',
 '不必',
 '不思',
 '不恥',
 '不應',
 '不應蹋',
 '不成',
 '不教',
 '不樂',
 '不爾者',
 '不犯',
 '不白',
 '不白師',
 '不知',
 '不答',
 '不者',
 '不能',
 '不行',
 '不解',
 '不誦',
 '不識',
 '不識罪',
 '不起',
 '不退',
 '不重',
 '不長',
 '不關',
 '不須',
 '不驅',
 '世無食',
 '並',
 '並令',
 '並成',
 '並由師',
 '並須',
 '中',
 '中亦同',
 '中制',
 '中外',
 '中多',
 '中日',
 '中犯',
 '中當',
 '中白師',
 '乃',
 '乃白',
 '乃至',
 '久',
 '久德',
 '之',
 '之三',
 '之事',
 '之儀',
 '之喻',
 '之失',
 '之志',
 '之急',
 '之方',
 '之深',
 '之義',
 '之間'

Put it into a dataframe

In [57]:
import pandas as pd

count_vectorized_df = pd.DataFrame(count_vectorized.toarray(), columns=count_vectorizer.get_feature_names())

count_vectorized_df.sum(axis=1) 

0    3016
dtype: int64

In [59]:
# reorder the dataframe
count_vectorized_df.T

Unnamed: 0,0
一,10
一一,3
一七,1
一作,1
一切,3
...,...
麁,1
黃門,1
黜,1
鼈,1


In [60]:
count_vectorized_df.T.sort_values(by=0, ascending=False)

Unnamed: 0,0
弟子,59
依止,58
和尚,55
若,44
不,40
...,...
學者,1
學在,1
學,1
字,1


We can see from the top three words here that our results are promising: the word _dizi_ 弟子 "disciple" is the most common word in this text, followed by the word _yizi_ 依止 ("to rely upon") which is a little unusual, but not totally strange, followed by _heshang_ 和尚, a word meaning either 'monk' or 'master'. Following this are _ruo_ 若, usually meaning "if" and _bu_ 不 meaning "not"; stopword-like grammatical particles that are expectedly common. 

Although it is promising that words like "Disciple" and "master" appear high on the list, we don't know if this is because they are particularly distinctive about this chapter, or if they are common throughout this text, or texts like this on the whole. 

At this point it's also a good Idea to stop and check that our results are what we expect them to be. If we open up the file we are working on `T1804_ch09_師資相攝篇.txt` in a text editor, we can check and see if there are as many instances of each word as the `CountVectorizer` says  by using the search function.

Indeed we find that 弟子 can be found 59 times, but the third term 和尚 returns 60 times in our text editor, but only 55 from the `CountVectorizer`. Why is this?

Although the above display can't show all the values, below we can find some of the explanation:

In [62]:
print(count_vectorized_df.T.sort_values(by=0, ascending=False).to_string())

           0
弟子        59
依止        58
和尚        55
若         44
不         40
云         31
者         31
我         23
比丘        19
為         19
與         17
去         17
僧祇        16
有         16
不得        16
等         15
四分        15
得         15
之         14
法         13
是         13
或         12
闍梨        12
不知        12
住         12
於         12
三         11
二         11
而         11
四         11
非法        10
捨         10
中         10
亦         10
一         10
受         10
應          9
故          9
後          9
二師         9
他          9
從          8
如          8
也          8
十誦         8
教誡         7
謂          7
此          7
大德         7
不能         7
汝          7
教授         7
人          7
乃至         7
令          6
看          6
見          6
威儀         6
折伏         6
犯          6
知          6
和          6
無          6
又          6
訶責         6
得罪         6
五分         6
復          6
行法         6
雖          6
上          5
三種         5
出界         5
請          5
所          5
善見         5

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words=chinese_stop_words, tokenizer=jieba.lcut, use_idf=False, norm='l1')

In [51]:
tf_vectorized = tfidf_vectorizer.fit_transform(documents)
tf_vectorized_df = pd.DataFrame(tf_vectorized.toarray(), columns=tfidf_vectorizer.get_feature_names())
tf_vectorized_df

Unnamed: 0,一,一一,一七,一作,一切,一則,一匠,一坐,一夜,一子,...,餘文,餘未,餘盡,餘者,髀,麁,黃門,黜,鼈,龜
0,0.003316,0.000995,0.000332,0.000332,0.000995,0.000332,0.000332,0.000332,0.000332,0.000332,...,0.000332,0.000332,0.000332,0.000332,0.000332,0.000332,0.000332,0.000332,0.000332,0.000332


In [63]:
count_vectorized_df.T.sort_values(by=0, ascending=False)

Unnamed: 0,0
弟子,59
依止,58
和尚,55
若,44
不,40
...,...
學者,1
學在,1
學,1
字,1


In [64]:
tf_vectorized_df.T.sort_values(by=0, ascending=False)

Unnamed: 0,0
弟子,0.019562
依止,0.019231
和尚,0.018236
若,0.014589
不,0.013263
...,...
學者,0.000332
學在,0.000332
學,0.000332
字,0.000332


In [69]:
55/3847

0.014296854691967767

In [67]:
len(tokenized_text)

3847

一     0.003316
一一    0.000995
一七    0.000332
一作    0.000332
一切    0.000995
        ...   
麁     0.000332
黃門    0.000332
黜     0.000332
鼈     0.000332
龜     0.000332
Length: 1641, dtype: float64