# Lu Yin Text Analysis Project
Stephanie Shu

# Framing the Project

## 1. Research interests

My primary research involves early twentieth century Chinese literature, specifically focusing representations of female-female homoeroticism. This project aims to extend a year-long thesis project I worked on in 2022, which used traditional close-reading strategies to analyze a set of five woman-authored short stories that depicted female-female homoerotic relationships. As a new programmer, the goal of this project is to allow me to practice using different basic text analysis strategies and evaluate their application to Chinese texts. I approach this project as more of an exercise in exploration and discovery, rather than something with a set path of methods and goals. 

## 2. Who was Lu Yin 廬隱?

Lu Yin 廬隱 (1898-1934) was a Chinese author whose stories focused on issues of women's societal entrapment and frustrations with heteronormative marriage culture. In her personal life, Lu Yin was well-known for her close relationship with another female writer, which was certainly homosocial, if not homoerotic in nature. Thus, her stories also explore the intimate and close relationships formed between young women, and the issues of family and national progress that they feel obligated to prioritize over their own happiness, relationships, and even love. 

Because of these features, I find Lu Yin to be the most compelling of the authors that I studied in my thesis, and therefore the author that I wanted to explore more through computational analysis. 

## 3. Research question

As mentioned about, this project is one of curiosity and exploration. My goal is to familiarize myself with the most basic of Chinese text analysis tools, and see what is possible. Some of the questions that guiding this project included:
* What information can be gleaned about Lu Yin's works by applying different forms of text analysis? For example, does word frequency analysis provide a good picture of the texts? 
* What is the best way to deal with Chinese text, which is a character-based language?
* Does Latent Dirichlet Allocation-based topic modeling give us better results than character-based or word-based frequency analyses?

## 4. The Data

I used three of Lu Yin's stories in this project:
* "Lishi's Diary" 麗石的日記 (short story)
* <i> Old Friends by the Seashore</i> 海濱故人 (novella)
* "Drifting Women" 飄泊的女兒 (short story)

The texts are sourced from the final copy of my thesis. These were adapted from online versions, which I manually corrected using published print copies. I made a separate text file for each story and uploaded these to github.

# Project Set-Up

## 1. Download files
1. Go to https://github.com/sshu99/luyin.git 
2. Open terminal and type 'git clone https://github.com/sshu99/luyin.git'
OR
2. Download ZIP

## 2. Install packages

In [1]:
!pip install zhon
!pip install jieba
!pip install seaborn
!pip install scikit-learn



## 3. Import Modules

In [2]:
import os
import platform 
import pandas as pd

# Python library 'zhon' with constants for Chinese text processing.
# https://pypi.org/project/zhon/ 
import zhon.hanzi

# Python module 'string' for removing English punctuation.
import string
from string import punctuation

# Python module 'Counter' for counting most frequent words. 
from collections import Counter 

#Python module 'jieba' for segmenting Chinese texts
import jieba

import spacy
nlp = spacy.load("en_core_web_sm", disable=["ner", "textcat"])

import nltk
nltk.download("stopwords")
nltk.download("averaged_perceptron_tagger")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/stephanie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/stephanie/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## 4. Open the text files

In [3]:
# Assigning variables to read each file
lishi = open('lishi.txt', encoding = 'utf-8').read()
haibin = open('haibin.txt', encoding = 'utf-8').read()
piaobo = open('piaobo.txt', encoding = 'utf-8').read()

## 5. Read a file

In [4]:
# Checking that we can read one of the files
print(lishi[:300])

今日春雨不住響地滴著，窗外天容愔淡，耳邊風聲淒厲，我靜坐幽齋，思潮起伏，只覺悵然惘然！
　　去年的今天，正是我的朋友麗石超脫的日子，現在春天已經回來了，並且一樣的風淒雨冷，但麗石那慘白梨花般的兩靨，誰知變成什麼樣了！
　　麗石的死，醫生說是心臟病，但我相信麗石確是死於心病，不是死於身病，她留下的日記，可以證實，現在我將她的日記發表了吧！

十二月二十一日
　　不記日記已經半年了。只感覺著學校的生活單調，吃飯，睡覺，板滯的上課，教員戴上道德的假面具，像俳優般舞著唱著，我們便像傻子般看著聽著，真是無聊極了。
　　圖書館裡，擺滿了古人的陳跡，我掀開了屈原的《離騷》念了幾頁，心竊怪其愚——懷王也值得深


# Processing texts for word frequency analysis

## 1. Removing punctuation

Chinese text uses different punctuation from English, which is why I needed to import a special library with Chinese punctuation earlier. Since people now use English input alongside Chinese input, the text may also contain English punctuation, so I have to remove both.

In [5]:
# Printing Chinese punctuation
print(zhon.hanzi.punctuation)

＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､　、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·．！？｡。


In [6]:
# Printing English punctuation
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [7]:
# Using a For loop to remove Chinese punctuation
for char in zhon.hanzi.punctuation:
    lishi = lishi.replace(char, '')
    haibin = haibin.replace(char, '')
    piaobo = piaobo.replace(char, '')

# Using a For loop to remove English punctuation
for char in string.punctuation:
    lishi = lishi.replace(char, '')
    haibin = haibin.replace(char, '')
    piaobo = piaobo.replace(char, '')

In [8]:
# Testing to make sure the punctuation has been removed.
print(lishi[:300])

今日春雨不住響地滴著窗外天容愔淡耳邊風聲淒厲我靜坐幽齋思潮起伏只覺悵然惘然
去年的今天正是我的朋友麗石超脫的日子現在春天已經回來了並且一樣的風淒雨冷但麗石那慘白梨花般的兩靨誰知變成什麼樣了
麗石的死醫生說是心臟病但我相信麗石確是死於心病不是死於身病她留下的日記可以證實現在我將她的日記發表了吧

十二月二十一日
不記日記已經半年了只感覺著學校的生活單調吃飯睡覺板滯的上課教員戴上道德的假面具像俳優般舞著唱著我們便像傻子般看著聽著真是無聊極了
圖書館裡擺滿了古人的陳跡我掀開了屈原的離騷念了幾頁心竊怪其愚懷王也值得深戀嗎
下午回家寂悶更甚這時的心緒真微玄至不可捉摸日來絕要自制不讓消極的思想入據靈台所以


## 2. Removing spaces
Because I want to count word frequency, I also need to remove any spaces between characters so that these aren't in the way of the text processing.

In [9]:
# Replacing '\n' and '\r' to get rid of line breaks.
# Replacing ' ' spaces to get rid of any single spaces.
lishi = lishi.replace('\n', '').replace('\r', '').replace(' ', '')
haibin = haibin.replace('\n', '').replace('\r', '').replace(' ', '')
piaobo = piaobo.replace('\n', '').replace('\r', '').replace(' ', '')

In [10]:
# Testing to make sure all spaces have been removed.
print(lishi[:300])

今日春雨不住響地滴著窗外天容愔淡耳邊風聲淒厲我靜坐幽齋思潮起伏只覺悵然惘然去年的今天正是我的朋友麗石超脫的日子現在春天已經回來了並且一樣的風淒雨冷但麗石那慘白梨花般的兩靨誰知變成什麼樣了麗石的死醫生說是心臟病但我相信麗石確是死於心病不是死於身病她留下的日記可以證實現在我將她的日記發表了吧十二月二十一日不記日記已經半年了只感覺著學校的生活單調吃飯睡覺板滯的上課教員戴上道德的假面具像俳優般舞著唱著我們便像傻子般看著聽著真是無聊極了圖書館裡擺滿了古人的陳跡我掀開了屈原的離騷念了幾頁心竊怪其愚懷王也值得深戀嗎下午回家寂悶更甚這時的心緒真微玄至不可捉摸日來絕要自制不讓消極的思想入據靈台所以又忙把案頭的奮


## 3. Create a stopword list

To do word frequency analysis, I need to remove common, but relatively irrelevant words and phrases from the text. I searched for a suitable list of stopwords online and came upon this library: https://github.com/bryanchw/Traditional-Chinese-Stopwords-and-Punctuations-Library. However, the contents were not exactly what I wanted, so I copied it into a new text file and made some edits to it. Now I will open that file in this notebook so I can use this list to remove the stopwords from the texts.

In [11]:
stops = open('stopwords.txt', encoding = 'utf-8').read()

In [12]:
# Formatting the text into a list
stoplist = stops.replace('\n','').replace("'", '').replace(" ", '')
stps = stoplist.split(',')

# Adding more words to the stopword list as needed
stps.append('麼')
stps.append('麽') # for some reason there are two unicodes for this character, so we need to add both
stps.append('事')
stps.append('便')
stps.append('見')
stps.append('頭')
stps.append('子')
stps.append('然')
stps.append('樣')
stps.append('裏')
print(stps[:100])

['的', '得', '不', '沒', '了', '是', '到', '至', '有', '此', '些', '這', '那', '在', '裡', '外', '上', '下', '著', '者', '個', '也', '來', '去', '和', '只', '都', '就', '之', '過', '地', '無', '對', '又', '起', '前', '能', '以', '但', '而', '並', '可', '麼', '什麼', '因', '因為', '多', '少', '出', '很', '太', '走', '從', '如', '才', '呢', '吧', '啊', '呵', '嗎', '還', '把', '欸', '所以', '何', '結果', '如果', '已經', '已', '向', '最', '作', '做', '一', '二', '三', '四', '五', '六', '七', '八', '九', '十', '每', '便一個', '一些', '一何', '一切', '一則', '一方面', '一旦', '一來', '一樣', '一種', '一般', '一轉眼', '萬一', '上', '上下', '下']


# >>> Side quest: character name frequencies
## 1. Make character lists
While creating my stopwords list, I also realized there were character names that I wanted to remove, but I want to remove them separately so I can also count their frequencies. I created lists of characters for each text.

In [13]:
lishi_names = ['麗石', '沅青', '雯薇', '酈文']
haibin_names = ['露沙', '雲青', '宗瑩', '玲玉', '蓮裳', '梓青', '劍卿', '師旭']
piaobo_names = ['畏如', '星若', '星呵']

## 2. Add character lists to jieba dictionary
I will be using the jieba Python module to split my texts and search for character names. These character names aren't in the default jieba dictionary, however, so I need to modify the dictionary to recognize the names.

In [14]:
# Writing a function to add the names from the list to jieba
def add_words_from(namelist):
    for word in namelist:
        jieba.add_word(word)

# Using the function to add each list of names to jieba for all three texts
add_words_from(lishi_names)
add_words_from(haibin_names)
add_words_from(piaobo_names)

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/jt/fyzwsnw901g8rw4fc1nwbh_c0000gn/T/jieba.cache
Loading model cost 0.564 seconds.
Prefix dict has been built successfully.


## 3. Write a function to return character counts

Now the set-up is ready, and I can write a function to process the text and output a dataframe with the frequencies of each character.

In [15]:
# Writing a function to produce the correct names list 
# For use in the following function
def producelist(text):
    if text == lishi:
        return lishi_names
    if text == haibin:
        return haibin_names
    if text == piaobo:
        return piaobo_names

# Writing a function to list the frequency of each character's name
def name_counter(text):
    seg_text = jieba.cut(text, cut_all=False) # Using jieba to segment the text
    c = Counter(seg_text)
    new_list = {}
    names_list = producelist(text)
    for name in names_list: # Using a For loop to check each of the names in the list
        new_list[name] = c[name]
    name_freq = sorted(new_list.items(), key=lambda x:x[1], reverse = True) 
        # Sorting the names by frequency
    return pd.DataFrame(name_freq)

## 4. Counting name frequencies in each text

In [16]:
name_counter(lishi)

Unnamed: 0,0,1
0,沅青,38
1,麗石,14
2,雯薇,9
3,酈文,3


In [17]:
name_counter(haibin)

Unnamed: 0,0,1
0,露沙,285
1,雲青,117
2,宗瑩,109
3,玲玉,84
4,梓青,69
5,蓮裳,25
6,劍卿,17
7,師旭,5


In [18]:
name_counter(piaobo)

Unnamed: 0,0,1
0,畏如,22
1,星若,19
2,星呵,4


This side quest is not particularly relevant to my research question, but it is interesting to see which characters' names are mentioned the most! A different research question that asks about characters and their relationships might benefit from something like this.

## 5. Removing names

And now that we have the name frequencies, we can take the names out of the text so they won't be counted in the word frequencies later.

In [19]:
# Writing a function to remove the characters' names from each text
# nc stands for 'no characters'

def nc(text):
    namelist = producelist(text)
    for name in namelist:
        if name in text:
            text = text.replace(name,'')
    return text

nc_lishi = nc(lishi)
nc_haibin = nc(haibin)
nc_piaobo = nc(piaobo)

### Now back to the main quest...
# Exploring basic word frequencies
Now that I have finished removing punctuation, space, and character names, I will get back to the main goal of exploring word frequencies. First, I need to remove stopwords so I can look at the word frequencies for just the important words in each text. Then, I can write a function to calculate those frequencies.

## 1. Removing stops

In [7]:
# Writing a function to remove stopwords from each text
# ns stands for 'no stops'

def ns(text):
    for stop in stps:
        if stop in text:
            text = text.replace(stop,'')
    return text

ns_lishi = ns(nc_lishi)
ns_haibin = ns(nc_haibin)
ns_piaobo = ns(nc_piaobo)

NameError: name 'nc_lishi' is not defined

## 2. Calculating word frequencies 

Now we can finally look at word frequencies and see which words are the most frequent in each text. We can look at these words and see if they are helpful in either reflecting or evaluating the content of the text.

In [21]:
# Writing a function to produce the correct text with no characters and no stops
# For use in the following function
def produce_ns(text):
    if text == lishi:
        return ns_lishi
    if text == haibin:
        return ns_haibin
    if text == piaobo:
        return ns_piaobo

# Writing a function to return the 10 most common words in a text
def word_frequencies(text):
    freq = Counter(produce_ns(text)).most_common(10) # Using the Counter module to get the most frequently used words
    return freq

## 3. Reviewing the common words* in these three texts:
*Technically, the function I wrote above calculates character frequencies, not word frequencies. In Chinese, characters have meanings of their own, but they may also appear in other words that are comprised of multiple characters.

I've added very rough translations to the dataframe to give non-Chinese speakers a sense of the results.

In [22]:
print (word_frequencies(lishi))

[('天', 45), ('心', 34), ('生', 33), ('覺', 27), ('真', 22), ('情', 20), ('信', 19), ('病', 18), ('想', 18), ('回', 17)]


In [23]:
# Assigning a variable to the dataframe
lishi_df = pd.DataFrame(word_frequencies(lishi))

# Naming columns and adding a column with definitions
lishi_df.columns=['character', 'frequency']
lishi_df.insert(1, 'definition',['day, heaven, sky', 'heart, mind', 'life, birth', 'feel', 'real, true', 'emotion', 'believe, letter','illness, sick','think, feel, want', 'return'], True) 

# Printing the dataframe
lishi_df

Unnamed: 0,character,definition,frequency
0,天,"day, heaven, sky",45
1,心,"heart, mind",34
2,生,"life, birth",33
3,覺,feel,27
4,真,"real, true",22
5,情,emotion,20
6,信,"believe, letter",19
7,病,"illness, sick",18
8,想,"think, feel, want",18
9,回,return,17


### 3a. Understanding "Lishi's Diary" character frequencies

"Lishi's diary" is a fictional diary that is very emotional. The main character and fictional author, Lishi, writes about her feelings in a lot of depth. She also commonly associates sadness and discontentment with illness, which we can also see reflected in this character count.

In [24]:
print(word_frequencies(haibin))

[('天', 178), ('道', 160), ('生', 135), ('心', 131), ('情', 112), ('信', 107), ('想', 105), ('家', 99), ('覺', 98), ('回', 90)]


In [25]:
# Assigning a variable to the dataframe
haibin_df = pd.DataFrame(word_frequencies(haibin))

# Naming columns and adding a column with definitions
haibin_df.columns=['character', 'frequency']
haibin_df.insert(1, 'definition',['day, heaven, sky', 'way, channel', 'life, birth', 'heart, mind', 'emotion', 'believe, letter', 'think, feel, want', 'family, home', 'feel', 'return'],True)

# Printing the dataframe
haibin_df

Unnamed: 0,character,definition,frequency
0,天,"day, heaven, sky",178
1,道,"way, channel",160
2,生,"life, birth",135
3,心,"heart, mind",131
4,情,emotion,112
5,信,"believe, letter",107
6,想,"think, feel, want",105
7,家,"family, home",99
8,覺,feel,98
9,回,return,90


### 3b. Understanding "Friends by the Seashore" character frequencies

"Friends by the Seashore" (Habin guren) is a novella, so we can see that the character counts are very high. The novella is filled with emotional letters and poems written between characters. The characters also often discuss family obligations and whether or not they will return home. 

In [26]:
print(word_frequencies(piaobo))

[('心', 18), ('家', 18), ('女', 16), ('男', 16), ('想', 15), ('愛', 12), ('住', 11), ('回', 11), ('海', 9), ('生', 9)]


In [27]:
# Assigning a variable to the dataframe
piaobo_df = pd.DataFrame(word_frequencies(piaobo))

# Naming columns and adding a column with definitions
piaobo_df.columns=['character', 'frequency']
piaobo_df.insert(1, 'definition',['heart, mind', 'family, home', 'woman, female', 'man, male', 'think, feel, want', 'love', 'live, reside', 'return', 'sea, ocean', 'life, birth'],True)

# Printing the dataframe
piaobo_df

Unnamed: 0,character,definition,frequency
0,心,"heart, mind",18
1,家,"family, home",18
2,女,"woman, female",16
3,男,"man, male",16
4,想,"think, feel, want",15
5,愛,love,12
6,住,"live, reside",11
7,回,return,11
8,海,"sea, ocean",9
9,生,"life, birth",9


###  3c. Understanding "Drifting Women" character frequencies

"Drifting Women" (Piaobo de nü'er) is a short story about two women who are not sure if they should follow their familial duty and get married to men, or if they should follow their own desires and love and stay together.

## 4. Evaluating character frequencies

Now that I have found the 10 most common characters across the three texts, I am able to see that the most common words/characters do seem to reflect central concerns from the texts. However, this is more circumstantial evidence, because we already know what the texts are about. Knowing the content of the text helped me interpret why certain characters showed up the most often. 

But, if I wanted to process an unknown text the same way, these character frequencies probably wouldn't be enough to predict the content of a text. For predictive power, I'm going to need something more advanced than simple counting. 


# Exploring word frequencies through text segmentation
Now, I will try looking at word frequencies through a different lens: text segmentation. Since words can be multiple characters long, I am hoping that using text segmentation will produce informative results.

## 1. Writing a function to segment the text and count frequencies

I am using the Python module jieba, which is built for Chinese text segmentation. Given the nature of the Chinese language, which does not have prefizes, suffixes, and conjugations, words already hold their original semantic meaning more or less in their form. Thus, we can think of text segmentation as performing a very similar function to that of lemmatization for English texts.

In [28]:
# Writing a function that will simultaneously segment the text and count frequencies
def seg_frequency(text):
    ns_text = (''.join(produce_ns(text)))
    seg_text = jieba.cut(ns_text, cut_all=False) # Using jieba to segment the text
    freq = Counter(seg_text).most_common(10) # Using the Counter module to get the most frequently used words
    return freq

## 2. Reviewing the common words in these three texts:

In [29]:
print(seg_frequency(lishi))

[('安慰', 9), ('生活', 8), ('聽', 8), ('結婚', 7), ('睡', 6), ('寂寞', 6), ('昨天', 5), ('昨夜', 5), ('勉強', 5), ('天津', 5)]


In [30]:
# Assigning a variable to the dataframe
lishi_seg = pd.DataFrame(seg_frequency(lishi))

# Naming columns and adding a column with definitions
lishi_seg.columns=['character', 'frequency']
lishi_seg.insert(1, 'definition',['comfort, soothe', 'life', 'listen, hear', 'marry', 'sleep', 'lonely, loneliness', 'yesterday', 'last night', 'do reluctantly, force', 'Tianjin (city name)'],True)

# Printing the dataframe
lishi_seg

Unnamed: 0,character,definition,frequency
0,安慰,"comfort, soothe",9
1,生活,life,8
2,聽,"listen, hear",8
3,結婚,marry,7
4,睡,sleep,6
5,寂寞,"lonely, loneliness",6
6,昨天,yesterday,5
7,昨夜,last night,5
8,勉強,"do reluctantly, force",5
9,天津,Tianjin (city name),5


In [31]:
print(seg_frequency(haibin))

[('道', 60), ('聽', 36), ('話', 35), ('知道', 34), ('吃', 25), ('想', 25), ('生活', 25), ('寫', 24), ('封信', 22), ('朋友', 21)]


In [32]:
# Assigning a variable to the dataframe
haibin_seg = pd.DataFrame(seg_frequency(haibin))

# Naming columns and adding a column with definitions
haibin_seg.columns=['character', 'frequency']
haibin_seg.insert(1, 'definition',['way, path', 'listen, hear', 'speech, language', 'know, understand', 'eat', 'think, feel, want', 'life', 'write', 'letter', 'friend'],True)

# Printing the dataframe
haibin_seg

Unnamed: 0,character,definition,frequency
0,道,"way, path",60
1,聽,"listen, hear",36
2,話,"speech, language",35
3,知道,"know, understand",34
4,吃,eat,25
5,想,"think, feel, want",25
6,生活,life,25
7,寫,write,24
8,封信,letter,22
9,朋友,friend,21


In [33]:
print(seg_frequency(piaobo))

[('想', 8), ('男', 6), ('青春', 5), ('找', 5), ('朋友', 4), ('兩', 4), ('道', 4), ('回家', 4), ('嫁', 4), ('住', 3)]


In [34]:
# Assigning a variable to the dataframe
piaobo_seg = pd.DataFrame(seg_frequency(piaobo))

# Naming columns and adding a column with definitions
piaobo_seg.columns=['character', 'frequency']
piaobo_seg.insert(1, 'definition',['think, feel, want', 'man, male', 'youth', 'find', 'friend', 'two, pair', 'way, path,', 'return home', 'marry', 'live, reside'],True)

# Printing the dataframe
piaobo_seg

Unnamed: 0,character,definition,frequency
0,想,"think, feel, want",8
1,男,"man, male",6
2,青春,youth,5
3,找,find,5
4,朋友,friend,4
5,兩,"two, pair",4
6,道,"way, path,",4
7,回家,return home,4
8,嫁,marry,4
9,住,"live, reside",3


## 3. Evaluating character frequencies

The results of the word frequency analysis on the segmented texts produces quite different results from the word frequency analysis that only counted character frequencies. As this experiment shows, there are quite a few seemingly unhelpful words that end up at the top of the lists, such as 吃 "eat" and 天津 "Tianjin" (city name). Aside from this, however, there are also helpful words that didn't show up in the other character-based frequency analysis. For example, 嫁 "marry", 青春 "youth", 朋友 "friend", 安慰 "comfort" are all important aspects to these stories. 

This does show that the frequencies shift dramatically between the two methods. If you look at how jieba segments a text, it becomes clear why. As seen below, the text is chopped into chunks, with some chunks being two or three characters long. This means that frequent characters aren't considered alone, but rather as part of the word they are found in. Thus, the frequencies will shift when compared to character-based frequency analysis. 

In [35]:
TEST = jieba.cut(ns_lishi[:300], cut_all=True)
'/'.join(list(TEST))

'春雨/住/響/滴/窗/天容/愔/淡/耳/風/聲/淒/厲/靜/坐/幽/齋/思潮/伏/覺/悵/惘/天正/朋友/超/脫/現/春天/回/風/淒/雨/冷/慘/白梨/梨花/般/兩/靨/知/變/成/死/醫/生/心/臟/病/相信/確/死心/心病/病死/身/病/留/記/證/實/現/記/發/表/記/記/半/感/覺/學/校/生活/單/調/吃/飯/睡/覺/板/滯/課/教/員/戴/道德/假面/假面具/面具/俳/優/般/舞/唱/傻/般/聽/真/聊/極/圖/書/館/擺/滿/古/陳/跡/掀/開/屈原/騷/念/頁/心/竊/怪/愚/懷/王/值/深/戀/午/回家/寂/悶/甚/心/緒/真/微/玄/捉摸/絕/制/消/極/思想/想入/靈/台/忙/案/奮/鬥/雜/誌/讀/晚/飯/生/海信/寥寥/行/系心/心坎/流/近/宿/常常/戕/身/白/蘭/酒/兩/天/喝完/瓶/沉醉/忘/憂/候/憐/感情/情海/豈/容/輕/陷/指路/紅/燈/盞/萬/矢/紅/燈/料定/勝/實/海/蘭/女/世界/絕/僅/生/永/遠/瞭/解/層/罷/夜/復/生/信/竟/受困/確/搜/枯/腸/找/句/恰/話/足/安慰/實/真正/苦/悶/候/絕/話/安慰/天/俗例/冬/節/學/堂/放/天/假/早晨/姑母/忙/預/備/祭祖/免/想/家情/緒/憶/獨/異/鄉/異/客/逢/佳/節/倍/思/親/愴/淚/姑丈/老病/兩/天'

## 4. Comparing the two frequency analysis methods
When comparing these two ways of doing frequency analysis, it's hard to say which method does a better job at showing trends in the text. We can try to see if we can compare them by writing a function that matches the texts.

### 4a. Get summaries of each text
I'll be using the summaries I wrote for each of these texts as the metric for comparison. Obviously, this is a very subjective measure, since I was the one who wrote these summary texts. But hopefully it can give us an idea of how well these frequency counts match the content.

In [36]:
lishi_summary = 'This lesser-known work by Lu Yin is formatted and presented as the diary of a deceased young woman named Lishi 麗石, ostensibly published by a friend who believes she died not of heart disease, but rather of heartsickness. Through the many entries, Lishi’s story of love and loss unfolds. After witnessing the toll marriage has taken on her childhood friend Wenwei 雯薇, Lishi confesses her feelings of disillusionment with marriage to her close friend Yuanqing 沅青. Finding sympathy in one another, the two grow closer, and their relationship evolves into one that Lishi explicitly defines as “same- gender love” 同性的愛戀. The two happily imagine a future together, and that evening, Lishi dreams of traveling down a stream with Yuanqing on a moonlit night. But this happiness does not last long; Yuanqing is called home by her mother, and upon her return, seems more sullen, though she refuses to explain why. The following day, Lishi receives a letter from Yuanqing, saying that her mother has arranged for her to move to Tianjin to spend time with her cousin, who she is expected to marry. Though the initial letter is filled with resentment towards a world that will not allow their love, within a few weeks she sends another letter to Lishi, encouraging her to awaken to reality and the “childishness” of their previous dreams. To add insult to injury, Yuanqing sends a young man to court Lishi, which causes her to sink deeper into melancholia. In the final entry, Lishi wishes that death will come quickly, and with this, her diary ends abruptly.'
haibin_summary = 'The opening of Lu Yin’s Old Friends by the Seashore starts with the picturesque scene of five young female friends, Lusha 露莎, Lingyu 玲玉, Yunqing 雲青, Zongying 宗瑩, and Lianshang 蓮裳, spending a relaxing summer along the seashore during their break between academic terms. Having isolated themselves from their peers, their relationship is very intimate and affectionate, and they spend their free time debating various intellectual inquiries, such as boundaries of emotion and logic or the meaning of “love.” Soon after, the girls return to school, and as they approach their graduation, they grow increasingly apprehensive of the oppressive life that awaits them once they enter society and are expected to marry, bear children, and ultimately, sacrifice their individual ambitions for their families. They worry, too, that their education, intellectual enlightenment, and modern values may only worsen their feelings of entrapment once they become captives to society, and Lusha in particular has a mounting fear that life and love may be fundamentally meaningless. Indeed, as time wears on, each woman experiences her own personal tragedies. At Lianshang’s wedding, the other four girls stand miserably to the side as they helplessly witness the festivities which are the first indication of their diverging paths. Lusha’s tragedy is two-fold; the first is her mother’s death, and the second, her continued agony over the impossibility of a relationship with the man she loves, Ziqing, who is still married to a woman his parents chose for him. Yunqing, too, struggles because herfamily disapproves of her suitor. Meanwhile, Zongying and Lingyu each find potential marriage matches, gradually spending less time with the group to accompany their boyfriends. Seeing their increasingly apathetic attitude towards the friendship, Lusha and Yunqing bemoan the impossibility of a dream all four once shared—a home by the sea where they could teach or write and leisurely spend time together, just as they did in days gone by.Their misfortune continues as Zongying, who despite overcoming her family’s disapproval and marrying the man of her choice, falls severely ill less than a month after her wedding. Lingyu has plans to marry, while Yunqing, unwilling to rebel against her family’s wishes, gives up on her love to return home and care for her mother and siblings. With all her friends gone in scattered in directions, Lusha writes to Yunqing, explaining that she and Ziqing, though unable to concretize their love through any legal or socially accepted channels, have decided to build a life together. She writes that they have purchased land to build a small hut along the same seashore where the four women vacationed in their youth, and plan to settle there together—if they can succeed in accomplishing their shared revolutionary goals. If they fail, she writes, they will commit double suicide by treading into the waves. After receiving this letter, Yunqing does not hear from Lusha again.One year later, Lingyu and Yunqing make their way back to the seaside. There they find an empty house, just as Lusha described, with the words “Old Friends by the Seashore” inscribed above the door frame.'
piaobo_summary = 'Set in 1932, the story unfolds against the backdrop of Japanese military aggression and general political chaos in Shanghai. The first scene of the story opens with the sound of explosions waking the two protagonists, lovers Weiru 畏如 and Xingruo 星若. The two women have been romantically involved for several years and are currently staying in a friend’s home in Shanghai, searching for jobs. The state of the outside world is reflected in the two protagonists’ individual states. Both are unemployed and struggling to support themselves financially. Moreover, they find themselves having to reckon with the reality of independent life. Weiru, the older of the two, bemoans men’s unwillingness to come to her aid as they had in the past. It is only through Xingruo’s explanation that she realizes their previous attentiveness towards her stemmed not from interest in her as a person, but rather as a sexual object. Now that she is perceived to have passed her so-called prime, men no longer pay her any heed. Comforting Weiru, Xingruo reassures her that their love is more than sufficient. The two decide to make a pact to stay unmarried, and if this proves impossible, they promise to get married at the same time, inthe same place. When the two women find themselves still unemployed, despite their continuous search, they decide to return home to their families. First, they visit Weiru’s elderly parents together, then Xingruo returns to her own hometown by herself. At home, amidst pressure from her family and friends, Xingruo begins to consider marriage and returns to Shanghai to look for potential marriage prospects. She receives a letter from Weiru, in which Weiru swears off love, saying that the age of love has passed with her youth. She writes saying that she must be practical now and put her family’s survival before her own ambitions and dreams. Aimlessly wandering in Shanghai alone, Xingruo thinks of Weiru, drifting alone somewhere far away.'

### 4b. Write functions to lemmatize and match texts

In [37]:
def process(text):
    text = text.lower()
    nlp = spacy.load("en_core_web_sm", disable = ["ner", "textcat"])
    stp = nlp.Defaults.stop_words
    lemma_no_stops = []
    for char in punctuation:
        if char != '\'':
            text = text.replace(char,"")
    doc = nlp(text)
    for word in doc:
        lemma = word.lemma_.lower()
        if lemma not in stp:
            lemma_no_stops.append(lemma)
    return lemma_no_stops

def match(summary, analysis):
    x = process(summary)
    y = ', '.join(list(analysis[0:10]['definition']))
    z = process(y)
    new = set() #intialize an empty set
    for word in x:
        if word in z:
            new.add(word)
    print('words in common:', len(new))
    print('words:', new)

### 4c. Print the results

In [38]:
print('\033[1m' + '"Lishi\'s Diary"' + '\033[0m')

print ("\ncharacter-based frequency analysis:")
match(lishi_summary, lishi_df)

print ("\nsegement-based frequency analysis:")
match(lishi_summary, lishi_seg)

print ("\n---------")

print('\033[1m' + '"Old Friends by the Seashore"'+ '\033[0m')
print ("character-based frequency analysis:")
match(lishi_summary, haibin_df)

print ("\nsegement-based frequency analysis:")
match(lishi_summary, haibin_seg)

print ("\n---------")

print('\033[1m' + '"Drifting Women"'+ '\033[0m')

print ("\ncharacter-based frequency analysis:")
match(lishi_summary, piaobo_df)

print ("\nsegement-based frequency analysis:")
match(lishi_summary, piaobo_seg)

[1m"Lishi's Diary"[0m

character-based frequency analysis:
words in common: 5
words: {'heart', 'letter', 'believe', 'return', 'day'}

segement-based frequency analysis:
words in common: 3
words: {'tianjin', 'marry', 'night'}

---------
[1m"Old Friends by the Seashore"[0m
character-based frequency analysis:
words in common: 6
words: {'heart', 'letter', 'believe', 'home', 'return', 'day'}

segement-based frequency analysis:
words in common: 2
words: {'letter', 'friend'}

---------
[1m"Drifting Women"[0m

character-based frequency analysis:
words in common: 6
words: {'woman', 'heart', 'man', 'love', 'home', 'return'}

segement-based frequency analysis:
words in common: 6
words: {'marry', 'friend', 'man', 'home', 'return', 'find'}


### 4d. Compare methods

From this comparison of the analysis, we can see that the character-based frequency analysis actually seems to bring up more of the same words that are in the summaries for the first two stories. For "Drifting Women," both seem equally helpful in terms of number of matched terms. We can see that the character-based frequency analysis tends to provide more conceptual words like "believe," "love," "home," and "heart," while the word-based gives more concrete and tangible words like "friend" and "Tianjin." This tracks with how the Chinese language often combines characters that convey concepts into word pairs that hold a more specific semantic meaning. It's helpful to note that the matched terms are relatively low in the first place, so the summaries may note be the best way to evaluate the success of a word frequency analysis. 

# Exploring topic models with LDA

Now I will try to use Latent Dirchlet Allocation to produce topic models to compare to the word frequency analysis.

In [39]:
!pip install HanziNLP



## 1. Using the LDA model to do topic modeling for a text

In [6]:
# assigning variable to open the file
lishi = open('lishi.txt', encoding = 'utf-8').read() 

Due to time constraints, I can only look at one of the texts with this method. I decided to look at "Lishi's Diary." I'm using HanziNLP library's LDA function. https://github.com/samzshi0529/HanziNLP 

In [4]:
from HanziNLP import sentence_segment, word_tokenize, lda_model, print_topics

text = ns(lishi)
sentences = sentence_segment(text)
tokenized_texts = [word_tokenize(sentence) for sentence in sentences]
lda_model, corpus, dictionary = lda_model(tokenized_texts, num_topics=5, passes=100)
print_topics(lda_model)

NameError: name 'ns' is not defined

### Here's a better view of those topics with translations:

#### Topic 0
"沅青" Yuanqing (character), "天津" Tianjin (city name), "憐" pity, "寂寞" loneliness, lonely, "麗石" Lishi (character), "約" agree to meet, "意思" intention, "路" street, "母親" mother, "同性" same-gender

#### Topic 1
"安慰" comfort, "生活" life, "早" early, "沅青" Yuanqing (character), "—" em dash (punctuation), "回想" memory, "奪" seize, take away, "類" type, "戴" to wear, support, "青" youth

#### Topic 2
"—" em dash (punctuation), "麗石" Lishi (character), "生" life, birth, "沅青" Yuanqing (character), "真" real, true, "床" bed, "病" sickness, "午" noon, "親愛" beloved, darling, "痕跡" scars

#### Topic 3
"沅青" Yuanqing (character), "天" day, sky, heaven, "話" word, speech, "記" remember, record "雯薇" Wenwei (character), "死" die, death, "正" current, straight, correct, "抑鬱" depression, "天覺" sky + feel (not an expression) "竟" actually, unexpected

#### Topic 4
"—" em dash (punctuation), "沅青" Yuanqing (character) "結婚" marry, marriage, "表兄" cousin, "酈文" Liwen (character), "聽" hear, listen, "覺" feel, "信" believe, letter, "恨" hate, "苦痛" pain, suffering

## 2. Evaluating Topic Modeling

The topic modeling was a little disappointing. These topics still feel like somewhat random collections of words, but when compared to the word frequency analysis, they are definitely better at showing us what the text is about. 

Topic 0 is probably the most random of all five. If I had to name it, I would call it "Yuanqing and Lishi's sadness about location and family." 

Topic 1 is pretty good, and I might call this one "concerns about youth and life." 


Topic 2 could be "illness and love between Lishi and Yuanqing"


Topic 3 could be "death and sadness connected to Yuanqing and Wenwei"


Topic 4 could be "marriage, hate, and suffering connected to multiple characters"

I would say that all together, these three topic sum up the story reasonably well. But I already know the story and can see the connections, so I am biased to see the connections here. Once again we see that this text analysis strategy fails to really reliably predict the content of the text. 

#### The failure of LDA in this context is probably due to a few different factors:

1. I did not preprocess the text, which is why there are weird things in the topics, such as em dashes. These should have been removed in the preprocessing stage, but when I tried to preprocess my texts, the topics did not come out well. This is something I don't understand and would want to troubleshoot in the future. 

2. This text is a short story, which means it's probably not ideal for using LDA. A much longer text would be better for topic modeling. 

# Project Conclusion

From this project, I was able to familiarize myself with some basic tools of Chinese text analysis, such as the Python module jieba and LDA. I also learned how different languages require different computation approaches, such as segmentation vs. lemmatization, and character-based versus word-based analysis. I conducted different analyses and found that word frequency analysis can help get a rough picture of the important concepts in a text, but that overall it is not good enough to predict the content of a text. I conducted a second analysis with Latent Dirichlet Allocation, which produced five topics. When compared to the word frequency analysis, the LDA analysis did produce a better product, but it was still flawed. Fundamentally, all of these methods had flaws, in part due to user error, and in part due to the length of the texts I used. Still, I would say that these methods do provide an interesting look into the texts, which might not be as useful on it's own, but would be a good supplement to a close-reading analysis. 