### Dimensionality Reduction/Embeddings of Classical Chinese poetry

    - This notebook creates a process to produce embeddings from a cross-dynastic corpus of Chinese poems:
     秦 (Qin), 汉(Han), 唐 (Tang), 南北朝 (Northern/Southern Dynasties), 宋 (Song), 元 (Yuan), 明 (Ming), 清 (Qing) 
     
    - Embeddings create vectorized (numeric) representations for text, which exist in distributional
      semantic embedding space. Since text data is high dimensional in nature (similar to genetic data, audio data, etc -- compression of the multi-dimensional corpus and distributional semantic models are needed in order to ascertain general (and latent) structures within the data.
      
      - [ This is an exercise in informational retrieval (IR) using machine learning and Chinese poems ]
      
      
    - Embedded language has shown useful not only in information retrieval tasks (topic modeling, etc.) -- but
      also as inputs to document classifiers and in natura language understanding for machines. As an extension
      of the embeddings created here, a general adversarial network (GAN) is trained to generate Chinese poetry
      non-deterministically and autonomously with seed words as inputs, capitalizing on latent structure within 
      the latent space geometry of the embeddings. 

#### Module imports

In [3]:
#pinyin and translation/transliteration support 

import pinyin 
print(pinyin.get('你好'))

#Chinese character segmentation, tokenization, and dictionary features
from chinese import ChineseAnalyzer

#Efficient multi-core processing and progress bar utility 
import multiprocessing
from time import sleep
from tqdm import tqdm

#General python processing
import pandas as pd 
import string
import re

##Baidu stopwords: Stopwords are very frequent in language and can interfere with information retrieval
## this is a baidu library of stopwords so that they can be scrubbed from the poems -- the list is modern,
## so there may be a mismatch*

baidu_stopwords = pd.read_csv("/home/spenser/Poetry/stopwords/baidu_stopwords.txt", header=None, encoding = 'utf-8', sep =",")
baidu_stopwords.columns = ['baidu_sw']
 #https://github.com/goto456/stopwords.git

nǐhǎo


### Import Cross-Dynastic Poetry Corpus
    -Each .csv file containing poems organized by dynastic era will be combined into a single
     dataframe. This dataframe will be the input to NLP text preprocessing.

In [4]:
import pandas as pd 
import os
os.chdir("/home/spenser/Poetry/")


qin_all = pd.read_csv("秦.csv")

han_all = pd.read_csv("汉.csv")

tang_all = pd.read_csv("唐.csv")

northern_southern_dyn_all = pd.read_csv("南北朝.csv")

song1 = pd.read_csv("宋_1.csv")
song2 = pd.read_csv("宋_2.csv")
song3 = pd.read_csv("宋_3.csv")
song4 = pd.read_csv("宋_4.csv")
song_all = pd.concat([song1, song2, song3, song4])

yuan_all = pd.read_csv("元.csv")

ming1 = pd.read_csv("明_1.csv")
ming2 = pd.read_csv("明_1.csv")
ming3 = pd.read_csv("明_1.csv")
ming4 = pd.read_csv("明_1.csv")
ming_all = pd.concat([ming1, ming2, ming3, ming4])

qing1 = pd.read_csv("清_1.csv")
qing2 = pd.read_csv("清_2.csv")
qing_all = pd.concat([qing1, qing2])

###unify

all_poems = pd.concat([qin_all, han_all, tang_all, northern_southern_dyn_all,
                      song_all, yuan_all, ming_all, qing_all]).reset_index(drop=True)

print(len(all_poems), "Poems in the cross-Dynastic corpus")


all_poems.head()

706638 Poems in the cross-Dynastic corpus


Unnamed: 0,题目,朝代,作者,内容
0,三秦民谣,秦,无名氏,武功太白，去天三百。孤云两角，去天一握。山水险阻，黄金子午。蛇盘鸟栊，势与天通。
1,巴谣歌,秦,阙名,神仙得者茅初成，驾龙上升入太清。时下玄洲戏赤城，继世而往在我盈，帝若学之腊嘉平。
2,大招,汉,作者未详,青春受谢，白日昭只。春气奋发，万物遽只。冥淩浃行，魂无逃只。魂魄归徕！无远遥只。魂乎归徕！无...
3,上邪,汉,两汉乐府,上邪。我欲与君相知。长命无绝衰。山无陵。江水为竭。冬雷震震夏雨雪。天地合。乃敢与君绝。
4,孔雀东南飞 古诗为焦仲卿妻作,汉,两汉乐府,孔雀东南飞。五里一徘徊。十三能织素。十四学裁衣。十五弹箜篌。十六诵诗书。十七为君妇。心中常苦...


In [5]:
#Counts of Poems by Dynastic Era 
#So few from earlier history in the corpus!
#This will mostly be a comparison of song - Qing.

all_poems["朝代"].value_counts()

宋      287114
明      237912
清       90088
唐       49195
元       37375
南北朝      4586
汉         363
秦           2
许梦青         1
Name: 朝代, dtype: int64

### Poem Text Preprocessing

    - This utilizes the pre-built segmenter and tokenizer for Chinese language available
      from pip install chinese. 
      
    -Segmentation is necessary for Chinese language inputs without spaces -- segmentation is done
     algorithmically within the ChineseAnalyzer module so is subject to error. 
     
    -Spaces are necessary in NLP for token (word unit) vectorization (creating numerical representations of unique words for algorithmic processing and pattern organization. 

In [6]:
print(all_poems["内容"].iloc[0].split('。')[0] , "Example of First Line")

print('Full Poem, Example First Line ')
all_poems["内容"].iloc[0].split('。')

武功太白，去天三百 Example of First Line
Full Poem, Example First Line 


['武功太白，去天三百', '孤云两角，去天一握', '山水险阻，黄金子午', '蛇盘鸟栊，势与天通', '']

In [7]:
from chinese import ChineseAnalyzer
analyzer = ChineseAnalyzer()
result = analyzer.parse(all_poems["内容"].iloc[0].split('。')[0]
)
result.tokens()
result.pprint()

{'original': '武功太白，去天三百',
 'parsed': [{'dict_data': [{'definitions': ['Wugong County in Xianyang '
                                            '咸陽|咸阳[Xian2 yang2], Shaanxi'],
                            'kind': 'Simplified',
                            'match': '武功',
                            'pinyin': ['Wu3', 'gong1']},
                           {'definitions': ['martial art',
                                            'military accomplishments',
                                            '(Peking opera) martial arts '
                                            'feats'],
                            'kind': 'Simplified',
                            'match': '武功',
                            'pinyin': ['wu3', 'gong1']}],
             'token': ('武功', 0, 2)},
            {'dict_data': [{'definitions': ['Taibai County in Baoji 寶雞|宝鸡[Bao3 '
                                            'ji1], Shaanxi',
                                            'Venus'],
                            'ki

#### Create Multiprocessing process to distribute ChineseAnalyzer across CPUs at scale

In [8]:

analyzer = ChineseAnalyzer()

#Distribute ChineseAnalyzer function to segment and tokenize each poem across multiple CPUs. 

def segment_chinese(input_text_df):
   
    #Instantiate progress bar so this doesn't get stuck in an infinite loop!
    
    result = analyzer.parse(input_text_df)
    tokens = result.tokens()
    #scruntch segmented tokens back into a string for input to vectorizers
    re_string = ' '.join(tokens).replace(' , ',',').replace(' .','.').replace(' !','!').replace(' 。', '。').replace(' ?', '?')
    #dictionary_items = result.pformat()
    return re_string

from multiprocessing import Pool

p = Pool(8)


all_poems["segmented"] = p.map(segment_chinese, all_poems["内容"].astype('unicode'))

In [9]:
print('print examples of segmentation output')
print(all_poems["segmented"][0])
print(' ')
print('------------------')
print(all_poems["segmented"][1])
print(' ')
print('------------------')
print(all_poems["segmented"][2])

print examples of segmentation output
武功 太白 ， 去 天 三百。 孤云 两角 ， 去 天一 握。 山水 险阻 ， 黄金 子午。 蛇盘 鸟 栊 ， 势 与 天通。
 
------------------
神仙 得者 茅 初成 ， 驾龙 上升 入太清。 时下 玄洲 戏 赤城 ， 继世而往 在 我盈 ， 帝若学 之腊嘉平。
 
------------------
青春 受谢 ， 白日 昭 只。 春气 奋发 ， 万物 遽 只。 冥 淩 浃 行 ， 魂 无 逃 只。 魂魄 归徕 ！ 无远遥 只。 魂乎 归徕 ！ 无东 无西无南 无北 只。 东 有 大海 ， 溺水 浟 浟 只。 螭 龙 并 流 ， 上下 悠悠 只。 雾雨淫 淫 ， 白皓胶 只。 魂乎 无东 ！ 汤谷? 只。 魂乎 无南 ！ 南有 炎火 千里 ， 蝮蛇 蜒 只。 山林 险隘 ， 虎豹 蜿只。 鰅 鳙 短 狐 ， 王 虺 骞 只。 魂乎 无南 ！ 蜮 伤 躬 只。 魂乎 无西 ！ 西方 流沙 ， 漭 洋洋 只。 豕 首 纵目 ， 被 发 鬤 只。 长爪 踞 牙 ， 诶 笑 狂 只。 魂乎 无西 ！ 多害 伤 只。 魂乎 无北 ！ 北有 寒山 ， 逴 龙 赩 只。 代水 不可 涉 ， 深不可测 只。 天白颢颢 ， 寒凝凝只。 魂乎 无往 ！ 盈 北极 只。 魂魄 归徕 ！ 閒以静 只。 自 恣 荆楚 ， 安以定 只。 逞志究 欲 ， 心意 安只。 穷身 永乐 ， 年寿延 只。 魂乎 归徕 ！ 乐不可言 只。 五谷 六仞 ， 设 菰 粱 只。 鼎 臑 盈望 ， 和 致芳 只。 内 鸧 鸽 鹄 ， 味 豺 羹 只。 魂乎 归徕 ！ 恣所尝 只。 鲜 蠵 甘鸡 ， 和 楚酪 只。 醢 豚 苦 狗 ， 脍 苴 莼 只。 吴酸蒿 蒌 ， 不 沾 薄 只。 魂 兮 归徕 ！ 恣所择 只。 炙 鸹 烝 凫 ， 煔 鹑 陈 只。 煎 鰿? 雀 ， 遽 爽存 只。 魂乎 归徕 ！ 丽以 先 只。 四 酎 并 孰 ， 不 歰 嗌 只。 清馨 冻 饮 ， 不 歠 役 只。 吴 醴白 糵 ， 和 楚 沥 只。 魂乎 归徕 ！ 不遽 惕 只。 代秦 郑卫 ， 鸣竽张 只。 伏戏 《 驾辩 》 ， 楚 《 劳商 》 只。 讴 和 《 扬 阿 》 ， 赵箫倡 只。 魂乎 归徕 ！ 定 空桑 只。 二八

In [10]:

def remove_punctuation(text_input):
    text_input = text_input
    for punctuation in "！。，《》":
    
        text_input = re.sub(punctuation, "", text_input)
        
    return text_input

p = Pool(8)
all_poems["segmented_punctuation_removed"] = p.map(remove_punctuation, all_poems["segmented"] )

p.terminate()
p.close()

In [11]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

#Without setting parameters to eliminate the most common, and less frequent (max_df, min_df)
#terms - the count vectorizer produces too large of a vocabulary

#It could be that the segmenter is finding many unique terms that are highly intrinstic to
#chinese language -- e.g., chengyu or idiomatic type 4 character sequences.
    ### Note: This is an assumption that will have to be explored , but is not in the scope of
    ###       the present notebook. 

vectorizer = TfidfVectorizer( max_df=.95, min_df=5, norm='l2', use_idf=True , stop_words=list(baidu_stopwords['baidu_sw']))
poems_vectorized = vectorizer.fit_transform(all_poems["segmented_punctuation_removed"])


  'stop_words.' % sorted(inconsistent))


In [12]:
tfidf_features = pd.DataFrame.from_dict(vectorizer.vocabulary_, orient='index', columns = ['counts'])
tfidf_features["words"] = tfidf_features.index
tfidf_features = tfidf_features.reset_index(drop=True)

In [13]:
print(poems_vectorized.shape[0], "poems in the vectorized corpus")
print(poems_vectorized.shape[1], "features/words in the vectorized corpus")

tfidf_features.sort_values(by='counts', ascending=False)

###龟龄鹤算 is so common! Something about an old wise tortise. Why is this so pervasive?
### Check for all influential chengyu

706638 poems in the vectorized corpus
282475 features/words in the vectorized corpus


Unnamed: 0,counts,words
105264,282474,龟龙出
34404,282473,龟龙
211575,282472,龟龄鹤算
126110,282471,龟龄
278452,282470,龟鼎
...,...,...
167727,4,一丁不识
58404,3,一丁
158541,2,一一记
266606,1,一一分


In [14]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=500, random_state=42)

svd_components = svd.fit_transform(poems_vectorized)

print("SVD (Compressed) Components Cumulatively Explain", " ", sum(svd.explained_variance_ratio_)*100, "% Variance")

MemoryError: 

In [None]:
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# 3D Plot
fig = plt.figure(figsize=(7,7))
ax3D = fig.add_subplot(111, projection='3d')
ax3D.scatter(svd_components[:,0], svd_components[:,1], svd_components[:,2], s=3, c=pd.Categorical(all_poems["朝代"]).codes, marker='o')  

#plt.scatter(svd_components[:,0], svd_components[:,1], c=pd.Categorical(all_poems["朝代"]).codes)

#plt.xlim(0.0, 1.0)
#plt.ylim(0.0, 2)
plt.show()

#### Cross Dynastic Corpus - Uniform Manifold Approximation (UMAP) to Vector Space
    - Several vectorization methods will be explored 

In [None]:
import umap


embedding = umap.UMAP(n_components=2, metric='cosine').fit(poems_vectorized[0:100000])

In [None]:
# For interactive plotting use
# f = umap.plot.interactive(embedding, labels=dataset.target, hover_data=hover_df, point_size=1)
# show(f)
f = umap.plot.points(embedding, labels=poems_all['朝代'])