<a href="https://colab.research.google.com/github/tranvohuy/Markovify_sentence_Truyen_Kieu/blob/master/Truyen_Kieu_Markovify.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we learn how to make poems. The poem style we choose is from Truyện Kiều. The strategy is from [Markovify](https://github.com/jsvine/markovify) module. The point is to understand a data science problem (pipeline?) from data processing, to modelling, testing, and implementation.


# Truyện Kiều

[Truyện Kiều](https://en.wikipedia.org/wiki/The_Tale_of_Kieu) is a famous epic poem written by Nguyen Du. This poem is learnt by every student in Vietnam.
 
 The poem has the 6-8 style. That means a 6-word sentence is followed by a 8-word sentence. These two sentences are correlated by rhymes.
 
 We will use [Markovify](https://github.com/jsvine/markovify). This module will import a whole text as one string (of many sentences). Then it will make individual sentence. The next sentence is totally independent of the current sentence.
 
 Therefore, we will pre-process Truyen Kieu by combining consecutive 6-word and 8-word sentences into one sentence.

References:
- [Word2vec on Nguyen Du](https://blog.duyet.net/2017/04/nlp-truyen-kieu-word2vec.html#.XJOZAChKiqY)
- [Other blog](https://github.com/kavgan/nlp-in-practice/blob/master/word2vec/Word2Vec.ipynb)
- https://www.kaggle.com/paultimothymooney/poetry-generator-rnn-markov/notebook
- [Markovify](https://github.com/jsvine/markovify)
- [Thơ máy](http://www.thomay.vn/thomay/)

- [Poetry RNN](https://www.kaggle.com/paultimothymooney/poetry-generator-rnn-markov/notebook)

# Data preprocessing

If you already have a 'clean' data like this [one](https://github.com/tranvohuy/sentiment_sentence/blob/master/data/Truyen_Kieu.txt), you can skip this part.

- First, we copy the content [here](https://sites.google.com/site/khonggianketnoidqt/truyen-kieu-tron-bo), and save it in a text file 'Truyen_Kieu_internet.txt'
  
  - The text there is not well-punctuated.
  - A mistake at line 120, and another above line 130
  - To have a 'cleaner' version, we have to check carefully the poem, or find a better source. But we can ignore it, simplify troubles as much as possible, just to get a basic model. Then we can construct a more complicated model (including cleaner data, better model structure, etc).
- Then upload it to colab.
- Import it into a dataframe.
- Use regex to delete redundant details.
 - As mentioned above, for a basic model, it is better to delele all puntuation marks and to lower all characters
- Save the poem to a txt file.

- Some codes are learnt from [here](https://blog.duyet.net/2017/04/nlp-truyen-kieu-word2vec.html#.XJU7BShKiqY).

In [0]:
from google.colab import files
files.upload()
import pandas as pd
file_name = 'Truyen_Kieu_from_web.txt'
df = pd.read_csv(file_name, sep="/", names=["row"]).dropna()

#to see what df is about
#df.sample(4)
#df.head(10)
#etc

import re

def transform_row(row):
    # Delete numbers, dots, commas at the beginning of sentences
    row = re.sub(r"^[0-9\.,]+", "", row)
    
    # Delete dots, commas, question marks at the end
    row = re.sub(r"[\.,\?!]+$", "", row)
    
    #remove white spaces at the beginning and end of sentences
    row = row.strip()
    return row 

df["row"] = df.row.apply(transform_row)

#save df to txt file

with  open('Truyen_Kieu_simple_version.txt','w')  as f:
  for index, row in df.iterrows():
    f.write(row['row']+' ')
    if index%2 !=0:
      f.write('\n')

Saving Truyen_Kieu_from_web.txt to Truyen_Kieu_from_web.txt


# Import Truyen Kieu
- Now suppose we have a clean version. 
- Import it now.

In [0]:
from google.colab import files
files.upload()
import pandas as pd

Saving Truyen_Kieu_simple_v1.txt to Truyen_Kieu_simple_v1.txt


In [0]:
!pip install markovify

Collecting markovify
  Downloading https://files.pythonhosted.org/packages/94/b2/b4ce1e461bb3482b1fd63328a2097aed5917cdfa0dbfe9492a84ea46e2ab/markovify-0.7.1.tar.gz
Collecting unidecode (from markovify)
[?25l  Downloading https://files.pythonhosted.org/packages/31/39/53096f9217b057cb049fe872b7fc7ce799a1a89b76cf917d9639e7a558b5/Unidecode-1.0.23-py2.py3-none-any.whl (237kB)
[K    100% |████████████████████████████████| 245kB 7.1MB/s 
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/66/fe/5b/07257dd2401d9835447a0f0223f967c998c153404d32612253
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.7.1 unidecode-1.0.23


In [0]:
import markovify
with open("Truyen_Kieu_simple_version.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.NewlineText(text, state_size = 2)

# Print five randomly-generated sentences
for i in range(5):
    print(text_model.make_sentence())

Chữ trinh đáng giá nghìn vàng chẳng ngoa
Cùng nhau trông mặt càng ngẩn ngơ Ruột tằm ngày một vắng tin Mặn tình cát lũy lạt tình tào khang
Họ Chung có kẻ lại người qua Xót nàng còn chút xa xôi Mà ta bất động nữa người sinh nghi
Đường xa nghĩ nỗi sau này đã bỏ những ngày
Vì ta khăng khít, cho người thác oan thế này


## Observation:

- The length of each sentence is mostly not 6+8 = 14 word long. 
- Recall that 6-8 style is a 6-word sentence followed by an 8-word one.
- Most of the times, there is no rhyme between the 6-th word and 12-word.

# Understand the module Markovify



What happens when we run `text_model = markovify.NewlineText(text, state_size = 2)` or `text_model = markovify.Text(text, state_size = 2)`


# Class Text
- [init](https://github.com/jsvine/markovify/blob/master/markovify/text.py#L17) (self, input_text, state_size=2, chain=None, parsed_sentences=None, retain_original=True)

Basically,
- `input_text` is a string of sentences. Each sentence ends by a dot `.`, not by a new line `\n`. 
- If sentences in our `input_text` end by new line `\n`. We should use `markovify.NewlineText(input_text)` instead of `markovify.Text(input_text)`.

- In the following example, sentences end with `\n`.


In [0]:
import markovify
with open("Truyen_Kieu_clean_v1.txt") as f:
    text = f.read()

In [0]:
text[0:1000]

'Trăm năm trong cõi người ta Chữ tài chữ mệnh khéo là ghét nhau \nTrải qua một cuộc bể dâu Những điều trông thấy mà đau đớn lòng \nLạ gì bỉ sắc tư phong Trời xanh quen thói má hồng đánh ghen \nCảo thơm lần giở trước đèn Phong tình có lục còn truyền sử xanh \nRằng năm Gia Tĩnh triều Minh Bốn phương phẳng lặng, hai kinh vững vàng \nCó nhà viên ngoại họ Vương Gia tư nghĩ cũng thường thường bực trung \nMột trai con thứ rốt lòng Vương Quan là chữ, nối dòng nho gia \nĐầu lòng hai ả tố nga Thúy Kiều là chị, em là Thúy Vân \nMai cốt cách, tuyết tinh thần Một người một vẻ, mười phân vẹn mười \nVân xem trang trọng khác vời Khuôn trăng đầy đặn, nét ngài nở nang \nHoa cười ngọc thốt đoan trang Mây thua nước tóc, tuyết nhường màu da \nKiều càng sắc sảo, mặn mà So bề tài, sắc, lại là phần hơn \nLàn thu thủy, nét xuân sơn Hoa ghen thua thắm, liễu hờn kém xanh \nMột, hai nghiêng nước nghiêng thành Sắc đành đòi một, tài đành họa hai \nThông minh vốn sẵn tư trời Pha nghề thi họa, đủ mùi ca ngâm \nCung t

- `input_text` will be break into a list of sentences through [`self.generate_corpus(input_text)`](https://github.com/jsvine/markovify/blob/master/markovify/text.py#L33).
- And each sentence is splitted by words.

-So there is `text_model.parsed_senteces` which is a list. Each element of the list is a word from sentences.

In [0]:
text_model = markovify.NewlineText(text)

In [0]:
print(text_model.parsed_sentences[0:3])

[['Trăm', 'năm', 'trong', 'cõi', 'người', 'ta', 'Chữ', 'tài', 'chữ', 'mệnh', 'khéo', 'là', 'ghét', 'nhau'], ['Trải', 'qua', 'một', 'cuộc', 'bể', 'dâu', 'Những', 'điều', 'trông', 'thấy', 'mà', 'đau', 'đớn', 'lòng'], ['Lạ', 'gì', 'bỉ', 'sắc', 'tư', 'phong', 'Trời', 'xanh', 'quen', 'thói', 'má', 'hồng', 'đánh', 'ghen']]


- This list `text_model.parsed_sentences` is [put ](https://github.com/jsvine/markovify/blob/master/markovify/text.py#L37)into a Markov chain to analyze.

We now go to the [`Chain`](https://github.com/jsvine/markovify/blob/master/markovify/chain.py) class. 
- [init](https://github.com/jsvine/markovify/blob/master/markovify/chain.py#L32) (self, corpus, state_size, model=None)
- The `init` will run [`build(self, corpus, state_size)`](https://github.com/jsvine/markovify/blob/master/markovify/chain.py#L47) 
  -  Count the number of times $w_3$ appears immediately after the words $w_1w_2$.
- And run [precompute_begin_state](https://github.com/jsvine/markovify/blob/master/markovify/chain.py#L45)

So
`text_model.chain = Chain(text_model.parsed_sentences, state_size)`

We now analyze line [58](https://github.com/jsvine/markovify/blob/master/markovify/chain.py#L58) to 72.

 

```

       model = {}

        for run in corpus:
            items = ([ BEGIN ] * state_size) + run + [ END ]
            for i in range(len(run) + 1):
                state = tuple(items[i:i+state_size])
                follow = items[i+state_size]
                if state not in model:
                    model[state] = {}

                if follow not in model[state]:
                    model[state][follow] = 0

                model[state][follow] += 1
        return model

```

where corpus = $[ [w_{11}, w_{12}, w_{13}, w_{14}],   
[w_{21}, w_{22}, w_{23},w_{24},w_{25}],    \cdots]$ 

- First `run` = $[w_{11}, w_{12}, w_{13}, w_{14}]$.
- items = [ BEGIN, BEGIN, $w_{11}, w_{12}, w_{13}, w_{14}$, END]. Note that `state_size=2` by default.
- `i` in `(0,1,2,3,4)`
  - `i=0` gives `state=(BEGIN, BEGIN)`, `follow =` $w_{11}$
    - model[(BEGIN, BEGIN)][$w_{11}$] =1
  - `i=1` gives `state= (BEGIN, w_{11})`, and `follow` = $w_{12}$
    - model[(BEGIN, w_{11})][$w_{12}$] =1
  -  `i=4` gives `state` = $(w_{13},w_{14}$), `follow = END`
    
 -So model[($w_1,w_2$)][$w_3$] counts the number of times the word $w_3$ comes immediately after $w_1w_2$
 
 - `model` is a `dict` class.

In [0]:
print(type(text_model.chain.model))
print(text_model.chain.model[("___BEGIN__", "___BEGIN__")])
#there are a lot of ways to begin a sentence.

<class 'dict'>
{'Trăm': 4, 'Trải': 2, 'Lạ': 5, 'Cảo': 1, 'Rằng': 1, 'Có': 14, 'Một': 38, 'Đầu': 2, 'Mai': 4, 'Vân': 2, 'Hoa': 8, 'Kiều': 7, 'Làn': 1, 'Một,': 1, 'Thông': 1, 'Cung': 3, 'Khúc': 5, 'Phong': 8, 'Êm': 1, 'Ngày': 4, 'Cỏ': 1, 'Thanh': 1, 'Gần': 3, 'Dập': 2, 'Ngổn': 2, 'Tà': 1, 'Bước': 3, 'Nao': 1, 'Sè': 1, 'Rằng:': 28, 'Vương': 2, 'Nổi': 1, 'Kiếp': 5, 'Thuyền': 3, 'Buồng': 5, 'Khóc': 4, 'Đã': 18, 'Sắm': 3, 'Lòng': 7, 'Đau': 4, 'Phũ': 1, 'Sống': 1, 'Nào': 4, 'đã': 3, 'Gọi': 2, 'Lầm': 1, 'Rút': 1, 'Lại': 9, 'Nỗi': 13, 'Quan': 3, 'ở': 6, 'Dễ': 3, 'Thoắt': 6, 'Dấu': 1, 'Nàng': 57, 'Chớ': 1, 'Dùng': 3, 'Trông': 10, 'đề': 1, 'Tuyết': 1, 'Nẻo': 1, 'Hài': 1, 'Chàng': 9, 'Nguyên': 1, 'Nền': 1, 'Chung': 4, 'Vẫn': 1, 'Nước': 3, 'May': 1, 'Bóng': 5, 'Người': 15, 'Chập': 1, 'Dưới': 6, 'Gương': 2, 'Hải': 2, 'Chênh': 1, 'Sương': 1, 'Chào': 1, 'Thưa': 8, 'Hàn': 2, 'Mấy': 8, 'Vâng': 3, 'Âu': 1, 'Này': 6, 'Xem': 4, 'Ví': 5, 'Thềm': 1, 'Gió': 4, 'Giọng': 1, 'Cớ': 3, 'Buổi': 1, 'đoạn': 1, 'Cứ': 

In [0]:
print(text_model.chain.model[("Trăm", "năm")])
print(text_model.chain.model[("___BEGIN__", "Trăm")])
print(text_model.chain.model["Thúy", "Kiều"])
print(text_model.chain.model["Thúy", "Vân"])
#There are not many ways to continue the sentence given "Thúy Vân"

{'trong': 1, 'biết': 1, 'tạc': 1, 'thề': 1, 'để': 1, 'tính': 1, 'danh': 1}
{'năm': 2, 'nghìn': 1, 'điều': 1}
{'là': 1, 'sắc': 1, 'Mắc': 1, 'tài': 1}
{'___END__': 2, 'chợt': 1, 'thay': 1}


We now analyze [`precompute_begin_state()`](https://github.com/jsvine/markovify/blob/master/markovify/chain.py#L45)



```
        begin_state = tuple([ BEGIN ] * self.state_size)
        choices, weights = zip(*self.model[begin_state].items())
        cumdist = list(accumulate(weights))
        self.begin_cumdist = cumdist
        self.begin_choices = choices
```

- `self.begin_choices` is the list of words appearing at the beginning of sentences in `input_text`. There is also weight for each of these "beginning" words (in `cumdist`).

In [0]:
print(type(text_model.chain.begin_choices))
print(text_model.chain.begin_choices)
print(text_model.chain.begin_cumdist)

<class 'tuple'>
('Trăm', 'Trải', 'Lạ', 'Cảo', 'Rằng', 'Có', 'Một', 'Đầu', 'Mai', 'Vân', 'Hoa', 'Kiều', 'Làn', 'Một,', 'Thông', 'Cung', 'Khúc', 'Phong', 'Êm', 'Ngày', 'Cỏ', 'Thanh', 'Gần', 'Dập', 'Ngổn', 'Tà', 'Bước', 'Nao', 'Sè', 'Rằng:', 'Vương', 'Nổi', 'Kiếp', 'Thuyền', 'Buồng', 'Khóc', 'Đã', 'Sắm', 'Lòng', 'Đau', 'Phũ', 'Sống', 'Nào', 'đã', 'Gọi', 'Lầm', 'Rút', 'Lại', 'Nỗi', 'Quan', 'ở', 'Dễ', 'Thoắt', 'Dấu', 'Nàng', 'Chớ', 'Dùng', 'Trông', 'đề', 'Tuyết', 'Nẻo', 'Hài', 'Chàng', 'Nguyên', 'Nền', 'Chung', 'Vẫn', 'Nước', 'May', 'Bóng', 'Người', 'Chập', 'Dưới', 'Gương', 'Hải', 'Chênh', 'Sương', 'Chào', 'Thưa', 'Hàn', 'Mấy', 'Vâng', 'Âu', 'Này', 'Xem', 'Ví', 'Thềm', 'Gió', 'Giọng', 'Cớ', 'Buổi', 'đoạn', 'Cứ', 'Dạy', 'Ngoài', 'Hiên', 'Cho', 'Sầu', 'Mây', 'Tuần', 'Mành', 'Vì', 'Bâng', 'Nghề', 'Thâm', 'Lơ', 'Tần', 'Là', 'Lấy', 'Mừng', 'Song', 'Tấc', 'Nhẫn', 'Cách', 'Buông', 'Lần', 'Giơ', 'Ngẫm', 'Liền', 'Tan', 'Sinh', 'Thoa', 'Tiếng', 'Chiếc', 'Rày', 'Bấy', 'Vội', 'Thang', 'Sượng', 'Xương',

In [0]:
#Generate a new sentence

print(text_model.chain.walk())

['Già', 'giang', 'một', 'lão', 'một', 'trai', 'Một', 'dây', 'một', 'buộc', 'ai', 'làm', '?', 'Này', 'ai', 'đan', 'dậm,', 'giật', 'giàm', 'bỗng', 'dưng']


In [0]:
print(text_model.chain.walk(init_state = ("Trăm", "năm")))
print(text_model.chain.walk(init_state = ("Thúy", "Kiều")))

['danh', 'tiết', 'cũng', 'vì', 'đêm', 'nay']
['Mắc', 'điều', 'tình', 'ái', 'khỏi', 'điều', 'tà', 'dâm']


# Conclusion:

- Markovify is  a very basic model.

- It is a Markov chain with `state_size = 2` that means it will produce words $w_1w_2w_3w_4\cdots$ in  a way that the conditional law of $(w_4w_5\cdots| w_1w_2w_3)$ is the same as that of $(w_4|w_2w_3)$. 

- It chooses the word $w_4$ given $w_1w_2w_3$ by learning statiscally from the original text what would appears after $w_2w_3$.

- Or in other words $(w_4w_5\cdots|w_2w_3)$ is independent of $(w_1|w_2w_3)$.

- The chain will stop when the new word is ["END"](https://github.com/jsvine/markovify/blob/master/markovify/chain.py#L108).
  - Will it ever hit "END"?  interesting..;-)
  - The the whole sentence will be [compared](https://github.com/jsvine/markovify/blob/master/markovify/text.py#L189) to the original text to see whether or not it overlaps ("copies") too much with the original text.
  - If there is too much overlap, it re-generates the sentence again.
  - After `tries = 10` times, a `None` sentence is given.

- The Markov chain does not have "new" creativity in the sense that if the current state is "Trăm năm", the next word will 100% be in `['trong', 'cõi', 'người', 'ta', 'Khéo', 'thay', 'gặp', 'gỡ', 'cũng', 'trong', 'chuyển', 'vần']`.

- The model/Markov chain with `state_size=2` does not obey 6-8 rule of the poem. In a 2-sentence $w_{11}\cdots w_{16}, w_{21}\cdots w_{28}$, the words $w_{16}$ and $w_{26}$ have to be rhyme.

- Therefore, this model is good on "text" styles rather than poem styles since the latter's structure is rigid/strict.


## How can we improve the performance?
- We can increase `state_size`, says up to ~~`12`~~ `6` such that the model learns 6-8 rule. But then, the model has very limited resource/dictionary (it is rigid) to produce new sentences. 
  - It is likely to produce a sentence in the original text. Because of the [`max_overlap`](https://github.com/jsvine/markovify/blob/master/markovify/text.py#L8) rules, it is likely to produce `None` sentence.
    - For example, if `state_size = 3` and we are looking for a new word from `Trăm năm trong`, the next word has to be `cõi`. There is not much flexibility with `state_size=3`, unless the original text is huge or something else happens.
  
  - We could let the program choose a new word "out of the history" based on a different text.

- Experiment with different way to compute Markov chain: for example, try word2vec, or RNN.
- Add rule: each sentence, we check if it satisfies the "rhyme" rule or not. If it doesn't, re-generate a new sentence.
  - One suggestion is to use [nltk.pos_tag](https://medium.com/@gianpaul.r/tokenization-and-parts-of-speech-pos-tagging-in-pythons-nltk-library-2d30f70af13b). This module will analyze words in each sentence and assign each word with its  role (noun, verb, pronoun,  adjective, etc)
  - Or use [spacy](https://github.com/jsvine/markovify). 
  - We have to use [Vietnamese](https://spacy.io/usage/models#languages) [version](https://github.com/undertheseanlp/underthesea).
  
- We could take a large dataset of 6-8 rule poems, let the `general_68_rule` model analyze rhyme rule from the dataset. 
  - Then combine with the model from `Truyen_Kieu`. Set a small weight (i.e. transition probability) for `general_68_rule`, and big for `Truyen_Kieu`.