<a href="https://colab.research.google.com/github/rainermesi/estonian_literature_markov_chain/blob/main/markov_word_gen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Generating new words using a Markov Chain and public domain Estonian literature

Working with Google Colab, this is needet to mount Google Drive to the notebook for persistent storage.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Modules

In [None]:
import numpy as np
import pandas as pd
from pandas.core.common import flatten
import random
import re
from collections import defaultdict
from collections import Counter
import unicodedata

! pip install BeautifulSoup4
from bs4 import BeautifulSoup
import requests
import time

! pip install ebooklib
import ebooklib
from ebooklib import epub
import os
import pickle

Get list of public domain books. Tartu Public Libary publishes and distributes Estonian classics: https://www.luts.ee/index.php/111-e-raamatud

In [13]:
url = "https://www.luts.ee/index.php/111-e-raamatud"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')

In [14]:
list_of_urls = []
for i in soup.find_all('a'):
  if i.get_text() == 'e-lugerisse (epub)':
    list_of_urls.append('https://www.luts.ee'+i.get('href'))

For some reason ebooklib does not like getting the epub file straight from requests. So I'm dowloading a copy of the books.

In [15]:
for i in list_of_urls:
  r = requests.get(i, allow_redirects=True)
  open(i.split('/')[-1], 'wb').write(r.content)
  time.sleep(3)

Creating a function to parse the epubs into a list of words.

In [25]:
def epub_to_text(corpus):
  blacklist = ['[document]','noscript','header','html','meta','head','input','script']
  output_str = ''
  output_list = []
  for i in corpus:
    soup = BeautifulSoup(i,'html.parser')
    text = soup.find_all(text=True)
    for t in text: 
      if t.parent.name not in blacklist:
        output_list.append(unicodedata.normalize("NFKD",t).strip())
  output_list = [i.split() for i in output_list]
  output_list = list(flatten(output_list))
  output_list_copy = [i for i in output_list if len(i) > 2]
  output_list = [i.replace(',','') for i in output_list_copy]
  output_list = [i.replace("'",'') for i in output_list]
  output_list = [i.replace("«",'') for i in output_list]
  output_list = [i.replace("»",'') for i in output_list]
  output_list = [i.replace("(",'') for i in output_list]
  output_list = [i.replace(")",'') for i in output_list]
  output_list = [i.lower() for i in output_list]
  return output_list

A final loop to iterate over all the books and join them into one long list of words.


In [None]:
corpus_word_list = []
for filename in os.listdir(os.curdir):
  if filename.endswith(".epub"):
    try:
        book = epub.read_epub(filename)
        list_of_elements = []
        for i in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
          list_of_elements.append(i.get_content())
        list_of_words = epub_to_text(list_of_elements)
        corpus_word_list.extend(list_of_words)
    except:
      print('Error at element:',filename)
  else:
      continue

Backup the list

In [52]:
with open('corpus_word_list.txt', 'wb') as fp:
    pickle.dump(corpus_word_list, fp)

# load the list from disk
# with open ('corpus_word_list.txt', 'rb') as fp:
#    corpus_word_list = pickle.load(fp)

### Analysis of the corpus

Prepare the list of words into a Series object.

In [35]:
pd_word_series = pd.Series(corpus_word_list)

What are most popular words in corpus?

In [37]:
pd_word_series.value_counts().head(10)

kui     11852
oli      9907
aga      9004
siis     7645
see      6533
oma      6134
mis      5710
nii      4465
tema     4199
nagu     4179
dtype: int64

How long are words in corpus?

In [38]:
def word_len_df_gen(in_list):
    count_list = [len(item) for item in in_list]
    count_df = pd.DataFrame.from_dict(Counter(count_list).items())
    count_df.sort_values(by=1,ascending=False)
    return count_df

In [50]:
word_len_df = word_len_df_gen(pd_word_series)
word_len_df.sort_values(by=[1],ascending=False).head(10)

Unnamed: 0,0,1
11,4,151282
0,5,144065
10,6,136811
2,7,100647
15,3,97769
5,8,67974
1,9,47942
12,10,29018
7,11,17867
6,12,10691


### Creating the Markov Chain to generate new words

Create a dictionay for the graph and parse the wiki corpus

In [41]:
def create_graph_dict(corpus):
  graphdict = defaultdict(lambda:defaultdict(int))
  for word in corpus:
    prev_letter = word[0]
    for letter in word[1:]:
      graphdict[prev_letter][letter] += 1
      prev_letter = letter
  return graphdict

In [42]:
graph_dict = create_graph_dict(pd_word_series)

Clean up the graph dictionary

In [43]:
def graph_cleanup(graph):
  abc = ['A', 'a', 'B', 'b', 'D', 'd', 'E', 'e', 'F', 'f', 'G', 'g', 'H', 'h', 'I', 'i', 'J', 'j', 'K', 'k', 'L', 'l', 'M', 'm', 'N', 'n', 'O', 'o', 'P', 'p', 'R', 'r', 'S', 's', 'Š', 'š', 'Z', 'z', 'Ž', 'ž', 'T', 't', 'U', 'u', 'V', 'v', 'Õ', 'õ', 'Ä', 'ä', 'Ö', 'ö', 'Ü', 'ü']
  abc = list(dict.fromkeys(i.lower() for i in abc))
  # clean primary keys
  tempgraph = dict((k, graph[k]) for k in abc if k in graph) 
  # clean nested key value pairs
  for letter in abc:
    try:
      for item in tempgraph[letter].copy():
        if item not in abc:
          del tempgraph[letter][item]
    except:
        print(letter,'not in corpus as key, skipping letter')
  return tempgraph

In [44]:
graph_dict = graph_cleanup(graph_dict)

š not in corpus as key, skipping letter
ž not in corpus as key, skipping letter
õ not in corpus as key, skipping letter
ä not in corpus as key, skipping letter
ö not in corpus as key, skipping letter
ü not in corpus as key, skipping letter


Create a function to traverse the graph dictionary and create new words

In [45]:
def traverse_graph(graph, word_len=3, start_node=None):
  """Returns a list of words from a randomly weighted walk."""
  if word_len <= 0:
    return []
  
  # If not given, pick a start node at random.
  if not start_node:
    start_node = random.choice(list(graph.keys()))
  
  
  weights = np.array(
      list(graph[start_node].values()),
      dtype=np.float64)
  # Normalize letter counts to sum to 1. Create % weights for each letter.
  weights /= weights.sum()

  # Pick next letter using weighted distribution.
  choices = list(graph[start_node].keys())
  chosen_letter = np.random.choice(choices, None, p=weights)
  
  # recursively build a word until word_len = 0
  return [chosen_letter] + traverse_graph(
      graph, word_len=word_len-1,
      start_node=chosen_letter)

Generate new words with random word lenght.

In [47]:
for i in range(10): 
  print(''.join(traverse_graph(graph_dict,word_len=random.choice(word_len_df[0]))))

ahkulenoinanguo
astlerastaenelikadidik
akubedalui
nolerakemandevagmala
ikatopudadusetasisobuasaelikolisenagealudre
emanoolilalmisuagstarituist
eeinaigemutedm
ilinasoolagiroodaar
adeseagagautolinesisimmadahesulgarodamaseel
t


Generate new words with word lenght 5 charaters (2nd most popular lenght in corpus).

In [49]:
for i in range(10): 
  print(''.join(traverse_graph(graph_dict,word_len=5)))

udiko
sivae
udark
oimas
otase
dsost
abedu
isste
astei
hadab
