<a href="https://colab.research.google.com/github/rainermesi/wiki_Parse/blob/master/wiki_word_gen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Prep

Import modules

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [68]:
import numpy as np
import pandas as pd
import random
import re
from collections import defaultdict
from collections import Counter

Read in wikipedia corpus

In [4]:
wiki_full_corpus = open(r'/content/drive/My Drive/DATA/Wikipedia/etwiki_latest/wiki_et.txt','r',encoding='utf-8').read()
wiki_word_list = full_corpus.lower().split()

Read in Tammsaare corpus

In [69]:
! pip install ebooklib
import ebooklib
from ebooklib import epub

Collecting ebooklib
[?25l  Downloading https://files.pythonhosted.org/packages/00/38/7d6ab2e569a9165249619d73b7bc6be0e713a899a3bc2513814b6598a84c/EbookLib-0.17.1.tar.gz (111kB)
[K     |███                             | 10kB 11.8MB/s eta 0:00:01[K     |█████▉                          | 20kB 2.1MB/s eta 0:00:01[K     |████████▉                       | 30kB 2.4MB/s eta 0:00:01[K     |███████████▊                    | 40kB 2.7MB/s eta 0:00:01[K     |██████████████▊                 | 51kB 2.5MB/s eta 0:00:01[K     |█████████████████▋              | 61kB 2.7MB/s eta 0:00:01[K     |████████████████████▋           | 71kB 3.0MB/s eta 0:00:01[K     |███████████████████████▌        | 81kB 3.2MB/s eta 0:00:01[K     |██████████████████████████▍     | 92kB 3.3MB/s eta 0:00:01[K     |█████████████████████████████▍  | 102kB 3.3MB/s eta 0:00:01[K     |████████████████████████████████| 112kB 3.3MB/s 
Building wheels for collected packages: ebooklib
  Building wheel for ebooklib (s

In [72]:
book = epub.read_epub(r'/content/drive/My Drive/DATA/Wikipedia/etwiki_latest/Anton_Hansen_Tammsaare_Tode_ja_oigus_I.epub')

In [89]:
book_corpus = []

for i in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
  book_corpus.append(i.get_content())

In [91]:
# bsoup to parse the html?
# https://medium.com/@zazazakaria18/turn-your-ebook-to-text-with-python-in-seconds-2a1e42804913

## Analysis

What are most popular words in corpus?

In [5]:
pd_word_series = pd.Series(word_list)

In [6]:
pd_word_series.value_counts().head(50)

ja             1148377
on              800455
kategooria      333385
eesti           307631
oli             291754
ta              271610
aastal          249307
ka              231168
ning            201392
et              171516
mis             165233
kui             156204
ei              136771
aasta           124229
või             112717
see             100403
oma             100297
tema             84331
selle            76414
viited           74003
sai              70914
sündinud         66513
kuid             65650
tartu            64389
välislingid      62625
the              62323
kes              61928
pärast           58102
seda             57875
mille            57853
jpg              55724
tallinna         55129
kus              55090
keeles           53840
kuni             51982
aga              50693
of               50308
välja            49863
aastatel         49489
nii              48971
olid             45566
võib             44797
üle              44738
ehk        

How long are words in corpus?

In [29]:
def word_len_df_gen(in_list):
    count_list = [len(item) for item in in_list]
    count_df = pd.DataFrame.from_dict(Counter(count_list).items())
    count_df.sort_values(by=1,ascending=False)
    return count_df

In [30]:
word_len_df = word_len_df_gen(pd_word_series)
word_len_df.head(50)

Unnamed: 0,0,1
0,8,4124037
1,7,4261348
2,12,1080924
3,3,2905578
4,9,3274568
5,5,5127056
6,10,2517816
7,11,1649642
8,6,5027484
9,13,804707


Create a dictionay for the graph and parse the wiki corpus

In [40]:
def create_graph_dict(corpus):
  graphdict = defaultdict(lambda:defaultdict(int))
  for word in corpus:
    prev_letter = word[0]
    for letter in word[1:]:
      graphdict[prev_letter][letter] += 1
      prev_letter = letter
  return graphdict

In [41]:
graph_dict = create_graph_dict(pd_word_series)

Clean up the graph dictionary

In [43]:
def graph_cleanup(graph):
  abc = ['A', 'a', 'B', 'b', 'D', 'd', 'E', 'e', 'F', 'f', 'G', 'g', 'H', 'h', 'I', 'i', 'J', 'j', 'K', 'k', 'L', 'l', 'M', 'm', 'N', 'n', 'O', 'o', 'P', 'p', 'R', 'r', 'S', 's', 'Š', 'š', 'Z', 'z', 'Ž', 'ž', 'T', 't', 'U', 'u', 'V', 'v', 'Õ', 'õ', 'Ä', 'ä', 'Ö', 'ö', 'Ü', 'ü']
  abc = list(dict.fromkeys(i.lower() for i in abc))
  # clean primary keys
  tempgraph = dict((k, graph[k]) for k in abc if k in graph) 
  # clean nested key value pairs
  for letter in abc:
    for item in tempgraph[letter].copy():
      if item not in abc:
        del tempgraph[letter][item]
  return tempgraph

In [44]:
graph_dict = graph_cleanup(graph_dict)

Traverse the graph_dict and create new words

In [65]:
def traverse_graph(graph, word_len=3, start_node=None):
  """Returns a list of words from a randomly weighted walk."""
  if word_len <= 0:
    return []
  
  # If not given, pick a start node at random.
  if not start_node:
    start_node = random.choice(list(graph.keys()))
  
  
  weights = np.array(
      list(graph[start_node].values()),
      dtype=np.float64)
  # Normalize letter counts to sum to 1. Create % weights for each letter.
  weights /= weights.sum()

  # Pick next letter using weighted distribution.
  choices = list(graph[start_node].keys())
  chosen_letter = np.random.choice(choices, None, p=weights)
  
  # recursively build a word until word_len = 0
  return [chosen_letter] + traverse_graph(
      graph, word_len=word_len-1,
      start_node=chosen_letter)

In [93]:
for i in range(10): 
  print(''.join(traverse_graph(graph_dict,word_len=random.choice(word_len_df[0]))))

ovakulõl
omadiinõ
eiinoomo
slisiks
insinis
alobestnemä
iemuleigluande
ridatuvadehere
asudmäesahemär
tanisks
