# Making a Frequency Distribution
### Often, when working with a corpus or a body of words that belong to a corpus, it's helpful to use the metric of a freqency distribution. Usually a frequency distribution is a count of each distinct word form, and then the total occurences are normalized so that all frequency values fall between 0.0 and 0.99999.
#### why 0.99999? see: https://en.wikipedia.org/wiki/Cromwell%27s_rule


In [1]:
%load_ext autoreload
%autoreload 2


### Standard library imports

In [2]:
from collections import Counter
import numpy as np
import pickle

### custom library imports

In [3]:
from tqdm import tqdm
from cltk.corpus.readers import get_corpus_reader
from cltk.prosody.latin.string_utils import remove_punctuation_dict
from cltk.stem.latin.j_v import JVReplacer
from building_language_model.aeoe_replacer import AEOEReplacer
from sklearn.preprocessing import MinMaxScaler

In [4]:
latin_reader = get_corpus_reader(corpus_name='latin_text_latin_library', language='latin')

In [5]:
word_counter = Counter()
jv_replacer = JVReplacer()
aeoe_replacer = AEOEReplacer()

for word in tqdm(latin_reader.words()):
    if word.isalpha():
        word = aeoe_replacer.replace(jv_replacer.replace(word))
        word_counter.update({word: 1})

16455728it [18:03, 15186.65it/s]


In [9]:
word_counter.most_common(10)

[('et', 426296),
 ('in', 264109),
 ('est', 170724),
 ('non', 155822),
 ('ad', 127206),
 ('ut', 115717),
 ('cum', 100820),
 ('quod', 95403),
 ('qui', 86333),
 ('si', 79412)]

In [10]:
total_words = sum(word_counter.values())
word_counter['et']

426296

In [11]:
word_counter['et']/float(total_words)

0.032449394302737196

### kai is the Greek word for 'and' transliterated into Latin. It is one of the most common words in Greek, and thus it is the one Greek word most likely to appear as loanword, as such we could use it as a threshold for detecting whether or not a random word is candidate for being a transliterated Greek loanword; we'll try this in another notebook.

In [12]:
word_counter['kai']

476

In [13]:
word_counter['kai'] / float(total_words)

3.6232832792479645e-05

### The raw percentage number isn't very readable, so we'll normalize 

In [20]:
words = list(word_counter.keys())
counts = [ tmp/float(total_words) for tmp in word_counter.values()]
counts = np.array(counts)
min_max_scaler = MinMaxScaler(feature_range=(0, 0.99999))
# why 0.999999? see: https://en.wikipedia.org/wiki/Cromwell%27s_rule
scaled_data = min_max_scaler.fit_transform(counts.reshape(-1, 1))

word_probabilities = Counter(dict(zip (words, scaled_data.tolist())))
word_probabilities['et']

[0.9999900000000002]

In [21]:
word_probabilities['kai']

[0.0011142407253193212]

### Now that normalized number looks more managable. Let's save the counter for resuse.

In [22]:
with open('freq_dist.latin.pkl', 'wb') as writer:
    pickle.dump(word_probabilities, writer)

### Let's prove that we can load and use what we just saved

In [23]:
latin_frequency_dist = None
with open('freq_dist.latin.pkl', 'rb') as reader:
    latin_frequency_dist = pickle.load(reader)

In [24]:
latin_frequency_dist['rex']

[0.015599370154470498]