# Making a Frequency Distribution
### Often, when working with a corpus or a body of words that belong to a corpus, it's helpful to use the metric of a frequency distribution. Usually a frequency distribution is a count of each distinct word form, and then the total occurences are normalized so that all frequency values fall between 0.0 and 0.99999.
#### why 0.99999? see: https://en.wikipedia.org/wiki/Cromwell%27s_rule


In [7]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Standard library imports

In [8]:
from collections import Counter
import numpy as np
import pickle

from tqdm import tqdm
from cltk.corpus.readers import get_corpus_reader
from cltk.prosody.latin.string_utils import remove_punctuation_dict
from cltk.stem.latin.j_v import JVReplacer

### Add parent directory to path so we can access our common code

In [9]:
import os,sys,inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(currentdir)
sys.path.insert(0,parentdir) 

### custom library imports

In [10]:
from mlyoucanuse.aeoe_replacer import AEOEReplacer
from mlyoucanuse.corpus_analysis_fun import create_probability_dist

In [11]:
latin_reader = get_corpus_reader(corpus_name='latin_text_latin_library', language='latin')

In [6]:
word_counter = Counter()
jv_replacer = JVReplacer()
aeoe_replacer = AEOEReplacer()

latin_texts = latin_reader.fileids()

for file in tqdm(latin_texts , total=len(latin_texts), unit='files'):
    for word in latin_reader.words(file):
        if word.isalpha():
            word = aeoe_replacer.replace(jv_replacer.replace(word))
            word_counter.update({word: 1})

100%|██████████| 2141/2141 [16:10<00:00,  1.63files/s]  


In [None]:
word_counter.most_common(10)

In [None]:
total_words = sum(word_counter.values())
word_counter['et']

In [None]:
word_counter['et']/float(total_words)

### kai is the Greek word for 'and' transliterated into Latin. It is one of the most common words in Greek, and thus it is the one Greek word most likely to appear as loanword, as such we could use it as a threshold for detecting whether or not a random word is candidate for being a transliterated Greek loanword; we'll try this in another notebook.

In [None]:
word_counter['kai']

In [None]:
word_counter['kai'] / float(total_words)

### The raw percentage number isn't very readable, so we'll normalize 

In [None]:
word_probabilities = create_probability_dist(word_counter)
word_probabilities['et']

### Now that normalized number looks more managable. Let's save the counter for reuse.

In [None]:
with open('freq_dist.latin.pkl', 'wb') as writer:
    pickle.dump(word_probabilities, writer)

### Let's prove that we can load and use what we just saved

In [None]:
latin_frequency_dist = None
with open('freq_dist.latin.pkl', 'rb') as reader:
    latin_frequency_dist = pickle.load(reader)

In [None]:
latin_frequency_dist['rex']

## That's all for now folks