# Context-free grammar text generator

**Pack everything up nice and neatly, get generator working as well as it can **

+ Why are there weird things happening with the periods?
+ DEAL WITH EMPTY PRODUCTIONS IN GRAMMAR
+ Fix the weird double period/comma problem, figure out where that's coming from.
+ When stripping corpus, delete lines that are shorter than most (likely chapter headings)
+ feed a slightly modified corpus into the kgram dictionary generator in order to account for periods and commas and other punctuation
+ Need to identify words that are capital both when they start the sentence and whenever they appear (first names). Maybe make a list of all words that appear in both capital and non capital form?

+ If pyStatParser fails, select sentence rules from a library of already-processed sentences

+ delete parentheses, get punctuation to go into the original sentence structure in parse_sent()

**Research ideas**
+ generate text using words from one author and syntax from a different author (style swapper)
+ can I 1. observe zipf's law in Frankenstein, and 2. observe it in text generated from Markov and CFG markov?

In [1]:
from cfgen import *

In [3]:
# initialize all of the relevant variables
mycorp = clean_corpus('/Users/william/python_files/cfgen/full_books/frankenstein_short.txt')
tagged_corpus = tag_corpus(mycorp)
termrules_mycorp = make_terminal_rules(tagged_corpus)
my_kgram = make_kgram(mycorp, k=2)

# testing various use cases of make_sentence()

simple_sentence = 'The dog ran up the stairs slowly.'

# Try parsing a fixed grammar with random words inserted for terminal symbols
for ii in range(3):
    some_txt = make_sentence(mycorp, termrules_mycorp, fixed_grammar=True, sample_sentence=simple_sentence)
    some_txt = clean_output_text(some_txt)
    print(some_txt)
    print('\n')
    
print("----------\n")

# Try parsing a fixed grammar with Markov-biased selection of terminal words
for ii in range(3):
    some_txt = make_sentence(mycorp, termrules_mycorp, my_kgram, fixed_grammar=True, sample_sentence=simple_sentence)
    some_txt = clean_output_text(some_txt)
    print(some_txt)
    print('\n')

print("----------\n")
    

# Pick sentence structure randomly as well
for ii in range(3):
    some_txt = make_sentence(mycorp, termrules_mycorp)
    some_txt = clean_output_text(some_txt)
    print(some_txt)
    print('\n')
    
print("----------\n")
    
# Pick sentence structure randomly with Markov-biased selection
for ii in range(3):
    some_txt = make_sentence(mycorp, termrules_mycorp, my_kgram)
    some_txt = clean_output_text(some_txt)
    print(some_txt)
    print('\n')
    

the heart was away the country away .


the sprung spoke off the window then .


another world sought off the suffice back .


----------

the bridge had up all murder burst .


HIT!
the design was away the fiend not .


HIT!
the falsehood was away the fiend not .


----------

and which will is done to knows its wasting nor qualities, in that to is barred to affords my diabolical on my falling I evils, at all abhorrence is which to find my transcendent ,. my returning thou events but my I spurred


his remorse had filled upon your victim


farewell fairly, which countenance I to know that vessel england be would hoarser penetrate found of vice


----------

HIT!
my reason: my I for a I at craving of his destruction I is pierced into release .




IndexError: list index out of range

In [46]:
all_gscores

[0.00017946877243359656, 0.00018001800180018, 0.00036744442403086535, 0.0]

In [44]:
all_simscores

[0.038787325670935173,
 0.040584091492456432,
 0.04047001112283604,
 0.037304280865870003]

In [36]:
all_simscores

[0.00051336166329178905, 0.00039928129367139149]

In [27]:
from numpy import mean,std
mean(all_simscores)
std(all_simscores)

0.03675579322638145

In [65]:
# Estimate the quality of the text statistically

# make 1000 sentences

# initialize all of the relevant variables
# mycorp = clean_corpus('/Users/william/python_files/cfgen/full_books/frankenstein.txt')
# tagged_corpus = tag_corpus(mycorp)
# termrules_mycorp = make_terminal_rules(tagged_corpus)
# my_kgram = make_kgram(mycorp, k=3)

all_gscores = list()
all_simscores = list()
with open("3markov_score.txt", 'wb') as myfile:
    for ii in range(100):

        text_sample = make_sentence_markov(my_kgram, 1000)

        simscore = similarity_score(text_sample, mycorp)
        gscore = grammar_score(text_sample)
        all_simscores.append(simscore)
        all_gscores.append(gscore)

        myfile.write(str(simscore)+'\t'+str(gscore)+"\n")
        print(ii)


0
1
2
3
4
5
6
7
8
9
10
11
12
13


KeyboardInterrupt: 

In [280]:
similarity_score(out, mycorp)

0.10149732620320856

In [247]:
similarity_score(out, mycorp)

0.05144385026737968

When comparing the generated text for repetitions, the number of common substrings between the genereated text and the corpus was computed and divided by the maximum possible value.

In order to check the grammar of the generated text, the open-source library LanguageTool and its accompany Python API language-check were used to individually count the number of uniqe gramamtical errors in each sentence generated by the tool. For the output of cfgen, the original sentence from which the grammatical structure was parsed, and the number of "original" erros was subtrated from the number etected in the generated text.

# Appendix (old code)

In [None]:
import jellyfish


def originality_score(sentence, corpus):
    '''
    Return the "originality" of the sentence, normalized by its length
    
    sentence : str
        A generated sentence
        
    corpus : str
        A large body of text to compare against
    '''
    
    mylen = len(corpus)
    str_dist = jellyfish.damerau_levenshtein_distance(unicode(sentence), unicode(corpus))
    
    # 1 - (edit distance / length of the larger of the two strings)
    
    norm_str_dist = float(str_dist)/mylen
    
    return str_dist

In [None]:
%load_ext Cython

In [None]:
%%cython

cdef double fr(double x) except? -2:
    return x**2-x


cdef double all_common_substring2(s1, s2,threshold_length=15):
    
    m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
    all_sub = list()
    longest, x_longest = 0, 0
    for x in range(1, 1 + len(s1)):
        for y in range(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] == threshold_length:
                    longest = m[x][y]
                    x_longest = x
                    sout = s1[x_longest - longest: x_longest]
                    all_sub.append(sout)
            else:
                m[x][y] = 0
                
    return all_sub

print(fr(6))

In [None]:
from difflib import SequenceMatcher
with open('file1.txt') as file_1,open('file2.txt') as file_2:
    file1_data = file_1.read()
    file2_data = file_2.read()
    similarity_ratio = SequenceMatcher(None,file1_data,file2_data).ratio()
    print similarity_ratio  #plagiarism detected

In [None]:
# # functions for processing text output

# import language_check
# from numpy import median, floor


# def all_common_substring(s1, s2,threshold_length=15):
#     '''
#     Return a list of all substrings of a given length that two
#     strings have in common
    
#     Based on standard code for solving the "longest common substring" problem
    
#     '''
    
#     m = [[0] * (1 + len(s2)) for i in range(1 + len(s1))]
#     all_sub = list()
#     longest, x_longest = 0, 0
#     for x in range(1, 1 + len(s1)):
#         for y in range(1, 1 + len(s2)):
#             if s1[x - 1] == s2[y - 1]:
#                 m[x][y] = m[x - 1][y - 1] + 1
#                 if m[x][y] == threshold_length:
#                     longest = m[x][y]
#                     x_longest = x
#                     sout = s1[x_longest - longest: x_longest]
#                     all_sub.append(sout)
#             else:
#                 m[x][y] = 0
#     return all_sub


# def similarity_score(s1, s2, threshold_length='auto'):
#     '''
#     Compute the similarity between two strings based on the
#     number of identical substrings of at least a given length
    
#     Parameters
#     ----------
    
#     s1 : str
#     s2 : str
#         The two strings to compare
        
#     threshold_length : int
#         The length for overlapping substrings to be significant
#         If this is not specified, it is set to thrice the median
#         length of words in the two strings
        
#     Returns
#     -------
    
#     score : float
#         The similarity score, a number between 0.0 and 1.0
    
#     '''
#     if threshold_length=='auto':
#         ave_word_len = median([len(item) for item in (s1 + ' ' + s2).split(' ')])
#         threshold_length = int(3*ave_word_len)
    
#     min_len = max([len(s1), len(s2)])
#     max_sim = floor(min_len/float(threshold_length))
    
#     all_comm = all_common_substring(s1, s2, threshold_length=threshold_length)
    
#     score = float(len(all_comm))/max_sim
    
#     return score

# def grammar_score(some_text):
#     '''
#     Count the total number of errors in a text
    
#     Excludes cosmetic errors, like misuse of capitals, 
#     and instead focus on structural issues
    
#     Parameters
#     ----------
#     some_text : str
#     '''
#     tool = language_check.LanguageTool('en-US')
#     matches = tool.check(some_text)

#     structural_errors = list()
#     for item in matches:
#         if item.ruleId.find('WHITESPACE') != -1:
#             continue
#         elif item.ruleId.find('UPPERCASE') != -1:
#             continue
#         elif item.ruleId.find('LOWERCASE') != -1:
#             continue
#         elif item.ruleId.find('MORFOLOGIK_RULE_EN_US') != -1:
#             continue
#         elif item.ruleId.find('ENGLISH_WORD_REPEAT_BEGINNING_RULE') != -1:
#             continue
#         else:
#             structural_errors.append(item)
    
#     error_score = float(len(structural_errors))/len(some_text)
    
#     return error_score

### Class CFGen 

instance variables:
    bad tags to substitute out then back in

In [None]:
class CFGen:
    '''
    
    k : int
        The order of the Markov model
    '''
    exclusions = [] # global to the class by not user-facing
    
    def __init__(self, corpus, k):
        self.corpus = name    # instance variable unique to each instance
        self.k = k
    
        self.kgram = make_kgram(self.corpus, k=self.k)
        self.tagged_corpus = tag_corpus(mycorp)
        self.term_rules = make_terminal_rules(self.tagged_corpus)
    
    def generate():
        return None