# Project: PlotGEN

### My Artistic Goals
My vision for this project has certainly evolved throughout its lifespan. I began with the naive assumption that I could generate an entirely new plotline for a story after just 10 weeks of programming, to the point where a human writer would only be needed to fill in the precise details here and there.I began by building a series of functions that work together to generate characters and setting details, and provide the interactions for those objects. However, I quickly found that to be too restrictive, since it only encompassed one type of plotline.

From there, I began this plotline analysis, where I analyzed the frequency of postive and engative words in a text to find the plotline structure. This can be visually represented on a graph, which helps a lot with visualizing what info the computer is finding from the text. From there, I wasn't entirely sure where to go. Should I continue with the text analysis, or should I try to generate new plotlines from these?

I decided to go with more text analysis, because I think the plotline generation is a project that will take another few terms before it makes anything useful. I also wanted to see what additional info I could pull from the text, which will definitely help with the plot generation in the future. From my previous plotline analysis, I was able to pull the minima and maxima of the plotline, and I used that to get the "most important" paragraphs from the text. I also used the spaCy library to get characters and locations from the text.

Essentially, this project went from plot generation to text analysis, and my final result is one that pulls the "most important" paragraphs from a text. I put quotes around "most important" because the computer doesn't actually know if they're important to the overall story or not. It just pulls the paragraphs for the top five minima/maxima and stores them in a summary string. Sometimes they read like a decent summary, and other times they don't make any sense. Either way, they definitely provide some very interesting insights into how a program can determines the important plot points.

I had a lot of fun with this project, and I will definitely be continuing to work on this for a long time! 

---
### Imports

In [26]:
from nltk.corpus import gutenberg as gb
import nltk
from nltk import tokenize
import matplotlib.pyplot as plt
import numpy as np
import random
import spacy

---
### Text Retrieval
- Get the text(s)

Note for forkers: if you're on a Linux or Unix based system (like Mac), you'll have to change the double backslashes to forward slashes to access the Lovecraft files

In [3]:
# Read in Call of Cthulhu
f = open('data\\call_of_cthulhu', 'r', encoding="utf8")
cthulhu_str = f.read()
f.close()

In [21]:
# Read in Color Out of Space
f = open('data\\color_out_of_space.txt', 'r')
color_str = f.read()
f.close()

In [5]:
# Read in Shadow Over Innsmouth
f = open('data\\shadow_over_innsmouth.txt', 'r')
innsmouth_str = f.read()
f.close()

In [6]:
# Read in Shadow Out of Time
f = open('data\\shadow_out_of_time.txt', 'r')
time_str = f.read()
f.close()

In [7]:
# Read in A New Hope
f = open('data\\EpisodeIV_dialogues.txt', 'r')
starwarsIV = f.read()
f.close()

---
### Text Analysis (resurrected)
Some resurrected functions
- Analyze the text(s)
- Get character list, main character, locations, etc
- Try to get general plotline structure

In [8]:
# Get characters from text
def get_things(input_text_str, tag):
    # some of this code is taken from the spaCy tutorial
    # Load English tokenizer, tagger, parser, NER and word vectors
    nlp = spacy.load("en_core_web_sm")
    text = (input_text_str)
    doc = nlp(text)
    
    entity_good = [entity.text for entity in doc.ents if entity.label_ == tag]
    char_dict = {}
    for ent_text in entity_good:
        if ent_text in char_dict:
            char_dict[ent_text] += 1
        else:
            char_dict[ent_text] = 1
    char_list = [(key, char_dict[key]) for key in list(char_dict.keys()) if char_dict[key] > 1]
    char_list.sort(key=lambda x: x[1], reverse=True)
    return char_list

---
### Build plot structure
- Using positive / negative word frequency to form a "plot-line"

In [9]:
def tokenize_paragraphs(input_paragraphs):
    tokenized_paras = []
    for paragraph in input_paragraphs:
        sentences = []
        if len(paragraph) < 2:
            continue
        token_para = tokenize.sent_tokenize(paragraph)
        for sentence in token_para:
            if len(sentence) < 5:
                continue
            token_sentence = tokenize.word_tokenize(sentence)
            sentences.append(token_sentence)
        if len(sentences) < 2:
            continue
        tokenized_paras.append(sentences)
    return tokenized_paras

In [45]:
# Get structure of plot via sentiment analysis
def get_plot_struct(input_text_str, para_delimiter="\n"):
    paragraphs = input_text_str.split(para_delimiter)
    tokenized_paras = tokenize_paragraphs(paragraphs)
    
    posf = open("data\\opinion-lexicon-English\\positive-words.txt", 'r')
    negf = open("data\\opinion-lexicon-English\\negative-words.txt", 'r')
    
    pos_words = posf.read()
    neg_words = negf.read()
    
    posf.close()
    negf.close()
    
    pos_set = set(pos_words.split("\n"))
    neg_set = set(neg_words.split("\n"))
    
    storyline = [[],[]]
    len_tokenized_paras = len(tokenized_paras)
    for i in range(len_tokenized_paras):
        para = tokenized_paras[i]
        total_pos = 0
        total_neg = 0
        total_sents = 0
        
        valid_para = False
        
        for sent in para:
            valid_sent = False
            points_pos = 0
            points_neg = 0
            
            for word in sent:
                if word in pos_set:
                    points_pos += 1
                    valid_sent = True
                    continue
                elif word in neg_set:
                    points_neg += 1
                    valid_sent = True
                    continue                
            
            if not valid_sent:
                continue
            
            total_sents += 1
            
            if points_pos > points_neg:
                total_pos += (points_pos)
            elif points_neg > points_pos:
                total_neg += (points_neg)
            
            if not valid_para and valid_sent:
                valid_para = True
        
        if not valid_para:
            continue
        
        #storyline[0].append(float(i) * 100.0 / float(len_tokenized_paras))
        storyline[0].append(i)
        
        if total_pos > total_neg:
            storyline[1].append(total_pos/total_sents)
        elif total_neg > total_pos:
            storyline[1].append(-total_neg/total_sents)
        else:
            storyline[1].append(0.0)
    
    for i in range(len(storyline[1])):
        if i == 0 or i == len(storyline[1])-1:
            continue
                
        if storyline[1][i] == 0.0:
            storyline[1][i] = (storyline[1][i-1] + storyline[1][i+1]) / 2
            continue

        if storyline[1][i] > 0.0 and (storyline[1][i-1] < 0 and  storyline[1][i+1] < 0):
            storyline[1][i] = (storyline[1][i-1] + storyline[1][i+1]) / 2
        elif storyline[1][i] < 0.0 and (storyline[1][i-1] > 0 and  storyline[1][i+1] > 0):
            storyline[1][i] = (storyline[1][i-1] + storyline[1][i+1]) / 2
    
    max_value = max([abs(min(storyline[1])), abs(max(storyline[1]))])
    
    for i in range(len(storyline[1])):
        storyline[1][i] = storyline[1][i] / max_value
        
    return storyline

In [46]:
# Getting Average Values
def average_values(x_and_y):
    x_array = []
    y_array = []
    
    for i in range(len(x_and_y[1])):
        if i == 0:
            x_array.append(x_and_y[0][i])
            y_array.append((x_and_y[1][i]+x_and_y[1][i+1])/2)
        elif i == len(x_and_y[1])-1:
            x_array.append(x_and_y[0][i])
            y_array.append((x_and_y[1][i]+x_and_y[1][i-1])/2)
        else:
            x_array.append(x_and_y[0][i])
            y_array.append((x_and_y[1][i-1]+x_and_y[1][i]+x_and_y[1][i+1])/3)
    
    max_value = max([abs(min(y_array)),abs(max(y_array))])
    for i in range(len(y_array)):
        y_array[i] = y_array[i] / max_value
    
    
    return [x_array, y_array]

In [47]:
# Get mean shape of multiple texts
def mean_plot(plotlines):
    total_x = []
    total_y = {}
    
    for plot in plotlines:
        for i in range(len(plot[0])):
            if plot[0][i] in total_x:
                total_y[plot[0][i]].append(plot[1][i])
            else:
                total_y[plot[0][i]] = [plot[1][i]]
                total_x.append(plot[0][i])
                       
    avg_x = sorted(total_x)
    temp_y = [(arr, np.average(total_y[arr])) for arr in list(total_y.keys())]
    temp_y2 = sorted(temp_y, key=lambda tup: tup[0])
    avg_y = [tup[1] for tup in temp_y2]
    
    return [avg_x, avg_y]

---
### Plotlines in words
- some text generation based off the plotlines

In [67]:
# Get paragraph qualities
def para_qualities(input_plot):
    para_vals = []  
    xs = input_plot[0]
    ys = input_plot[1]
    len_ys = len(ys)
    biggest = max(abs(min(ys)),abs(max(ys)))    
    
    for i in range(len_ys):
        if i > 0 and i < len_ys-1:
            if abs(ys[i-1]) <= abs(ys[i]) <= abs(ys[i+1]):
                continue
            if abs(ys[i-1]) >= abs(ys[i]) >= abs(ys[i+1]):
                continue
        
        para_val = ""
        y_val = abs(ys[i])
        
        if y_val == 0.0:
            para_vals.append("neutral")
            continue
        
        if y_val < 0.25:
            para_val += "a little "
        elif y_val < 0.5:
            para_val += "kinda "
        elif y_val < 0.75:
            para_val += ""
        elif y_val < 1.0:
            para_val += "very "
        else:
            para_val += "most "
        
        if ys[i] < 0.0:
            para_val += "bad"
        elif ys[i] > 0.0:
            para_val += "good"
        
        para_vals.append((xs[i], y_val, para_val))
    
    return para_vals
            

---
### Testing

In [111]:
def get_story_summary(input_txt, title, para_delimiter="\n", moderate_num=20, limit=0):
    storyline = get_plot_struct(input_txt, para_delimiter=para_delimiter)

    moderated = storyline
    for i in range(0,moderate_num):
        moderated = average_values(moderated)
        
    no_space_before = ".,”’;!?'-:"
    no_space_after = "“’`-"

    init_plotline = para_qualities(moderated)
    if limit == 0:
        plotline = init_plotline
    else:
        temp_plotline = sorted(init_plotline, key = lambda tup: tup[1])
        plotline = sorted(temp_plotline[:limit], key = lambda tup: tup[0])

    para_list = tokenize_paragraphs(input_txt.split(para_delimiter))
    important_paras = [para_list[spot[0]] for spot in plotline]
    for para in important_paras:
        for i in range(len(para)):
            sent = para[i]
            new_sent = []
            for j in range(len(sent)):
                if sent[j] in no_space_before and len(new_sent) > 0:
                    new_sent[-1] += sent[j]
                elif j > 0 and sent[j-1] in no_space_after:
                    new_sent[-1] += sent[j]
                else:
                    new_sent.append(sent[j])
            para[i] = new_sent    
    
    first_last_sents = [" ".join([" ".join(sent) for sent in para]) for para in important_paras]
    summary = "Summary: \n\n" + "\n\n".join(first_last_sents)
    
    chars = get_things(input_txt, "PERSON")
    char_list = [" - " + char[0] for char in chars]
    if len(char_list) <= (10 if limit == 0 else limit):
        characters = "Main Characters: \n" + "\n".join(char_list)
    else:
        characters = "Main Characters: \n" + "\n".join(char_list[:(10 if limit == 0 else limit)])
    
    locs = get_things(input_txt, "GPE")
    locs += get_things(input_txt, "LOC")
    filtered_locs = [" - " + loc[0][0].upper() + loc[0][1:] 
                         for loc in locs if loc[0] not in characters]
    if len(filtered_locs) <= (10 if limit == 0 else limit):
        locations = "Locations: \n" + "\n".join(filtered_locs)
    else:
        locations = "Locations: \n" + "\n".join(filtered_locs[:(10 if limit == 0 else limit)])
    
    return_string = title + " \n\n"
    return_string += characters + "\n\n"
    return_string += locations + "\n\n"
    return_string += summary
    
    return return_string

In [112]:
cthulhu_summary = get_story_summary(cthulhu_str, "The Call of Cthulhu", limit = 5)

In [113]:
print(cthulhu_summary)

The Call of Cthulhu 

Main Characters: 
 - Angell
 - Legrasse
 - Wilcox
 - Castro
 - Alert

Locations: 
 - Johansen
 - New Orleans
 - London
 - Greenland
 - S. Latitude

Summary: 

The writing accompanying this oddity was, aside from a stack of press cuttings, in Professor Angell’s most recent hand; and made no pretence to literary style. What seemed to be the main document was headed “CTHULHU CULT” in characters painstakingly printed to avoid the erroneous reading of a word so unheard-of. The manuscript was divided into two sections, the first of which was headed “1925—Dream and Dream Work of H. A. Wilcox, 7 Thomas St., Providence, R.I.”, and the second, “Narrative of Inspector John R. Legrasse, 121 Bienville St., New Orleans, La., at 1908 A. A. S. Mtg.—Notes on Same, & Prof. Webb’s Acct.” The other manuscript papers were all brief notes, some of them accounts of the queer dreams of different persons, some of them citations from theosophical books and magazines ( notably W. Scott-Elli

In [114]:
color_summary = get_story_summary(color_str, "The Color Out of Space", limit = 5)

In [115]:
print(color_summary)

The Color Out of Space 

Main Characters: 
 - Nahum
 - Ammi
 - Gardner
 - Merwin
 - ye

Locations: 
 - Arkham
 - Zenas
 - Thaddeus
 - Boston
 - Valley

Summary: 

West of Arkham the hills rise wild, and there are valleys with deep woods that no axe has ever cut. There are dark narrow glens where the trees slope fantastically, and where thin brooklets trickle without ever having caught the glint of sunlight. On the gentler slopes there are farms, ancient and rocky, with squat, moss-coated cottages brooding eternally over old New England secrets in the lee of great ledges; but these are all vacant now, the wide chimneys crumbling and the shingled sides bulging perilously beneath low gambrel roofs.

It all began, old Ammi said, with the meteorite. Before that time there had been no wild legends at all since the witch trials, and even then these western woods were not feared half so much as the small island in the Miskatonic where the devil held court beside a curious stone altar older tha

In [116]:
gb.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [117]:
emma_str = gb.raw('austen-emma.txt')

In [118]:
emma_summary = get_story_summary(emma_str, "Emma by Jane Austen", para_delimiter="\n\n", limit = 5)

In [119]:
print(emma_summary)

Emma by Jane Austen 

Main Characters: 
 - Emma
 - Weston
 - Elton
 - Harriet
 - Knightley

Locations: 
 - London
 - Fairfax
 - Weymouth
 - Enscombe
 - Ireland

Summary: 

`` But you must have found it very damp and dirty. I wish you may not catch cold. ''

`` You saw her answer! -- you wrote her answer too. Emma, this is your doing. You persuaded her to refuse him. ''

Before the middle of the next day, he was at Hartfield; and he entered the room with such an agreeable smile as certified the continuance of the scheme. It soon appeared that he came to announce an improvement.

She spoke with great agitation; and Emma very feelingly replied, '' That can be no reason for your being exposed to danger now. I must order the carriage. The heat even would be danger. -- You are fatigued already. ''

`` True, true, '' he answered, warmly. `` No, not true on your side. You can have no superior, but most true on mine. -- She is a complete angel. Look at her. Is not she an angel in every gesture?

In [120]:
whitman_str = gb.raw("whitman-leaves.txt")

In [121]:
whitman_summary = get_story_summary(whitman_str, "Leaves of Grass by Walt Whitman", para_delimiter="\n\n", limit = 5)

In [122]:
print(whitman_summary)

Leaves of Grass by Walt Whitman 

Main Characters: 
 - Nature
 - Mannahatta
 - ye
 - lo
 - Walt Whitman

Locations: 
 - America
 - States
 - Thou
 - Manhattan
 - Thee

Summary: 

I tramp a perpetual journey, ( come listen all! ) My signs are a rain-proof coat, good shoes, and a staff cut from the woods, No friend of mine takes his ease in my chair, I have no chair, no church, no philosophy, I lead no man to a dinner-table, library, exchange, But each man and each woman of you I lead upon a knoll, My left hand hooking you round the waist, My right hand pointing to landscapes of continents and the public road.

This day before dawn I ascended a hill and look 'd at the crowded heaven, And I said to my spirit When we become the enfolders of those orbs, and the pleasure and knowledge of every thing in them, shall we be fill 'd and satisfied then? And my spirit said No, we but level that lift to pass and continue beyond.

I teach straying from me, yet who can stray from me? I follow you whoe

In [123]:
hamlet_str = gb.raw("shakespeare-hamlet.txt")

In [124]:
hamlet_summary = get_story_summary(hamlet_str, "Hamlet by Shakespeare", para_delimiter="\n\n", limit = 5)

In [125]:
print(hamlet_summary)

Hamlet by Shakespeare 

Main Characters: 
 - Hor
 - Laer
 - Qu
 - Ophe
 - Pol

Locations: 
 - Heauen
 - England
 - Horatio
 - Soule
 - Ophelia

Summary: 

Mar. Is it not like the King? As thou art to thy selfe, Such was the very Armour he had on, When th' Ambitious Norwey combatted: So frown 'd he once, when in an angry parle He smot the sledded Pollax on the Ice. 'T is strange

King. Though yet of Hamlet our deere Brothers death The memory be greene: and that it vs befitted To beare our hearts in greefe, and our whole Kingdome To be contracted in one brow of woe: Yet so farre hath Discretion fought with Nature, That we with wisest sorrow thinke on him, Together with remembrance of our selues. Therefore our sometimes Sister, now our Queene, Th' imperiall Ioyntresse of this warlike State, Haue we, as 'twere, with a defeated ioy, With one Auspicious, and one Dropping eye, With mirth in Funerall, and with Dirge in Marriage, In equall Scale weighing Delight and Dole Taken to Wife; nor haue