# Project: PlotGEN

**Concept**

Why would we hard-code a plotline structure? My idea is to have the computer look at a text, discern important plot information from it, and construct its own story from that.

**Outline**
1. Compile texts from sources
    - Gutenberg
    - webscraper
2. Analyze the texts
    - Characters
        - build frequency table of characters appearances
        - account for first / second person narration
        - the most frequently mentioned character is most likely the main character
    - Setting
        - use Wordnet to find locations
        - find descriptors of locations to apply as modifiers
3. Build plot structure for each text
    - Summarize each text?
    - Distill each sentence down to basic concepts
        - Example: "Michael walked along the sandy, desolate beach" -> "Michael walked beach" -> `[Michael, ACTION, SETTING]` 
4. Train unsupervised learning model on plot structures
    - Clustering in n-dimensional space
5. Generate plot structure using structures
    - Group it with closest cluster
    - Change aspect
    - Repeat until firmly within the cluster
6. Profit?

In [15]:
from gensim.summarization.summarizer import summarize
from nltk.corpus import gutenberg as gb
from nltk import tokenize, pos_tag
from nltk import wordnet as wn

---
### Step 1
- Get the text(s)

In [7]:
# Read in Call of Cthulhu
f = open('data\\call_of_cthulhu', 'r', encoding="utf8")
cthulhu_str = f.read()
f.close()

---
### Step 2
- Analyze the text(s)
- Get character list, main character, locations, etc
- Try to get general plotline structure

In [51]:
# Get characters from text
def get_chars(input_text_str):
    tokenized_input = tokenize.word_tokenize(input_text_str)
    pos_input = pos_tag(tokenized_input)
    punct = "’.,;'”“"
    pos_good = [pos for pos in pos_input if pos[0] not in punct and (pos[1] == "NNP" or pos[1] == "PRP")]
    char_dict = {}
    for i in range(len(pos_good)):
        pos = pos_good[i]
        if pos[0] != "I" and pos[1] == "PRP":
            if i > 0 and pos_good[i-1][1] == "NNP":
                char_dict[pos_good[i-1][0]] += 1
        else:
            if pos[0] in char_dict:
                char_dict[pos[0]] += 1
            else:
                char_dict[pos[0]] = 1
    char_list = [(key, char_dict[key]) for key in list(char_dict.keys()) if char_dict[key] > 5]
    char_list.sort(key=lambda x: x[1], reverse=True)
    return char_list

In [96]:
# Build graph of sentence to sentence flow
def build_sentence_graph(input_text_str, char_list):
    char_dict = {}
    for char in char_list:
        char_dict[char[0]] = char[1]
    tokenized_input = tokenize.word_tokenize(input_text_str)
    pos_input = pos_tag(tokenized_input)
    
    pos_sents = []
    for i in range(len(pos_input)):
        pos = pos_input[i]
        if len(pos_sents) == 0:
            pos_sents.append([pos])
            continue
        if pos[0] == '.' or pos[0] == "!" or pos[0] == "?":
            pos_sents.append([])
        else:
            pos_sents[-1].append(pos)
    
    char_sents = []
    for sent in pos_sents:
        if any(pos[0] in char_dict for pos in sent):
            char_sents.append(sent)
    
    return char_sents

---
### Testing

In [87]:
# CALL OF CTHULHU

# test step 2
cthulhu_chars = get_chars(cthulhu_str)
print(cthulhu_chars)

[('I', 113), ('Johansen', 37), ('Wilcox', 34), ('Legrasse', 34), ('Cthulhu', 25), ('Professor', 21), ('Angell', 20), ('April', 16), ('Old', 16), ('Alert', 16), ('March', 14), ('R', 13), ('Castro', 12), ('Emma', 11), ('Inspector', 9), ('New', 9), ('Webb', 9), ('Great', 9), ('Ones', 9), ('Them', 9), ('Street', 8), ('God', 7), ('Orleans', 6), ('Sydney', 6), ('Dunedin', 6), ('Briden', 6)]


In [100]:
cthulhu_sentence_graph = build_sentence_graph(cthulhu_str, cthulhu_chars)
print(cthulhu_sentence_graph)

[[('The', 'DT'), ('Call', 'NNP'), ('of', 'IN'), ('Cthulhu', 'NNP'), ('By', 'IN'), ('H.', 'NNP'), ('P.', 'NNP'), ('Lovecraft', 'NNP'), ('(', '('), ('Found', 'NNP'), ('Among', 'IN'), ('the', 'DT'), ('Papers', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Late', 'NNP'), ('Francis', 'NNP'), ('Wayland', 'NNP'), ('Thurston', 'NNP'), (',', ','), ('of', 'IN'), ('Boston', 'NNP'), (')', ')'), ('“', 'NN'), ('Of', 'IN'), ('such', 'JJ'), ('great', 'JJ'), ('powers', 'NNS'), ('or', 'CC'), ('beings', 'NNS'), ('there', 'EX'), ('may', 'MD'), ('be', 'VB'), ('conceivably', 'RB'), ('a', 'DT'), ('survival', 'NN')], [('I', 'PRP')], [('The', 'DT'), ('most', 'RBS'), ('merciful', 'JJ'), ('thing', 'NN'), ('in', 'IN'), ('the', 'DT'), ('world', 'NN'), (',', ','), ('I', 'PRP'), ('think', 'VBP'), (',', ','), ('is', 'VBZ'), ('the', 'DT'), ('inability', 'NN'), ('of', 'IN'), ('the', 'DT'), ('human', 'JJ'), ('mind', 'NN'), ('to', 'TO'), ('correlate', 'VB'), ('all', 'DT'), ('its', 'PRP$'), ('contents', 'NNS')], [('But', 'CC'), 