# Data Loading and Naive Bayes
I will use this notebook to establish a pipeline to load data for our experiments. 

The second aim to establish a Naive Bayes generator capable of generating few lines of dialogue given a context of words.

In [97]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import re

import tensorflow as tf

In [125]:
datapath = "Shakespeare_data.csv"

data = pd.read_csv(datapath)
data.keys()

# datacleanup
# remove all special characters
lines_data = []
for sentence in data['PlayerLine']:
    new_sentence = []
    for word in sentence.split(" "):
        new_sentence.append(re.sub('[^A-Za-z0-9]+','', word.lower()))
    lines_data.append(" ".join(new_sentence))
print(len(lines_data))

111396


## Algorithm Description
Let $\mathcal{V}$ be the vocabulary of words in the dataset. $\mathcal{S} = \{\vec{s_i}\}$ be the set of sentences in the order as they appear in the dataset. $\vec{x}$ is a collection of words from $\mathcal{V}$, defined as the context. Let $d$ be a positive integer called the dialogue depth.

### Probabilty Model
The task is given a context $\vec{x}$, predict a most probable set of surrounding sentences $\{\vec{s_i}\}_{i=j-d}^{j-1}$ such that words from $\vec{x}$ are contained in the set.

As a start, the probability is used. $$P(\vec{x},j | d) \propto \sum_{i=1,i \neq j}^{|x|} \mathcal{I}(x_i \in \cup_{i=j-d}^{j-1}\vec{s_i})$$

where $\mathcal{I}(...)$ is the indicator function
It is basically the count of the number of times words from $\vec{x}$ appear in sentences $\{\vec{s_i}\}_{i=j-d}^{j-1}$.

### Dialogue Prediction
The final output dialogue $\mathcal{D}$ is defined as:

\begin{align*}
j^* &= \underset{j}{\arg\max} P(\vec{x},j | d) \\
\mathcal{D} &= s_{j^*} 
\end{align*}

In [130]:
# vocabulary
V = list(set((" ".join(lines_data).split(" "))))
print(len(V))

27461


In [254]:
def prob(x,j,d,lines_data):
    '''
    Calcuates the prob. that context x belongs to sentence set j-d + j
        x : context
        j : index of the middle sentence
        d : dialogue depth
        lines_data : list of sentences
    '''
    # get all words in sentences using the join and split
    wordlist = " ".join(lines_data[j-d:j]).split(" ")
    score = 0
    for word in x:
        score += wordlist.count(word)
    return score

In [255]:
def calc_dialogue(x,d,lines_data):
    '''
    Returns the prob_vec and dialogue with highest score given context x
    '''
    prob_vec = np.array([prob(x,y,d,lines_data) for y in range(len(lines_data))])
    max_score = np.max(prob_vec)
    # break ties arbitarily
    max_index = np.random.choice(np.argwhere(prob_vec == max_score).flatten())
    dialogue = lines_data[max_index]
    
    return prob_vec,max_index,max_score,dialogue

## Experiment I
Context is chosen randomly or set manually from the vocabulary and one response is predicted.

In [256]:
# dialogue depth
d = 20
# context_size
context_size = 10

# choose a random context from V
#x = np.random.choice(list(V),context_size)

x = ["king","subject"]
prob_vec,max_index,max_score,dialogue = calc_dialogue(x,d,lines_data)

print("Context words\n",x)

print("\nScore")
print(max_score)

print("\nmax index")
print(max_index,"\n")
print(dialogue)

Context words
 ['king', 'subject']

Score
8

max index
80520 

the blood of english shall manure the ground


## Experiment II
A loop of conversation is formed after using the output of first prediction as context for the next prediction.

In [None]:
conversation_length = 15
d = 10
x = ["king","slave"]
for index in range(conversation_length):
    prob_vec,max_index,max_score,dialogue = calc_dialogue(x,d,lines_data)    
    result = dialogue
    #print(np.argwhere(prob_vec == max_score).flatten())
    #print(max_score,max_index)
    print(data['Player'][max_index],result)
    
    # new context
    x = result.split(" ")

OCTAVIA that have my heart parted betwixt two friends
Clown for young charbon the puritan and old poysam the
TOUCHSTONE justices could not take up a quarrel but when the
TOUCHSTONE of an if as if you said so then i said so and
ROSALIND of irish wolves against the moon
FLUELLEN if the enemy is an ass and a fool and a prating
CARDINAL WOLSEY let silence be commanded
PARIS come you to make confession to this father
ROSALIND i would love you if i could tomorrow meet me all together
HELENA what worser place can i beg in your love
HELENA than to be used as you use your dog
VIOLA longer some mollification for your giant sweet
Sixth Citizen mans voice
