<a href="https://colab.research.google.com/github/albisbub/dighum/blob/master/_notebooks/2020-05-14-Final-Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project DH 2020
> Laura-Bot V2

- toc:true
- branch: master
- badges: true
- comments: true
- author: Blaise
- permalink: /final/
- categories: [fastpages, jupyter, nltk, textual_analysis, Jewish, History]

# Introduction, Goals, and Considerations

This platform aims to provide a framework and tools to conduct detailed analysis in digital humanities. The platform is a simple blogging website built on GitHub that is open source and can be contributed to by anyone. The website can be easily duplicated by other GitHub users making the setup convenient for a classroom or casual tinkering.


This blog has the functionality to host python notebook, markdown, and HTML pages. There were two choices to be made when designing the backend of the project: make it a web app and allow users to produce analysis that they would download and then use where they see fit. The other option is to build an open-ended platform that hosts the analysis itself. The ladder is chosen for the following reasons.


*Technical Limitations*: After extensive research, building a standalone web app that works at scale is a significant task and would require a higher degree of work segmentation and specialization--it would not be an impossible grind. However, one should be wary about future upkeep and depreciation since a web app would require more rigorous maintenance. One would need the same code toolbox for the interactivities in both propositions so this won't constrain the interactivities themselves.

*Accessibility*: All code from the project will be open source and uses many popular open-source libraries. The website is reproducible via a simple GitHub clone, and the module of tools will be uploaded to the Python Package Index (PyPi) so that others can install the functions directly into their local python environments.


*Modular Tools*: the notebook/blog format preserves the modularity of the tools, so the functions created do not have to be cookie-cutter activities, they can be used in python code wherever a user might want.


*Extendability*: Python is arguably the most popular scripting language with one of the most active communities, making debugging, researching, and executing processes much more straightforward. It is also a very readable language so it will likely be more pleasing and understandable than other syntaxes, even with no coding experience, 


*Multimodality*: Ipython notebooks are perfect examples of multimodal learning environments. In a python notebook, one writes content in cells, much like WordPress word processors. One can change the content type of a cell to be code, markdown, images, or interactive buttons/forms. These inputs themselves possess multimodality; however, code can also conjure a plethora of analyses and outputs that harness multiple modes that the contemporary computer user is familiar with like video, sound, and interactive HTML widgets.


*Interactivity*: After completing an analysis, the notebooks automatically render to HTML. Rendering is essential to creating interactivities since we need the user to click on things, change stuff in the output, and explore their research questions by messing around with different settings. Aside from the code being 100% malleable, dynamic HTML outputs will allow for the type of interactive charts and displays commonplace on the web.


*Transparency*: These notebooks are unique from most of the tools we have studied in digital humanities this semester since the code is visible if one wants it to be. One of the most critical aspects of this approach is that nothing is hidden and that the user can wholly view and edit the source material for the tools. Why is this important? Many religious and classics scholars are critics of digital humanities because of the power it has to reshape knowledge through digital listing and cataloging, epistemological concerns, and the challenge of striking a balance between the conflicting forces of coherence and asymmetry. (Clivaz) This format does not solve all of these problems, but it addresses the last one pretty thoroughly. The previous analysis does not constrain the resulting methods from the current; however, it does rely on it, allowing the budding digital humanities scholar to tailor levels of said forces to their liking.


# Data Cleaning

## Load and Parse PDFS from folder in Google Drive

In [3]:
#hide
!pip install pdfminer
!pip install io
!pip install PyPDF2

Collecting pdfminer
[?25l  Downloading https://files.pythonhosted.org/packages/71/a3/155c5cde5f9c0b1069043b2946a93f54a41fd72cc19c6c100f6f2f5bdc15/pdfminer-20191125.tar.gz (4.2MB)
[K     |████████████████████████████████| 4.2MB 8.0MB/s 
[?25hCollecting pycryptodome
[?25l  Downloading https://files.pythonhosted.org/packages/af/16/da16a22d47bac9bf9db39f3b9af74e8eeed8855c0df96be20b580ef92fff/pycryptodome-3.9.7-cp36-cp36m-manylinux1_x86_64.whl (13.7MB)
[K     |████████████████████████████████| 13.7MB 286kB/s 
[?25hBuilding wheels for collected packages: pdfminer
  Building wheel for pdfminer (setup.py) ... [?25l[?25hdone
  Created wheel for pdfminer: filename=pdfminer-20191125-cp36-none-any.whl size=6140074 sha256=31d32313896184c17396d78326cdb56c4c03f78720d524a4f3f9a7cf9a29e882
  Stored in directory: /root/.cache/pip/wheels/e1/00/af/720a55d74ba3615bb4709a3ded6dd71dc5370a586a0ff6f326
Successfully built pdfminer
Installing collected packages: pycryptodome, pdfminer
Successfully instal

In [4]:
#hide
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [12]:
#hide
# %cd /content/drive
!pwd

/content/drive/My Drive/pdfs


In [6]:
#hide
%cd drive
%pwd
%cd My Drive
%cd pdfs
# convert_pdf_to_txt('drive/')


/content/drive
/content/drive/My Drive
/content/drive/My Drive/pdfs


In [7]:
#hide
%pwd

'/content/drive/My Drive/pdfs'

In [13]:
#hide
import os
from os import listdir
from os.path import isfile, join
mypath = os.curdir
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
print(mypath)


.


In [14]:
#collapse-hide
import PyPDF2
from os import listdir
from os.path import isfile, join


onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]

listofArticles = []
art_records = []
stringV = ''
for file in onlyfiles:
  try:
    fileReader = PyPDF2.PdfFileReader(open(file,'rb'))
    pages = fileReader.getNumPages()
    print(pages ,"pages")
    filetext = ''
    count = 0

    while count < pages:
      pageObj = fileReader.getPage(count)
      text = pageObj.extractText()
      filetext+=(text)
      stringV+=(text)
      listofArticles.append(text)
      count+=1
    
    filename = str(file)
    tuple = (filename, filetext)
    art_records.append(tuple)
  except:
    pass



6 pages
19 pages
4 pages
26 pages
28 pages


In [0]:
#hide
import pandas as pd
fileDF = pd.DataFrame.from_records(art_records, columns = ['Article', 'Text']) 

In [0]:
#hide
fileDF['Text'] = fileDF['Text'].str.replace('\d+', '')

In [19]:
fileDF

Unnamed: 0,Article,Text
0,Introduction_Jewish_American_Material_Cu.pdf,Introduction: Jewish American Material Culture...
1,Messianism_Secrecy_and_Mysticism_A_New_I.pdf,"Messianism,SecrecyandMysticismANewInterpretati..."
2,Of_Dogs_Vacations_and_Jews.pdf,"Of Dogs, Vacations, and Jews\nLAURA LEIBMAN\nI..."
3,Making_Jews_Race_Gender_and_Identity_in.pdf,"Making Jews: Race, Gender and Identity in Barb..."
4,Poetics_of_the_Apocalypse_Messianism_in.pdf,"Studies in American Jewish Literature, Vol. , ..."


Import stop words and english corpus

In [20]:
#collapse-hide
# !pip install -q wordcloud
import wordcloud

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') 

import pandas as pd
import matplotlib.pyplot as plt
import io
import unicodedata
import numpy as np
import re
import string

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [0]:
#collapse-hide

# Constants
# POS (Parts Of Speech) for: nouns, adjectives, verbs and adverbs
DI_POS_TYPES = {'NN':'n', 'JJ':'a', 'VB':'v', 'RB':'r'} 
POS_TYPES = list(DI_POS_TYPES.keys())

# Constraints on tokens
MIN_STR_LEN = 3
RE_VALID = '[a-zA-Z]'

## Load the article text from our parsed data

In [22]:
#collapse-hide

# Upload from google drive
from google.colab import files
# uploaded = fileDF
# print("len(uploaded.keys():", len(uploaded.keys()))

# for fn in uploaded.keys():
#   print('User uploaded file "{name}" with length {length} bytes'.format(name=fn, length=len(uploaded[fn])))

# # Get list of quotes
df_quotes = fileDF
  
# Display
# print("df_quotes:")
# print(df_quotes.head().to_string())
# print(df_quotes.describe())

# Convert quotes to list
li_quotes = df_quotes['Text'].tolist()
stringV = li_quotes
print("Number of Articles", len(li_quotes))

a = ' '.join(stringV)
!pip install wordninja
import wordninja
b = wordninja.split(a)
print(len(b))

Number of Articles 5


## Tokenize sentences and words, remove stopwords, use stemmer & lemmatizer

First, a note on the difference between Stemming vs Lemmatization:

* Stemming: Trying to shorten a word with simple regex rules

* Lemmatization: Trying to find the root word with linguistics rules (with the use of regex rules)

In [25]:
# Get stopwords, stemmer and lemmatizer
stopwords = nltk.corpus.stopwords.words('english')
stemmer = nltk.stem.PorterStemmer()
lemmatizer = nltk.stem.WordNetLemmatizer()

# Remove accents function
def remove_accents(data):
    return ''.join(x for x in unicodedata.normalize('NFKD', data) if x in string.ascii_letters or x == " ")

# Process all quotes
li_tokens = []
li_token_lists = []
li_lem_strings = []

for i,text in enumerate(li_quotes):
    # Tokenize by sentence, then by lowercase word
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]

    # Process all tokens per quote
    li_tokens_quote = []
    li_tokens_quote_lem = []
    for token in tokens:
        # Remove accents
        t = remove_accents(token)

        # Remove punctuation
        t = str(t).translate(string.punctuation)
        li_tokens_quote.append(t)
        
        # Add token that represents "no lemmatization match"
        li_tokens_quote_lem.append("-") # this token will be removed if a lemmatization match is found below

        # Process each token
        if t not in stopwords:
            if re.search(RE_VALID, t):
                if len(t) >= MIN_STR_LEN:
                    # Note that the POS (Part Of Speech) is necessary as input to the lemmatizer 
                    # (otherwise it assumes the word is a noun)
                    pos = nltk.pos_tag([t])[0][1][:2]
                    pos2 = 'n'  # set default to noun
                    if pos in DI_POS_TYPES:
                      pos2 = DI_POS_TYPES[pos]
                    
                    stem = stemmer.stem(t)
                    lem = lemmatizer.lemmatize(t, pos=pos2)  # lemmatize with the correct POS
                    
                    if pos in POS_TYPES:
                        li_tokens.append((t, stem, lem, pos))

                        # Remove the "-" token and append the lemmatization match
                        li_tokens_quote_lem = li_tokens_quote_lem[:-1] 
                        li_tokens_quote_lem.append(lem)

    # Build list of token lists from lemmatized tokens
    li_token_lists.append(li_tokens_quote)
    
    # Build list of strings from lemmatized tokens
    str_li_tokens_quote_lem = ' '.join(li_tokens_quote_lem)
    li_lem_strings.append(str_li_tokens_quote_lem)
    
# Build resulting dataframes from lists
df_token_lists = pd.DataFrame(li_token_lists)

print("df_token_lists.head(5):")
print(df_token_lists.head(5).to_string())

# Replace None with empty string
for c in df_token_lists:
    if str(df_token_lists[c].dtype) in ('object', 'string_', 'unicode_'):
        df_token_lists[c].fillna(value='', inplace=True)

df_lem_strings = pd.DataFrame(li_lem_strings, columns=['lem quote'])

print()
print("")
print("df_lem_strings.head():")
print(df_lem_strings.head().to_string())

df_token_lists.head(5):
              0     1                                                                                                                    2          3                                                               4        5                6         7        8         9        10   11           12         13             14     15            16       17                                                                                           18     19                     20       21                   22       23                                                                  24     25                    26        27            28     29      30        31                                                                                                                                                 32     33                                  34        35                    36        37                                                                                    

# Simple Textual Analysis

## Process results, find the most popular lemmatized words and group results by Part of Speech (POS)

In [26]:
#hide-input
print("Group by lemmatized words, add count and sort:")
df_all_words = pd.DataFrame(li_tokens, columns=['token', 'stem', 'lem', 'pos'])
df_all_words['counts'] = df_all_words.groupby(['lem'])['lem'].transform('count')
df_all_words = df_all_words.sort_values(by=['counts', 'lem'], ascending=[False, True]).reset_index()

print("Get just the first row in each lemmatized group")
df_words = df_all_words.groupby('lem').first().sort_values(by='counts', ascending=False).reset_index()
print("df_words.head(10):")
print(df_words.head(10))

Group by lemmatized words, add count and sort:
Get just the first row in each lemmatized group
df_words.head(10):
        lem  index     token      stem pos  counts
0    jewish      1    jewish    jewish  NN     400
1       jew    182      jews       jew  NN     233
2  american      2  american  american  JJ     217
3    vestry   5247    vestry    vestri  NN      83
4  barbados   1334  barbados   barbado  NN      83
5       new    451       new       new  JJ      80
6      make     92      made      made  VB      76
7     early   1311     early     earli  RB      75
8      bill   5248      bill      bill  NN      73
9     isaac   1849     isaac     isaac  NN      68


## Frequency of Lemmatized Words Grouped by Parts of Speech.

In [27]:
#hide-input
df_words.head(50)

Unnamed: 0,lem,index,token,stem,pos,counts
0,jewish,1,jewish,jewish,NN,400
1,jew,182,jews,jew,NN,233
2,american,2,american,american,JJ,217
3,vestry,5247,vestry,vestri,NN,83
4,barbados,1334,barbados,barbado,NN,83
5,new,451,new,new,JJ,80
6,make,92,made,made,VB,76
7,early,1311,early,earli,RB,75
8,bill,5248,bill,bill,NN,73
9,isaac,1849,isaac,isaac,NN,68


## Top 10 words per Part Of Speech (POS)

In [0]:
#collapse-hide
df_words = df_words[['lem', 'pos', 'counts']].head(200)
dfList_pos = []
for v in POS_TYPES:
    df_pos = df_words[df_words['pos'] == v]
    # print()
    # print("POS_TYPE:", v)
    # print(df_pos.head(10).to_string())
    df = df_pos.reset_index(inplace=False)
    dfList_pos.append(df.head(10))


### Nouns

In [29]:
#hide-input
dfList_pos[0]

Unnamed: 0,index,lem,pos,counts
0,0,jewish,NN,400
1,1,jew,NN,233
2,3,vestry,NN,83
3,4,barbados,NN,83
4,8,bill,NN,73
5,9,isaac,NN,68
6,10,history,NN,66
7,11,emancipation,NN,64
8,12,israel,NN,63
9,13,poem,NN,63


### Adjectives

In [30]:
#hide-input
dfList_pos[1]

Unnamed: 0,index,lem,pos,counts
0,2,american,JJ,217
1,5,new,JJ,80
2,27,spanish,JJ,41
3,38,social,JJ,36
4,46,literary,JJ,33
5,60,poor,JJ,29
6,66,white,JJ,27
7,91,large,JJ,24
8,95,racial,JJ,23
9,105,many,JJ,22


### Verbs

In [31]:
#hide-input
dfList_pos[2]

Unnamed: 0,index,lem,pos,counts
0,6,make,VB,76
1,29,debate,VB,41
2,34,form,VB,40
3,49,take,VB,31
4,94,understand,VB,23
5,99,see,VB,23
6,124,write,VB,20
7,135,present,VB,19
8,155,begin,VB,18
9,159,come,VB,17


### Adverb

In [32]:
#hide-input
dfList_pos[3]

Unnamed: 0,index,lem,pos,counts
0,7,early,RB,75
1,18,also,RB,52
2,31,even,RB,40
3,42,rather,RB,34
4,45,yet,RB,33
5,47,later,RB,32
6,78,indeed,RB,26
7,83,often,RB,25
8,154,thus,RB,18


### Frequency plot grouped by POS type

In [33]:
#hide-input
import pandas as pd
import altair as alt
import numpy as np
# np.random.seed(42)

# Generating Data
# source = pd.DataFrame({
#     'Trial A': np.random.normal(0, 0.8, 1000),
#     'Trial B': np.random.normal(-2, 1, 1000),
#     'Trial C': np.random.normal(3, 2, 1000)
# })
source = df_words.sort_values(by=['counts'], ascending=False)
alt.Chart(source).mark_bar(opacity=0.7).encode(
    y=alt.Y('lem:N',sort= {"op": "distinct", "field": "sort_order:O"}),
    x=alt.X('counts:Q', stack=None),
    color="pos:N",
)

In [47]:
#hide
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

# Machine Learning Text Generation Model

While parsing PDFs the most common thing to see are page numbers and words that are stucktogetherlikethis. To handle this and to make our training data more robust we use a package called word ninja that uses english corpuses (corpii?) and some fancy math to split them up correctly. We also remove all numbers that are not spelled out in the text.

In [49]:
#collapse-hide
# !pip install grammar
# !pip install base32hex
# from grammar import Document
# # string = stringV


# #screen for broken words with english corpus
# string_tokens = nltk.word_tokenize(str(stringV))
# print(string_tokens)

# remove_broken_words = " ".join(w for w in nltk.wordpunct_tokenize(stringV) \
#          if w.lower() in words or not w.isalpha())
# !pip install wordninjacleanedTextcleanedText
cleanedText =[]
import wordninja
for i in stringV:
  split = wordninja.split(i)
  cleanedText.append(split)
# b = wordninja.split(stringV)


print(len(cleanedText))
out = []
words = set(nltk.corpus.words.words())

for i in cleanedText:
  screened = " ".join(w for w in nltk.wordpunct_tokenize(str(i)) \
         if w.lower() in words or not w.isalpha())

  out.append(screened)



5



In [0]:
#collapse-hide
import itertools
cleanedText_S = (list(itertools.chain.from_iterable(cleanedText)))

In [55]:
#collapse-hide
import numpy
import sys
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint

file = " ".join(cleanedText_S)

def tokenize_words(input):
    # lowercase everything to standardize it
    input = input.lower()

    # instantiate the tokenizer
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(input)

    # if the created token isn't in the stop words, make it part of "filtered"
    filtered = filter(lambda token: token not in stopwords.words('english'), tokens)
    return " ".join(filtered)

# preprocess the input data, make tokens
processed_inputs = tokenize_words(file)


chars = sorted(list(set(processed_inputs)))
char_to_num = dict((c, i) for i, c in enumerate(chars))

input_len = len(processed_inputs)
vocab_len = len(chars)
print ("Total number of characters:", input_len)
print ("Total vocab:", vocab_len)

Total number of characters: 168214
Total vocab: 27


In [0]:
#collapse-hide
seq_length = 100
x_data = []
y_data = []

In [0]:
#collapse-hide
# loop through inputs, start at the beginning and go until we hit
# the final character we can create a sequence out of
for i in range(0, input_len - seq_length, 1):
    # Define input and output sequences
    # Input is the current character plus desired sequence length
    in_seq = processed_inputs[i:i + seq_length]

    # Out sequence is the initial character plus total sequence length
    out_seq = processed_inputs[i + seq_length]

    # We now convert list of characters to integers based on
    # previously and add the values to our lists
    x_data.append([char_to_num[char] for char in in_seq])
    y_data.append(char_to_num[out_seq])

In [58]:
#collapse-hide
n_patterns = len(x_data)
print ("Total Patterns:", n_patterns)
X = numpy.reshape(x_data, (n_patterns, seq_length, 1))
X = X/float(vocab_len)
y = np_utils.to_categorical(y_data)


Total Patterns: 168114


## Paramaters

In [0]:
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))

In [0]:
model.compile(loss='categorical_crossentropy', optimizer='adam')

In [0]:
filepath = "model_weights_saved.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
desired_callbacks = [checkpoint]

## Fitting Model (This takes a long time)

In [63]:
model.fit(X, y, epochs=5, batch_size=200, callbacks=desired_callbacks)


Epoch 1/5

Epoch 00001: loss improved from inf to 2.89978, saving model to model_weights_saved.hdf5
Epoch 2/5

Epoch 00002: loss improved from 2.89978 to 2.68133, saving model to model_weights_saved.hdf5
Epoch 3/5

Epoch 00003: loss improved from 2.68133 to 2.48392, saving model to model_weights_saved.hdf5
Epoch 4/5

Epoch 00004: loss improved from 2.48392 to 2.31998, saving model to model_weights_saved.hdf5
Epoch 5/5

Epoch 00005: loss improved from 2.31998 to 2.19382, saving model to model_weights_saved.hdf5


<keras.callbacks.callbacks.History at 0x7f42ce73bdd8>

In [115]:
1#collapse-hide

filename = "model_weights_saved.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy', optimizer='adam')
num_to_char = dict((i, c) for i, c in enumerate(chars))
import re

numberOfSamples = 12


# sample = ("\"", ''.join([num_to_char[value] for value in pattern]), "\"")

essay = []
essays = ''

for i in range(numberOfSamples):
  start = numpy.random.randint(0, len(x_data) - 1)

  pattern = x_data[start]
  o = ( ''.join([num_to_char[value] for value in pattern]))

  output = re.sub(r'\d+', '',o)
  print(output)
  essays+=output

  # essay.append(sample)
  # essay.append(".")
tokens = nltk.word_tokenize(essays)

ential threats providential messianic vision time space reworks underscore ultimate jewish tri u mph
t cemeteries second tl ayy im ashkenazi spanish town jamaica xxvii second jewish cemetery new york x
i dent al plan actions consequences plan russian formalist boris roma v sky similarly suggested trav
see cary carson architecture social history chesapeake house architectural investigation colonial wi
ish reclaims jewish language positive part sephardic tory identity like boa b de barrios combines ol
ancy reinforce ment cohesion correlation causation narrative thread new literary history e second us
pompous elevated prose marked speakers attempts gain authority position decorous gentlemen e r macki
raph author xii tables timeline stages death rit u adapted robert v wells facing king terrors living
y information jews saratoga springs vacations hilton seligman jews like represented vulgar osten ati
nglish women male patriarchal privilege rather responding attacks privilege directly jun fr

## Getting Some Output

In [116]:
words = set(nltk.corpus.words.words())
#screen for broken words with english corpus |
string_tokens =(str(" ".join(tokens)))
print(string_tokens)

remove_broken_words = " ".join(w for w in nltk.wordpunct_tokenize(string_tokens) \
         if w.lower() in words or not w.isalpha())

ential threats providential messianic vision time space reworks underscore ultimate jewish tri u mpht cemeteries second tl ayy im ashkenazi spanish town jamaica xxvii second jewish cemetery new york xi dent al plan actions consequences plan russian formalist boris roma v sky similarly suggested travsee cary carson architecture social history chesapeake house architectural investigation colonial wiish reclaims jewish language positive part sephardic tory identity like boa b de barrios combines olancy reinforce ment cohesion correlation causation narrative thread new literary history e second uspompous elevated prose marked speakers attempts gain authority position decorous gentlemen e r mackiraph author xii tables timeline stages death rit u adapted robert v wells facing king terrors livingy information jews saratoga springs vacations hilton seligman jews like represented vulgar osten atinglish women male patriarchal privilege rather responding attacks privilege directly jun framed disc

In [117]:
print(remove_broken_words)

providential vision time space underscore ultimate tri u second town second cemetery new york xi dent al plan plan formalist v sky similarly architecture social history house architectural investigation colonial language positive part tory identity like boa b de reinforce cohesion correlation causation narrative thread new literary history e second elevated prose marked gain authority position decorous e r author tables death rit u v facing king information like vulgar male patriarchal privilege rather privilege directly framed de weave people tradition baroque whereas god yet comparison also painful unlike ram substituted poetically however t


# Some Cherry Picked Outputs:

1)
es yad ham boa b race may unsettle us provide verse tradition new yale university encyclopedia poetry poetics material culture social history despite social h school boa b strong company mae era sh ber pi ha also se e f lay emancipation since much elite socially yet us importance thinking past national work remains sign atlantic world e inquisition con verso even promise world c ted colonization indeed black reappear later men sense gender question one already seen yet even patriarchy one older community example inferior men hence deserved equal however given period vestry female i alias aba mat j lay para zo van renegade resurrection dead revolutionary ser de public de la de c rennet h r la poe de ness whiteness likely lose since material augment racial status oh union college wiz er colonial brazil henry al incentive plenty emotional three lar col tea see also added illness pell pell dutch naming ked congregation elite call would also ally bill first least making vestry bill whether people descent count fully i ha heavily u en twentieth century poetics boa b ha poem indebted book prayer congregation un participate island politics fully without convert emancipation ted emancipation people descent col include identity rip community apart class ran beneath race gender long practiced separate distinct utterly alien see question whether grant faculty student collaborative research generous access nid collection first pur ing pleasure island un

2) poets provide temporal messianic poetry implicitly argues time contains pattern pur
odrigues pereira mendes abraham ca philadelphia xvi jewish population see also cemeteries mik veh is
nues impacting island politics unable vote racial changes also impacted jewish emancipation barbados
samuel th century kinship see also family jewish family

3) ential threats providential messianic vision time space reworks underscore ultimate jewish tri u mpht cemeteries second tl ayy im ashkenazi spanish town jamaica xxvii second jewish cemetery new york xi dent al plan actions consequences plan russian formalist boris roma v sky similarly suggested travsee cary carson architecture social history chesapeake house architectural investigation colonial wiish reclaims jewish language positive part sephardic tory identity like boa b de barrios combines olancy reinforce ment cohesion correlation causation narrative thread new literary history e second uspompous elevated prose marked speakers attempts gain authority position decorous gentlemen e r mackiraph author xii tables timeline stages death rit u adapted robert v wells facing king terrors livingy information jews saratoga springs vacations hilton seligman jews like represented vulgar osten atinglish women male patriarchal privilege rather responding attacks privilege directly jun framed discowever de barrios uses sonnets weave jewish people tradition spanish baroque whereas lazarus writes venant god yet comparison also feels painful unlike kedah ram substituted trevi poetically however t

4)
providential vision time space underscore ultimate tri u second town second cemetery new york xi dent al plan plan formalist v sky similarly architecture social history house architectural investigation colonial language positive part tory identity like boa b de reinforce cohesion correlation causation narrative thread new literary history e second elevated prose marked gain authority position decorous e r author tables death rit u v facing king information like vulgar male patriarchal privilege rather privilege directly framed de weave people tradition baroque whereas god yet comparison also painful unlike ram substituted poetically however t

# Conclusion

This version of Laura bot performed slightly better than before. This is because I trained it on more robust speech data and gave it more information about words in relation to other words in the article. Although there is a lot of jibberish, there are more instances of sentances or fragments that look like assertions. I only trained it for 4 hours, but would like to train it for 10+ hours and I think it would start to become more conversational but generally I am happy with its performance compared to first version. The word structure  is more cohesive but I still havent figured out a way to incorperate grammer parterns in the same type of quantified vectors that I used for word placement.

I hope this framework will become a robust launchpad for digital humanities projects, and less of a central location to conduct analysis. One interesting term I have come across a few times when exploring literature on frameworks and methods in digital humanities is buttonology. Buttonology is simply a lesson or interactivity that surveys a software feature and interface in an introductory manner. 

>"Knowing how to upload texts into a tool like Voyant does not help researchers think about what texts should be uploaded, how selecting data relates to a research question, or even what constitutes an effective research question. This type of teaching does not encourage critical thinking, yet digital humanities instruction, in our experience, is frequently focused on showing how to use software rather than reflect on the broader context." (Russell et al., 2017)

I feel like we did an excellent job this semester of not reverting to buttonology, although we did learn how to use software at an introductory level, it was always in the context of analysis with sources we were actively thinking about. Furthermore, it is not even that buttonology is useless, but rather it is the absence of the critical pedagogy in DH that makes buttonology dangerous. I think that this format's ability to blend code with narrative is what makes it a useful analysis approach and a compelling presentation or publication tool.

---
Google Doc for Final Write Up
:https://docs.google.com/document/d/1nzjYj3jsUz78JF9wub55kwIw8ZtenOwRsy5EjsAFaxo/edit?usp=sharing

---

# Works Cited 
>Correct Indentation in Google Doc


Anonymous, gal-a. “Google Colaboratory: Example Using Nltk for Preprocessing Text.” Google, Google, 2018, colab.research.google.com/github/gal-a/blog/blob/master/docs/notebooks/nlp/nltk_preprocess.ipynb#scrollTo=lKYd8I_LhyMo.

Clivaz, Claire. “‘Digital Humanities in Ancient Jewish, Christian and Arabic Traditions.’” Journal of Religion, Media and Digital Culture, vol. 5, no. 1, 2016, pp. 1–20., doi:10.1163/21659214-90000068.

Nelson, Dan. “Text Generation with Python and TensorFlow/Keras.” Stack Abuse, Stack Abuse, 2019, stackabuse.com/text-generation-with-python-and-tensorflow-keras/.

Russell, John E., and Merinda Kaye Hensley. “Beyond Buttonology: Digital Humanities, Digital Pedagogy, and the ACRL Framework.” College & Research Libraries News, vol. 78, no. 11, 2017, p. 588., doi:10.5860/crln.78.11.588.

Worsley, Marcelo. “Multimodal Learning Analytics: Enabling the Future of Learning through Multimodal Data Analysis and Interfaces.” Proceedings of the 14th ACM International Conference on Multimodal Interaction - ICMI ’12, ACM Press, 2012, p. 353. DOI.org (Crossref), doi:10.1145/2388676.2388755.
