# Analysis of Differences in Text

The plan here is to

* analyze differences among types of text
    * word choice/usage via unique vocab, freqdist, lexical diversity, most common w and w/o stopwords, plotting
    * context via concordance, collocations, similar, common_contexts
* from these concepts, build a corpus of docs and explore differences in scholarly lit across disciplines
    * requires above plus pre-processing, tokenizing, etc.
    
### Running outline

1. Import gutenberg corpus - overview of other avaialble corpora and why might use them (lots of preprocessing!)
2. Corpus methods - sents, paras, etc.
3. Init a gutenberg text as Text
4. text module methods
5. do comparison of texts per above

In [1]:
import nltk
import numpy as np
import matplotlib

In [2]:
# NLTK data need to be downloaded separately. Uncomment and run the next line the first time this notebook is run
# on a new system.

# nltk.download() # use id 'all' to download all

In [3]:
from nltk.tokenize.stanford import StanfordTokenizer
from nltk import word_tokenize
from nltk.corpus import gutenberg, stopwords
from nltk import draw
from nltk.probability import *

Using examples from https://www.nltk.org/book/

```
Bird, Steven, Ewan Klein, and Edward Loper (2015). _Natural Language Processing with Python_. Accessed 2018-10-16 from https://www.nltk.org/book/.
```

In [4]:
# These are the texts in Gutenberg

print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [5]:
# some corpus methods
# https://www.nltk.org/api/nltk.corpus.reader
# gb is plaintext, so the available methods are defined at
# https://www.nltk.org/api/nltk.corpus.reader.html#module-nltk.corpus.reader.plaintext

gutenberg.sents('melville-moby_dick.txt') # a list of lists - do raw, then words first

[['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']'], ['ETYMOLOGY', '.'], ...]

In [6]:
# with plaintext, options are words, sents, paras, raw

gutenberg.words('melville-moby_dick.txt') # list of words and punctuation

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', ...]

In [7]:
gutenberg.paras('melville-moby_dick.txt') # lists of lists of lists

[[['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']']], [['ETYMOLOGY', '.']], ...]

In [8]:
# gutenberg.raw('melville-moby_dick.txt')

In [None]:
# more info on corpuse package at https://www.nltk.org/api/nltk.corpus.html
# Text module provides some analysis methods - need to make doc text and instance of nltk.Text

moby = nltk.Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))

In [None]:
# OR
# so creating as instance of Text object gives us access to different methods, in our case to explore context
# see text Module at https://www.nltk.org/api/nltk.html

from nltk.text import Text

moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
moby.collocations()

In [None]:
moby.dispersion_plot(['man', 'whale', 'God'])

In [None]:
moby.concordance('chowder')

In [None]:
moby.common_contexts(['man', 'whale'])

In [None]:
from nltk.text import *

In [None]:
ci = ContextIndex(moby)

In [None]:
ci.common_contexts(['man', 'whale', 'God'])

In [None]:
ci.similar_words('man')

In [None]:
ci.word_similarity_dict('whale')

In [None]:
cci = ConcordanceIndex(moby)

In [None]:
cci.offsets('whale')

In [None]:
cci.print_concordance('whale')

In [None]:
moby.count('whale')

In [None]:
# Vocab is a freqdist - some shortcuts (so do a fd on plaintext above)

moby.vocab().most_common()

In [None]:
len(moby.vocab())

In [None]:
len(moby)

In [None]:
len(set(moby)) # so vocab is uniques

In [None]:
# maybe a roundabout way to remove stopwords for the FD

from nltk.corpus import stopwords

In [None]:
print(stopwords.fileids())

In [None]:
esw = stopwords.words('english')

In [None]:
esw

In [None]:
# Text object doesn't provide a list of word, so we need either the tokens from one of the indexes 
# or we can go back to the source file

# we need to lowercase everything, too

nsw = [w.lower() for w in ci.tokens() if w.lower() not in esw and w.isalpha()]

In [None]:
len(gutenberg.words('melville-moby_dick.txt'))

In [None]:
len(nsw)

In [None]:
fd1 = FreqDist(gutenberg.words('melville-moby_dick.txt'))

In [None]:
fd2 = FreqDist(nsw)

In [None]:
fd1.most_common(20)

In [None]:
fd2.most_common(20)

In [None]:
fd1.plot(20, cumulative = True)

In [None]:
fd2.plot(20, cumulative = True)

In [None]:
# So what can we do with this?

# Hard to compare texts in the provided PG corpus - different genres, time periods, etc.

# Use requests to access ED and WW complete-ish from PG onine

import requests

In [None]:
r = requests.get('https://www.gutenberg.org/files/12242/12242.txt')
emily = r.text
with open('poetry/emily_dickinson_raw.txt', 'w', encoding='latin-1') as e:
    e.write(emily)
    e.close()

In [None]:
r = requests.get('https://www.gutenberg.org/files/1322/1322.txt')
walt = r.text
with open('poetry/walt_whitman_raw.txt', 'w', encoding='latin-1') as w:
    w.write(walt)
    w.close()

In [None]:
len(emily)

In [None]:
len(walt)

In [None]:
# these raw texts are not very nice - right now each one is one long string

emily

In [None]:
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

In [None]:
am_poets = PlaintextCorpusReader('poetry/', '.*') # note we are not including the raw files

In [None]:
# we now have a corpus and can use the plaintext corpus methods above

am_poets.fileids()

In [None]:
wf = 'walt_whitman_edit.txt'
walt = am_poets.words(wf)

In [None]:
ef = 'emily_dickinson_edit.txt'
emily = am_poets.words(ef)

In [None]:
len(am_poets.raw(ef))

In [None]:
len(am_poets.words(ef))

In [None]:
len(am_poets.sents(ef))

In [None]:
len(am_poets.paras(ef))

In [None]:
len(am_poets.raw(wf))

In [None]:
len(am_poets.words(wf))

In [None]:
len(am_poets.sents(wf))

In [None]:
len(am_poets.paras(wf))

In [None]:
am_poets.raw(ef)

In [None]:
am_poets.raw(wf)

In [None]:
am_poets.paras(wf)[10] # paras are stanzas - 10 is first poem

In [None]:
am_poets.paras(ef)[6] # paras are stanzas - 6 is first poem

In [None]:
am_poets.sents(wf)[10:15] # first sentence is 10

In [None]:
am_poets.sents(ef)[8] # first sentence is 8

In [None]:
# keeping in mind that line breaks in poetry may be more important than sentence length
# we can analyze sentence and para/stanza lengths

# average sentence length
wlen = len(am_poets.sents(wf)) - 1
elen = len(am_poets.sents(ef)) - 1

In [None]:
# slice sentence lists to only include poetry

w_sents = am_poets.sents(wf)[10:wlen]
e_sents = am_poets.sents(ef)[8:elen]

In [None]:
w_wc = 0
for s in w_sents:
    w_wc += len(s)
print(w_wc / len(w_sents))

In [None]:
e_wc = 0
for s in e_sents:
    e_wc += len(s)
print(e_wc / len(e_sents))

In [None]:
# same calculation for para/stanza length

wplen = len(am_poets.paras(wf)) - 1
eplen = len(am_poets.paras(ef)) - 1

w_paras = am_poets.paras(wf)[10:wplen]
e_paras = am_poets.paras(ef)[6:eplen]

In [None]:
# we already have the word counts from sentence length analysis

print(w_wc / len(w_paras))

In [None]:
print(e_wc / len(e_paras))

In [None]:
# fdist = FreqDist(len(w) for w in text1)

# fdist of sentence lengths

ws_fdist = FreqDist(len(s) for s in w_sents)

In [None]:
es_fdist = FreqDist(len(s) for s in e_sents)

In [None]:
print(ws_fdist)

In [None]:
len(w_sents)

In [None]:
len(e_sents)

In [None]:
print(es_fdist)

In [None]:
ws_fdist.most_common(20)

In [None]:
ws_fdist.plot(20)

In [None]:
es_fdist.most_common(20)

In [None]:
es_fdist.plot(20)

In [None]:
# fdist of stanza lengths (in sentences not words)
# remember paras are lists of lists

# fdist = FreqDist(len(w) for w in text1)

wp_fdist = FreqDist(len(p) for p in w_paras)
ep_fdist = FreqDist(len(p) for p in e_paras)

In [None]:
print(wp_fdist)

In [None]:
len(w_paras)

In [None]:
print(ep_fdist)

In [None]:
print(len(e_paras))

In [None]:
wp_fdist.most_common(20)

In [None]:
ep_fdist.most_common(20)

In [None]:
wp_fdist.plot(20)

In [None]:
ep_fdist.plot()

In [None]:
# fdist of stanzas by number of words?
# fdist = FreqDist(len(w) for w in text1)
wpb = []
for p in w_paras:
    pwc = 0
    for s in p:
        pwc += len(s)
    wpb.append(pwc)
wwps = FreqDist(wpb)

In [None]:
print(wwps)

In [None]:
wwps.most_common(20)

In [None]:
wwps.plot(20)

In [None]:
epb = []
for p in e_paras:
    pwc = 0
    for s in p:
        pwc += len(s)
    epb.append(pwc)
ewps = FreqDist(epb)

In [None]:
print(ewps)

In [None]:
ewps.most_common(20)

In [None]:
ewps.plot(20)