## Getting Voyant Summary Information ##
Here are some steps to get the information from the [Voyant Summary](https://voyant-tools.org/) tool -- total words, unique words, vocabulary density, average words per sentence, and most frequent words -- using Python. For 4 of 5 of these, we can use techniques we looked at in the earlier notebook.

First, download the two text files we used for the Voyant exercise, Treatise on Tolerance and the Nouvelle Heloise, and save them in the same file/directly from where you'll be running this Notebook (the Desktop will work).

Step one is to open the file, read its contents into a variable and to close it. Key is to make sure you have the right filename. "volt_tolerance_no_notes.txt" is the name of the file in Canvas. If you change this name, then make sure you change the name in the open instruction below.

In [1]:
f=open("volt_tolerance_no_notes.txt")      #this loads the file into a variable, f.
tolerance=f.read()      #this uses the "read" method to read the contents of the file into a variable, tolerance.
                        #the contents are stored in this variable as a long string.
f.close()     #it's always advisable to close the file once you've extracted its contents.

Run the cell above. Once you've done so, we can print out the first 200 characters of the string (you could print the whole thing out if you wanted, but remember, this is a long string -- though much shorter than the string you'll get by reading the Nouvelle Heloise).

In [2]:
print(tolerance[:1000])      #if you don't specify an opening parameter in the ":100" expression, it's assumed
                            #to be 0.

Whether it is Useful to Maintain People in their Superstition

Such is the feebleness of humanity, such is its perversity, that doubtless it is better for it to be subject to all possible superstitions, as long as they are not murderous, than to live without religion. Man always needs a rein, and even if it might be ridiculous to sacrifice to fauns, or sylvans, or naiads, it is much more reasonable and more useful to venerate these fantastic images of the Divine than to sink into atheism. An atheist who is rational, violent, and powerful, would be as great a pestilence as a blood-mad, superstitious man.

When men do not have healthy notions of the Divinity, false ideas supplant them, just as in bad times one uses counterfeit money when there is no good money. The pagan feared to commit any crime, out of fear of punishment by his false gods; the Malabarian fears to be punished by his pagoda. Wherever there is a settled society, religion is necessary; the laws cover manifest crimes, and 

Since we're going to "clean" this text -- remove punctuation, put into lower-case, etc... -- we'll store this initial text into a new variable, which we'll come back to.

In [3]:
tolerance_full=tolerance

Now we'll remove punctuation... And numbers... For punctuation, we'll do this in two steps.
1. we'll remove apostrophes and dashes, and replace with a blank space, since these often link two words in french.
2. we'll then remove all other punctuation, as we did in the previous notebook, replacing the punctuation with an empty string.

First, step 1:

In [4]:
apos_dash="'-"
for symbol in apos_dash:
    tolerance=tolerance.replace(symbol," ")
print(tolerance[:500])

Whether it is Useful to Maintain People in their Superstition

Such is the feebleness of humanity, such is its perversity, that doubtless it is better for it to be subject to all possible superstitions, as long as they are not murderous, than to live without religion. Man always needs a rein, and even if it might be ridiculous to sacrifice to fauns, or sylvans, or naiads, it is much more reasonable and more useful to venerate these fantastic images of the Divine than to sink into atheism. An ath


Now we'll remove the rest of the punctuation. This time, we'll use a Python module called "string." Modules offer ready-made functions for many things. Python has many of them, and Anaconda comes with many pre-installed. To access these modules, you import them...

In [5]:
import string
punct=string.punctuation
print("Here is the ready-made punctuation string: " + punct)

Here is the ready-made punctuation string: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Now we'll remove these punctuation marks and this time, replace with an empty string. We'll do the same with numbers right after.

In [6]:
for mark in punct:
    tolerance=tolerance.replace(mark,"")
print(tolerance[:250])

Whether it is Useful to Maintain People in their Superstition

Such is the feebleness of humanity such is its perversity that doubtless it is better for it to be subject to all possible superstitions as long as they are not murderous than to live wit


In [None]:
numbers="0123456789"
for number in numbers:
    tolerance=tolerance.replace(number,"")
print(tolerance[:250])
#when you print, see what happens to the date...

Now convert all to lowercase (so uppercase and lowercase words are not counted differently)

In [7]:
tolerance=tolerance.lower()
print(tolerance[:200])

whether it is useful to maintain people in their superstition

such is the feebleness of humanity such is its perversity that doubtless it is better for it to be subject to all possible superstitions 


Next step: tokenize into a list of words...

In [8]:
tol_words=tolerance.split()
print(tol_words[:200])

['whether', 'it', 'is', 'useful', 'to', 'maintain', 'people', 'in', 'their', 'superstition', 'such', 'is', 'the', 'feebleness', 'of', 'humanity', 'such', 'is', 'its', 'perversity', 'that', 'doubtless', 'it', 'is', 'better', 'for', 'it', 'to', 'be', 'subject', 'to', 'all', 'possible', 'superstitions', 'as', 'long', 'as', 'they', 'are', 'not', 'murderous', 'than', 'to', 'live', 'without', 'religion', 'man', 'always', 'needs', 'a', 'rein', 'and', 'even', 'if', 'it', 'might', 'be', 'ridiculous', 'to', 'sacrifice', 'to', 'fauns', 'or', 'sylvans', 'or', 'naiads', 'it', 'is', 'much', 'more', 'reasonable', 'and', 'more', 'useful', 'to', 'venerate', 'these', 'fantastic', 'images', 'of', 'the', 'divine', 'than', 'to', 'sink', 'into', 'atheism', 'an', 'atheist', 'who', 'is', 'rational', 'violent', 'and', 'powerful', 'would', 'be', 'as', 'great', 'a', 'pestilence', 'as', 'a', 'blood', 'mad', 'superstitious', 'man', 'when', 'men', 'do', 'not', 'have', 'healthy', 'notions', 'of', 'the', 'divinity', 

Now let's start counting...
The total number of words:

In [9]:
tol_total_words=len(tol_words)
print(tol_total_words)

1655


It's not quite the same as the Voyant Summary... But it's more or less in the ballpark.
Unique words?

In [10]:
tol_unique_words=set(tol_words)
tol_total_unique_words=len(tol_unique_words)
print(tol_total_unique_words)

672


Vocabulary density? That's # of unique words/# of total words...

In [11]:
vocab_density=tol_total_unique_words/tol_total_words
print(vocab_density)

0.4060422960725076


Most frequent words? Here we'll create a dictionary. But first, let's do something we didn't do in the first notebook, which is to get rid of stopwords. These are small, common words -- "the," "and," or in French, "et" or "la" -- the frequency of which is not very interesting. 

The list of stopwords you use will have a big effect on the results you get. The one below is pretty complete, including almost 700 terms. We'll read it into a variable, then tokenize into a list...

In [12]:
stopwords = "a abord absolument afin ah ai aie aient aies ailleurs ainsi ait allaient allo allons allô alors anterieur anterieure anterieures apres après as assez attendu au aucun aucune aucuns aujourd aujourd'hui aupres auquel aura aurai auraient aurais aurait auras aurez auriez aurions aurons auront aussi autre autrefois autrement autres autrui aux auxquelles auxquels avaient avais avait avant avec avez aviez avions avoir avons ayant ayez ayons b bah bas basee bat beau beaucoup bien bigre bon boum bravo brrr c car ce ceci cela celle celle-ci celle-là celles celles-ci celles-là celui celui-ci celui-là celà cent cependant certain certaine certaines certains certes ces cet cette ceux ceux-ci ceux-là chacun chacune chaque cher chers chez chiche chut chère chères ci cinq cinquantaine cinquante cinquantième cinquième clac clic combien comme comment comparable comparables compris concernant contre couic crac d da dans de debout dedans dehors deja delà depuis dernier derniere derriere derrière des desormais desquelles desquels dessous dessus deux deuxième deuxièmement devant devers devra devrait different differentes differents différent différente différentes différents dire directe directement divers diverse diverses dix dix-huit dix-neuf dix-sept dixième doit doivent donc dont dos douze douzième dring droite du duquel durant dès début désormais e effet egale egalement egales eh elle elle-même elles elles-mêmes en encore enfin entre envers environ es essai est et etant etc etre eu eue eues euh eurent eus eusse eussent eusses eussiez eussions eut eux eux-mêmes exactement excepté extenso exterieur eûmes eût eûtes f fais faisaient faisant fait faites façon feront fi flac floc fois font force furent fus fusse fussent fusses fussiez fussions fut fûmes fût fûtes g gens h ha haut hein hem hep hi ho holà hop hormis hors hou houp hue hui huit huitième hum hurrah hé hélas i ici il ils importe j je jusqu jusque juste k l la laisser laquelle las le lequel les lesquelles lesquels leur leurs longtemps lors lorsque lui lui-meme lui-même là lès m ma maint maintenant mais malgre malgré maximale me meme memes merci mes mien mienne miennes miens mille mince mine minimale moi moi-meme moi-même moindres moins mon mot moyennant multiple multiples même mêmes n na naturel naturelle naturelles ne neanmoins necessaire necessairement neuf neuvième ni nombreuses nombreux nommés non nos notamment notre nous nous-mêmes nouveau nouveaux nul néanmoins nôtre nôtres o oh ohé ollé olé on ont onze onzième ore ou ouf ouias oust ouste outre ouvert ouverte ouverts o| où p paf pan par parce parfois parle parlent parler parmi parole parseme partant particulier particulière particulièrement pas passé pendant pense permet personne personnes peu peut peuvent peux pff pfft pfut pif pire pièce plein plouf plupart plus plusieurs plutôt possessif possessifs possible possibles pouah pour pourquoi pourrais pourrait pouvait prealable precisement premier première premièrement pres probable probante procedant proche près psitt pu puis puisque pur pure q qu quand quant quant-à-soi quanta quarante quatorze quatre quatre-vingt quatrième quatrièmement que quel quelconque quelle quelles quelqu'un quelque quelques quels qui quiconque quinze quoi quoique r rare rarement rares relative relativement remarquable rend rendre restant reste restent restrictif retour revoici revoilà rien s sa sacrebleu sait sans sapristi sauf se sein seize selon semblable semblaient semble semblent sent sept septième sera serai seraient serais serait seras serez seriez serions serons seront ses seul seule seulement si sien sienne siennes siens sinon six sixième soi soi-même soient sois soit soixante sommes son sont sous souvent soyez soyons specifique specifiques speculatif stop strictement subtiles suffisant suffisante suffit suis suit suivant suivante suivantes suivants suivre sujet superpose sur surtout t ta tac tandis tant tardive te tel telle tellement telles tels tenant tend tenir tente tes tic tien tienne tiennes tiens toc toi toi-même ton touchant toujours tous tout toute toutefois toutes treize trente tres trois troisième troisièmement trop très tsoin tsouin tu té u un une unes uniformement unique uniques uns v va vais valeur vas vers via vif vifs vingt vivat vive vives vlan voici voie voient voilà vont vos votre vous vous-mêmes vu vé vôtre vôtres w x y z zut à â ça ès étaient étais était étant état étiez étions été étée étées étés êtes être ô"
stopwords_list=stopwords.split()
print(stopwords_list[:100])

['a', 'abord', 'absolument', 'afin', 'ah', 'ai', 'aie', 'aient', 'aies', 'ailleurs', 'ainsi', 'ait', 'allaient', 'allo', 'allons', 'allô', 'alors', 'anterieur', 'anterieure', 'anterieures', 'apres', 'après', 'as', 'assez', 'attendu', 'au', 'aucun', 'aucune', 'aucuns', 'aujourd', "aujourd'hui", 'aupres', 'auquel', 'aura', 'aurai', 'auraient', 'aurais', 'aurait', 'auras', 'aurez', 'auriez', 'aurions', 'aurons', 'auront', 'aussi', 'autre', 'autrefois', 'autrement', 'autres', 'autrui', 'aux', 'auxquelles', 'auxquels', 'avaient', 'avais', 'avait', 'avant', 'avec', 'avez', 'aviez', 'avions', 'avoir', 'avons', 'ayant', 'ayez', 'ayons', 'b', 'bah', 'bas', 'basee', 'bat', 'beau', 'beaucoup', 'bien', 'bigre', 'bon', 'boum', 'bravo', 'brrr', 'c', 'car', 'ce', 'ceci', 'cela', 'celle', 'celle-ci', 'celle-là', 'celles', 'celles-ci', 'celles-là', 'celui', 'celui-ci', 'celui-là', 'celà', 'cent', 'cependant', 'certain', 'certaine', 'certaines', 'certains']


To remove these words from our text, we'll first copy our text into a new variable (just in case we want to come back to the first list). We'll cycle through each of stopwords in the stopword list. We'll check to see if the stopword is in our text; if it is, we'll remove it...

(You have to run this 3 times to get all the stopwords -- not sure what the issue is but the for-loop won't run through the whole file in one go)

In [13]:
tol_words_stopfree = tol_words 
for word in tol_words_stopfree:
    if word in stopwords_list:
        tol_words_stopfree.remove(word)
for word in tol_words_stopfree:
    if word in stopwords_list:
        tol_words_stopfree.remove(word)
for word in tol_words_stopfree:
    if word in stopwords_list:
        tol_words_stopfree.remove(word)


Now, let's make the dictionary of word frequencies, as in the previous Notebook... (Get ready to scroll down. The dictionary will be long, and since it's unordered, it's hard to limit how much of it to print. Alternatively, you can clear the output with cell->current outputs->clear. Or else simply remove the "print" line -- or put a # in front of it -- and rerun the cell).

In [14]:
word_frequencies = {}

for word in tol_words_stopfree:
    if word in word_frequencies:
        word_frequencies[word] += 1
    else:
        word_frequencies[word] = 1

print(word_frequencies)

{'whether': 1, 'it': 25, 'is': 27, 'useful': 2, 'to': 60, 'maintain': 1, 'people': 5, 'in': 21, 'their': 10, 'superstition': 4, 'such': 3, 'the': 125, 'feebleness': 1, 'of': 57, 'humanity': 1, 'its': 4, 'perversity': 1, 'that': 45, 'doubtless': 1, 'better': 1, 'for': 11, 'be': 18, 'subject': 1, 'all': 11, 'superstitions': 5, 'long': 3, 'they': 17, 'are': 14, 'not': 22, 'murderous': 1, 'than': 4, 'live': 1, 'without': 4, 'religion': 7, 'man': 3, 'always': 1, 'needs': 1, 'rein': 1, 'and': 43, 'even': 3, 'if': 7, 'might': 4, 'ridiculous': 1, 'sacrifice': 1, 'fauns': 1, 'or': 14, 'sylvans': 1, 'naiads': 1, 'much': 4, 'more': 5, 'reasonable': 3, 'venerate': 1, 'these': 12, 'fantastic': 1, 'images': 1, 'divine': 1, 'sink': 1, 'into': 5, 'atheism': 1, 'an': 1, 'atheist': 1, 'who': 5, 'rational': 1, 'violent': 1, 'powerful': 1, 'would': 18, 'great': 3, 'pestilence': 1, 'blood': 1, 'mad': 1, 'superstitious': 1, 'when': 6, 'men': 4, 'do': 12, 'have': 3, 'healthy': 1, 'notions': 1, 'divinity': 1,

Next, we'll sort by frequency values, as we did in the previous Notebook. Interestingly, though the word counts differed from Voyant, the first five items in this list are an exact match.

In [22]:
sorted_word_frequencies = sorted(word_frequencies, key=lambda x: word_frequencies[x], reverse=True)

for word in sorted_word_frequencies:
    print(word + ": ", word_frequencies[word])

the:  125
to:  60
of:  57
that:  45
and:  43
is:  27
it:  25
not:  22
in:  21
be:  18
would:  18
they:  17
are:  14
or:  14
you:  13
these:  12
do:  12
but:  12
for:  11
all:  11
their:  10
his:  10
we:  9
by:  8
was:  8
this:  8
religion:  7
if:  7
one:  7
us:  7
saint:  7
will:  7
when:  6
them:  6
only:  6
could:  6
which:  6
your:  6
people:  5
superstitions:  5
more:  5
into:  5
who:  5
there:  5
out:  5
holy:  5
god:  5
believe:  5
at:  5
then:  5
he:  5
same:  5
brother:  5
my:  5
superstition:  4
its:  4
than:  4
without:  4
might:  4
much:  4
men:  4
false:  4
good:  4
very:  4
our:  4
were:  4
from:  4
other:  4
has:  4
such:  3
long:  3
man:  3
even:  3
reasonable:  3
great:  3
have:  3
no:  3
any:  3
should:  3
what:  3
two:  3
world:  3
time:  3
lords:  3
jesus:  3
know:  3
navel:  3
him:  3
however:  3
so:  3
with:  3
king:  3
day:  3
france:  3
each:  3
masters:  3
does:  3
too:  3
speak:  3
inquisitor:  3
useful:  2
ideas:  2
times:  2
money:  2
punishment:  2
crimes:  

Finally, to get average sentence length, we need to import a function from the Natural Language Toolkit module, also known simple as nltk. This is a very popular module, with a lot of techniques for Natural Language Processing. This can be a little slow. And if you get an error below, it's maybe because you need to download some data: you might need to run this command:

nltk.download('punkt')

If so, add it to the top of the cell and rerun.

In [16]:
import nltk
from nltk.tokenize import sent_tokenize

The function *sent_tokenize*, as you might guess, will tokenize a string at the sentence level... To do this with our text, we need to go back to our original next, which still includes the puncutation. We saved this the variable: tolerance_full.

Run the cell below, and you'll see a list, but now of sentences rather than words.

In [18]:
nltk.download('punkt')
tol_sentences = sent_tokenize(tolerance_full)
print(tol_sentences[:100])

[nltk_data] Downloading package punkt to /home/nacl/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['Whether it is Useful to Maintain People in their Superstition\n\nSuch is the feebleness of humanity, such is its perversity, that doubtless it is better for it to be subject to all possible superstitions, as long as they are not murderous, than to live without religion.', 'Man always needs a rein, and even if it might be ridiculous to sacrifice to fauns, or sylvans, or naiads, it is much more reasonable and more useful to venerate these fantastic images of the Divine than to sink into atheism.', 'An atheist who is rational, violent, and powerful, would be as great a pestilence as a blood-mad, superstitious man.', 'When men do not have healthy notions of the Divinity, false ideas supplant them, just as in bad times one uses counterfeit money when there is no good money.', 'The pagan feared to commit any crime, out of fear of punishment by his false gods; the Malabarian fears to be punished by his pagoda.', 'Wherever there is a settled society, religion is necessary; the laws cover man

We can get the total number of sentences easily enough:

In [19]:
tol_total_sentences=len(tol_sentences)
print(tol_total_sentences)

60


And from here, going back to our total words variable, we can get the final piece of the puzzle: average sentence length...

In [20]:
ave_sentence_length = tol_total_words/tol_total_sentences
print(ave_sentence_length)

27.583333333333332


Putting all the pieces together:

In [23]:
print("Total words:", tol_total_words)
print("Total unique words:", tol_total_unique_words)
print("Vocabulary density:", vocab_density)
print("Average words per sentence:", ave_sentence_length)
print("Most frequent words: ")
for word in sorted_word_frequencies[:10]:
    print(word + ": ", word_frequencies[word])

Total words: 1655
Total unique words: 672
Vocabulary density: 0.4060422960725076
Average words per sentence: 27.583333333333332
Most frequent words: 
the:  125
to:  60
of:  57
that:  45
and:  43
is:  27
it:  25
not:  22
in:  21
be:  18
