Hopefully by this week, you're comfortable with lists (i.e. [a, b, c, ..]) and growing to understand list comprehensions.

Two fundamental Python skills to be aware of. First, there's a general purpose method called `len()` that returns the length of an object, like "how many items in this list" or "how many characters in this string". e.g.

In [1]:
l = ['hello', 'text', 'mining']
len(l)

3

In [2]:
len("Text Mining")

11

There is also an object called a `set`, which is like a list, but without an ordering and only allowing unique elements. This is useful for us, because it gives a quick way to see just the unique words of a list: the vocabulary.

In [3]:
l = ['Buffalo', 'buffalo', 'Buffalo', 'buffalo', 'buffalo', 'Buffalo', 'buffalo', 'buffalo']
s = set(l)
print("List:", l)
print("Set:", s)

List: ['Buffalo', 'buffalo', 'Buffalo', 'buffalo', 'buffalo', 'Buffalo', 'buffalo', 'buffalo']
Set: {'buffalo', 'Buffalo'}


# Normalization

This week, follow along with [Searching for Meaning](https://github.com/sgsinclair/alta/blob/41f389f3d9708573c44c883bcd95fd16bad54a24/ipynb/SearchingMeaning.ipynb) from the Art of Literary Text Analysis.

Use the trimmed version of Frankenstein from last week to try some of the concepts in the chapter. This should get you up to speed.

In [24]:
import nltk
with open('../data/frankenstein.txt') as f:
    frankensteinString = f.read()
frankensteinTokens = nltk.word_tokenize(frankensteinString)
cleanedTokens = [word.lower() for word in frankensteinTokens if word[0].isalpha()]
print(len(frankensteinTokens))

85440


Note, that there are 85440 tokens in the text. If we count just the _unique_ words (the _vocabulary size_), we find 7510:

In [5]:
len(set(cleanedTokens))

7038

**Questions**

- 1) Create a lemmatized version of cleanedTokens and count the unique lemmas. Share the code to do this: the answer that it gives you should be 6417.

In [6]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

frankensteinLemmas = [wnl.lemmatize(word) for word in cleanedTokens]
len(set(frankensteinLemmas))

6416

- 2) Re-do the lemmatization after stopping words against the default NLTK stoplist, and tabulate the top ten words. Paste the code and output.

In [7]:
stopwords = nltk.corpus.stopwords.words('english')
cleanedTokensNoStop = [word for word in cleanedTokens if word not in stopwords]

frankensteinLemmasNoStop = [wnl.lemmatize(word) for word in cleanedTokensNoStop]
frankenLemmasNoStopFreq = nltk.FreqDist(frankensteinLemmasNoStop)
frankenLemmasNoStopFreq.tabulate(15)

    one   could   would     yet     man  father     day    upon  friend    life     eye thought    time   every   might 
    207     197     183     152     136     131     129     126     125     124     122     112     109     109     108 


- 3) How does the tabulation of lemmas differ from the tabulation of the non-lemmatized (but still stopped and case-folded) tokens?

In [8]:
cleanedTokensNoStopFreq = nltk.FreqDist(cleanedTokensNoStop)
cleanedTokensNoStopFreq.tabulate(15)

   one  could  would    yet    man father   upon   life  every  might  first  shall   eyes   said    may 
   206    197    183    152    136    131    126    115    109    108    108    105    104    102     98 


- 4) What are the WordNet synsets for 'monster'?

In [9]:
from nltk.corpus import wordnet as wn
wn.synsets('monster')

[Synset('monster.n.01'),
 Synset('giant.n.05'),
 Synset('freak.n.01'),
 Synset('monster.n.04'),
 Synset('monster.n.05')]

- 5) A synset has a method called `definition()`. Noting that the code for Q4 resulted in a list, write a list comprehension to extracts all the definitions for each synset. Share the code and output.

In [10]:
# print(len(cleanedTokensNoStop))
frankensteinSynsets = [wn.synsets(token) for token in cleanedTokensNoStop]
# print(len(frankensteinSynsets))

monsterDefs = [synset.definition() for synset in wn.synsets('monster')]
print(monsterDefs)

['an imaginary creature usually having various human and animal parts', 'someone or something that is abnormally large and powerful', 'a person or animal that is markedly unusual or deformed', 'a cruel wicked and inhuman person', '(medicine) a grossly malformed and usually nonviable fetus']


- 6) Each synset is a child of a more general synset. For example, `crab` is an example of a `decapod_crustacean`, which is more generally a `crustacean`, and so on. You can get at the paths to the root of this tree with `hypernym_paths()`. Paste the code and hypernym path for `freak.n.01`.

In [11]:
freakSynset = wn.synset('freak.n.01')
freakSynset.hypernym_paths()

[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('living_thing.n.01'),
  Synset('organism.n.01'),
  Synset('mutant.n.01'),
  Synset('freak.n.01')]]

- 7) (for 2 points): We've already seen some corpora that NLTK can pull in, from the complex WordNet information to a basic stoplist. Using the NLTK information on male/female names, determine and paste in the unique female names in Frankenstein. This isn't in the ALTA book, but searching Google sometimes helps ;)

In [12]:
cleanedTokens2 = [word for word in frankensteinTokens if word[0].isalpha()]
names = nltk.corpus.names
male_names = names.words('male.txt')
female_names = names.words('female.txt')
namelist = male_names + female_names
print('There are ', len(namelist), ' total names in this corpus')
print('There are ', len(male_names), ' male names in this corpus')
print('There are ', len(female_names), ' female names in this corpus')

# frankenSteinNames = [word for word in cleanedTokens2 if word in namelist]
# print('There are ', len(frankenSteinNames), ' many names in Frankenstein')

frankenSteinFemaleNames = [word for word in cleanedTokens2 if word in female_names]
print(set(frankenSteinFemaleNames))

There are  7944  total names in this corpus
There are  2943  male names in this corpus
There are  5001  female names in this corpus
{'France', 'Angelica', 'Asia', 'Geneva', 'Justine', 'Lacey', 'Christian', 'Louisa', 'Angel', 'La', 'Abbey', 'Manon', 'Daniel', 'Isis', 'Agatha', 'Margaret', 'Happy', 'Harmony', 'Caroline', 'May', 'June', 'Eve', 'Elizabeth'}


Our copy of Frankenstein is from Project Gutenberg, a collection of transcriptions of public domain (i.e. legally shareable) books. NLTK offers a small selection of those books through `nltk.corpus.gutenberg`.

Load the gutenberg corpus and convert it to what NLTK calls a TextCollection:

In [35]:
from nltk.text import TextCollection
gutenberg_docs = nltk.corpus.gutenberg
gutenberg_collection = TextCollection(gutenberg_docs)

print("TF-IDF for 'monster': ", gutenberg_collection.tf_idf('monster', frankensteinString))
print("TF-IDF for 'miserable': ", gutenberg_collection.tf_idf('miserable', frankensteinString))
print("TF-IDF for 'horror': ", gutenberg_collection.tf_idf('horror', frankensteinString))

TF-IDF for 'monster':  8.60294946704625e-05
TF-IDF for 'miserable':  6.06154057055602e-05
TF-IDF for 'horror':  3.938280089547152e-05


Using a method of `gutenberg_collection` (remember auto-complete!), answer the final two-part question.

Questions:
    
 - 8) For 2 points:
   - What is the TFIDF for 'monster' in Frankenstein? You'll need the original string.
   - What word has the highest TF-IDF for the following: 'miserable', 'horror', 'monster'? If you need it, you can compare numbers in python with > (greater than) or < (less than).