# Analyzing Text from Moby Dick

Data Source: "Moby Dick" by Herman Melville | 'moby.txt' file in project files

This project utilizes text mining techniques to draw insights and analysis from the text found in "Moby Dick". The purpose of the project is to demonstrate basic anaysis of text which can be applied to other text and text formats.

In [1]:
import nltk
#nltk.download()
import pandas as pd
import numpy as np
from nltk.probability import FreqDist

In [2]:
with open('moby.txt', 'r') as f:
    moby_raw = f.read()

In [5]:
moby_raw[0:1000] # raw text imported 

'[Moby Dick by Herman Melville 1851]\n\n\nETYMOLOGY.\n\n(Supplied by a Late Consumptive Usher to a Grammar School)\n\nThe pale Usher--threadbare in coat, heart, body, and brain; I see him\nnow.  He was ever dusting his old lexicons and grammars, with a queer\nhandkerchief, mockingly embellished with all the gay flags of all the\nknown nations of the world.  He loved to dust his old grammars; it\nsomehow mildly reminded him of his mortality.\n\n"While you take in hand to school others, and to teach them by what\nname a whale-fish is to be called in our tongue leaving out, through\nignorance, the letter H, which almost alone maketh the signification\nof the word, you deliver that which is not true." --HACKLUYT\n\n"WHALE. ... Sw. and Dan. HVAL.  This animal is named from roundness\nor rolling; for in Dan. HVALT is arched or vaulted." --WEBSTER\'S\nDICTIONARY\n\n"WHALE. ... It is more immediately from the Dut. and Ger. WALLEN;\nA.S. WALW-IAN, to roll, to wallow." --RICHARDSON\'S DICTIONARY

In [5]:
moby_tokens = nltk.word_tokenize(moby_raw) # tokenize words to clean and prepare text for analysis
text1 = nltk.Text(moby_tokens)
text1

<Text: Moby Dick by Herman Melville 1851>

How many tokens (words and punctuation symbols) are in the text?

In [6]:
len(text1)

255018

How many unique tokens (unique words and punctuation) does the text have?

In [7]:
from nltk.stem import WordNetLemmatizer
print('{} unique tokens'.format(len(set(text1))))
def lem():

    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in text1]

    return len(set(lemmatized))

print('{} unique tokens after lemmatizing verbs'.format(lem())) # grouping verbs with same root (i.e. "running" & "runs")

20754 unique tokens
16899 unique tokens after lemmatizing verbs


What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)

In [8]:
len(set(nltk.word_tokenize(moby_raw))) / len(nltk.word_tokenize(moby_raw)) 

0.08138249064771899

This tells us that about 92% of tokens used are repeated at least once

What percentage of tokens is 'whale'or 'Whale'?

In [9]:
def whale_count():
    text_1 = [x.lower() for x in text1]
    whale_c = text_1.count('whale')
    return whale_c / len(nltk.word_tokenize(moby_raw))  * 100
print(str(round(whale_count(),2)) + '%')

0.43%


What are the 20 most frequently occurring (unique) tokens in the text? 

In [10]:
def top_20():
    s = list(set(text1))
    f = []
    for c in s:
        count = c , text1.count(c)
        f.append(count)
    final = sorted(f, key=lambda x:x[1], reverse=True)
    return final[:20]
top_20()

[(',', 19204),
 ('the', 13715),
 ('.', 7308),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978),
 ('his', 2459),
 ('it', 2196),
 ('I', 2111),
 ('!', 1767),
 ('is', 1722),
 ('--', 1713),
 ('with', 1659),
 ('he', 1658),
 ('was', 1639),
 ('as', 1620)]

What tokens have a length of greater than 5 and frequency of more than 150?

In [11]:
def filter_5_150():
    t = FreqDist(text1)
    f = []
    for x in t:
        if t[x] > 150 and len(x) > 5:
            f.append(x)
    return sorted(f)
filter_5_150()

['Captain',
 'Pequod',
 'Queequeg',
 'Starbuck',
 'almost',
 'before',
 'himself',
 'little',
 'seemed',
 'should',
 'though',
 'through',
 'whales',
 'without']

Longest word and length

In [14]:
def longest():
    t = list(FreqDist(text1))
    sw = sorted(t, key=len, reverse=True)
    return sw[0] , len(sw[0])
word, length = longest()
print('The longest word is {} and has {} characters'.format(word, length))

The longest word is twelve-o'clock-at-night and has 23 characters


What unique words have a frequency of more than 2000? What is their frequency (not including numbers or punctuation)?

In [15]:
def top_words():
    t = FreqDist(text1)
    f = []
    for x in t:
        if t[x] > 2000 and x.isalpha():
            f.append((t[x], x))
    f.sort(key=lambda x:x[0], reverse=True)
    return f
top_words()

[(13715, 'the'),
 (6513, 'of'),
 (6010, 'and'),
 (4545, 'a'),
 (4515, 'to'),
 (3908, 'in'),
 (2978, 'that'),
 (2459, 'his'),
 (2196, 'it'),
 (2111, 'I')]

What is the average number of tokens per sentence?

In [16]:
def avg_tokens():
    sentences = nltk.sent_tokenize(moby_raw)
    l = []
    for s in sentences:
        c = len(nltk.word_tokenize(s))
        l.append(c)
    return np.array(l).mean()
round(avg_tokens())

26.0

What are the 5 most frequent parts of speech in this text? What is their frequency?

Legend for part of speech tagging found [here](https://www.guru99.com/pos-tagging-chunking-nltk.html)

In [17]:
def pos():
    pos = nltk.pos_tag(text1)
    ls = [x[1] for x in pos]
    l = FreqDist(ls)
    f = []
    for t in l:
        f.append((t, l[t]))
    final = sorted(f, key=lambda x:x[1], reverse=True)
    return final[:5]
pos()

[('NN', 32730), ('IN', 28658), ('DT', 25870), (',', 19204), ('JJ', 17619)]