## Homework 2

In [63]:
from nltk.book import *
from nltk.corpus import words
import re
import time
import requests

In [102]:
def lexical_diversity(text):
    if text is str:
        text = split_text(remove_meta(text))
    return len(set(text)) / len(text)

def vocab_count(text):
    if text is str:
        text = split_text(remove_meta(text))
    return len(set(text))

def split_text(text):
    return re.split("[^a-z']+")

def remove_meta(text):
    return re.split(end_line, re.split(start_line, text)[1])[0]

### 1.) In Python, create a method for scoring the vocabulary size of a text, and normalize the score from 0 to 1. It does not matter what method you use for normalization as long as you explain it in a short paragraph. (Various methods will be discussed in the live session.)

In [41]:
max_words = max([
    vocab_score(text1),
    vocab_score(text2),
    vocab_score(text3),
    vocab_score(text4),
    vocab_score(text5),
    vocab_score(text6),
    vocab_score(text7),
    vocab_score(text8),
    vocab_score(text9)
    ])

In [42]:
vocab_score(text1)

19317

In [43]:
text1.name

'Moby Dick by Herman Melville 1851'

* The method listed above takes all the texts from the nltk text database and uses the max word count in any of those documents as the value of 1.  This book had a total number of unique words of 19317.  This book was Moby Dick

In [6]:
from nltk.tag import BigramTagger

In [7]:
tagger = BigramTagger()

ValueError: Must specify either training data or trained model.

In [13]:
all_words = words.words()

In [19]:
len(all_words)

236736

In [72]:
def vocab_score(text):
    return vocab_count(text) / len(all_words)

In [73]:
vocab_score(text1)

0.08159722222222222

* A more useful method would be to compare the number of total words in a document to the 236,376 words found in the nltk.corpus.words.words() object.  This is the method I will be using for the remaider of the Document, where a book which contains 236,376 unique words will have a vocabulary score of 1, and a book which contains only 5 unique words will have a value of 5 / 236,376

### After consulting section 3.2 in chapter 1 of Bird-Klein, create a method for scoring the long-word vocabulary size of a text, and likewise normalize (and explain) the scoring as in step 1 above.

In [129]:
def longword_count(text, min_length=15):
    '''
    Counts the number of unique long words in a body of text
    --------
    INPUTS
    text: {str | list}
        -   The Body of text to check for long words
    min_length: int (default 15)
        -   Minimum word length to be considered a 'long word'
    --------
    RETURNS
    vocab_score: int
        -   number of total words in text which are at least min_length
    '''
    if text is str:
        text = split_text(text)
    longwords = []
    for word in text:
        try:
            word[min_length - 1]
            longwords.append(word)
        except IndexError:
            continue
    return vocab_count(longwords)

def longword_score(text, min_length=15):
    '''
    Scores the number of unique long words in a body of text when
    compared to the total number of words of equal lenght in nltk.corpus.words.words()
    --------
    INPUTS
    text: {str | list}
        -   The Body of text to check for long words
    min_length: int (default 15)
        -   Minimum word length to be considered a 'long word'
    --------
    RETURNS
    longwords_score: float
        -   count of total words > minwords / number of words > min_length in nltk.corpus.words.words()
    '''
    longwords_count = 0
    for word in all_words:
        try: 
            word[min_length - 1]
            longwords_count += 1
        except IndexError:
            continue
    return float(longword_count(text, min_length)) / longwords_count

In [84]:
longword_count(text5)

94

In [85]:
longword_score(text5)

0.0073852922690132

* The function 'longword_count' checks all words in a corpus of text to see if the words are at least 15 characters.  If the word is at least 15 characters, it adds them to a list and counts the total number of uniuqe words.  The function longword_score performs the function in longword_count, but it also counts the total number of long words in nltk.corpus.words.words() and compares the number of long words in the text to that value (with the number of total long words in nltk.corpus being a value of 1.

### Now create a “text difficulty score” by combining the lexical diversity score from homework 1, and your normalized score of vocabulary size and long-word vocabulary size, in equal weighting. Explain what you see when this score is applied to same graded texts you used in homework 1.

In [59]:
## The Bacon Second Reader
book1 = requests.get('http://www.gutenberg.org/cache/epub/15659/pg15659.txt').content
time.sleep(6)
## Mcguffy’s Forth Eclectic Reader
book2 = requests.get('http://www.gutenberg.org/cache/epub/14880/pg14880.txt').content
time.sleep(4)
## Mcguffy’s Fifth Eclectic Reader
book3 = requests.get('http://www.gutenberg.org/cache/epub/15040/pg15040.txt').content
time.sleep(5)
## The Ontario High School Reader
book4 = requests.get('http://www.gutenberg.org/cache/epub/19923/pg19923.txt').content

In [115]:
def total_score(text):
    return (lexical_diversity(text) + vocab_score(text) + longword_score(text)) * 100

In [116]:
print("The Bacon Second Reader: ", total_score(str(book1)))
print("Mcguffy’s Forth Eclectic Reader: ", total_score(str(book2)))
print("Mcguffy’s Fifth Eclectic Reader: ", total_score(str(book3)))
print("The Ontario High School Reader: ", total_score(str(book4)))

The Bacon Second Reader:  0.09076099505319624
Mcguffy’s Forth Eclectic Reader:  0.057444635626472146
Mcguffy’s Fifth Eclectic Reader:  0.04989201175903839
The Ontario High School Reader:  0.0478885863745599


* The data here still shows that the second grade reader will have the highest difficulty score of all of the listed books.  The reason for this is that the normalization parameters for vocab score and longword score are dominated by lexical diversity (because this value is much larger) A good way of fixing this would be to either adjust the normalization parameter (the majority of all english language may be a bit extreme for a single book to encompass) or to utilize a non-linear scale for these values, as the more complex a vocabulary gets, the less significent adding a couple of extra words will be to its overall complexity.  This could be a strong application of utilizing a non-linear scale

In [132]:
def total_score_sqrt(text):
    return (lexical_diversity(text) \
            + np.sqrt(vocab_score(text)) \
            + np.sqrt(longword_score(text))) * 100

In [133]:
print("The Bacon Second Reader: ", total_score_sqrt(str(book1)))
print("Mcguffy’s Forth Eclectic Reader: ", total_score_sqrt(str(book2)))
print("Mcguffy’s Fifth Eclectic Reader: ", total_score_sqrt(str(book3)))
print("The Ontario High School Reader: ", total_score_sqrt(str(book4)))

The Bacon Second Reader:  1.93896000972
Mcguffy’s Forth Eclectic Reader:  1.92709169301
Mcguffy’s Fifth Eclectic Reader:  1.91953906914
The Ontario High School Reader:  1.9387259395


* Using a square root scale, all the values end up looking about the same.  It would be possible to tune these values to match the expected outcome, but there is a large risk of overfitting the data; so untill we are planning on developing a difficulty score for a much larger corpus of labeled text, it would probably be a better idea to not assume difficulty is going to be a strictly evenly weighted linear combination of the three scores we have so far obtained.