## Homework 2 Prime

In [159]:
from nltk.book import *
from nltk.corpus import words
import numpy as np
import re
import time
import requests
import os
from bs4 import BeautifulSoup

In [141]:
all_words = words.words()

In [211]:
def lexical_diversity(text):
    if text is str:
        text = split_text(remove_meta(text))
    return len(set(text)) / len(text)

def vocab_count(text):
    if text is str:
        text = split_text(remove_meta(text))
    return len(set(text))

def split_text(text):
    return re.split("[^a-z']+")

def remove_meta(text):
    start_line = "*** START OF THIS PROJECT GUTENBERG EBOOK [A-Z ]+***"
    end_line = "*** END OF THIS PROJECT GUTENBERG EBOOK [A-Z]+***"
    return re.split(end_line, re.split(start_line, text)[1])[0]

def vocab_score(text):
    return vocab_count(text) / len(all_words)

### 1.) In Python, create a method for scoring the vocabulary size of a text, and normalize the score from 0 to 1. It does not matter what method you use for normalization as long as you explain it in a short paragraph. (Various methods will be discussed in the live session.)

In [143]:
max_words = max([
    vocab_score(text1),
    vocab_score(text2),
    vocab_score(text3),
    vocab_score(text4),
    vocab_score(text5),
    vocab_score(text6),
    vocab_score(text7),
    vocab_score(text8),
    vocab_score(text9)
    ])

In [144]:
vocab_score(text1)

0.08159722222222222

In [145]:
text1.name

'Moby Dick by Herman Melville 1851'

* The method listed above takes all the texts from the nltk text database and uses the max word count in any of those documents as the value of 1.  This book had a total number of unique words of 19317.  This book was Moby Dick

In [146]:
len(all_words)

236736

In [147]:
vocab_score(text1)

0.08159722222222222

* A more useful method would be to compare the number of total words in a document to the 236,376 words found in the nltk.corpus.words.words() object.  This is the method I will be using for the remaider of the Document, where a book which contains 236,376 unique words will have a vocabulary score of 1, and a book which contains only 5 unique words will have a value of 5 / 236,376

### After consulting section 3.2 in chapter 1 of Bird-Klein, create a method for scoring the long-word vocabulary size of a text, and likewise normalize (and explain) the scoring as in step 1 above.

In [148]:
def longword_count(text, min_length=15):
    '''
    Counts the number of unique long words in a body of text
    --------
    INPUTS
    text: {str | list}
        -   The Body of text to check for long words
    min_length: int (default 15)
        -   Minimum word length to be considered a 'long word'
    --------
    RETURNS
    vocab_score: int
        -   number of total words in text which are at least min_length
    '''
    if text is str:
        text = split_text(text)
    longwords = []
    for word in text:
        try:
            word[min_length - 1]
            longwords.append(word)
        except IndexError:
            continue
    return vocab_count(longwords)

def longword_score(text, min_length=15):
    '''
    Scores the number of unique long words in a body of text when
    compared to the total number of words of equal lenght in nltk.corpus.words.words()
    --------
    INPUTS
    text: {str | list}
        -   The Body of text to check for long words
    min_length: int (default 15)
        -   Minimum word length to be considered a 'long word'
    --------
    RETURNS
    longwords_score: float
        -   count of total words > minwords / number of words > min_length in nltk.corpus.words.words()
    '''
    longwords_count = 0
    for word in all_words:
        try: 
            word[min_length - 1]
            longwords_count += 1
        except IndexError:
            continue
    return float(longword_count(text, min_length)) / longwords_count

In [149]:
longword_count(text5)

94

In [150]:
longword_score(text5)

0.0073852922690132

* The function 'longword_count' checks all words in a corpus of text to see if the words are at least 15 characters.  If the word is at least 15 characters, it adds them to a list and counts the total number of uniuqe words.  The function longword_score performs the function in longword_count, but it also counts the total number of long words in nltk.corpus.words.words() and compares the number of long words in the text to that value (with the number of total long words in nltk.corpus being a value of 1.

### Now create a “text difficulty score” by combining the lexical diversity score from homework 1, and your normalized score of vocabulary size and long-word vocabulary size, in equal weighting. Explain what you see when this score is applied to same graded texts you used in homework 1.

In [151]:
class GutenScraper:
    def __init__(self, url='http://www.gutenberg.org/wiki/Children%27s_Instructional_Books_(Bookshelf)'):
        self.urlbase = url
        self.download_template = "http://www.gutenberg.org/ebooks/{}.txt.utf-8"
        self.books_folder = "local_pages"
        if self.books_folder not in os.listdir():
            os.mkdir(self.books_folder)
        self.books = {}
        
    def get_book_names(self):
        if "base_page" not in os.listdir(self.books_folder):
            main_page = str(requests.get(self.urlbase).content)
        else:
            main_page = open(self.books_folder + "/base_page")
        page_soup = BeautifulSoup(main_page, 'html.parser')
        for i in page_soup.find_all('a', attrs={"class": "extiw"}):
            self.books[re.sub("[^a-z0-9]", '', i.contents[0].lower())] = \
                self.download_template.format(i.attrs['title'].split(":")[1])
                
    def save_book(self, contents, bookname):
        open(self.books_folder + "/" + bookname, "w").write(contents)
    
    def get_all_books(self, sleep_time=3):
        for bookname, bookurl in self.books.items():
            if bookname not in os.listdir(self.books_folder):
                contents = str(requests.get(bookurl).content)
                self.save_book(contents, bookname)
                time.sleep(sleep_time)

In [202]:
Scraper = GutenScraper()
Scraper.get_book_names()
Scraper.get_all_books()

In [203]:
len(os.listdir('local_pages'))

104

* There are 104 total books in the Guttenberg children's instructional books library

In [155]:
def total_score(text):
    return (lexical_diversity(text) + vocab_score(text) + longword_score(text)) * 100

In [212]:
for book in os.listdir('local_pages'):
    print(book + ": ", total_score(book))

thestoryofmanhattan:  57.899383368193405
anecdotesofthehabitsandinstinctofanimals:  35.005913760475806
campingforboys:  92.8626342061561
walterandthewireless:  55.00464652608814
harrysladdertolearning:  59.096400439922334
thestoryofwool:  71.432795543197
woodlandtales:  69.23457093393225
theontarioreadersthirdbook:  46.15891509139685
boyblueandhisfriendsschooled:  53.57776474336693
howtowriteclearlynrulesandexercisesonenglishcomposition:  34.55348036324314
firstitalianreadings:  55.00464652608814
thewondersofthejunglenbookone:  55.17917237650437
littlebusybodiesnthelifeofcricketsantsbeesbeetlesandotherbusybodies:  25.380315323221687
delasallefifthreader:  50.00422411462557
theontarioreadersfourthbook:  51.857765612327654
mcguffeysfirsteclecticreaderrevisededition:  40.48337147105395
theontarioreadersthehighschoolreader1886:  40.00675858340092
countrywalksofanaturalistwithhischildren:  42.507180994863475
aprimaryreadernoldtimestoriesfairytalesandmythsretoldbychildren:  26.99130797899046

In [213]:
def total_score_sqrt(text):
    return (lexical_diversity(text) \
            + np.sqrt(vocab_score(text)) \
            + np.sqrt(longword_score(text))) * 100

In [214]:
for book in os.listdir('local_pages'):
    print(book + ": ", total_score_sqrt(book))

thestoryofmanhattan:  58.5763911598
anecdotesofthehabitsandinstinctofanimals:  35.7690097838
campingforboys:  93.5981792269
walterandthewireless:  55.6816543177
harrysladdertolearning:  59.8319454607
thestoryofwool:  72.0785033193
woodlandtales:  69.8473487604
theontarioreadersthirdbook:  46.8658108686
boyblueandhisfriendsschooled:  54.3674293213
howtowriteclearlynrulesandexercisesonenglishcomposition:  35.4413238322
firstitalianreadings:  55.6816543177
thewondersofthejunglenbookone:  55.9945198325
littlebusybodiesnthelifeofcricketsantsbeesbeetlesandotherbusybodies:  26.2205418374
delasallefifthreader:  50.6499318907
theontarioreadersfourthbook:  52.6208616357
mcguffeysfirsteclecticreaderrevisededition:  41.3235979852
theontarioreadersthehighschoolreader1886:  40.8221060394
countrywalksofanaturalistwithhischildren:  43.347407509
aprimaryreadernoldtimestoriesfairytalesandmythsretoldbychildren:  27.8315344931
abookofnaturalhistorynyoungfolkslibraryvolumexiv:  40.47920262
alittlebookforal

The Scores used in the original HWK2 have now been applied to all 104 books in the Guttenberg instructional childrens books