# Introduction

In this exercise, we will start with the same dataset as in the previous assignment. As an extension to our previous assignment, we will: 

- Create a method for normalizing (Scaling) the vocabulary size of 3 texts
- Create a method for scoring the long-word vocabulary size of a text
    - We will build an array of all words greater than 10 characters
    - Normalize / Scale the score as in previous step
    - Store it in the results dataframe
- Create a Text Difficulty Score
    - Combine the Lexical Diversity Score, Normalized Vocabulary Size Score and the Normalized long-word vocabulary score
    - Calculate the mean
- Compare the new score between the 3 texts and explain the observation in the conclusion section

#### Additional text processing compared to Assignment 1:

- We have identified the starting and end of the text to filter out table of contents, preface, table of figures, etc. 
- We have converted all words to lowercase so that we don't count the same word twice.

### Import necessary packages

In [1]:
import nltk, re, pprint
import pandas as pd
import numpy as np
from urllib import request
from nltk import word_tokenize
from nltk.corpus import stopwords
from sklearn.preprocessing import minmax_scale

### Download the text files
- Download the files using the .txt URL.
- Decode (UTF-8 text) 
- Identify the starting and ending position of the text and strip out the unnecessary text
- Tokenize the words
- Convert text to lower case
- Remove stop words
- Remove punctuation marks

In [2]:
stop_words = set(stopwords.words('english'))

In [3]:
def clean_text(text, start, end):
    text_sub = text[start:end] # Substring from Start to End of actual Text
    text_tokens = word_tokenize(text_sub) # Tokenize the words
    text_lower = [w.lower() for w in text_tokens] # Convert to lower case
    text_rem_stop = [word for word in text_lower if not word in stop_words] # remove stop words
    text_rem_punc = [word for word in text_rem_stop if word.isalpha()] # remove punctuations
    return(text_rem_punc)

In [4]:
url_grade3_text = "http://www.gutenberg.org/cache/epub/14766/pg14766.txt"
grade3_text_url_open = request.urlopen(url_grade3_text)
text_grade3_raw = grade3_text_url_open.read().decode('utf-8-sig')
text_grade3_raw_start = text_grade3_raw.find("MCGUFFEY\'S\r\n\r\nTHIRD READER")
text_grade3_raw_end = text_grade3_raw.rfind("End of the Project Gutenberg EBook of McGuffey\'s Third Eclectic Reader")
text_grade3_tokenize = clean_text(text_grade3_raw, text_grade3_raw_start, text_grade3_raw_end)

In [5]:
url_grade4_text = "http://www.gutenberg.org/cache/epub/14880/pg14880.txt"
grade4_text_url_open = request.urlopen(url_grade4_text)
text_grade4_raw = grade4_text_url_open.read().decode('utf-8-sig')
text_grade4_raw_start = text_grade4_raw.find("MCGUFFEY\'S FOURTH READER")
text_grade4_raw_end = text_grade4_raw.rfind("End of the Project Gutenberg EBook of McGuffey\'s Fourth Eclectic Reader")
text_grade4_tokenize = clean_text(text_grade4_raw, text_grade4_raw_start, text_grade4_raw_end)

In [6]:
url_grade5_text = "http://www.gutenberg.org/cache/epub/15040/pg15040.txt"
grade5_text_url_open = request.urlopen(url_grade5_text)
text_grade5_raw = grade5_text_url_open.read().decode('utf-8-sig')
text_grade5_raw_start = text_grade5_raw.find("McGuffey\'s Fifth Reader")
text_grade5_raw_end = text_grade5_raw.rfind("End of the Project Gutenberg EBook of McGuffey\'s Fifth Eclectic Reader")
text_grade5_tokenize = clean_text(text_grade5_raw, text_grade5_raw_start, text_grade5_raw_end)

In [7]:
print(len(text_grade4_tokenize))
print(len(set(text_grade4_tokenize)))

27495
6316


### Calculate and store the Lexical Diversity score in a dataframe

- Define a blank dataframe
- Define a function to calculate the lexical diversity score
- Call the function for the text from different grades and store the corresponding values in the dataframe

In [8]:
results_df = pd.DataFrame(columns=['Grade_Level'
                                   , 'Lexical_Diversity_Score'
                                   , 'Vocabulary_Count'
                                  ])
results_df

Unnamed: 0,Grade_Level,Lexical_Diversity_Score,Vocabulary_Count


In [9]:
def lexical_diversity(text):
    vocab_cnt = len(set(text))
    lex_div = vocab_cnt / len(text)
    return (lex_div, vocab_cnt)

In [10]:
(lex_div, vocab_cnt) = lexical_diversity(text_grade3_tokenize)

results_df = results_df.append({'Grade_Level': 'Grade 3'
                                , 'Lexical_Diversity_Score': lex_div
                                , 'Vocabulary_Count': vocab_cnt}
                              , ignore_index=True)

In [11]:
(lex_div, vocab_cnt) = lexical_diversity(text_grade4_tokenize)

results_df = results_df.append({'Grade_Level': 'Grade 4'
                                , 'Lexical_Diversity_Score': lex_div
                                , 'Vocabulary_Count': vocab_cnt}
                              , ignore_index=True)

In [12]:
(lex_div, vocab_cnt) = lexical_diversity(text_grade5_tokenize)

results_df = results_df.append({'Grade_Level': 'Grade 5'
                                , 'Lexical_Diversity_Score': lex_div
                                , 'Vocabulary_Count': vocab_cnt}
                              , ignore_index=True)

In [13]:
results_df.iloc[:, [0, 1]]

Unnamed: 0,Grade_Level,Lexical_Diversity_Score
0,Grade 3,0.246825
1,Grade 4,0.229714
2,Grade 5,0.229137


In [14]:
results_df.iloc[:, [0, 2]]

Unnamed: 0,Grade_Level,Vocabulary_Count
0,Grade 3,2974
1,Grade 4,6316
2,Grade 5,9986


In [15]:
results_df

Unnamed: 0,Grade_Level,Lexical_Diversity_Score,Vocabulary_Count
0,Grade 3,0.246825,2974
1,Grade 4,0.229714,6316
2,Grade 5,0.229137,9986


### Normalizing the Scores

We need scale and translate the feature such that it is between zero and one.

- Use MinMaxScaler from sklearn
- Scale the Vocabulary Size of the text from each grade between 0 and 1
- Store the results in the same dataframe

X_normalized = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

In [16]:
def n_vocab_size(*arg):
    vocab_size = np.array([])
    vocab_size_norm = np.array([])
    
    #### Getting the Vocab Size
    for text in arg:
        vocab_size = np.append(vocab_size,len(set(text)))
    
    #### Normalizing using the formula 
    for vsize in vocab_size:
        vocab_size_norm = np.append(vocab_size_norm,(vsize - vocab_size.min()) /
                                                    (vocab_size.max() - vocab_size.min()))
    
    #### Normalizing using sklearn preprocessing 
    vocab_size_norm_sklearn = minmax_scale(vocab_size, feature_range=(0,1), axis=0)
    
    return(vocab_size,vocab_size_norm,vocab_size_norm_sklearn)

In [17]:
vocab_size = n_vocab_size(text_grade3_tokenize,
                          text_grade4_tokenize,
                          text_grade5_tokenize)

In [18]:
results_df['Vocab_Cnt_Scaled'] = vocab_size[2]
results_df

Unnamed: 0,Grade_Level,Lexical_Diversity_Score,Vocabulary_Count,Vocab_Cnt_Scaled
0,Grade 3,0.246825,2974,0.0
1,Grade 4,0.229714,6316,0.476612
2,Grade 5,0.229137,9986,1.0


In [19]:
results_df

Unnamed: 0,Grade_Level,Lexical_Diversity_Score,Vocabulary_Count,Vocab_Cnt_Scaled
0,Grade 3,0.246825,2974,0.0
1,Grade 4,0.229714,6316,0.476612
2,Grade 5,0.229137,9986,1.0


As expected, the scaled vocabulary size score increases with grade level. This is a better indicator than the lexical diversity score.

### List of Long Words

- Use nltk.FreqDist to calculate the occurrence of each word within the text.
- Make a list of all words greater than 10 characters in each of the texts.
- Store the counts of complex words in the results dataframe.
- Scale the complex word count between 0 and 1.
- Store the results in the same dataframe

In [20]:
fdist_grade3 = nltk.FreqDist(text_grade3_tokenize)

grade3_complex_words = []
for word in sorted(fdist_grade3):
    if(len(word) > 10):
        print(word, '->', fdist_grade3[word], end='\n')
        grade3_complex_words.append(word)
        
results_df.loc[0, 'Complex_Wrd_Cnt'] = len(grade3_complex_words)

blackberries -> 1
commandment -> 2
commandments -> 2
constructed -> 1
deliverance -> 1
discouraged -> 1
disobedient -> 1
encountered -> 1
experienced -> 1
forgiveness -> 1
gingerbread -> 2
grandfather -> 1
grandmother -> 12
immediately -> 2
neighborhood -> 1
newfoundland -> 1
overflowing -> 1
pocketknife -> 1
satisfaction -> 2
schoolhouse -> 2
schoolmates -> 2
sorrowfully -> 1
strawberries -> 4
transgressions -> 1


In [21]:
fdist_grade4 = nltk.FreqDist(text_grade4_tokenize)

grade4_complex_words = []
for word in sorted(fdist_grade4):
    if(len(word) > 10):
        print(word, '->', fdist_grade4[word], end='\n')
        grade4_complex_words.append(word)
        
results_df.loc[1, 'Complex_Wrd_Cnt'] = len(grade4_complex_words)

abbreviation -> 1
aberbrothok -> 4
accidentally -> 2
accompanied -> 2
accomplishing -> 1
accordingly -> 2
acquaintance -> 1
advancement -> 1
affectionate -> 1
application -> 1
appreciated -> 1
approaching -> 2
approbation -> 1
appropriate -> 1
arrangements -> 1
ascertained -> 1
astonishment -> 2
beautifully -> 2
capabilities -> 1
chanticleer -> 1
cheerfulness -> 1
christopher -> 1
churchgoing -> 1
circumstance -> 2
circumstances -> 1
comfortable -> 2
comfortably -> 2
comfortless -> 1
commandment -> 2
commandments -> 2
commencement -> 1
communication -> 2
comparatively -> 2
competition -> 2
complaining -> 2
composition -> 19
compositions -> 1
concentration -> 1
confidingly -> 1
confinement -> 1
conjectured -> 1
conscientious -> 1
consciousness -> 1
consequence -> 3
consequences -> 2
consideration -> 1
considerest -> 1
considering -> 1
consolation -> 1
consultation -> 1
contemptible -> 1
contentedly -> 1
continually -> 3
continuance -> 1
contrivance -> 1
conversation -> 2
countenance -> 

In [22]:
fdist_grade5 = nltk.FreqDist(text_grade5_tokenize)

grade5_complex_words = []
for word in sorted(fdist_grade5):
    if(len(word) > 10):
        print(word, '->', fdist_grade5[word], end='\n')
        grade5_complex_words.append(word)
        
results_df.loc[2, 'Complex_Wrd_Cnt'] = len(grade5_complex_words)

abandonment -> 1
abbreviation -> 1
abstractedly -> 1
accidentally -> 1
accountableness -> 1
accumulated -> 1
achievement -> 1
achievements -> 1
acknowledge -> 2
acknowledged -> 2
acknowledgment -> 1
acquaintance -> 3
administered -> 1
administering -> 1
advancement -> 1
adventurous -> 1
affectionate -> 2
aggravating -> 2
agricultural -> 1
agriculture -> 1
alleghanies -> 1
ambrosianae -> 1
annihilates -> 1
anonymously -> 1
antediluvian -> 1
anticipation -> 1
antislavery -> 1
appealingly -> 1
application -> 3
applications -> 1
appointment -> 2
appreciated -> 1
apprehension -> 2
apprenticed -> 3
apprenticeship -> 1
approaching -> 3
architecture -> 2
arrangements -> 1
articulation -> 2
artlessness -> 1
ascertained -> 3
assiduities -> 1
astonishing -> 2
astonishingly -> 1
astonishment -> 1
attainments -> 1
attentively -> 1
attractions -> 1
authenticity -> 1
authorizing -> 1
backwardness -> 1
backwoodsman -> 1
ballyshannon -> 1
barbarously -> 1
battlefield -> 1
beautifully -> 1
bedchambers -

proceedings -> 1
productions -> 9
professorship -> 2
promiscuous -> 1
promulgating -> 1
pronunciation -> 1
propagation -> 2
propensities -> 1
prophesying -> 1
proprieties -> 1
prosecution -> 1
prospective -> 1
provocation -> 1
publication -> 7
rattlesnake -> 2
reappointed -> 1
recognition -> 6
recollection -> 1
recollections -> 2
reconciliation -> 1
reenforcements -> 1
reformation -> 2
refreshment -> 1
regulations -> 2
relinquished -> 1
reluctantly -> 1
remembering -> 1
remembrance -> 3
remittances -> 1
remonstrate -> 1
representatives -> 1
represented -> 4
reprimanded -> 1
republished -> 2
resemblance -> 1
resignation -> 2
respectfully -> 1
responsibilitie -> 1
restoration -> 1
retaliating -> 1
retributory -> 1
reverberated -> 1
reverberating -> 2
reverential -> 1
revolutionary -> 1
righteousness -> 2
roundabouts -> 1
satisfaction -> 5
saunterings -> 1
scholarship -> 1
schoolfellow -> 1
schoolfellows -> 1
schoolhouse -> 2
schoolmaster -> 3
scrupulously -> 2
secretaries -> 2
selfishnes

In [23]:
vocab_size_complex = n_vocab_size(grade3_complex_words,
                          grade4_complex_words,
                          grade5_complex_words)

In [24]:
results_df['Complex_Wrd_Cnt_Scaled'] = vocab_size_complex[2]

In [25]:
results_df

Unnamed: 0,Grade_Level,Lexical_Diversity_Score,Vocabulary_Count,Vocab_Cnt_Scaled,Complex_Wrd_Cnt,Complex_Wrd_Cnt_Scaled
0,Grade 3,0.246825,2974,0.0,24.0,0.0
1,Grade 4,0.229714,6316,0.476612,246.0,0.363934
2,Grade 5,0.229137,9986,1.0,634.0,1.0


The result of the long-word vocabulary size (scaled), similar to the vocabulary size (scaled), increases with grade level. We see that both our scaled scores add more value than just the lexical diversity score.

## New Complexity Score

- Create a new text difficulty score
- Simple mean between lexical diversity, scaled vocabulary size, and scaled long-word vocabulary size

In [26]:
results_df['New_Complexity_Score'] = (results_df.Lexical_Diversity_Score + 
                                      results_df.Vocab_Cnt_Scaled + 
                                      results_df.Complex_Wrd_Cnt_Scaled) / 3

In [27]:
results_df.iloc[:, [0,1,3,5,6]]

Unnamed: 0,Grade_Level,Lexical_Diversity_Score,Vocab_Cnt_Scaled,Complex_Wrd_Cnt_Scaled,New_Complexity_Score
0,Grade 3,0.246825,0.0,0.0,0.082275
1,Grade 4,0.229714,0.476612,0.363934,0.356753
2,Grade 5,0.229137,1.0,1.0,0.743046


# Conclusion

In Unit 1 Homework, we looked at the lexical diversity and the vocabulary size. We learnt that while the lexical diversity  score may not provide the right picture, the vocabulary size adds additional information on the complexity of the text.

In this homework, we scaled the vocabulary size to a value between 0 to 1. We also computed a list of long-words (more than 10 characters) in the text. Both the trend of scaled vocabulary size and the scaled long-word vocabulary were on expected lines. 

Finally, we created a new complexity score with the lexical diversity score, scaled vocabulary size score, and the scaled long-word vocabulary score having equal weights. As expected, the new complexity score increased with an increase in grade level.

# References

- https://www.nltk.org/book/ch02.html
- https://www.nltk.org/book/ch03.html
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
- https://machinelearningmastery.com/clean-text-machine-learning-python/