# Introduction

In this exercise, we will be loading NLTK package and follow Chapter 1 of Bird-Klein for implementing a lexical diversity scoring routine. 
- We will then import any 3 texts (of different grade levels) from the Graded Readers section of http://www.gutenberg.org/wiki/Children%27s_Instructional_Books_(Bookshelf) and explore the results.
- We will next run the vocabulary size of the same 3 texts and explore the results.
- Finally, we will debate whether the 2 measures should be used together to measure text difficulty (or reading level).

### Import necessary packages

In [1]:
import nltk
import pandas as pd

In [2]:
from urllib import request
from nltk import word_tokenize

### Download the text in .txt format. Since it is encoded in utf-8, we decode the data and tokenize the words.

In [3]:
url_grade3_text = "http://www.gutenberg.org/cache/epub/14766/pg14766.txt"
grade3_text_url_open = request.urlopen(url_grade3_text)
text_grade3_raw = grade3_text_url_open.read().decode('utf-8-sig')
text_grade3_tokenize = word_tokenize(text_grade3_raw)

In [4]:
text_grade3_tokenize

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'McGuffey',
 "'s",
 'Third',
 'Eclectic',
 'Reader',
 'by',
 'William',
 'Holmes',
 'McGuffey',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 '.',
 'You',
 'may',
 'copy',
 'it',
 ',',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'eBook',
 'or',
 'online',
 'at',
 'www.gutenberg.net',
 'Title',
 ':',
 'McGuffey',
 "'s",
 'Third',
 'Eclectic',
 'Reader',
 'Author',
 ':',
 'William',
 'Holmes',
 'McGuffey',
 'Release',
 'Date',
 ':',
 'January',
 '23',
 ',',
 '2005',
 '[',
 'EBook',
 '#',
 '14766',
 ']',
 'Language',
 ':',
 'English',
 '*',
 '*',
 '*',
 'START',
 'OF',
 'THIS',
 'PROJECT',
 'GUTENBERG',
 'EBOOK',
 'MCGUFFEY',
 "'S",
 'THIRD',
 'ECLECTIC',
 'READER',
 '*',
 '*',
 '*',
 'Produced',


In [5]:
url_grade4_text = "http://www.gutenberg.org/cache/epub/14880/pg14880.txt"
grade4_text_url_open = request.urlopen(url_grade4_text)
text_grade4_raw = grade4_text_url_open.read().decode('utf-8-sig')
text_grade4_tokenize = word_tokenize(text_grade4_raw)

In [6]:
text_grade4_tokenize

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'McGuffey',
 "'s",
 'Fourth',
 'Eclectic',
 'Reader',
 'by',
 'William',
 'Holmes',
 'McGuffey',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 '.',
 'You',
 'may',
 'copy',
 'it',
 ',',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'eBook',
 'or',
 'online',
 'at',
 'www.gutenberg.net',
 'Title',
 ':',
 'McGuffey',
 "'s",
 'Fourth',
 'Eclectic',
 'Reader',
 'Author',
 ':',
 'William',
 'Holmes',
 'McGuffey',
 'Release',
 'Date',
 ':',
 'February',
 '2',
 ',',
 '2005',
 '[',
 'EBook',
 '#',
 '14880',
 ']',
 'Language',
 ':',
 'English',
 '*',
 '*',
 '*',
 'START',
 'OF',
 'THIS',
 'PROJECT',
 'GUTENBERG',
 'EBOOK',
 'MCGUFFEY',
 "'S",
 'FOURTH',
 'ECLECTIC',
 'READER',
 '*',
 '*',
 '*',
 'Produced

In [7]:
url_grade5_text = "http://www.gutenberg.org/cache/epub/15040/pg15040.txt"
grade5_text_url_open = request.urlopen(url_grade5_text)
text_grade5_raw = grade5_text_url_open.read().decode('utf-8-sig')
text_grade5_tokenize = word_tokenize(text_grade5_raw)

In [8]:
text_grade5_tokenize

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'McGuffey',
 "'s",
 'Fifth',
 'Eclectic',
 'Reader',
 'by',
 'William',
 'Holmes',
 'McGuffey',
 'This',
 'eBook',
 'is',
 'for',
 'the',
 'use',
 'of',
 'anyone',
 'anywhere',
 'at',
 'no',
 'cost',
 'and',
 'with',
 'almost',
 'no',
 'restrictions',
 'whatsoever',
 '.',
 'You',
 'may',
 'copy',
 'it',
 ',',
 'give',
 'it',
 'away',
 'or',
 're-use',
 'it',
 'under',
 'the',
 'terms',
 'of',
 'the',
 'Project',
 'Gutenberg',
 'License',
 'included',
 'with',
 'this',
 'eBook',
 'or',
 'online',
 'at',
 'www.gutenberg.net',
 'Title',
 ':',
 'McGuffey',
 "'s",
 'Fifth',
 'Eclectic',
 'Reader',
 'Author',
 ':',
 'William',
 'Holmes',
 'McGuffey',
 'Release',
 'Date',
 ':',
 'February',
 '14',
 ',',
 '2005',
 '[',
 'EBook',
 '#',
 '15040',
 ']',
 'Language',
 ':',
 'English',
 '*',
 '*',
 '*',
 'START',
 'OF',
 'THIS',
 'PROJECT',
 'GUTENBERG',
 'EBOOK',
 'MCGUFFEY',
 "'S",
 'FIFTH',
 'ECLECTIC',
 'READER',
 '*',
 '*',
 '*',
 'Produced',

### Lexical Diversity Score

Lexical Diversity provides the ratio of the total number of words (tokens) to the unique word stems (types) in the text. It is an important measure of text difficulty.

In [9]:
results_df = pd.DataFrame(columns=['Grade_Level'
                                   , 'Lexical_Diversity_Score'
                                   , 'Vocabulary_Count'
                                  ])
results_df

Unnamed: 0,Grade_Level,Lexical_Diversity_Score,Vocabulary_Count


In [10]:
def lexical_diversity(text):
    vocab_cnt = len(set(text))
    lex_div = len(text) / vocab_cnt
    return (lex_div, vocab_cnt)

In [11]:
(lex_div, vocab_cnt) = lexical_diversity(text_grade3_tokenize)

results_df = results_df.append({'Grade_Level': 'Grade 3'
                                , 'Lexical_Diversity_Score': lex_div
                                , 'Vocabulary_Count': vocab_cnt}
                              , ignore_index=True)

In [12]:
(lex_div, vocab_cnt) = lexical_diversity(text_grade4_tokenize)

results_df = results_df.append({'Grade_Level': 'Grade 4'
                                , 'Lexical_Diversity_Score': lex_div
                                , 'Vocabulary_Count': vocab_cnt}
                              , ignore_index=True)

In [13]:
(lex_div, vocab_cnt) = lexical_diversity(text_grade5_tokenize)

results_df = results_df.append({'Grade_Level': 'Grade 5'
                                , 'Lexical_Diversity_Score': lex_div
                                , 'Vocabulary_Count': vocab_cnt}
                              , ignore_index=True)

In [14]:
results_df.iloc[:, [0, 1]]

Unnamed: 0,Grade_Level,Lexical_Diversity_Score
0,Grade 3,8.063455
1,Grade 4,8.102245
2,Grade 5,8.861712


#### Lexical Diversity Score Analysis

In the above analysis, the lexical diveristy score does not change much. Just looking at the Lexical Diversity score, it seems like the books selected for the 3 grades have similar difficulty levels.

### Counting Vocabulary

This measure provides us the distinct word count (types) in the text.

In [15]:
results_df.iloc[:, [0, 2]]

Unnamed: 0,Grade_Level,Vocabulary_Count
0,Grade 3,4712
1,Grade 4,10377
2,Grade 5,14289


In [16]:
sorted(set(text_grade3_tokenize))

['!',
 '#',
 '$',
 '%',
 '&',
 "'",
 "''",
 "'AS-IS",
 "'All",
 "'Beware",
 "'S",
 "'See",
 "'T",
 "'Therefore",
 "'What",
 "'beware",
 "'d",
 "'held",
 "'holding",
 "'ll",
 "'m",
 "'re",
 "'round",
 "'s",
 "'t",
 "'ve",
 '(',
 ')',
 '*',
 ',',
 '-',
 '--',
 '.',
 '..',
 '//gutenberg.net/license',
 '//pglaf.org',
 '//pglaf.org/donate',
 '//pglaf.org/fundraising',
 '//www.gutenberg.net',
 '//www.gutenberg.net/1/4/7/6/14766/',
 '//www.pglaf.org',
 '0',
 '1',
 '1.A',
 '1.B',
 '1.C',
 '1.D',
 '1.E',
 '1.E.1',
 '1.E.2',
 '1.E.3',
 '1.E.4',
 '1.E.5',
 '1.E.6',
 '1.E.7',
 '1.E.8',
 '1.E.9',
 '1.F',
 '1.F.1',
 '1.F.2',
 '1.F.3',
 '1.F.4',
 '1.F.5',
 '1.F.6',
 '10',
 '100',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '11',
 '110',
 '111',
 '112',
 '113',
 '114',
 '115',
 '116',
 '117',
 '118',
 '119',
 '12',
 '120',
 '121',
 '122',
 '123',
 '124',
 '125',
 '126',
 '127',
 '128',
 '129',
 '13',
 '130',
 '131',
 '132',
 '133',
 '134',
 '135',
 '136',
 '137',
 '138',
 

In [17]:
sorted(set(text_grade4_tokenize))

['!',
 '#',
 '$',
 '%',
 '&',
 "'",
 "''",
 "'AS-IS",
 "'Do",
 "'Give",
 "'Have",
 "'Honor",
 "'No",
 "'Pray",
 "'S",
 "'T",
 "'There",
 "'This",
 "'To",
 "'Turn",
 "'What",
 "'Who",
 "'Why",
 "'You",
 "'d",
 "'em",
 "'fled",
 "'gainst",
 "'grand",
 "'ll",
 "'m",
 "'mill",
 "'old",
 "'re",
 "'s",
 "'t",
 "'tchick",
 "'treasures",
 "'ve",
 '(',
 ')',
 '*',
 ',',
 '-',
 '--',
 '-s',
 '.',
 '..',
 '.....',
 '//gutenberg.net/license',
 '//pglaf.org',
 '//pglaf.org/donate',
 '//pglaf.org/fundraising',
 '//www.gutenberg.net',
 '//www.gutenberg.net/1/4/8/8/14880/',
 '//www.pglaf.org',
 '1',
 '1.20',
 '1.A',
 '1.B',
 '1.C',
 '1.D',
 '1.E',
 '1.E.1',
 '1.E.2',
 '1.E.3',
 '1.E.4',
 '1.E.5',
 '1.E.6',
 '1.E.7',
 '1.E.8',
 '1.E.9',
 '1.F',
 '1.F.1',
 '1.F.2',
 '1.F.3',
 '1.F.4',
 '1.F.5',
 '1.F.6',
 '10',
 '100',
 '103',
 '104',
 '107',
 '109',
 '10th',
 '11',
 '110',
 '113',
 '116',
 '117',
 '12',
 '120',
 '121',
 '125',
 '126',
 '128',
 '13',
 '132',
 '134',
 '135',
 '136',
 '139',
 '14',
 '143'

In [18]:
sorted(set(text_grade5_tokenize))

['!',
 '#',
 '$',
 '%',
 '&',
 "'",
 "''",
 "'AS-IS",
 "'Boidered",
 "'Lar'ums",
 "'Mid",
 "'Neath",
 "'S",
 "'T",
 "'We",
 "'Why",
 "'bove",
 "'d",
 "'em",
 "'er-ence",
 "'er-ous",
 "'father",
 "'gainst",
 "'governor",
 "'larums",
 "'ll",
 "'m",
 "'mid",
 "'neath",
 "'pear",
 "'pin-ion",
 "'re",
 "'rt",
 "'s",
 "'scape",
 "'spe-cial",
 "'t",
 "'tree",
 "'twixt",
 "'uge",
 "'ve",
 "'whelms",
 '(',
 ')',
 '*',
 ',',
 '-',
 '--',
 '-he',
 "-i-ga'tion",
 '.',
 '..',
 '...',
 '//gutenberg.net/license',
 '//pglaf.org',
 '//pglaf.org/donate',
 '//pglaf.org/fundraising',
 '//www.gutenberg.net',
 '//www.gutenberg.net/1/5/0/4/15040/',
 '//www.pglaf.org',
 '0',
 '1',
 '1,300',
 '1.',
 '1.A',
 '1.B',
 '1.C',
 '1.D',
 '1.E',
 '1.E.1',
 '1.E.2',
 '1.E.3',
 '1.E.4',
 '1.E.5',
 '1.E.6',
 '1.E.7',
 '1.E.8',
 '1.E.9',
 '1.F',
 '1.F.1',
 '1.F.2',
 '1.F.3',
 '1.F.4',
 '1.F.5',
 '1.F.6',
 '10',
 '10,000',
 '100',
 '101',
 '102',
 '103',
 '104',
 '105',
 '106',
 '107',
 '108',
 '109',
 '11',
 '110',
 '111'

### Vocabulary count analysis

As we can see above, the vocabulary count increases from 4712 in 3rd grade to 10377 in 4th grade and upto 14289 in 5th grade. This is in line with our understanding that the number of vocabulary words increases with grade.

# Conclusion

In our analysis, we noticed that the lexical diversity score for all 3 grades were pretty close. While the lexical diversity score is an indicator of complexity of text, it hides the importance of the length of the text and the number of vocabulary words in the text. 

On the other side of the spectrum, just considering the number of vocabulary words of the text also could be misleading. In various scenarios, the writer may use synonyms to avoid repeating words in a passage or essay. 

In order to avoid these biases, it is important to look at the lexical diversity in combination with the vocabulary size. 

In [19]:
results_df

Unnamed: 0,Grade_Level,Lexical_Diversity_Score,Vocabulary_Count
0,Grade 3,8.063455,4712
1,Grade 4,8.102245,10377
2,Grade 5,8.861712,14289


## References

- https://en.wikipedia.org/wiki/Lexical_diversity
- https://textinspector.com/help/lexical-diversity/
- Natural Language Processing with Python - Steven Bird - Chapter 2 - For importing books using URL.