<font color = green >

# Natural Language Toolkit (NLTK)

## Home Task 
</font>

In [1]:
import nltk
from nltk.corpus import gutenberg 

#### Download necessary corpus 

In [2]:
moby_raw = gutenberg.raw('melville-moby_dick.txt') 

<font color = green >

### Example 1

</font>

How many tokens (words and punctuation symbols) are in `moby_raw`?
<br>*This function should return an integer.*

In [3]:
def example_one():
    from nltk.tokenize import word_tokenize
    return len(word_tokenize(moby_raw)) 

print ('{:,}'.format(example_one()))

255,028


<font color = green >

### Example 2

</font>

How many unique tokens (unique words and punctuation) does `moby_raw` have?
<br>*This function should return an integer.*

In [4]:
def example_two():    
    return len(set(nltk.word_tokenize(moby_raw)))

print ('{:,}'.format(example_two()))

20,742


<font color = green >

### Example 3

</font>

After lemmatizing the verbs, how many unique tokens does `moby_raw` have?
<br>*This function should return an integer.*


In [5]:
from nltk.stem import WordNetLemmatizer

def example_three():
    lemmatizer = WordNetLemmatizer()
    lemmatized = [lemmatizer.lemmatize(w,'v') for w in nltk.word_tokenize(moby_raw)]
    return len(set(lemmatized))

In [6]:
print ('{:,}'.format(example_three()))

16,887


<font color = green >

### Question 1

</font>


What is the lexical diversity of the given text input? (i.e. ratio of unique tokens to the total number of tokens)
<br>*This function should return a float.*


In [7]:
def answer_one():
    tokens = nltk.word_tokenize(moby_raw)
    unique_tokens = set(tokens)
    return len(unique_tokens) / len(tokens)

answer_one()

0.08133224587104161

<font color = green >

### Question 2

</font>

What percentage of tokens is 'whale'or 'Whale'?
<br>*This function should return a float.*

In [8]:
def answer_two():    
    tokens = nltk.word_tokenize(moby_raw)
    whale_count = tokens.count('whale') + tokens.count('Whale')
    whale_percentage = (whale_count / len(tokens)) * 100
    return whale_percentage

answer_two()

0.4125037250811676

<font color = green >

### Question 3

</font>

What are the 20 most frequently occurring (unique) tokens in the text? What is their frequency?
<br>*This function should return a list of 10 tuples where each tuple is of the form `(token, frequency)`. The list should be sorted in descending order of frequency.*

In [9]:
from nltk import FreqDist

def answer_three():
    tokens = nltk.word_tokenize(moby_raw)
    frequency_distribution = FreqDist(tokens)
    most_common_tokens = frequency_distribution.most_common(20)
    return most_common_tokens

answer_three()

[(',', 19204),
 ('the', 13715),
 ('.', 7306),
 ('of', 6513),
 ('and', 6010),
 ('a', 4545),
 ('to', 4515),
 (';', 4173),
 ('in', 3908),
 ('that', 2978),
 ('his', 2459),
 ('it', 2196),
 ('I', 2113),
 ('!', 1767),
 ('is', 1722),
 ('--', 1713),
 ('with', 1659),
 ('he', 1658),
 ('was', 1639),
 ('as', 1620)]

<font color = green >

### Question 4

</font>

What tokens have a length of greater than 5 and frequency of more than 150?
<br>*This function should return a sorted list of the tokens that match the above constraints. To sort your list, use `sorted()`*

In [10]:
def answer_four():
    tokens = nltk.word_tokenize(moby_raw)
    frequency_distribution = FreqDist(tokens)
    filtered_tokens = [token for token, freq in frequency_distribution.items() if len(token) > 5 and freq > 150]
    sorted_tokens = sorted(filtered_tokens)
    return sorted_tokens

answer_four()

['Captain',
 'Pequod',
 'Queequeg',
 'Starbuck',
 'almost',
 'before',
 'himself',
 'little',
 'seemed',
 'should',
 'though',
 'through',
 'whales',
 'without']

<font color = green >

### Question 5

</font>

Find the longest word in text1 and that word's length.
<br>
*This function should return a tuple `(longest_word, length)`.*


In [11]:
from nltk.corpus import words
nltk.download('words')

def answer_five():
    word_list = words.words()
    longest_word = max(word_list, key=len)
    return (longest_word, len(longest_word))

answer_five()

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\sviat\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


('formaldehydesulphoxylate', 24)

<font color = green >

### Question 6

</font>

What unique words have a frequency of more than 2000? What is their frequency?
<br>*This function should return a list of tuples of the form `(frequency, word)` sorted in descending order of frequency.*


In [12]:
def answer_six():
    tokens = nltk.word_tokenize(moby_raw)
    frequency_distribution = FreqDist(tokens)
    result = [(freq, word) for word, freq in frequency_distribution.items() if freq > 2000 and word.isalpha()]
    result.sort(reverse=True)
    return result

answer_six()

[(13715, 'the'),
 (6513, 'of'),
 (6010, 'and'),
 (4545, 'a'),
 (4515, 'to'),
 (3908, 'in'),
 (2978, 'that'),
 (2459, 'his'),
 (2196, 'it'),
 (2113, 'I')]

<font color = green >

### Question 7

</font>

What is the average number of tokens per sentence?
<br>*This function should return a float.*

In [13]:
from nltk.tokenize import sent_tokenize,word_tokenize
import numpy as np 

def answer_seven():
    sentences = sent_tokenize(moby_raw)
    tokens_per_sentence = [len(word_tokenize(sentence)) for sentence in sentences]
    average_tokens_per_sentence = np.mean(tokens_per_sentence)
    return average_tokens_per_sentence

answer_seven()

25.88591149005278

<font color = green >

### Question 8

</font>

What are the 5 most frequent parts of speech in this text? What is their frequency?
<br>*This function should return a list of tuples of the form `(part_of_speech, frequency)` sorted in descending order of frequency.*

In [14]:
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter

def answer_eight():
    tokens = word_tokenize(moby_raw)
    pos_tags = pos_tag(tokens)
    pos_frequencies = Counter(tag for (word, tag) in pos_tags)
    top_pos = pos_frequencies.most_common(5)
    return top_pos

result = answer_eight()
print(result)

[('NN', 32727), ('IN', 28662), ('DT', 25879), (',', 19204), ('JJ', 17613)]


<font color = green >

### Question 9

</font>

Create spelling recommender, that take a list of misspelled words and recommends a correctly spelled word for every word in the list.

For every misspelled word, the recommender should find find the word in `correct_spellings` that has the shortest `edit distance` (you may need  to use `nltk.edit_distance(word_1, word_2, transpositions=True)`), and starts with the same letter as the misspelled word, and return that word as a recommendation.

Recommender should provide recommendations for the three words: `['cormulent', 'incendenece', 'validrate']`.
<br>*This function should return a list of length three:
`['cormulent_reccomendation', 'incendenece_reccomendation', 'validrate_reccomendation']`.*

In [15]:
from nltk.corpus import words
from nltk import edit_distance

correct_spellings = set(words.words())

def answer_nine(default_words=['cormulent', 'incendenece', 'validrate']):
    recommendations = []
    for misspelled_word in default_words:
        possible_spellings = [word for word in correct_spellings if word.startswith(misspelled_word[0])]
        if possible_spellings:
            best_spellings = min(possible_spellings, key=lambda word: edit_distance(misspelled_word, word, transpositions=True))
            recommendations.append(best_spellings)
        else:
            recommendations.append('No recommendations found')
    return recommendations

answer_nine()

['corpulent', 'intendence', 'validate']