## Task 5

1.	Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website 
    http://www.gutenberg.org/files/11/11-0.txt
2.	Perform any necessary preprocessing on the text, including converting to lower case, 
    removing stop words, numbers / non-alphabetic characters, lemmatization.
3.	Find Top 10 most important (for example, in terms of TF-IDF metric) words 
    from each chapter in the text (not "Alice"); how would you name each chapter 
    according to the identified tokens?
4.	Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?
5.	*(not necessary) Find Top 100 most used verbs in sentences with Alice. 
    Get word vectors using a pre-trained word2vec model and visualize them. 
    Compare the words using embeddings.

In [1]:
import nltk
import string
import math
import pandas as pd

from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from collections import Counter
from string import punctuation
from os import listdir

In [2]:
nltk.download("wordnet")
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Sultan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sultan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sultan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Sultan\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
# Load text into memory
with open("./11-0.txt", "r", encoding="utf8") as f_src:
    text = f_src.read()

In [4]:
# Clean text
def clean_text(txt):
    tokens = txt.split()
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    tokens = [word for word in tokens if word.isalpha()]
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [word for word in tokens if len(word) > 3]
    return tokens

In [5]:
# Turn a text into clean tokens
words = clean_text(text)

In [6]:
# Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()
lem_words = [wordnet_lemmatizer.lemmatize(word, pos='v') for word in words]

In [7]:
# Lowercase
lem_words = [word.lower() for word in lem_words] 
print(lem_words[:10])

['project', 'adventures', 'wonderland', 'lewis', 'carroll', 'this', 'ebook', 'anyone', 'anywhere', 'cost']


In [8]:
# Chapters
chapters, tmp_chap = list(), list()
for word in lem_words:
    if word == "chapter":
        tmp_chap = list() 
        chapters.append(tmp_chap)
    else:
        tmp_chap.append(word)

print(f"Total count of chapters: {len(chapters)}")

print("First 3 words per chapter:")    
for i in range(len(chapters)):
    print(chapters[i][0:3])

Total count of chapters: 12
First 3 words per chapter:
['down', 'rabbithole', 'alice']
['pool', 'tears', 'cry']
['caucusrace', 'long', 'tale']
['rabbit', 'sends', 'little']
['advice', 'caterpillar', 'caterpillar']
['pepper', 'minute', 'stand']
['teaparty', 'there', 'table']
['viii', 'croquetground', 'large']
['mock', 'story', 'think']
['lobster', 'quadrille', 'mock']
['stole', 'tarts', 'king']
['evidence', 'cry', 'alice']


In [9]:
# TF
def get_tf(word, f_text):
    tf_text = Counter(f_text)
    tf_text[word] = tf_text[word] / float(len(f_text))
        
    return tf_text[word]

In [10]:
# IDF
def get_idf(word, chaps):
    count = sum([1.0 for chap in chaps if word in chap])
    if count != 0:
        return math.log(len(chaps) / count)
    else:
        return 0
    

### Note

I would name them by the first three words found.

In [11]:
# Top-10 TF-IDF in chapters
chapters_list = list()
for chapter in chapters:
    vocab_chapters = Counter()
    for word in chapter:
        if word != "alice":
            vocab_chapters[word] = get_tf(word, chapter) * get_idf(word, chapters)
            vocab_chapters[word] = round(vocab_chapters[word], 5)
    chapters_list.append(vocab_chapters)
        
for chapter in chapters_list:
    print("Chapter " + str(chapters_list.index(chapter) + 1))
    print(pd.DataFrame(chapter.most_common(10)))
    print()
    

Chapter 1
            0        1
0      either  0.01011
1        down  0.00906
2  rabbithole  0.00906
3        dark  0.00906
4         bat  0.00906
5      candle  0.00906
6      bottle  0.00871
7        fall  0.00745
8        fell  0.00674
9        burn  0.00653

Chapter 2
          0        1
0     mouse  0.01423
1      pool  0.01386
2     mabel  0.01242
3      swim  0.01213
4    gloves  0.00896
5       cat  0.00866
6      four  0.00693
7   capital  0.00672
8  ringlets  0.00621
9     andoh  0.00621

Chapter 3
            0        1
0        dodo  0.03248
1       mouse  0.02513
2        lory  0.01624
3       cause  0.01083
4        bird  0.00996
5        ring  0.00751
6      morcar  0.00751
7       earls  0.00751
8      mercia  0.00751
9  archbishop  0.00751

Chapter 4
         0        1
0    puppy  0.01501
1   window  0.01251
2   gloves  0.01083
3   bottle  0.01083
4  chimney  0.01001
5     mary  0.00751
6     rush  0.00751
7     bark  0.00751
8     pair  0.00722
9    stick  0.00722


### Note

Alice most often says something, goes somewhere and thinks about something.

In [12]:
# Verbs with 'Alice'
vocab_verbs = Counter()
sentences = sent_tokenize(text)

for sentence in sentences:
    sentence_words = clean_text(sentence)
    sentence_words = [wordnet_lemmatizer.lemmatize(w, pos='v') for w in sentence_words]
    sentence_words = [w.lower() for w in sentence_words]
    if "alice" in sentence_words:
        sentence_words = nltk.pos_tag(sentence_words)
        for word, pos_tag in sentence_words:
            if pos_tag == "VB":
                vocab_verbs[word] += 1 
pd.DataFrame(data=vocab_verbs.most_common(10), columns=["Verb", "Count"])

Unnamed: 0,Verb,Count
0,say,59
1,go,32
2,think,25
3,take,18
4,keep,12
5,find,11
6,tell,10
7,begin,9
8,make,8
9,get,8
