## <ins>**Machine Learning Technologies</ins> - Task 2**: Natural Language Processing
__ITMO University__, St. Petersburg, Russia
- Name    : Rahman, Rasyad Rifatan <br>
- ID      : 458029

---

1. Download Alice in Wonderland by Lewis Carroll from Project Gutenberg's website http://www.gutenberg.org/files/11/11-0.txt
2. Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.
3. Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?
4. Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?

---


In [170]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import pos_tag
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter
import pandas as pd
import requests

In [171]:
import nltk

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rasyad\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Rasyad\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Rasyad\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\Rasyad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\Rasyad\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

---

*Access the text file through the internet*

In [172]:
url = "https://www.gutenberg.org/files/11/11-0.txt"
response = requests.get(url)

alice = response.text
alice = alice.lower()

---

*or locally*

In [173]:
with open("alice.txt",'r',encoding='utf-8') as file:
    alice = file.read().lower()

---

Perform any necessary preprocessing on the text, including converting to lower case, removing stop words, numbers / non-alphabetic characters, lemmatization.

---

### **Data Wrangling**

In [174]:
alice = re.split(r'(chapter\s+[ivxlcdm]+\.\s+[^\n]*)\s*\n', alice)
title = alice[1::2]
content = alice[2::2]

temp_df = pd.DataFrame(alice)
temp_df.head()

Unnamed: 0,0
0,﻿*** start of the project gutenberg ebook alic...
1,chapter i. down the rabbit-hole
2,
3,chapter ii. the pool of tears
4,


In [175]:
del temp_df

In [176]:
df = pd.DataFrame({'title':title,'content':content})
df.head()

Unnamed: 0,title,content
0,chapter i. down the rabbit-hole,
1,chapter ii. the pool of tears,
2,chapter iii. a caucus-race and a long tale,
3,chapter iv. the rabbit sends in a little bill,
4,chapter v. advice from a caterpillar,


In [177]:
df = df.drop(df.index[0:12])
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,title,content
0,chapter i.\ndown the rabbit-hole,alice was beginning to get very tired of sitti...
1,chapter ii.\nthe pool of tears,“curiouser and curiouser!” cried alice (she wa...
2,chapter iii.\na caucus-race and a long tale,they were indeed a queer-looking party that as...
3,chapter iv.\nthe rabbit sends in a little bill,"it was the white rabbit, trotting slowly back ..."
4,chapter v.\nadvice from a caterpillar,the caterpillar and alice looked at each other...


### **Data Preprocessing**

In [178]:
def preprocess(text):
    # Remove numbers and non-alphabetic characters
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenize the text into words
    words = text.split()

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Lemmatize the words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]

    # Join words back into a single string
    return ' '.join(words)

df['processed'] = df['content'].apply(preprocess)
print(df[['content', 'processed']].head())

                                             content  \
0  alice was beginning to get very tired of sitti...   
1  “curiouser and curiouser!” cried alice (she wa...   
2  they were indeed a queer-looking party that as...   
3  it was the white rabbit, trotting slowly back ...   
4  the caterpillar and alice looked at each other...   

                                           processed  
0  alice beginning get tired sitting sister bank ...  
1  curiouser curiouser cried alice much surprised...  
2  indeed queerlooking party assembled bankthe bi...  
3  white rabbit trotting slowly back looking anxi...  
4  caterpillar alice looked time silence last cat...  


---

Find Top 10 most important (for example, in terms of TF-IDF metric) words from each chapter in the text (not "Alice"); how would you name each chapter according to the identified tokens?

---

In [179]:
df["processed"] = df["processed"].apply(lambda x: re.sub(r'\balice\b', '', x))

vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df["processed"])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())

In [180]:
top10 = []

for i, chapter in enumerate(df['processed']):
    tfidf_scores = tfidf_matrix[i].toarray().flatten()
    top_indices = tfidf_scores.argsort()[-10:][::-1]
    top_words = [vectorizer.get_feature_names_out()[j] for j in top_indices]
    top10.append(top_words)

df['top10'] = top10
df['top10'].head(12)

0     [little, bat, door, key, eat, way, like, think...
1     [mouse, little, pool, im, swam, cat, dear, sai...
2     [mouse, said, dodo, prize, lory, dry, thimble,...
3     [window, little, bill, puppy, rabbit, glove, f...
4     [caterpillar, said, serpent, pigeon, im, youth...
5     [said, cat, footman, baby, mad, duchess, pig, ...
6     [hatter, dormouse, said, hare, march, twinkle,...
7     [queen, said, hedgehog, king, gardener, soldie...
8     [said, turtle, mock, gryphon, duchess, moral, ...
9     [turtle, mock, gryphon, said, dance, lobster, ...
10    [king, hatter, said, court, dormouse, witness,...
11    [said, king, jury, queen, sister, dream, would...
Name: top10, dtype: object

Through Data Wrangler, we are able to see the full list of the top 10 words of each chapter.

![top10](top10.png)

With that, here's my list of what I would name each chapter based on the top 10 words:
- **Chapter 1**     : Either Way, I Think You’ll Like It
- **Chapter 2**     : Dear Little Mouse
- **Chapter 3**     : The Bird Knows the Prize
- **Chapter 4**     : The Little One by the Window
- **Chapter 5**     : The Father of Youth
- **Chapter 6**     : The Mad Baby
- **Chapter 7**     : The Marching of Time
- **Chapter 8**     : The King's Executioner
- **Chapter 9**     : Mockery of the Duchess
- **Chapter 10**    : Dance, o' Beautiful
- **Chapter 11**    : The Witness' Court
- **Chapter 12**    : Dreams of a Rabbit

---

Find the Top 10 most used verbs in sentences with Alice. What does Alice do most often?

---

In [181]:
with open("alice.txt",'r',encoding='utf-8') as file:
    alice = file.read().lower()

In [182]:
sentences_with_alice = [sentence for sentence in re.split(r'(?<=[.!?])\s+', alice) if 'alice' in sentence]
sentences_with_alice = [word_tokenize(sentence) for sentence in sentences_with_alice]

def remove_stopwords(words):
    stop_words = set(stopwords.words('english'))
    return [word for word in words if word.lower() not in stop_words]

sentences_with_alice = [remove_stopwords(sentence) for sentence in sentences_with_alice]

def lemmatize_words(words):
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word, pos='v') for word in words]

sentences_with_alice = [lemmatize_words(sentence) for sentence in sentences_with_alice]

sentences_with_alice = [' '.join(words) for words in sentences_with_alice]

In [183]:
all_verbs = []
for sentence in sentences_with_alice:
    words = word_tokenize(sentence)
    pos_tags = pos_tag(words)
    
    verbs = [word.lower() for word, tag in pos_tags if tag.startswith('V') and word.isalpha()]
    all_verbs.extend(verbs)

In [184]:
verb_counts = Counter(all_verbs).most_common(10)

print("10 Most Mentioned Verbs with Alice")
for verb, freq in verb_counts:
    print(f"{verb}: {freq}")

10 Most Mentioned Verbs with Alice
say: 294
go: 85
think: 58
get: 47
know: 46
make: 36
see: 31
take: 31
find: 28
come: 25
