<a href="https://colab.research.google.com/github/zh19980811/Automatic-Text-Processing-and-Image-Processing-itmo/blob/main/Task_1_Programmers_EN_students_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text preprocessing

**Data URL**

In [1]:
#DATA_URL = "https://www.gutenberg.org/files/913/913-0.txt"
DATA_URL = "https://www.gutenberg.org/files/44117/44117-h/44117-h.htm"

In [2]:
! pip install -q nltk==3.2.5

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for nltk (setup.py) ... [?25l[?25hdone


Loading *NLTK*'s 'wordnet'

In [3]:
import nltk
from nltk.corpus import wordnet
nltk.download("wordnet")

lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()
lemmatizer

[nltk_data] Downloading package wordnet to /root/nltk_data...


<WordNetLemmatizer>

Downloading the text for the task via `urllib`

In [4]:
import urllib.request

opener = urllib.request.URLopener({})
resource = opener.open(DATA_URL)
charset = resource.headers.get_content_charset()
print("Charset", charset)
raw_text = resource.read()

if charset:
  raw_text = raw_text.decode(resource.headers.get_content_charset())
else:
  raw_text = raw_text.decode("utf-8")

raw_text[:100]

Charset None


'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"\r\n"http://www.w3.org/TR/xhtml1/DTD/xhtml1-st'

Removing the book ending (Gutenberg legal information)

In [5]:
import re

clean_pattern = re.compile("End of the Project Gutenberg EBook.*")
cleaner_text =  re.sub(clean_pattern, "", raw_text.replace("\n", " ").replace("\r", " "))
cleaner_text[-100:]

'on-commissioned {pg 362}</td></tr>  </table>    <hr class="full" />                <pre>            '

Splitting the text into tokens with a little help from [NLTK](https://nltk.org/).

In [6]:
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

tokens =  word_tokenize(cleaner_text)
"A total of %d 'tokens'" % len(tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


"A total of 139343 'tokens'"

Now we are about to **lemmatize the tokens**. Please note that for better results we should have first PoS-tagged the text (e.g. with NLTK as well, [please refer to the book and the docs](https://www.nltk.org/book/ch05.html)). `WordNetLemmatizer` would work best with PoS tags provided. However, to make things short and simple, we won't do it as of now.



---


## Task \#1
using Python's standard library's `str.isalpha` modify the code below to remove all non-letter tokens from sentences.

---



In [7]:
!python3 -m nltk.downloader wordnet
!unzip /root/nltk_data/corpora/wordnet.zip -d /root/nltk_data/corpora/

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
Archive:  /root/nltk_data/corpora/wordnet.zip
   creating: /root/nltk_data/corpora/wordnet/
  inflating: /root/nltk_data/corpora/wordnet/lexnames  
  inflating: /root/nltk_data/corpora/wordnet/data.verb  
  inflating: /root/nltk_data/corpora/wordnet/index.adv  
  inflating: /root/nltk_data/corpora/wordnet/adv.exc  
  inflating: /root/nltk_data/corpora/wordnet/index.verb  
  inflating: /root/nltk_data/corpora/wordnet/cntlist.rev  
  inflating: /root/nltk_data/corpora/wordnet/data.adj  
  inflating: /root/nltk_data/corpora/wordnet/index.adj  
  inflating: /root/nltk_data/corpora/wordnet/LICENSE  
  inflating: /root/nltk_data/corpora/wordnet/citation.bib  
  inflating: /root/nltk_data/corpora/wordnet/noun.exc  
  inflating: /root/nltk_data/corpora/wordnet/verb.exc  
  inflating: /root/nltk_data/corpora/wordnet/README  
  inflating: /root/nltk_data/corpora/wordnet/index.sense 

In [8]:
from tqdm.notebook import tqdm

lemmas = [lemmatizer.lemmatize(lemma) for lemma in tqdm(tokens) if lemma.isalpha()]
lemmas[-10:]

  0%|          | 0/139343 [00:00<?, ?it/s]

['mysterious',
 'benefactor',
 'pg',
 'tr',
 'td',
 'center',
 'pg',
 'hr',
 'full',
 'pre']

In [9]:
len(lemmas)

96220

To continue working on this assignment make sure those numbers match!


---

## Task #2

Using `lemmas`, english NLTK stopwords (`nltk.corpus.stopwords`) and `nltk.FreqDist`, compute **the fraction of the stopwords** in the top-100 most frequent words in the text.


**Googling how to work with NLTK is encouraged.**

---



In [12]:
import nltk
from nltk import FreqDist
from nltk.corpus import stopwords


nltk.download("stopwords")
STOPWORDS = set(stopwords.words("english"))
stopwords.words("english")[:5]

# Get the Frequent Distribution dictionary
fdist = FreqDist()
for word in lemmas:
  fdist[word.lower()] += 1

# Sort the values in descending order
sortedDict = sorted(fdist.items(), key=lambda x:x[1], reverse=True)
top_100 = [w[0] for w in sortedDict[:150]]  #fix
print(top_100)

cnt = 0

# Check if stopword in this dictionary
for sw in STOPWORDS:
  if sw in top_100:
    cnt += 1

print(cnt / len(top_100))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
['the', 'a', 'and', 'of', 'to', 'in', 'p', 'his', 'he', 'i', 'with', 'you', 'mdash', 'that', 'wa', 'it', 'at', 'romashov', 'but', 's', 'on', 'is', 'for', 'not', 'this', 'all', 'by', 'had', 'from', 'him', 'my', 'an', 'one', 'what', 'were', 'be', 'which', 'me', 'your', 'have', 'who', 'are', 'her', 'now', 'no', 'their', 'there', 'up', 'do', 'so', 'or', 'time', 'if', 'out', 'they', 'officer', 'when', 'himself', 'eye', 'she', 'then', 'will', 'only', 'into', 'like', 'how', 't', 'know', 'more', 'we', 'about', 'voice', 'little', 'hand', 'thought', 'after', 'other', 'even', 'well', 'can', 'over', 'ha', 'company', 'very', 'come', 'soldier', 'down', 'them', 'am', 'life', 'could', 'same', 'word', 'room', 'once', 'been', 'last', 'here', 'said', 'long', 'moment', 'some', 'two', 'would', 'shall', 'every', 'head', 'whole', 'just', 'say', 'regiment', 'did', 'way', 'day', 'don', 'before



---


## Task #3
Compute how many tokens occur in the text **strictly greater than** 50 times.

---



In [11]:
more_than_50 = [w for w in sortedDict if w[1] > 50]
len(more_than_50)

236