<a href="https://colab.research.google.com/github/zhxkpo/NLP_2024/blob/main/10_Tokenization_VariousWays.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#@markdown # 📚 **Python packages & 🔪Preprocessing전처리**
from IPython.display import display
import ipywidgets as widgets
import requests

def on_button_click(button):
    sn = int(button.description) - 1
    image.value = requests.get(urls[sn]).content

urls = [ "https://raw.githubusercontent.com/junkyuhufs/workshop/main/slide.07.png",
         "https://raw.githubusercontent.com/junkyuhufs/workshop/main/slide.08.png",
         "https://raw.githubusercontent.com/junkyuhufs/workshop/main/slide.11.png"
]

button_layout = widgets.Layout(width='50px', height='30px')

buttons = [widgets.Button(description=str(i), layout=button_layout) for i in range(1, 4)]
for button in buttons:
    button.on_click(on_button_click)

image = widgets.Image(value=requests.get(urls[0]).content, width="700", height="600")

display(widgets.HBox([image, widgets.VBox(buttons)]))

HBox(children=(Image(value=b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x03+\x00\x00\x01\xc8\x08\x06\x00\x00\…

# 📖  Tokenization
### 1️⃣ Sentence tokenization (문장 토큰화) or Sentence segmentation ( 문장 분류)⤵️

* 코퍼스 내에서 문장 단위로 구분은 문장 구분자 ("!", "?", ".")를 주로 사용하면 문장 예측을 할 수 있음.
* "."의 예외가 되는 여러 가능성 존재 (e.g., IP 192.168.56.31, email account python@gmail.com, Ph.D, etc.)
* 개별 언어 특수성, 특수 문자 사용, 혹은 오타 때문에 규칙을 찾아내기 어려운 점이 있음.
* NLTK 페키지 안에서 sent_tokenize() 함수 사용

>
### 2️⃣ Small-unit tokenization ⤵️

* **Corpus data (e.g., crawling) should be <font color = 'red'> preprocessed </font> before further analysis by means of <font color = 'red'> Cleaning(정제), Tokenization(토큰화), & Normalization(정규화)**.
>
* Simplest tokenization: 구두점 지운 후 띄어쓰기(whitespace)를 기준으로 잘라내기
>
* [**English Tokenization**](https://wikidocs.net/21698)

  - <font color = 'red'> **Cleaning**</font>
    * 구두점(punctuation (e.g., ".", ",", "?", "!", ";", ":")을 지우기
    * 특수문자 지우기
    * line 표시 등도 정제 가능

  - <font color = 'red'> **Tokenization**</font>

    * Tokenization: 주어진 코퍼스(corpus)에서 토큰(token)이라 불리는 단위 (e.g., word, strings with meaning)로 나누는 작업
    * 토큰의 단위가 상황에 따라 다르지만, 보통 의미있는 단위로 토큰을 정의한다.
    * apostrophe, hyphen 등은 Tokenize 용도로 사용하는 함수의 특성에 따라 다양한 방식으로 토큰에 포함시키기도 하고 삭제하기도 한다.    

  - <font color = 'red'> **Normalization**</font>
    * Stemming: am → 'am', having → 'hav'
                operation, operative, operating, operational → oper
    * Lemmatization: am → 'be', having → 'have'


### More about <font color = 'red'> **Normalization**
* [Stemming and lemmatization](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

The goal of both stemming and lemmatization is to reduce **inflectional** forms and sometimes **derivationally** related forms of a word to a **common base form**.

organize (vt)

|Inflection|base|grammatical functions|derived forms|
|:--|--|--|--|
|1.|orgaize(vt.)|3rd per. sg| orgaizes|
|2.||progressive| orgaizing|
|3.||past | orgaized|
|4.||past participle|orgaized|

[Table2] car (noun)

|Inflection|base|grammatical functions|derived forms|
|:--|--|--||
|1.|car(noun)|plural|cars|
|2.||possessive| car's|

[Table3] big (adjective)

|Inflection|base|grammatical functions|derived forms|
|:--|--|--||
|1.|big(adjective)|comparative|bigger|
|2.||superlative| biggest|

[Table4] combine (vt)

|Derivation|base|grammatical category|derived forms|
|:--|--|--||
|1.|combine (verb)| verb| recombine|
|1.|combine (verb)| noun| combination|
|2.||adjective| combinational|

[Table 5] be (vi)

|copular be verb|base|subject-verb agreement|forms|
|:--|--|--||
|1.|be (verb)| 1st. sg. prsnt/past| am/was|
|2.|| 2nd/3rd sg. & pl, prsnt/past|are/were|
|3.||3rd. sg. prsnt/past| is/was|
<p> </p>

### Mapping of text

>* The boy's cars are different colors $\Rightarrow$
the boy car be differ color


- **Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the **<em>removal of derivational affixes</em>**.

- **Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to **remove inflectional endings** only and to return the **base or dictionary form** of a word, which is known as the lemma.

If confronted with the token 'saw', stemming might return just 's', whereas lemmatization would attempt to return either 'see'(verb) or 'saw'(noun) depending on whether the use of the token was as a verb or a noun. Linguistic processing for stemming or lemmatization is often done by an additional plug-in component to the indexing process, and a number of such components exist, both commercial and open-source. The most common algorithm for **stemming English**, and one that has repeatedly been shown to be empirically very effective, is **Porter's algorithm (Porter, 1980)**.

You can use a **lemmatizer**, a tool from Natural Language Processing which does full morphological analysis to <em>accurately identify the lemma for each word</em>. <font color = 'blue'> Doing full morphological analysis produces at most very modest benefits for retrieval</font>. It is hard to say more, because either form of normalization tends not to improve English information retrieval performance in aggregate - at least not by very much. While it helps a lot for some queries, it equally hurts performance a lot for others. <font color = 'red'> Stemming increases recall while harming precision.</font>

>The **Porter stemmer** stems all of the following words:
>
>>operate', 'operating, ' operates', ' operation', ' operative', ' operatives', ' operational'
$\Rightarrow$ 'oper'.
>
> Defect $\Rightarrow$ We lose considerable precision on queries such as the following with Porter stemming:
>> "operational and research", "operating and system", "operative and dentistry"

For a case like this, moving to using a lemmatizer would not completely fix the problem because **particular inflectional forms are used in particular <font color = 'blue'>collocations</font>**: a sentence with the words operate and system is not a good match for the query operating and system. Getting better value from term normalization depends more on pragmatic issues of word use than on formal issues of linguistic morphology.

# 📖  Tokenization

### 💡 **For your information**

The **'punkt'** resource is a tokenizer that is commonly used for **splitting text into individual sentences or words**. Once the 'punkt' resource is downloaded, you can proceed to use NLTK's tokenization capabilities in your code. Specifically, you can access and use the tokenizer provided by NLTK. This resource is necessary for certain NLTK functionalities, such as tokenization using the **nltk.tokenize module**.

### 🆘 What is PunktSentenceTokenizer?
* [For more information read the original article](https://www.askpython.com/python-modules/nltk-punkt)

In NLTK, PUNKT is an <font color = 'brown'> **unsupervised trainable model**</font>, which means it can be **trained on unlabeled data** (Data that has not been tagged with information identifying its characteristics, properties, or categories is referred to as unlabeled data.)

It generates a list of sentences from a text by developing a model for **words that start sentences, prepositional phrases, and abbreviations** using an unsupervised technique. Without first being put to use, it has to be trained on a sizable amount of plaintext in the intended language.

🚯 Caution should be taken
* nltk.sent_tokenize를 사용할 경우, punkt 모델을 활용하여 sentence segmentation/tokenization을 진행하게 된다. <font color = 'blue'> punkt 문장 구조를 학습한 일종의 모델로, 어떤 것이 약어에 쓰이는 "."이고(Ex : Ph.D.), 어떤 것이 마침표인지 학습이 되어있다.</font> <font color = 'brown'> 문장을 기본적으로 마침표를 기준으로 나누되, Ph.D., Saint., Professor., 와 같은 약어(Abbreviation)는 Known abbreviation으로 학습하여 한 단어로 취급하는 방식이다.</font> 하지만 punkt model이 모든 약어를 학습하지 못했다보니, Vol. 13, Apr. 13 과 같은 표현 및 U.S. Pat. No. 134 과 같은 복잡한 약어는 Known abbreviation이 아니여서 모두 나눠져버린다.

In [2]:
#@markdown #### 🐹 **Student's Activity 0** ⤵️
#@markdown 👀🐾 **Install <font color = 'red'> NLTK</font> package and download  <font color = 'red'> punkt </font> package.**
!pip install nltk
import nltk
nltk.download('punkt')



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
#@markdown #### 🐹 **Student's Activity 1** ⤵️
#@markdown 👀🐾 **Call <font color = 'red'> sent_tokenize() </font> function**

#@markdown 🔎 **Exercise for Various Periods**

import nltk
nltk.download('punkt_tab')

message = "I'm actively looking for Ph.D. students, \
and you are a Ph.D student. \
Visit IP 192.168.56.31 \
and send the results to my email account. \
It's python@gmail.com."

from nltk.tokenize import sent_tokenize
sentence = sent_tokenize(message)
print('문장 토큰화: %s' %sentence)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


문장 토큰화: ["I'm actively looking for Ph.D. students, and you are a Ph.D student.", 'Visit IP 192.168.56.31 and send the results to my email account.', "It's python@gmail.com."]


In [None]:
#@markdown #### 🐹 **Student's Activity 2** ⤵️
#@markdown 🔎 **Additional exercise for sentence tokenization**

text = 'Here’s to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes. \
The ones who see things differently — they’re not fond of rules. \
I wanted to pay with a twenty-dolloar bill; however, she couldn’t get cash. \
I like Brown’s East back pack,\
I’ve got a big trouble, but other people were having lots of fun. \
Anna likes Brown’s East back pack,\
but her brother doesn’t.'

from nltk.tokenize import sent_tokenize
sentence = sent_tokenize(text)
print('문장 토큰화: %s' %sentence)

문장 토큰화: ['Here’s to the crazy ones, the misfits, the rebels, the troublemakers, the round pegs in the square holes.', 'The ones who see things differently — they’re not fond of rules.', 'I wanted to pay with a twenty-dolloar bill; however, she couldn’t get cash.', 'I like Brown’s East back pack,I’ve got a big trouble, but other people were having lots of fun.', 'Anna likes Brown’s East back pack,but her brother doesn’t.']


##⏰ <font color = 'purple'> **Prerequisite Step!**

1. Students download txt file from [data_misc repository](https://github.com/ms624atyale/Data_Misc/blob/main/text_symbol_sample.txt)
2. Students create a folder under [sample_data].
3. Students upload downloaded file entitled "text_symbol_sample".

In [None]:
#@markdown #### 🐹 **Student's Activity 0** ⤵️
#@markdown 👀🐾 **Install <font color = 'red'> NLTK</font> package and download  <font color = 'red'> punkt </font> package.**
!pip install nltk
import nltk
nltk.download('punkt')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### **Tokenization**
#### 🐹 1. Simplest way to tokenize a text into small units

In [5]:
#@markdown 🐹 **Student's Activity 1** ⤵️

#@markdown 👀🐾 **text Read files using open( ) function.**
text = open('/content/sample_data/Crime_Punishment_Sample.txt','rt')

In [6]:
#@markdown 👀🐾 **temp variable assigned text being read**

temp = text.read()
print(temp)

He had successfully avoided meeting his landlady on the stairs. His closet of a room was under the roof of a high, five-floor house and was more like a cupboard than a place in which to live. The landlady who provided him with the room and with dinner and service lived on the floor below, and every time he went out he was obliged to pass her kitchen, the door of which was always open. And each time he passed, the young man had a sick, frightened feeling, which made him grimace and feel ashamed. He was hopelessly in debt to his landlady and was afraid of meeting her.

This was not because he was cowardly and browbeaten, quite the contrary; but for some time past he had been in an overstrained irritable condition, verging on hypochondria. He had become so completely absorbed in himself and isolated from everyone else that he dreaded meeting not only his landlady, but anyone at all. He was crushed by poverty, but even the anxieties of his position had recently ceased to weigh upon him. He

In [7]:
#@markdown 👀🐾 **obj variable assigned text being cleaned (e.g., line substitution)**

obj = temp.replace("\n", " ")

In [8]:
print(obj)

He had successfully avoided meeting his landlady on the stairs. His closet of a room was under the roof of a high, five-floor house and was more like a cupboard than a place in which to live. The landlady who provided him with the room and with dinner and service lived on the floor below, and every time he went out he was obliged to pass her kitchen, the door of which was always open. And each time he passed, the young man had a sick, frightened feeling, which made him grimace and feel ashamed. He was hopelessly in debt to his landlady and was afraid of meeting her.  This was not because he was cowardly and browbeaten, quite the contrary; but for some time past he had been in an overstrained irritable condition, verging on hypochondria. He had become so completely absorbed in himself and isolated from everyone else that he dreaded meeting not only his landlady, but anyone at all. He was crushed by poverty, but even the anxieties of his position had recently ceased to weigh upon him. He

In [9]:
#@markdown <font color = 'red'> 👀🐾 **Simplest tokenization: a text into units: every small unit that is separated by space **
obj.split()

['He',
 'had',
 'successfully',
 'avoided',
 'meeting',
 'his',
 'landlady',
 'on',
 'the',
 'stairs.',
 'His',
 'closet',
 'of',
 'a',
 'room',
 'was',
 'under',
 'the',
 'roof',
 'of',
 'a',
 'high,',
 'five-floor',
 'house',
 'and',
 'was',
 'more',
 'like',
 'a',
 'cupboard',
 'than',
 'a',
 'place',
 'in',
 'which',
 'to',
 'live.',
 'The',
 'landlady',
 'who',
 'provided',
 'him',
 'with',
 'the',
 'room',
 'and',
 'with',
 'dinner',
 'and',
 'service',
 'lived',
 'on',
 'the',
 'floor',
 'below,',
 'and',
 'every',
 'time',
 'he',
 'went',
 'out',
 'he',
 'was',
 'obliged',
 'to',
 'pass',
 'her',
 'kitchen,',
 'the',
 'door',
 'of',
 'which',
 'was',
 'always',
 'open.',
 'And',
 'each',
 'time',
 'he',
 'passed,',
 'the',
 'young',
 'man',
 'had',
 'a',
 'sick,',
 'frightened',
 'feeling,',
 'which',
 'made',
 'him',
 'grimace',
 'and',
 'feel',
 'ashamed.',
 'He',
 'was',
 'hopelessly',
 'in',
 'debt',
 'to',
 'his',
 'landlady',
 'and',
 'was',
 'afraid',
 'of',
 'meeting',


In [11]:
#@markdown 🐹 **Student's Activity 1-1** ⤵️

#@markdown 👀🐾 **Codeline variation**
text = open('/content/sample_data/Crime_Punishment_Sample.txt','rt')
obj = text.read().replace("\n", " ")
obj.split()

['He',
 'had',
 'successfully',
 'avoided',
 'meeting',
 'his',
 'landlady',
 'on',
 'the',
 'stairs.',
 'His',
 'closet',
 'of',
 'a',
 'room',
 'was',
 'under',
 'the',
 'roof',
 'of',
 'a',
 'high,',
 'five-floor',
 'house',
 'and',
 'was',
 'more',
 'like',
 'a',
 'cupboard',
 'than',
 'a',
 'place',
 'in',
 'which',
 'to',
 'live.',
 'The',
 'landlady',
 'who',
 'provided',
 'him',
 'with',
 'the',
 'room',
 'and',
 'with',
 'dinner',
 'and',
 'service',
 'lived',
 'on',
 'the',
 'floor',
 'below,',
 'and',
 'every',
 'time',
 'he',
 'went',
 'out',
 'he',
 'was',
 'obliged',
 'to',
 'pass',
 'her',
 'kitchen,',
 'the',
 'door',
 'of',
 'which',
 'was',
 'always',
 'open.',
 'And',
 'each',
 'time',
 'he',
 'passed,',
 'the',
 'young',
 'man',
 'had',
 'a',
 'sick,',
 'frightened',
 'feeling,',
 'which',
 'made',
 'him',
 'grimace',
 'and',
 'feel',
 'ashamed.',
 'He',
 'was',
 'hopelessly',
 'in',
 'debt',
 'to',
 'his',
 'landlady',
 'and',
 'was',
 'afraid',
 'of',
 'meeting',


### 🐹 **2. nltktokenize.**word_tokenize( )

>

| Function.     | Type | symbols |result |example|
|:--------------|:-----|:--------|:-----------|--|
|word.tokenize()|punctuation|".", ",", "?", "!", ";", ":"|tokenize| '.' |
|               |apostrophe        |"'" |tokenize| ''' |
|               |word-linking hyphen|"-" |part of a token|'ten-dollar', 'bill' |
|               |phrase-linking hyphen|"--" |tokenize|'--' |
|               |Capital vs. small|The ones |tokenize|'The', 'ones' |

In [None]:
#@markdown ###🐹 **Students' Activity 2** ⤵️

#@markdown <font color = 'red'> 👀🐾 **nltktokenize.**word_tokenize( )

!pip install nltk
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

print('Tokenizing Word and Punctuation:',word_tokenize(obj))

Tokenizing Word and Punctuation: ['Here', '’', 's', 'to', 'the', 'crazy', 'ones', ',', 'the', 'misfits', ',', 'the', 'rebels', ',', 'the', 'troublemakers', ',', 'the', 'round', 'pegs', 'in', 'the', 'square', 'holes', '.', 'The', 'ones', 'who', 'see', 'things', 'differently', '—', 'they', '’', 're', 'not', 'fond', 'of', 'rules', '.', 'I', 'wanted', 'to', 'pay', 'with', 'a', 'twenti-dolloar', 'bill', ';', 'however', ',', 'she', 'couldn', '’', 't', 'get', 'cash', '.', 'I', '’', 've', 'got', 'a', 'big', 'trouble', ',', 'but', 'other', 'people', 'were', 'having', 'lots', 'of', 'fun', '.', 'Anna', 'likes', 'Brown', '’', 's', 'East', 'back', 'pack', ',', 'but', 'her', 'brother', 'doesn', '’', 't', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


###🐹 2-1. Advanced for a print codeline

In [None]:
#@markdown ###🐹 **Students' Activity 2-1** ⤵️

#@markdown <font color = 'red'> **🍎 Advanced for a print codeline**
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
result2 = word_tokenize(obj)

print('Tokenizing Word and Punctuation: %s' %result2)


Tokenizing Word and Punctuation: ['Here', '’', 's', 'to', 'the', 'crazy', 'ones', ',', 'the', 'misfits', ',', 'the', 'rebels', ',', 'the', 'troublemakers', ',', 'the', 'round', 'pegs', 'in', 'the', 'square', 'holes', '.', 'The', 'ones', 'who', 'see', 'things', 'differently', '—', 'they', '’', 're', 'not', 'fond', 'of', 'rules', '.', 'I', 'wanted', 'to', 'pay', 'with', 'a', 'twenti-dolloar', 'bill', ';', 'however', ',', 'she', 'couldn', '’', 't', 'get', 'cash', '.', 'I', '’', 've', 'got', 'a', 'big', 'trouble', ',', 'but', 'other', 'people', 'were', 'having', 'lots', 'of', 'fun', '.', 'Anna', 'likes', 'Brown', '’', 's', 'East', 'back', 'pack', ',', 'but', 'her', 'brother', 'doesn', '’', 't', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 🐹 3. nltk.tokenize.**WordPunctTokenizer()**

| Function     | Type | symbols or words |result |example|
|:--------------|:-----|:--------|:-----------|--|
|WordPunctTokenizer()|punctuation|".", ",", "?", "!", ";", ":"|tokenize| '.' |
|               |apostrophe        |"'" |tokenize| ''' |
|               |word-linking hyphen|"-" |tokenize|'-' |
|               |phrase-linking hyphen|"--" |tokenize|'--' |
|               |Capital vs. small|The ones |tokenize|'The', 'ones' |

  * <font color = 'indigo'> <font size = '2.2'> Wrong command: >>WordPunctTokenizer(txt) or WordPunctTokenizer.tokenize(txt): Error message: WordPunctTokenizer.__init__() takes 1 positional argument but 2 were given.

In [None]:
#@markdown ###🐹 **Students' Activity 3** ⤵️

#@markdown <font color = 'red'> 👀🐾 **nltk.tokenize.**WordPunctTokenizer()

from nltk.tokenize import WordPunctTokenizer

print('Tokenizing Words and Punctuations:', WordPunctTokenizer().tokenize(obj))

Tokenizing Words and Punctuations: ['Here', '’', 's', 'to', 'the', 'crazy', 'ones', ',', 'the', 'misfits', ',', 'the', 'rebels', ',', 'the', 'troublemakers', ',', 'the', 'round', 'pegs', 'in', 'the', 'square', 'holes', '.', 'The', 'ones', 'who', 'see', 'things', 'differently', '—', 'they', '’', 're', 'not', 'fond', 'of', 'rules', '.', 'I', 'wanted', 'to', 'pay', 'with', 'a', 'twenti', '-', 'dolloar', 'bill', ';', 'however', ',', 'she', 'couldn', '’', 't', 'get', 'cash', '.', 'I', '’', 've', 'got', 'a', 'big', 'trouble', ',', 'but', 'other', 'people', 'were', 'having', 'lots', 'of', 'fun', '.', 'Anna', 'likes', 'Brown', '’', 's', 'East', 'back', 'pack', ',', 'but', 'her', 'brother', 'doesn', '’', 't', '.']


In [None]:
#@markdown ###🐹 **Students' Activity 3-1** ⤵️

#@markdown <font color = 'red'> 🍎 Advanced for a print codeline

from nltk.tokenize import WordPunctTokenizer
result3 = WordPunctTokenizer().tokenize(obj)

print('Tokenizing Words and Punctuations: %s' %result3)

Tokenizing Words and Punctuations: ['Here', '’', 's', 'to', 'the', 'crazy', 'ones', ',', 'the', 'misfits', ',', 'the', 'rebels', ',', 'the', 'troublemakers', ',', 'the', 'round', 'pegs', 'in', 'the', 'square', 'holes', '.', 'The', 'ones', 'who', 'see', 'things', 'differently', '—', 'they', '’', 're', 'not', 'fond', 'of', 'rules', '.', 'I', 'wanted', 'to', 'pay', 'with', 'a', 'twenti', '-', 'dolloar', 'bill', ';', 'however', ',', 'she', 'couldn', '’', 't', 'get', 'cash', '.', 'I', '’', 've', 'got', 'a', 'big', 'trouble', ',', 'but', 'other', 'people', 'were', 'having', 'lots', 'of', 'fun', '.', 'Anna', 'likes', 'Brown', '’', 's', 'East', 'back', 'pack', ',', 'but', 'her', 'brother', 'doesn', '’', 't', '.']


### 🐹 4. tensorflow.keras.preprocessing.text.**text_to_word_sequence()**

| Function.     | Type | symbols or words |result |example|
|:--------------|:-----|:--------|:-----------|--|
|text_to_word_sequence()|punctuation|".", ",", "?", "!", ";", ":"|deleted|  |
|               |apostrophe        |"'" |part of a token| 'Here's' |
|               |word-linking hyphen|"-" |deleted| |
|               |phrase-linking hyphen|"--" |tokenize|'--' |
|               |Capital ➡️  small|The ones |tokenize|'the', 'ones' |


* <font color = 'sky blue'> Small letters across the board
* An apostrophe ("'" for Perfect tense (e.g., "I've"), possessive (e.g., "John's"), negation (e.g., 'doesn't)) is part of a token.
* A phrase-linking hyphen (-) are tokenized.
* **All punctuations and a word-linking phyphen are deleted**.


In [None]:
#@markdown ###🐹 **Students' Activity 4** ⤵️

#@markdown <font color = 'red'> 👀🐾 **tensorflow.keras.preprocessing.text.text_to_word_sequence()**

from tensorflow.keras.preprocessing.text import text_to_word_sequence

print('Tokenizing Words after Cleaning Punctuations:', text_to_word_sequence(obj))


Tokenizing Words after Cleaning Punctuations: ['here’s', 'to', 'the', 'crazy', 'ones', 'the', 'misfits', 'the', 'rebels', 'the', 'troublemakers', 'the', 'round', 'pegs', 'in', 'the', 'square', 'holes', 'the', 'ones', 'who', 'see', 'things', 'differently', '—', 'they’re', 'not', 'fond', 'of', 'rules', 'i', 'wanted', 'to', 'pay', 'with', 'a', 'twenti', 'dolloar', 'bill', 'however', 'she', 'couldn’t', 'get', 'cash', 'i’ve', 'got', 'a', 'big', 'trouble', 'but', 'other', 'people', 'were', 'having', 'lots', 'of', 'fun', 'anna', 'likes', 'brown’s', 'east', 'back', 'pack', 'but', 'her', 'brother', 'doesn’t']


In [None]:
#@markdown ###🐹 **Students' Activity 4-1** ⤵️

#@markdown <font color = 'red'> **🍎 Advanced for a print codeline**

from tensorflow.keras.preprocessing.text import text_to_word_sequence
result4 = text_to_word_sequence(obj)

print('Tokenizing Words after Cleaning Punctuations: %s' %result4)



Tokenizing Words after Cleaning Punctuations: ['here’s', 'to', 'the', 'crazy', 'ones', 'the', 'misfits', 'the', 'rebels', 'the', 'troublemakers', 'the', 'round', 'pegs', 'in', 'the', 'square', 'holes', 'the', 'ones', 'who', 'see', 'things', 'differently', '—', 'they’re', 'not', 'fond', 'of', 'rules', 'i', 'wanted', 'to', 'pay', 'with', 'a', 'twenti', 'dolloar', 'bill', 'however', 'she', 'couldn’t', 'get', 'cash', 'i’ve', 'got', 'a', 'big', 'trouble', 'but', 'other', 'people', 'were', 'having', 'lots', 'of', 'fun', 'anna', 'likes', 'brown’s', 'east', 'back', 'pack', 'but', 'her', 'brother', 'doesn’t']


### **Tokenization & Normalization**

### 🐹 5. **PorterStemmer()** for Stemming

| Function.     | Type | symbols or words |result |example|
|:--------------|:-----|:--------|:-----------|--|
|PorterStemmer()|punctuation|".", ",", "?", "!", ";", ":"|tokenize|'.', etc. |
|               |apostrophe        |"'" |tokenize| ''' |
|               |word-linking hyphen|"-" |part of a token| 'twenti-dollar' |
|               |phrase-linking hyphen|"--" |tokenize|'--'|
||Capital ➡️  small|The ones |tokenize|'the', 'ones' |
|              ||they're, were having||'they', ''' 're', 'were', 'hav'|

In [None]:
#@markdown ###🐹 **Students' Activity 5** ⤵️

#@markdown <font color = 'red'> 👀🐾 **PorterStemmer() for Stemming**

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
tokenized_sentence = word_tokenize(obj)

[stemmer.stem(w) for w in tokenized_sentence] # words 로 list comprehension. 이 워즈에서 하나하나씩 루프를 도는데, 단어를 w 에 담아서, 앞에 stemmer.stem () 함수가 실행된다. 어떤 함수는 어떤 장/단점이 있구나. result4 -> result3 수정 (as of 07JUNE23)

['here',
 '’',
 's',
 'to',
 'the',
 'crazi',
 'one',
 ',',
 'the',
 'misfit',
 ',',
 'the',
 'rebel',
 ',',
 'the',
 'troublemak',
 ',',
 'the',
 'round',
 'peg',
 'in',
 'the',
 'squar',
 'hole',
 '.',
 'the',
 'one',
 'who',
 'see',
 'thing',
 'differ',
 '—',
 'they',
 '’',
 're',
 'not',
 'fond',
 'of',
 'rule',
 '.',
 'i',
 'want',
 'to',
 'pay',
 'with',
 'a',
 'twenti-dolloar',
 'bill',
 ';',
 'howev',
 ',',
 'she',
 'couldn',
 '’',
 't',
 'get',
 'cash',
 '.',
 'i',
 '’',
 've',
 'got',
 'a',
 'big',
 'troubl',
 ',',
 'but',
 'other',
 'peopl',
 'were',
 'have',
 'lot',
 'of',
 'fun',
 '.',
 'anna',
 'like',
 'brown',
 '’',
 's',
 'east',
 'back',
 'pack',
 ',',
 'but',
 'her',
 'brother',
 'doesn',
 '’',
 't',
 '.']

### 🐹 6. RegexpTokenizer() with a regular expression as an argument

| Function.     | Type | symbols or words |result |example|
|:--------------|:-----|:--------|:-----------|--|
|RegexpTokenizer()|punctuation|".", ",", "?", "!", ";", ":"|deleted|  |
|               |apostrophe        |"'" |deleted|  |
|               |word-linking hyphen|"-" |deleted| |
|               |phrase-linking hyphen|"--" |deleted||
||Capital vs. small|The ones |tokenize|'The', 'ones' |
|              ||they're, were having||'they', 're', 'were', 'hav'|


In [None]:
#@markdown ###🐹 **Students' Activity 6** ⤵️

#@markdown <font color = 'red'> 👀🐾 **RegexpTokenizer() with a regular expression as an argument

from nltk.tokenize import RegexpTokenizer
retokenize = RegexpTokenizer("[\w]+")
result6 = retokenize.tokenize(obj)

print('Tokenizing Words with RegexpTokenizer: %s' %result6)

Tokenizing Words with RegexpTokenizer: ['Here', 's', 'to', 'the', 'crazy', 'ones', 'the', 'misfits', 'the', 'rebels', 'the', 'troublemakers', 'the', 'round', 'pegs', 'in', 'the', 'square', 'holes', 'The', 'ones', 'who', 'see', 'things', 'differently', 'they', 're', 'not', 'fond', 'of', 'rules', 'I', 'wanted', 'to', 'pay', 'with', 'a', 'twenti', 'dolloar', 'bill', 'however', 'she', 'couldn', 't', 'get', 'cash', 'I', 've', 'got', 'a', 'big', 'trouble', 'but', 'other', 'people', 'were', 'having', 'lots', 'of', 'fun', 'Anna', 'likes', 'Brown', 's', 'East', 'back', 'pack', 'but', 'her', 'brother', 'doesn', 't']


### 🐹 7. LancasterStemmer() for Lemmatization?

| Function.     | Type | symbols or words |result |example|
|:--------------|:-----|:--------|:-----------|--|
|LancasterStemmer()|punctuation|".", ",", "?", "!", ";", ":"|deleted|  |
|               |apostrophe        |"'" |deleted|  |
|               |word-linking hyphen|"-" |deleted| |
|               |phrase-linking hyphen|"--" |deleted||
||Capital ➡️ small|The ones |tokenize|'the', 'ones' |
|              ||they're, were having||'they', 're', 'wer', 'hav'|


In [None]:
#@markdown ###🐹 **Students' Activity 7** ⤵️

#@markdown <font color = 'red'> 👀🐾 **LancasterStemmer() for Lemmatization?**

from nltk.tokenize import RegexpTokenizer
retokenize = RegexpTokenizer("[\w]+")
result7 = retokenize.tokenize(obj)

from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
[stemmer.stem(w) for w in result7]

['her',
 's',
 'to',
 'the',
 'crazy',
 'on',
 'the',
 'misfit',
 'the',
 'rebel',
 'the',
 'troublemak',
 'the',
 'round',
 'peg',
 'in',
 'the',
 'squ',
 'hol',
 'the',
 'on',
 'who',
 'see',
 'thing',
 'diff',
 'they',
 're',
 'not',
 'fond',
 'of',
 'rul',
 'i',
 'want',
 'to',
 'pay',
 'with',
 'a',
 'twent',
 'dollo',
 'bil',
 'howev',
 'she',
 'couldn',
 't',
 'get',
 'cash',
 'i',
 've',
 'got',
 'a',
 'big',
 'troubl',
 'but',
 'oth',
 'peopl',
 'wer',
 'hav',
 'lot',
 'of',
 'fun',
 'ann',
 'lik',
 'brown',
 's',
 'east',
 'back',
 'pack',
 'but',
 'her',
 'broth',
 'doesn',
 't']

### 8. WordNetLemmatizer

| Function.     | Type | symbols or words |result |example|
|:--------------|:-----|:--------|:-----------|--|
|WordNetLemmatizer()|punctuation|".", ",", "?", "!", ";", ":"|deleted|  |
|               |apostrophe        |"'" |deleted|  |
|               |word-linking hyphen|"-" |deleted| |
|               |phrase-linking hyphen|"--" |deleted||
||| || |
|inflection|pl, 3rd per. sg.|ones, rebels, likes||'one', 'rebel', 'like'|
|inflection|progressive|were having ||'were', 'having'|

In [None]:
#@markdown ###🐹 **Students' Activity 8** ⤵️

#@markdown <font color = 'red'> 👀🐾 ** WordNetLemmatizer( )

from nltk.tokenize import RegexpTokenizer
retokenize = RegexpTokenizer("[\w]+")
result8 = retokenize.tokenize(obj)

from nltk.stem import WordNetLemmatizer # 일부 활용만 복원 (e.g., inflectional morpheme '-s' for 3rd per. sg & plural; cf., no recovery from progressive)
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
[lemmatizer.lemmatize(w) for w in result8]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['Here',
 's',
 'to',
 'the',
 'crazy',
 'one',
 'the',
 'misfit',
 'the',
 'rebel',
 'the',
 'troublemaker',
 'the',
 'round',
 'peg',
 'in',
 'the',
 'square',
 'hole',
 'The',
 'one',
 'who',
 'see',
 'thing',
 'differently',
 'they',
 're',
 'not',
 'fond',
 'of',
 'rule',
 'I',
 'wanted',
 'to',
 'pay',
 'with',
 'a',
 'twenti',
 'dolloar',
 'bill',
 'however',
 'she',
 'couldn',
 't',
 'get',
 'cash',
 'I',
 've',
 'got',
 'a',
 'big',
 'trouble',
 'but',
 'other',
 'people',
 'were',
 'having',
 'lot',
 'of',
 'fun',
 'Anna',
 'like',
 'Brown',
 's',
 'East',
 'back',
 'pack',
 'but',
 'her',
 'brother',
 'doesn',
 't']

Stopwords 불용어
For more information, read the original article

Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc. Such words are already captured this in corpus named corpus. We first download it to our python environment.

Stop words are common words like ‘the’, ‘and’, ‘I’, etc. that are very frequent in text, and so don’t convey insights into the specific topic of a document. We can remove these stop words from the text in a given corpus to clean up the data, and identify words that are more rare and potentially more relevant to what we’re interested in.

Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.

# Conclusion:

토큰화 작업을 단순하게 코퍼스에서 구두점을 제외하고 공백 기준으로 잘라내는 작업이라고 간주할 수는 없습니다.

1) 구두점이나 특수 문자를 단순 제외해서는 안 된다.
갖고있는 코퍼스에서 단어들을 걸러낼 때, 구두점이나 특수 문자를 단순히 제외하는 것은 옳지 않습니다. 코퍼스에 대한 정제 작업을 진행하다보면, 구두점조차도 하나의 토큰으로 분류하기도 합니다. 가장 기본적인 예를 들어보자면, 마침표(.)와 같은 경우는 문장의 경계를 알 수 있는데 도움이 되므로 단어를 뽑아낼 때, 마침표(.)를 제외하지 않을 수 있습니다.

또 다른 예로 단어 자체에 구두점을 갖고 있는 경우도 있는데, m.p.h나 Ph.D나 AT&T 같은 경우가 있습니다. 또 특수 문자의 달러나 슬래시(/)로 예를 들어보면, $45.55와 같은 가격을 의미 하기도 하고, 01/02/06은 날짜를 의미하기도 합니다. 보통 이런 경우 45.55를 하나로 취급하고 45와 55로 따로 분류하고 싶지는 않을 수 있습니다.

숫자 사이에 컴마(,)가 들어가는 경우도 있습니다. 보통 수치를 표현할 때는 123,456,789와 같이 세 자리 단위로 컴마가 있습니다.

2) 줄임말과 단어 내에 띄어쓰기가 있는 경우.
토큰화 작업에서 종종 영어권 언어의 아포스트로피(')는 압축된 단어를 다시 펼치는 역할을 하기도 합니다. 예를 들어 what're는 what are의 줄임말이며, we're는 we are의 줄임말입니다. 위의 예에서 re를 접어(clitic)이라고 합니다. 즉, 단어가 줄임말로 쓰일 때 생기는 형태를 말합니다. 가령 I am을 줄인 I'm이 있을 때, m을 접어라고 합니다.

New York이라는 단어나 rock 'n' roll이라는 단어를 봅시다. 이 단어들은 하나의 단어이지만 중간에 띄어쓰기가 존재합니다. 사용 용도에 따라서, 하나의 단어 사이에 띄어쓰기가 있는 경우에도 하나의 토큰으로 봐야하는 경우도 있을 수 있으므로, 토큰화 작업은 저러한 단어를 하나로 인식할 수 있는 능력도 가져야합니다.



Penn Treebank Tokenization의 규칙에 대해서 소개하고, 토큰화의 결과를 확인해보겠습니다.

  - 규칙 1. 하이푼으로 구성된 단어는 하나로 유지한다.
  - 규칙 2. doesn't와 같이 아포스트로피로 '접어'가 함께하는 단어는 분리해준다.

Alert: TypeError: TreebankWordTokenizer() takes no arguments.

# 💣💊  아래는 실습에서 모두 제외 (as of 07JUNE23) ↘️

In [None]:
from nltk.tokenize import TreebankWordTokenizer #Wikidocs에서는 doesn't 에서 접어 n't 를 따로 토큰화 한다고 하는데, 여기 예시에서는 그렇지 않음. 이상함...

tokenizer = TreebankWordTokenizer()
Result = tokenizer.tokenize(obj)

print('트리뱅크 워드토크나이저 : %s' %Result)

트리뱅크 워드토크나이저 : ['Here’s', 'to', 'the', 'crazy', 'ones', ',', 'the', 'misfits', ',', 'the', 'rebels', ',', 'the', 'troublemakers', ',', 'the', 'round', 'pegs', 'in', 'the', 'square', 'holes.', 'The', 'ones', 'who', 'see', 'things', 'differently', '—', 'they’re', 'not', 'fond', 'of', 'rules.', 'I', 'wanted', 'to', 'pay', 'with', 'a', 'twenti-dolloar', 'bill', ';', 'however', ',', 'she', 'couldn’t', 'get', 'cash.', 'I’ve', 'got', 'a', 'big', 'trouble', ',', 'but', 'other', 'people', 'were', 'having', 'lots', 'of', 'fun.', 'Anna', 'likes', 'Brown’s', 'East', 'back', 'pack', ',', 'but', 'her', 'brother', 'doesn’t', '.']


## [For Tokenization and POS in Korean, visit Wikidocs for further information](https://wikidocs.net/21698)
* Sentence tokenizer for Korean: kss package

In [None]:
!pip install kss
import kss

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kss
  Using cached kss-4.5.3.tar.gz (78 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting emoji==1.2.0 (from kss)
  Using cached emoji-1.2.0-py3-none-any.whl (131 kB)
Collecting pecab (from kss)
  Using cached pecab-1.0.8.tar.gz (26.4 MB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: kss, pecab
  Building wheel for kss (setup.py) ... [?25l[?25hdone
  Created wheel for kss: filename=kss-4.5.3-py3-none-any.whl size=54258 sha256=07ac04a15bddf59d3dec45404a9d84d4124cba5b422f3945b1aaf434a9ac6e18
  Stored in directory: /root/.cache/pip/wheels/d8/9e/a3/5b09e3f14722fa0d77f47fe840668d426760023bdd11b0fbd9
  Building wheel for pecab (setup.py) ... [?25l[?25hdone
  Created wheel for pecab: filename=pecab-1.0.8-py3-none-any.whl size=26646666 sha256=42d490a00b3b299cfa6e398b2a3c0f1c4fcdf0c8edc88db190d4d5b49fdba482
  Stored in