<a href="https://colab.research.google.com/github/tanongsakintean/google_colab/blob/main/NLTK_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to Natural Language Processing

In this workbook, at a high-level we will learn about text tokenization; text normalization such as lowercasing, stemming; part-of-speech tagging; Named entity recognition; Sentiment analysis; Topic modeling; Word embeddings





# Text-PreProcessing
The Basics of NLP for Text
In this article, we’ll cover the following topics to text-preprocessing:

1. Sentence Tokenization
2. Word Tokenization
3. Regular expression
4. Text Lemmatization and Stemming
5. Ngram
6. Stop Words


-----


**punkt ** This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

**averaged_perceptron_tagger ** contains the pre-trained English [Part-of-Speech (POS]]

**โมดูล nltk (Natural Language Toolkit)** เป็นโมดูลในภาษาไพทอนที่ช่วยในการประมวลภาษาธรรมชาติและโมดูลนี้เป็นที่นิยมกันในโลกนักพัฒนาภาษาไพทอน โดยใช้ Apache License, Version 2.0 และรองรับทั้ง Python 2 และ Python 3.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') #หา verb เช่น is เป็น verb
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
#Tokenization -- Paragraphs into sentences;
from nltk.tokenize import sent_tokenize

text = "Hello All. Welcome to medium. This article is about NLP using NLTK."
print("SENTENCE AS TOKENS:")
print(sent_tokenize(text))
print("No of Sentence Tokens:",len(sent_tokenize(text)))

SENTENCE AS TOKENS:
['Hello All.', 'Welcome to medium.', 'This article is about NLP using NLTK.']
No of Sentence Tokens: 3


In [None]:
import nltk.data
etext = 'Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries.  And sometimes sentences can start with non-capitalized words.  i is a good variable name.'
english_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
english_tokenizer.tokenize(etext)



['Punkt knows that the periods in Mr. Smith and Johann S. Bach do not mark sentence boundaries.',
 'And sometimes sentences can start with non-capitalized words.',
 'i is a good variable name.']

# ภาษาที่มีใน punkt

czech.pickle     finnish.pickle  norwegian.pickle   slovene.pickle
danish.pickle    french.pickle   polish.pickle      spanish.pickle
dutch.pickle     german.pickle   portuguese.pickle  swedish.pickle
english.pickle   greek.pickle                turkish.pickle
estonian.pickle  italian.pickle

In [None]:
#ภาษาเยอรมัน
german_tokenizer = nltk.data.load('tokenizers/punkt/PY3/german.pickle')
gtext = 'Wie geht es Ihnen? Mir geht es gut.'
german_tokenizer.tokenize(gtext)

['Wie geht es Ihnen?', 'Mir geht es gut.']

# ตัดคำ (word_tokenize)

In [None]:
#Tokenization --Text into word tokens;
from nltk.tokenize import word_tokenize

text = "Hello All. Welcome to medium. This article is about NLP using NLTK. Subscribe with $4.00. "
print("SENTENCE AS TOKENS:")
print(word_tokenize(text))
print("No of Sentence Tokens:",len(word_tokenize(text)))


SENTENCE AS TOKENS:
['Hello', 'All', '.', 'Welcome', 'to', 'medium', '.', 'This', 'article', 'is', 'about', 'NLP', 'using', 'NLTK', '.', 'Subscribe', 'with', '$', '4.00', '.']
No of Sentence Tokens: 20


# Regular expression (re)
ใน python. ... regex เป็นรูปแบบการเขียนที่นิยมใช้กันทั่วไปในการแสดงรูปแบบของตัวหนังสือ หากใช้ regex แล้วจะทำให้สามารถค้นหากลุ่มตัวหนังสือที่มีรูปแบบตามที่ต้องการจากข้อความหรือกลุ่มตัวอักษรได้

ดูเพิ่มเติมที่ https://www.bualabs.com/archives/3070/what-is-regular-expression-regex-regexp-teach-how-to-regex-python-nlp-ep-7/

In [None]:
#Using NLTK import ngrams

import re
from nltk.util import ngrams
text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
text = text.lower()
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
 # Break sentence in the token, remove empty tokens
tokens = [token for token in text.split(" ") if token != ""]
output = list(ngrams(tokens, 3))
print(output)

[('hello', 'everyone', 'welcome'), ('everyone', 'welcome', 'to'), ('welcome', 'to', 'intro'), ('to', 'intro', 'to'), ('intro', 'to', 'machine'), ('to', 'machine', 'learning'), ('machine', 'learning', 'applications'), ('learning', 'applications', 'we'), ('applications', 'we', 'are'), ('we', 'are', 'now'), ('are', 'now', 'learning'), ('now', 'learning', 'important'), ('learning', 'important', 'basics'), ('important', 'basics', 'of'), ('basics', 'of', 'nlp')]


In [None]:
#Text Normalization

#Case Conversion
text = "Hello All. Welcome to medium. This article is about NLP using NLTK. Subscribe with $4.00."
lowert = text.lower()
uppert = text.upper()

print("To Lower Case:",lowert)
print("To Upper Case:",uppert)


To Lower Case: hello all. welcome to medium. this article is about nlp using nltk. subscribe with $4.00.
To Upper Case: HELLO ALL. WELCOME TO MEDIUM. THIS ARTICLE IS ABOUT NLP USING NLTK. SUBSCRIBE WITH $4.00.


# #stemming
ดูรายละเอียด https://www.bualabs.com/archives/2952/what-is-stemming-what-is-lemmatization-different-stemming-lemmatization-nlp-ep-3/


In [None]:

#Porterstemmer is a famous stemming approach
from nltk.stem import PorterStemmer  #
from nltk.tokenize import word_tokenize


ps = PorterStemmer()
sentence = "It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity."

sent = word_tokenize(sentence)
print("After Word Tokenization:\n",sent)
print("Total No of Word Tokens: ",len(sent))

ps_sent = [ps.stem(words_sent) for words_sent in sent]
print(ps_sent)
print(len(ps_sent))



After Word Tokenization:
 ['It', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'people', 'cease', 'pirating', 'files', 'when', 'those', 'same', 'people', 'are', "n't", 'paid', 'for', 'their', 'participation', 'in', 'very', 'lucrative', 'network', 'schemes', '.', 'Ordinary', 'people', 'are', 'relentlessly', 'spied', 'on', ',', 'and', 'not', 'compensated', 'for', 'information', 'taken', 'from', 'them', '.', 'While', 'I', "'d", 'like', 'to', 'see', 'everyone', 'eventually', 'pay', 'for', 'music', 'and', 'the', 'like', ',', 'I', "'d", 'not', 'ask', 'for', 'it', 'until', 'there', "'s", 'reciprocity', '.']
Total No of Word Tokens:  69
['it', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'peopl', 'ceas', 'pirat', 'file', 'when', 'those', 'same', 'peopl', 'are', "n't", 'paid', 'for', 'their', 'particip', 'in', 'veri', 'lucr', 'network', 'scheme', '.', 'ordinari', 'peopl', 'are', 'relentlessli', 'spi', 'on', ',', 'and', 'not', 'compens', 'for', 'inform', 'taken', 'from', 'them', '.', 'whi

In [None]:
import pandas as pd
stemdf= pd.DataFrame({'original_word': sent,'stemmed_word': ps_sent})
stemdf

Unnamed: 0,original_word,stemmed_word
0,It,it
1,would,would
2,be,be
3,unfair,unfair
4,to,to
...,...,...
64,until,until
65,there,there
66,'s,'s
67,reciprocity,reciproc


In [None]:
#Porter stemmer is a famous stemming approach

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
ps = PorterStemmer()

words = ["hike", "hikes", "hiked", "hiking", "hikers", "hiker", "universal", "universe", "university","alumnus", "alumni", "alumnae"]

for w in words:
    print(w, " : ", ps.stem(w))

hike  :  hike
hikes  :  hike
hiked  :  hike
hiking  :  hike
hikers  :  hiker
hiker  :  hiker
universal  :  univers
universe  :  univers
university  :  univers
alumnus  :  alumnu
alumni  :  alumni
alumnae  :  alumna


In [None]:
#another stemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize

sb = SnowballStemmer("english")
sentence = "It would be unfair to demand that people cease pirating files when those same people aren't paid for their participation in very lucrative network schemes. Ordinary people are relentlessly spied on, and not compensated for information taken from them. While I'd like to see everyone eventually pay for music and the like, I'd not ask for it until there's reciprocity."

sent = word_tokenize(sentence)
print("After Word Tokenization:\n",sent)
print("Total No of Word Tokens: ",len(sent))

sb_sent = [sb.stem(words_sent) for words_sent in sent]
print(sb_sent)
print(len(sb_sent))

After Word Tokenization:
 ['It', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'people', 'cease', 'pirating', 'files', 'when', 'those', 'same', 'people', 'are', "n't", 'paid', 'for', 'their', 'participation', 'in', 'very', 'lucrative', 'network', 'schemes', '.', 'Ordinary', 'people', 'are', 'relentlessly', 'spied', 'on', ',', 'and', 'not', 'compensated', 'for', 'information', 'taken', 'from', 'them', '.', 'While', 'I', "'d", 'like', 'to', 'see', 'everyone', 'eventually', 'pay', 'for', 'music', 'and', 'the', 'like', ',', 'I', "'d", 'not', 'ask', 'for', 'it', 'until', 'there', "'s", 'reciprocity', '.']
Total No of Word Tokens:  69
['it', 'would', 'be', 'unfair', 'to', 'demand', 'that', 'peopl', 'ceas', 'pirat', 'file', 'when', 'those', 'same', 'peopl', 'are', "n't", 'paid', 'for', 'their', 'particip', 'in', 'veri', 'lucrat', 'network', 'scheme', '.', 'ordinari', 'peopl', 'are', 'relentless', 'spi', 'on', ',', 'and', 'not', 'compens', 'for', 'inform', 'taken', 'from', 'them', '.', 'whi

In [None]:
stemdf= pd.DataFrame({'original_word': sent,'stemmed_word':sb_sent})
stemdf

Unnamed: 0,original_word,stemmed_word
0,It,it
1,would,would
2,be,be
3,unfair,unfair
4,to,to
...,...,...
64,until,until
65,there,there
66,'s,'s
67,reciprocity,reciproc


# The Comparison of three stemmers

In [None]:
import nltk
input_str="There are several types of stemming algorithms. Programmers program with programming languages"
words=word_tokenize(input_str)
ps = PorterStemmer()
sno =  SnowballStemmer('english')


ps_sent = [ps.stem(words_sent) for words_sent in words]
sno_sent = [sno.stem(words_sent) for words_sent in words]


stem3df= pd.DataFrame({'original_word': words,'poster_word':ps_sent,'snowball_word':sno_sent})
stem3df

Unnamed: 0,original_word,poster_word,snowball_word
0,There,there,there
1,are,are,are
2,several,sever,sever
3,types,type,type
4,of,of,of
5,stemming,stem,stem
6,algorithms,algorithm,algorithm
7,.,.,.
8,Programmers,programm,programm
9,program,program,program


#WordNet Lemmatization
**คำสั่ง nltk.download('omw-1.4')** เป็นคำสั่งที่ใช้ใน Natural Language Toolkit (NLTK) เพื่อดาวน์โหลดข้อมูลภาษาต่างๆ ที่ถูกเรียบเรียงและเตรียมไว้ให้ในรูปแบบของ WordNet ของ Open Multilingual WordNet (OMW) รุ่น 1.4 นี้.

**WordNet **  เป็นคลังข้อมูลทางภาษาศาสตร์ที่ใช้ในการค้นหาความหมายของคำศัพท์และความสัมพันธ์ระหว่างคำในภาษาอังกฤษ และภาษาอื่นๆ ใน WordNet มีการจัดเก็บคำศัพท์และความหมายของคำในรูปแบบของสังกัดความหมาย ความสัมพันธ์ระหว่างคำ และความสัมพันธ์ทางความหมายอื่นๆ ซึ่งทำให้มันเป็นทรัพยากรมีประโยชน์สำหรับงานด้านปัญญาประดิษฐ์และการประมวลผลภาษาธรรมชาติ.

การดาวน์โหลดข้อมูล OMW 1.4 ใน NLTK จะช่วยให้คุณเข้าถึงข้อมูล WordNet ที่เป็นรูปแบบของภาษาต่างๆ ที่ถูกสร้างขึ้นจาก WordNet และอาจมีประโยชน์ในการทำงานที่เกี่ยวข้องกับภาษาและความหมายในภาษาต่างๆ ถ้าคุณทำงานด้านปัญญาประดิษฐ์หรือประมวลผลภาษาธรรมชาติที่เกี่ยวข้องกับภาษาต่างๆ การดาวน์โหลดข้อมูล OMW 1.4 อาจมีประโยชน์ในการสำรวจและวิเคราะห์ความหมายของคำในภาษาต่างๆ ที่ไม่ใช่ภาษาอังกฤษ.

In [None]:
import nltk
import pandas as pd
from nltk.stem import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize

nltk.download('omw-1.4')
ps = PorterStemmer()
sno =  SnowballStemmer('english')
#WordNet Lemmatization
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

#without POS tagging
text = "She jumped into the river and breathed heavily"
tokenizer = word_tokenize(text)
lemma =  nltk.wordnet.WordNetLemmatizer()


lem_sent = [lemma.lemmatize(words_sent) for words_sent in tokenizer]
ps_sent = [ps.stem(words_sent) for words_sent in tokenizer ]
sno_sent = [sno.stem(words_sent) for words_sent in tokenizer ]

stem3df= pd.DataFrame({'original_word': tokenizer,'poster_word':ps_sent,'snowball_word':sno_sent,'lemma_word':lem_sent})
stem3df

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,original_word,poster_word,snowball_word,lemma_word
0,She,she,she,She
1,jumped,jump,jump,jumped
2,into,into,into,into
3,the,the,the,the
4,river,river,river,river
5,and,and,and,and
6,breathed,breath,breath,breathed
7,heavily,heavili,heavili,heavily


In [None]:
#Lemmatizer with POS tag
from nltk import word_tokenize,pos_tag

for token,tag in pos_tag(word_tokenize(text)):
    pos=tag[0].lower() #แปลงจาก tag ใหญ่ๆ เป็นแบบสายย่อ
    print(token,"--->",lemma.lemmatize(token)," with POS =",tag, " or ",pos)

She ---> She  with POS = PRP  or  p
jumped ---> jumped  with POS = VBD  or  v
into ---> into  with POS = IN  or  i
the ---> the  with POS = DT  or  d
river ---> river  with POS = NN  or  n
and ---> and  with POS = CC  or  c
breathed ---> breathed  with POS = VBD  or  v
heavily ---> heavily  with POS = RB  or  r


#merge all the tokens to form a long text sequence

In [None]:
#from nltk.stem import PorterStemmer
#from nltk.tokenize import word_tokenize
import re

ps = PorterStemmer()
text = "Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP."
print(text)


#Tokenize and stem the words
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
tokens = [token for token in text.split(" ") if token != ""]

i=0
while i<len(tokens):
  tokens[i]=ps.stem(tokens[i])
  i=i+1

#merge all the tokens to form a long text sequence
text2 = ' '.join(tokens)

print(text2)

Hello everyone. Welcome to Intro to Machine Learning Applications. We are now learning important basics of NLP.
hello everyon welcom to intro to machin learn applic we are now learn import basic of nlp


#stopwords

Stop Words คือ คำทั่ว ๆ ไป ที่เราพบบ่อย ๆ ในประโยค หรือ เอกสาร ต่ไม่ค่อยช่วยในการสื่อความหมายสักเท่าไร ทำให้เราสามารถลบคำเหล่านั้นออกไปจากรายการคำศัพท์ได้เลย กรองทิ้งไปจากเอกสารได้เลย เช่น a, an, the, also, just, quite, unless, etc. คำเหล่านี้เรียกว่า Stop Words.  
ดูเพิ่มเติมที่ https://colab.research.google.com/github/gnoparus/bualabs/blob/master/nbs/26a_stop_words.ipynb

In [None]:
#Stopwords removal
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "Hello All. Welcome to medium. This article is about NLP using NLTK."

stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

['Hello', 'All', '.', 'Welcome', 'to', 'medium', '.', 'This', 'article', 'is', 'about', 'NLP', 'using', 'NLTK', '.']
['Hello', 'All', '.', 'Welcome', 'medium', '.', 'This', 'article', 'NLP', 'using', 'NLTK', '.']


In [None]:
#Part-of-Speech tagging

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

text = "Medium welcomes you and this article is about NLP using NLTK."

sent = nltk.word_tokenize(text)
print(sent)
postag = nltk.pos_tag(sent)
print(postag)

['Medium', 'welcomes', 'you', 'and', 'this', 'article', 'is', 'about', 'NLP', 'using', 'NLTK', '.']
[('Medium', 'NNP'), ('welcomes', 'VBZ'), ('you', 'PRP'), ('and', 'CC'), ('this', 'DT'), ('article', 'NN'), ('is', 'VBZ'), ('about', 'IN'), ('NLP', 'NNP'), ('using', 'VBG'), ('NLTK', 'NNP'), ('.', '.')]


# Named-Entity Recognition
เมื่อได้คำนามมาแล้วจากการทำ POS เราจะมาเรียนรู้ Named-Entity Recognition ทำ Named-Entity Tagging ว่าคำ ๆ นี้ เป็น ชื่อสิ่งที่อยู่ในโลกความเป็นจริงหรือไม่ ประเภทอะไร เช่น ชื่อคน สถานที่ องค์กร

ดูตัวอย่างเพิ่มเติมที่ https://www.bualabs.com/archives/4112/what-is-part-of-speech-tagging-what-is-named-entity-recognition-tagging-tutorial-pos-tagging-ner-thai-language-pythainlp-ep-4/

In [None]:
#Named entity recognition

#spaCy is an NLP Framework -- easy to use and having ability to use neural networks

import en_core_web_sm
nlp = en_core_web_sm.load()

text = 'GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside 40 million developers.'

doc = nlp(text)
print(doc.ents)
print([(X.text, X.label_) for X in doc.ents])

(GitHub, 40 million)
[('GitHub', 'ORG'), ('40 million', 'CARDINAL')]


# Bag of word (BOW)

In [None]:
from collections import Counter
f = open("/content/myfile.txt","r")
article = f.read()
tokens = word_tokenize(article)
lower_tokens = [t.lower() for t in tokens]
bow_simple = Counter(lower_tokens)
print(bow_simple.most_common(10))

TypeError: ignored

How to create Bag of word from Dataframe using CountVectorizer

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import re
df = pd.DataFrame

# This is our string data
from io import StringIO

# wrap the string data in StringIO function
StringData = """text,class
     GitHub is a development platform inspired by the way you work. From open source to business you can host and review code manage projects and build software alongside 40 million developers, 1.
     Now your DataFrame df contains the original text data as well as the BoW representation of that data in the form of a set of columns where each column represents a unique word in your text data,0.
     Now we will use Pandas pd.read_clipboard() function to read the data into a DataFrame,1.
    """

# Read data
df  =  pd.read_csv(StringIO(StringData.strip()))

# Print the DataFrame
df



Unnamed: 0,text,class
0,GitHub is a development platform inspired...,1.0
1,Now your DataFrame df contains the origin...,0.0
2,Now we will use Pandas pd.read_clipboard(...,1.0


In [None]:
# Assuming you have a DataFrame 'df' with a column 'text'
df['text'] = df['text'].str.lower()  # Convert to lowercase
df['text'] = df['text'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]', '', x))  # Remove punctuation
# Initialize the CountVectorizer
vectorizer = CountVectorizer() #fun bag of word besttttttttt

# Fit and transform the text data to create the BoW representation
X = vectorizer.fit_transform(df['text'])

# Convert the BoW representation to a DataFrame (optional)
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()) #get_feature_names_out เอาคำเป็นชื่อ colums
df_bow = pd.concat([df['text'], bow_df], axis=1)
df_bow

Unnamed: 0,text,40,alongside,and,as,bow,build,business,by,can,...,use,way,we,well,where,will,word,work,you,your
0,github is a development platform inspired...,1,1,2,0,0,1,1,1,1,...,0,1,0,0,0,0,0,1,2,0
1,now your dataframe df contains the origin...,0,0,0,2,1,0,0,0,0,...,0,0,0,1,1,0,1,0,0,2
2,now we will use pandas pdreadclipboard fu...,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,1,0,0,0,0


#Sentiment analysis
คือ “การวิเคราะห์ความรู้สึก”
เป็นการวิเคราะห์อารมณ์และความรู้สึกจากข้อความ เพื่อบ่งบอกความรู้สึกของผู้คนที่มีต่อบางสิ่งบางอย่าง แบ่งได้เป็น
Positive = เป็นในทางที่ดี
Negative = เป็นในทางที่ไม่ดี
Neutral = เป็นกลาง

Example : Get data from twitter
https://pypi.org/project/twython/

In [None]:
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

s2 = 'This was the best, most awesome movie EVER MADE!!!'
print("polarity score for s2:")
sia.polarity_scores(s2)

polarity score for s2:


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


{'neg': 0.0, 'neu': 0.425, 'pos': 0.575, 'compound': 0.8877}

# Exercise
ให้นักศึกษาทำ Exercise การ Preprocess Data ในไฟล์ Moview_review.ipynb