# Text Mining
텍스트 마이닝(Text Mining)이란 방대한 양의 비정형(unstructured) 텍스트 데이터 속에서 가치 있는 정보, 패턴, 그리고 인사이트를 추출하고 분석하는 기술입니다. 텍스트 데이터 마이닝(Text Data Mining) 또는 텍스트 분석(Text Analytics)이라고도 불리며, 인공지능(AI), 자연어 처리(NLP), 통계학, 기계 학습 등 다양한 기술을 활용하여 사람이 직접 읽고 파악하기 어려운 대규모 텍스트의 의미를 컴퓨터가 분석할 수 있도록 돕습니다. 쉽게 비유하자면, 텍스트 마이닝은 산더미처럼 쌓인 문서, 이메일, 소셜 미디어 게시글, 뉴스 기사 등에서 금을 캐는 것과 같습니다. 이 과정를 통해 텍스트에 숨겨진 트렌드를 발견하고, 사람들의 감정을 이해하며, 중요한 정보를 요약하는 등 다양한 작업을 수행할 수 있습니다.

텍스트 마이닝은 일반적으로 다음과 같은 단계를 거쳐 진행됩니다.

1. 데이터 수집 (Data Collection)
2. 데이터 전처리 (Data Pre-processing)
    * 토큰화 (Tokenization): 문장을 의미 있는 최소 단위인 단어, 형태소 등으로 나누는 과정.
    * 정제 (Cleaning): 불필요한 구두점, 특수 문자, HTML 태그 등을 제거.
    * 불용어 처리 (Stop-word Removal): '은', '는', '이', '가'와 같이 자주 등장하지만 분석에 큰 의미가 없는 단어(불용어)를 제거.
    * 어간 추출 (Stemming) & 표제어 추출 (Lemmatization): 단어의 다양한 변형(예: '달리다', '달리고', '달려서')을 기본형('달리다')으로 통일.
3. 텍스트 변환 및 특징 추출 (Text Transformation & Feature Extraction): 전처리된 텍스트를 기계 학습 모델이 처리할 수 있는 숫자 형태의 데이터(벡터)로 변환.
4. 텍스트 분석 및 마이닝 (Text Analysis & Mining): 정형화된 데이터를 바탕으로 다양한 분석 기법을 적용하여 패턴과 인사이트를 도출.
5. 결과 해석 및 시각화 (Interpretation & Visualization): 분석 결과를 해석하고, 워드 클라우드, 토픽 모델링 시각화, 감성 분석 차트 등 이해하기 쉬운 형태로 표현.

In [1]:
import nltk

# 품사 태깅(Part-of-Speech Tagging, POS Tagging)을 위해 미리 훈련된 모델.
nltk.download('averaged_perceptron_tagger_eng')

# # punkt는 마침표나 약어(Mr. , Dr.)와 같은 특별한 언어적 특성을 고려하여 토큰화를 할 수 있게 미리 훈련된 모델.
nltk.download('punkt')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\campus4D004\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\campus4D004\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [8]:
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

TEXT = """After reading the comments for this movie, I am not sure whether I should be angry, sad or sickened. 
Seeing comments typical of people who a)know absolutely nothing about the military or b)who base everything they think they know on movies like this or on CNN reports about Abu-Gharib makes me wonder about the state of intellectual stimulation in the world. 
At the time I type this the number of people in the US military: 1.4 million on Active Duty with another almost 900,000 in the Guard and Reserves for a total of roughly 2.3 million. 
The number of people indicted for abuses at at Abu-Gharib: Currently less than 20 That makes the total of people indicted. 
00083% of the total military. Even if you indict every single military member that ever stepped in to Abu-Gharib, you would not come close to making that a whole number.  The flaws in this movie would take YEARS to cover. 
I understand that it's supposed to be sarcastic, but in reality, the writer and director are trying to make commentary about the state of the military without an enemy to fight. 
In reality, the US military has been at its busiest when there are not conflicts going on. The military is the first called for disaster relief and humanitarian aid missions. 
When the tsunami hit Indonesia, devestating the region, the US military was the first on the scene. 
When the chaos of the situation overwhelmed the local governments, it was military leadership who looked at their people, the same people this movie mocks, and said make it happen. 
Within hours, food aid was reaching isolated villages. Within days, airfields were built, cargo aircraft started landing and a food distribution system was up and running. Hours and days, not weeks and months. Yes there are unscrupulous people in the US military. But then, there are in every walk of life, every occupation. But to see people on this website decide that 2.3 million men and women are all criminal, with nothing on their minds but thoughts of destruction or mayhem is an absolute disservice to the things that they do every day. One person on this website even went so far as to say that military members are in it for personal gain. Wow! Entry level personnel make just under $8.00 an hour assuming a 40 hour work week. Of course, many work much more than 40 hours a week and those in harm's way typically put in 16-18 hour days for months on end. That makes the pay well under minimum wage. So much for personal gain. I beg you, please make yourself familiar with the world around you. Go to a nearby base, get a visitor pass and meet some of the men and women you are so quick to disparage. You would be surprised. The military no longer accepts people in lieu of prison time. They require a minimum of a GED and prefer a high school diploma. The middle ranks are expected to get a minimum of undergraduate degrees and the upper ranks are encouraged to get advanced degrees."""

# 단어 토큰화
tokenized_words = word_tokenize(TEXT)
print(tokenized_words)

['After', 'reading', 'the', 'comments', 'for', 'this', 'movie', ',', 'I', 'am', 'not', 'sure', 'whether', 'I', 'should', 'be', 'angry', ',', 'sad', 'or', 'sickened', '.', 'Seeing', 'comments', 'typical', 'of', 'people', 'who', 'a', ')', 'know', 'absolutely', 'nothing', 'about', 'the', 'military', 'or', 'b', ')', 'who', 'base', 'everything', 'they', 'think', 'they', 'know', 'on', 'movies', 'like', 'this', 'or', 'on', 'CNN', 'reports', 'about', 'Abu-Gharib', 'makes', 'me', 'wonder', 'about', 'the', 'state', 'of', 'intellectual', 'stimulation', 'in', 'the', 'world', '.', 'At', 'the', 'time', 'I', 'type', 'this', 'the', 'number', 'of', 'people', 'in', 'the', 'US', 'military', ':', '1.4', 'million', 'on', 'Active', 'Duty', 'with', 'another', 'almost', '900,000', 'in', 'the', 'Guard', 'and', 'Reserves', 'for', 'a', 'total', 'of', 'roughly', '2.3', 'million', '.', 'The', 'number', 'of', 'people', 'indicted', 'for', 'abuses', 'at', 'at', 'Abu-Gharib', ':', 'Currently', 'less', 'than', '20', 'T

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\campus4D004\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [9]:
from collections import Counter
vocab = Counter(tokenized_words)
print(vocab)

Counter({'the': 30, '.': 29, ',': 21, 'of': 15, 'and': 14, 'to': 13, 'a': 12, 'military': 12, 'in': 12, 'people': 9, 'on': 9, 'are': 9, 'for': 7, 'this': 7, 'that': 6, 'I': 5, 'The': 5, 'you': 5, 'not': 4, 'or': 4, 'about': 4, 'US': 4, 'at': 4, 'every': 4, 'it': 4, 'make': 4, 'was': 4, 'movie': 3, 'be': 3, 'who': 3, 'they': 3, 'Abu-Gharib': 3, 'makes': 3, 'number': 3, 'million': 3, 'with': 3, 'total': 3, 'would': 3, 'an': 3, 'there': 3, 'days': 3, 'hour': 3, 'minimum': 3, 'get': 3, 'comments': 2, ')': 2, 'know': 2, 'nothing': 2, 'base': 2, 'state': 2, 'world': 2, 'time': 2, ':': 2, '2.3': 2, 'indicted': 2, 'than': 2, 'That': 2, "'s": 2, 'but': 2, 'reality': 2, 'is': 2, 'first': 2, 'aid': 2, 'When': 2, 'their': 2, 'Within': 2, 'hours': 2, 'food': 2, 'months': 2, 'But': 2, 'website': 2, 'men': 2, 'women': 2, 'so': 2, 'personal': 2, 'gain': 2, 'under': 2, '40': 2, 'work': 2, 'week': 2, 'much': 2, 'ranks': 2, 'degrees': 2, 'After': 1, 'reading': 1, 'am': 1, 'sure': 1, 'whether': 1, 'should

In [10]:
print(vocab.most_common(10))

[('the', 30), ('.', 29), (',', 21), ('of', 15), ('and', 14), ('to', 13), ('a', 12), ('military', 12), ('in', 12), ('people', 9)]


In [13]:
uncommon_words = []
for key, value in vocab.items():
    if value <= 2:
        uncommon_words.append(key)
print(uncommon_words)

['After', 'reading', 'comments', 'am', 'sure', 'whether', 'should', 'angry', 'sad', 'sickened', 'Seeing', 'typical', ')', 'know', 'absolutely', 'nothing', 'b', 'base', 'everything', 'think', 'movies', 'like', 'CNN', 'reports', 'me', 'wonder', 'state', 'intellectual', 'stimulation', 'world', 'At', 'time', 'type', ':', '1.4', 'Active', 'Duty', 'another', 'almost', '900,000', 'Guard', 'Reserves', 'roughly', '2.3', 'indicted', 'abuses', 'Currently', 'less', 'than', '20', 'That', '00083', '%', 'Even', 'if', 'indict', 'single', 'member', 'ever', 'stepped', 'come', 'close', 'making', 'whole', 'flaws', 'take', 'YEARS', 'cover', 'understand', "'s", 'supposed', 'sarcastic', 'but', 'reality', 'writer', 'director', 'trying', 'commentary', 'without', 'enemy', 'fight', 'In', 'has', 'been', 'its', 'busiest', 'when', 'conflicts', 'going', 'is', 'first', 'called', 'disaster', 'relief', 'humanitarian', 'aid', 'missions', 'When', 'tsunami', 'hit', 'Indonesia', 'devestating', 'region', 'scene', 'chaos', '

In [15]:
cleaned_by_freq = []
for i in tokenized_words:
    if i not in uncommon_words:
        cleaned_by_freq.append(i)
print(cleaned_by_freq)

['the', 'for', 'this', 'movie', ',', 'I', 'not', 'I', 'be', ',', 'or', '.', 'of', 'people', 'who', 'a', 'about', 'the', 'military', 'or', 'who', 'they', 'they', 'on', 'this', 'or', 'on', 'about', 'Abu-Gharib', 'makes', 'about', 'the', 'of', 'in', 'the', '.', 'the', 'I', 'this', 'the', 'number', 'of', 'people', 'in', 'the', 'US', 'military', 'million', 'on', 'with', 'in', 'the', 'and', 'for', 'a', 'total', 'of', 'million', '.', 'The', 'number', 'of', 'people', 'for', 'at', 'at', 'Abu-Gharib', 'makes', 'the', 'total', 'of', 'people', '.', 'of', 'the', 'total', 'military', '.', 'you', 'every', 'military', 'that', 'in', 'to', 'Abu-Gharib', ',', 'you', 'would', 'not', 'to', 'that', 'a', 'number', '.', 'The', 'in', 'this', 'movie', 'would', 'to', '.', 'I', 'that', 'it', 'to', 'be', ',', 'in', ',', 'the', 'and', 'are', 'to', 'make', 'about', 'the', 'of', 'the', 'military', 'an', 'to', '.', ',', 'the', 'US', 'military', 'at', 'there', 'are', 'not', 'on', '.', 'The', 'military', 'the', 'for', '

In [16]:
len(cleaned_by_freq)

307

In [17]:
len(uncommon_words)

234

In [19]:
# 글자수가 3 이상인 단어 추출.
cleaned_by_freq_len = []
for i in cleaned_by_freq:
    if len(i) >= 3:
        cleaned_by_freq_len.append(i)
print(cleaned_by_freq_len)

['the', 'for', 'this', 'movie', 'not', 'people', 'who', 'about', 'the', 'military', 'who', 'they', 'they', 'this', 'about', 'Abu-Gharib', 'makes', 'about', 'the', 'the', 'the', 'this', 'the', 'number', 'people', 'the', 'military', 'million', 'with', 'the', 'and', 'for', 'total', 'million', 'The', 'number', 'people', 'for', 'Abu-Gharib', 'makes', 'the', 'total', 'people', 'the', 'total', 'military', 'you', 'every', 'military', 'that', 'Abu-Gharib', 'you', 'would', 'not', 'that', 'number', 'The', 'this', 'movie', 'would', 'that', 'the', 'and', 'are', 'make', 'about', 'the', 'the', 'military', 'the', 'military', 'there', 'are', 'not', 'The', 'military', 'the', 'for', 'and', 'the', 'the', 'the', 'military', 'was', 'the', 'the', 'the', 'the', 'the', 'was', 'military', 'who', 'people', 'the', 'people', 'this', 'movie', 'and', 'make', 'was', 'days', 'and', 'was', 'and', 'and', 'days', 'not', 'and', 'there', 'are', 'people', 'the', 'military', 'there', 'are', 'every', 'every', 'people', 'this'

In [20]:
# 등장 빈도 기준 정제 함수
def clean_by_freq(tokenized_words, cut_off_count):
    # 파이썬의 Counter 모듈을 통해 단어의 빈도수 카운트하여 단어 집합 생성
    vocab = Counter(tokenized_words)
    
    # 빈도수가 cut_off_count 이하인 단어 set 추출
    uncommon_words = {key for key, value in vocab.items() if value <= cut_off_count}
    
    # uncommon_words에 포함되지 않는 단어 리스트 생성
    cleaned_words = [word for word in tokenized_words if word not in uncommon_words]

    return cleaned_words

# 단어 길이 기준 정제 함수
def clean_by_len(tokenized_words, cut_off_length):
    # 길이가 cut_off_length 이하인 단어 제거
    cleaned_by_freq_len = []
    
    for word in tokenized_words:
        if len(word) > cut_off_length:
            cleaned_by_freq_len.append(word)

    return cleaned_by_freq_len

In [23]:
from nltk.corpus import stopwords
nltk.download("stopwords")

stopwords_set = set(stopwords.words("english"))
print(stopwords_set)

{'themselves', 'theirs', 'as', 'if', 'after', 'him', 'having', 'out', "we'll", "haven't", 'ourselves', "we're", 'with', 'more', 'hadn', 'of', 'all', 'these', 'was', 'below', 'where', 'haven', 'aren', 'about', 'itself', 'myself', 'into', 'll', 'they', 'you', "wasn't", 'once', 't', 'other', 'ain', 'off', "you're", 'he', 'me', 'same', 'while', 'over', "won't", 'shouldn', 'few', 'their', 'does', 'wouldn', "they've", 'during', 'this', "you'll", 'has', 'is', 'which', "needn't", 'them', 'or', "she'd", 'own', 'so', 'there', 'because', 'when', 'both', 'd', 'doesn', 'had', 'she', 'herself', 'until', "you've", "it's", "hasn't", "he's", "i'd", "she'll", 'didn', 'most', 'whom', 'mightn', 'yourself', 'y', 'at', 'and', 'do', "we'd", "they'll", 'himself', "she's", 'our', 'ours', 've', 'm', 'no', "don't", 'it', 'how', 'my', 'not', "wouldn't", 'here', 'have', "didn't", 'being', 'its', 'yours', 'isn', "you'd", 'nor', "they're", 'than', 'before', 'only', 'in', 'such', 'those', "it'll", 'too', 'your', 'sho

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\campus4D004\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
box = ["oh", 'the', 'i']
stopwords_set.update(box)
print(stopwords_set)

{'themselves', 'theirs', 'as', 'if', 'after', 'him', 'having', 'out', "we'll", "haven't", 'ourselves', "we're", 'with', 'more', 'hadn', 'of', 'all', 'these', 'was', 'below', 'where', 'haven', 'aren', 'about', 'itself', 'myself', 'into', 'll', 'they', 'oh', 'you', "wasn't", 'once', 't', 'other', 'ain', 'off', "you're", 'he', 'me', 'same', 'while', 'over', "won't", 'shouldn', 'few', 'their', 'does', 'wouldn', "they've", 'during', 'this', "you'll", 'has', 'is', 'which', "needn't", 'them', 'or', "she'd", 'own', 'so', 'there', 'because', 'when', 'both', 'd', 'doesn', 'had', 'she', 'herself', 'until', "you've", "it's", "hasn't", "he's", "i'd", "she'll", 'didn', 'most', 'whom', 'mightn', 'yourself', 'y', 'at', 'and', 'do', "we'd", "they'll", 'himself', "she's", 'our', 'ours', 've', 'm', 'no', "don't", 'it', 'how', 'my', 'not', "wouldn't", 'here', 'have', "didn't", 'being', 'its', 'yours', 'isn', "you'd", 'nor', "they're", 'than', 'before', 'only', 'in', 'such', 'those', "it'll", 'too', 'your'

In [25]:
stopwords_set.add("hello")
stopwords_set.remove("the")

In [26]:
cleaned_words = []
for i in cleaned_by_freq_len:
    if i not in stopwords_set:
        cleaned_words.append(i)
print(cleaned_words)

['the', 'movie', 'people', 'the', 'military', 'Abu-Gharib', 'makes', 'the', 'the', 'the', 'the', 'number', 'people', 'the', 'military', 'million', 'the', 'total', 'million', 'The', 'number', 'people', 'Abu-Gharib', 'makes', 'the', 'total', 'people', 'the', 'total', 'military', 'every', 'military', 'Abu-Gharib', 'would', 'number', 'The', 'movie', 'would', 'the', 'make', 'the', 'the', 'military', 'the', 'military', 'The', 'military', 'the', 'the', 'the', 'the', 'military', 'the', 'the', 'the', 'the', 'the', 'military', 'people', 'the', 'people', 'movie', 'make', 'days', 'days', 'people', 'the', 'military', 'every', 'every', 'people', 'million', 'the', 'every', 'military', 'make', 'hour', 'hour', 'hour', 'days', 'makes', 'the', 'minimum', 'make', 'the', 'get', 'the', 'would', 'The', 'military', 'people', 'minimum', 'The', 'get', 'minimum', 'the', 'get']


In [27]:
# 불용어 제거 함수
def clean_by_stopwords(tokenized_words, stop_words_set):
    cleaned_words = []
    for word in tokenized_words:
        if word not in stop_words_set:
            cleaned_words.append(word)
    return cleaned_words

In [30]:
dic = {"US" : "USA", "U.S" : "USA", "Ummm" : "Umm", "Ummmm" : "Umm"}

text2 = "She became a US citizen. Ummmm"

tokenized_words = word_tokenize(text2)

normalized_words = []
for i in tokenized_words:
    if i in dic:
        i = dic[i]
    normalized_words.append(i)
    
print(normalized_words)

['She', 'became', 'a', 'USA', 'citizen', '.', 'Umm']


## 어간 추출(Stemming)
어간 추출은 텍스트 전처리(Text Preprocessing)의 핵심 기술 중 하나로, 단어의 다양한 변형(예: 복수형, 진행형, 과거형 등)을 규칙(rule-based)에 기반하여 단순화된 형태, 즉 '어간(stem)'으로 바꾸는 과정입니다. 예를 들어, studies, studying, studied 와 같은 단어들은 모두 '공부하다'라는 핵심 의미를 공유합니다. 어간 추출은 이 단어들을 모두 studi 라는 공통된 형태로 만들어 단어의 개수를 줄이고 분석의 효율성을 높입니다. 중요한 특징은 어간 추출은 정해진 규칙에 따라 접미사를 잘라내는 방식으로 동작하기 때문에, 결과물이 실제 사전에 존재하는 단어가 아닐 수도 있습니다. (예: studies -> studi)

In [96]:
from nltk.stem import PorterStemmer

In [33]:
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()

# 테스트할 단어 리스트
words_to_stem = ['program', 'programs', 'programmer', 'programming', 'programmers']
# 비교를 위해 다른 단어들도 추가
words_to_stem.extend(['history', 'historical', 'computation', 'computer', 'compute'])

for i in words_to_stem:
    stem = porter_stemmer.stem(i)
    print(f"{i} -> {stem}")

program -> program
programs -> program
programmer -> programm
programming -> program
programmers -> programm
history -> histori
historical -> histor
computation -> comput
computer -> comput
compute -> comput


In [34]:
# 포터 스테머 어간 추출 함수
def stemming_by_porter(tokenized_words):
    porter_stemmer = PorterStemmer()
    porter_stemmed_words = []

    for word in tokenized_words:
        stem = porter_stemmer.stem(word)
        porter_stemmed_words.append(stem)
    return porter_stemmed_words

In [40]:
text = "You are so lovely. I am loving you now."
text = text.lower()

porter_stemmer = PorterStemmer()

box = []
tokenized_words = word_tokenize(text)

for i in tokenized_words:
    stem = porter_stemmer.stem(i)
    box.append(stem)
    
print(box)

['you', 'are', 'so', 'love', '.', 'i', 'am', 'love', 'you', 'now', '.']


In [42]:
import pandas as pd

df = pd.read_csv('imdb.tsv', sep="\t")
del df['Unnamed: 0']
df

Unnamed: 0,review
0,"Watching Time Chasers, it obvious that it was ..."
1,I saw this film about 20 years ago and remembe...
2,"Minor Spoilers In New York, Joan Barnard (Elvi..."
3,I went to see this film with a great deal of e...
4,"Yes, I agree with everyone on this site this m..."
5,"Jennifer Ehle was sparkling in \""Pride and Pre..."
6,Amy Poehler is a terrific comedian on Saturday...
7,A plane carrying employees of a large biotech ...
8,"A well made, gritty science fiction movie, it ..."
9,Incredibly dumb and utterly predictable story ...


In [45]:
df['review'] = df['review'].str.lower()
df

Unnamed: 0,review
0,"watching time chasers, it obvious that it was ..."
1,i saw this film about 20 years ago and remembe...
2,"minor spoilers in new york, joan barnard (elvi..."
3,i went to see this film with a great deal of e...
4,"yes, i agree with everyone on this site this m..."
5,"jennifer ehle was sparkling in \""pride and pre..."
6,amy poehler is a terrific comedian on saturday...
7,a plane carrying employees of a large biotech ...
8,"a well made, gritty science fiction movie, it ..."
9,incredibly dumb and utterly predictable story ...


In [63]:
df["word_token"] = df['review'].apply(word_tokenize)
df["cleaned_token"] = df['word_token'].apply(lambda x : clean_by_freq(x, 1))
df['cleaned_token'] = df['cleaned_token'].apply(lambda x : clean_by_len(x, 2))
df['cleaned_token'] = df['cleaned_token'].apply(lambda x : clean_by_stopwords(x, stopwords_set))
df

Unnamed: 0,review,word_token,cleaned_token
0,"watching time chasers, it obvious that it was ...","[watching, time, chasers, ,, it, obvious, that...","[one, film, said, really, bad, movie, like, sa..."
1,i saw this film about 20 years ago and remembe...,"[i, saw, this, film, about, 20, years, ago, an...","[film, the, the, the, film]"
2,"minor spoilers in new york, joan barnard (elvi...","[minor, spoilers, in, new, york, ,, joan, barn...","[new, york, joan, barnard, elvire, audrey, the..."
3,i went to see this film with a great deal of e...,"[i, went, to, see, this, film, with, a, great,...","[went, film, the, film, the, went, the, jump, ..."
4,"yes, i agree with everyone on this site this m...","[yes, ,, i, agree, with, everyone, on, this, s...","[site, movie, bad, even, movie, made, movie, s..."
5,"jennifer ehle was sparkling in \""pride and pre...","[jennifer, ehle, was, sparkling, in, \, '', pr...","[ehle, northam, wonderful, the, the, the, wond..."
6,amy poehler is a terrific comedian on saturday...,"[amy, poehler, is, a, terrific, comedian, on, ...","[role, movie, n't, author, book, the, author, ..."
7,a plane carrying employees of a large biotech ...,"[a, plane, carrying, employees, of, a, large, ...","[plane, the, ceo, the, the, search, rescue, mi..."
8,"a well made, gritty science fiction movie, it ...","[a, well, made, ,, gritty, science, fiction, m...","[gritty, movie, the, the, the, sci-fi, good, s..."
9,incredibly dumb and utterly predictable story ...,"[incredibly, dumb, and, utterly, predictable, ...","[girl, girl, the, the, the, the]"


In [66]:
df["stemmed_tokens"] = df['cleaned_token'].apply(stemming_by_porter)
df

Unnamed: 0,review,word_token,cleaned_token,stemmed_tokens
0,"watching time chasers, it obvious that it was ...","[watching, time, chasers, ,, it, obvious, that...","[one, film, said, really, bad, movie, like, sa...","[one, film, said, realli, bad, movi, like, sai..."
1,i saw this film about 20 years ago and remembe...,"[i, saw, this, film, about, 20, years, ago, an...","[film, the, the, the, film]","[film, the, the, the, film]"
2,"minor spoilers in new york, joan barnard (elvi...","[minor, spoilers, in, new, york, ,, joan, barn...","[new, york, joan, barnard, elvire, audrey, the...","[new, york, joan, barnard, elvir, audrey, the,..."
3,i went to see this film with a great deal of e...,"[i, went, to, see, this, film, with, a, great,...","[went, film, the, film, the, went, the, jump, ...","[went, film, the, film, the, went, the, jump, ..."
4,"yes, i agree with everyone on this site this m...","[yes, ,, i, agree, with, everyone, on, this, s...","[site, movie, bad, even, movie, made, movie, s...","[site, movi, bad, even, movi, made, movi, spec..."
5,"jennifer ehle was sparkling in \""pride and pre...","[jennifer, ehle, was, sparkling, in, \, '', pr...","[ehle, northam, wonderful, the, the, the, wond...","[ehl, northam, wonder, the, the, the, wonder, ..."
6,amy poehler is a terrific comedian on saturday...,"[amy, poehler, is, a, terrific, comedian, on, ...","[role, movie, n't, author, book, the, author, ...","[role, movi, n't, author, book, the, author, t..."
7,a plane carrying employees of a large biotech ...,"[a, plane, carrying, employees, of, a, large, ...","[plane, the, ceo, the, the, search, rescue, mi...","[plane, the, ceo, the, the, search, rescu, mis..."
8,"a well made, gritty science fiction movie, it ...","[a, well, made, ,, gritty, science, fiction, m...","[gritty, movie, the, the, the, sci-fi, good, s...","[gritti, movi, the, the, the, sci-fi, good, su..."
9,incredibly dumb and utterly predictable story ...,"[incredibly, dumb, and, utterly, predictable, ...","[girl, girl, the, the, the, the]","[girl, girl, the, the, the, the]"


## 문장 토큰화(Sentence Tokenization)
문장 토큰화는 하나의 긴 텍스트(문서, 단락 등)를 문장의 최소 단위로 분리하는 작업을 말합니다. 즉, 전체 글을 마침표(.), 물음표(?), 느낌표(!) 등을 기준으로 문장 단위의 목록으로 만드는 과정입니다.

In [68]:
text = "My email address is 'abcde@codeit.com'. Send it to Mr.Kim."

In [69]:
from nltk.tokenize import sent_tokenize
sent_tokenize(text)

["My email address is 'abcde@codeit.com'.", 'Send it to Mr.Kim.']

In [70]:
text = "Can you forward my email to Mr.Kim? Thank you!"
sent_tokenize(text)

['Can you forward my email to Mr.Kim?', 'Thank you!']

## 품사 태깅(POS Tagging)
품사 태깅은 문장 내의 각 단어에 해당하는 품사(명사, 동사, 형용사, 부사 등)를 식별하여 태그를 붙여주는 과정을 말합니다. 이는 컴퓨터가 문장의 문법적 구조를 이해하게 하는 핵심적인 단계입니다. 예를 들어, "The cat sat on the mat." 이라는 문장이 있다면,(The, 관사), (cat, 명사), (sat, 동사), (on, 전치사), (the, 관사), (mat, 명사)

* NNP:	고유명사, 단수
* VBZ:	동사, 3인칭 단수 현재형
* DT:	한정사 (관사 등)
* JJ:	형용사
* NN:	명사, 단수형
* IN:	전치사 또는 접속사
* .:	문장 부호

In [74]:
from nltk.tag import pos_tag

text = "Watching Time Chasers, it obvious that it was made by a bunch of friends. Maybe they were sitting around one day in film school and said, \"Hey, let\'s pool our money together and make a really bad movie!\" Or something like that."

pos_tagged_words = []
for sentence in sent_tokenize(text):
    tokenized_words = word_tokenize(sentence)
    pos_tagged = pos_tag(tokenized_words)
    pos_tagged_words += pos_tagged

pos_tagged_words

[('Watching', 'VBG'),
 ('Time', 'NNP'),
 ('Chasers', 'NNPS'),
 (',', ','),
 ('it', 'PRP'),
 ('obvious', 'VBZ'),
 ('that', 'IN'),
 ('it', 'PRP'),
 ('was', 'VBD'),
 ('made', 'VBN'),
 ('by', 'IN'),
 ('a', 'DT'),
 ('bunch', 'NN'),
 ('of', 'IN'),
 ('friends', 'NNS'),
 ('.', '.'),
 ('Maybe', 'RB'),
 ('they', 'PRP'),
 ('were', 'VBD'),
 ('sitting', 'VBG'),
 ('around', 'IN'),
 ('one', 'CD'),
 ('day', 'NN'),
 ('in', 'IN'),
 ('film', 'NN'),
 ('school', 'NN'),
 ('and', 'CC'),
 ('said', 'VBD'),
 (',', ','),
 ('``', '``'),
 ('Hey', 'NNP'),
 (',', ','),
 ('let', 'VB'),
 ("'s", 'POS'),
 ('pool', 'VB'),
 ('our', 'PRP$'),
 ('money', 'NN'),
 ('together', 'RB'),
 ('and', 'CC'),
 ('make', 'VB'),
 ('a', 'DT'),
 ('really', 'RB'),
 ('bad', 'JJ'),
 ('movie', 'NN'),
 ('!', '.'),
 ("''", "''"),
 ('Or', 'CC'),
 ('something', 'NN'),
 ('like', 'IN'),
 ('that', 'DT'),
 ('.', '.')]

In [76]:
# 품사 태깅 함수
def pos_tagger(tokenized_sents):
    pos_tagged_words = []

    for sentence in tokenized_sents:
        # 단어 토큰화
        tokenized_words = word_tokenize(sentence)
    
        # 품사 태깅
        pos_tagged = pos_tag(tokenized_words)
        pos_tagged_words.extend(pos_tagged)
    
    return pos_tagged_words

## 표제어 추출(Lemmatization)
표제어 추출은 단어의 여러 변형된 형태를 문법적 정보와 ```사전(dictionary)```을 이용하여 그 단어의 기본형, 즉 ```표제어(Lemma)```를 찾아내는 과정입니다. 표제어는 우리가 사전에서 찾는 '기본형 단어'라고 생각하시면 됩니다.

In [81]:
from nltk.stem import WordNetLemmatizer
nltk.download("wordnet")

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\campus4D004\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [82]:
words_to_lemmatize = ['is', 'was', 'are', 'cars', 'dies', 'flies', 'watched']

for i in words_to_lemmatize:
    stem = lemmatizer.lemmatize(i)
    print(f"{i} -> {stem}")

is -> is
was -> wa
are -> are
cars -> car
dies -> dy
flies -> fly
watched -> watched


In [89]:
lemmatizer.lemmatize("better", pos = 'a')

'good'

In [93]:
from nltk.corpus import wordnet

def get_wordnet_pos(X):
    if X.startswith("J"):
        return wordnet.ADJ
    elif X.startswith("V"):
        return wordnet.VERB
    elif X.startswith("N"):
        return wordnet.NOUN
    elif X.startswith("R"):
        return wordnet.ADV
    else:
        return wordnet.NOUN

text = "The dogs are running and chasing the flying birds."
text = text.lower()

word_tokens = word_tokenize(text)

tagged_tokens = pos_tag(word_tokens)

lemmatized_words = []
for word, tag in tagged_tokens:
    pos = get_wordnet_pos(tag)
    stem = lemmatizer.lemmatize(word, pos = pos)
    lemmatized_words.append(stem)
print(lemmatized_words)

['the', 'dog', 'be', 'run', 'and', 'chase', 'the', 'fly', 'bird', '.']


In [94]:
def words_lemmatizer(pos_tagged_words):
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = []

    for word, tag in pos_tagged_words:
        wn_tag = get_wordnet_pos(tag)

        if wn_tag in (wordnet.NOUN, wordnet.ADJ, wordnet.ADV, wordnet.VERB):
            lemmatized_words.append(lemmatizer.lemmatize(word, wn_tag))
        else:
            lemmatized_words.append(word)

    return lemmatized_words

In [98]:
# 1. 문장 토큰화
# 2. 품사 태깅
# 3. 표제어 추출

df = pd.read_csv('imdb.tsv', sep="\t")
del df['Unnamed: 0']

df['review'] = df['review'].str.lower()

In [101]:
df["sent_tokens"] = df['review'].apply(sent_tokenize)
df

Unnamed: 0,review,sent_tokens
0,"watching time chasers, it obvious that it was ...","[watching time chasers, it obvious that it was..."
1,i saw this film about 20 years ago and remembe...,[i saw this film about 20 years ago and rememb...
2,"minor spoilers in new york, joan barnard (elvi...","[minor spoilers in new york, joan barnard (elv..."
3,i went to see this film with a great deal of e...,[i went to see this film with a great deal of ...
4,"yes, i agree with everyone on this site this m...","[yes, i agree with everyone on this site this ..."
5,"jennifer ehle was sparkling in \""pride and pre...","[jennifer ehle was sparkling in \""pride and pr..."
6,amy poehler is a terrific comedian on saturday...,[amy poehler is a terrific comedian on saturda...
7,a plane carrying employees of a large biotech ...,[a plane carrying employees of a large biotech...
8,"a well made, gritty science fiction movie, it ...","[a well made, gritty science fiction movie, it..."
9,incredibly dumb and utterly predictable story ...,[incredibly dumb and utterly predictable story...


In [103]:
df["pos_tagged_tokens"] = df['sent_tokens'].apply(pos_tagger)
df

Unnamed: 0,review,sent_tokens,pos_tagged_tokens
0,"watching time chasers, it obvious that it was ...","[watching time chasers, it obvious that it was...","[(watching, VBG), (time, NN), (chasers, NNS), ..."
1,i saw this film about 20 years ago and remembe...,[i saw this film about 20 years ago and rememb...,"[(i, NN), (saw, VBD), (this, DT), (film, NN), ..."
2,"minor spoilers in new york, joan barnard (elvi...","[minor spoilers in new york, joan barnard (elv...","[(minor, JJ), (spoilers, NNS), (in, IN), (new,..."
3,i went to see this film with a great deal of e...,[i went to see this film with a great deal of ...,"[(i, JJ), (went, VBD), (to, TO), (see, VB), (t..."
4,"yes, i agree with everyone on this site this m...","[yes, i agree with everyone on this site this ...","[(yes, UH), (,, ,), (i, JJ), (agree, VBP), (wi..."
5,"jennifer ehle was sparkling in \""pride and pre...","[jennifer ehle was sparkling in \""pride and pr...","[(jennifer, NN), (ehle, NN), (was, VBD), (spar..."
6,amy poehler is a terrific comedian on saturday...,[amy poehler is a terrific comedian on saturda...,"[(amy, JJ), (poehler, NN), (is, VBZ), (a, DT),..."
7,a plane carrying employees of a large biotech ...,[a plane carrying employees of a large biotech...,"[(a, DT), (plane, NN), (carrying, VBG), (emplo..."
8,"a well made, gritty science fiction movie, it ...","[a well made, gritty science fiction movie, it...","[(a, DT), (well, NN), (made, VBN), (,, ,), (gr..."
9,incredibly dumb and utterly predictable story ...,[incredibly dumb and utterly predictable story...,"[(incredibly, RB), (dumb, JJ), (and, CC), (utt..."


In [110]:
df["lemmatized_tokens"] = df['pos_tagged_tokens'].apply(words_lemmatizer)

df["cleaned_tokens"] = df['lemmatized_tokens'].apply(lambda x : clean_by_freq(x, 1))
df["cleaned_tokens"] = df["cleaned_tokens"].apply(lambda x : clean_by_len(x, 2))
df["cleaned_tokens"] = df["cleaned_tokens"].apply(lambda x : clean_by_stopwords(x, stopwords_set))
df

Unnamed: 0,review,sent_tokens,pos_tagged_tokens,lemmatized_tokens,cleaned_tokens
0,"watching time chasers, it obvious that it was ...","[watching time chasers, it obvious that it was...","[(watching, VBG), (time, NN), (chasers, NNS), ...","[watch, time, chaser, ,, it, obvious, that, it...","[make, one, film, say, make, really, bad, movi..."
1,i saw this film about 20 years ago and remembe...,[i saw this film about 20 years ago and rememb...,"[(i, NN), (saw, VBD), (this, DT), (film, NN), ...","[i, saw, this, film, about, 20, year, ago, and...","[film, the, the, the, film]"
2,"minor spoilers in new york, joan barnard (elvi...","[minor spoilers in new york, joan barnard (elv...","[(minor, JJ), (spoilers, NNS), (in, IN), (new,...","[minor, spoiler, in, new, york, ,, joan, barna...","[new, york, joan, barnard, elvire, audrey, the..."
3,i went to see this film with a great deal of e...,[i went to see this film with a great deal of ...,"[(i, JJ), (went, VBD), (to, TO), (see, VB), (t...","[i, go, to, see, this, film, with, a, great, d...","[film, the, film, the, the, jump, send, n't, j..."
4,"yes, i agree with everyone on this site this m...","[yes, i agree with everyone on this site this ...","[(yes, UH), (,, ,), (i, JJ), (agree, VBP), (wi...","[yes, ,, i, agree, with, everyone, on, this, s...","[site, movie, bad, even, movie, movie, make, m..."
5,"jennifer ehle was sparkling in \""pride and pre...","[jennifer ehle was sparkling in \""pride and pr...","[(jennifer, NN), (ehle, NN), (was, VBD), (spar...","[jennifer, ehle, be, sparkle, in, \, '', pride...","[ehle, northam, wonderful, the, the, the, wond..."
6,amy poehler is a terrific comedian on saturday...,[amy poehler is a terrific comedian on saturda...,"[(amy, JJ), (poehler, NN), (is, VBZ), (a, DT),...","[amy, poehler, be, a, terrific, comedian, on, ...","[role, movie, n't, author, book, funny, the, a..."
7,a plane carrying employees of a large biotech ...,[a plane carrying employees of a large biotech...,"[(a, DT), (plane, NN), (carrying, VBG), (emplo...","[a, plane, carry, employee, of, a, large, biot...","[plane, the, ceo, the, the, search, rescue, mi..."
8,"a well made, gritty science fiction movie, it ...","[a well made, gritty science fiction movie, it...","[(a, DT), (well, NN), (made, VBN), (,, ,), (gr...","[a, well, make, ,, gritty, science, fiction, m...","[gritty, movie, movie, keep, the, the, the, sc..."
9,incredibly dumb and utterly predictable story ...,[incredibly dumb and utterly predictable story...,"[(incredibly, RB), (dumb, JJ), (and, CC), (utt...","[incredibly, dumb, and, utterly, predictable, ...","[girl, girl, the, the, the, the]"


In [112]:
def combine(sentence):
    return " ".join(sentence)

df["combined_corpus"] = df['cleaned_tokens'].apply(combine)
df

Unnamed: 0,review,sent_tokens,pos_tagged_tokens,lemmatized_tokens,cleaned_tokens,combined_corpus
0,"watching time chasers, it obvious that it was ...","[watching time chasers, it obvious that it was...","[(watching, VBG), (time, NN), (chasers, NNS), ...","[watch, time, chaser, ,, it, obvious, that, it...","[make, one, film, say, make, really, bad, movi...",make one film say make really bad movie like s...
1,i saw this film about 20 years ago and remembe...,[i saw this film about 20 years ago and rememb...,"[(i, NN), (saw, VBD), (this, DT), (film, NN), ...","[i, saw, this, film, about, 20, year, ago, and...","[film, the, the, the, film]",film the the the film
2,"minor spoilers in new york, joan barnard (elvi...","[minor spoilers in new york, joan barnard (elv...","[(minor, JJ), (spoilers, NNS), (in, IN), (new,...","[minor, spoiler, in, new, york, ,, joan, barna...","[new, york, joan, barnard, elvire, audrey, the...",new york joan barnard elvire audrey the barnar...
3,i went to see this film with a great deal of e...,[i went to see this film with a great deal of ...,"[(i, JJ), (went, VBD), (to, TO), (see, VB), (t...","[i, go, to, see, this, film, with, a, great, d...","[film, the, film, the, the, jump, send, n't, j...",film the film the the jump send n't jump radio...
4,"yes, i agree with everyone on this site this m...","[yes, i agree with everyone on this site this ...","[(yes, UH), (,, ,), (i, JJ), (agree, VBP), (wi...","[yes, ,, i, agree, with, everyone, on, this, s...","[site, movie, bad, even, movie, movie, make, m...",site movie bad even movie movie make movie spe...
5,"jennifer ehle was sparkling in \""pride and pre...","[jennifer ehle was sparkling in \""pride and pr...","[(jennifer, NN), (ehle, NN), (was, VBD), (spar...","[jennifer, ehle, be, sparkle, in, \, '', pride...","[ehle, northam, wonderful, the, the, the, wond...",ehle northam wonderful the the the wonderful t...
6,amy poehler is a terrific comedian on saturday...,[amy poehler is a terrific comedian on saturda...,"[(amy, JJ), (poehler, NN), (is, VBZ), (a, DT),...","[amy, poehler, be, a, terrific, comedian, on, ...","[role, movie, n't, author, book, funny, the, a...",role movie n't author book funny the author th...
7,a plane carrying employees of a large biotech ...,[a plane carrying employees of a large biotech...,"[(a, DT), (plane, NN), (carrying, VBG), (emplo...","[a, plane, carry, employee, of, a, large, biot...","[plane, the, ceo, the, the, search, rescue, mi...",plane the ceo the the search rescue mission ca...
8,"a well made, gritty science fiction movie, it ...","[a well made, gritty science fiction movie, it...","[(a, DT), (well, NN), (made, VBN), (,, ,), (gr...","[a, well, make, ,, gritty, science, fiction, m...","[gritty, movie, movie, keep, the, the, the, sc...",gritty movie movie keep the the the sci-fi goo...
9,incredibly dumb and utterly predictable story ...,[incredibly dumb and utterly predictable story...,"[(incredibly, RB), (dumb, JJ), (and, CC), (utt...","[incredibly, dumb, and, utterly, predictable, ...","[girl, girl, the, the, the, the]",girl girl the the the the


## 연습문제
1. 주어진 영어 문장을 단어 단위로 나누어 리스트 형태로 출력해 보세요. 아래 text 변수에 저장된 문장을 단어로 토큰화하세요.

In [114]:
text = "Text mining is the process of exploring and analyzing large amounts of unstructured text data."
print(word_tokenize(text))

['Text', 'mining', 'is', 'the', 'process', 'of', 'exploring', 'and', 'analyzing', 'large', 'amounts', 'of', 'unstructured', 'text', 'data', '.']


2. 주어진 영어 텍스트를 문장 단위로 나누어 리스트 형태로 출력해 보세요. ., ?, ! 등 다양한 문장 부호를 기준으로 문장을 정확히 분리해야 합니다. 아래 text 변수에 저장된 문단을 문장으로 토큰화하세요.

In [115]:
text = "What is text mining? It is a fascinating field! We can discover hidden patterns. contact. masterkyungil@gmail.com"
sent_tokenize(text)

['What is text mining?',
 'It is a fascinating field!',
 'We can discover hidden patterns.',
 'contact.',
 'masterkyungil@gmail.com']

3. 분석에 큰 의미가 없는 단어들, 즉 불용어를 제거하는 과정입니다. 기본 불용어 목록에 새로운 단어를 추가하여 함께 제거해 보세요. 주어진 단어 리스트에서 NLTK의 영어 불용어를 제거하세요. 추가로 'movie'와 'film'도 불용어로 간주하여 제거합니다.

In [116]:
words = ['This', 'is', 'an', 'amazing', 'movie', 'about', 'a', 'heroic', 'film', 'director', '.']

stopwords_set.add("movie")
stopwords_set.add("film")

clean_by_stopwords(words, stopwords_set)

['This', 'amazing', 'heroic', 'director', '.']

4. 텍스트에 너무 적게 등장하는 단어는 분석에 유용하지 않을 수 있습니다. 특정 횟수 이하로 등장하는 단어들을 제거해 보세요. 주어진 텍스트에서 1번만 등장하는 단어들을 모두 제거한 결과를 출력하세요.

In [119]:
text = "The cat sat on the mat. The dog sat on the log. The cat and dog are friends."
A = word_tokenize(text)
clean_by_freq(A, 1)

['The',
 'cat',
 'sat',
 'on',
 'the',
 '.',
 'The',
 'dog',
 'sat',
 'on',
 'the',
 '.',
 'The',
 'cat',
 'dog',
 '.']

5. 단어의 접미사를 규칙에 기반하여 잘라내어 원형에 가깝게 만드는 과정입니다. Porter Stemmer를 사용해 보세요. 주어진 단어 리스트에 포터 스테머(Porter Stemmer)를 적용하여 각 단어의 어간을 추출하세요.

In [120]:
words = ['studies', 'studying', 'beautiful', 'beauty', 'connection', 'connects']
stemming_by_porter(words)

['studi', 'studi', 'beauti', 'beauti', 'connect', 'connect']

6. 단어의 사전적, 문법적 의미를 고려하여 기본형(표제어)을 찾는 과정입니다. 어간 추출과의 차이점을 비교해 보세요. 연습문제 5와 동일한 단어 리스트에 표제어 추출을 적용하세요. pos 인자를 사용하지 않고 명사(기본값) 기준으로 표제어를 추출합니다.

In [129]:
B = pos_tag(A)
for word, tag in B:
    print(lemmatizer.lemmatize(word))

The
quick
brown
fox
jump
over
the
lazy
dog
.


7. 장 내 각 단어의 품사를 알아내어 태그를 붙이는 과정입니다. 이는 더 정확한 표제어 추출의 기반이 됩니다. 주어진 문장을 단어로 토큰화한 후, 각 단어에 대한 품사를 태깅하여 (단어, 품사) 튜플의 리스트로 출력하세요.

In [126]:
text = "The quick brown fox jumps over the lazy dog."
A = word_tokenize(text)
pos_tag(A)

[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumps', 'VBZ'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN'),
 ('.', '.')]

8. 품사 정보를 함께 사용하면 표제어 추출의 정확도를 크게 높일 수 있습니다. is는 be로, running은 run으로 정확히 변환해 보세요. 주어진 문장에 대해 품사 태깅을 수행하고, 이 품사 정보를 활용하여 표제어를 추출하세요.

In [131]:
text = "She is running faster than I thought."
A = word_tokenize(text)
tag = pos_tag(A)
words_lemmatizer(tag)

['She', 'be', 'run', 'faster', 'than', 'I', 'think', '.']

9. 지금까지 배운 여러 전처리 기술을 순서대로 적용하는 함수를 만들어 보세요. 다음 텍스트에 대해 아래의 전처리 과정을 순서대로 적용하는 preprocess_text 함수를 완성하세요.
* 모든 알파벳을 소문자로 변환
* 단어 토큰화
* 길이가 2 이하인 단어 제거
* 영어 불용어 제거
* 포터 스테머를 이용한 어간 추출

In [135]:
text = "Data science has become one of the most popular fields in the 21st century."
text = text.lower()

tokens = word_tokenize(text)
tokens = clean_by_len(tokens, 2)
tokens = clean_by_stopwords(tokens, stopwords_set)
stemming_by_porter(tokens)

['data',
 'scienc',
 'becom',
 'one',
 'the',
 'popular',
 'field',
 'the',
 '21st',
 'centuri']

10. 실제 데이터 분석에서는 Pandas DataFrame 형태로 텍스트 데이터를 다루는 경우가 많습니다. DataFrame의 각 행에 전처리 함수를 일괄 적용해 보세요. 주어진 데이터로 Pandas DataFrame을 생성한 후, review 열의 각 텍스트에 연습문제 9에서 만든 preprocess_text 함수를 적용하여 processed_review라는 새로운 열을 추가하세요.

In [140]:
import pandas as pd

data = {'review': [
    "This movie was absolutely fantastic!",
    "I've never seen such a boring film in my life.",
    "A truly inspiring and emotional journey."
]}

df = pd.DataFrame(data)
df['review'] = df['review'].str.lower()
df["tokens"] = df['review'].apply(word_tokenize)
df

Unnamed: 0,review,tokens
0,this movie was absolutely fantastic!,"[this, movie, was, absolutely, fantastic, !]"
1,i've never seen such a boring film in my life.,"[i, 've, never, seen, such, a, boring, film, i..."
2,a truly inspiring and emotional journey.,"[a, truly, inspiring, and, emotional, journey, .]"
