### 자연어(영어) 처리 과정

- 영어 데이터 : nltk 라이브러리 사용
- nltk.download('punkt') # punkt가 마침표나 약어, 언어적 특성 같은걸 고려해줌

∇¬
\n
1. 토큰화 : 자연어 데이터 분석을 위한 작은 단위(토큰)로 분리
2. 정제 : 분석에 큰 의미가 없는 데이터 제거
3. 정규화 : 표현 방법이 다르지만 의미가 같은 단어들을 통합
4. 정수 인코딩 : 컴퓨터가 이해하기 쉽도록 자연어 데이터에 정수 인덱스를 부여

#### 토큰화

In [4]:
import nltk
nltk.download('punkt') # punkt가 마침표나 약어, 언어적 특성 같은걸 고려해줌

TEXT = """After reading the comments for this movie, I am not sure whether I should be angry, sad or sickened. Seeing comments typical of people who a)know absolutely nothing about the military or b)who base everything they think they know on movies like this or on CNN reports about Abu-Gharib makes me wonder about the state of intellectual stimulation in the world. At the time I type this the number of people in the US military: 1.4 million on Active Duty with another almost 900,000 in the Guard and Reserves for a total of roughly 2.3 million. The number of people indicted for abuses at at Abu-Gharib: Currently less than 20 That makes the total of people indicted .00083% of the total military. Even if you indict every single military member that ever stepped in to Abu-Gharib, you would not come close to making that a whole number.  The flaws in this movie would take YEARS to cover. I understand that it's supposed to be sarcastic, but in reality, the writer and director are trying to make commentary about the state of the military without an enemy to fight. In reality, the US military has been at its busiest when there are not conflicts going on. The military is the first called for disaster relief and humanitarian aid missions. When the tsunami hit Indonesia, devestating the region, the US military was the first on the scene. When the chaos of the situation overwhelmed the local governments, it was military leadership who looked at their people, the same people this movie mocks, and said make it happen. Within hours, food aid was reaching isolated villages. Within days, airfields were built, cargo aircraft started landing and a food distribution system was up and running. Hours and days, not weeks and months. Yes there are unscrupulous people in the US military. But then, there are in every walk of life, every occupation. But to see people on this website decide that 2.3 million men and women are all criminal, with nothing on their minds but thoughts of destruction or mayhem is an absolute disservice to the things that they do every day. One person on this website even went so far as to say that military members are in it for personal gain. Wow! Entry level personnel make just under $8.00 an hour assuming a 40 hour work week. Of course, many work much more than 40 hours a week and those in harm's way typically put in 16-18 hour days for months on end. That makes the pay well under minimum wage. So much for personal gain. I beg you, please make yourself familiar with the world around you. Go to a nearby base, get a visitor pass and meet some of the men and women you are so quick to disparage. You would be surprised. The military no longer accepts people in lieu of prison time. They require a minimum of a GED and prefer a high school diploma. The middle ranks are expected to get a minimum of undergraduate degrees and the upper ranks are encouraged to get advanced degrees."""

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
TEXT.split() # 이렇게하면 토큰화는 되긴하는데 문제가 생긴다 예를 들어 mive, 이렇게 한 토큰이 된다.

['After',
 'reading',
 'the',
 'comments',
 'for',
 'this',
 'movie,',
 'I',
 'am',
 'not',
 'sure',
 'whether',
 'I',
 'should',
 'be',
 'angry,',
 'sad',
 'or',
 'sickened.',
 'Seeing',
 'comments',
 'typical',
 'of',
 'people',
 'who',
 'a)know',
 'absolutely',
 'nothing',
 'about',
 'the',
 'military',
 'or',
 'b)who',
 'base',
 'everything',
 'they',
 'think',
 'they',
 'know',
 'on',
 'movies',
 'like',
 'this',
 'or',
 'on',
 'CNN',
 'reports',
 'about',
 'Abu-Gharib',
 'makes',
 'me',
 'wonder',
 'about',
 'the',
 'state',
 'of',
 'intellectual',
 'stimulation',
 'in',
 'the',
 'world.',
 'At',
 'the',
 'time',
 'I',
 'type',
 'this',
 'the',
 'number',
 'of',
 'people',
 'in',
 'the',
 'US',
 'military:',
 '1.4',
 'million',
 'on',
 'Active',
 'Duty',
 'with',
 'another',
 'almost',
 '900,000',
 'in',
 'the',
 'Guard',
 'and',
 'Reserves',
 'for',
 'a',
 'total',
 'of',
 'roughly',
 '2.3',
 'million.',
 'The',
 'number',
 'of',
 'people',
 'indicted',
 'for',
 'abuses',
 'at',

In [6]:
from nltk.tokenize import word_tokenize
tokenized_words = word_tokenize(TEXT) # 이렇게하면 토큰화가 된다.

##### 빈도 분석

In [9]:
## 빈도 분석

import pandas as pd
dic = {}
for i in tokenized_words:
    if i not in dic:
        dic[i] = 1
    else:
        dic[i] += 1

a= pd.DataFrame({"WORD" : dic.keys(), "FREQ" : dic.values()})
a.sort_values(by='FREQ', ascending=False)


Unnamed: 0,WORD,FREQ
2,the,30
19,.,28
7,",",21
22,of,15
66,and,14
...,...,...
131,disaster,1
132,relief,1
133,humanitarian,1
135,missions,1


In [11]:
# 이것도 빈도 분석 : collections 사용하면 간단히 할 수 있음
from collections import Counter
vocab=Counter(tokenized_words)
vocab.most_common(10)

[('the', 30),
 ('.', 28),
 (',', 21),
 ('of', 15),
 ('and', 14),
 ('to', 13),
 ('a', 12),
 ('military', 12),
 ('in', 12),
 ('people', 9)]

텍스트 마이닝 : 완성한 후에 결과를 보고 고쳐야함


토픽 모델링 : 빈도가 아주 적은것과 아주 많은게 문제



In [12]:
vocab.keys()
# vocab.values()


dict_keys(['After', 'reading', 'the', 'comments', 'for', 'this', 'movie', ',', 'I', 'am', 'not', 'sure', 'whether', 'should', 'be', 'angry', 'sad', 'or', 'sickened', '.', 'Seeing', 'typical', 'of', 'people', 'who', 'a', ')', 'know', 'absolutely', 'nothing', 'about', 'military', 'b', 'base', 'everything', 'they', 'think', 'on', 'movies', 'like', 'CNN', 'reports', 'Abu-Gharib', 'makes', 'me', 'wonder', 'state', 'intellectual', 'stimulation', 'in', 'world', 'At', 'time', 'type', 'number', 'US', ':', '1.4', 'million', 'Active', 'Duty', 'with', 'another', 'almost', '900,000', 'Guard', 'and', 'Reserves', 'total', 'roughly', '2.3', 'The', 'indicted', 'abuses', 'at', 'Currently', 'less', 'than', '20', 'That', '.00083', '%', 'Even', 'if', 'you', 'indict', 'every', 'single', 'member', 'that', 'ever', 'stepped', 'to', 'would', 'come', 'close', 'making', 'whole', 'flaws', 'take', 'YEARS', 'cover', 'understand', 'it', "'s", 'supposed', 'sarcastic', 'but', 'reality', 'writer', 'director', 'are', 'tr

##### 빈도가 2 이하인 단어 제거

In [14]:
uncommom_words = []
for key, value in vocab.items():
    if value <= 2:
        uncommom_words.append(key)

uncommom_words

['After',
 'reading',
 'comments',
 'am',
 'sure',
 'whether',
 'should',
 'angry',
 'sad',
 'sickened',
 'Seeing',
 'typical',
 ')',
 'know',
 'absolutely',
 'nothing',
 'b',
 'base',
 'everything',
 'think',
 'movies',
 'like',
 'CNN',
 'reports',
 'me',
 'wonder',
 'state',
 'intellectual',
 'stimulation',
 'world',
 'At',
 'time',
 'type',
 ':',
 '1.4',
 'Active',
 'Duty',
 'another',
 'almost',
 '900,000',
 'Guard',
 'Reserves',
 'roughly',
 '2.3',
 'indicted',
 'abuses',
 'Currently',
 'less',
 'than',
 '20',
 'That',
 '.00083',
 '%',
 'Even',
 'if',
 'indict',
 'single',
 'member',
 'ever',
 'stepped',
 'come',
 'close',
 'making',
 'whole',
 'flaws',
 'take',
 'YEARS',
 'cover',
 'understand',
 "'s",
 'supposed',
 'sarcastic',
 'but',
 'reality',
 'writer',
 'director',
 'trying',
 'commentary',
 'without',
 'enemy',
 'fight',
 'In',
 'has',
 'been',
 'its',
 'busiest',
 'when',
 'conflicts',
 'going',
 'is',
 'first',
 'called',
 'disaster',
 'relief',
 'humanitarian',
 'aid',

In [15]:
cleaned_by_freq = []
for word in tokenized_words:
    if word not in uncommom_words:
        cleaned_by_freq.append(word)

cleaned_by_freq

['the',
 'for',
 'this',
 'movie',
 ',',
 'I',
 'not',
 'I',
 'be',
 ',',
 'or',
 '.',
 'of',
 'people',
 'who',
 'a',
 'about',
 'the',
 'military',
 'or',
 'who',
 'they',
 'they',
 'on',
 'this',
 'or',
 'on',
 'about',
 'Abu-Gharib',
 'makes',
 'about',
 'the',
 'of',
 'in',
 'the',
 '.',
 'the',
 'I',
 'this',
 'the',
 'number',
 'of',
 'people',
 'in',
 'the',
 'US',
 'military',
 'million',
 'on',
 'with',
 'in',
 'the',
 'and',
 'for',
 'a',
 'total',
 'of',
 'million',
 '.',
 'The',
 'number',
 'of',
 'people',
 'for',
 'at',
 'at',
 'Abu-Gharib',
 'makes',
 'the',
 'total',
 'of',
 'people',
 'of',
 'the',
 'total',
 'military',
 '.',
 'you',
 'every',
 'military',
 'that',
 'in',
 'to',
 'Abu-Gharib',
 ',',
 'you',
 'would',
 'not',
 'to',
 'that',
 'a',
 'number',
 '.',
 'The',
 'in',
 'this',
 'movie',
 'would',
 'to',
 '.',
 'I',
 'that',
 'it',
 'to',
 'be',
 ',',
 'in',
 ',',
 'the',
 'and',
 'are',
 'to',
 'make',
 'about',
 'the',
 'of',
 'the',
 'military',
 'an',
 '

##### 길이가 2 이하인 단어들 제거

In [16]:
# 1. 빈도가 2이하인 단어들을 제거
# 2. 길이가 2 이하인 단어들도 제거

cleaned_by_freq_len = []
for word in cleaned_by_freq:
    if len(word) > 2:
        cleaned_by_freq_len.append(word)

cleaned_by_freq_len

['the',
 'for',
 'this',
 'movie',
 'not',
 'people',
 'who',
 'about',
 'the',
 'military',
 'who',
 'they',
 'they',
 'this',
 'about',
 'Abu-Gharib',
 'makes',
 'about',
 'the',
 'the',
 'the',
 'this',
 'the',
 'number',
 'people',
 'the',
 'military',
 'million',
 'with',
 'the',
 'and',
 'for',
 'total',
 'million',
 'The',
 'number',
 'people',
 'for',
 'Abu-Gharib',
 'makes',
 'the',
 'total',
 'people',
 'the',
 'total',
 'military',
 'you',
 'every',
 'military',
 'that',
 'Abu-Gharib',
 'you',
 'would',
 'not',
 'that',
 'number',
 'The',
 'this',
 'movie',
 'would',
 'that',
 'the',
 'and',
 'are',
 'make',
 'about',
 'the',
 'the',
 'military',
 'the',
 'military',
 'there',
 'are',
 'not',
 'The',
 'military',
 'the',
 'for',
 'and',
 'the',
 'the',
 'the',
 'military',
 'was',
 'the',
 'the',
 'the',
 'the',
 'the',
 'was',
 'military',
 'who',
 'people',
 'the',
 'people',
 'this',
 'movie',
 'and',
 'make',
 'was',
 'days',
 'and',
 'was',
 'and',
 'and',
 'days',
 'no

##### 결과

- 1.빈도가 2이하인 단어들을 제거
- 2.길이가 2 이하인 단어들도 제거

In [17]:
print("정제 전 : ",cleaned_by_freq[:10])
print("정제 후 : ",cleaned_by_freq_len[:10])

정제 전 :  ['the', 'for', 'this', 'movie', ',', 'I', 'not', 'I', 'be', ',']
정제 후 :  ['the', 'for', 'this', 'movie', 'not', 'people', 'who', 'about', 'the', 'military']


##### 빈도,길이 제거 사용자 정의 함수

In [20]:
def clean_by_freq(tokenized_words, cut_off_count):
    vocab = Counter(tokenized_words)

    # 빈도수가 cut_off_count 이하인 단어를 제거하는 코드를 작성해 주세요
    uncommom_words = {key for key,value in vocab.items() if value <=cut_off_count} # 리스트 컴프리헨션
    # 풀이
    # for key, value in vocab.items():
    #     if value <= cut_off_count:
    #         uncommom_words.append(key)

    cleaned_words = [word for word in tokenized_words if word not in uncommom_words]
    # 풀이
    # for word in tokenized_words:
    #     if word not in uncommom_words:
    #         cleaned_words.append(word)

    return cleaned_words


def clean_by_len(tokenized_words, cut_off_length):
    cleaned_words = []

    for word in tokenized_words:
        # 길이가 cut_off_length 이하인 단어 제거하는 코드를 작성해 주세요
        if len(word) > cut_off_length:
            cleaned_words.append(word)

    return cleaned_words


# 문제의 조건에 맞게 함수를 호출해 주세요
clean_by_freq = clean_by_freq(tokenized_words,2) # 빈도가 2이상인 단어
cleaned_words = clean_by_len(clean_by_freq,2) # 길이가 2 이상인단어

cleaned_words

['the',
 'for',
 'this',
 'movie',
 'not',
 'people',
 'who',
 'about',
 'the',
 'military',
 'who',
 'they',
 'they',
 'this',
 'about',
 'Abu-Gharib',
 'makes',
 'about',
 'the',
 'the',
 'the',
 'this',
 'the',
 'number',
 'people',
 'the',
 'military',
 'million',
 'with',
 'the',
 'and',
 'for',
 'total',
 'million',
 'The',
 'number',
 'people',
 'for',
 'Abu-Gharib',
 'makes',
 'the',
 'total',
 'people',
 'the',
 'total',
 'military',
 'you',
 'every',
 'military',
 'that',
 'Abu-Gharib',
 'you',
 'would',
 'not',
 'that',
 'number',
 'The',
 'this',
 'movie',
 'would',
 'that',
 'the',
 'and',
 'are',
 'make',
 'about',
 'the',
 'the',
 'military',
 'the',
 'military',
 'there',
 'are',
 'not',
 'The',
 'military',
 'the',
 'for',
 'and',
 'the',
 'the',
 'the',
 'military',
 'was',
 'the',
 'the',
 'the',
 'the',
 'the',
 'was',
 'military',
 'who',
 'people',
 'the',
 'people',
 'this',
 'movie',
 'and',
 'make',
 'was',
 'days',
 'and',
 'was',
 'and',
 'and',
 'days',
 'no

##### 불용어처리

In [21]:
# 불용어 처리
from nltk.corpus import stopwords
nltk.download('stopwords')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [26]:
added_stopwords = ['oh','the','i'] # 추가적으로 넣고싶은 불용어

In [27]:
stopwords_set = set(stopwords.words('english'))
stopwords_set

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'r

In [29]:
stopwords_set.update(added_stopwords) # 불용어 추가
stopwords_set

{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'oh',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'ow

In [34]:
stopwords_set.add('hello') # hello 불용어 추가
stopwords_set.add('the') #
#stopwords_set.remove('the') #the 불용어 삭제

In [35]:
# 빈도가 적게 나온 단어, 길이가 짧은 단어, 불용어에 있는 단어 모두 제거
cleaned_words = []

for word in cleaned_by_freq_len:
    if word not in stopwords_set:
        cleaned_words.append(word)

print(f'불용어 제거 전 : {len(cleaned_by_freq_len)}')
print(f'불용어 제거 후 : {len(cleaned_words)}')

불용어 제거 전 : 169
불용어 제거 후 : 67


In [37]:
# 불용어 제거 함수
def clean_by_stopwords(tokenized_words, stop_words_set):
    cleaned_words = []

    for word in tokenized_words:
        if word not in stop_words_set:
            cleaned_words.append(word)

    return cleaned_words

# cleaned_by_stopwords = clean_by_stopwords(cleaned_by_freq_len, stopwords_set)

##### 소문자 처리 - 정규화

In [38]:
# 정규화 : 소문자 처리
TEXT.lower()

"after reading the comments for this movie, i am not sure whether i should be angry, sad or sickened. seeing comments typical of people who a)know absolutely nothing about the military or b)who base everything they think they know on movies like this or on cnn reports about abu-gharib makes me wonder about the state of intellectual stimulation in the world. at the time i type this the number of people in the us military: 1.4 million on active duty with another almost 900,000 in the guard and reserves for a total of roughly 2.3 million. the number of people indicted for abuses at at abu-gharib: currently less than 20 that makes the total of people indicted .00083% of the total military. even if you indict every single military member that ever stepped in to abu-gharib, you would not come close to making that a whole number.  the flaws in this movie would take years to cover. i understand that it's supposed to be sarcastic, but in reality, the writer and director are trying to make comme

##### 규칙 기반 정규화

In [41]:
# 규칙 기반 정규화 US, U.S, USA, Um, Umm, Ummm

dic = {'US':'USA', "U.S":"USA", "Ummmm":"UMM"}
text2 = 'she became a US citizen, Ummmm'

nomalized_words = []

tokenized_words = word_tokenize(text2) #토큰화
tokenized_words

['she', 'became', 'a', 'US', 'citizen', ',', 'Ummmm']

In [42]:
for word in tokenized_words: # 정규화
    if word in dic.keys():
        word = dic[word]
    nomalized_words.append(word)

nomalized_words

['she', 'became', 'a', 'USA', 'citizen', ',', 'UMM']

##### 어간 추출 - 정규화

- PorterStemmer : 단순하게 어미만 잘라줌

In [46]:
# 어간 추출
from nltk.stem import PorterStemmer # 어간 추출
from nltk.tokenize import word_tokenize # 토큰화


text3 = "you are so lovely, i am loving you now."
tokenized_words = word_tokenize(text3)

stemmer_words = []

porter = PorterStemmer()

for word in tokenized_words:
    stem = porter.stem(word)
    # print(stem)
    stemmer_words.append(stem)

stemmer_words



['you', 'are', 'so', 'love', ',', 'i', 'am', 'love', 'you', 'now', '.']

###### 어간추출 사용자 정의 함수

In [62]:
from nltk.stem import PorterStemmer

# 포터 스테머 어간 추출 함수
def stemming_by_porter(tokenized_words):
    porter_stemmer = PorterStemmer()
    porter_stemmed_words = []

    for word in tokenized_words:
        stem = porter_stemmer.stem(word)
        porter_stemmed_words.append(stem)

    return porter_stemmed_words

### 자연어 처리

In [60]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/IMbank_텍스트마이닝/imdb.tsv',sep='\t')

del df['Unnamed: 0']
df['review'] = df['review'].str.lower() # 소문자로 변경 - 정규화
df['word_tokens']=df['review'].apply(word_tokenize) # 소문자 처리한걸 토큰화
df

Unnamed: 0,review,word_tokens
0,"watching time chasers, it obvious that it was ...","[watching, time, chasers, ,, it, obvious, that..."
1,i saw this film about 20 years ago and remembe...,"[i, saw, this, film, about, 20, years, ago, an..."
2,"minor spoilers in new york, joan barnard (elvi...","[minor, spoilers, in, new, york, ,, joan, barn..."
3,i went to see this film with a great deal of e...,"[i, went, to, see, this, film, with, a, great,..."
4,"yes, i agree with everyone on this site this m...","[yes, ,, i, agree, with, everyone, on, this, s..."
5,"jennifer ehle was sparkling in \""pride and pre...","[jennifer, ehle, was, sparkling, in, \, '', pr..."
6,amy poehler is a terrific comedian on saturday...,"[amy, poehler, is, a, terrific, comedian, on, ..."
7,a plane carrying employees of a large biotech ...,"[a, plane, carrying, employees, of, a, large, ..."
8,"a well made, gritty science fiction movie, it ...","[a, well, made, ,, gritty, science, fiction, m..."
9,incredibly dumb and utterly predictable story ...,"[incredibly, dumb, and, utterly, predictable, ..."


In [64]:
def clean_by_freq(tokenized_words, cut_off_count):
    vocab = Counter(tokenized_words)

    # 빈도수가 cut_off_count 이하인 단어를 제거하는 코드를 작성해 주세요
    uncommon_words = {key for key, value in vocab.items() if value <= cut_off_count}
    cleaned_words = [word for word in tokenized_words if word not in uncommon_words]

    return cleaned_words


def clean_by_len(tokenized_words, cut_off_length):
    cleaned_words = []

    for word in tokenized_words:
        # 길이가 cut_off_length 이하인 단어 제거하는 코드를 작성해 주세요
        if len(word) > cut_off_length:
            cleaned_words.append(word)

    return cleaned_words


# 문제의 조건에 맞게 함수를 호출해 주세요
cleaned_by_freq = clean_by_freq(tokenized_words, 2)
cleaned_words = clean_by_len(cleaned_by_freq, 2)

cleaned_words

[]

In [66]:
stopwords_set = set(stopwords.words('english')) # 불용어 처리 세트 생성

df['cleaned_tokens']=df['word_tokens'].apply(lambda x : clean_by_freq(x,1)) # 빈도가 1 이하인것들은 삭제
df['cleaned_tokens']=df['cleaned_tokens'].apply(lambda x : clean_by_len(x,2)) # 길이가 2 이하인것들은 삭제
df['cleaned_tokens']=df['cleaned_tokens'].apply(lambda x : clean_by_stopwords(x,stopwords_set)) # 불용어 처리

df['stemmed_tokens']=df['cleaned_tokens'].apply(stemming_by_porter) # 어간 추출  -> 문제 movie가 movi 이렇게 e 가 빠져있음
df

Unnamed: 0,review,word_tokens,cleaned_tokens,stemmed_tokens
0,"watching time chasers, it obvious that it was ...","[watching, time, chasers, ,, it, obvious, that...","[one, film, said, really, bad, movie, like, sa...","[one, film, said, realli, bad, movi, like, sai..."
1,i saw this film about 20 years ago and remembe...,"[i, saw, this, film, about, 20, years, ago, an...","[film, film]","[film, film]"
2,"minor spoilers in new york, joan barnard (elvi...","[minor, spoilers, in, new, york, ,, joan, barn...","[new, york, joan, barnard, elvire, audrey, bar...","[new, york, joan, barnard, elvir, audrey, barn..."
3,i went to see this film with a great deal of e...,"[i, went, to, see, this, film, with, a, great,...","[went, film, film, went, jump, send, n't, jump...","[went, film, film, went, jump, send, n't, jump..."
4,"yes, i agree with everyone on this site this m...","[yes, ,, i, agree, with, everyone, on, this, s...","[site, movie, bad, even, movie, made, movie, s...","[site, movi, bad, even, movi, made, movi, spec..."
5,"jennifer ehle was sparkling in \""pride and pre...","[jennifer, ehle, was, sparkling, in, \, '', pr...","[ehle, northam, wonderful, wonderful, ehle, no...","[ehl, northam, wonder, wonder, ehl, northam, l..."
6,amy poehler is a terrific comedian on saturday...,"[amy, poehler, is, a, terrific, comedian, on, ...","[role, movie, n't, author, book, author, autho...","[role, movi, n't, author, book, author, author..."
7,a plane carrying employees of a large biotech ...,"[a, plane, carrying, employees, of, a, large, ...","[plane, ceo, search, rescue, mission, ceo, har...","[plane, ceo, search, rescu, mission, ceo, harl..."
8,"a well made, gritty science fiction movie, it ...","[a, well, made, ,, gritty, science, fiction, m...","[gritty, movie, sci-fi, good, suspense, movie,...","[gritti, movi, sci-fi, good, suspens, movi, sc..."
9,incredibly dumb and utterly predictable story ...,"[incredibly, dumb, and, utterly, predictable, ...","[girl, girl]","[girl, girl]"


In [74]:
# 토큰화 하기 전에 문장 단위로 토큰화
from nltk.tokenize import sent_tokenize

text4 = "My email address is 'abcde@codeit.com'. Send it to Mr.Kim"
sent_tokenize(text4)

["My email address is 'abcde@codeit.com'.", 'Send it to Mr.Kim']

In [75]:
text5 = "Can you forward my email to Mr.Kim? Thank you!"
sent_tokenize(text5)

['Can you forward my email to Mr.Kim?', 'Thank you!']

In [82]:
# 빈도 분석 / 명사,동사,형용사
from nltk.tag import pos_tag
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')  # 문장 및 단어 토큰화를 위한 데이터


pos_tag_words = []

tokenized_sents = sent_tokenize(TEXT) # 문장 토큰화 (sent_tokenize): 텍스트를 문장 단위로 분리합니다.
# tokenized_sents


for sentence in tokenized_sents:
    tokenized_words = word_tokenize(sentence) #단어 토큰화 (word_tokenize): 각 문장을 단어 단위로 분리합니다.
    pos_tags = pos_tag(tokenized_words) #품사 태깅 (pos_tag): 각 단어에 대해 품사를 태깅합니다. 품사 태깅은 단어가 명사(NN), 동사(VB), 형용사(JJ) 등 어떤 품사인지를 결정하는 작업
    pos_tag_words += pos_tags

print(pos_tag_words)

[('After', 'IN'), ('reading', 'VBG'), ('the', 'DT'), ('comments', 'NNS'), ('for', 'IN'), ('this', 'DT'), ('movie', 'NN'), (',', ','), ('I', 'PRP'), ('am', 'VBP'), ('not', 'RB'), ('sure', 'JJ'), ('whether', 'IN'), ('I', 'PRP'), ('should', 'MD'), ('be', 'VB'), ('angry', 'JJ'), (',', ','), ('sad', 'JJ'), ('or', 'CC'), ('sickened', 'VBN'), ('.', '.'), ('Seeing', 'VBG'), ('comments', 'NNS'), ('typical', 'JJ'), ('of', 'IN'), ('people', 'NNS'), ('who', 'WP'), ('a', 'DT'), (')', ')'), ('know', 'VBP'), ('absolutely', 'RB'), ('nothing', 'NN'), ('about', 'IN'), ('the', 'DT'), ('military', 'NN'), ('or', 'CC'), ('b', 'NN'), (')', ')'), ('who', 'WP'), ('base', 'VBP'), ('everything', 'NN'), ('they', 'PRP'), ('think', 'VBP'), ('they', 'PRP'), ('know', 'VBP'), ('on', 'IN'), ('movies', 'NNS'), ('like', 'IN'), ('this', 'DT'), ('or', 'CC'), ('on', 'IN'), ('CNN', 'NNP'), ('reports', 'NNS'), ('about', 'IN'), ('Abu-Gharib', 'NNP'), ('makes', 'VBZ'), ('me', 'PRP'), ('wonder', 'VB'), ('about', 'IN'), ('the', '

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [77]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# 품사 태깅 함수
def pos_tagger(tokenized_sents):
    pos_tagged_words = []

    for sentence in tokenized_sents:
        # 단어 토큰화
        tokenized_words = word_tokenize(sentence)

        # 품사 태깅
        pos_tagged = pos_tag(tokenized_words)
        pos_tagged_words.extend(pos_tagged)

    return pos_tagged_words

In [85]:
# 동사인경우의 원형
# 부사인경우의 원형
t = 'hello world!'
tokenized_words = word_tokenize(t)
tagged_words=pos_tagger(tokenized_words)
tagged_words

[('hello', 'NN'), ('world', 'NN'), ('!', '.')]

In [88]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer() # **WordNetLemmatizer**는 단어의 **원형(lemma)**을 추출하는 도구입니다. 예를 들어, 'running'을 'run'으로 변환합니다.

for word,tag in tagged_words:
    a = lemmatizer.lemmatize(word,wn.NOUN)
    print(a)

[nltk_data] Downloading package wordnet to /root/nltk_data...


hello
world
!


In [90]:
# 형용사,명사,부사,동사만 추출하는 함수
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ # 형용사

    elif tag.startswith('N'):
        return wn.NOUN # 명사

    elif tag.startswith('R'):
        return wn.ADV # 부사

    elif tag.startswith('V'):
        return wn.VERB # 동사

    else:
        return tag


penn_to_wn("NNG")  # 확인

'n'

In [92]:
def words_lematier(pos_tagger_words):
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = []

    for word, tag in pos_tagger_words:
        wn_tag = penn_to_wn(tag)
        if wn_tag in (wn.NOUN, wn.ADJ, wn.ADV, wn.VERB):
            stem = lemmatizer.lemmatize(word, wn_tag)
            lemmatized_words.append((stem))
        else:
            lemmatized_words.append((word))

    return lemmatized_words

In [95]:
df = pd.read_csv('/content/drive/MyDrive/IMbank_텍스트마이닝/imdb.tsv',sep='\t')
del df['Unnamed: 0']

df['review'] = df['review'].str.lower()
df['sent_tokens']=df['review'].apply(sent_tokenize) # 문장단위로 나눠주기
df['pos_tagged_tokens']=df['sent_tokens'].apply(pos_tagger)


df['lemmatized_words']=df['pos_tagged_tokens'].apply(words_lematier)


df

Unnamed: 0,review,sent_tokens,pos_tagged_tokens,lemmatized_words
0,"watching time chasers, it obvious that it was ...","[watching time chasers, it obvious that it was...","[(watching, VBG), (time, NN), (chasers, NNS), ...","[watch, time, chaser, ,, it, obvious, that, it..."
1,i saw this film about 20 years ago and remembe...,[i saw this film about 20 years ago and rememb...,"[(i, NN), (saw, VBD), (this, DT), (film, NN), ...","[i, saw, this, film, about, 20, year, ago, and..."
2,"minor spoilers in new york, joan barnard (elvi...","[minor spoilers in new york, joan barnard (elv...","[(minor, JJ), (spoilers, NNS), (in, IN), (new,...","[minor, spoiler, in, new, york, ,, joan, barna..."
3,i went to see this film with a great deal of e...,[i went to see this film with a great deal of ...,"[(i, JJ), (went, VBD), (to, TO), (see, VB), (t...","[i, go, to, see, this, film, with, a, great, d..."
4,"yes, i agree with everyone on this site this m...","[yes, i agree with everyone on this site this ...","[(yes, UH), (,, ,), (i, JJ), (agree, VBP), (wi...","[yes, ,, i, agree, with, everyone, on, this, s..."
5,"jennifer ehle was sparkling in \""pride and pre...","[jennifer ehle was sparkling in \""pride and pr...","[(jennifer, NN), (ehle, NN), (was, VBD), (spar...","[jennifer, ehle, be, sparkle, in, \, '', pride..."
6,amy poehler is a terrific comedian on saturday...,[amy poehler is a terrific comedian on saturda...,"[(amy, JJ), (poehler, NN), (is, VBZ), (a, DT),...","[amy, poehler, be, a, terrific, comedian, on, ..."
7,a plane carrying employees of a large biotech ...,[a plane carrying employees of a large biotech...,"[(a, DT), (plane, NN), (carrying, VBG), (emplo...","[a, plane, carry, employee, of, a, large, biot..."
8,"a well made, gritty science fiction movie, it ...","[a well made, gritty science fiction movie, it...","[(a, DT), (well, NN), (made, VBN), (,, ,), (gr...","[a, well, make, ,, gritty, science, fiction, m..."
9,incredibly dumb and utterly predictable story ...,[incredibly dumb and utterly predictable story...,"[(incredibly, RB), (dumb, JJ), (and, CC), (utt...","[incredibly, dumb, and, utterly, predictable, ..."


In [98]:
# 빈도 1이하 없애고, 길이 2 이하 없애고 , 불용어 처리까지 해서 나온 결과 확인하기!

df['cleaned_tokens']=df['lemmatized_words'].apply(lambda x : clean_by_freq(x,1)) # 빈도가 1 이하인것들은 삭제
df['cleaned_tokens']=df['cleaned_tokens'].apply(lambda x : clean_by_len(x,2)) # 길이가 2 이하인것들은 삭제
df['cleaned_tokens']=df['cleaned_tokens'].apply(lambda x : clean_by_stopwords(x,stopwords_set)) # 불용어 처리

df['combined_corpus']=df['cleaned_tokens'].apply(lambda x :" ".join(x))
df

Unnamed: 0,review,sent_tokens,pos_tagged_tokens,lemmatized_words,cleaned_tokens,combined_corpus
0,"watching time chasers, it obvious that it was ...","[watching time chasers, it obvious that it was...","[(watching, VBG), (time, NN), (chasers, NNS), ...","[watch, time, chaser, ,, it, obvious, that, it...","[make, one, film, say, make, really, bad, movi...",make one film say make really bad movie like s...
1,i saw this film about 20 years ago and remembe...,[i saw this film about 20 years ago and rememb...,"[(i, NN), (saw, VBD), (this, DT), (film, NN), ...","[i, saw, this, film, about, 20, year, ago, and...","[film, film]",film film
2,"minor spoilers in new york, joan barnard (elvi...","[minor spoilers in new york, joan barnard (elv...","[(minor, JJ), (spoilers, NNS), (in, IN), (new,...","[minor, spoiler, in, new, york, ,, joan, barna...","[new, york, joan, barnard, elvire, audrey, bar...",new york joan barnard elvire audrey barnard jo...
3,i went to see this film with a great deal of e...,[i went to see this film with a great deal of ...,"[(i, JJ), (went, VBD), (to, TO), (see, VB), (t...","[i, go, to, see, this, film, with, a, great, d...","[film, film, jump, send, n't, jump, radio, n't...",film film jump send n't jump radio n't send re...
4,"yes, i agree with everyone on this site this m...","[yes, i agree with everyone on this site this ...","[(yes, UH), (,, ,), (i, JJ), (agree, VBP), (wi...","[yes, ,, i, agree, with, everyone, on, this, s...","[site, movie, bad, even, movie, movie, make, m...",site movie bad even movie movie make movie spe...
5,"jennifer ehle was sparkling in \""pride and pre...","[jennifer ehle was sparkling in \""pride and pr...","[(jennifer, NN), (ehle, NN), (was, VBD), (spar...","[jennifer, ehle, be, sparkle, in, \, '', pride...","[ehle, northam, wonderful, wonderful, ehle, no...",ehle northam wonderful wonderful ehle northam ...
6,amy poehler is a terrific comedian on saturday...,[amy poehler is a terrific comedian on saturda...,"[(amy, JJ), (poehler, NN), (is, VBZ), (a, DT),...","[amy, poehler, be, a, terrific, comedian, on, ...","[role, movie, n't, author, book, funny, author...",role movie n't author book funny author author...
7,a plane carrying employees of a large biotech ...,[a plane carrying employees of a large biotech...,"[(a, DT), (plane, NN), (carrying, VBG), (emplo...","[a, plane, carry, employee, of, a, large, biot...","[plane, ceo, search, rescue, mission, call, ce...",plane ceo search rescue mission call ceo harla...
8,"a well made, gritty science fiction movie, it ...","[a well made, gritty science fiction movie, it...","[(a, DT), (well, NN), (made, VBN), (,, ,), (gr...","[a, well, make, ,, gritty, science, fiction, m...","[gritty, movie, movie, keep, sci-fi, good, kee...",gritty movie movie keep sci-fi good keep suspe...
9,incredibly dumb and utterly predictable story ...,[incredibly dumb and utterly predictable story...,"[(incredibly, RB), (dumb, JJ), (and, CC), (utt...","[incredibly, dumb, and, utterly, predictable, ...","[girl, girl]",girl girl


In [99]:
box = []
for words in df['cleaned_tokens']:
    box +=words

Counter(box).most_common(10)

[('movie', 18),
 ('film', 12),
 ("n't", 11),
 ('scene', 10),
 ('bad', 8),
 ('time', 8),
 ('reason', 8),
 ('make', 7),
 ('jim', 7),
 ('good', 7)]