<a href="https://colab.research.google.com/github/rtajeong/AI_Cluster/blob/main/lab53_sentiment_analysis_naver_movie_rev100.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

네이버영화평점
==
- 감성분석
- 네이버 영화평점 (Naver sentiment movie corpus v.1.0) 데이터(https://github.com/e9t/nsmc)
- 영화 리뷰 20만건이 저장됨. 각 평가 데이터는 0(부정), 1(긍정)으로 label 됨.

### 한글 자연어 처리
- KoNLPy(“코엔엘파이”라고 읽습니다)는 한국어 정보처리를 위한 파이썬 패키지입니다.
- konlpy 패키지에서 제공하는 Twitter라는 문서 분석 라이브러리 사용 (트위터 분석 뿐 아니라 한글 텍스트 
  처리도 가능)
- colab 사용 권장

# 로지스틱회귀를 이용한 감성분석

In [95]:
!pip install konlpy



In [96]:
# 네이버 영화 평점 데이터 다운로드
!curl -L https://bit.ly/2X9Owwr -o ratings_train.txt
!curl -L https://bit.ly/2WuLd5I -o ratings_test.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   152  100   152    0     0    899      0 --:--:-- --:--:-- --:--:--   899
100   148    0   148    0     0    397      0 --:--:-- --:--:-- --:--:--  1129
100   318  100   318    0     0    536      0 --:--:-- --:--:-- --:--:--   536
100 14.0M  100 14.0M    0     0  11.6M      0  0:00:01  0:00:01 --:--:-- 11.6M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   151  100   151    0     0   3282      0 --:--:-- --:--:-- --:--:--  3282
100   147    0   147    0     0    781      0 --:--:-- --:--:-- --:--:--   781
100   318  100   318    0     0    809      0 --:--:-- --:--:-- --:--:--   809
100 4827k  100 4827k    0     0  5485k      0 --:--:-- --:--:-- --:--:--  122M


In [154]:
import konlpy
import pandas as pd
from konlpy.tag import Twitter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
# from sklearn.pipeline import make_pipeline
# import pickle
# import os.path

# 데이터 로드
# keep_default_na: Whether or not to include the default NaN values when parsing the data
# -> False: no strings will be parsed as NaN.

df_train = pd.read_csv('ratings_train.txt', delimiter='\t', keep_default_na=False)
df_test = pd.read_csv('ratings_test.txt', delimiter='\t', keep_default_na=False)

In [155]:
df_train.head(2)

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1


In [156]:
df_test.head(2)

Unnamed: 0,id,document,label
0,6270596,굳 ㅋ,1
1,9274899,GDNTOPCLASSINTHECLUB,0


In [157]:
text_train, y_train = df_train['document'].values, df_train['label'].values
text_test, y_test = df_test['document'].values, df_test['label'].values

In [158]:
text_train.shape, text_test.shape   # too big

((150000,), (50000,))

In [159]:
text_train, y_train = text_train[:2000], y_train[:2000]
text_test, y_test = text_test[:1000], y_test[:1000]

In [160]:
text_train.shape, text_test.shape

((2000,), (1000,))

In [161]:
from konlpy.tag import Okt
def twitter_tokenizer(text):
    return Okt().morphs(text)

cv = TfidfVectorizer(tokenizer=twitter_tokenizer, max_features = 1000, min_df=5)

In [162]:
x_train = cv.fit_transform(text_train)
x_test = cv.transform(text_test)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

(2000, 794) (1000, 794) (2000,) (1000,)


In [163]:
print(cv.vocabulary_)
print(cv.get_feature_names()[:10])

{'아': 432, '더빙': 211, '..': 13, '진짜': 693, '흠': 791, '...': 14, '포스터': 742, '보고': 326, '영화': 520, '줄': 680, '....': 15, '연기': 512, '너': 174, '추천': 716, '한': 760, '다': 194, '이야기': 579, '솔직히': 402, '재미': 622, '는': 190, '없다': 484, '평점': 741, '그': 119, '의': 555, '가': 60, '!': 0, '에서': 496, '했던': 777, '너무나도': 177, '막': 261, '3': 24, '세': 392, '부터': 360, '초등학교': 708, '1': 20, '8': 28, '.': 12, 'ㅋㅋㅋ': 50, '별': 321, '반개': 314, '도': 213, '아까': 433, '움': 542, '원작': 546, '긴장감': 145, '을': 553, '제대로': 656, '했다': 776, '아깝다': 436, '욕': 537, '나온다': 160, '연': 511, '기': 138, '이': 557, '몇': 289, '년': 181, '인지': 596, '정말': 653, '로': 253, '해도': 772, '그것': 120, '만': 263, '드라마': 229, '가족': 68, '못': 297, '하는': 751, '사람': 370, '네': 179, '액션': 465, '있는': 602, '안되는': 453, '?': 33, '꽤': 151, '볼': 344, '데': 212, '식': 422, '너무': 175, '걍': 89, '짱': 699, '이다': 563, '♥': 45, '마다': 257, '90년': 30, '자극': 608, '!!': 1, '감성': 76, '멜로': 285, '~': 43, '서': 387, '손': 401, '들': 231, '고': 105, '때': 238, '뻔': 366, '해': 771, '좋다

In [164]:
lr = LogisticRegression()
lr.fit(x_train, y_train)
print("train score: ", lr.score(x_train, y_train))
print("test score: ", lr.score(x_test, y_test))

train score:  0.8555
test score:  0.746


# 불용어 처리
- 한국어  불용어 확인은 형태소 분석 라이브러리인 KonLPy 를 이용하면 됨.
- (예) 한국어 품사 중 조사를 추출하는 예
- pos (part-of-speech): 품사 (명사, 동사, ...)

In [108]:
from konlpy.tag import Twitter

In [109]:
Twitter().morphs("텍스트 데이터를 이용해서 불용어 사전을 구축하기 위한 간단 예제")

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


['텍스트',
 '데이터',
 '를',
 '이용',
 '해서',
 '불',
 '용어',
 '사전',
 '을',
 '구축',
 '하기',
 '위',
 '한',
 '간단',
 '예제']

In [110]:
Twitter().pos("텍스트 데이터를 이용해서 불용어 사잔을 구축하기 위한 간단 예제")

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


[('텍스트', 'Noun'),
 ('데이터', 'Noun'),
 ('를', 'Josa'),
 ('이용', 'Noun'),
 ('해서', 'Verb'),
 ('불', 'Noun'),
 ('용어', 'Noun'),
 ('사잔', 'Noun'),
 ('을', 'Josa'),
 ('구축', 'Noun'),
 ('하기', 'Verb'),
 ('위', 'Noun'),
 ('한', 'Josa'),
 ('간단', 'Noun'),
 ('예제', 'Noun')]

In [111]:
Twitter().pos("텍스트 데이터를 이용해서 불용어 사전을 구축하기 위한 간단 예제", norm=True)   # norm - 오타 수정 (사잔->사전)

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


[('텍스트', 'Noun'),
 ('데이터', 'Noun'),
 ('를', 'Josa'),
 ('이용', 'Noun'),
 ('해서', 'Verb'),
 ('불', 'Noun'),
 ('용어', 'Noun'),
 ('사전', 'Noun'),
 ('을', 'Josa'),
 ('구축', 'Noun'),
 ('하기', 'Verb'),
 ('위', 'Noun'),
 ('한', 'Josa'),
 ('간단', 'Noun'),
 ('예제', 'Noun')]

In [112]:
Twitter().nouns("텍스트 데이터를 이용해서 불용어 사전을 구축하기 위한 간단 예제")

  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


['텍스트', '데이터', '이용', '불', '용어', '사전', '구축', '위', '간단', '예제']

- norm: 오타수정, stem: 어근 찾기

In [113]:
from konlpy.tag import Twitter

word_tags = Twitter().pos("텍스트 데이터를 이용해서 불용어 사전을 구축하기 위한 간단 예제", norm=True, stem=True)
print(word_tags)
stop_words = [word[0] for word in word_tags if word[1]=="Josa"]
print (stop_words)

[('텍스트', 'Noun'), ('데이터', 'Noun'), ('를', 'Josa'), ('이용', 'Noun'), ('하다', 'Verb'), ('불', 'Noun'), ('용어', 'Noun'), ('사전', 'Noun'), ('을', 'Josa'), ('구축', 'Noun'), ('하다', 'Verb'), ('위', 'Noun'), ('한', 'Josa'), ('간단', 'Noun'), ('예제', 'Noun')]
['를', '을', '한']


  warn('"Twitter" has changed to "Okt" since KoNLPy v0.4.5.')


# LSTM을 이용한 분석 (additional)

In [166]:
from konlpy.tag import Okt
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras import utils
from sklearn.model_selection import train_test_split
from sklearn import model_selection, metrics
import numpy as np
import pickle
import os.path
import tensorflow.keras.backend as K

# 토큰 파서
def twitter_tokenizer(text):
    return Okt().morphs(text)

In [167]:
df_train = pd.read_csv('ratings_train.txt', delimiter='\t', keep_default_na=False)
df_test = pd.read_csv('ratings_test.txt', delimiter='\t', keep_default_na=False)

In [168]:
print(df_train.shape, df_test.shape)
df_train.columns, df_test.columns

(150000, 3) (50000, 3)


(Index(['id', 'document', 'label'], dtype='object'),
 Index(['id', 'document', 'label'], dtype='object'))

In [169]:
df_train, df_test = df_train[:2000], df_test[:1000]  # too big

In [170]:
df_data= pd.concat([df_train, df_test])
df_data.shape, df_data.columns

((3000, 3), Index(['id', 'document', 'label'], dtype='object'))

In [171]:
text_data, y_data = df_data['document'].values, df_data['label'].values

In [172]:
text_data.shape, y_data.shape

((3000,), (3000,))

In [173]:
cv = TfidfVectorizer(tokenizer=twitter_tokenizer, max_features = 1000, min_df=5)

In [174]:
if not os.path.isfile("X_data.pickle"): 
    print('file does not exists')
    X_data = cv.fit_transform(text_data)
    pickle.dump(X_data, open("X_data.pickle", "wb"))

file does not exists


In [175]:
# 저장된 tfidf vector 데이터 읽기
with open('X_data.pickle', 'rb') as f:
    X_data = pickle.load(f)

In [176]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3)

In [177]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((2100, 1000), (900, 1000), (2100,), (900,))

In [179]:
max_words = X_train.shape[1]   
batch_size = 64
nb_epoch = 5

In [180]:
# LSTM 학습을 위한 데이터 재배열 (Time step)
X_train_rnn = X_train.A.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test_rnn = X_test.A.reshape((X_test.shape[0], 1, X_test.shape[1]))

print(X_train_rnn.shape)
print(X_test_rnn.shape)

(2100, 1, 1000)
(900, 1, 1000)


In [181]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM

def build_LSTM_model():
    model = Sequential()
    model.add(LSTM(128, input_shape=(X_train_rnn.shape[1], X_train_rnn.shape[2]), return_sequences=True))
    model.add(Activation('relu'))
    model.add(Dropout(0.2))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [182]:
model_lstm = KerasClassifier(build_fn=build_LSTM_model, 
                             epochs=nb_epoch, 
                             batch_size=batch_size)
model_lstm.fit(X_train_rnn, y_train)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f5f35f71bd0>

In [183]:
y_pred = model_lstm.predict(X_train_rnn)
metrics.accuracy_score(y_train, y_pred)

0.8938095238095238

In [184]:
y_pred = model_lstm.predict(X_test_rnn)
metrics.accuracy_score(y_test, y_pred)

0.7722222222222223

# 참고사항

### Pickling: 
-“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy. 
- Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” or “flattening”; however, to avoid confusion, the terms “pickling” and “unpickling” are being mostly used.

- Comparison with json
  - There are fundamental differences between the pickle protocols and JSON (JavaScript Object Notation):

  - JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8), while pickle is a binary serialization format;
  - JSON is human-readable, while pickle is not;
  - JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific;
  - JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementing specific object APIs);
  - Unlike pickle, deserializing untrusted JSON does not in itself create an arbitrary code execution vulnerability.