<a href="https://colab.research.google.com/github/rtajeong/M3_new_2025/blob/main/lab53_sentiment_analysis_naver_movie_rev100.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

네이버영화평점
==
- 감성분석
- 네이버 영화평점 (Naver sentiment movie corpus v.1.0) 데이터(https://github.com/e9t/nsmc)
- 영화 리뷰 20만건이 저장됨. 각 평가 데이터는 0(부정), 1(긍정)으로 label 됨.

### 한글 자연어 처리
- KoNLPy(“코엔엘파이”라고 읽습니다)는 한국어 정보처리를 위한 파이썬 패키지입니다.
- konlpy 패키지에서 제공하는 Twitter라는 문서 분석 라이브러리 사용 (트위터 분석 뿐 아니라 한글 텍스트
  처리도 가능)
- colab 사용 권장

# 로지스틱회귀를 이용한 감성분석

In [1]:
!pip install konlpy

Collecting konlpy
  Downloading konlpy-0.6.0-py2.py3-none-any.whl.metadata (1.9 kB)
Collecting JPype1>=0.7.0 (from konlpy)
  Downloading jpype1-1.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (5.0 kB)
Downloading konlpy-0.6.0-py2.py3-none-any.whl (19.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.4/19.4 MB[0m [31m36.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jpype1-1.6.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (495 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m495.9/495.9 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: JPype1, konlpy
Successfully installed JPype1-1.6.0 konlpy-0.6.0


In [2]:
# 네이버 영화 평점 데이터 다운로드
!curl -L https://bit.ly/2X9Owwr -o ratings_train.txt
!curl -L https://bit.ly/2WuLd5I -o ratings_test.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   118  100   118    0     0    645      0 --:--:-- --:--:-- --:--:--   648
100   131  100   131    0     0    335      0 --:--:-- --:--:-- --:--:--   335
100    17  100    17    0     0     18      0 --:--:-- --:--:-- --:--:--     0
100 14.0M  100 14.0M    0     0  9874k      0  0:00:01  0:00:01 --:--:-- 9874k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   117  100   117    0     0    628      0 --:--:-- --:--:-- --:--:--   625
100   130  100   130    0     0    234      0 --:--:-- --:--:-- --:--:--   234
100    17  100    17    0     0     15      0  0:00:01  0:00:01 --:--:--    15
100 4827k  100 4827k    0     0  2046k      0  0:00:02  0:00:02 --:--:-- 4294k


In [3]:
import konlpy
import pandas as pd
from konlpy.tag import Twitter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
# from sklearn.pipeline import make_pipeline
# import pickle
# import os.path

# 데이터 로드
# keep_default_na: Whether or not to include the default NaN values when parsing the data
# -> False: no strings will be parsed as NaN.

df_train = pd.read_csv('ratings_train.txt', delimiter='\t', keep_default_na=False)
df_test = pd.read_csv('ratings_test.txt', delimiter='\t', keep_default_na=False)

In [4]:
df_train.head(2)

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1


In [5]:
df_test.head(2)

Unnamed: 0,id,document,label
0,6270596,굳 ㅋ,1
1,9274899,GDNTOPCLASSINTHECLUB,0


In [6]:
text_train, y_train = df_train['document'].values, df_train['label'].values
text_test, y_test = df_test['document'].values, df_test['label'].values

In [7]:
text_train.shape, text_test.shape   # too big

((150000,), (50000,))

In [8]:
text_train, y_train = text_train[:2000], y_train[:2000]
text_test, y_test = text_test[:1000], y_test[:1000]

In [9]:
text_train.shape, text_test.shape

((2000,), (1000,))

In [10]:
from konlpy.tag import Okt

def okt_tokenizer(text):
    return Okt().morphs(text)


In [11]:
cv = TfidfVectorizer(tokenizer=okt_tokenizer, max_features = 1000, min_df=5)
x_train = cv.fit_transform(text_train)
x_test = cv.transform(text_test)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)



(2000, 794) (1000, 794) (2000,) (1000,)


In [12]:
print(cv.vocabulary_)
print(cv.get_feature_names_out()[-10:])

{'아': np.int64(432), '더빙': np.int64(211), '..': np.int64(13), '진짜': np.int64(693), '흠': np.int64(791), '...': np.int64(14), '포스터': np.int64(742), '보고': np.int64(326), '영화': np.int64(520), '줄': np.int64(680), '....': np.int64(15), '연기': np.int64(512), '너': np.int64(174), '추천': np.int64(716), '한': np.int64(760), '다': np.int64(194), '이야기': np.int64(579), '솔직히': np.int64(402), '재미': np.int64(622), '는': np.int64(190), '없다': np.int64(484), '평점': np.int64(741), '그': np.int64(119), '의': np.int64(555), '가': np.int64(60), '!': np.int64(0), '에서': np.int64(496), '했던': np.int64(777), '너무나도': np.int64(177), '막': np.int64(261), '3': np.int64(24), '세': np.int64(392), '부터': np.int64(360), '초등학교': np.int64(708), '1': np.int64(20), '8': np.int64(28), '.': np.int64(12), 'ㅋㅋㅋ': np.int64(50), '별': np.int64(321), '반개': np.int64(314), '도': np.int64(213), '아까': np.int64(433), '움': np.int64(542), '원작': np.int64(546), '긴장감': np.int64(145), '을': np.int64(553), '제대로': np.int64(656), '했다': np.int64(776), '아깝다': np.

In [13]:
lr = LogisticRegression()
lr.fit(x_train, y_train)
print("train score: ", lr.score(x_train, y_train))
print("test score: ", lr.score(x_test, y_test))

train score:  0.855
test score:  0.745


# 불용어 처리
- 한국어  불용어 확인은 형태소 분석 라이브러리인 KonLPy 를 이용하면 됨.
- (예) 한국어 품사 중 조사를 추출하는 예
- pos (part-of-speech): 품사 (명사, 동사, ...)

In [14]:
Okt().morphs("텍스트 데이터를 이용해서 불용어 사전을 구축")

['텍스트', '데이터', '를', '이용', '해서', '불', '용어', '사전', '을', '구축']

In [15]:
Okt().pos("텍스트 데이터를 이용해서 불용어 사잔을 구축")

[('텍스트', 'Noun'),
 ('데이터', 'Noun'),
 ('를', 'Josa'),
 ('이용', 'Noun'),
 ('해서', 'Verb'),
 ('불', 'Noun'),
 ('용어', 'Noun'),
 ('사잔', 'Noun'),
 ('을', 'Josa'),
 ('구축', 'Noun')]

In [16]:
Okt().pos("텍스트 데이터를 이용해서 불용어 사전을 구축", norm=True)   # norm - 오타 수정 (사잔->사전)

[('텍스트', 'Noun'),
 ('데이터', 'Noun'),
 ('를', 'Josa'),
 ('이용', 'Noun'),
 ('해서', 'Verb'),
 ('불', 'Noun'),
 ('용어', 'Noun'),
 ('사전', 'Noun'),
 ('을', 'Josa'),
 ('구축', 'Noun')]

In [17]:
Okt().nouns("텍스트 데이터를 이용해서 불용어 사전을 구축")

['텍스트', '데이터', '이용', '불', '용어', '사전', '구축']

- norm: 오타수정, stem: 어근 찾기

In [18]:
from konlpy.tag import Okt

okt = Okt()

word_tags = okt.pos("텍스트 데이터를 이용해서 불용어 사전을 구축하기 위한 간단 예제", norm=True, stem=True)
print(word_tags)
stop_words = [word[0] for word in word_tags if word[1]=="Josa"]
print (stop_words)

[('텍스트', 'Noun'), ('데이터', 'Noun'), ('를', 'Josa'), ('이용', 'Noun'), ('하다', 'Verb'), ('불', 'Noun'), ('용어', 'Noun'), ('사전', 'Noun'), ('을', 'Josa'), ('구축', 'Noun'), ('하다', 'Verb'), ('위', 'Noun'), ('한', 'Josa'), ('간단', 'Noun'), ('예제', 'Noun')]
['를', '을', '한']


In [19]:
def okt_tokenizer2(text):
    word_tags = okt.pos(text, norm=True, stem=True)
    words = [word[0] for word in word_tags if word[1]!="Josa" and word[1]!="Punctuation"]
    return words

cv = TfidfVectorizer(tokenizer=okt_tokenizer2, max_features = 500, min_df=5)
x_train = cv.fit_transform(text_train)
x_test = cv.transform(text_test)

lr = LogisticRegression()
lr.fit(x_train,y_train)
print("훈련 데이터 점수 : ", lr.score(x_train, y_train))
print("테스트 데이터 점수 : ", lr.score(x_test, y_test))



훈련 데이터 점수 :  0.835
테스트 데이터 점수 :  0.76


# LSTM을 이용한 분석 (additional) - skip

In [20]:
from konlpy.tag import Okt
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras import utils
from sklearn.model_selection import train_test_split
from sklearn import model_selection, metrics
import numpy as np
import pickle
import os.path
import tensorflow.keras.backend as K

# 토큰 파서
def twitter_tokenizer(text):
    return Okt().morphs(text)

ModuleNotFoundError: No module named 'tensorflow.keras.wrappers.scikit_learn'

In [None]:
df_train = pd.read_csv('ratings_train.txt', delimiter='\t', keep_default_na=False)
df_test = pd.read_csv('ratings_test.txt', delimiter='\t', keep_default_na=False)

In [None]:
print(df_train.shape, df_test.shape)
df_train.columns, df_test.columns

In [None]:
df_train, df_test = df_train[:2000], df_test[:1000]  # too big

In [None]:
df_data= pd.concat([df_train, df_test])
df_data.shape, df_data.columns

In [None]:
text_data, y_data = df_data['document'].values, df_data['label'].values

In [None]:
text_data.shape, y_data.shape

In [None]:
cv = TfidfVectorizer(tokenizer=twitter_tokenizer, max_features = 1000, min_df=5)

In [None]:
if not os.path.isfile("X_data.pickle"):
    print('file does not exists')
    X_data = cv.fit_transform(text_data)
    pickle.dump(X_data, open("X_data.pickle", "wb"))

In [None]:
# 저장된 tfidf vector 데이터 읽기
with open('X_data.pickle', 'rb') as f:
    X_data = pickle.load(f)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
max_words = X_train.shape[1]
batch_size = 64
nb_epoch = 5

In [None]:
# LSTM 학습을 위한 데이터 재배열 (Time step)
X_train_rnn = X_train.A.reshape((X_train.shape[0], 1, X_train.shape[1]))
X_test_rnn = X_test.A.reshape((X_test.shape[0], 1, X_test.shape[1]))

print(X_train_rnn.shape)
print(X_test_rnn.shape)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM

def build_LSTM_model():
    model = Sequential()
    model.add(LSTM(128, input_shape=(X_train_rnn.shape[1], X_train_rnn.shape[2]), return_sequences=True))
    model.add(Activation('relu'))
    model.add(Dropout(0.2))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
model_lstm = KerasClassifier(build_fn=build_LSTM_model,
                             epochs=nb_epoch,
                             batch_size=batch_size)
model_lstm.fit(X_train_rnn, y_train)

In [None]:
y_pred = model_lstm.predict(X_train_rnn)
metrics.accuracy_score(y_train, y_pred)

In [None]:
y_pred = model_lstm.predict(X_test_rnn)
metrics.accuracy_score(y_test, y_pred)

# 참고사항

### Pickling:
-“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a byte stream (from a binary file or bytes-like object) is converted back into an object hierarchy.
- Pickling (and unpickling) is alternatively known as “serialization”, “marshalling,” or “flattening”; however, to avoid confusion, the terms “pickling” and “unpickling” are being mostly used.

- Comparison with json
  - There are fundamental differences between the pickle protocols and JSON (JavaScript Object Notation):

  - JSON is a text serialization format (it outputs unicode text, although most of the time it is then encoded to utf-8), while pickle is a binary serialization format;
  - JSON is human-readable, while pickle is not;
  - JSON is interoperable and widely used outside of the Python ecosystem, while pickle is Python-specific;
  - JSON, by default, can only represent a subset of the Python built-in types, and no custom classes; pickle can represent an extremely large number of Python types (many of them automatically, by clever usage of Python’s introspection facilities; complex cases can be tackled by implementing specific object APIs);
  - Unlike pickle, deserializing untrusted JSON does not in itself create an arbitrary code execution vulnerability.