<h1>1. Sentiment Analysis with Logistic Regression using scikit-learn</h1>

### Konlpy를 사용한 텍스트 데이터 전처리 + 로지스틱 회귀 모형을 이용한 네이버 영화평 감성 분석

<h2>1-1. Without preprocessing</h2>

In [1]:
import numpy as np
import pandas as pd

In [2]:
train = pd.read_csv('ratings_train.txt', delimiter='\t')
test = pd.read_csv('ratings_test.txt', delimiter='\t')

In [3]:
train.loc[0:4,:]

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,1
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [20]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=2, max_df=0.8).fit(train.document.values.astype('U'))
X_train = vect.transform(train.document.values.astype('U'))
X_test = vect.transform(test.document.values.astype('U'))

In [21]:
X_train[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

In [22]:
temp = vect.transform(['영화 너무 재미없다.'])

In [23]:
print(temp)

  (0, 13007)	1
  (0, 44583)	1
  (0, 53782)	1


In [24]:
print(len(vect.vocabulary_))
print(X_train.shape)

70407
(150000, 70407)


In [25]:
y_train = train.label.values
y_test = test.label.values

In [26]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(150000, 70407)
(150000,)
(50000, 70407)
(50000,)


In [27]:
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression()
logReg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [28]:
print(logReg.score(X_train, y_train))
print(logReg.score(X_test, y_test))

0.896133333333
0.81126


In [29]:
logReg.predict(X_test)

array([0, 0, 0, ..., 1, 0, 0])

In [34]:
print(len(vect.vocabulary_))

70407


<img src="imgs/counter_2.png" style="height:150px">

<img src="imgs/counter.png" style="height:400px">

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
#vect = TfidfVectorizer(ngram_range=(1,2)).fit(train.document.values.astype('U'))
vect = CountVectorizer(min_df=2, max_df=0.8).fit(train.document.values.astype('U'))
X_train_v2 = vect.transform(train.document.values.astype('U'))
X_test_v2 = vect.transform(test.document.values.astype('U'))

In [32]:
logReg_v2 = LogisticRegression()
logReg_v2.fit(X_train_v2, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [33]:
print(logReg_v2.score(X_train_v2, y_train))
print(logReg_v2.score(X_test_v2, y_test))

0.896133333333
0.81126


<h2>1-2. With Preprocessing using Konlpy Twitter</h2>

In [1]:
from konlpy.tag import Twitter
twitter = Twitter()

In [5]:
print(train.loc[0, 'document'])
print(twitter.morphs(train.loc[0, 'document'], stem=True, norm=True))
print(twitter.pos(train.loc[0, 'document']))

아 더빙.. 진짜 짜증나네요 목소리
['아', '더빙', '..', '진짜', '짜증', '나네', '요', '목소리']
[('아', 'Exclamation'), ('더빙', 'Noun'), ('..', 'Punctuation'), ('진짜', 'Noun'), ('짜증', 'Noun'), ('나네', 'Verb'), ('요', 'Eomi'), ('목소리', 'Noun')]


In [None]:
train_segs = []
for i in range(len(train.index)):
    if isinstance(train.loc[i, 'document'], float):
        continue
    tokens = twitter.morphs(train.loc[i,'document'], norm=True, stem=True)
    print(tokens)
    train_segs.append(tokens)

In [None]:
test_segs = []
for i in range(len(test.index)):
    if isinstance(test.loc[i, 'document'], float):
        continue
    tokens = twitter.morphs(test.loc[i,'document'], norm=True, stem=True)
    test_segs.append(tokens)

In [None]:
train_sents = []
for i in range(len(train_segs)):
    temp = ''
    for j in range(len(train_segs[i])-1):
        temp += train_segs[i][j] + ' '
    temp += train_segs[i][len(train_segs[i])-1]
    train_sents.append(temp)

In [None]:
test_sents = []
for i in range(len(test_segs)):
    temp = ''
    for j in range(len(test_segs[i])-1):
        temp += test_segs[i][j] + ' '
    temp += test_segs[i][len(test_segs[i])-1]
    test_sents.append(temp)

In [None]:
np.save('train_segs', train_segs)
np.save('test_segs', test_segs)
np.save('train_sents', train_sents)
np.save('test_sents', test_sents)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
vect = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.8).fit(np.array(train_sents).astype('U'))
X_train_v3 = vect.transform(np.array(train_sents).astype('U'))
X_test_v3 = vect.transform(np.array(test_sents).astype('U'))
logReg_v3 = LogisticRegression()
logReg_v3.fit(X_train_v3, y_train)
logReg_v3.score(X_test_v3, y_test)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
vect = CountVectorizer(min_df=2, max_df=0.8).fit(np.array(train_sents).astype('U'))
X_train_v3 = vect.transform(np.array(train_sents).astype('U'))
X_test_v3 = vect.transform(np.array(test_sents).astype('U'))
logReg_v3 = LogisticRegression()
logReg_v3.fit(X_train_v3, y_train)
logReg_v3.score(X_test_v3, y_test)

In [None]:
from sklearn.neural_network import MLPClassifier
neural_classifier = MLPClassifier()
neural_classifier.fit(X_train_v3, y_train)
neural_classifier.score(X_test_v3, y_test)

In [None]:
from sklearn.svm import SVC
svm_classifier = SVC()
svm_classifier.fit(X_train_v3, y_train)
svm_classifier.score(X_train_v3, y_test)

<h1>2. Sentiment Analysis with feed-forward neural network using tensorflow</h1>

### Feed-Forward 뉴럴넷을 이용한 영화평 감성 분석

In [None]:
import tensorflow as tf

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
vect = CountVectorizer(min_df=2, max_df=0.8).fit(np.array(train_sents).astype('U'))
X_train_v4 = vect.transform(np.array(train_sents).astype('U'))
X_test_v4 = vect.transform(np.array(test_sents).astype('U'))

In [None]:
n_input = X_train_v4.shape[1]
n_output = 2
n_hidden = 128
learning_rate = 1e-2
n_epoch = 5
batch_size = 64

In [None]:
y_train = np.reshape(y_train, [y_train.shape[0], 1])
y_test = np.reshape(y_test, [y_test.shape[0], 1])

In [None]:
y_train.shape

In [None]:
from sklearn.utils import shuffle
shuffled_X_train, shuffled_y_train = shuffle(X_train_v4, y_train)

In [None]:
tf.reset_default_graph()
X = tf.placeholder(tf.float32, shape=[None, n_input])
Y = tf.placeholder(tf.int32, shape=[None, 1])

In [None]:
Y_one_hot = tf.one_hot(Y, n_output)
Y_one_hot = tf.reshape(Y_one_hot, [-1, n_output])

In [None]:
print(shuffled_X_train.shape)
print(shuffled_y_train.shape)

In [None]:
W1 = tf.Variable(tf.random_normal([n_input, n_hidden]))
b1 = tf.Variable(tf.random_normal([n_hidden]))
W2 = tf.Variable(tf.random_normal([n_hidden, n_output]))
b2 = tf.Variable(tf.random_normal([n_output]))

In [None]:
h = tf.nn.relu(tf.matmul(X, W1) + b1)
logits = tf.matmul(h, W2) + b2
hypothesis = tf.nn.softmax(logits)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=Y_one_hot))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
prediction = tf.argmax(hypothesis, 1)
correct = tf.equal(prediction, tf.argmax(Y_one_hot, 1))
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [None]:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
sess.run(tf.global_variables_initializer())

for epoch in range(n_epoch):
    total_batch = int(X_train_v4.shape[0] / batch_size)
    cost_avg = 0
    print('< epoch :', (epoch+1), '>')
    for i in range(total_batch):
        if i == (total_batch-1):
            batch_xs = shuffled_X_train[(i*batch_size):shuffled_X_train.shape[0]].todense()
            batch_ys = shuffled_y_train[(i*batch_size):shuffled_y_train.shape[0]]
        else:
            batch_xs = shuffled_X_train[i*batch_size:(i+1)*batch_size].todense()
            batch_ys = shuffled_y_train[i*batch_size:(i+1)*batch_size]       
        cost_val, _ = sess.run([cost, optimizer], feed_dict={X: batch_xs, Y: batch_ys})
        cost_avg += cost_val
        if i % 500 == 499:
            print('%04d' % (i+1), 'Cost: ', '{:.3f}'.format(cost_avg/500))
            cost_avg = 0

In [None]:
test_batch = int(X_test_v4.shape[0] / batch_size)
test_acc = 0
for i in range(test_batch):
    if i == (test_batch-1):
        batch_xs = X_test_v4[(i*batch_size):X_test_v4.shape[0]].todense()
        batch_ys = y_test[(i*batch_size):len(y_test)]
    else:
        batch_xs = X_test_v4[i*batch_size:(i+1)*batch_size].todense()
        batch_ys = y_test[i*batch_size:(i+1)*batch_size]       
    acc = sess.run(accuracy, feed_dict={X: batch_xs, Y: batch_ys})
    test_acc += acc
print('Accuracy: ', '{:.3f}'.format(test_acc/test_batch))