# 감성 분석 (sentiment analysis)
1. 텍스트 데이터 정제와 준비
2. 특성 벡터 구축
3. 긍정 부정 분류 모델 구축
4. 외부 메모리 학습 사용해 대용량 데이터셋 다루기
5. 문서의 토픽 추론하기

In [22]:
from tqdm import tqdm
import pandas as pd
import os

basepath = 'E://downloads/chrome/aclImdb_v1.tar/aclImdb_v1/aclImdb'

labels = {'pos':1, 'neg':0}
df = pd.DataFrame()
for s in ('test', 'train'):
    for l in ('pos', 'neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file), 'r', encoding = 'utf-8') as infile:
                txt = infile.read()
            df = df.append([[txt, labels[l]]], ignore_index = True)
df.columns = ['review', 'sentiment']

In [24]:
import numpy as np

np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index = False, encoding = 'utf-8')

In [25]:
df = pd.read_csv("movie_data.csv", encoding = 'utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [26]:
df.shape

(50000, 2)

# 단어를 특성벡터로 변환

In [30]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# ngram_range = (2, 2) : 2-gram
count = CountVectorizer()
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [31]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [32]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


- 위의 특성벡터의 각 인덱스는 위의 딕셔너리에 저장된 정수 값에 해당된다.
- 이를 단어 빈도 (term frequency)라고 부르며 문서 d에 등장한 단어 t의 횟수를 tf(t, d)로 쓴다.


- 위으 BoW 모델에 있는 item sequence를 1-gram / unigram model이라고 한다.
- 각 토큰이 하나의 단어를 표현한다.
- n-gram에서 n에 어떤 값을 선택할지 application마다 다르다.
- 1그램 : 'the', 'sun', 'is', 'shining'
- 2그램 : 'the sun', 'sun is', 'is shining'

### tf-idf
- 각 클래스에 등장 빈도가 높은 단어는 판별에 유용하지 않다.
- 따라서 tf-idf (term frequency-inverse document frequency) 기법을 이용해 단어의 가중치를 낮출 수 있다.


- 정의 : tf-idf는 tf X idf (inverse document frequency) 역문서빈도
- $tf-idf(t, d) = tf(t, d) \times idf(t, d)$
- $idf(t, d) = log \frac{n_{d}}{1+df(d,t)}$
- $n_{d} =$전체 문서 개수
- $df(d, t)=$단어가 t가 포함된 문서 d의 개수
- 분모에 상수 1을 추가하는 것은 훈련 샘플에 한번도 등장하지 않는 단어가 있을 경우 분모가 0이 되지 않도록 한다.
- log는 df(d, t)가 낮을 때 역문서 빈도 값이 너무 커지지 않도록 만든다.

In [33]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf = True,
                         norm = 'l2',
                         smooth_idf = True)
np.set_printoptions(precision = 2)
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


# 텍스트 데이터 정제

In [34]:
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [35]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [36]:
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [37]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [38]:
df['review'] = df['review'].apply(preprocessor)

# 문서 토큰으로 나누기

In [39]:
def tokenizer(text):
    return text.split()

tokenizer('runners like running')

['runners', 'like', 'running']

### 어간 추출 알고리즘

In [43]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

### 불용어 (stop-word) 제거

In [44]:
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\xnoti\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


['runner', 'like', 'run', 'run', 'lot']

# logistic regression for doc classification

In [45]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents = None,
                        lowercase = False,
                        preprocessor = None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver ='liblinear', random_state = 0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring = 'accuracy',
                           cv = 5, verbose = 1, n_jobs = -1)
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


# 대용량 데이터 처리
- 외부 메모리 학습 (out-of-core learning) : 데이터셋을 작은 batch로 나누어 분류기를 점진적으로 학습

In [1]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text.lower())
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    tokenized = [w for w in text.split() if w not in stop]
    
    return tokenized

def stream_docs(path):
    with open(path, 'r', encoding = 'utf-8') as csv:
        next(csv)
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text, label

In [2]:
next(stream_docs(path = 'movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [5]:
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        pass
    return docs, y

In [6]:
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error = 'ignore', n_features = 2**21,
                         preprocessor = None, tokenizer = tokenizer)
clf = SGDClassifier(loss = 'log', random_state = 1, max_iter = 1)
doc_stream = stream_docs(path = 'movie_data.csv')

In [7]:
classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes = classes)

In [8]:
X_test, y_test = get_minibatch(doc_stream, size = 5000)
X_test = vect.transform(X_test)
print(clf.score(X_test, y_test))

0.8682


# LDA

In [9]:
import pandas as pd
df = pd.read_csv('movie_data.csv', encoding = 'utf-8')

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words = 'english', max_df = .1, max_features = 5000)
X = count.fit_transform(df['review'].values)

In [12]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components = 10, random_state = 123, learning_method = 'batch')
X_topics = lda.fit_transform(X)

In [13]:
lda.components_.shape

(10, 5000)

In [14]:
n_top_words = 5

feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
    print(f"토픽{topic_idx + 1}")
    print(" ".join([feature_names[i] for i in topic.argsort() [:-n_top_words - 1: -1]]))

토픽1
worst minutes awful script stupid
토픽2
family mother father children girl
토픽3
american war dvd music tv
토픽4
human audience cinema art sense
토픽5
police guy car dead murder
토픽6
horror house sex girl woman
토픽7
role performance comedy actor performances
토픽8
series episode war episodes tv
토픽9
book version original read novel
토픽10
action fight guy guys cool


In [17]:
horror = X_topics[:, 5].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\n공포 영화 #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')


공포 영화 #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

공포 영화 #2:
Okay, what the hell kind of TRASH have I been watching now? "The Witches' Mountain" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there  ...

공포 영화 #3:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...


In [20]:
import pickle
import os

dest = os.path.join('movieclassifier', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)

In [22]:
pickle.dump(stop, open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol = 4)
pickle.dump(clf, open(os.path.join(dest, 'classifier.pkl'), 'wb'), protocol = 4)