# 감성 분석

## 감성 분석 소개
- Sentiment Analysis
- 지도학습 : 기존 데이터의 레이블 값을 통해 다른 텍스트의 감성 분석을 예측
- 비지도학습 : Lexicon 감성 어휘 사전 이용. 문서의 긍정적, 부정적 감성 여부를 판단

## 지도학습 기반 - IMDB 영화평
- [tsv 파일](https://www.kaggle.com/c/word2vec-nlp-tutorial/data) : pandas `read_csv`의 인자로 `sep='\t'`를 줌
    - sentiment : 1=긍정적 평가, 0=부정적 평가

In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

# header=0 : 첫 행을 column name으로
# quoting=3 : ??
review_df = pd.read_csv('../data/word2vec-nlp-tutorial/labeledTrainData.tsv', header=0, sep='\t', quoting=3)
review_df.head(3)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."


In [2]:
print(review_df['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

### 전처리
- html 태그인 \<br /> 제거
- 정규식을 이용하여 숫자와 특수문자 제거

In [3]:
import re

# <br /> 제거
review_df['review'] = review_df['review'].str.replace('<br />', ' ')

# 숫자와 특수문자 제거
# re.sub("바꾸기 전 텍스트나 정규식 표현", "바꾼 뒤 텍스트나 정규식 표현", "해당 텍스트(문장)")
# [^a-zA-Z] : 영어 대/소문자가 아닌 모든 문자
review_df['review'] = review_df['review'].apply(lambda x:re.sub("[^a-zA-Z]", " ", x))

In [7]:
print(review_df['review'][0])

 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay   Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him   The actual feature film bit when it finally starts is only on for  

### 레이블 열 분리 및 학습용/테스트용 분리

In [4]:
from sklearn.model_selection import train_test_split

class_df = review_df['sentiment']
feature_df = review_df.drop(columns=['id', 'sentiment'], inplace=False)

X_train, X_test, y_train, y_test = train_test_split(feature_df, class_df, 
                                                   test_size=0.3, random_state=156)

X_train.shape, X_test.shape

((17500, 1), (7500, 1))

### ML 알고리즘 학습/평가
- Pipeline, CountVectorizer, TF-IDF, Logistic Regression 이용

In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# stop_word = 'english', ngram_range = (1, 2)
pipeline = Pipeline([
    ('cnt_vect', CountVectorizer(stop_words='english', ngram_range=(1, 2))),
    ('lr_clf', LogisticRegression(C=10))
])

# 학습/평가
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:, 1]

acc = accuracy_score(y_test, pred)
roc_auc = roc_auc_score(y_test, pred_probs)

print(f'예측 정확도 : {acc:.4f}, ROC-AUC : {roc_auc:.4f}')

예측 정확도 : 0.8860, ROC-AUC : 0.9503


In [6]:
# stop_word = 'english', ngram_range = (1, 2)
pipeline = Pipeline([
    ('tfidf_vect', TfidfVectorizer(stop_words='english', ngram_range=(1, 2))),
    ('lr_clf', LogisticRegression(C=10))
])

# 학습/평가
pipeline.fit(X_train['review'], y_train)
pred = pipeline.predict(X_test['review'])
pred_probs = pipeline.predict_proba(X_test['review'])[:, 1]

acc = accuracy_score(y_test, pred)
roc_auc = roc_auc_score(y_test, pred_probs)

print(f'예측 정확도 : {acc:.4f}, ROC-AUC : {roc_auc:.4f}')

예측 정확도 : 0.8936, ROC-AUC : 0.9598


## 비지도 학습 기반
- Lexicon 이용 (한글 지원 X)
- 감성 지수(Polarity score) : 단어의 위치, 문맥, 주변 단어, POS(Part of Speech) 등을 참고해 결정
- NLTK에 Lexicon 포함
    - 다만 예측 성능이 좋지 못함

WordNet
- 시맨틱 분석을 제공하는 어휘 사전
- 시맨틱(Semantic) : 문맥상 의미 (present를 선물, 현재 등 문맥상 파악해야 하는 것)
- 각각의 품사(명/동/형용/부사 등)로 구성된 개별 단어를 Synset(Sets of cognitive synonyms) 개념을 이용해 표현

### 감성 사전 종류
- SentiWordNet : WordNet의 Synset 개념을 감성 분석에 적용한 것
    - 긍정/부정 감성 지수를 합산하여 최종 감성 지수를 계산, 객관성 지수를 이용하여 감성과 관계 없이 얼마나 객관적인지 알려줌
- VADER : 소셜 미디어의 텍스트에 대한 감성 분석
    - 우수한 성능과 비교적 빠른 수행 시간
- Pattern : 예측 성능이 가장 좋음
    - 예전엔 python 2.x 버전에만 호환되었으나, 지금은 `pip install pattern` 후 python 3.x에서도 사용 가능

## SentiWordNet을 이용한 감성 분석
### WordNet Synset과 SentiWordNet SentiSynset 클래스의 이해

In [9]:
import warnings
warnings.filterwarnings('ignore')

In [10]:
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Package biocreative_ppi is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown_tei is already up-to-date!
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     C:\Us

[nltk_data]    |   Unzipping corpora\product_reviews_2.zip.
[nltk_data]    | Downloading package pros_cons to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\pros_cons.zip.
[nltk_data]    | Downloading package qc to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\qc.zip.
[nltk_data]    | Downloading package reuters to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package rte to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\rte.zip.
[nltk_data]    | Downloading package semcor to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package senseval to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\senseval.zip.
[nltk_data]    | Downloading package sentiwordnet to
[nltk_data]    |

[nltk_data]    |   Unzipping misc\perluniprops.zip.
[nltk_data]    | Downloading package nonbreaking_prefixes to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\nonbreaking_prefixes.zip.
[nltk_data]    | Downloading package vader_lexicon to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package porter_test to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping stemmers\porter_test.zip.
[nltk_data]    | Downloading package wmt15_eval to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\wmt15_eval.zip.
[nltk_data]    | Downloading package mwa_ppdb to
[nltk_data]    |     C:\Users\master\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping misc\mwa_ppdb.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection all


True

In [11]:
from nltk.corpus import wordnet as wn

term = 'present'

# present 단어로 wordnet synsets 생성
synsets = wn.synsets(term)
print('synsets() 반환 type :', type(synsets))
print('synsets() 반환 값 개수:', len(synsets))
print('synsets() 반환 값:', synsets)

synsets() 반환 type : <class 'list'>
synsets() 반환 값 개수: 18
synsets() 반환 값: [Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]


### Synset
- Synset(뜻.품사.인덱스)
- POS : Part of Speech(품사)
- Definition : 정의
- Lemma : 부명제

In [12]:
for synset in synsets:
    print('##### Synset name :', synset.name(), '#####')
    print('POS :', synset.lexname())
    print('Definition :', synset.definition())
    print('Lemmas :', synset.lemma_names())

##### Synset name : present.n.01 #####
POS : noun.time
Definition : the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas : ['present', 'nowadays']
##### Synset name : present.n.02 #####
POS : noun.possession
Definition : something presented as a gift
Lemmas : ['present']
##### Synset name : present.n.03 #####
POS : noun.communication
Definition : a verb tense that expresses actions or states at the time of speaking
Lemmas : ['present', 'present_tense']
##### Synset name : show.v.01 #####
POS : verb.perception
Definition : give an exhibition of to an interested audience
Lemmas : ['show', 'demo', 'exhibit', 'present', 'demonstrate']
##### Synset name : present.v.02 #####
POS : verb.communication
Definition : bring forward and present to the mind
Lemmas : ['present', 'represent', 'lay_out']
##### Synset name : stage.v.01 #####
POS : verb.creation
Definition : perform (a play), especially on a stage
Lemmas : ['stage', 'present', 're

#### 유사도

In [16]:
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

entities = [tree, lion, tiger, cat, dog]
similarities = []
entity_names = [entity.name().split('.')[0] for entity in entities]

entity_names

['tree', 'lion', 'tiger', 'cat', 'dog']

In [17]:
# 각 단어별 유사도 측정
for entity in entities:
    similarity = [round(entity.path_similarity(compared_entity), 2) for compared_entity in entities]
    similarities.append(similarity)

# DataFrame으로 저장
similarity_df = pd.DataFrame(similarities, columns=entity_names, index=entity_names)
similarity_df

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.07,0.08,0.12
lion,0.07,1.0,0.33,0.25,0.17
tiger,0.07,0.33,1.0,0.25,0.17
cat,0.08,0.25,0.25,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


### SentiSynset

In [18]:
import nltk
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets() 반환 type :', type(senti_synsets))
print('senti_synsets() 반환 값 개수 :', len(senti_synsets))
print('senti_synsets() 반환 값 :', senti_synsets)

senti_synsets() 반환 type : <class 'list'>
senti_synsets() 반환 값 개수 : 11
senti_synsets() 반환 값 : [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]


#### 감성 지수와 객관성 지수

In [20]:
father = swn.senti_synset('father.n.01')
print('father 긍정 감성 지수 :', father.pos_score())
print('father 부정 감성 지수 :', father.neg_score())
print('father 객관성 지수 :', father.obj_score())
print('\n')

fabulous = swn.senti_synset('fabulous.a.01')
print('fabulous 긍정 감성 지수 :', fabulous.pos_score())
print('fabulous 부정 감성 지수 :', fabulous.neg_score())
print('fabulous 객관성 지수 :', fabulous.obj_score())

father 긍정 감성 지수 : 0.0
father 부정 감성 지수 : 0.0
father 객관성 지수 : 1.0


fabulous 긍정 감성 지수 : 0.875
fabulous 부정 감성 지수 : 0.125
fabulous 객관성 지수 : 0.0


### SentiWordNet을 이용한 영화 감상평 감성 분석

감성 분석 진행 순서
1. 문서를 문장 단위로 분해
2. 문장을 단어 단위로 토큰화 후 품사 태깅
3. 품사 태깅된 단어 기반으로 synset, senti_synset 객체 생성
4. senti_synset에서 긍정/부정 감성 지수를 구하고 이를 합산하여 특정 임계치 값 이상일 때 긍정, 미만일 때 부정 감성으로 결정

#### 품사 태깅 함수

In [21]:
from nltk.corpus import wordnet as wn

# NLTK PennTreebank Tag를 기반으로 WordNet 기반의 품사 Tag로 변환
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB

#### 감성 지수 예측

In [29]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

def swn_polarity(text):
    # 감성 지수 초기화
    sentiment = 0.0
    tokens_count = 0
    
    # 단어의 원형 추출 함수
    lemmatizer = WordNetLemmatizer()
    
    # 문서를 문장 단위로 분해
    raw_sentences = sent_tokenize(text)
    
    # 분해된 문장별로 단어 토큰 -> 품사 태깅 후 SentiSynset 생성 -> 감성 지수 합산
    for raw_sentence in raw_sentences:
        # NLTK 기반 품사 태깅 문장 추출
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        
        for word, tag in tagged_sentence:
            
            # WordNet 기반 품사 태깅과 어근 추출
            wn_tag = penn_to_wn(tag)
            if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
                continue
            lemma = lemmatizer.lemmatize(word, pos=wn_tag)
            if not lemma:
                continue
                
            # 어근을 추출한 단어와 WordNet 기반 품사 태깅을 입력해 Synset 객체 생성
            synsets = wn.synsets(lemma, pos=wn_tag)
            if not synsets:
                continue
            # sentiwordnet의 감성 단어 분석으로 감성 synset 추출
            # 모든 단어에 대해 긍정 감성 지수는 +, 부정 감성 지수는 -로 합산해 감성지수 계산
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())
            tokens_count += 1
    
    if not tokens_count:
        return 0
    
    # 0점 이상이면 긍정(1), 아니면 부정(0)
    if sentiment >= 0:
        return 1
    
    return 0

In [31]:
%%time

review_df['preds'] = review_df['review'].apply(lambda x:swn_polarity(x))
y_target = review_df['sentiment'].values
preds = review_df['preds'].values

Wall time: 4min 56s


#### 성능 평가

In [33]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score
from sklearn.metrics import recall_score, f1_score, roc_auc_score
import numpy as np

print(confusion_matrix(y_target, preds))
print('정확도 :', np.round(accuracy_score(y_target, preds), 4))
print('정밀도 :', np.round(precision_score(y_target, preds), 4))
print('재현율 :', np.round(recall_score(y_target, preds), 4))

[[7668 4832]
 [3636 8864]]
정확도 : 0.6613
정밀도 : 0.6472
재현율 : 0.7091


#### 참고

In [23]:
pos_tag(['cat', 'is', 'cute'])

[('cat', 'NN'), ('is', 'VBZ'), ('cute', 'JJ')]

## VADER를 이용한 감성 분석
- NLTK에서 사용 가능
    - `pip install vaderSentiment` 이후
    ```python
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    ```
    로도 사용 가능
- neg : 부정 감성 지수
- neu : 중립 감성 지수
- pos : 긍정 감성 지수
- compound : 위 세 값을 적절히 조절하여 나온 값으로, 0.1 이상이면 긍정으로 판단하는 등 임계값을 적절히 조절하여 예측 성능 조절

In [35]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)

{'neg': 0.13, 'neu': 0.743, 'pos': 0.127, 'compound': -0.7943}


In [37]:
def vader_polarity(review, threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

In [38]:
%%time

review_df['vader_preds'] = review_df['review'].apply(lambda x:vader_polarity(x, 0.1))
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

print(confusion_matrix(y_target, vader_preds))
print('정확도 :', np.round(accuracy_score(y_target, vader_preds), 4))
print('정밀도 :', np.round(precision_score(y_target, vader_preds), 4))
print('재현율 :', np.round(recall_score(y_target, vader_preds), 4))

[[ 6747  5753]
 [ 1858 10642]]
정확도 : 0.6956
정밀도 : 0.6491
재현율 : 0.8514
Wall time: 3min 47s


## Pattern을 이용한 감성 분석
- (긍정/부정 감성 지수, 객관성 지수) 반환

In [41]:
from pattern.en import sentiment

print(sentiment(review_df['review'][0]))

(-0.004956095913542712, 0.6009287402904423)


In [42]:
def pattern_polarity(review, threshold=0.1):
    agg_score = sentiment(review)[0]
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

In [43]:
%%time

review_df['pattern_preds'] = review_df['review'].apply(lambda x:pattern_polarity(x, 0.1))
y_target = review_df['sentiment'].values
pattern_preds = review_df['pattern_preds'].values

print(confusion_matrix(y_target, pattern_preds))
print('정확도 :', np.round(accuracy_score(y_target, pattern_preds), 4))
print('정밀도 :', np.round(precision_score(y_target, pattern_preds), 4))
print('재현율 :', np.round(recall_score(y_target, pattern_preds), 4))

[[9547 2953]
 [2823 9677]]
정확도 : 0.769
정밀도 : 0.7662
재현율 : 0.7742
Wall time: 58.8 s


## 각 패키지별 성능 결과
|              | 정확도 | 정밀도 | 재현율 |
|--------------|--------|--------|--------|
| **SentiWordNet** | 0.6613 | 0.6472 | 0.7091 |
| **VADER**        | 0.6956 | 0.6491 | 0.8514 |
| **pattern**      | 0.7690 | 0.7662 | 0.7742 |

수행 시간까지 고려하면 pattern의 성능이 가장 좋음