### 비지도학습 기반 
* label대신 감성어휘 사전 Lexicon기반(한글지원 없음),지도학습은 label 있음
* 감성사전은 긍정, 부정의 정도를 의미하는 수치 -> 감성지수(Polarity score)
* 감성지수는 단어의 위치나 주변단어, 문맥, POS(Part of Speech) 등으로 결정
* 감성사전을 구현한 것 NLTK 패키지, NLTK안에는 Lexicon모듈이 포함되어 있음
#### NLP 패키지의 WordNet
    * 같은 어휘라도 다른게 사용되는 어휘의 시맨틱(문맥상 의미)정보을 제공하는 영어 어휘사전
    * 각각의 품사(명사,동사,형용사, 부사 등)으로 구성된 개별단어를
      Synset(Sets of cognitive synonyms)개념을 이용해 표현
#### NLTK 단점 : 예측성능이 좋지 못하다, 다른 감성 사전을 적용하는 것이 일반적
#### 감성사전
* SentiWordnet : Wordnet의 Synset개념을 감성분석에 적용(긍정,부정,객관성 지수)
* VADER : 소셜미디어의 텍스트에 대한 감성분석 제공을 위한 패키지. 성능,빨라 대용량텍스트데이터
* Pattern : 예측성능면에서 가장 주목받는 패키지, 파이썬 2.x 호환, 3.x 미호환
    - https://www.clips.uantwerpen.be/pattern 감상

In [None]:
### SentiWordNet을 이용한 감성분석
* WordNet Synset과 SentiWordNet SentiSynset 클래스의 이해

In [2]:
import pandas as pd
import nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to
[nltk_data]    |     /home/jovyan/n

[nltk_data]    |   Unzipping corpora/shakespeare.zip.
[nltk_data]    | Downloading package sinica_treebank to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/sinica_treebank.zip.
[nltk_data]    | Downloading package smultron to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/smultron.zip.
[nltk_data]    | Downloading package state_union to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/state_union.zip.
[nltk_data]    | Downloading package stopwords to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/stopwords.zip.
[nltk_data]    | Downloading package subjectivity to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/subjectivity.zip.
[nltk_data]    | Downloading package swadesh to
[nltk_data]    |     /home/jovyan/nltk_data...
[nltk_data]    |   Unzipping corpora/swadesh.zip.
[nltk_data]    | Downloading package

True

In [3]:
from nltk.corpus import wordnet as wn

term = 'present'

# 'present'라는 단어로 wordnet의 synsets 생성. 
synsets = wn.synsets(term)
print('synsets() 반환 type :', type(synsets))
print('synsets() 반환 값 갯수:', len(synsets))
print('synsets() 반환 값 :', synsets)

synsets() 반환 type : <class 'list'>
synsets() 반환 값 갯수: 18
synsets() 반환 값 : [Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]


### Synset 객체의 속성 : 시맨틱적인 요소들
* n:명사(noun), n.01:명사로서 가진의 의미들의 index 01
* POS : 품사, 정의(Definition), 부명제(Lemma):비슷한 단어
* v : 동사(verb)

In [4]:
for synset in synsets :
    print('##### Synset name : ', synset.name(),'#####')
    print('POS :',synset.lexname())
    print('Definition:',synset.definition())
    print('Lemmas:',synset.lemma_names())


##### Synset name :  present.n.01 #####
POS : noun.time
Definition: the period of time that is happening now; any continuous stretch of time including the moment of speech
Lemmas: ['present', 'nowadays']
##### Synset name :  present.n.02 #####
POS : noun.possession
Definition: something presented as a gift
Lemmas: ['present']
##### Synset name :  present.n.03 #####
POS : noun.communication
Definition: a verb tense that expresses actions or states at the time of speaking
Lemmas: ['present', 'present_tense']
##### Synset name :  show.v.01 #####
POS : verb.perception
Definition: give an exhibition of to an interested audience
Lemmas: ['show', 'demo', 'exhibit', 'present', 'demonstrate']
##### Synset name :  present.v.02 #####
POS : verb.communication
Definition: bring forward and present to the mind
Lemmas: ['present', 'represent', 'lay_out']
##### Synset name :  stage.v.01 #####
POS : verb.creation
Definition: perform (a play), especially on a stage
Lemmas: ['stage', 'present', 'represen

In [None]:
### wn.synset('tree.n.01').path_similarity()
    * 단어간의 유사도 메서드

In [5]:
# synset 객체를 단어별로 생성합니다. 
tree = wn.synset('tree.n.01')
lion = wn.synset('lion.n.01')
tiger = wn.synset('tiger.n.02')
cat = wn.synset('cat.n.01')
dog = wn.synset('dog.n.01')

entities = [tree , lion , tiger , cat , dog]
similarities = []
entity_names = [ entity.name().split('.')[0] for entity in entities]

# 단어별 synset 들을 iteration 하면서 다른 단어들의 synset과 유사도를 측정합니다. 
for entity in entities:
    similarity = [ round(entity.path_similarity(compared_entity), 2)  for compared_entity in entities ]
    similarities.append(similarity)
    
# 개별 단어별 synset과 다른 단어의 synset과의 유사도를 DataFrame형태로 저장합니다.  
similarity_df = pd.DataFrame(similarities , columns=entity_names,index=entity_names)
similarity_df

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.07,0.08,0.12
lion,0.07,1.0,0.33,0.25,0.17
tiger,0.07,0.33,1.0,0.25,0.17
cat,0.08,0.25,0.25,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


In [14]:
similarity_df.style.background_gradient(cmap='YlGn')
# Styler.background_gradient(cmap='PuBu', low=0, high=0, axis=0, subset=None)
# cmap: str or colormap : matplotlib colormap
# low, high: float : compress the range by these values.
# axis: int or str : 1 or ‘columns’ for colunwise, 0 or ‘index’ for rowwise
# subset: IndexSlice : a valid slice for data to limit the style application to

Unnamed: 0,tree,lion,tiger,cat,dog
tree,1.0,0.07,0.07,0.08,0.12
lion,0.07,1.0,0.33,0.25,0.17
tiger,0.07,0.33,1.0,0.25,0.17
cat,0.08,0.25,0.25,1.0,0.2
dog,0.12,0.17,0.17,0.2,1.0


In [None]:
#### sentiwordnet

In [17]:
import nltk
from nltk.corpus import sentiwordnet as swn

senti_synsets = list(swn.senti_synsets('slow'))
print('senti_synsets() 반환 type :', type(senti_synsets))
print('senti_synsets() 반환 값 갯수:', len(senti_synsets))
print('senti_synsets() 반환 값 :', senti_synsets)


senti_synsets() 반환 type : <class 'list'>
senti_synsets() 반환 값 갯수: 11
senti_synsets() 반환 값 : [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]


In [None]:
#### 객관성지수와 감성지수
* 객관성지수 : 감성적이지 않다 1 -> 감성지수(긍정,부정) 0

In [15]:
import nltk
from nltk.corpus import sentiwordnet as swn

father = swn.senti_synset('father.n.01')
print('father 긍정감성 지수: ', father.pos_score())
print('father 부정감성 지수: ', father.neg_score())
print('father 객관성 지수: ', father.obj_score())
print('\n')
fabulous = swn.senti_synset('fabulous.a.01') # fabulous 멋진
print('fabulous 긍정감성 지수: ',fabulous .pos_score())
print('fabulous 부정감성 지수: ',fabulous .neg_score())

father 긍정감성 지수:  0.0
father 부정감성 지수:  0.0
father 객관성 지수:  1.0


fabulous 긍정감성 지수:  0.875
fabulous 부정감성 지수:  0.125


#### SentiWordNet Lexicon을 이용한 IMDB 영화 감상평 감성분석
1. 문서(Document)를 문장(Sentence)단위로 분해
2. 다시 문장을 단어(Word)단위로 토큰화하고 어근추출(Lemmatization), 품사태깅(POS tagging)
3. 품사태깅된 단어기반으로 sysnet객체와 senti_sysnet객체를 생성
4. Senti_sysnet에서 긍정 감성/부정감성 지수를 구하고 이를 모두 합산해
   특정 임계치 이상일 때 긍정감성, 그렇지 않으면 부정감성으로 결정


In [18]:
# 품사태깅

from nltk.corpus import wordnet as wn

# 간단한 NTLK PennTreebank Tag를 기반으로 WordNet기반의 품사 Tag로 변환
def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ         # 형용사
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV         # 부사 
    elif tag.startswith('V'):
        return wn.VERB
    return 


In [35]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag

def swn_polarity(text):
    # 감성 지수 초기화 
    sentiment = 0.0
    tokens_count = 0
    
    lemmatizer = WordNetLemmatizer()
    raw_sentences = sent_tokenize(text)
    # 분해된 문장별로 단어 토큰 -> 품사 태깅 후에 SentiSynset 생성 -> 감성 지수 합산 
    for raw_sentence in raw_sentences:
        
        # NTLK 기반의 품사 태깅 문장 추출  
        tagged_sentence = pos_tag(word_tokenize(raw_sentence))
        # print(f'tagged_sentence\n{tagged_sentence}\n')
        
        for word , tag in tagged_sentence:
            
            # WordNet 기반 품사 태깅과 어근 추출
            wn_tag = penn_to_wn(tag)
            # print(f'wn_tag:\n{wn_tag}\n')  => None(<- With) None(<- all)  None(<- this) n(<- stuff) 
            
            if wn_tag not in (wn.NOUN , wn.ADJ, wn.ADV):
                continue  
                
            # 어근추출
            lemma = lemmatizer.lemmatize(word, pos=wn_tag)
            #print(f'lemma\n{lemma}\n')  # stuff,   moment, MJ, i ,,,
            if not lemma:
                continue 
                
            # 어근을 추출한 단어와 WordNet 기반 품사 태깅을 입력해 Synset 객체를 생성. 
            synsets = wn.synsets(lemma , pos=wn_tag)
            # print(f'synsets\n{synsets}\n') # [Synset('material.n.01'), Synset('stuff.n.02'), Synset('stuff.n.03'),,,,]
            if not synsets:
                continue
                
            # sentiwordnet의 감성 단어 분석으로 감성 synset 추출
            # 모든 단어에 대해 긍정 감성 지수는 +로 부정 감성 지수는 -로 합산해 감성 지수 계산. 
            synset = synsets[0]
            
            # print(f'synset.name()\n{synset.name()}\n') # material.n.01
            swn_synset = swn.senti_synset(synset.name())
            # print(f'swn_synset\n{swn_synset}\n') # <material.n.01: PosScore=0.0 NegScore=0.0>
            
            sentiment += (swn_synset.pos_score() - swn_synset.neg_score())           
            tokens_count += 1
    
    if not tokens_count:
        return 0
    
    # 총 score가 0 이상일 경우 긍정(Positive) 1, 그렇지 않을 경우 부정(Negative) 0 반환
    if sentiment >= 0 :
#         print(f'sentiment:{sentiment}\n')
#         print(f'tokens_count:{tokens_count}\n')
        return 1
    
#     print(f'sentiment:{sentiment}\n')
#     print(f'tokens_count:{tokens_count}\n')
    return 0


In [29]:
import pandas as pd
review_df = pd.read_csv('./labeledTrainData.tsv', header=0, sep="\t", quoting=3)
print(review_df.head(3))
print(review_df['review'][0])

import re
review_df['review'] = review_df['review'].str.replace('<br />',' ')
review_df['review'] = review_df['review'].apply( lambda x : re.sub("[^a-zA-Z]", " ", x) )
print()
print(review_df['review'][0])
# from sklearn.model_selection import train_test_split
# class_df = review_df['sentiment']
# feature_df = review_df.drop(['id','sentiment'], axis=1, inplace=False)
# X_train, X_test, y_train, y_test= train_test_split(feature_df, class_df, test_size=0.3, random_state=156)
# X_train.shape, X_test.shape

         id  sentiment                                             review
0  "5814_8"          1  "With all this stuff going down at the moment ...
1  "2381_9"          1  "\"The Classic War of the Worlds\" by Timothy ...
2  "7759_3"          0  "The film starts with a manager (Nicholas Bell...
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ

In [39]:
 swn_polarity(review_df['review'][1]) # index 0, 1, 2 변화시켜 본다

sentiment:2.375

tokens_count:63



1

In [24]:
# 시간이 10분정도 소요된다고 함
review_df['preds'] = review_df['review'].apply( lambda x : swn_polarity(x) )
y_target = review_df['sentiment'].values
preds = review_df['preds'].values

In [27]:
# 평가는 3장 이용
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score 
from sklearn.metrics import recall_score, f1_score, roc_auc_score

def get_clf_eval(y_test=None, pred=None):
    confusion = confusion_matrix( y_test, pred)
    accuracy = accuracy_score(y_test , pred)
    precision = precision_score(y_test , pred)
    recall = recall_score(y_test , pred)
    f1 = f1_score(y_test,pred)
    # ROC-AUC 추가 
    roc_auc = roc_auc_score(y_test, pred)
    print('오차 행렬')
    print(confusion)
    # ROC-AUC print 추가
    print('정확도: {0:.4f}, 정밀도: {1:.4f}, 재현율: {2:.4f},\
    F1: {3:.4f}, AUC:{4:.4f}'.format(accuracy, precision, recall, f1, roc_auc))

In [28]:
print('#### SentiWordNet 예측 성능 평가 ####')
get_clf_eval(y_target, preds) # 성능은 좋지 않음

#### SentiWordNet 예측 성능 평가 ####
오차 행렬
[[7668 4832]
 [3636 8864]]
정확도: 0.6613, 정밀도: 0.6472, 재현율: 0.7091,    F1: 0.6767, AUC:0.6613


In [40]:
### VADER lexicon을 이용한 Sentiment Anal
* 소셜미디어 감성분석 용도, Lexicon
* NLTK의 모듈형태 or 별도 모듈 
    - pip install vaderSentiment
    - from vaderSentiment.vaserSentiment import SentimentIntensityAnalyzer
* polarity_scores() 감성점수를 구하고, 특정임계값이상이면 긍정, 아니면 부정
  -> 리턴 딕셔너리 감성점수 반환(neg, neu, pos, compound(-1 ~ +1))
* compound로 판단 (0.1이상이면 긍정, 이후 부정, 임계값 조정가능)

In [41]:
# NLTK의 모듈형태
from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0])
print(senti_scores)

{'neg': 0.13, 'neu': 0.743, 'pos': 0.127, 'compound': -0.7943}


In [42]:
def vader_polarity(review,threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    # compound 값에 기반하여 threshold 입력값보다 크면 1, 그렇지 않으면 0을 반환 
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

# apply lambda 식을 이용하여 레코드별로 vader_polarity( )를 수행하고 결과를 'vader_preds'에 저장
review_df['vader_preds'] = review_df['review'].apply( lambda x : vader_polarity(x, 0.1) )
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

print('#### VADER 예측 성능 평가 ####')
get_clf_eval(y_target, vader_preds)


#### VADER 예측 성능 평가 ####
오차 행렬
[[ 6736  5764]
 [ 1867 10633]]
정확도: 0.6948, 정밀도: 0.6485, 재현율: 0.8506,    F1: 0.7359, AUC:0.6948
