## 감성분석(sentiment analysis)
- 자연어 처리(Natural Language Processing, NLP)의 하위 분야
- 문서 성향을 분석(opinion mining)

In [1]:
import tarfile

In [None]:
#tarball 파일 압축 풀기
import tarfile
with tarfile.open('aclImdb_v1.tar.gz','r:gz')as tar:
    tar.extractall()

In [None]:
!pip install pyprind

In [None]:
import pyprind#예상 시간 추측
import pandas as pd
import os

In [None]:
basepath = 'aclImdb'#'base path'를 압축 해제된 영화 리뷰 데이터셋에 있는 디렉터리로 바꿈
labels = {'pos':1, 'neg':0}
pbar = pyprind.ProgBar(50000)#진행 막대(읽어들일 문서 개수)
df = pd.DataFrame()
for s in ('test','train'):
    for l in ('pos','neg'):
        path = os.path.join(basepath, s, l)
        for file in sorted(os.listdir(path)):
            with open(os.path.join(path, file),
                     'r', encoding ='utf-8') as infile:
                txt = infile.read()
            df = df.append([[[txt, labels[l]]]], #1:긍정 0:부정
                           ignore_index=True)
            pbar.update()

In [12]:
df.head()

Unnamed: 0,0
0,[I went and saw this movie last night after be...
1,[Actor turned director Bill Paxton follows up ...
2,[As a recreational golfer with some knowledge ...
3,"[I saw this film in a sneak preview, and it is..."
4,[Bill Paxton has taken the true story of the 1...


In [None]:
df.columns = ['review','sentiment']

In [None]:
import numpy as np
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))#데이터프레임 섞기
df.to_csv('movie_data.csv',index=False,encoding='utf-8')

In [4]:
import pandas as pd
#그냥 깃헙에서 가져옴
df = pd.read_csv('movie_data.csv',encoding='utf-8')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [5]:
df.shape

(50000, 2)

### BoW (Bag-of-Word)
: 텍스트를 수치 특성 벡터로 표현<br>
1. 전체 문서에 대해 고유한 토큰(token), 예를 들어 단어로 이루어진 어휘 사전(vocabulary)을 만듦
2. 특정 문서에 각 단어가 얼마나 자주 등장하는지 헤아려 특성 벡터를 만듦

#### 단어를 특성 벡터로 변환
CountVectorizer 클래스를 사용하여 각각의 문서에 있는 단어 카운트를 기반으로 BoW 모델을 만듦

In [6]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()

In [7]:
docs = np.array([
    'The sun is shining',
    'The weather is sweet',
    'The sun is shining, the wheather is sweet, and one and one is two'
])
bag = count.fit_transform(docs)

fit_transform: BoW 모델의 어휘 사전을 구축하고 문장들을 희소한 특성 벡터로 변환

In [8]:
print(count.vocabulary_)#숫자는 해당 단어가 저장된 열(column)의 인덱스

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'wheather': 9, 'and': 0, 'one': 2, 'two': 7}


In [9]:
print(bag.toarray())#column별로 앞에서부터 and, is, one ,,,순서

[[0 1 0 1 1 0 1 0 0 0]
 [0 1 0 0 0 1 1 0 1 0]
 [2 3 2 1 1 1 2 1 0 1]]


#### 단어 빈도(term frequency)
$tf(t,d)$ : 문서 d에 등장한 단어 t의 횟수

### tf-idf (term frequency-inverse document frquency)
: 특성 벡터에서 자주 등장하는 단어의 가중치를 낮추는 기법
<br>
단어 빈도와 역문서빈도(inverse document frequency)의 곱으로 정의됨

<br>
<center>tf-idf기법</center>
$$tf-idf(t,d) = tf(t,d) \times idf(t,d)$$
<br><br>

<center>역문서 빈도</center>
$$idf(t,d) = log\frac{n_d}{1+df(d,t)}$$

In [10]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True, norm='l2',#L2정규화
                        smooth_idf=True)#단어 빈도를 입력 받아 tf-idf로 변환
np.set_printoptions(precision=2)

In [11]:
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.   0.  ]
 [0.   0.39 0.   0.   0.   0.5  0.39 0.   0.66 0.  ]
 [0.5  0.44 0.5  0.19 0.19 0.19 0.29 0.25 0.   0.25]]


<center>사이킷런에서 계산하는 tf-idf</center>
$$idf(t,d) = log\frac{1+n_d}{1+df(d,t)}$$

### 텍스트 데이터 정제

In [12]:
#첫 번째 문서에서 마지막 50개의 문자 출력
df.loc[0,'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [13]:
#이모티콘 문자를 제외하고 모든 구두점 기호 삭제
import re#정규표현식(regular expression)
def preprocessor(text):#텍스트 데이터 정제
    text = re.sub('<[^>]*>','',text)#정규표현식 사용
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text)
    text = (re.sub('[\W]+',' ',text.lower()) +#단어가 아닌 문자 모두 제거
            ''.join(emoticons).replace('-',''))#소문자로 바꿈, 이모티콘 변수를 처리 완료된 문자열 끝에 추가
    return text

In [14]:
preprocessor(df.loc[0,'review'][-50:])

'is seven title brazil not available'

In [15]:
preprocessor("<//a>This :) is :( a test :-)!")

'this is a test :):(:)'

In [16]:
df['review'] = df['review'].apply(preprocessor)

### 문서를 토큰으로 나누기

In [17]:
def tokenizer(text):#문서 토큰화
    return text.split()#공백 문자를 기준으로 개별 단어로 나눔

In [18]:
tokenizer(('runners like running thus they run'))

['runners', 'like', 'running', 'thus', 'they', 'run']

어간추출(stemming): 단어를 변하지 않는 기본 형태인 어간으로 바꿔줌

In [19]:
!pip install nltk



In [20]:
from nltk.stem.porter import PorterStemmer#포터 어간 추출
porter = PorterStemmer()

In [21]:
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [22]:
tokenizer_porter('Hello my name is Chaeeun hahaha I have to write a verb, I am a runner thus I run. ring - rang - rung - singing')
#실제 사용하지 않는 단어 thu가 포함됨

['hello',
 'my',
 'name',
 'is',
 'chaeeun',
 'hahaha',
 'i',
 'have',
 'to',
 'write',
 'a',
 'verb,',
 'i',
 'am',
 'a',
 'runner',
 'thu',
 'i',
 'run.',
 'ring',
 '-',
 'rang',
 '-',
 'rung',
 '-',
 'sing']

불용어(stop-word): 아주 흔하게 등장하는 단어(is, and, has, like)

In [23]:
import nltk
nltk.download('stopwords')#179개 불용어

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LG\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [24]:
from nltk.corpus import stopwords#불용어
stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:] if w not in stop]#불용어 제거

['runner', 'like', 'run', 'run', 'lot']

In [25]:
[w for w in tokenizer_porter('Hello my name is Chaeeun hahaha I have to write a verb, I am a runner thus I run. ring - rang - rung - singing') if w not in stop]

['hello',
 'name',
 'chaeeun',
 'hahaha',
 'write',
 'verb,',
 'runner',
 'thu',
 'run.',
 'ring',
 '-',
 'rang',
 '-',
 'rung',
 '-',
 'sing']

### 문서 분류를 위한 로지스틱 회귀 모델 훈련

In [26]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test=df.loc[25000:,'sentiment'].values

In [27]:
from sklearn.model_selection import GridSearchCV#그리드 서치
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression#로지스틱 회귀 모델
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
tfidf = TfidfVectorizer(strip_accents=None,#CountVectorizer과 TfidTransformer 두 기능 하나로 합침
                       lowercase=False,
                       preprocessor=None)#tf-idf

In [29]:
param_grid = [{'vect__ngram_range':[(1,1)],#매개변수 딕셔너리 1: 기본 매개변수 세팅
              'vect__stop_words':[stop, None],#use_idf = True, smooth_idf = True, norm = l2
               'vect__tokenizer':[tokenizer,tokenizer_porter],
               'clf__penalty':['l1','l2'],#로지스틱 회귀분류기 규제
               'clf__C':[1.0,10.0,100.0]#규제 매개변수
              },
              
             {'vect__ngram_range':[(1,1)],#매개변수 딕셔너리 2: use_idf = False, smooth_idf = False, norm=None
              'vect__stop_words':[stop,None],
              'vect__tokenizer':[tokenizer, tokenizer_porter],
              'vect__use_idf':[False],
              'vect__norm':[False],
              'clf__penalty':['l1','l2'],#로지스틱 회귀 분류기 규제
              'clf__C':[1.0,10.0,100.0]#규제 매개변수
             }
             ]

In [30]:
lr_tfidf = Pipeline([('vect',tfidf),
                    ('clf',LogisticRegression(solver = 'liblinear',random_state=0))])

In [31]:
gs_lr_tfidf = GridSearchCV(lr_tfidf,param_grid,
                          scoring='accuracy',
                          cv=4,#5겹 계층별 교차 검증
                          verbose=1,
                          n_jobs=1)#속도 높이려면 n_jobs=-1로 지정

In [32]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 4 folds for each of 48 candidates, totalling 192 fits








GridSearchCV(cv=4,
             estimator=Pipeline(steps=[('vect',
                                        TfidfVectorizer(lowercase=False)),
                                       ('clf',
                                        LogisticRegression(random_state=0,
                                                           solver='liblinear'))]),
             n_jobs=1,
             param_grid=[{'clf__C': [1.0, 10.0, 100.0],
                          'clf__penalty': ['l1', 'l2'],
                          'vect__ngram_range': [(1, 1)],
                          'vect__stop_words': [['i', 'me', 'my', 'myself', 'we',
                                                'our', 'ours', 'ourselves',
                                                'you', "you're", "you've"...
                                                'our', 'ours', 'ourselves',
                                                'you', "you're", "you've",
                                                "you'll", "you'd", 'your',
 

In [33]:
print("최적의 매개변수 조합: %s"%gs_lr_tfidf.best_params_)

최적의 매개변수 조합: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x0000017E505C8820>}


In [34]:
print("CV 정확도: %.3f"%gs_lr_tfidf.best_score_)

CV 정확도: 0.897


In [36]:
print("테스트 정확도: %.3f"%gs_lr_tfidf.score(X_test,y_test))

테스트 정확도: 0.898


### 대용량 데이터 처리: 온라인 알고리즘과 외보 메모리 학습
<br>외부 메모리 학습(out-of-core learning)<br>
: 작은 배치(batch)로 나누어 분류기를 점진적으로 학습시킴

In [37]:
import numpy as np
import re
from nltk.corpus import stopwords
stop = stopwords.words('english')

def tokenizer(text):
    #텍스트 정제->불용어 제거-> 토큰으로 분리
    text = re.sub('<[^>]*>','',text)#정규표현식 사용
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',text)
    text = (re.sub('[\W]+',' ',text.lower()) +#단어가 아닌 문자 모두 제거
            ''.join(emoticons).replace('-',''))#소문자로 바꿈, 이모티콘 변수를 처리 완료된 문자열 끝에 추가
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

In [38]:
def stream_docs(path):#문서 하나씩 읽어서 반환되는 제너레이터 함수
    with open(path,'r',encoding = 'utf-8') as csv:
        next(csv)#헤더 넘기기
        for line in csv:
            text, label = line[:-3], int(line[-2])
            yield text,label

In [39]:
next(stream_docs(path = 'movie_data.csv'))#리뷰텍스트와 이에 상응하는 클래스 레이블이 하나의 튜플로 반환됨

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [40]:
def get_minibatch(doc_stream,size):#문서를 읽어 지정한 만큼 문서를 반환
    docs, y = [],[]
    try:
        for _ in range(size):#지정한 size만큼
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        pass
    return docs, y

In [41]:
from sklearn.feature_extraction.text import HashingVectorizer#데이터 종류에 상관없이 사용 가능
from sklearn.linear_model import SGDClassifier
vect = HashingVectorizer(decode_error = 'ignore',
                        n_features = 2**21,
                        preprocessor = None,
                        tokenizer=tokenizer)#해싱트릭사용
clf = SGDClassifier(loss = 'log',random_state = 1, max_iter =1)#로지스틱 회귀 모델(loss = 'log')
doc_stream = stream_docs(path = 'movie_data.csv')

In [42]:
import pyprind
pbar = pyprind.ProgBar(45)#진행 막대
classes = np.array([0,1])
for _ in range(45):#45개 미니 배치
    X_train, y_train = get_minibatch(doc_stream, size = 1000)#1000개 문서로 구성된 각 미니 배치
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train,y_train,classes = classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:30


In [43]:
X_test,y_test = get_minibatch(doc_stream,size=5000)#마지막 5000개 문서->평가
X_test = vect.transform(X_test)
print("정확도: %.3f"%clf.score(X_test,y_test))

정확도: 0.868


In [44]:
clf = clf.partial_fit(X_test,y_test)

### 잠재 디리클레 할당을 사용한 토픽 모델링
<b>잠재 디리클래 할당(Latent Dirichlet Allocation)</b>
- 문서-토픽 행렬
- 단어-토픽 행렬

<br>
미리 토픽 개수를 정해야 함

In [47]:
import pandas as pd 
df = pd.read_csv('movie_data.csv',encoding='utf-8')

In [48]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words = 'english',#불용어
                       max_df =.1,#최대문서 빈도 10%(하이퍼파라미터)
                        max_features = 5000#자주 등장하는 단어 5000개로 계싼(하이퍼파라미터)
                       )
X = count.fit_transform(df['review'].values)

In [49]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=10,#토픽 10개
                               random_state = 123,
                               learning_method = 'batch')
X_topics= lda.fit_transform(X)

In [50]:
lda.components_.shape

(10, 5000)

In [52]:
n_top_words = 5
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):#10개의 토팍에서 가장 중요한 단어 5개씩 출력
    print("토픽 %d:"%(topic_idx + 1))
    print(" ".join(feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]))#역순으로 정렬해야 최상위부터 출력 가능

토픽 1:
worst minutes awful script stupid
토픽 2:
family mother father children girl
토픽 3:
american war dvd music tv
토픽 4:
human audience cinema art sense
토픽 5:
police guy car dead murder
토픽 6:
horror house sex girl woman
토픽 7:
role performance comedy actor performances
토픽 8:
series episode war episodes tv
토픽 9:
book version original read novel
토픽 10:
action fight guy guys cool


In [53]:
horror = X_topics[:,4].argsort()[::-1]#공포영화카테고리
for iter_idx, movie_idx in enumerate(horror[:3]):
    print("\n공포영화 #%d:"%(iter_idx +1))
    print(df['review'][movie_idx][:300],'...')


공포영화 #1:
**SPOILERS** Extremely brutal police drama set in San Francisco involving a sting operation that goes terribly wrong. A cop Det. Falon, Sam Elliott,mistakenly and savagely beats to death an undercover policeman Winch, Mike Watson,thinking that he murdered his partner Det. Sam Levinson, Mike Burstyn. ...

공포영화 #2:
Two stars <br /><br />Amanda Plummer looking like a young version of her father, Christopher Plummer in drag, stars in this film along with Robert Forster--who really should have put a little shoe black on top of that bald spot.<br /><br />I've never seen Amanda Plummer in a good film. She always pl ...

공포영화 #3:
A film without conscience. Drifter agrees to kill a man for a mobster for money. Then they double cross him. Meanwhile he falls in love with the dead man's wife, and, without her knowing he's the killer, moves in with her. Then he "accidentally" kills her when she finds out. Then, in a WALKING TALL  ...
