# 텍스트 마이닝
## 주요 용어
- 문서(Documents)
- 말뭉치(Corpus): A set of documents
- 토큰(Token): Meaningful elements in a text such as words or phrases or symbols
- 형태소(Morphemes) 분석: Smallest meaningful unit in a language
 - 보통 NLP에서 토큰으로 형태소하며, 어근, 접두사, 접미사, 품사 등 다양한 언어적 속성을 파악하는 작업
 - 어간 추출(stemming)
 - 원형 복원(lemmatizing)
 - 품사 부착(Part-Of-Speech tagging)
- Toipic Modeling
- 감성분석


## 프로세스
- 데이터 전처리
- 데이터 구조화
- 변수 추출
- 모델링

## 전처리 in Scikit-learn
- BOW (Bag of Words)
 - Tokenize: 단어(토큰)으로 분리
 - 어휘 사전 구축: 모든 단어를 모으고 번호 부여(알파벳 순서)
 - 인코딩: 단어별 카운트


In [1]:
bards_words = ["The fool doth think he is wise,", 
              "but the wise man knows himself to be a fool"]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

vect.fit(bards_words)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [3]:
# 어휘 사전의 크기
print(len(vect.vocabulary_))

13


In [4]:
# 어휘 사전 내용
print(vect.vocabulary_)

{'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}


In [5]:
bag_of_word = vect.transform(bards_words)

In [6]:
print(repr(bag_of_word))

<2x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>


In [7]:
print(bag_of_word.toarray())

[[0 0 1 1 1 0 1 0 0 1 1 0 1]
 [1 1 0 1 0 1 0 1 1 1 0 1 1]]


## 영화 리뷰 텍스트 마이닝

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [10]:
df_train = pd.read_csv("ratings_train.txt", delimiter='\t', keep_default_na = False).head(1000)

df_train.head()

Unnamed: 0,id,document,label
0,9976970,아 더빙.. 진짜 짜증나네요 목소리,0
1,3819312,흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나,1
2,10265843,너무재밓었다그래서보는것을추천한다,0
3,9045019,교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정,0
4,6483659,사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 ...,1


In [12]:
text_train = df_train['document'].values
y_train = df_train['label'].values

In [13]:
df_test = pd.read_csv("ratings_test.txt", delimiter='\t', keep_default_na = False).head(200)

df_test.head()

Unnamed: 0,id,document,label
0,6270596,굳 ㅋ,1
1,9274899,GDNTOPCLASSINTHECLUB,0
2,8544678,뭐야 이 평점들은.... 나쁘진 않지만 10점 짜리는 더더욱 아니잖아,0
3,6825595,지루하지는 않은데 완전 막장임... 돈주고 보기에는....,0
4,6723715,3D만 아니었어도 별 다섯 개 줬을텐데.. 왜 3D로 나와서 제 심기를 불편하게 하죠??,0


In [15]:
text_test = df_test['document'].values
y_test = df_test['label'].values

In [16]:
print(text_train.shape, np.bincount(y_train))
print(text_test.shape, np.bincount(y_test))

(1000,) [508 492]
(200,) [ 94 106]


In [27]:
from konlpy.tag import Twitter

#twitter_tag = Twitter()

ModuleNotFoundError: No module named 'jpype'

In [25]:
def twitter_tokenizer(text):
    return twitter_tag.nouns(text)

In [133]:
vect = CountVectorizer(tokenizer=twitter_tokenizer).fit(text_train)

In [134]:
print(len(vect.vocabulary_))

2472


In [135]:
bag_of_word = vect.transform(text_train)

In [136]:
print(repr(bag_of_word))

<1000x2472 sparse matrix of type '<type 'numpy.int64'>'
	with 6078 stored elements in Compressed Sparse Row format>


In [137]:
print(bag_of_word.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

In [36]:
twit_param_grid = {'tfidfvectorizer__min_df': [5, 7],
#                   'tfidfvectorizer__ngram_range': [(1,1), (1,2), (1,3)],
                   'logisticregression__C': [0.1, 1, 10]}

In [37]:
twit_pipe = make_pipeline(TfidfVectorizer(tokenizer = twitter_tokenizer),
                          LogisticRegression())

In [38]:
twit_grid = GridSearchCV(twit_pipe, twit_param_grid, cv=5)

In [40]:
twit_grid.fit(text_train[0:500], y_train[0:500])

GridSearchCV(cv=5, error_score='raise',
       estimator=Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm=u'l2', preprocessor=None, smoo...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'tfidfvectorizer__min_df': [5, 7], 'logisticregression__C': [0.1, 1, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [43]:
print(twit_grid.best_score_)
print(twit_grid.best_params_)

0.654
{'tfidfvectorizer__min_df': 5, 'logisticregression__C': 0.1}


In [53]:
X_test_konlpy = twit_grid.best_estimator_.named_steps["tfidfvectorizer"].transform(text_test)

In [55]:
score = twit_grid.best_estimator_.named_steps['logisticregression'].score(X_test_konlpy, y_test)

In [56]:
print(score)

0.64272
