## 기말과제: 텍스트마이닝을 통한 정서분류

### 1. 과제 개요
* 정서 (emotion)는 다양한 감정, 생각, 행동과 관련된 정신적ㆍ생리적 상태로 기분, 기질, 성격 등 주관적 경험과 관련이 있다. <span style="color:blue;">[2]</span>
* 본 과제는 특정 주제에 대해 웹 상에 있는 의견을 모아 긍정(Positive) 혹은 부정(Negative)로 분류하는 것을 그 목적으로 한다.
* 제출데이터는 데이터, 보고서(가설, 연구방법, 결론), 실행프로그램으로 구성한다.
* 데이터는 웹으로 부터 300개의 문장을 수집하며, Training과 Test 데이터를 따로 구성한다.

### 2. 연구주제 및 가설
* 연구주제: 서울 지하철에 대한 외국인들의 정서
* 연구가설: 서울 지하철에 대한 외국인들의 의견에 정서적 차이가 있다.
* 서울의 외국인 생활환경에 대한 국내연구에서, 외국인이 교통시설을 편리하게 이용하도록 돕는 서비스가 많이 개선되었지만, 안내표지에 외국어 표기가 미흡한 등, 아직 외국인이 서울의 교통수단을 편리하기 이용하는 것은 힘들다고 말한다. <span style="color:blue;">[1]</span>
* 본 연구는 외국인 역시 국내 연구진들의 예측과 동일하게 서울 교통수단, 특히 지하철에 대해 정서적 의견 차이를 보인다는 것을 연구가설로 한다.
* 연구가설을 증명하기 위해 본 과제에서는 어휘기반의 정서분류 모듈을 작성하는 것을 그 목적으로 한다.

### 3. 연구방법
* 연구방법은 데이터 마이닝, 정서분류 모듈작성(NaiveBayesian, SVM) 과정으로 이루어졌다.

### 3-1. 데이터 마이닝
* Training 데이터를 확보하기 위해 트립어드바이저 리뷰 <span style="color:blue;">[3]</span>에서 데이터 마이닝을 수동으로(연구자가 직접 복사하여 수집) 진행하였다.
* '아주좋음', '좋음' 평점을 준 리뷰에서 긍정 정서를 표현하는 문장 150개를 MetroPositiveTraining 텍스트 파일로 저장하였다.
* '보통', '별로', '최악' 평점을 준 리뷰에서 부정 정서를 표현하는 문장 150개를 MetroNegativeTraining 텍스트 파일로 저장하였다.
* Test 데이터는 트위터에서 서울 지하철에 대한 의견을 검색하여 긍정 10개, 부정 10개, 총 20개 문장을 MetroTestSet 텍스트 파일로 저장하였다.

### 3-2. 정서분류 모듈 작성
* 최적 정서분류 모듈 작성을 위해 NaiveBayesian과 SVM 알고리즘을 사용하여 각각의 분류기를 만들고, 그 중 정확률이 높은 모듈을 채택하였다.

#### 프로그램 작성을 위한 directory setup

In [50]:
# directory setup
import os
myhome=os.path.expanduser('~')
mywd=os.path.join(myhome,'Desktop/S_ParkMinJi/src/')
mytxt=os.path.join(myhome,'Desktop/S_ParkMinJi/doc/')
print myhome, mywd, mytxt

C:\Users\MinJi C:\Users\MinJi\Desktop/S_ParkMinJi/src/ C:\Users\MinJi\Desktop/S_ParkMinJi/doc/


In [51]:
%cd {mywd}

C:\Users\MinJi\Desktop\S_ParkMinJi\src


#### 서울 지하철에 대한 관광객들의 Positive, Negative의견 단어 벡터 생성

In [52]:
from numpy import *

"""
textParse
input: bigString
output: word list
문자열 리스트로 텍스트를 구문 분석, 문자의 길이가 두 개 이하인 단어는 탈락, 모든 문자를 소문자로 변환
"""
def textParse(bigString):
    import re
    listOfTokens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTokens if len(tok) > 2]

"""
loadDataSet
output: postingList, classVec
"""
def loadDataSet():
    postingList = [];
    # MetroPositiveTraining 파일을 줄 단위로 읽은 후, textParse를 거쳐 postingList에 저장한다.
    with open('data/metro/MetroPositiveTraining.txt') as mPosTrain:
        for i, line in enumerate(mPosTrain):
            postingList.append(textParse(line))
    classVec = list(zeros(150))    #0: Positive     
    # MetroNegativeTraining 파일을 줄 단위로 읽은 후, textParse를 거쳐 postingList에 추가한다.    
    with open('data/metro/MetroNegativeTraining.txt') as mNegTrain:
        for i, line in enumerate(mNegTrain):
            postingList.append(textParse(line))           
    classVec.extend(ones(150))    #1: Negative
    return postingList, classVec

"""
createVocabList
input: dataSet
output: list(vocabSet)
모든 문서에 있는 유일한 단어 목록을 생성
"""
def createVocabList(dataSet):
    vocabSet = set([])  #create empty set
    for document in dataSet:
        vocabSet = vocabSet | set(document) #union of the two sets - or
    return list(vocabSet)

"""
setOfWords2Vec
input: vocabList, inputSet
output: returnVec
주어진 문서 내에 어휘 목록에 있는 단어가 존재하는지 아닌지를 표현 - 어휘 목록, 문서, 1과 0의 출력 데이터 사용
"""
def setOfWords2Vec(vocabList, inputSet):
    returnVec = [0]*len(vocabList) #어휘 목록과 같은 길이의 벡터를 생성하고 모두 0으로 채움
    for word in inputSet: #문서 내에 있는 단어를 하나하나 비교
        if word in vocabList: #해당 단어가 어휘 목록에 있다면
            returnVec[vocabList.index(word)] = 1 #출력 벡터에 있는 해당 단어의 값을 1로 설정
#         else: print "the word: %s is not in my Vocabulary!" % word
    return returnVec

In [4]:
listPosts, listClasses = loadDataSet()
print listPosts, listClasses

[['best', 'subway', 'system', 'around'], ['excelent', 'way', 'move', 'round', 'the', 'city'], ['very', 'nice', 'network', 'and', 'working', 'stations'], ['information', 'personnel', 'are', 'very', 'helpfully', 'and', 'helps', 'you', 'get', 'right', 'location', 'and', 'right', 'ticket'], ['had', 'card', 'which', 'works', 'for', 'many', 'different', 'public', 'transportation', 'vehicles', 'even', 'bicycles'], ['you', 'buy', 'money', 'card', 'really', 'easy', 'use', 'you', 'add', 'much', 'you', 'want'], ['the', 'metro', 'clean', 'frequent', 'and', 'spacious'], ['easy', 'accessible', 'great', 'running', 'times', 'regular', 'clean', 'english', 'signage', 'safe', 'modern'], ['seoul', 'has', 'the', 'best', 'subway', 'system', 'the', 'world'], ['traveling', 'the', 'airport', 'simple'], ['hope', 'you', 'all', 'enjoy', 'yourself', 'much', 'did', 'using', 'the', 'public', 'transport', 'system'], ['was', 'easy', 'get', 'around', 'but', 'lot', 'lines'], ['highly', 'recommend', 'the', 'best', 'way',

In [5]:
myVocabList = createVocabList(listPosts)
print myVocabList

['limited', 'all', 'beware', 'lack', 'leads', 'commute', 'freakin', 'follow', 'chair', 'children', 'certainly', 'paris', 'young', 'languages', 'bicycles', 'asking', 'helps', 'cluttered', 'suffered', 'friendly', 'paying', 'straightforward', 'very', 'translations', 'connects', 'every', 'fall', 'malfunctioned', 'ticket', 'monkeys', 'minute', 'cool', 'tickets', 'exits', 'level', 'did', 'having', 'kiosks', 'try', 'convenience', 'mile', 'subway', 'guiding', 'small', 'study', 'settings', 'round', 'enjoy', 'tired', 'guidelines', 'japanese', 'sign', 'jump', 'street', 'design', 'traveling', 'machines', 'even', 'headset', 'what', 'stood', 'asia', 'magnifying', 'access', 'above', 'toll', 'new', 'learned', 'ever', 'public', 'filled', 'handicaped', 'transportation', 'stuffier', 'never', 'understanding', 'hours', 'tracks', 'slow', 'let', 'cab', 'directions', 'standard', 'personable', 'change', 'wait', 'passengers', 'great', 'northern', 'metro', 'luggage', 'involved', 'dispensers', 'experience', 'exit

In [6]:
print listPosts[0]
print setOfWords2Vec(myVocabList, listPosts[0])

['best', 'subway', 'system', 'around']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [7]:
print listPosts[150]
print setOfWords2Vec(myVocabList, listPosts[150])

['koreans', 'commute', 'lot', 'their', 'public', 'transportation', 'heavily', 'used']
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

In [8]:
trainMat = []
for postinDoc in listPosts:
    trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
print trainMat

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

### 3-2-1. 나이브 베이즈 (NaiveBayesian)

In [44]:
"""
trainNB0
input: trainMatrix,trainCategory
output: p0Vect,p1Vect,pAbusive
"""
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)
    numWords = len(trainMatrix[0])
    pAbusive = sum(trainCategory)/float(numTrainDocs)
    p0Num = ones(numWords); p1Num = ones(numWords)      #change to ones() 
    p0Denom = 2.0; p1Denom = 2.0                        #change to 2.0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = log(p1Num/p1Denom)          #change to log()
    p0Vect = log(p0Num/p0Denom)          #change to log()
    return p0Vect,p1Vect,pAbusive

# listPosts, listClasses = loadDataSet()
# myVocabList = createVocabList(listPosts)

# trainMat = []
# for postinDoc in listPosts:
#     trainMat.append(setOfWords2Vec(myVocabList, postinDoc))

# p0V,p1V,pAb=trainNB0(trainMat, listClasses)
# print pAb
# print p0V
# print p1V

In [49]:
"""
classifyNB
input: vec2Classify, p0Vec, p1Vec, pClass1
output: 0(positive) or 1(negative)
"""
# 0(positive) 또는 1(negative)로 분류
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2Classify * p1Vec) + log(pClass1)    #element-wise mult
    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
    if p1 > p0:
        return 1
    else: 
        return 0

"""
testingNB
input: text
output: classifyNB(thisDoc,p0V,p1V,pAb)
"""    
# 문자열 리스트를 받아 나이즈 베이즈 분류를 한 결과를 출력한다.
def testingNB(text):
    listPosts,listClasses = loadDataSet()
    myVocabList = createVocabList(listPosts)
    trainMat=[]
    for postinDoc in listPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))

    testEntry = textParse(text)
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print testEntry,'classified as: ', classifyNB(thisDoc,p0V,p1V,pAb)
    return classifyNB(thisDoc,p0V,p1V,pAb)


def testloadDataSet():
    postingList = []; testCount=0;
    # MetroPositiveTraining 파일을 줄 단위로 읽은 후, textParse를 거쳐 postingList에 저장한다.
    with open('data/metro/MetroPositiveTraining.txt') as mPosTrain:
        for i, line in enumerate(mPosTrain):
            currLine = line.strip()
            postingList.append(currLine)
            testCount += 1
    classVec = list(zeros(150))    #0: Positive     
    # MetroNegativeTraining 파일을 줄 단위로 읽은 후, textParse를 거쳐 postingList에 추가한다.    
    with open('data/metro/MetroNegativeTraining.txt') as mNegTrain:
        for i, line in enumerate(mNegTrain):
            currLine = line.strip()
            postingList.append(currLine)
            testCount += 1
    classVec.extend(ones(150))    #1: Negative
    errorCount = 0.0
    for i in range(0, testCount):
        if testingNB(postingList[i]) != int(classVec[i]):
            errorCount += 1
        print 'origin class: ', int(classVec[i]), '  classified as: ', testingNB(postingList[i])
    print 'the error rate is:', errorCount

testloadDataSet()

['best', 'subway', 'system', 'around'] classified as:  0
origin class:  0   classified as:  ['best', 'subway', 'system', 'around'] classified as:  0
0
['excelent', 'way', 'move', 'round', 'the', 'city'] classified as:  0
origin class:  0   classified as:  ['excelent', 'way', 'move', 'round', 'the', 'city'] classified as:  0
0
['very', 'nice', 'network', 'and', 'working', 'stations'] classified as:  0
origin class:  0   classified as:  ['very', 'nice', 'network', 'and', 'working', 'stations'] classified as:  0
0
['information', 'personnel', 'are', 'very', 'helpfully', 'and', 'helps', 'you', 'get', 'right', 'location', 'and', 'right', 'ticket'] classified as:  0
origin class:  0   classified as:  ['information', 'personnel', 'are', 'very', 'helpfully', 'and', 'helps', 'you', 'get', 'right', 'location', 'and', 'right', 'ticket'] classified as:  0
0
['had', 'card', 'which', 'works', 'for', 'many', 'different', 'public', 'transportation', 'vehicles', 'even', 'bicycles'] classified as:  0
or

In [24]:
def testData():
    testList = []; testClass = []; testCount = 0
    with open('data/metro/MetroTestSet.txt') as mTest:
        for i, line in enumerate(mTest):
            currLine = line.strip().split('\t')
            testClass.extend(currLine[0])
            testList.append(currLine[1])
            testCount += 1         
    errorCount = 0.0
    for i in range(0, testCount):
        if testingNB(testList[i]) != int(testClass[i]):
            errorCount += 1
        print 'origin class: ', int(testClass[i]), '  classified as: ', testingNB(testList[i])
    print 'the error rate is:', errorCount/testCount

In [21]:
import sentiment
sentiment.testData()

origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  0
origin class:  0   classified as:  1
origin class:  0   classified as:  1
origin class:  1   classified as:  1
origin class:  1   classified as:  1
origin class:  1   classified as:  1
origin class:  1   classified as:  1
origin class:  1   classified as:  0
origin class:  1   classified as:  1
origin class:  1   classified as:  1
origin class:  1   classified as:  1
origin class:  1   classified as:  0
origin class:  1   classified as:  1
origin class:  1   classified as:  0
origin class:  1   classified as:  0
o

### LIBSVM

In [9]:
"""
testData
output: testList, testClass
"""
def testData():
    testList = []; testClass = []; testCount = 0
    with open('data/metro/MetroTestSet.txt') as mTest:
        for i, line in enumerate(mTest):
            currLine = line.strip().split('\t')
            testClass.extend(currLine[0])
            testList.append(textParse(currLine[1]))
    testClass = map(int,testClass)
            
    return testList, testClass

In [10]:
testPosts, testClasses = testData()
print testPosts, testClasses

[['seoul', 'subway', 'station', 'always', 'clean', 'tidy', 'and', 'safe'], ['think', 'favorite', 'thing', 'about', 'the', 'seoul', 'subway', 'system', 'that', 'every', 'stop', 'has', 'clean'], ['the', 'seoul', 'subway', 'user', 'friendly', 'that', 'travel', 'across', 'the', 'city', 'easy'], ['the', 'seoul', 'south', 'korean', 'subway', 'transit', 'system', 'elaborate', 'and', 'amazing'], ['the', 'metro', 'clean', 'frequent', 'and', 'spacious'], ['trains', 'run', 'frequently', 'and', 'are', 'very', 'clean', 'well', 'thought', 'out', 'navigation', 'aids', 'such', 'maps', 'latin', 'well', 'korean', 'alphabet', 'clearly', 'marked', 'neighborhood', 'maps', 'guide', 'travelers', 'the', 'correct', 'exit'], ['clean', 'modern', 'signs', 'english', 'lots', 'shops', 'easy', 'negotiate'], ['take', 'take', 'the', 'subway', 'matter', 'where', 'you', 'are', 'going', 'friendly', 'easy', 'safe', 'understandable', 'everything', 'korean', 'and', 'english', 'and', 'takes', 'you', 'almost', 'anywhere'], ['

In [11]:
print testPosts[0]
print setOfWords2Vec(myVocabList, testPosts[0])

['seoul', 'subway', 'station', 'always', 'clean', 'tidy', 'and', 'safe']
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [12]:
testMat = []
for postinTest in testPosts:
    testMat.append(setOfWords2Vec(myVocabList, postinTest))
print testMat

[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

In [23]:
from svmutil import *
svm_model.predict = lambda self, x: svm_predict([0], [x], self)[0][0]

y, x = listClasses, trainMat
prob = svm_problem(y,x)

param = svm_parameter()
param.kernel_type = LINEAR
param.C = 10

m=svm_train(prob, param)
b, a = listClasses, trainMat
svm_predict(b, a, m)

Accuracy = 99.6667% (299/300) (classification)


([0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  1.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,

In [13]:
from svmutil import *
svm_model.predict = lambda self, x: svm_predict([0], [x], self)[0][0]

y, x = listClasses, trainMat
prob = svm_problem(y,x)

param = svm_parameter()
param.kernel_type = LINEAR
param.C = 10

m=svm_train(prob, param)
b, a = testClasses, testMat
svm_predict(b, a, m)

Accuracy = 80% (24/30) (classification)


([0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  1.0,
  1.0,
  1.0,
  1.0,
  1.0,
  0.0,
  1.0,
  1.0,
  1.0,
  1.0,
  0.0,
  1.0,
  1.0,
  0.0,
  1.0,
  0.0,
  1.0],
 (80.0, 0.2, 0.3665158371040724),
 [[1.310556844017037],
  [0.9049970563069931],
  [1.1608765244912975],
  [1.2524875066997363],
  [1.000103600831546],
  [0.6728952701571039],
  [1.8660494605933584],
  [2.2580585686391585],
  [2.0980886481847816],
  [0.6994429353857271],
  [0.9197377947879128],
  [0.6652027578887305],
  [2.1055947255716054],
  [-0.5120033900457454],
  [-0.32643545767192406],
  [-0.017223763476428344],
  [-1.864921515905189],
  [-0.3255372499785081],
  [0.061776825180906825],
  [-0.637571089762671],
  [-2.4802959437413583],
  [-1.0594252669291542],
  [-1.4400728783034145],
  [0.36055165746799356],
  [-0.47363401500462465],
  [-0.3844877473389071],
  [0.308852513364704],
  [-1.2651227880775664],
  [0.2369424906396983],
  [-1.0000150712752116]])

### REFERENCES
* [1] 홍석기. (2009). 서울, 과연 외국인 친화 도시인가?. 정책리포트, (32), 1-21.
* [2] Lee, H. K., & Chun, B. J. (2011). The Relationship Between Leisure Experience and Job-attitude and Organizational Citizenship Behavior; The Mediating Effect of Positive Emotion. Journal of the Korea Academia-Industrial cooperation Society, 12(3), 1188-1196.
* [3] Seoul Metro (South Korea): Top Tips Before You Go - TripAdvisor. Retrieved from https://www.tripadvisor.co.kr/Attraction_Review-g294197-d2194168-Reviews-Seoul_Metro-Seoul.html
* [4]