# 情感分析实例---推特情感分析
## 1. 基本方法-TextBlob
（数据不进行预处理）

In [1]:
import pandas as pd
from textblob import TextBlob
test=pd.read_csv('files/data/python46-data/test_tweets_anuFYb8.csv')
test['label']=test['tweet'].apply(lambda x:1 if TextBlob(x).sentiment[0]<0 else 0)

In [2]:
test['label'].head()

0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

In [3]:
test[['id','label']].to_csv("files/data/python46-data/test_predictions.csv",index=False)

提交之后的结果为：	0.1693121693

![结果1.png](https://i.loli.net/2018/03/10/5aa36a4761f80.png)

## 2. 数据预处理+TextBlob

In [6]:
import pandas as pd
from textblob import TextBlob,Word
from nltk.corpus import stopwords
test=pd.read_csv('files/data/python46-data/test_tweets_anuFYb8.csv')
# 1.小写转换
test['tweet']=test['tweet'].apply(lambda x:" ".join([word.lower() for word in x.split()]))
# 2.去除标点符号
test['tweet']=test['tweet'].str.replace('[^\w\s]','')
# 3.去除停用词
stop=stopwords.words('english')
test['tweet']=test['tweet'].apply(lambda x:" ".join(word for word in x.split() if word not in stop))
# 4.去除频现词
freq=pd.Series(' '.join(test['tweet']).split()).value_counts()[:10]
test['tweet']=test['tweet'].apply(lambda x:" ".join(word for word in x .split() if word not in freq))
# 5.去除稀缺词
rare=pd.Series(' '.join(test['tweet']).split()).value_counts()[-10:]
test['tweet']=test['tweet'].apply(lambda x:" ".join(word for word in x.split() if word not in rare))
# 6.词形还原(lemmatization)
test['tweet']=test['tweet'].apply(lambda x:" ".join([Word(word).lemmatize() for word in x.split()]))

In [7]:
test['tweet'].head()

0    studiolife aislife requires passion dedication...
1    white supremacist want everyone see new birdsâ...
2     safe way heal acne altwaystoheal healthy healing
3    hp cursed child book reservation already yes ð...
4    3rd bihday amazing hilarious nephew eli ahmir ...
Name: tweet, dtype: object

In [8]:
test['label']=test['tweet'].apply(lambda x:1 if TextBlob(x).sentiment[0]<0 else 0)
test.head(10)

Unnamed: 0,id,tweet,label
0,31963,studiolife aislife requires passion dedication...,0
1,31964,white supremacist want everyone see new birdsâ...,0
2,31965,safe way heal acne altwaystoheal healthy healing,0
3,31966,hp cursed child book reservation already yes ð...,0
4,31967,3rd bihday amazing hilarious nephew eli ahmir ...,0
5,31968,choose momtips,0
6,31969,something inside dy ððâ eye ness smokeyeyes ti...,1
7,31970,finishedtattooinkedinkloveitâï âïâïâïâï thanks...,0
8,31971,never understand dad left young deep inthefeels,0
9,31972,delicious food lovelife capetown mannaepicure ...,0


In [9]:
test[['id','label']].to_csv("files/data/python46-data/test_predictions.csv",index=False)

提交之后的结果为：	0.1739130435

## 3. 使用SVM和Word2Vec进行情感分类
我们随机从文本数据中抽取正负样本，构建比例为8：2的训练集和测试集。随后，我们对训练集数据构建Word2Vec模型，其中分类器的输入值为推文中所有词向量的加权平均值。word2vec工具和svm分类器分别使用python中的gensim库和sklearn库。
### 3.1 加载文件，预处理数据，并分词

In [39]:
import nltk
from sklearn.cross_validation import train_test_split
from gensim.models.word2vec import Word2Vec
import numpy as np
import pandas as pd
from sklearn.externals import joblib
from sklearn.svm import SVC

In [92]:
data=pd.read_csv('files/data/python46-data/train_E6oV3lV.csv')
# 1.小写转换
data['tweet']=data['tweet'].apply(lambda x:" ".join([word.lower() for word in x.split()]))
# 2.去除标点符号
data['tweet']=data['tweet'].str.replace('#','').replace('@','')

# 4.去除频现词
freq=pd.Series(' '.join(data['tweet']).split()).value_counts()[:10]
data['tweet']=data['tweet'].apply(lambda x:" ".join(word for word in x .split() if word not in freq))


In [93]:
neg=data[data['label']==0]
neg.head()

Unnamed: 0,id,label,tweet
0,1,0,when father is dysfunctional is so selfish he ...
1,2,0,thanks lyft credit can't use cause they don't ...
2,3,0,bihday your majesty
3,4,0,model love u take with u all time urð±!!! ð...
4,5,0,factsguide: society now motivation


In [94]:
pos=data[data['label']==1]
pos.head()

Unnamed: 0,id,label,tweet
13,14,1,cnn calls michigan middle school 'build wall' ...
14,15,1,no comment! australia opkillingbay seashepherd...
17,18,1,retweet if agree!
23,24,1,lumpy says am . prove it lumpy.
34,35,1,it's unbelievable that 21st century we'd need ...


In [96]:
pos['words']=pos['tweet'].apply(lambda x:nltk.word_tokenize(x))
neg['words']=neg['tweet'].apply(lambda x:nltk.word_tokenize(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [97]:
y = np.concatenate((np.ones(len(pos)), np.zeros(len(neg))))
x_train, x_test, y_train, y_test = train_test_split(np.concatenate((pos['words'], neg['words'])), y, test_size=0.2)
np.save('files/data/python46-data/svm_data/y_train.npy',y_train)
np.save('files/data/python46-data/svm_data/y_test.npy',y_test)

### 3.2计算词向量，并对每个评论的所有词向量取均值作为每个评论的输入

In [98]:
# 对每个句子的所有词向量取均值，来生成一个句子的vector
def build_sentence_vector(text, size,imdb_w2v):
    vec = np.zeros(size).reshape((1, size))
    count = 0.
    for word in text:
        try:
            vec += imdb_w2v[word].reshape((1, size))
            count += 1.
        except KeyError:
            continue
    if count != 0:
        vec /= count
    return vec

# 计算词向量
def get_train_vecs(x_train, x_test):
    n_dim = 300
    # 初始化模型和词表
    imdb_w2v = Word2Vec(x_train, size=n_dim, min_count=10)
    # imdb_w2v = Word2Vec(size=300, window=5, min_count=10, workers=12)
    # imdb_w2v.build_vocab(x_train)
    #
    # imdb_w2v.train(x_train,
    #                total_examples=imdb_w2v.corpus_count,
    #                epochs=imdb_w2v.iter)


    train_vecs = np.concatenate([build_sentence_vector(z, n_dim, imdb_w2v) for z in x_train])
    # train_vecs = scale(train_vecs)

    np.save('files/data/python46-data/svm_data/train_vecs.npy', train_vecs)
    print(train_vecs.shape)
    # 在测试集上训练
    imdb_w2v.train(x_test,total_examples=imdb_w2v.corpus_count,total_words=len(x_train),epochs=imdb_w2v.iter)
    # imdb_w2v.train(x_test,
    #                total_examples=imdb_w2v.corpus_count,
    #                epochs=imdb_w2v.iter)

    imdb_w2v.save('files/data/python46-data/svm_data/w2v_model/w2v_model.pkl')
    # Build test tweet vectors then scale
    test_vecs = np.concatenate([build_sentence_vector(z, n_dim, imdb_w2v) for z in x_test])
    # test_vecs = scale(test_vecs)
    np.save('files/data/python46-data/svm_data/test_vecs.npy', test_vecs)
    print(test_vecs.shape)
    
def get_data():
    train_vecs=np.load('files/data/python46-data/svm_data/train_vecs.npy')
    y_train=np.load('files/data/python46-data/svm_data/y_train.npy')
    test_vecs=np.load('files/data/python46-data/svm_data/test_vecs.npy')
    y_test=np.load('files/data/python46-data/svm_data/y_test.npy')
    return train_vecs,y_train,test_vecs,y_test

# 训练svm模型
def svm_train(train_vecs,y_train,test_vecs,y_test):
    clf=SVC(kernel='rbf',verbose=True)
    clf.fit(train_vecs,y_train)
    joblib.dump(clf, 'files/data/python46-data/svm_data/svm_model/model.pkl')
    print(clf.score(test_vecs,y_test))


# 构建待预测句子的向量

def get_predict_vecs(words):
    n_dim = 300
    imdb_w2v = Word2Vec.load('files/data/python46-data/svm_data/w2v_model/w2v_model.pkl')
    #imdb_w2v.train(words)
    train_vecs = build_sentence_vector(words, n_dim,imdb_w2v)
    #print() train_vecs.shape
    return train_vecs

# 对单个句子进行情感判断

def svm_predict(string):
    words=nltk.word_tokenize(string)
    words_vecs=get_predict_vecs(words)
    clf=joblib.load('files/data/python46-data/svm_data/svm_model/model.pkl')
     
    result=clf.predict(words_vecs)
    
    if int(result[0])==1:
        print(string,'pos')
    else:
        print(string,'neg')

In [90]:
# x_train,x_test = load_file_and_preprocessing()
get_train_vecs(x_train,x_test)
train_vecs,y_train,test_vecs,y_test = get_data()
svm_train(train_vecs,y_train,test_vecs,y_test)


string='use the power of your mind to #heal your body!! '
svm_predict(string)

(25569, 300)
(6393, 300)
[LibSVM]0.929923353668
use the power of your mind to #heal your body!!  pos


In [91]:
test=pd.read_csv('files/data/python46-data/test_tweets_anuFYb8.csv')
test['label']=test['tweet'].apply(lambda x:0 if svm_predict(x)=='pos' else 'neg')
test

#studiolife #aislife #requires #passion #dedication #willpower   to find #newmaterialsâ¦  pos
 @user #white #supremacists want everyone to see the new â  #birdsâ #movie â and hereâs why   pos
safe ways to heal your #acne!!    #altwaystoheal #healthy   #healing!!  pos
is the hp and the cursed child book up for reservations already? if yes, where? if no, when? ððð   #harrypotter #pottermore #favorite pos
  3rd #bihday to my amazing, hilarious #nephew eli ahmir! uncle dave loves you and missesâ¦  pos
choose to be   :) #momtips  pos
something inside me dies ð¦ð¿â¨  eyes ness #smokeyeyes #tired  #lonely #sof #grungeâ¦  pos
#finished#tattoo#inked#ink#loveitâ¤ï¸ #â¤ï¸â¤ï¸â¤ï¸â¤ï¸ #thanks#aleeee !!!  pos
 @user @user @user i will never understand why my dad left me when i was so young.... :/ #deep #inthefeels   pos
#delicious   #food #lovelife #capetown mannaepicure #resturantâ¦  pos
1000dayswasted - narcosis infinite ep.. make me aware.. grinding neuro bass

interview feat grandmaster flash - ze lovely message â«âªâ«â«âºâº #nurap #nudisco #music #paris   â«âªâ«  via @user pos
i'm getting more hours at work for my training. i'm so    pos
life right now is amazingð  #successful #positive pos
if social media is your reality you should really get out more. viual acceptance seems to really be a thing.   #s ocialmedia pos
@user ðª sevens tomorrowðª #southwestseason ðfinalðagainst @user next week  plus scrim with @user  #wearebusy   pos
@user @user @user @user @user your ignorant &amp; ill informed tweets r silly, childish &amp; one dimensional   pos
great to see you! look forward to welcoming you to digme    pos
jackblair - na: #horny #hot #naughty #nasty   #slut #young #shy #wet #nude #xxx #sexy #porn #kinky #snapshot  pos
@user @user did u take both of them and pour them into one big cup, or what? no drink delivered.   pos
ððððâ¤ï¸ððð» happy father's day dad hope u have a great day love u ððð

 @user vip ciniworld with caz @user  ð¬   ð pos
seeing war craft in imax 3d ð #warcraftmovie   #imax #woohoo  pos
#model   i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦   pos
 @user i burnt all the garlic bread #devastated   #:( pos
#model   i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦   pos
@user  looked right at the cameraman as she grabbed katie's arm &amp; said "can i talk to you"   wow, everything you do is for tv?   pos
@user @user @user @user @user we held one in march, which was great cpd for everyone ðð½  pos
looking up some african news out of boredem, stumbled upon @user and oh my god, site is racist as fuck against everything   pos
@user @user dana is devestated that there hasn't been any trouble  pos
i am gorgeous. #i_am #positive #affirmation      pos
 @user oh, so now he's admitting to having a small staff???  #flipflopper #disgraceful    pos
because happy! #because   #instagram #instagood #instagr

KeyboardInterrupt: 