# 文本特徵建構

透過文本中各個文字建構特徵，一個文字都代表一行特徵

## 詞袋法

將所有文字做集合，成為語料庫(corpus)。並將文件轉成向量的形式，以數值的方式呈現。

In [1]:
import pandas as pd

In [2]:
#載入資料
tweets = pd.read_csv('./data/twitter_sentiment.csv', encoding='latin1')
tweets.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


In [3]:
#刪除不需要的ID行
del tweets['ItemID']
tweets.head()

Unnamed: 0,Sentiment,SentimentText
0,0,is so sad for my APL frie...
1,0,I missed the New Moon trail...
2,1,omg its already 7:30 :O
3,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,0,i think mi bf is cheating on me!!! ...


### ConutVectorizer
- stop_words:停用詞刪除
- min_df:忽略在文件中出現頻率低於臨界值的詞
- max_df:保留最多出現在臨界值設定值得詞。例如0.8，表示出現在80%文本的詞將被剃除
- ngram_range:幾個字符為一特徵
- analyzer:判斷特徵是字詞還是短語

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

In [5]:
X = tweets['SentimentText']
y = tweets['Sentiment']

In [6]:
vect = CountVectorizer()
_ = vect.fit_transform(X)
print(_.shape)

(99989, 105849)


In [7]:
#stopword去除
vect = CountVectorizer(stop_words='english')
_ = vect.fit_transform(X)
print(_.shape)

(99989, 105545)


In [8]:
#min_df去除
vect = CountVectorizer(min_df=.05)
_ = vect.fit_transform(X)
print(_.shape)

(99989, 31)


In [9]:
#max_df去除
vect = CountVectorizer(max_df=.8)
_ = vect.fit_transform(X)
print(_.shape)

(99989, 105849)


In [10]:
#ngram建立，最多5個字的短語
vect = CountVectorizer(ngram_range=(1, 5))
_ = vect.fit_transform(X)
print(_.shape)

(99989, 3219557)


In [11]:
#分析器(analyzer)
vect = CountVectorizer(analyzer='word')
_ = vect.fit_transform(X)
print(_.shape)

(99989, 105849)


### 自訂「詞幹提取」分析器

In [12]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')

#觀察還原情形
stemmer.stem('interesting')

'interest'

In [13]:
#建立函數
def word_tokenize(text, how='lemma'):
    words = text.split(' ')
    return [stemmer.stem(word) for word in words]

In [14]:
#觀察效果
word_tokenize("hello you are very interesting")

['hello', 'you', 'are', 'veri', 'interest']

In [15]:
#將分詞器傳入分析器參數
vect = CountVectorizer(analyzer=word_tokenize)
_ = vect.fit_transform(X)
print(_.shape)

(99989, 154397)


## TF-IDF 向量化器

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [17]:
#原始的CountVectorizer，輸出shape與平均值
vect = CountVectorizer()
_ = vect.fit_transform(X)
print(_.shape, _[0,:].mean())

(99989, 105849) 6.613194267305311e-05


In [18]:
#TF-IDF建立向量。兩者shape相同，但數值不同
vect = TfidfVectorizer()
_ = vect.fit_transform(X)
print(_.shape, _[0,:].mean())

(99989, 105849) 2.1863060975751192e-05


## 機器學習管線應用文本特徵

特徵多時，要使用有效率的分類方法，例如「單純貝式(Naive Bayes)」

步驟:
1. 先將文本轉換成特徵表達形式
2. 利用單純貝式進行正面或負面情緒分類

In [19]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

In [20]:
#計算空準確率
y.value_counts(normalize=True)

1    0.564632
0    0.435368
Name: Sentiment, dtype: float64

In [21]:
#設置管線參數
pipe_params = {'vect__ngram_range':[(1,1), (1,2)], 'vect__max_features':[1000, 10000], 'vect__stop_words':[None, 'english']}

In [22]:
#產生實體管線，先建立特徵，再分類
pipe = Pipeline([('vect', CountVectorizer()), ('classify', MultinomialNB())])

In [23]:
#網格搜尋
grid = GridSearchCV(pipe, pipe_params)
grid.fit(X, y)

print(grid.best_score_, grid.best_params_)

0.7558931564507154 {'vect__max_features': 10000, 'vect__ngram_range': (1, 2), 'vect__stop_words': None}


### FeatureUnion

此模組可以使用多個不同文本特徵來表示文本，像是前面所提及的CountVectorizer與TfidfVectorizer兩者結合。特徵數量會變多。

In [24]:
from sklearn.pipeline import FeatureUnion

In [25]:
featurizer = FeatureUnion([('tfidf_vect', TfidfVectorizer()), ('count_vect', CountVectorizer())])

In [26]:
_ = featurizer.fit_transform(X)
print(_.shape)
#可以看到特徵數量變為兩倍，那是因為結合兩種特徵表達方式，因此變為兩倍

(99989, 211698)


In [27]:
#改變參數數量，我們可以透過設定參數使得數量不要那麼多，以下為範例，最後輸出參數數量為400個
featurizer.set_params(tfidf_vect__max_features=100, count_vect__ngram_range=(1,2), count_vect__max_features=300)

_= featurizer.fit_transform(X)
print(_.shape)

(99989, 400)


In [28]:
#設定更為完整的參數
pipe_params = {'featurizer__count_vect__ngram_range':[(1, 1), (1, 2)],
               'featurizer__count_vect__max_features':[1000, 10000],
               'featurizer__count_vect__stop_words':[None, 'english'],
               'featurizer__tfidf_vect__ngram_range':[(1, 1), (1, 2)],
               'featurizer__tfidf_vect__max_features':[1000, 10000],
               'featurizer__tfidf_vect__stop_words':[None, 'english']}

pipe = Pipeline([('featurizer', featurizer), ('classify', MultinomialNB())])

In [29]:
grid = GridSearchCV(pipe, pipe_params)
grid.fit(X, y)

print(grid.best_score_, grid.best_params_)

0.7580933924767184 {'featurizer__count_vect__max_features': 10000, 'featurizer__count_vect__ngram_range': (1, 2), 'featurizer__count_vect__stop_words': None, 'featurizer__tfidf_vect__max_features': 10000, 'featurizer__tfidf_vect__ngram_range': (1, 1), 'featurizer__tfidf_vect__stop_words': 'english'}
