# Lab1 Sentiment Analysis using Naive Bayes
## Baseline Algorithm
1. Tokenization (分词)
2. Feature Extraction (tfidf)
3. Naive Bayes

### Data
- 文档形式，模型：$P(W) = P(w_1, w_2, ..., w_n)$
- 情感特征(y)，polarity：+/- (1/0)
    + 判断：$P(y|W), y=0,1$取最大值作为类别
- Unigram, Bigram, Trigram, ...N-gram
    + 普通的贝叶斯分类器+laplace smoothing是unigram
    + 对于文本，N-gram效果应该会更好

## File Read

In [1]:
import os
from random import shuffle

def loadData(flagTrain = True):
    path = './aclImdb/'
    if flagTrain:
        path += "train/"
    else:
        path += "test/"
    
    pos_path = path + 'pos/'
    neg_path = path + 'neg/'
    pos_files = [pos_path + x for x in 
                 filter(lambda x: x.endswith('.txt'), os.listdir(pos_path))]
    neg_files = [neg_path + x for x in 
                 filter(lambda x: x.endswith('.txt'), os.listdir(neg_path))]
    pos_list = [open(x, 'r', encoding='utf-8').read().lower() for x in pos_files]
    neg_list = [open(x, 'r', encoding='utf-8').read().lower() for x in neg_files]
    data_list = pos_list + neg_list
    label_list = [1] * len(pos_list) + [0] * len(neg_list)
    
    # shuffle if you'd like ===========================
    if flagTrain:
        merged_data = list(zip(data_list, label_list))
        shuffle(merged_data)
        data_list, label_list = list(zip(*merged_data))
    return list(data_list), list(label_list)

In [2]:
data_list, label_list = loadData()
print(type(data_list), len(data_list))
print(type(label_list), len(label_list))

<class 'list'> 25000
<class 'list'> 25000


## Tokenization

In [3]:
from keras.preprocessing.text import Tokenizer
max_vocab_size = 50000
tokenizer = Tokenizer(num_words=max_vocab_size, oov_token='<UNK>')
tokenizer.fit_on_texts(data_list)
tf_idf_data = tokenizer.texts_to_matrix(data_list, mode='tfidf')

In [4]:
import numpy as np 
label_list = np.array(label_list)

print(label_list)
print("data shape: ", tf_idf_data.shape)
print("label shape: ", label_list.shape)

[1 0 1 ... 0 1 0]
data shape:  (25000, 50000)
label shape:  (25000,)


## Naive Bayes
本例中，每个输入数据为文本，文本通过tfidf预处理后（见appendix），相当于每个值是权重，替换原来公式中的count的位置。
1. 每个词算一个特征，即$x_j^{(i)}$算一个词，而它的取值是0或1，即有或没有
2. 所以此时在count的时候，就是0或1两种情况。count的过程，就是相加的过程。
3. 经过tfidf处理之后，每个词$x_j^{(i)}$变成了一个浮点数，可以看作权重
4. 每个count的位置，变成相加求和tfidf的值

> 相当于，
> - count(y)现在是把y类型的，所有tfidf数值加起来
> - count(x, y)现在是把y类型的，所有x特征的tfidf数值加起来

在最终预测的时候，不需要用tfidf，只需要做词语有/无的向量，使用训练好的$p(y)$和$p(X|y)$计算即可。
> 使用log处理，加法比乘法更快：
> $$p(y|X) ≈ p(y)p(X|y) => log(p(y)) + log(p(X|y)$$

In [5]:
class NaiveBayes():
    def fit(self, X, y):
        self.num_classes = 2  # neg/pos = 0/1
        self.m_examples = y.shape[0] 
        ## p(X|y)
        self.prob_Xy_arr = np.zeros((self.num_classes, X.shape[1]), dtype=np.float64)
        count_y = np.zeros((self.num_classes, 1))
        for i in range(self.m_examples):
            ith_lbl = y[i] 
            self.prob_Xy_arr[ith_lbl] += X[i] 
            count_y[ith_lbl] += np.sum(X[i])
        self.prob_Xy_arr =  (self.prob_Xy_arr + 1) / (count_y + X.shape[1])
        
        ## p(y)
        self.prob_y_arr = np.zeros(self.num_classes, dtype=np.float64)
        for i in range(self.num_classes): 
            self.prob_y_arr[i] = sum(y==i) / self.m_examples 
    
    def predict(self, X):
        m_test = X.shape[0] 
        labels = np.zeros(m_test)
        for i in range(m_test): 
            y, prob = None, float('-inf') 
            for lbl in range(self.num_classes):
                sc = np.sum(X[i] * np.log(self.prob_Xy_arr[lbl]) + np.log(self.prob_y_arr[lbl]))
                if sc > prob:
                    prob = sc 
                    y = lbl 
            labels[i] = y
        return labels 

In [6]:
nb = NaiveBayes()
nb.fit(tf_idf_data, label_list)

In [7]:
print(nb.prob_Xy_arr)
print(nb.prob_y_arr)

print(np.sum(nb.prob_Xy_arr[1]))

[[1.51975928e-07 1.93160878e-02 4.32211781e-03 ... 1.51975928e-07
  1.52403570e-06 2.57939809e-06]
 [1.44923561e-07 1.86923348e-02 4.14379472e-03 ... 2.76170361e-06
  1.45331359e-06 1.44923561e-07]]
[0.5 0.5]
0.9999999999999988


### Test

In [8]:
testdata, testlabel = loadData(False) 
print(type(testdata), len(testdata))
print(type(testlabel), len(testlabel))

<class 'list'> 25000
<class 'list'> 25000


In [9]:
testdata_tfidf = tokenizer.texts_to_matrix(testdata, mode='tfidf') 
testlabel = np.array(testlabel)
print("test label shape: ", testlabel.shape)
print(testlabel)
print("test data shape: ", testdata_tfidf.shape)
print(testdata_tfidf)

test label shape:  (25000,)
[1 1 1 ... 0 0 0]
test data shape:  (25000, 50000)
[[ 0.          0.          2.05416196 ...  0.          0.
   0.        ]
 [ 0.         21.25195642  2.88365037 ...  0.          0.
   0.        ]
 [ 0.         21.25195642  2.85265447 ...  0.          0.
   0.        ]
 ...
 [ 0.         26.4249195   2.8202164  ...  0.          0.
   0.        ]
 [ 0.         24.16521815  2.05416196 ...  0.          0.
   0.        ]
 [ 0.         33.44419303  2.63059898 ...  0.          0.
   0.        ]]


In [10]:
predlabels = nb.predict(testdata_tfidf)
print(predlabels)
acc = predlabels==testlabel
print("accuracy: ", np.sum(acc) / testlabel.shape[0])

[1. 1. 1. ... 1. 0. 1.]
accuracy:  0.785


## scikit-learn

In [12]:
from time import time
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [14]:
## 使用sklearn的预处理 + bayes
tf_vectorizer = TfidfVectorizer() # CountVectorizer=0.814；TfidfVectorizer=0.830
X_train_tf = tf_vectorizer.fit_transform(data_list)
X_test_tf = tf_vectorizer.transform(testdata)

naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tf, label_list)
y_pred = naive_bayes_classifier.predict(X_test_tf)
score1 = metrics.accuracy_score(testlabel, y_pred)
print("[sklearn] TfidfVectorizer accuracy:   %0.3f" % score1)

[sklearn] TfidfVectorizer accuracy:   0.830


In [15]:
## 使用上述内容相同的数据(tensorflow 预处理) + sklearn的bayes = 0.785
naive_bayes_classifier.fit(tf_idf_data, label_list)
y_pred = naive_bayes_classifier.predict(testdata_tfidf)
score2 = metrics.accuracy_score(testlabel, y_pred)
print("[tensorflow] accuracy:   %0.3f" % score2)

[tensorflow] accuracy:   0.785
