#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2017


# Homework 3:  Classification Cook-off: Naive Bayes vs Rocchio (plus a little bit of recommenders)

### 100 points [5% of your final grade]

### Due: Wednesday, March 29 by 11:59pm

*Goals of this homework:* Hands-on practice building and evaluating classifiers.

*Submission Instructions:* To submit your homework, rename this notebook as YOUR_UIN_hw3.ipynb. Submit this notebook via eCampus. Your notebook should be completely self-contained, with the results visible in the notebook. 

*Late submission policy:* For this homework, you may use up to three of your late days, meaning that no submissions will be accepted after Saturday April 1 at 11:59pm.

`# Part 0: Yelp review data

In this assignment, given a Yelp review, your task is to implement two classifiers to predict if the business category of this review is "food-relevant" or not, **only based on the review text**. The data is from the [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge).

## Build the training data

First, you will need to download this data file as your training data: [training_data.json](https://drive.google.com/open?id=0B_13wIEAmbQMdzBVTndwenoxQlk) 

The training data file includes 40,000 Yelp reviews. Each line is a json-encoded review, and **you should only focus on the "text" field**. As same as in homework 1, you should tokenize the review text by using the regular expression "\W+" (we discussed it in [this Piazza post](https://piazza.com/class/ixkk1fy863r1vs?cid=29). Do NOT remove stop words. **Do casefolding but no stemming**.

The label (class) information of each review is in the "label" field. It is **either "Food-relevant" or "Food-irrelevant"**.

## Testing data

We provide 100 yelp reviews here: [testing_data.json](https://drive.google.com/open?id=0B_13wIEAmbQMbXdyTkhrZDN4Wms). The testing data file has the same format as the training data file. Again, you can get the label informaiton in the "label" field. Only use it when you evalute your classifiers.

In [1]:
import json

# read the training data
data = []
data2 = []
with open('training_data.json') as data_file:
    for line in data_file: 
        data.append(json.loads(line))

with open('testing_data.json') as data_file:
    for line in data_file: 
        data2.append(json.loads(line))


# print (data[1])

In [2]:
import re
# tokenization 
def tokenizer(document):
    document = document.lower()
    wl = re.split('\W+',document)
    wl = filter(None, wl)
    return wl

In [3]:
# pair the tokenlist and label forming the training data
label = []
tokenList = []
for review in data:
    token = tokenizer(review["text"])
    tokenList.append(token)
    label.append(review["label"])

t_data = zip(tokenList,label)

# print len(tokenList[2])
# print (label[2])



In [4]:
# testing data
label2 = []
tokenList2 = []
for review in data2:
    token = tokenizer(review["text"])
    tokenList2.append(token)
    label2.append(review["label"])

test_data = zip(tokenList2,label2)

# Part 1: Naive Bayes classifier [35 points]

In this part, you will implement a Naive Bayes classifier, which outputs the probabilities that a given review belongs to each class.

Use a mixture model that mixes the probability from the document with the general collection frequency of the word. **You should use lambda = 0.7**. Be careful about the decimal rounding since multiplying many probabilities can generate a tiny value. We will not grade on the exact probability value, so feel free to change to logorithm summation (it's not required, though). If the tie case happens, **always go to the "Food-irrelevant" side**.

### What to report

* For the entire testing dataset, report the overall accuracy.
* For the class "Food-relevant", report the precision and recall.
* For the class "Food-irrelevant", report the precision and recall.

We will also grade on the quality of your code. So make sure that your code is clear and readable.

In [5]:
# step 1: get the probability of each term in vocabulary in traing_data
# step 2: get the probability of each term in vocabulary of each calss, 
# step 3: use the formula (smoothin)pro_term_class = lamda * pro_term_class + (1-lamda) * prob_overall_term
# step 4: get the prior of each class, 
# step 5: use log10(prior(class)) + sum(log10(pro(term|class)) to get the condition_prob(class|reviews)
# step 6: assign the class with higher condion_prob

In [6]:
# Build the naive bayes classifier
# Insert as many cells as you want
import collections
bag_of_word = []
r_bag_of_word = []
i_bag_of_word = []
i_number = 0
r_number = 0
# generate the bag of word and its vocabulary
for entry in t_data:
    bag_of_word += entry[0]
voca = list(set(bag_of_word))

# get the the number of entry and text corpus for each class, and relevant voca and irrelevant voca

for text,c in t_data:
    if(c == "Food-relevant"):
        r_bag_of_word += text
        r_number += 1
    else:
        i_bag_of_word += text
        i_number += 1

r_voca = list(set(r_bag_of_word))
i_voca = list(set(i_bag_of_word))


# print len(r_bag_of_word)
# print len(i_bag_of_word)
# print len(r_voca)
# print len(i_voca)
# print len(r_bag_of_word) + len(i_bag_of_word)

In [15]:
# step 1: get the probability of each term in vocabulary in traing_data

pro_term_whole = collections.defaultdict(float)
w_term_fre = collections.Counter(bag_of_word)
for term in voca:
    pro_term_whole[term] = float(w_term_fre[term]) / len(bag_of_word)


In [16]:
# step 2: get the probability of each term in vocabulary of each calss,
# step 3: get the prior of each class, 

prob_i_term = collections.defaultdict(float)
prob_r_term = collections.defaultdict(float)

# step 2
i_term_fre = collections.Counter(i_bag_of_word)
r_term_fre = collections.Counter(r_bag_of_word)

for term in voca:
    prob_i_term[term] = float(i_term_fre[term]) / len(i_bag_of_word)
    prob_r_term[term] = float(r_term_fre[term]) / len(r_bag_of_word)
        
# step 3    
i_prior = float(i_number) / 40000
r_prior = float(r_number) / 40000



In [19]:
# Apply your classifier on the test data. Report the results.
# Insert as many cells as you want
import math

# record the predictive class for testing data
predict_c = []
compare_list = []
r_prob_dic = {}
i_prob_list = {}

for entry in test_data:
    r_result = math.log10(r_prior)
    i_result = math.log10(i_prior)
#     entry_voca = lset(entry[0])
#     actual_voca = entry_voca.difference(set(voca))
# #   not in r_voca but in i_voca
#     voca1 = actual_voca.difference
    for term in entry[0]:
        if (term in voca):
            r_result += math.log10(0.7 * prob_r_term[term] + 0.3 * pro_term_whole[term])
            i_result += math.log10(0.7 * prob_i_term[term] + 0.3 * pro_term_whole[term])
    compare_list.append((r_result, i_result)) 
    
    
for i in range(len(test_data)):
#   compare the the two result in c_list (for each entry)
    if (compare_list[i][0] > compare_list[i][1] ):
        predict_c.append("Food-relevant")
    else: 
        predict_c.append("Food-irrelevant")


In [21]:
# the actual class list
actual_c = []
for entry in test_data:
    actual_c.append(entry[1])

In [22]:
pair = zip(actual_c, predict_c)
TP = 0
TN = 0
FP = 0
FN = 0

for i in range(len(actual_c)):
    if (actual_c[i] == 'Food-relevant' and predict_c[i] == 'Food-relevant'):
        TP += 1
    elif (actual_c[i] == 'Food-irrelevant'and predict_c[i] == 'Food-relevant'):
#         print i
        FP += 1
    elif (actual_c[i] == 'Food-relevant'and predict_c[i] == 'Food-irrelevant'):
#         print i
        FN += 1
    else:
        TN += 1
             
# print TP
# print TN
# print FP
# print FN

In [23]:
accuracy = float(TP + TN) / len(test_data)
Food_relevant_precision = float(TP) / (TP + FP)
Food_relevant_recall = float(TP) / (TP + FN)
Food_irrelevant_precision = float(TN) / (TN + FN)
Food_irrelevant_recall = float(TN) / (TN + FP)

print 'Accuracy is ',accuracy
print 'Food-relevant-precision is ', Food_relevant_precision
print 'Food-relevant-recall is ', Food_relevant_recall
print 'Food-irrelevant-precision is ', Food_irrelevant_precision
print 'Food-irrelevant-recall is ', Food_irrelevant_recall


Accuracy is  0.89
Food-relevant-precision is  0.901408450704
Food-relevant-recall is  0.941176470588
Food-irrelevant-precision is  0.862068965517
Food-irrelevant-recall is  0.78125


In [24]:
f = 0.9014*0.9412/(0.9014+0.9412)
print f

0.460435080864


# Part 2: Rocchio classifier [35 points]

In this part, your job is to implement a Rocchio classifier for "food-relevant vs. food-irrelevant". You need to aggregate all the reviews of each class, and find the center. **Use the normalized raw term frequency**.


### What to report

* For the entire testing dataset, report the overall accuracy.
* For the class "Food-relevant", report the precision and recall.
* For the class "Food-irrelevant", report the precision and recall.

We will also grade on the quality of your code. So make sure that your code is clear and readable.

In [16]:
# step1:  divide the t_data by class
# step2: calculate the term frequency dictionary for each entry in the t_data, then normalized:
#         1.make each entry a dictionary,key is term and tf as value
#         2.divide it by the length of the vector
# step3: get the centroid of each calss:
#         1.combine the dictionary for each class
#         2.divide it by the number of entry in each class
# step4: for test_data, get tf-dic for each entry and normalized
# step5: meaure the distance between each class's centroid and each entry's tf-dic
#        1. for each class's centroid, get the union voca of each test_data's entry's voca and centroid voca
#        2. for each term in this union voca, measure euclidean distance


In [10]:
# Build the Rocchio classifier
# Insert as many cells as you want

# step1: divde the training data by class
r_entryList = []
i_entryList = []

for entry in t_data:
        if (entry[1] == "Food-relevant"):
            r_entryList.append(entry[0])
        else:
            i_entryList.append(entry[0])
        


In [11]:
# step2: calculate the term frequency dictionary for each entry in the t_data, then normalized:
#         1.make each entry in each class a dictionary,key is term and tf as value
#         2.divide it by the length of the vector

# helpe method to get the length of the vector represented by dictionary
def vector_len(dic):
    value_list = dic.values()
    length = math.sqrt(sum(a * a for a in value_list))
    return length

def norm_tf_list (entryList):
    tf_dic_list = []
    for entry in entryList:
        tf_dic= collections.Counter(entry)
        length = vector_len(tf_dic)
        tf_norm = {k: (v / length) for k, v in tf_dic.iteritems()}
#       append the dictionary to the tf_dic_list
        tf_dic_list.append(tf_norm)
    return tf_dic_list

r_tf_list = norm_tf_list (r_entryList)
i_tf_list = norm_tf_list (i_entryList)
    



In [12]:
# step3: get the centroid of each calss:
#         1.go through each entry dictionary, if exist, count+current, else, create key and count = current
#         2.divide it by the number of entry in each class

def centroid(tf_list,voca):
    dic ={}
#   initial the dic
    for term in voca:
        dic[term] = 0
#   get the number of entry in each class
    N = len(tf_list)
#   go through each entry dictionary, dic[key]+v
    for i in range(N):
        for k,v in tf_list[i].items():
            dic[k] += v
#   divide it by the number of entry in each class
    dic = {k: v / N for k,v in dic.items()}
    return dic

r_centroid = centroid(r_tf_list, r_voca)
i_centroid = centroid(i_tf_list, i_voca)


In [13]:
# step4: for test_data, get tf-dic for each entry and normalized like before

test_entrylist  = []
for item in test_data:
    test_entrylist.append(item[0])

test_dic_list = norm_tf_list(test_entrylist)


In [43]:
# step5: meaure the distance between each class's centroid and each entry's tf-dic
#        1. for each class's centroid, get the union voca of each test_data's entry's voca and centroid voca
#        2. for each term in this union voca, measure euclidean distance

# euclidean distant function
def eu_dis(centroid, query):
    c_voca = centroid.keys()
    q_voca = query.keys()

#   in c not in q
    voca1 = set(c_voca).difference(set(q_voca))
#   in q not in c
    voca2 = set(q_voca).difference(set(c_voca))
#   in c and in q
    voca3 = set(c_voca).intersection(set(q_voca))
    
    s = 0
    for term in voca3:
        s += (centroid[term] - query[term])**2
    for term in voca2:
        s +=  query[term] ** 2
    for term in voca1:
        s +=  centroid[term] ** 2
    
    dis = math.sqrt(s)       
    return dis




In [50]:
# Apply your classifier on the test data. Report the results.
# Insert as many cells as you want
r_predict_c = []
for query in test_dic_list:
    r_dis = eu_dis(r_centroid,query)
    i_dis = eu_dis(i_centroid,query)
    if (r_dis < i_dis):
        r_predict_c.append("Food-relevant")
    else:
        r_predict_c.append("Food-irrelevant")


In [53]:
pair = zip(actual_c, r_predict_c)
TP = 0
TN = 0
FP = 0
FN = 0

for i in range(len(actual_c)):
    if (actual_c[i] == 'Food-relevant' and r_predict_c[i] == 'Food-relevant'):
        TP += 1
    elif (actual_c[i] == 'Food-irrelevant'and r_predict_c[i] == 'Food-relevant'):
#         print i
        FP += 1
    elif (actual_c[i] == 'Food-relevant'and r_predict_c[i] == 'Food-irrelevant'):
#         print i
        FN += 1
    else:
        TN += 1

In [55]:
accuracy = float(TP + TN) / len(test_data)
Food_relevant_precision = float(TP) / (TP + FP)
Food_relevant_recall = float(TP) / (TP + FN)
Food_irrelevant_precision = float(TN) / (TN + FN)
Food_irrelevant_recall = float(TN) / (TN + FP)

print 'Accuracy is ',accuracy
print 'Food-relevant-precision is ', Food_relevant_precision
print 'Food-relevant-recall is ', Food_relevant_recall
print 'Food-irrelevant-precision is ', Food_irrelevant_precision
print 'Food-irrelevant-recall is ', Food_irrelevant_recall


Accuracy is  0.65
Food-relevant-precision is  0.8
Food-relevant-recall is  0.647058823529
Food-irrelevant-precision is  0.466666666667
Food-irrelevant-recall is  0.65625


In [60]:
f = 0.8 * 0.6470588/(0.8 + 0.647)
print f

0.357738106427


# Part 3: Naive Bayes vs. Rocchio [20 points]

Which method gives the better results? In terms of what? How did you compare them? Can you explain why you observe what you do? Write 1-3 paragraphs below.

**Add your answer here:**
1. Based on the result, we can see that the accuracy of Naive Bayes is about more than that of Rocchio. Moreover, the F score of Naive Bayes is about 0.47 while the one of Rocchio is 0.578.
2. The reason i think is that naive bayes consider the probability of each word in the document while the Rocchio use a vector to represent the document. In Rocchio method, calculating the centroid of two classses may loss the latent characteristic of some documents. Meanwhile, the naive bayes method capture the every term's probability and also a global term probability. That's why Naive Bayes out perform the Rocchio method.

# Part 4: Recommenders [10 points]

Finally, since we've begun our discussion of recommenders, let's do a quick problem too:

The table below is a utility matrix, representing the ratings, on a 1–5 star scale, of eight items, *a* through *h*, by three users *A*, *B*, and *C*. 
<pre>

  | a  b  c  d  e  f  g  h
--|-----------------------
A | 4  5     5  1     3  2
B |    3  4  3  1  2  1
C | 2     1  3     4  5  3

</pre>

Compute the following from the data of this matrix.

(a) Treating the utility matrix as boolean, compute the Jaccard distance between each pair of users.

(b) Repeat Part (a), but use the cosine distance.

(c) Treat ratings of 3, 4, and 5 as 1 and 1, 2, and blank as 0. Compute the Jaccard distance between each pair of users.

(d) Repeat Part (c), but use the cosine distance.

(e) Normalize the matrix by subtracting from each nonblank entry the average
value for its user.

(f) Using the normalized matrix from Part (e), compute the cosine distance
between each pair of users.

(g) Which of the approaches above seems most reasonable to you? Give a one or two sentence argument supporting your choice.

**Add your answer here:**

(a)Jac_dis(A,B) = 0.5  jac_dis(A,C) = 0.5, jac_dis(B,C)= 0.5

(b)cos(A,B) = 0.67, cos(A,C) = 0.67, cos(B,C) = 0.67

(c)Jac_dis(A,B) = 0.4  jac_dis(A,C) = 0.33, jac_dis(B,C)= 0.167
<pre>

  | a  b  c  d  e  f  g  h
--|-----------------------
A | 1  1  0  1  0  0  1  0
B | 0  1  1  1  0  0  0  0
C | 0  0  0  1  0  1  1  1

</pre>


(d)cos(A,B) = 0.58, cos(A,C) = 0.5, cos(B,C) = 0.29

(e)
<pre>
  | a    b   c   d     e    f    g    h
--|------------------------------------------
A | 2/3 5/3     5/3  -7/3      -1/3  -4/3  
B |     2/3 5/3 2/3  -4/3 -1/3 -4/3
C | -1      -2   0          1    2    0
</pre>

(f)cos(A,B) = 0.58, cos(A,C) = -0.12, cos(B,C) = -0.74

(g) I think the cosine similarity on the normalized matrix is most reasonable for me. As the normalized matrix make the consine similarity can reasonably calculate how they are similar to each other without the rating bias of the users. In contrast, the jaccard distance does not consider the rating itself and pure cosine similarity may ignore the higher and lower potention rating behavior.