#### CSCE 670 :: Information Storage and Retrieval :: Texas A&M University :: Spring 2017


# Homework 3:  Classification Cook-off: Naive Bayes vs Rocchio (plus a little bit of recommenders)

### 100 points [5% of your final grade]

### Due: Wednesday, March 29 by 11:59pm

*Goals of this homework:* Hands-on practice building and evaluating classifiers.

*Submission Instructions:* To submit your homework, rename this notebook as YOUR_UIN_hw3.ipynb. Submit this notebook via eCampus. Your notebook should be completely self-contained, with the results visible in the notebook. 

*Late submission policy:* For this homework, you may use up to three of your late days, meaning that no submissions will be accepted after Saturday April 1 at 11:59pm.

# Part 0: Yelp review data

In this assignment, given a Yelp review, your task is to implement two classifiers to predict if the business category of this review is "food-relevant" or not, **only based on the review text**. The data is from the [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge).

## Build the training data

First, you will need to download this data file as your training data: [training_data.json](https://drive.google.com/open?id=0B_13wIEAmbQMdzBVTndwenoxQlk) 

The training data file includes 40,000 Yelp reviews. Each line is a json-encoded review, and **you should only focus on the "text" field**. As same as in homework 1, you should tokenize the review text by using the regular expression "\W+" (we discussed it in [this Piazza post](https://piazza.com/class/ixkk1fy863r1vs?cid=29). Do NOT remove stop words. **Do casefolding but no stemming**.

The label (class) information of each review is in the "label" field. It is **either "Food-relevant" or "Food-irrelevant"**.

## Testing data

We provide 100 yelp reviews here: [testing_data.json](https://drive.google.com/open?id=0B_13wIEAmbQMbXdyTkhrZDN4Wms). The testing data file has the same format as the training data file. Again, you can get the label informaiton in the "label" field. Only use it when you evalute your classifiers.

# Part 1: Naive Bayes classifier [35 points]

In this part, you will implement a Naive Bayes classifier, which outputs the probabilities that a given review belongs to each class.

Use a mixture model that mixes the probability from the document with the general collection frequency of the word. **You should use lambda = 0.7**. Be careful about the decimal rounding since multiplying many probabilities can generate a tiny value. We will not grade on the exact probability value, so feel free to change to logorithm summation (it's not required, though). If the tie case happens, **always go to the "Food-irrelevant" side**.

### What to report

* For the entire testing dataset, report the overall accuracy.
* For the class "Food-relevant", report the precision and recall.
* For the class "Food-irrelevant", report the precision and recall.

We will also grade on the quality of your code. So make sure that your code is clear and readable.

In [97]:
# Build the naive bayes classifier
# Insert as many cells as you want
import json
import collections
import math
import re
food_rel = collections.defaultdict(int)
food_irrel = collections.defaultdict(int)
count_food_rel = 0
count_food_irrel = 0
const_lambda = 0.7
with open('training_data.json') as f:
    for line in f:
        j_content = json.loads(line)
        if j_content['type'] == 'review':
            if j_content['label'] == 'Food-relevant':
                for x in re.split('\W+', j_content['text']):
                    word = x.lower()
                    food_rel[word] += 1
                    count_food_rel += 1
            elif j_content['label'] == 'Food-irrelevant':
                for x in re.split('\W+', j_content['text']):
                    word = x.lower()
                    food_irrel[word] += 1
                    count_food_irrel += 1
accuracy = 0.0
correct = 0
total = 0
tp_relevant = 0
tn_relevant = 0
fp_relevant = 0
fn_relevant = 0
tp_irrelevant = 0
tn_irrelevant = 0
fp_irrelevant = 0
fn_irrelevant = 0
with open('testing_data.json') as f:
    for line in f:
        prob_food_rel = 0.0
        prob_food_irrel = 0.0
        j_test_content = json.loads(line)
        if j_test_content['type'] == 'review':
            for x in re.split('\W+', j_test_content['text']):
                word = x.lower()
                try:
                    prob_food_rel += math.log((const_lambda * food_rel[word]/count_food_rel + (1-const_lambda)*(food_irrel[word] + food_rel[word])/(count_food_irrel + count_food_rel)))
                except:
                    prob_food_rel = prob_food_rel
                try:
                    prob_food_irrel += math.log((const_lambda * food_irrel[word]/count_food_irrel + (1-const_lambda)*(food_irrel[word] + food_rel[word])/(count_food_irrel + count_food_rel)))
                except:
                    prob_food_irrel = prob_food_irrel
        if prob_food_rel > prob_food_irrel:
            if  j_test_content['label'] == 'Food-relevant':
                correct += 1
                tp_relevant += 1
                tn_irrelevant += 1
            if j_test_content['label'] == 'Food-irrelevant':
                fp_relevant += 1
                fn_irrelevant += 1
        else:
            if j_test_content['label'] == 'Food-irrelevant':
                correct += 1
                tn_relevant += 1
                tp_irrelevant += 1
            if j_test_content['label'] == 'Food-relevant':
                fn_relevant += 1
                fp_irrelevant += 1
        total += 1
        
print "###RESULTS FOR NAIVE BAYES###"
print "\n"
print "Accuracy:", float(correct)/float(total)
print "Precision - relevant: ", float(tp_relevant)/float(tp_relevant + fp_relevant)
print "Recall - relevant: ", float(tp_relevant)/float(tp_relevant + fn_relevant)
print "Precision - irrelevant: ", float(tp_irrelevant)/float(tp_irrelevant + fp_irrelevant)
print "Recall - irrelevant: ", float(tp_irrelevant)/float(tp_irrelevant + fn_irrelevant)

###RESULTS FOR NAIVE BAYES###


Accuracy: 0.89
Precision - relevant:  0.901408450704
Recall - relevant:  0.941176470588
Precision - irrelevant:  0.862068965517
Recall - irrelevant:  0.78125


In [39]:
# Apply your classifier on the test data. Report the results.
# Insert as many cells as you want



# Part 2: Rocchio classifier [35 points]

In this part, your job is to implement a Rocchio classifier for "food-relevant vs. food-irrelevant". You need to aggregate all the reviews of each class, and find the center. **Use the normalized raw term frequency**.


### What to report

* For the entire testing dataset, report the overall accuracy.
* For the class "Food-relevant", report the precision and recall.
* For the class "Food-irrelevant", report the precision and recall.

We will also grade on the quality of your code. So make sure that your code is clear and readable.

In [96]:
# Build the Rocchio classifier
# Insert as many cells as you want
centroid_food_rel = collections.defaultdict(float)
centroid_food_irrel = collections.defaultdict(float)
count_food_rel_reviews = 0
count_food_irrel_reviews = 0
arr_rel = []
arr_irrel = []
index_rel = 0
index_irrel = 0
word_set=set()

def normalize(dictionary):
    result = 0.0
    for key in dictionary:
        result += dictionary[key]**2
    result = math.sqrt(result)
    for key in dictionary:
        dictionary[key] = float(dictionary[key])/float(result)
    return dictionary

with open('training_data.json') as f:
    for line in f:
        j_content = json.loads(line)
        if j_content['type'] == 'review':
            if j_content['label'] == 'Food-relevant':
                arr_rel.append(collections.defaultdict(float))
                for x in re.split('\W+', j_content['text']):
                    word = x.lower()
                    if word != '':
                        word_set.add(word)
                    arr_rel[index_rel][word] += 1.0
                arr_rel[index_rel] = normalize(arr_rel[index_rel])
                index_rel += 1
                
                count_food_rel_reviews += 1
            if j_content['label'] == 'Food-irrelevant':
                arr_irrel.append(collections.defaultdict(float))
                for x in re.split('\W+', j_content['text']):
                    word = x.lower()
                    if word != '':
                        word_set.add(word)
                    arr_irrel[index_irrel][word] += 1.0
                arr_irrel[index_irrel] = normalize(arr_irrel[index_irrel])
                index_irrel += 1
                count_food_irrel_reviews += 1
for index in xrange(len(arr_rel)):
    for key in arr_rel[index]:
        centroid_food_rel[key] += arr_rel[index][key]
for index in xrange(len(arr_irrel)):
    for key in arr_irrel[index]:
        centroid_food_irrel[key] += arr_irrel[index][key]
for key in centroid_food_rel:
    centroid_food_rel[key] = float(centroid_food_rel[key])/float(count_food_rel_reviews)
for key in centroid_food_irrel:
    centroid_food_irrel[key] = float(centroid_food_irrel[key])/float(count_food_irrel_reviews)

correct = 0.0
correct_euc = 0.0
total = 0.0
tp_relevant_rocchio_manhattan = 0
tn_relevant_rocchio_manhattan = 0
fp_relevant_rocchio_manhattan = 0
fn_relevant_rocchio_manhattan = 0
tp_irrelevant_rocchio_manhattan = 0
tn_irrelevant_rocchio_manhattan = 0
fp_irrelevant_rocchio_manhattan = 0
fn_irrelevant_rocchio_manhattan = 0
tp_relevant_rocchio_euclidean = 0
tn_relevant_rocchio_euclidean = 0
fp_relevant_rocchio_euclidean = 0
fn_relevant_rocchio_euclidean = 0
tp_irrelevant_rocchio_euclidean = 0
tn_irrelevant_rocchio_euclidean = 0
fp_irrelevant_rocchio_euclidean = 0
fn_irrelevant_rocchio_euclidean = 0
with open('testing_data.json') as f:
    for line in f:
        manhattan_distance_relevant = 0.0
        manhattan_distance_irrelevant = 0.0
        euclidean_distance_relevant = 0.0
        euclidean_distance_irrelevant = 0.0
        j_test_content = json.loads(line)
        document_vector = collections.defaultdict(float)
        if j_test_content['type'] == 'review':
            for x in re.split('\W+', j_test_content['text']):
                word = x.lower()
                document_vector[word] += 1
        magnitude_document_vector = 0.0
        for key in document_vector:
            magnitude_document_vector += document_vector[key] ** 2
        magnitude_document_vector = math.sqrt(magnitude_document_vector)
        for key in document_vector:
            document_vector[key] = float(document_vector[key])/float(magnitude_document_vector)
        
        for word in word_set:
            d1 = 0.0
            d2 = 0.0
            if word in centroid_food_rel:
                d1 = centroid_food_rel[word]
            if word in document_vector:
                d2 = document_vector[word]
            euclidean_distance_relevant += (d1-d2) ** 2
            manhattan_distance_relevant += abs(d1 - d2)
        
        for word in word_set:
            d1 = 0.0
            d2 = 0.0
            if word in centroid_food_irrel:
                d1 = centroid_food_irrel[word]
            if word in document_vector:
                d2 = document_vector[word]
            euclidean_distance_irrelevant += (d1-d2) ** 2
            manhattan_distance_irrelevant += abs(d1 - d2)
        if manhattan_distance_relevant < manhattan_distance_irrelevant:
            if j_test_content['label'] == 'Food-relevant':
                correct += 1.0
                tp_relevant_rocchio_manhattan += 1
                tn_irrelevant_rocchio_manhattan += 1
            if j_test_content['label'] == 'Food-irrelevant':
                fp_relevant_rocchio_manhattan += 1
                fn_irrelevant_rocchio_manhattan += 1
        else:
            if j_test_content['label'] == 'Food-irrelevant':
                correct += 1.0
                tn_relevant_rocchio_manhattan += 1
                tp_irrelevant_rocchio_manhattan += 1
            if j_test_content['label'] == 'Food-relevant':
                fn_relevant_rocchio_manhattan += 1
                fp_irrelevant_rocchio_manhattan += 1
        if euclidean_distance_relevant < euclidean_distance_irrelevant:
            if j_test_content['label'] == 'Food-relevant':
                correct_euc += 1.0
                tp_relevant_rocchio_euclidean += 1
                tn_irrelevant_rocchio_euclidean += 1
            if j_test_content['label'] == 'Food-irrelevant':
                fp_relevant_rocchio_euclidean += 1
                fn_irrelevant_rocchio_euclidean += 1
        else:
            if j_test_content['label'] == 'Food-irrelevant':
                correct_euc += 1.0
                tn_relevant_rocchio_euclidean += 1
                tp_irrelevant_rocchio_euclidean += 1
            if j_test_content['label'] == 'Food-relevant':
                fn_relevant_rocchio_euclidean += 1
                fp_irrelevant_rocchio_euclidean += 1
        total += 1.0

print "##### RESULTS FOR ROCHHIO #####"
print "\n"
print "Using  Manhattan Distance: "
print "Accuracy(Manhattan): ", float(correct)/float(total)
print "Precision(Manhattan) - relevant: ", float(tp_relevant_rocchio_manhattan)/float(tp_relevant_rocchio_manhattan + fp_relevant_rocchio_manhattan)
print "Recall(Manhattan) - relevant: ", float(tp_relevant_rocchio_manhattan)/float(tp_relevant_rocchio_manhattan + fn_relevant_rocchio_manhattan)
print "Precision(Manhattan) - irrelevant: ", float(tp_irrelevant_rocchio_manhattan)/float(tp_irrelevant_rocchio_manhattan + fp_irrelevant_rocchio_manhattan)
print "Recall(Manhattan) - irrelevant: ", float(tp_irrelevant_rocchio_manhattan)/float(tp_irrelevant_rocchio_manhattan + fn_irrelevant_rocchio_manhattan)
print "\n"
print "Using Euclidean Distance: "
print "Accuracy(Euclidean): ", float(correct_euc)/float(total)
print "Precision(Euclidean) - relevant: ", float(tp_relevant_rocchio_euclidean)/float(tp_relevant_rocchio_euclidean + fp_relevant_rocchio_euclidean)
print "Recall(Euclidean) - relevant: ", float(tp_relevant_rocchio_euclidean)/float(tp_relevant_rocchio_euclidean + fn_relevant_rocchio_euclidean)
print "Precision(Euclidean) - irrelevant: ", float(tp_irrelevant_rocchio_euclidean)/float(tp_irrelevant_rocchio_euclidean + fp_irrelevant_rocchio_euclidean)
print "Recall(Euclidean) - irrelevant: ", float(tp_irrelevant_rocchio_euclidean)/float(tp_irrelevant_rocchio_euclidean + fn_irrelevant_rocchio_euclidean)

##### RESULTS FOR ROCHHIO #####


Using  Manhattan Distance: 
Accuracy(Manhattan):  0.72
Precision(Manhattan) - relevant:  0.722222222222
Recall(Manhattan) - relevant:  0.955882352941
Precision(Manhattan) - irrelevant:  0.7
Recall(Manhattan) - irrelevant:  0.21875


Using Euclidean Distance: 
Accuracy(Euclidean):  0.65
Precision(Euclidean) - relevant:  0.8
Recall(Euclidean) - relevant:  0.647058823529
Precision(Euclidean) - irrelevant:  0.466666666667
Recall(Euclidean) - irrelevant:  0.65625


In [None]:
# Apply your classifier on the test data. Report the results.
# Insert as many cells as you want


# Part 3: Naive Bayes vs. Rocchio [20 points]

Which method gives the better results? In terms of what? How did you compare them? Can you explain why you observe what you do? Write 1-3 paragraphs below.

**Results for Naive Bayes:

Accuracy: 0.89, Precision - relevant:  0.901408450704, Recall - relevant:  0.941176470588, Precision - irrelevant:  0.862068965517, Recall - irrelevant:  0.78125


Results for Rocchio using Euclidean Distance:

Accuracy(Euclidean):  0.65, Precision(Euclidean) - relevant:  0.8, Recall(Euclidean) - relevant:  0.647058823529, Precision(Euclidean) - irrelevant:  0.466666666667, Recall(Euclidean) - irrelevant:  0.65625

From our results, we can clearly see that Naive Bayes gives much better accuracy, precision and recall than Rocchio. This is because Rocchio cannot handle nonconvex, multimodal classes. Let's say in Rocchio we have two classes, A and B. For A class, the points are far apart on either sides of the positive Y axis (some points are on far right of positive side of the Y-axis and some points are on far left on the negative side of the Y-axis). While for B class, the points are evenly distributed close to the negative Y-axis. Now we are given a point O to classify which is much closer to B. But by Rocchio, O is much closer to centroid of A class than to B class. So, Rocchio misclassifies this point. But such kind of errors do not happen in Naive Bayes as it considers the maximum posteriori class.

Hence, Naive Bayes gives a much better accuracy than Rocchio.**

# Part 4: Recommenders [10 points]

Finally, since we've begun our discussion of recommenders, let's do a quick problem too:

The table below is a utility matrix, representing the ratings, on a 1–5 star scale, of eight items, *a* through *h*, by three users *A*, *B*, and *C*. 
<pre>

  | a  b  c  d  e  f  g  h
--|-----------------------
A | 4  5     5  1     3  2
B |    3  4  3  1  2  1
C | 2     1  3     4  5  3

</pre>

Compute the following from the data of this matrix.

(a) Treating the utility matrix as boolean, compute the Jaccard distance between each pair of users.

(b) Repeat Part (a), but use the cosine distance.

(c) Treat ratings of 3, 4, and 5 as 1 and 1, 2, and blank as 0. Compute the Jaccard distance between each pair of users.

(d) Repeat Part (c), but use the cosine distance.

(e) Normalize the matrix by subtracting from each nonblank entry the average
value for its user.

(f) Using the normalized matrix from Part (e), compute the cosine distance
between each pair of users.

(g) Which of the approaches above seems most reasonable to you? Give a one or two sentence argument supporting your choice.

**Add your answer here:**

(a) 
Jaccard (A,B) = (0+1+0+1+1+0+1+0)/8 = 4/8 = 1/2

Jaccard (A,C) = (1+0+0+1+0+0+1+1)/8 = 4/8 = 1/2

Jaccard (B,C) = (0+0+1+1+0+1+1+0)/8 = 4/8 = 1/2

(b)
Cosine(A,B) = 4/sqrt(6) * sqrt(6) = 4/6 = 2/3

Cosine(A,C) = 4/sqrt(6) * sqrt(6) = 2/3

Cosine(B,C) = 4/sqrt(6) * sqrt(6) = 2/3


(c) 

This is the modified matrix which we use for part(c) and part(d)

    a   b   c   d   e   f   g   h
    
    1   1   0   1   0   0   1   0    A 
    
    0   1   1   1   0   0   0   0    B 
        
    0   0   0   1   0   1   1   1    C 
   
Jaccard (A,B) = (0+1+0+1+0+0+0+0)/5 = 2/5

Jaccard (A,C) = (0+0+0+1+0+0+1+0)/6 = 2/6 = 1/3

Jaccard (B,C) = (0+0+0+1+0+0+0+0)/6 = 1/6



(d)
Cosine (A,B) = 2/sqrt(4) * sqrt(3) = 1/sqrt(3)

Cosine (A,C) = 2/sqrt(4) * sqrt(4) = 1/2

Cosine (B,C) = 1/sqrt(3) * sqrt(4) = 1/(2 * sqrt (3))

(e)

    a   b   c   d   e   f   g   h
    
    4   5       5   1       3   2    A (avg = 3.33)
    
        3   4   3       1   2   1    B (avg = 2.33)
        
    2       1   3       4   5   3    C (avg = 3)
    
    a      b       c     d      e     f        g       h
    
    0.67   1.67        1.67   -2.33          -0.33   -1.33    A
    
           0.67  1.67  0.67   -1.33  -0.33   -1.33            B
        
    -1             -2    0            1        2       0      C



(f)
Cosine(A,B) = (1.67 x 0.67 + 1.67 x 0.67 + 2.33 x 1.33  + 0.33 x 1.33)/sqrt(0.67 x 0.67 + 1.67 x 1.67 + 1.67 x 1.67 + 2.33 x 2.33 + 0.33 x 0.33 + 1.33  x 1.33) x sqrt(0.67^2 + 1.67^2 + 0.67^2 + 1.33^2 + 0.33^2 + 1.33^2) = 0.58

Cosine(A,C) = (0.67 x (-1) - 0.33 x 2)/sqrt(0.67 x 0.67 + 1.67 x 1.67 + 1.67 x 1.67 + 2.33 x 2.33 + 0.33 x 0.33 + 1.33 x 1.33) x sqrt(1^2 + 2^2 + 1^2 + 2^2) = -0.11

Cosine(B,C) = (1.67 x (-2) - 0.33 x 1 - 1.33 x 2)/sqrt(0.67^2 + 1.67^2 + 0.67^2 + 1.33^2 + 0.33^2 + 1.33^2) * sqrt(1^2 + 2^2 + 1^2 + 2^2) = -0.739



(g) Cosine Similarity is better than Jaccard because Jaccard treats every rating as 1 while treats others as 0 which the user did not rate (ignores the value of rating). Cosine similarity takes into account the value of rating and normalizes them which provides a better result. Cosine Similarity also works for arbitrary vectors unlike Jaccard. 