# Part 1: Term embeddings + SVM

### Dataset


For this homework, we will still play with Yelp reviews from the [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge). As in Homework 1, you'll see that each line corresponds to a review on a particular business. Each review has a unique "ID" and the text content is in the "review" field. Additionally, this time, we also offer you the "label". If `label=1`, it means that this review is `Food-relevant`. If `label=0`, it means that this review is `Food-irrelevant`. Similarly, we have already done some basic preprocessing on the reviews, so you can just tokenize each review using whitespace.

There are about 40,000 reviews in total, in which about 20,000 reviews are "Food-irrelevant". We split the review data into two sets. *review_train.json* is the training set. *review_test.json* is the testing set. 

In [3]:
# Please load the dataset
# Your code below
import numpy as np
import pandas as pd
import json
import math
data_train = pd.read_json('review_train.json', lines=True)#Importing dataset for training
[ids_train,reviews_train,labels_train]=data_train['id'],data_train['review'],data_train['label']
data_test = pd.read_json('review_test.json', lines=True)#Importing dataset for testing
[ids_test,reviews_test,labels_test]=data_test['id'],data_test['review'],data_test['label']

###  Pre-trained term embeddings

To save your time, you can make use of  pre-trained term embeddings. In this homework, we are using one of the great pre-trained models from [GloVe](https://nlp.stanford.edu/projects/glove/) based on 2 billion tweets. GloVe is quite similar to word2vec. Unzip the *glove.6B.50d.txt.zip* file and run the code below. You will be able to load the term embeddings model, with which each word can be represented with a 50-dimension vector.

In [180]:
# reload the pre-trained term embeddings
import numpy as np

with open("glove.6B.50d.txt", "rb") as lines:
    model = {str(line.split()[0],'utf-8'): np.array(list(map(float, line.split()[1:])))
           for line in lines}

Now, you have a vector representation for each word. First, we use the simple (arithmetic) **mean** of these vectors of words in a review to represent the review. *Note: Just ignore those words which are not in the corpus of this pre-trained model.*

In [183]:
# Please figure out the vector representation for each review in the training data and testing data.
# Your code below
def vector(reviews):
    vecrep=[[]]*len(reviews)#Initialize a 2Darray for the vector rep,every row is a vec-rep of a review
    for idx,review in enumerate(reviews):
        b=[]#Matrix to hold vecreps of the words in the review
        for word in review.split():
            if word in model.keys():#Check if the word has a vecrep in the GloVe model
                 b.append(model[word])#Add vec-rep for the word in b
        vecrep[idx]=[elem/len(b) for elem in np.sum(b,axis=0)]#Taking the mean of all the words,col by col
    return vecrep
vecrep_train=vector(reviews_train)
vecrep_test=vector(reviews_test)

### SVM

With the vector representations you get for each review, please train an SVM model to predict whether a given review is food-relevant or not. **You do not need to implement any classifier from scratch. You may use scikit-learn's built-in capabilities.** You can only train your model with reviews in *review_train.json*.

In [184]:
# SVM model training
# Your code here
from array import *
from sklearn import svm
Y=[labels_train[elem] for elem in range(len(labels_train))]
clf = svm.SVC(C=5,kernel='linear')#Increasing the penalty+using a linear kernel for the SVM
clf.fit(vecrep_train, Y)#Training the SVM model with the training dataset vecrep and the training labels

SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Your goal is to predict whether a given review is food-relevant or not. Please report the overall accuracy, precision and recall of your model on the **testing data**. You should **implement the functions for accuracy, precision, and recall**.

In [185]:
# Your code here
from collections import Counter
pred=clf.predict(vecrep_test)
def accuracy(predicted_labels, actual_labels):
    diff = np.array(predicted_labels) - np.array(actual_labels)
    return 1.0 - (float(np.count_nonzero(diff)) / len(diff))
def evaluation(predicted_labels, actual_labels):
    counts = Counter(zip(actual_labels, predicted_labels))
    true_pos=counts[1,1]
    false_neg=counts[0,1]
    false_pos=counts[1,0]
    true_neg=counts[0,0]
    recall = true_pos / float(true_pos + false_neg)
    precision = true_pos / float(true_pos + false_pos)
    print("Recall: %f"%recall,"Precision: %f"%precision)
Y_test=[labels_test[elem] for elem in range(len(labels_test))]
print("Accuracy: %f"%accuracy(pred,Y_test))
evaluation(pred,Y_test)

Accuracy: 0.907634
Recall: 0.894977 Precision: 0.923129


### Document-based embeddings

Instead of taking the mean of term embeddings, you can directly train a **doc2vec** model for paragraph or document embeddings. You can refer to the paper [Distributed Representations of Sentences and Documents](https://arxiv.org/pdf/1405.4053v2.pdf) for more details. And in this homework, you can make use of the implementation in [gensim](https://radimrehurek.com/gensim/models/doc2vec.html).

Now, you need to:
* Train a doc2vec model based on all reviews you have (training + testing sets).
* Use the embeddings from your doc2vec model to represent each review and train a new SVM model.
* Report the overall accuracy, precision and recall of your model on the testing data.

In [273]:
!pip install -U gensim

Requirement already up-to-date: gensim in c:\users\atrey\anaconda3\lib\site-packages (3.7.1)
Requirement not upgraded as not directly required: scipy>=0.18.1 in c:\users\atrey\anaconda3\lib\site-packages (from gensim) (1.1.0)
Requirement not upgraded as not directly required: smart-open>=1.7.0 in c:\users\atrey\anaconda3\lib\site-packages (from gensim) (1.8.0)
Requirement not upgraded as not directly required: six>=1.5.0 in c:\users\atrey\anaconda3\lib\site-packages (from gensim) (1.11.0)
Requirement not upgraded as not directly required: numpy>=1.11.3 in c:\users\atrey\anaconda3\lib\site-packages (from gensim) (1.14.3)
Requirement not upgraded as not directly required: bz2file in c:\users\atrey\anaconda3\lib\site-packages (from smart-open>=1.7.0->gensim) (0.98)
Requirement not upgraded as not directly required: boto>=2.32 in c:\users\atrey\anaconda3\lib\site-packages (from smart-open>=1.7.0->gensim) (2.48.0)
Requirement not upgraded as not directly required: boto3 in c:\users\atrey\an

distributed 1.21.8 requires msgpack, which is not installed.
You are using pip version 10.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [186]:
# Train a doc2vec
# Your code here
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
tagged_data = [TaggedDocument(words=_d.split(), tags=[ids_train[i]])
               for i, _d in enumerate(reviews_train)]#Splitting the review into a BOW and Using the IDs as tags
tagged_test = [TaggedDocument(words=_d.split(), tags=[ids_test[i]]) 
               for i, _d in enumerate(reviews_test)]
tagged_data=tagged_data+tagged_test#Total tagged data=tagged training+testing data
model_doc2vec = Doc2Vec(tagged_data,size=20,window=5,
                alpha=0.025, 
                min_alpha=0.0025,
                min_count=1,
                dm =1,epochs=50)#Setting epoch size as 50

print ('Start training process...')
model_doc2vec.train(tagged_data, total_examples=model_doc2vec.corpus_count, epochs=model_doc2vec.iter)
print ('Model Trained')



Start training process...




Model Trained


In [238]:
# Train a SVM
# Your code here
Y_doc2vec=[labels_train[elem] for elem in range(len(labels_train))]
#Building the training dataset by getting the vec-reps through the review ids
vecrep_train_doc2vec=[model_doc2vec.docvecs[ids_train[elem]] for elem in range(len(reviews_train))]
#Building a new SVM model for doc2vec
clf_doc2vec = svm.SVC(C=5,kernel='linear')
clf_doc2vec.fit(vecrep_train_doc2vec, Y_doc2vec)  
print("SVM Model Trained")

SVM Model Trained


In [189]:
# Report the performance
# Your code here
#Building the test dataset by getting the vec-reps through the review ids
vecrep_test_doc2vec=[model_doc2vec.docvecs[ids_test[elem]] for elem in range(len(reviews_test))]
#Predicting the result using the new SVM model created
pred_doc2vec=clf_doc2vec.predict(vecrep_test_doc2vec)
#Evaluating performance
Y_test=[labels_test[elem] for elem in range(len(labels_test))]
print("Accuracy: %f"%accuracy(pred_doc2vec,Y_test))
evaluation(pred_doc2vec,Y_test)

Accuracy: 0.937836
Recall: 0.937017 Precision: 0.938436


What do you observe? How different are your results for the term-based average approach vs. the doc2vec approach? Why do you think this is?

#### Answer:-
A marked difference in the accuracy can be observed between the 2 given approaches. the accuracy increases from 90.7% to 93.7%. Same goes for the recall and precision.
The doc2vec approach performs better because this approach basically employs a neural network that learns and thus is forced to create embeddings which better reflect the relationship between words. Hence, doc2vec is a predictive model and learns to embed words accurately. However, the GloVe model mainly depends on the term frequency of words. It learns by constructing a co-occurrence matrix (words * context) that basically counts how frequently a word appears in a context and then doing dimensionality reduction on these matrices.

### Can you do better?

Finally, see if you can do better than either the word- or doc- based embeddings approach for classification. You may explore new features, new classifiers, etc. Whatever you like. Just provide your code and a justification.

### Answer:-
#### Part 1:-
We try to use our generated doc2vec model(as it performed better), and send it through a logistic regression model. We see that the accuracy increases by a very slight amount, from 93.7% to 93.8%.

In [275]:
# your code here
from sklearn import linear_model
clf_lr = linear_model.LogisticRegressionCV()
clf_lr.fit(vecrep_train_doc2vec, Y_doc2vec)
pred_lr=clf_lr.predict(vecrep_test_doc2vec)
Y_test_lr=[labels_test[elem] for elem in range(len(labels_test))]
print("Accuracy using Logistic Regression: %f"%accuracy(pred_lr,Y_test_lr))
evaluation(pred_lr,Y_test_lr)

Accuracy using Logistic Regression: 0.938507
Recall: 0.938573 Precision: 0.938099


#### Part 2:-
We then try to experiment with 2 more classifiers: Naive-Bayes and Perceptron and they both give a reduced but similar accuracies. However, the recall value increases in Naive Bayes and decreases for perceptron and the opposite is observed for precision.

In [262]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(vecrep_train_doc2vec, Y_doc2vec)
pred_nb=gnb.predict(vecrep_test_doc2vec)
print("Accuracy for Naive Bayes: %f"%accuracy(pred_nb,Y_test_lr))
evaluation(pred_nb,Y_test_lr)
from sklearn.linear_model import Perceptron
clf_p = Perceptron(alpha=0.001,tol=1, random_state=0)
clf_p.fit(vecrep_train_doc2vec, Y_doc2vec)
pred_p=clf_p.predict(vecrep_test_doc2vec)
print("Accuracy for Perceptron: %f"%accuracy(pred_p,Y_test_lr))
evaluation(pred_p,Y_test_lr)

Accuracy for Naive Bayes: 0.917282
Recall: 0.940643 Precision: 0.890328
Accuracy for Perceptron: 0.913423
Recall: 0.899366 Precision: 0.930530


#### Conclusion:-
So we can conclude that the SVM classifier works as well as any other classifier in the domain. So we try to improve the doc2vec vector itself by traning the model for 100 epochs, in reference to a code at https://medium.com/@mishra.thedeepak/doc2vec-simple-implementation-example-df2afbbfbad5.

In [268]:
model_doc2vec_mod = Doc2Vec(tagged_data,size=20,window=5,
                alpha=0.025, 
                min_alpha=0.0025,
                min_count=1,
                dm =1)

print ('Start training process...')
#model_doc2vec_mod.train(tagged_data, total_examples=model_doc2vec_mod.corpus_count, epochs=model_doc2vec_mod.iter)
for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model_doc2vec_mod.train(tagged_data,
                total_examples=model_doc2vec_mod.corpus_count,
                epochs=model_doc2vec_mod.iter)
    # decrease the learning rate
    model_doc2vec_mod.alpha -= 0.0002
    # fix the learning rate, no decay
    model_doc2vec_mod.min_alpha = model_doc2vec_mod.alpha
print ('Model Trained')



Start training process...
iteration 0


  del sys.path[0]


iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
iteration 10
iteration 11
iteration 12
iteration 13
iteration 14
iteration 15
iteration 16
iteration 17
iteration 18
iteration 19
iteration 20
iteration 21
iteration 22
iteration 23
iteration 24
iteration 25
iteration 26
iteration 27
iteration 28
iteration 29
iteration 30
iteration 31
iteration 32
iteration 33
iteration 34
iteration 35
iteration 36
iteration 37
iteration 38
iteration 39
iteration 40
iteration 41
iteration 42
iteration 43
iteration 44
iteration 45
iteration 46
iteration 47
iteration 48
iteration 49
iteration 50
iteration 51
iteration 52
iteration 53
iteration 54
iteration 55
iteration 56
iteration 57
iteration 58
iteration 59
iteration 60
iteration 61
iteration 62
iteration 63
iteration 64
iteration 65
iteration 66
iteration 67
iteration 68
iteration 69
iteration 70
iteration 71
iteration 72
iteration 73
iteration 74
iteration 75
iteration 76
iteration 77
iteratio

#### Observation:-
The computation time is drastically increased.

In [269]:
Y_doc2vec_mod=[labels_train[elem] for elem in range(len(labels_train))]
vecrep_train_doc2vec_mod=[model_doc2vec_mod.docvecs[ids_train[elem]] for elem in range(len(reviews_train))]
clf_doc2vec = svm.SVC(C=5,kernel='linear')
clf_doc2vec.fit(vecrep_train_doc2vec_mod, Y_doc2vec_mod)  
print("SVM Model Trained")
vecrep_test_doc2vec_mod=[model_doc2vec_mod.docvecs[ids_test[elem]] for elem in range(len(reviews_test))]
pred_doc2vec_mod=clf_doc2vec.predict(vecrep_test_doc2vec_mod)
print("Accuracy: %f"%accuracy(pred_doc2vec_mod,Y_test))
evaluation(pred_doc2vec_mod,Y_test)

SVM Model Trained
Accuracy: 0.938423
Recall: 0.937678 Precision: 0.938940


In [270]:
# your code here
from sklearn import linear_model
clf_lr = linear_model.LogisticRegressionCV()
clf_lr.fit(vecrep_train_doc2vec_mod, Y_doc2vec_mod) 
pred_lr=clf_lr.predict(vecrep_test_doc2vec_mod)
Y_test_lr=[labels_test[elem] for elem in range(len(labels_test))]
print("Accuracy: %f"%accuracy(pred_lr,Y_test_lr))
evaluation(pred_lr,Y_test_lr)

Accuracy: 0.939597
Recall: 0.941674 Precision: 0.936922


#### Final Conclusion:-
The vector representation obtained is passed through an SVM and a logistic regression model that both outperform our original accuracy of 93.7% and it increases to 94%.

# Part 2: NDCG

You calculated the recall and precision in Part 1 and now you get a chance to implement NDCG. 

Assume that Amy searches for "food-relevant" reviews in the **testing set** on two search engines `A` and `B`. Since the ground-truth labels for the reviews are unknown to A and B, they need to make a prediction for each review and then return a ranked list of results based on their probabilities. The results from A are in *search_result_A.json*, and the results from B are in *search_result_B.json*. Each line contains the id of a review and its corresponding ranking.

You can check their labels in *review_test.json* while calculating the NDCG scores. If a review is "food-relevant", the relevance score is 1. Otherwise, the relevance score is 0.

In [271]:
data_resA = pd.read_json('search_result_A.json', lines=True)#Importing data for Search Result A
[ids_resA,rank_resA]=data_resA['id'],data_resA['rank']
labels_resA=[labels_test[list(ids_test).index(elem)] for elem in ids_resA]#Labels for search result B
data_resB = pd.read_json('search_result_B.json', lines=True)#Importing data for Search Result B
[ids_resB,rank_resB]=data_resB['id'],data_resB['rank']
labels_resB=[labels_test[list(ids_test).index(elem)] for elem in ids_resB]#Labels for search result B
import math
def dcg(items):
    dcg = 0
    i = 0
    for item in items:
        i += 1
        dcg += item / math.log(i + 1, 2)
    return dcg
# NDCG for search_result_A.json
# Your code here
totalrank_A=len(ids_resA)#length of the A review list
dcg_A=dcg(labels_resA)#Calculate dcg for A
labels_list=sorted(labels_test,reverse=True)#sort the whole corpus of the test dataset
idcg_A=dcg(labels_list[:totalrank_A])#Take the first totalrank_A relevant labels
print("Results for A:-\nDCG:%f IDCG:%f NDCG:%f"%(dcg_A,idcg_A,dcg_A/idcg_A))#Print the result

Results for A:-
DCG:505.397928 IDCG:550.371256 NDCG:0.918285


In [272]:
# NDCG for search_result_B.json
# Your code here
totalrank_B=len(ids_resB)#length of the B review list
dcg_B=dcg(labels_resB)#Calculate dcg for B
idcg_B=dcg(labels_list[:totalrank_B])#Take the first totalrank_B relevant labels
print("Results for B:-\nDCG:%f IDCG:%f NDCG:%f"%(dcg_B,idcg_B,dcg_B/idcg_B))#Print the result

Results for B:-
DCG:121.345774 IDCG:123.091533 NDCG:0.985817
