My goal is to build a classifier to determine if a Yelp review is "food-relevant" or not.

## Dataset: Yelp review data

First, you will need to download the training_data.json file from the Resources tab on Piazza, a collection of 40,000 json-encoded Yelp reviews we sampled from the [Yelp Dataset Challenge](https://www.yelp.com/dataset_challenge).

You'll see that each line corresponds to a review on a particular business. The label (class) information of each review is in the "label" field. It is **either "Food-relevant" or "Food-irrelevant"**.

## Part 1.1: Parsing Yelp

For this first part, we will build a parser for extracting tokens from the **review text** only. First, you should tokenize each review using **whitespaces and punctuations as delimiters**. Do not remove stopwords. You should apply casefolding (lower case everything) and use the [nltk Porter stemmer](http://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter) ... you may need to install nltk if you don't have it already. 

### Unique tokens?

Once you have your parser working, you should report here the size of your feature space. That is, how many unique tokens do you find?

In [3]:
import json
import string
import operator
import nltk
from nltk.stem import PorterStemmer
import re
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
from sklearn import svm
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import StratifiedKFold
from sklearn.datasets import make_classification
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix



file = 'F:/SEM-2/IR/HW_1/train.json'
st = PorterStemmer()
words = []
lines = []
labels = []

print'started'
with open(file) as f:
    for line in f:
        #print(line)
        json_data = json.loads(line.lower())
        text = json_data['text']
        lines.append(text)
        label = json_data['label']
        #print(label)
        labels.append(label)
        #print(json_data)
        list = re.findall('\w+', text)
        #print list
        #words.append(list)
        for temp in list:
            #print temp, st.stem(temp)
            words.append(st.stem(temp))
#print words
frequency = {}

for word in words:
    count = frequency.get(word,0)
    frequency[word] = count + 1
     
frequency_list = frequency.keys()
print len(frequency_list)

started
36555


### The Most Popular Words

Great, now we can tokenize the documents. Let's make a list of the most popular words in our reviews. For this step, you should maintain a count of how many times each word occurs. Then you should print out the top-20 words in your reviews.

Your output should look like this:

Rank Token Count

1 awesome 78

... ...

In [4]:
freq_sorted = sorted(frequency, key=frequency.get, reverse=True)
for r in freq_sorted[:20]:
    print r, frequency[r]

the 246309
i 168931
and 168589
a 134904
to 128139
it 78867
of 76237
wa 74020
is 63496
for 60867
in 60523
that 50804
my 50565
you 45881
they 43635
thi 39940
with 39340
have 39082
but 37967
on 35388


### Zipf's Law

Recall in class our discussion of Zipf's law. Let's see if this law applies to our Yelp reviews. You should use matplotlib to plot the log-base10 term counts on the y-axis versus the log-base10 rank on the x-axis. Your aim is to create a figure like the one in Figure 5.2 of the textbook.

In [5]:
# your code here
print 'sss'

sss


What do you observe? Is this consistent with Zipf's law?

*Your answer goes here*

## Part 1.2: Feature Represenation

In this part you will build feature vectors for each review. This will be input to our ML classifiers. You should call your parser from earlier, using all the same assumptions (e.g., casefolding, stemming). Each feature value should be the term count for that review.

In [6]:
print 'start'
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(lines)
print 'done'

start
done


## Part 1.3: Classifiers

In this part you will evaluate a bunch of classifiers -- kNN, Decision tree, Naive Bayes, and SVM -- on the feature vectors generated in the previous task in two different settings. **You do not need to implement any classifier from scratch. You may use scikit-learn's built-in capabilities.**

### Setting 1: Splitting data into train-test 

In the first setting, you should treat the first 70% of your data as training. The remaining 30% should be for testing. 

### Setting 2: Using 5 fold cross-validation

In the second setting, use 5-folk cross-validation. 



In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split


print 'split'

X_train, X_test, y_train, y_test = train_test_split(X, np.transpose(labels), test_size=0.3, random_state=42)

print 'svm'
svm_clf = svm.SVC(kernel ='linear', C = 1.0)
svm_clf.fit(X_train,y_train)
svm_pred = svm_clf.predict(X_test)

print 'knn'
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_pred = knn_clf.predict(X_test)

print 'tree'
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train)
tree_pred = tree_clf.predict(X_test)

print 'nb'
nb_clf = MultinomialNB()
nb_clf.fit(X_train, y_train)
nb_pred = nb_clf.predict(X_test)

print 'accuracy with splitting data------------------------------'
print 'svm',accuracy_score(y_test, svm_pred)
print 'knn',accuracy_score(y_test, knn_pred)
print 'decision tree',accuracy_score(y_test, tree_pred)
print 'naive bayes',accuracy_score(y_test, nb_pred)

print 'precision and recall is only meaningful if they are calculated on the test data'
print 'so they are calculated on test data and for svm classifier'

target_names = ['relevant', 'irrelevant']
print(classification_report(y_test, svm_pred, target_names=target_names))

y = np.transpose(labels)
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    print("TRAIN:", train_index, "TEST:", test_index)
    XK_train, XK_test = X[train_index], X[test_index]
    yK_train, yK_test = y[train_index], y[test_index]

svmK_clf = svm.SVC(kernel ='linear', C = 1.0)
svmK_clf.fit(XK_train,yK_train)
svmK_pred = svmK_clf.predict(XK_test)

knnK_clf = KNeighborsClassifier(n_neighbors=3)
knnK_clf.fit(XK_train, yK_train)
knnK_pred = knnK_clf.predict(XK_test)

treeK_clf = DecisionTreeClassifier()
treeK_clf.fit(XK_train, yK_train)
treeK_pred = treeK_clf.predict(XK_test)

nbK_clf = MultinomialNB()
nbK_clf.fit(XK_train, yK_train)
nbK_pred = nbK_clf.predict(XK_test)

print 'accuracy with 5 fold cross validation------------------------------'
print 'svm',accuracy_score(yK_test, svmK_pred)
print 'knn',accuracy_score(yK_test, knnK_pred)
print 'tree',accuracy_score(yK_test, treeK_pred)
print 'naive bayes',accuracy_score(yK_test, nbK_pred)

print 'precision and recall is only meaningful if they are calculated on the test data'
print 'so they are calculated on test data and for svm classifier'

target_names = ['relevant', 'irrelevant']
print(classification_report(yK_test, svmK_pred, target_names=target_names))


split
svm
knn
tree
nb
accuracy with splitting data------------------------------
svm 0.933083333333
knn 0.706083333333
decision tree 0.88375
naive bayes 0.948083333333
precision and recall is only meaningful if they are calculated on the test data
so they are calculated on test data and for svm classifier
             precision    recall  f1-score   support

   relevant       0.94      0.93      0.93      5999
 irrelevant       0.93      0.94      0.93      6001

avg / total       0.93      0.93      0.93     12000

('TRAIN:', array([ 5451,  5452,  5453, ..., 39997, 39998, 39999]), 'TEST:', array([    0,     1,     2, ..., 13245, 13246, 13247]))
('TRAIN:', array([    0,     1,     2, ..., 39997, 39998, 39999]), 'TEST:', array([ 5451,  5452,  5453, ..., 27312, 27313, 27314]))
('TRAIN:', array([    0,     1,     2, ..., 39997, 39998, 39999]), 'TEST:', array([11513, 11514, 11515, ..., 31997, 31998, 31999]))
('TRAIN:', array([    0,     1,     2, ..., 39997, 39998, 39999]), 'TEST:', array(

## Part 1.5: Improving the classifier

I think we can do better! In this part, your job is to create new features that you can think can help improve your classifier. You may choose to use new weightings for your words, new derived features (e.g., count of 3-letter words), or whatever you like. You may also add in the extra features in the json: funny, useful, cool. You will need to experiment with different approaches ... once you finalize on your best approach, include the features here with a description (that is, tell us what the feature means). Then give us your classifier results!

In [8]:
# your code here ... add as many cells as you need for features, results, and discussion.
print "we can use TFIDF"

from sklearn.feature_extraction.text import TfidfVectorizer
print 'start'
Tvectorizer = TfidfVectorizer()
XT = Tvectorizer.fit_transform(lines)
print 'done'



from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split


print 'split'

XT_train, XT_test, yT_train, yT_test = train_test_split(XT, np.transpose(labels), test_size=0.3, random_state=42)

print 'svm'
svm_clf = svm.SVC(kernel ='linear', C = 1.0)
svm_clf.fit(XT_train,yT_train)
svm_pred = svm_clf.predict(XT_test)

print 'knn'
knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(XT_train, yT_train)
knn_pred = knn_clf.predict(XT_test)

print 'tree'
tree_clf = DecisionTreeClassifier()
tree_clf.fit(XT_train, yT_train)
tree_pred = tree_clf.predict(XT_test)

print 'NB'
nb_clf = MultinomialNB()
nb_clf.fit(XT_train, yT_train)
nb_pred = nb_clf.predict(XT_test)

print 'accuracy with splitting data------------------------------'
print 'svm',accuracy_score(yT_test, svm_pred)
print 'knn',accuracy_score(yT_test, knn_pred)
print 'decision tree',accuracy_score(yT_test, tree_pred)
print 'naive bayes',accuracy_score(yT_test, nb_pred)

print 'precision and recall is only meaningful if they are calculated on the test data'
print 'so they are calculated on test data and for svm classifier'

target_names = ['relevant', 'irrelevant']
print(classification_report(y_test, svm_pred, target_names=target_names))

y = np.transpose(labels)
skf = StratifiedKFold(y, n_folds=5)
for train_index, test_index in skf:
    #print("TRAIN:", train_index, "TEST:", test_index)
    XK_train, XK_test = XT[train_index], XT[test_index]
    yK_train, yK_test = y[train_index], y[test_index]

svmK_clf = svm.SVC(kernel ='linear', C = 1.0)
svmK_clf.fit(XK_train,yK_train)
svmK_pred = svmK_clf.predict(XK_test)

knnK_clf = KNeighborsClassifier(n_neighbors=3)
knnK_clf.fit(XK_train, yK_train)
knnK_pred = knnK_clf.predict(XK_test)

treeK_clf = DecisionTreeClassifier()
treeK_clf.fit(XK_train, yK_train)
treeK_pred = treeK_clf.predict(XK_test)

nbK_clf = MultinomialNB()
nbK_clf.fit(XK_train, yK_train)
nbK_pred = nbK_clf.predict(XK_test)

print 'accuracy with 5 fold cross validation------------------------------'
print 'svm',accuracy_score(yK_test, svmK_pred)
print 'knn',accuracy_score(yK_test, knnK_pred)
print 'tree',accuracy_score(yK_test, treeK_pred)
print 'naive bayes',accuracy_score(yK_test, nbK_pred)

print 'precision and recall is only meaningful if they are calculated on the test data'
print 'so they are calculated on test data and for svm classifier'

target_names = ['relevant', 'irrelevant']
print(classification_report(yK_test, svmK_pred, target_names=target_names))

we can use TFIDF
start
done
split
svm
knn
tree
NB
accuracy with splitting data------------------------------
svm 0.958416666667
knn 0.512166666667
decision tree 0.876416666667
naive bayes 0.950083333333
precision and recall is only meaningful if they are calculated on the test data
so they are calculated on test data and for svm classifier
             precision    recall  f1-score   support

   relevant       0.95      0.96      0.96      5999
 irrelevant       0.96      0.95      0.96      6001

avg / total       0.96      0.96      0.96     12000

accuracy with 5 fold cross validation------------------------------
svm 0.954625
knn 0.869
tree 0.882
naive bayes 0.958
precision and recall is only meaningful if they are calculated on the test data
so they are calculated on test data and for svm classifier
             precision    recall  f1-score   support

   relevant       0.94      0.97      0.96      4000
 irrelevant       0.97      0.94      0.95      4000

avg / total       0.96 