Opinion Spam detection
======================

We not only predict if a review is Fake or True but we also predict the class it belongs to (Negative or Positive) .

Main idea of predicting whether given review is fake or real is addressed by using pos tagging and BOW(Bag of Words) model , First let us prepare the data to suffice the experiment needs. 

I have classified the reviews into four classes as follows :
- Highly positive (1) : The review is marked as positive and it is deceptive .
- Positive (2) : A review is marked as positive and it is truly positive .
- Negative (3) : A review is marked and negative and it is truly negative .
- Highly Negative (4) : A review is marked and negative and it is deceptive .

Given a review we will predict whether it is a true review or fake one using the following statergy:
- If the review falls into 2nd or 3rd class then it is true review (Review is NOT FAKE) 
- If the review falls into 1st or 4th class then the review is fake ! (Review is FAKE)

Following is a summary of the libraries used 
- os : As we are working with txt files, we'll use this lib to get the file names and to work with them .
- pandas : For creating data frames and manipulating .
- nltk :  For removing stopwords , tokenizing , parts of speech tagging and lematization
- gensim for creating the corpus , implementing bag of words , and matutils of gensim to convert corpus to sparse form. 
- sklearn for GridSearchCV, for splitting data , Random Forest and SVM

In [1]:
import pandas as pd
import os
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag
from gensim import matutils,corpora, models
from sklearn import cross_validation
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import warnings



After downloading the data , get into the root directory **`op_spam_v1.4`**  and execute the following shell scripting commands to get all the positive reviews into one folder and all negative reviews into another folder .

This command get all the txt files in the ** `negative_polariy`** folder into one folder and deletes all the folders except the root folder (**`negative_polarity`**)

In [2]:
#! cd negative_polarity && find . -type f -print0 | xargs -0 -I file mv --backup=numbered file . && rm -rf truthful_from_Web && rm -rf deceptive_from_MTurk

This command get all the txt files in the ** `positive_polariy`** folder into one folder and deletes all the folders except the root folder (**`positive_polarity`**)

In [3]:
#! cd positive_polarity && find . -type f -print0 | xargs -0 -I file mv --backup=numbered file . && rm -rf truthful_from_TripAdvisor && rm -rf deceptive_from_MTurk

In [4]:
negative_list = os.listdir("negative_polarity") # names of all files in the negative_polarity dir into a list
positive_list = os.listdir("positive_polarity") # names of all files in the positive_polarity dir into a list

** A function to create a dataframe with  "review" "labeled_class" and "actual_class"(t or d) as columns .**

In [5]:
def preprocess(files_list,root_dir,polarity):
    labeled_class = []
    reviews = []
    actual_class =[]
    for j in files_list:
        labeled_class.append(polarity)
        k = str(open(root_dir + '/' + j).read())
        reviews.append(k)
        actual_class.append(str(j.split('_')[0]))
    data = pd.DataFrame({'labeled_class':labeled_class,'review':reviews,'actual_class':actual_class})
    return data

** Create seperate data frames for postive and negative reviews **

In [6]:
negative_df = preprocess(negative_list,'negative_polarity','negative')
positive_df = preprocess(positive_list,'positive_polarity','positive')

In [7]:
negative_df.head()

Unnamed: 0,actual_class,labeled_class,review
0,t,negative,Very disappointed in our stay in Chicago Monoc...
1,t,negative,I just had a conference there. They have bed b...
2,t,negative,Over-hyped and over-priced. The fact that they...
3,t,negative,My family of four went to a convention and sta...
4,t,negative,Beautiful historic hotel -- and since I'm in h...


In [8]:
positive_df.head()

Unnamed: 0,actual_class,labeled_class,review
0,t,positive,We had a king deluxe room for 2 nights. We res...
1,t,positive,The Swisshotel is awesome. Very high class. It...
2,t,positive,Having had a great stay at the Monaco last Fal...
3,t,positive,Simply a nice place to stay... I had a great d...
4,t,positive,Stayed here October 31 through November 5 for ...


Following chunk of code adds a column named target which is our variable of interest. First consider the positive reviews data frame .
- **(positive + d) : ** If the *actual_class* is marked as **d** , it means that review is highly positive so we assign this to 1  
- **(positive + t) : ** If the *actual_class* is marked as **t** , it means that review is truly positve so we assign this to 2 

In [9]:
target = []
for i in positive_df.index:
    if ((positive_df['labeled_class'][i] == 'positive') & (positive_df['actual_class'][i] == 't')):
        target.append(2)
    elif ((positive_df['labeled_class'][i] == 'positive') & (positive_df['actual_class'][i] == 'd')):
        target.append(1)
    else:
        print('Error!')
positive_df['target'] = target

Now let us  consider the negative reviews data frame .
- **(negative + t) : ** If the *actual_class* is marked as **t** , it means that review is truly negative so we assign this to 3
- **(negative + d) : ** If the *actual_class* is marked as **d** , it means that review is highly negative so we assign this to 4  

In [10]:
target = []
for i in negative_df.index:
    if ((negative_df['labeled_class'][i] == 'negative') & (negative_df['actual_class'][i] == 't')):
        target.append(3)
    elif ((negative_df['labeled_class'][i] == 'negative') & (negative_df['actual_class'][i] == 'd')):
        target.append(4)
    else:
        print('Error!')
negative_df['target'] = target

Merge the postive and negative data frames to one .

In [11]:
data = positive_df.merge(negative_df,how='outer')

In [12]:
data = data[['review','target']]

As we are only intrested in review and target columns , subset the data ignoring other columns

In [13]:
data.head()

Unnamed: 0,review,target
0,We had a king deluxe room for 2 nights. We res...,2.0
1,The Swisshotel is awesome. Very high class. It...,2.0
2,Having had a great stay at the Monaco last Fal...,2.0
3,Simply a nice place to stay... I had a great d...,2.0
4,Stayed here October 31 through November 5 for ...,2.0


In [14]:
data.target.value_counts()

4.0    400
3.0    400
2.0    400
1.0    400
Name: target, dtype: int64

** Since all the class are equally distributed there is no need of using any sampling techniques. **

Let us discuss the functionality of **extract_tokens** : 

- word_tokenize converts each review into lowercase and append each character of review into a list
- the we tag parts of speech of each word and lemmatize the words(reduce the word to root as in dictionary format)
- apped  these lists to a column names reviews_tokenized 

In [15]:
def extract_tokens(df):
    review_tokenized = []
    lmt = WordNetLemmatizer()
    for index, datapoint in df.iterrows():
        tokenize_words = word_tokenize(datapoint["review"].lower(),language='english')
        pos_word = pos_tag(tokenize_words)
        tokenize_words = ["_".join([lmt.lemmatize(i[0]),i[1]]) for i in pos_word if (i[0] not in stopwords.words("english") and len(i[0]) > 2)]
        review_tokenized.append(tokenize_words)
    df["review_tokenized"] = review_tokenized
    return df

data = extract_tokens(data)

In [16]:
data.head()

Unnamed: 0,review,target,review_tokenized
0,We had a king deluxe room for 2 nights. We res...,2.0,"[king_NN, deluxe_NN, room_NN, night_NNS, reser..."
1,The Swisshotel is awesome. Very high class. It...,2.0,"[swisshotel_NN, awesome_JJ, high_JJ, class_NN,..."
2,Having had a great stay at the Monaco last Fal...,2.0,"[great_JJ, stay_NN, monaco_NN, last_JJ, fall_N..."
3,Simply a nice place to stay... I had a great d...,2.0,"[simply_RB, nice_JJ, place_NN, stay_VB, ..._:,..."
4,Stayed here October 31 through November 5 for ...,2.0,"[stayed_VBN, october_JJ, november_JJ, cconfere..."


I have used gensim to deal with the semantics of reviews
- corpara.Dictonary creates a corpus of review_tokenised column
- then we filter the words which have occured in less than 2 documents and have occured more than 0.8 times(fraction of corpus size)
- then we create a bag of words model of this corpus
- and condense the corpus to sparse form 
- shape of corpus can be seen below

In [19]:
from gensim import matutils,corpora, models

def vectorize_comments(df):
    d = corpora.Dictionary(df["review_tokenized"])
    d.filter_extremes(no_below=2, no_above=0.8)
    d.compactify()
    corpus = [d.doc2bow(text) for text in df["review_tokenized"]]
    corpus = matutils.corpus2csc(corpus, num_terms=len(d.token2id))
    corpus = corpus.transpose()
    return d, corpus

dictionary,corpus = vectorize_comments(data)
print (corpus.shape)

(1600, 5911)


** Firstly ** we have train a Random forest Classifier using Grid Search CV with the parameters shown below 

In [20]:
def train_rfc(X,y):
    n_estimators = [100]
    min_samples_split = [2]
    min_samples_leaf = [1]
    bootstrap = [True]
    parameters = {'n_estimators': n_estimators, 'min_samples_leaf': min_samples_leaf,
                  'min_samples_split': min_samples_split}
    clf = GridSearchCV(RandomForestClassifier(verbose=1,n_jobs=-1), cv=4, param_grid=parameters)
    clf.fit(X, y)
    return clf

We have used 70% of data for training and 30% of data for testing and we have used 4-fold cross validation, On accessing the accuracy :
- Accuracy of RF on Cross validation data is 69% and 
- Accuracy of RF on Cross validation data is ~ 72%

In [21]:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(corpus, data["target"], test_size=0.3, random_state=2016)
rfc_clf = train_rfc(X_train,y_train)
print ("Accuracy of RF on CV sets :{}".format(rfc_clf.best_score_))

[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.5s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.4s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.4s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.6s finished
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jo

Accuracy of RF on CV sets :0.7017857142857142


[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.6s finished


In [23]:
print("Accuracy of RF on test sets is : {}".format(rfc_clf.score(X_test,y_test)))

Accuracy of RF on test sets is : 0.69375


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished


As the next option we have used Support Vector Machine Classifier along with GridSearchCV with penalty parameters of 10,15,20,25 . 
- Accuracy of RF on Cross validation data is ~ 74% and
- Accuracy of RF on Cross validation data is 76% .

In [24]:
def train_svm(X,y):
    parameters = {'C': [10,15,20,25],'random_state':[2016]}
    clf = GridSearchCV(SVC(), cv=4, param_grid=parameters)
    clf.fit(X, y)
    return clf

In [25]:
svc_clf = train_svm(X_train,y_train)
print("Best accuracy of SVM on CV sets :{}".format(svc_clf.best_score_))
print("Accuracy of SVM on test sets is : {}".format(svc_clf.score(X_test,y_test)))

Best accuracy of SVM on CV sets :0.7392857142857143
Accuracy of SVM on test sets is : 0.7604166666666666


If make some tweaks we might be successful in increasing the accuarcy a little bit, but remember our dataset is ** very small ** (only 1600 reviews) , so there is a good chance that the model overfits the data and doesn't do well on new data 

** In this case a greater accuracy is bad, if we make attempts to increase the accuracy, since the dataset is very small the accuracy might increase but the model behaves badly on classifying new reviews . So for now let us stop here and use the SVC for classifying a review **

In [26]:
def model_test(review):
    a = svc_clf.predict(review)
    if a == 1.0 :
        return('Fake Review (Positive)')
    elif a == 2.0:
        return('True Review (Positive)')
    elif a == 3.0:
        return('True Review (Negative)')
    else :
        return('Fake Review (Negative)')

Now just for fun let us predict the class of some reviews in test set

In [27]:
for i in X_test[:10]:
    print(model_test(i))
    print('')

True Review (Negative)

True Review (Positive)

True Review (Negative)

True Review (Positive)

Fake Review (Positive)

True Review (Positive)

True Review (Negative)

Fake Review (Negative)

Fake Review (Negative)

True Review (Negative)



Any Questions or Suggestions ?? 

Ping me @ **chaitanyadeva96@gmail.com **