#                                      NLP Assignment - 1


## Part A - Deception Detection

The main dataset that we are working with is the amazon reviews. The main aim is to classify the reviews to be fake or real by implementing a Support Vector Machine (SVM). Both the review text and the additional features contained in the data set have been used to build and train the classifier on part of the data set. You will then test the accuracy of your classifier on an unseen portion of the corpus. 


### 1. Parsing and preprocessing
The first task is to define a function to parse the data and preprocess the review texts. For this, the fucntions parseReview() and preProcess(text) have been defined. In the parseReview as can be seen the Id, Text and Label have been selected from the data. Followed by preprocessing the review text. The text has been tokenised into differents words and normalised while preprocessing These functions are called while reading the data from the unicode csv reader.

In [44]:
import csv  
import unicodecsv   # csv reader
from sklearn.svm import LinearSVC
from nltk.classify import SklearnClassifier
from random import shuffle
from sklearn.pipeline import Pipeline
import re
from sklearn.metrics import precision_recall_fscore_support # to report on precision and recall
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import RegexpTokenizer
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
import numpy as np
# libraries for: regular expressions, file I/O
import sys
from sklearn.metrics import * # to report on precision and recall
from collections import defaultdict
from nltk.corpus import stopwords
from string import punctuation
from textstat.textstat import textstat

In [45]:
# load data from a file and append it to the rawData
def loadData(path, Text=None):
    with open(path, 'rb') as f:
        reader = unicodecsv.reader(f, encoding='utf-8', delimiter='\t')
        next(reader)
        for line in reader:
            (Id, Text, Label) = parseReview(line)
            rawData.append((Id, Text, Label))
            preprocessedData.append((Id, preProcess(Text), Label))

In [46]:
def splitData(percentage):
    dataSamples = len(rawData)
    halfOfData = int(len(rawData)/2)
    trainingSamples = int((percentage*dataSamples)/2)
    for (_, Text, Label) in rawData[:trainingSamples] + rawData[halfOfData:halfOfData+trainingSamples]:
        trainData.append((toFeatureVector(preProcess(Text)),Label))
    for (_, Text, Label) in rawData[trainingSamples:halfOfData] + rawData[halfOfData+trainingSamples:]:
        testData.append((toFeatureVector(preProcess(Text)),Label))

In [64]:
################
## QUESTION 1 ##
################

# # Convert line from input file into an id/text/label tuple
def parseReview(reviewLine):
    # Should return a triple of an integer, a string containing the review, and a string indicating the label
    Id = reviewLine[0]
    Text = reviewLine[8]
    Label = reviewLine[1]
    return (Id, Text, Label)

# TEXT PREPROCESSING

def preProcess(text):
    #normalisation and tokenising 
    no_symbols = re.sub(r'[^\w]', ' ', text.lower())
    tokens = no_symbols.split()
    return tokens

### 2. toFeatureVector
The next step is to implement the toFeatureVector function. Given a preprocessed review (that is, a list of tokens), this function will return a dictionary that has its keys as the tokens and values as the weight of those tokens in the preprocessed reviews. The weight has been given as the number of occurences of a token divided by the total token occurences in the preprocessed review. The global featureDict, which is the dictionary that keeps track of all the tokens in the whole review dataset has been incrementally built up. 

In [48]:
################
## QUESTION 2 ##
################
featureDict = {} # A global dictionary of features
def toFeatureVector(words):
    v = {}
    for w in words:
        try:
            i = featureDict[w]
        except KeyError:
            i = len(featureDict) + 1
            featureDict[w] = i
        try:
            v[w] += (1.0/len(words))
        except KeyError:
            v[w] = (1.0/len(words))
    return v



### 3.  crossValidate function
The crossValidate function has been completed to do a 10-fold cross validation. The precision_recall_fscore_support function has been used to compute the precision, recall, f1 score.The f1-score gives you the harmonic mean of precision and recall. The scores corresponding to every class will tell you the accuracy of the classifier in classifying the data points in that particular class compared to all other classes.



In [49]:
################
## QUESTION 3 ##
################
# TRAINING AND VALIDATING OUR CLASSIFIER
def trainClassifier(trainData):
    print("Training Classifier...")
    pipeline =  Pipeline([('svc', LinearSVC())])
    return SklearnClassifier(pipeline).train(trainData)


myTestData = []
myTrainData = []

    
def crossValidate(dataset, folds):
    cv_results = []
    accuracy = []
    shuffle(dataset)
    foldSize = int(len(dataset)/folds)
    for i in range(0,len(dataset),foldSize):
        # insert code here that trains and tests on the 10 folds of data in the dataset
        print ("fold start %d foldSize %d" % (i, foldSize))
        myTestData = dataset[i:i+foldSize]
        myTrainData = dataset[0:i] + dataset[i+foldSize:]
        classifier = trainClassifier(myTrainData)
        y_pred = predictLabels(myTestData, classifier)
#         review,label = zip(*myTestData)
#         y_true = label
        y_true = [x[1] for x in myTestData]
#         y_true = classifier.classify(map(lambda x: x[1], myTestData))
        cv_results.append(precision_recall_fscore_support(y_true, y_pred, average='weighted'))
        accuracy.append(accuracy_score(y_true, y_pred))
        
#Calclualte avergae of values over the 10-fold runs
    cv_results = np.asarray(cv_results)
    cv_results = [np.mean(cv_results[:,0]), np.mean(cv_results[:,1]), np.mean(cv_results[:,2])]
    
    accuracy = np.asarray(accuracy)
    accuracy = np.mean(accuracy)
    cv_results.append(accuracy)
    
    return cv_results
# PREDICTING LABELS GIVEN A CLASSIFIER


def predictLabels(reviewSamples, classifier):
    return classifier.classify_many(map(lambda t: t[0], reviewSamples))

def predictLabel(reviewSample, classifier):
    return classifier.classify(toFeatureVector(preProcess(reviewSample)))

After implementing the above functions, the loadData function is called to load the data. Consequently the functions defined above have been called. The corresponding accuracy scores have been calculated.

In [50]:
# MAIN

# loading reviews
rawData = []          # the filtered data from the dataset file (should be 21000 samples)
preprocessedData = [] # the preprocessed reviews (just to see how your preprocessing is doing)
trainData = []        # the training data as a percentage of the total dataset (currently 80%, or 16800 samples)
testData = []         # the test data as a percentage of the total dataset (currently 20%, or 4200 samples)

# the output classes
fakeLabel = 'fake'
realLabel = 'real'

# references to the data files
reviewPath = 'amazon_reviews.txt'

## Do the actual stuff
# We parse the dataset and put it in a raw data list
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing the dataset...",sep='\n')
loadData(reviewPath)
# We split the raw dataset into a set of training data and a set of test data (80/20)
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Preparing training and test data...",sep='\n')
splitData(0.8)
# We print the number of training samples and the number of features
print("Now %d rawData, %d trainData, %d testData" % (len(rawData), len(trainData), len(testData)),
      "Training Samples: ", len(trainData), "Features: ", len(featureDict), sep='\n')


Now 0 rawData, 0 trainData, 0 testData
Preparing the dataset...
Now 21000 rawData, 0 trainData, 0 testData
Preparing training and test data...
Now 21000 rawData, 16800 trainData, 4200 testData
Training Samples: 
16800
Features: 
34913


In [51]:
crossValidate(trainData, 10)

fold start 0 foldSize 1680
Training Classifier...
fold start 1680 foldSize 1680
Training Classifier...
fold start 3360 foldSize 1680
Training Classifier...
fold start 5040 foldSize 1680
Training Classifier...
fold start 6720 foldSize 1680
Training Classifier...
fold start 8400 foldSize 1680
Training Classifier...
fold start 10080 foldSize 1680
Training Classifier...
fold start 11760 foldSize 1680
Training Classifier...
fold start 13440 foldSize 1680
Training Classifier...
fold start 15120 foldSize 1680
Training Classifier...


[0.6556930063812416,
 0.6546428571428572,
 0.6542988629662425,
 0.6546428571428572]