In [1]:
# Import the pandas package, then use the "read_csv" function to read
# the labeled training data
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
%config InlineBackend.figure_format = 'png' #set 'png' here when working on notebook
warnings.filterwarnings('ignore') 
train = pd.read_csv(r"C:\Users\piush\Desktop\Dataset\HostelWorld\train_review_data.csv")

Let us take a small sample of 200 rows of the dataset

In [3]:
df = train.sample(n=200)

In [4]:
#Select only review_language english

df = df.loc[df['review_language'] == ('English')]
df = df.set_index('customer_id')

In [5]:
df_text = df['review_text']

### Preprocessing
The samples in the dataset should be preprocessed before performing any type of operation in it. The
preprocessing includes
1. Upper to lower case conversion:For the easiness of feature selection all the data should be converted into
lower cases.
2. Normalization:All words with apostrophizes should be replace with its original form. E.g. don’t-> do not.
3. Non ASCII removal: All non ASCII characters are removed from the samples.
4. Remove new lines: The datasets contains some unwanted new lines that are also removed before the feature
selection phase.
5. Remove unwanted punctuations: All punctuations should be removed before feature selection.
5. Stop word removal: Stop words in the English language are “a”, “an”, “the”, “is”. We have removed all words whose length is less than 3, except no, not, none.To remove stop words we are using Natural
Language Toolkit (NLTK)8provided by python.
6. Stemming: We observed that some of the words in the dataset have similar roots but they may differ only in affixes For example: computer, computation, computing has same root comput. The main purpose of this step is that reducing the feature set and improves the classification performance. We are using Porter
stemmer of NLTK provided by python.

In [6]:
#Tokenisation
import nltk

df['df_text'] = df.apply(lambda row: nltk.word_tokenize(row['review_text']), axis=1)

In [7]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [8]:
df['df_text_stop'] = df['df_text'].apply(lambda x: [item for item in x if item not in stop])

In [9]:
#Stemming is the process of reducing a word to its base/root form, called stem
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

df['stemmed'] = df["df_text_stop"].apply(lambda x: [stemmer.stem(y) for y in x])

### Features

##### Unigram of words:
We are performing some preprocessing on this feature such as stemming and stop word removal. So that we are
considering different categories of unigram features such as unigram with stemming with stop word, unigram with
stemming without stop words, unigram without stemming with stop words. We are not considering without
stemming without stop words category because without stemming will cause high dimensional features, since it does
not reduce to the root form of words. Also removal of stop words flips the negative samples to positive samples.

##### Bigram of words:
Like unigrams, we are considering two categories of Bigram such as Bigram without stemming with stop words
and Bigram with stemming with stop words. Due to the high dimensionality of bigram features. We are considering
features which are appearing more than three times in our dataset.

#### Semantic Orientation
First Step, we have to classify the set of
positive terms and negative terms present in
each trip advisor review. Secondly, the partof-
speech tagger is applied to the review. Two consecutive words are
extracted from the review if their tags
conform to any of the patterns.

The JJ tags indicate adjectives, the NN tags
are nouns, the RB tags are adverbs, and the third word (which is not extracted) cannot
be a noun. NNP and NNPS (singular and
plural proper nouns) are avoided, so that
the names of the items in the review
cannot influence the classification.

##### Parts of speech tags:
We are considering 36 parts of speech tags such VB, PRP, DT, NN etc. Each text in the reviews will be tagged
using POS tagger of NLTK9. 

###### The number of tags of reviews varies, since 36 tags are not used in this experiment.
We have 26 tags and 25 tag in positive and negative class respectively.

##### Function Words:
Function words or grammatical words are the words that have little lexicalmeaning or have ambiguousmeanings.
We are considering375 function words10.Tonormalize all these counts we are dividing the count by N (Total number
of words).

###### Word based features:
We are usingeight statistical measures10exploring that whether the count of words in each samples, helps to
extract the sentiments of review. By using these features we are exploring that whether the count of words in each
samples, helps to extractthe sentiments of review.

### Feature Selection

##### Mutual Information (MI)
MIterm selects features that are not uniformly distributed among the sentiment classes because they are
informative of their classes. And we can see that MI giving more importance to the rare term.

https://en.wikipedia.org/wiki/Mutual_information


##### Information gain(IG)
Information gain is the most commonly used feature selection method in the field of machine learning. It
calculates the relevance of a feature for prediction of sentiment of review by analysing the presence or absence of a
feature in a document.
https://en.wikipedia.org/wiki/Information_gain_ratio

##### Chi-square ( λ2 )
Chi-squaremeasures how much expected counts and observed counts deviate from each other.

http://www.math.uah.edu/stat/special/ChiSquare.html

##### TF-idf(Term Frequency-Inverse Document Frequency)
TF-idf16 is a weighting scheme, which measures how relevant a word to a sample in the dataset.The relevance
increases when the number of times a word appears in the sample.

Term Frequency is simplest approach is to
assign the weight to be equal to the number
of occurrences of term in document. Inverse frequency is measure of whether the
term is common or rare across all documents.
https://en.wikipedia.org/wiki/Tf%E2%80%93idf


### Feature weighting
The features which are selected using feature selection criteria is weighted using Feature Presence(FP).We are
calculating feature value by considering their presence or absence rather than count of feature in a sample.

### Classification

###### Support Vector Machine 
to create classification model on features of different length, extracted
from these sorted lists, in order to find the optimal feature length.SVM is capable of handling high dimensional data
in a linearly or non-linearly manner.Although SVM takes times to create a classification model; it performs well for
two class problems.

###### Naïve-Bayes Classifier

Naïve Bayes Multinomial distribution
with Laplace Smoothing and Bernoulli
distribution.

### Future Work

Sentimental Lexicons, non-word
tokens that are indicative of sentiments
(i.e. emoticons) , capturing semantic
similarities present in reviews

Our aim is to automatically mine these types of topics
from the raw review text and to automatically assign sentiment
labels to the relevant topics and review elements.

we show how this
can be used as the basis of a review recommendation system
to automatically recommend high quality reviews even in the
absence of any explicit helpfulness feedback.

Review timeliness was also found to be important
since review helpfulness declined as time went by.

Just as it is useful to automate the filtering of helpful reviews
it is also important to weed out malicious or biased
reviews. These reviews can be well written and informative
and so appear to be helpful. However these reviews often
adopt a biased perspective that is designed to help or hinder
sales of the target product. a reviewer's identity might be useful.

use network analysis techniques
to identify recurring spam in user generated comments associated
with YouTube videos by identifying discriminating
comment motifs that are indicative of spambots.

###### Research

A classical review classification approach consider features relating to the ratings,
structural, syntactic, and semantic properties of reviews to find ratings and review length among the most discriminating.
Reviewer expertise was found to be a useful predictor of
review helpfulness .People interested in a certain hostels are likely to pen high quality reviews for similar genre
movies. 

Review timeliness was also found to be important
since review helpfulness declined as time went by. 

Furthermore,
opinion sentiment has been mined from user reviews to
predict ratings and helpfulness in services.

Just as it is useful to automate the filtering of helpful reviews
it is also important to weed out malicious or biased
reviews. These reviews can be well written and informative
and so appear to be helpful. However these reviews often
adopt a biased perspective that is designed to help or hinder
sales of the target product . 

A machine learning approach to spam detection that
is enhanced by information about the spammer’s identify as
part of a two-tier co-learning approach. 

On a related topic, use network analysis techniques
to identify recurring spam in user generated comments associated
with reviews by identifying discriminating
comment motifs that are indicative of spambots.

###### Our model descibes techniques for mining topical and sentiment features from user generated reviews and demonstrate their ability to boost classification accuracy.

our focus is on automatically mining topics
from user-generated product reviews and assigning sentiment
to these topics on a per review basis.

###### Topic Extraction

We consider two basic types of topics—bi-grams and single
nouns.

To
produce a set of bi-gram topics we extract all bi-grams from
the global review set which conform to one of two basic partof-
speech co-location patterns: (1) an adjective followed by a
noun (AN) such as wide angle; and (2) a noun followed by a
noun (NN) such as video mode. These are candidate topics
that need to be filtered to avoid including AN’s that are actually
opinionated single-noun topics; for example, excellent
lens is a single-noun topic (lens) and not a bi-gram topic. To
do this we exclude bi-grams whose adjective is found to be a
sentiment word (e.g. excellent, good, great, lovely, terrible,
horrible etc.) using the sentiment lexicon .

To identify the single-noun topics we extract a candidate
set of (non stop-word) nouns from the global review set. Often
these single-noun candidates will not make for good topics;
for example, they might include words such as family or
day or vacation. A solution for
validating such topics by eliminating those that are rarely associated
with opinionated words. The intuition is that nouns
that frequently occur in reviews and that are frequently associated
with sentiment rich, opinion laden words are likely to
be product topics that the reviewer is writing about, and therefore
good topics. Thus, for each candidate single-noun, we
calculate how frequently it appears with nearby words from
a list of sentiment words , keeping the single-noun only if this
frequency is greater than some threshold (in this case 70%).

###### Sentiment Analysis
To determine the sentiment of the topics in the product topic
set we use a method similar to the opinion pattern mining
technique for extracting
opinions from unstructured hostel reviews.

Next we determine the part-of-speech tags for wmin, Ti
and any words that occur between wmin and Ti. The POS
sequence corresponds to an opinion pattern.

Once an entire pass of all topics has been completed we can compute the frequency of all opinion patterns that have been recorded. A pattern is deemed to be valid (from the perspective of our ability to assign sentiment) if it occurs more than the average number of occurrences over all patterns .
For valid patterns we assign sentiment based on the sentiment of wmin and subject to whether Sj contains any negation terms within a 4-word-distance of wmin. If there are no such negation terms then the sentiment assigned to Ti in Sj is that of the sentiment word in the sentiment lexicon. If there is a negation word then this sentiment is reversed. If an opinion pattern is deemed not to be valid (based on its frequency) then we assign a neutral sentiment to each of its occurrences within the review set.

###### To build a classifier for predicting review helpfulness we adopt a supervised machine learning approach. 
In the data
that is available to us each review has a helpfulness score that
reflects the percentage of positive votes that it has received, if
any. In this work we label a review as helpful if and only if
it has a helpfulness score in excess of 0.75. All other reviews
are labeled as unhelpful.

For each review Rk, we assign a collection of topics
(topics(Rk) = T1, T2, ..., Tm) and corresponding sentiment
scores (pos/neg/neutral) which can be considered in isolation
and/or in aggregate as the basis for classification features.

When it comes to sentiment we can formulate a variety of
classification features from the number of positive (NumPos
and NumUPos), negative (NumNeg and NumUNeg) and neutral
(NumNeutral and NumUNeutral) topics (total and unique)
in a review, to the rank-weighted number of positive (WPos),
negative (WNeg), and neutral (WNeutral) topics, to the relative
sentiment, positive (RelUPos), negative (RelUNeg), or
neutral (RelUNeutral), of a review’s topics.

We also include a measure of the relative density of opinionated
(non-neutral sentiment) topics in a review and a relative measure of the difference between the
overall review sentiment and the user’s normalized product
rating, i.e. SignedRatingDiff(Rk) = RelUPos(Rk) −
NormUserRating(Rk);

To evaluate
the ability of our classifier to make review recommendations
we can use the classification confidence as one simple
way to rank-order helpful reviews and select the top-ranked
review for recommendation to the user.

We can test the performance of these recommendation
techniques because we know the actual
helpfulness scores of all reviews (the ground-truth) we can
compare the recommended review to the review which has
the actual highest helpfulness score for each hostel, and average
across all hostel.