I. Introducion
1. Domain-specific area

	Sentiment analysis is a combination of natural language processing, text analysis and computational linguistics used to study subjective information. The data source we chose for this project is a corpus of reviews of hotels which could be either true or false. Since our aim is to accurately process sentiments of text, part of the project will involve cleaning the data of the false reviews. Sentiment analysis has many applications across various fields, e.g., marketing, customer service and clinical medicine that can span areas such as reviews/surveys, social media applications and health care. Sentiment analysis can come in several forms some of which include subjectivity/objectivity identification, feature/aspect based identification and intensity ranking of a given sentiment/emotion. In this project, we’ve chosen to undergo polarity classification which can be described as one of the main subtasks of sentiment analysis/opinion mining. Perhaps one of the biggest industries in which this work can contribute would be social media. Companies that wish to market their products place massive importance upon things such as reviews, ratings and recommendations in order to navigate the rapidly shifting landscape of modern trends. Research into areas such as sentiment analysis (in this case polarity classification) could have potentially limitless value in the modern world of social media. 

	In its rawest form, the data source we’ve chosen is a csv file containing five columns. The columns include deceptive (meaning a true or false opinion), hotel, polarity (positive or negative opinion), source and text. The only relevant fields in this investigation are polarity, text and to some degree deceptive. As mentioned before, the false opinions will need to be cleaned from the data because we’re trying to mine true opinions from text using sentiment analysis in this project. This clearly has direct applications (although not exclusive applications) to the hotel industry. Another area in which this research has applications is amongst recommender systems. Recommender systems are important to many industries such as streaming services (e.g. Netflix, Hulu, HBO, etc.), social networking services and e-commerce websites (e.g. Amazon). This area of research is especially important to social networking services and e-commerce websites since user-generated text can provide highly valuable data about user’s opinions. The area of sentiment analysis (and particularly polarity classification) seems to have unlimited potential in the increasingly technocentric society of the modern age. 

2. Dataset

	The objective of this research project is to utilize data from a hotel review corpus in order to create a text classifier utilizing sentiment analysis. The data source was a csv file found on Kaggle named Deceptive Opinion Scam Corpus. The corpus is composed of reviews of 20 hotels in Chicago and includes 400 true positive reviews, 400 false positive reviews, 400 true negative reviews and 400 false negative reviews. For the purpose of this research we will be removing all the positive reviews as that is not within the scope of this research. Therefore, the only significant fields in the data set include the text column composed of strings containing the customer reviews, the polarity column containing a string stating that the review is either positive or negative and the deceptive column containing a string stating that the review is either truthful or deceptive. The data source is roughly 1 MB in size. 
	
	A significant amount of preprocessing was necessary for this dataset. The first processing we did involved eliminating the deceptive rows from the dataset. Next, we checked to see if there were any null data elements using the isnull() function. We then proceeded to whittle the dataframe down to just two columns comprising text and polarity. The sentiment analysis we planned to incorporate involved logistic regression and a confusion matrix for our evaluation metric. Since training sets which are overrepresented by one category or another can lead to overtraining we then checked the count of positive and negative reviews. Luckily, there were an even number of positive and negative reviews at this point meaning there wouldn’t be need to selectively reduce the population size in order to prevent overtraining. The next step was to remove from the text any punctuation as that was irrelevant to our investigation. Then we converted all the text to lowercase and utilized the NLTK stopwords library to exclude stopwords which are essentially filler words in English that humans use but only serve to confuse/obfuscate the task at hand for a machine. The last thing to do was convert the polarity column from strings to numbers as strings wouldn’t be useful in our logistic regression. Therefore, we converted all the positive sentiments in the polarity column to a 1 and the negative sentiments to a 0, temporarily appended this column to the dataframe and then redefined the dataframe to only include the columns containing the cleaned text and the polarity converted to binary. 


3. Objectives

	The objective of this project is to run a logistic regression on the dataset in order to predict positive and negative sentiments. The reason for this is the dataset is fundamentally binary in nature and therefore lends itself to this form of analysis. We will then utilize scikit-learn on the regression to create a confusion matrix to analyze the results. The fundamental objective of this project is to create a text classifier capable of lending itself to various fields of industry in order to improve customer-supplier relations/communication. 

	This research can contribute to many of the areas of industry stated previously. Business owners/employers of any of the areas stated earlier looking for customer feedback in order to improve their business model and increase production and customer satisfaction could utilize this research. In the long term, the entire profession of marketing could essentially be removed using this research (this paper doesn’t aim to discuss the potential ethical repercussions of its objectives). It’s become increasingly more pervasive for machines to take on the role of humans and there’s little doubt that if the field of sentiment analysis continues to grow then the role of marketers will continually diminish over time. Even as this paper is written, grammatical/spelling errors are automatically brought to the writer's attention and corrected. There can be little doubt that as this field continues to grow exponentially, the field of marketing will continue to become more specialized and niche. Only the marketers most capable of analyzing/executing research metadata will remain in the long term. Indubitably, only highly specialized fields and exceptional expertise will remain unaffected by the increase in AI/ML in the years to come. 

	One other area where this research may have contributions is where sentiments aren’t explicitly stated. Spheres in which authors explicitly express their opinions has been the main subject of research amongst sentiment analysis historically. But with the increased sophistication of this research, areas in which sentiment was highly implicit will become more accessible to this branch of natural language processing. For example, one domain in which sentiments are usually implicitly, rather than explicitly, expressed is in news articles. Journalists are expected to maintain a certain level of journalistic objectivity/integrity which disallows them from expressing direct opinions. The implicit sentiments expressed in areas such as journalism have recently become attainable to the methods used in sentiment analysis. This work aims to aid and contribute to this area of research.

4. Evaluation Methodology

	For the purpose of this experiment, we will be undergoing logistic regression on a dataset in order to evaluate sentiment. This methodology of evaluation automatically lends itself to a confusion matrix. Therefore, for our evaluation metric we will compare and contrast precision, recall, accuracy, and F-measure. The reason why we’ve chosen logistic regression to analyze the data is due to its ease of use, interpretation, implementation and efficiency in training. It’s the view of the author that any single metric isn’t sufficient to fully summarize a dataset and, therefore, since confusion matrices lend themselves to a large portion of the metrics used to analyze machine learning, we will be using most of the major metrics. Thus, we will be using precision, accuracy, recall and F-measure. Our dataset isn’t particularly large for several reasons. The first reason is we didn’t want to get bogged down too much with data cleaning and checking since the point of the assignment is to focus on Natural Language Processing and not on cleaning data. The other main reason why we chose a smaller dataset is for the sake of computation time. The goal was to not have to wait 5-10 minutes every time we ran the code as this would make debugging more difficult and also potentially grading the assignment more difficult as well. Therefore, since we’re dealing with a dataset which isn’t particularly large, we don’t have high expectations for the outcome of this experiment. If we get an F-score/accuracy of anything reasonably high, we will be happy with that result.

II. Implementation

In [56]:
# download natural language tool kit
# pip install nltk


SyntaxError: invalid syntax (<ipython-input-56-1b729c2fa51d>, line 3)

In [31]:
# import nltk, pandas and scikit-learn and download relevant libraries

import nltk
import pandas as pd
import sklearn

nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Sam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [32]:
# import data
df = pd.read_csv('deceptive-opinion.csv')

In [33]:
df = df.loc[df["deceptive"] == 'truthful' ]

In [34]:
df.isnull().sum()

deceptive    0
hotel        0
polarity     0
source       0
text         0
dtype: int64

In [35]:
df = df[['text','polarity']]

In [36]:
df['polarity'].value_counts()

positive    400
negative    400
Name: polarity, dtype: int64

In [37]:
def clean_text(text):
    import string as st
    temp = [w for w in text if w not in st.punctuation]
    return ''.join(temp)
df['clean_text'] = df['text'].apply(clean_text)

In [38]:
df = df[['clean_text','polarity']]

In [39]:
df["clean_text"] = df["clean_text"].str.lower()

In [40]:
def delete_stopwords(text):
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords.remove('not')
    temp = [w for w in nltk.word_tokenize(text) if w not in stopwords]
    return ' '.join(temp)
df['final_text'] = df['clean_text'].apply(delete_stopwords)

In [41]:
df = df[['final_text','polarity']]

In [42]:
temp = []
for x in df['polarity']:
    if x == 'positive':
        temp.append(1)
    else:
        temp.append(0)

In [43]:
df['sentiment'] = temp
df = df[['final_text','sentiment']]

In [44]:
df

Unnamed: 0,final_text,sentiment
0,stayed one night getaway family thursday tripl...,1
1,triple rate upgrade view room less 200 also in...,1
2,comes little late im finally catching reviews ...,1
3,omni chicago really delivers fronts spaciousne...,1
4,asked high floor away elevator got room pleasa...,1
...,...,...
1195,booked directly intercontinentala special room...,0
1196,good location looks like good property not see...,0
1197,reading lukewarm reviews hotel went ahead got ...,0
1198,overview overrated hotel premium location grea...,0


In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectr = TfidfVectorizer(ngram_range=(1,2),min_df=1)
vectr.fit(df['final_text'])
vect_X = vectr.transform(df['final_text'])

In [46]:
# X = df['final_text']
y = df['sentiment']

In [47]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(vect_X,y,test_size=0.2,random_state=0)

In [48]:
from sklearn.linear_model import LogisticRegression

In [49]:
logreg = LogisticRegression()

In [50]:
logreg.fit(X_train,y_train)

LogisticRegression()

In [51]:
y_pred=logreg.predict(X_test)

In [52]:
y_pred

array([1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1], dtype=int64)

In [53]:
from sklearn import metrics

In [54]:
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)

In [55]:
cnf_matrix

array([[78, 10],
       [ 1, 71]], dtype=int64)

III. Conclusions

9. Evaluation
    The results of this experiment ended up containing 71 true positive scores , 1 false negative score, 78 true negative scores and 10 false positive scores. In terms of accuracy, our model did quite well. The accuracy ended up being 0.93125 which can be considered exceptional considering the dataset wasn’t particularly large. The recall of this research ended up coming in at 0.98611. Meanwhile, the precision of this experiment ended up scoring 0.87654. Finally, the F-score of this project came out to be 0.92810. Overall, these are exceptional results. Considering the fact that the maximum value for these four metrics is 1.0 and the minimum value is 0.0, we generally consider this research to have been a success. The F-score is the weighted average of both recall and precision and came in above 0.9 as well as the accuracy. This experiment only contained 800 test samples and therefore could be trained to be far more accurate if a larger dataset is used. If a training set of several thousand, or perhaps even several million had been provided, we believe (based on the results of this experiment) that this work could produce exceptionally accurate results. 

10. Summary and conclusions

    Overall, we found this project to be highly enjoyable. It provided the perfect amount of challenge and the results ended up being surprisingly accurate. This could have significant contributions to many areas of industry, especially the hotel industry. The dataset were raw hotel reviews (some of which were deceptive but those were removed from the dataset) and therefore obviously have applications to that domain. This solution is highly transferable to other areas of research and could potentially be of limitless value. On the other hand, the extent to which this work can be replicated by others is fundamentally limited in nature. The dataset we found was flawless in the sense that there were originally 800 positive reviews and 800 negative reviews and we systematically removed the false positive reviews and the false negative reviews which effectively removed 50% of the dataset. On the programming side, there can’t be too many words said about Python. The libraries in Python have significantly simplified this project. For example, writing the code in order to print the confusion matrix would have most likely been very difficult and time consuming. Programming this project in any other language which lacked the functionality of Python would most likely have been a nightmare. The main benefit of Python is the fact that there are an enormous number of easily accessible libraries in Python which allow the programmer to focus purely on the problem at hand, without having to worry about issues involving syntax. Thus, one potential drawback of someone else using an alternative programming language in order to conduct this experiment is there may not be as many libraries that are as easy to implement as Python contains.
	
	Also, considering the fact that the dataset we used was rather small and our expectations weren’t especially high for the efficacy of this experiment, the results have been a pleasant surprise. Overall, we consider this project to have been an astounding success and enjoyed it thoroughly. 

Sources:

https://joannatrojak.medium.com/sentiment-analysis-with-logistic-regression-in-python-with-nltk-library-d5030b1d84e3

https://www.analyticsvidhya.com/blog/2021/06/sentiment-analysis-using-nltk-a-practical-approach/

https://www.kaggle.com/rtatman/deceptive-opinion-spam-corpus

https://www.datacamp.com/community/tutorials/understanding-logistic-regression-python