# **The eighth in-class-exercise (20 points in total, 10/29/2020)**

The data for this exercise is from the dataset you created from assignment three. Please perform answer the following questions based on your data:

## (1) (10 points) Write a python program to extract the sentiment related terms from the corpus. You may use python package such as polyglot or external lexicon resources in the question. Rank the sentiment related terms by frequency.

In [91]:
# Write your code here

import pandas as pd
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')
from textblob import TextBlob

data = pd.read_csv(r"C:\Users\Raheyma Arshad\Desktop\Output_CSV.csv")
data = data[['review']]

# Text Preprocessing
data['review'] = data['review'].str.replace('[^\w\s]','')
data['review'] = data['review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
data['review'] = data['review'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
data['review'] = data['review'].apply(lambda x: nltk.word_tokenize(x))

# Calculate the frequency of all the terms.
term_freq = (data['review']).apply(lambda x: pd.value_counts(x)).sum(axis = 0).reset_index()
term_freq.columns = ['words', 'tf']

# Find the polarity of each term.
term_freq['polarity'] = term_freq['words'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Remove terms with polarity = 0.0 because they do not have any sentiment associated with them.
sentiment_related_terms = term_freq.loc[term_freq['polarity'] != 0].sort_values(by='tf', ascending=False)
sentiment_related_terms = sentiment_related_terms.reset_index(drop=True)
print('The sentiment related terms ranked by term frequency are:')
sentiment_related_terms

The sentiment related terms ranked by term frequency are:


Unnamed: 0,words,tf,polarity
0,good,52.0,0.700000
1,dark,33.0,-0.150000
2,many,33.0,0.500000
3,best,32.0,1.000000
4,much,31.0,0.200000
...,...,...,...
395,hated,1.0,-0.900000
396,wise,1.0,0.700000
397,fortunate,1.0,0.400000
398,magnificent,1.0,1.000000


## (2) (10 points) Compare the performance of the following tools in sentiment identification: TextBlob (https://textblob.readthedocs.io/en/dev/), VADER (https://github.com/cjhutto/vaderSentiment), TFIDF-based Support Vector Machine (SVM) (Split your data into training and testing data). Take your own annotation as the standard answers. 

Reference code: https://towardsdatascience.com/fine-grained-sentiment-analysis-in-python-part-1-2697bb111ed4

In [157]:
# Write your code here

############################################################################################################################
# TEXTBLOB

data2 = pd.read_csv(r"C:\Users\Raheyma Arshad\Desktop\Annotated Data.csv")
data2['polarity'] = data2['clean_text'].apply(lambda x: TextBlob(x).sentiment.polarity)
data2['predicted sentiment'] = pd.cut(data2['polarity'], bins=5, labels=[1, 2, 3, 4, 5])

def sentiment(x):
    if x in [1, 2]:
        return 'Negative'
    if x == 3:
        return 'Neutral'
    if x in [4, 5]:
        return 'Positive'

data2['predicted sentiment'] = data2['predicted sentiment'].apply(lambda x: sentiment(x))
print('\n', 'TEXTBLOB SENTIMENT IDENTIFICATION:', '\n')
print(data2[['document_id', 'sentiment', 'predicted sentiment']].head(5))

from sklearn.metrics import f1_score, accuracy_score
textblob_accuracy = accuracy_score(data2['sentiment'], data2['predicted sentiment'])*100
textblob_f1 = f1_score(data2['sentiment'], data2['predicted sentiment'], average='macro')

print('\n', 'The accuracy of the TextBlob sentiment identification is:', textblob_accuracy)
print('The f1-score of the TextBlob sentiment identification is:', textblob_f1)

############################################################################################################################
# VADER

from nltk.sentiment.vader import SentimentIntensityAnalyzer
vader = SentimentIntensityAnalyzer()

data3 = pd.read_csv(r"C:\Users\Raheyma Arshad\Desktop\Annotated Data.csv")
data3['polarity'] = data2['clean_text'].apply(lambda x: vader.polarity_scores(x)['compound'])
data3['predicted sentiment'] = pd.cut(data3['polarity'], bins=5, labels=[1, 2, 3, 4, 5])

data3['predicted sentiment'] = data3['predicted sentiment'].apply(lambda x: sentiment(x))
print('\n', 'VADER SENTIMENT IDENTIFICATION:', '\n')
print(data3[['document_id', 'sentiment', 'predicted sentiment']].head(5))

vader_accuracy = accuracy_score(data3['sentiment'], data3['predicted sentiment'])*100
vader_f1 = f1_score(data3['sentiment'], data3['predicted sentiment'], average='macro')

print('\n', 'The accuracy of the VADER sentiment identification is:', vader_accuracy)
print('The f1-score of the VADER sentiment identification is:', vader_f1)

############################################################################################################################
# TFIDF-BASED SUPPORT VECTOR MACHINE (SVM)

import sklearn
from sklearn.model_selection import train_test_split

data4 = pd.read_csv(r"C:\Users\Raheyma Arshad\Desktop\Annotated Data.csv")
train, test = sklearn.model_selection.train_test_split(data4, train_size=0.8, test_size=0.2)

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=100, 
                                           learning_rate='optimal', tol=None))])

svm = pipeline.fit(train['clean_text'], train['sentiment'])
test['predicted sentiment'] = svm.predict(test['clean_text'])

print('\n', 'TFIDF-BASED SVM SENTIMENT IDENTIFICATION:', '\n')
print(test[['document_id', 'sentiment', 'predicted sentiment']].head(5))

svm_accuracy = accuracy_score(test['sentiment'], test['predicted sentiment'])*100
svm_f1 = f1_score(test['sentiment'], test['predicted sentiment'], average='macro')

print('\n', 'The accuracy of the TFIDF-based SVM sentiment identification is:', svm_accuracy)
print('The f1-score of the TFIDF-based SVM sentiment identification is:', svm_f1)

############################################################################################################################
# Your analysis here
'''
We conduct sentiment analysis using three tools: TextBlob, VADER and TF-IDF Based SVM. The performance of the models on the 
annotated data is measured using accuracy score and f1 score. 

TF-IDF Based SVM gives the best accuracy and f1 score, which means it is the best model for sentiment analysis. It was able 
to correctly identify 70% of the reviews in the test dataset. VADER performed better than TextBlob and was able to correctly
identify 55% of the reviews. TextBlob performed the worst and was correct in only 39% of the cases. 

TextBlob finds words and phrases it can assign polarity to, and averages them all together for longer text. While VADER 
finds the sentiment score of a text by summing up the intensity of each word in the text. We can say that VADER is more
thorough and thats why gives better results. SVM is even more thorough than the other two and thats why gives the best 
results.

'''


 TEXTBLOB SENTIMENT IDENTIFICATION: 

  document_id sentiment predicted sentiment
0        rev1  Positive             Neutral
1        rev2  Positive            Positive
2        rev3  Positive             Neutral
3        rev4  Positive            Negative
4        rev5  Positive             Neutral

 The accuracy of the TextBlob sentiment identification is: 39.0
The f1-score of the TextBlob sentiment identification is: 0.3782789979973078

 VADER SENTIMENT IDENTIFICATION: 

  document_id sentiment predicted sentiment
0        rev1  Positive            Positive
1        rev2  Positive            Positive
2        rev3  Positive            Positive
3        rev4  Positive            Positive
4        rev5  Positive            Positive

 The accuracy of the VADER sentiment identification is: 55.00000000000001
The f1-score of the VADER sentiment identification is: 0.34931752873563227

 TFIDF-BASED SVM SENTIMENT IDENTIFICATION: 

   document_id sentiment predicted sentiment
18       rev19

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['predicted sentiment'] = svm.predict(test['clean_text'])
