# Intro to Sentiment Analysis
### By Suparna Kompalli

For reference, you might find it useful to read [Chapter 13 of the Data 100 textbook ](https://learningds.org/ch/13/text_intro.html) on working with text.

#### Setup

Run the following cell to import the necessary libraries for this lesson.

In [8]:
import pandas as pd
import re

In this lesson we will be working with a simple dataset of movie reviews. This dataset has two columns. The first titled `review` is the raw text of each review and the second columns `sentiment` classifies the review as a positive or negative. Run the following cells below to get a better idea of what a row looks like.

In [9]:
reviews = pd.read_csv('movie-reviews.csv')
reviews.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


#### Sentiment Analysis

Let's do some sentiment analysis! We are going to take a look at the sentiment of each review, and see if we can find some patterns. This will help us calssify the emotions of a body of text as positive or negative. The goal here is to build a classifier that can accurately predict the sentiment of a review. We can use the `sentiment` column to train and test our classifier! 

But what is sentiment analysis? In the sentence "I love pineapple" the word *love* has a positive sentiment. In a sentence like "I hate pineapple" the word *hate* has a negative sentiment. Thus, we are looking at the sentiment of the words in each review to get a general idea about the sentiment of the `entire body` of the text.



`Can you think of some words that might be used both positively and negatively?`

*answer here*

We are going to use the VADER (Valence Aware Dictionary and sEntiment Reasoner) lexicon to analyze the sentiment of these reviews. These sentiments were derived from sentiments expressed in social media which is very similar to what we want to measure today!

Here are some links for reference: [github](https://github.com/cjhutto/vaderSentiment), [original paper](https://doi.org/10.1609/icwsm.v8i1.14550)

The VADER lexicon gives the sentiment of individual words. Run the following cell to see how the lexicon classifies the sentiment of some words:

In [10]:
characters = pd.read_csv("vader_lexicon.txt", sep = "\t", names =  ["token", "polarity", "something", "list"]).drop(["something", "list"], axis = 1)
characters = characters.set_index("token")
characters[-100:-90]

Unnamed: 0_level_0,polarity
token,Unnamed: 1_level_1
withdrawal,0.1
woe,-1.8
woebegone,-2.6
woebegoneness,-1.1
woeful,-1.9
woefully,-1.7
woefulness,-2.1
woes,-1.9
woesome,-1.2
won,2.7


Each row contains a word ("token") and various measures of the polarity of that word, measuring how positive or negative the word is, on a scale of -4 (extremely negative) to +4 (extremely positive). 

We won't actually be *reading* each review, but by using the average sentiment of each body of text, we can calculate the sentiment of the whole review!

`What are some potential flaws with this method of classification?`

*answer here*

There are many reasons. Lets take a look at a specific review to see more.

In [11]:
texts = reviews['review'].tolist()
print("sentiment: positive")
print("review:" + texts[1])

sentiment: positive
review:A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface

The last sentence of this review is `It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets ... are terribly well done.`

Words like terrible are typically used in negative contexts, yet this review uses it in a positive way! This is just one example of the many challenges in sentiment analysis. 

`Do you notice a difference between the text in the review and the words in the VADER lexicon?`

*answer here*

Every word in the VADER dataset is lowercase! To properly, process our tweet, we want to make every word lower case in the review so we can match it correctly.

There are also some html tags hidden in the review. `<br /><br />` indicates a line break in the review. Lets go ahead and remove these from the review to clean up this dataset. 

In [12]:
clean = []
for i in texts:
    clean.append(re.sub('<br /><br />',"", i).lower())
reviews["clean_text"] = clean
reviews = reviews.iloc[:5000].reset_index()
reviews[["review", "clean_text"]].head()

Unnamed: 0,review,clean_text
0,One of the other reviewers has mentioned that ...,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,a wonderful little production. the filming tec...
2,I thought this was a wonderful way to spend ti...,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...","petter mattei's ""love in the time of money"" is..."


Now we can use our clean text to calculate the averagepolarity of each review!

In [13]:
tidy_reviews = (reviews["clean_text"].str.split().explode().to_frame().rename(columns={"clean_text": "word"}))
reviews["polarity_score"] = (tidy_reviews.merge(characters, how='left', left_on='word', right_index=True)
        .reset_index().loc[:, ['index', 'polarity']].fillna(0).groupby('index').sum())
overall = []
for i in reviews["polarity_score"]:
    if i > 0:
        overall.append("positive")
    else:
        overall.append("negative")     
reviews["polarity"] = overall
reviews.head()

Unnamed: 0,index,review,sentiment,clean_text,polarity_score,polarity
0,0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...,-24.0,negative
1,1,A wonderful little production. <br /><br />The...,positive,a wonderful little production. the filming tec...,12.4,positive
2,2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...,14.6,positive
3,3,Basically there's a family where a little boy ...,negative,basically there's a family where a little boy ...,-8.3,negative
4,4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"petter mattei's ""love in the time of money"" is...",19.5,positive


Finally, let's see how good this method is at classification!

We will calculate Precision, Accuracy, Recall, and F1-Score.

In [20]:
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score

def calculate_metrics(df, actual_col, predicted_col):
    actual = [1 if sentiment == "positive" else 0 for sentiment in df[actual_col].tolist()]
    predicted = [1 if sentiment == "positive" else 0 for sentiment in df[predicted_col].tolist()]
    

    metrics = {
        'Precision': precision_score(actual, predicted, average='binary'),
        'Accuracy': accuracy_score(actual, predicted),
        'Recall': recall_score(actual, predicted, average='binary'),
        'F1-Score': f1_score(actual, predicted, average='binary')
    }
    return metrics

metrics = calculate_metrics(reviews, 'sentiment', 'polarity')
for label in metrics:
    print(label + ": " + str(metrics[label]))

Precision: 0.6231082654249127
Accuracy: 0.6756
Recall: 0.8675040518638574
F1-Score: 0.7252710027100271


Not bad! What are some ways you think we can improve this classifier?

*answer here*

#### Conclusion

You've built your first classifier for sentiment analysis! Sentiment analysis is just one of the many applications of Natural Language Processing and allows for researchers to gauge public opinon, customer sentiment, and more! 

I hope you enjoyed!