# Lecture 6 - Sentiment Analysis

In this notebook we will learn how to measure sentiment in text. 

Below is the overview of this notebook.
<ol type = 1>
<li> Measure tweet sentiment</li>
<ol type = a>
<li> Load tweets from database</li>
<li> Load sentiment classifier from huggingface, which in this case is a BERT transformer</li>
<li>  Measure tweet sentiment with transformer</li>
<li> Calculate transformer embedding of tweet</li>

</ol>

<li> Analyze sentiment of tweets </li>
<ol type = a>

<li> Plot sentiment versus retweet counts</li>
<li> Visualize UMAP transformer tweet embeddings</li>
</ol>
</ol>

Below are some cool blogs you can read to learn more about the BERT transformer.

http://jalammar.github.io/illustrated-bert/

https://towardsdatascience.com/deconstructing-bert-distilling-6-patterns-from-100-million-parameters-b49113672f77

https://github.com/jessevig/bertviz

This notebook can be opened in Colab 
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/zlisto/social_media_analytics/blob/main/Lecture06_SentimentAnalysis.ipynb)

Before starting, select "Runtime->Factory reset runtime" to start with your directories and environment in the base state.

If you want to save changes to the notebook, select "File->Save a copy in Drive" from the top menu in Colab.  This will save the notebook in your Google Drive.

# Clones, installs, imports, and GPU

## Using a GPU

If we switch the run-time to a GPU (graphical processing unit), the neural network computations will run faster.  To do this, go to the top left menu and select **Runtime-> Change runtime type -> Harware accelerator -> GPU**.  

The code below will tell you if your Colab runtime is using a GPU.

In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

## Clone GitHub Repository
This will clone the repository to your machine.  This includes the code and data files.  Then change into the directory of the repository.

In [None]:
!git clone https://github.com/zlisto/social_media_analytics

import os
os.chdir("social_media_analytics")

## Install Requirements 


The main package we need today is `transformers` - this lets us use pre-trained transformer models. 

In [None]:
!pip install -r requirements.txt


## Import Packages

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import umap
import scripts.TextAnalysis as ta
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import codecs  #this let's us display tweets properly (emojis, etc.)
pd.set_option("display.max_colwidth", None)



# Sentiment Classification with BERT

We will pass the tweets through a pre-trained sentiment classifier with a BERT core.  Then we will plot the tweets with UMAP and color them by their sentiment.  Hopefully the positive and negative are in different regions of the plot.

### Download Pre-Traine Model and Tokenizer

We will download the model and tokenizer from https://huggingface.co/nlptown/bert-base-multilingual-uncased-sentiment.  This is a pre-trained model in the huggingface library that was trained on product reviews in multiple languages.  The output sentiment is between 1 and 5.

There are many other models on huggingface that you can find here: https://huggingface.co/models?pipeline_tag=text-classification.

We will create a tokenizer for the model called `tokenizer` and create the model itself, which we call `model`.  Every model needs its own tokenizer which tells it how to map text into the proper input vectors.

In [None]:
%%time
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment")

### Define Sentiment Classifier and Transformer Embedding Function

When we pass text through our transformer model, we get many pieces of data in the output. First, we get the sentiment of the text.  Second, we get the text embedding in the final layer of the transformer.  Remember, inside the transformer we turn the input text into a high dimensional vector. This is the transformer embedding, and it is designed to separate text based on sentiment.


We will create a function called `sentiment_classifier` which takes as input a string `text`, a transformer model called `model`, and a tokenizer called `tokenizer`, and returns the `sentiment` and `embedding`of the text.  The raw sentiment output of the model is a probability for each sentiment value.  The function will return the average sentiment based on these probabilities.  


In [None]:
def sentiment_classifier(text,model,tokenizer):
    inputs = tokenizer.encode_plus(text, return_tensors='pt', add_special_tokens=True)

    token_type_ids = inputs['token_type_ids']
    input_ids = inputs['input_ids']

    output = model(input_ids, token_type_ids=token_type_ids,return_dict=True,output_hidden_states=True)
    logits = np.array(output.logits.tolist()[0])
    prob = np.exp(logits)/np.sum(np.exp(logits))
    sentiment = np.sum([(x+1)*prob[x] for x in range(len(prob))])  #use this line if you want the mean score
    embedding = output.hidden_states[12].detach().numpy().squeeze()[0]
    
    return sentiment,embedding


### Sentiment Classification Example
Now we can test the model on some text.  Feel free to try any text you want here.  Just put your text in the list `Text`.

In [None]:
Text = ["This class is kinda boring, but informative", 
        "This class is awesome",
        "this class is stupid",         
        "this class is dope", 
       "this class is crazy hard",
       "this class is fun",
       "this class is fun!",
       "this class is :(",
       "this class is :)"]

In [None]:
for text in Text:
    sentiment,_ = sentiment_classifier(text,model,tokenizer)
    print(f"Sentiment:{sentiment:.2f}\nText: {text}\n")

### Load Tweets 

Load the tweets from the file `"data/lec_06_tweets_sentiment_embedding.csv"` into a dataframe `df`.  You can use the `read_csv` function to do this.

In [None]:
df = pd.read_csv("data/lec_06_tweets_sentiment_embedding.csv")
ntweets = len(df)
print(f"dataframe has {ntweets} tweets")
df.sample(n=5)

### Calculate Sentiment and Transformer Embedding of Tweets

We use a for loop iterating over the rows in `df` to calculate the sentiment and transformer embedding of each tweet.  We store the sentiment in the list `Sentiment` and embeddings in the list `Embedding`.  We then add the sentiment to a column in `df`.   The speed of this code depends on if you use a CPU or GPU:

CPU = 9000 tweets/hour

GPU = 23000 tweets/hour






In [None]:
%%time
c = 0
Sentiment = []
Embedding = []
for index,row in df.iterrows():  #iterate over rows of dataframe
    c+=1
    if c%1000==0:print(f"Tweet {c}/{len(df)}")  #print progres every 1000 rows

    sentiment,embedding = sentiment_classifier(row.text,model,tokenizer)  #calculate sentiment and embedding of tweet
    Sentiment.append(sentiment)  #append sentiment of tweet to Sentiment list
    Embedding.append(embedding) #append embedding of tweet to Embedding list

df['sentiment'] = Sentiment  #add sentiment column to dataframe of tweets
df.head()
    

#### Save sentiments


You can save the resulting dataframe to a file with the `to_csv` function.  We will set `fname_sentiment` equal to the path and filename in Google Drive where you want to save the data.  Follow the code below to mount your Google Drive into your Colab environment.

In [None]:
from google.colab import drive
drive.mount('/content/social_media_analytics/drive')



In [None]:
#uncomment the lines below to save tweets, sentiment, and embedding to a csv file
#fname_sentiment = "/content/social_media_analytics/drive/MyDrive/MGT 575/data/lec_06_tweets_sentiment_v1.csv"
#df.to_csv(fname_sentiment)


### UMAP Transformer Embedding of Tweets

The tranformer turns the input text into a high dimensional vector.  This is the transformer embedding, and it is designed to separate text based on sentiment.  However, we can't really visualize such a high-dimensional object.  But no worries, UMAP will let us embed this high-dimensional vector into 2 dimensions. 

We apply UMAP to the transformer embedding `Embedding` to create the UMAP transformer embedding `umap_bert_embedding`.  Before doing this we have to convert `Embedding` from a list to an array.  Then we save the UMAP embedding values in `df` as `"umap_transformer_x"` and `"umap_transformer_y"`.  The dataframe is saved to the file `"data/lec_12_tweets_sentiment_embedding.csv"`.



In [None]:
Embedding = np.array(Embedding)
umap_bert_embedding = umap.UMAP(n_components=2, metric='cosine').fit_transform(Embedding)
df['umap_transformer_x'] = umap_bert_embedding[:,0]
df['umap_transformer_y'] = umap_bert_embedding[:,1]


#### Save UMAP Embeddings

You can save the resulting dataframe to a file with the `to_csv` function.  Choose your Google Drive as the saving location. 

In [None]:
#uncomment the line below to save tweets, sentiment, and embedding to a csv file
#fname_sentiment = "/content/social_media_analytics/drive/MyDrive/MGT 575/data/lec_06_tweets_sentiment_embedding_v1.csv"
#df.to_csv(fname_sentiment)

# Analyze Tweet Sentiment

### Load Tweets and Sentiment

Once we save the tweet sentiments, the next time we run the notebook we can just load this data instead of recalculating the sentiment.  The sentiment is in the file `"data/lec_06_tweets_sentiment_embedding.csv"`.

In [None]:
df = pd.read_csv("data/lec_06_tweets_sentiment_embedding.csv")

### Average User Sentiment

We make a bar plot of the average sentiment of each user.  We do this using the `barplot` function.

In [None]:
fig = plt.figure(figsize = (12,6))
sns.barplot(data = df, x= 'sentiment', y = 'screen_name')
plt.xlim([2.5,4])
plt.xlabel("Screen name",fontsize = 16)
plt.ylabel("Mean sentiment",fontsize = 16)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)
plt.title("Sentiment vs Screen Name",fontsize = 20)
plt.grid()

### Sentiment Distribution per User

We can make histograms of the tweet sentiment for each user.  We use a `for` loop to iterate through each screen name.
We use the `histplot` function to make a histogram of the tweets of each user.  We also add a title to each plot so we know whose tweets they are.

In [None]:
for screen_name in df.screen_name.unique():
    df_plot = df[df.screen_name==screen_name]
    sns.histplot(data=df_plot, x = "sentiment")
    plt.title(f"Tweets of {screen_name}")
    plt.show()

### Look at Tweets with Extreme Sentiment

We can select tweets of each user with very high or very low sentiment and print them out.  We do this by keeping the rows of `df` with the corresponding screen name, and then using `sort_values` to sort the user's tweets by sentiment.  We set `ascending = True` inside the `for` loop to get the most postive tweets, and set `ascending = False` to get the most negative tweets.  We set `ndisplay` equal to the number of tweets we want to print per user.

To show all the funny Twitter characters, we need to use the `decode` function in the `codecs` module.



In [None]:
ndisplay = 3

print(f"Top {ndisplay} Most Positive Tweets per Screen Name")
for screen_name in df.screen_name.unique():
    df_display = df[(df.screen_name==screen_name)].sort_values(by = ['sentiment'], ascending = False)
    c=0
    print(f"\n{screen_name}")
    for index,row in df_display.iterrows():
        c+=1
        text = codecs.decode(row.text, 'unicode_escape')
        print(f"\tsentiment = {row.sentiment:.2f}: {text}")
        if c>=ndisplay:break



In [None]:
print(f"\nTop {ndisplay} Most Negative Tweets per Screen Name")
for screen_name in df.screen_name.unique():
    df_display = df[(df.screen_name==screen_name)].sort_values(by = ['sentiment'], ascending = True)
    c=0
    print(f"\n{screen_name}")
    for index,row in df_display.iterrows():
        c+=1
        text = codecs.decode(row.text, 'unicode_escape')
        print(f"\tsentiment = {row.sentiment:.2f}: {text}")
        if c>=ndisplay:break

### Plot of Retweet Count vs Sentiment

To see this correlation of extreme sentiment and retweet count, we can use `barplot`.

 We first create a column called `star` that rounds the sentiment to the nearest integer with the `round` function.  Then we can make a plot of the retweet count versus the tweet star sentiment.  

In [None]:
df['star'] = df.sentiment.round()

fig = plt.figure(figsize = (12,8))
ax = sns.barplot(data=df, x="star", y="retweet_count")
plt.xlabel("Sentiment", fontsize = 16)
plt.ylabel("Retweet Count", fontsize = 16)
plt.xticks(fontsize = 14)
plt.yticks(fontsize = 14)

plt.grid()
plt.show()

### Plot Retweet Count vs Sentiment per User

We can also look at a plot for each individual user.  

In [None]:
for screen_name in df.screen_name.unique():
  df_plot = df[df.screen_name==screen_name]
  fig = plt.figure(figsize=(12,6))
  sns.barplot(data = df_plot, x = 'star', y = 'retweet_count')
  plt.xlabel("Sentiment", fontsize = 16)
  plt.ylabel("Retweet Count", fontsize = 16)
  plt.title(f"{screen_name}", fontsize = 20)
  plt.xticks(fontsize = 14)
  plt.yticks(fontsize = 14)
  plt.grid()
  plt.show()


# Visualize Transformer Embedding

Now we will visulize the transformer embeddings using UMAP to see how the sentiment is distributed.

### Scatter Plot of UMAP Transformer Tweet Embeddings

We can make a scatter plot of the UMAP transformer embeddings of the tweets.  We will color the data points by the user screen name.  We will also make another plot next to this plot where we color the data points by sentiment.  You set the column for the datapoint color with the `hue` parameter.  You can choose a color palette with the `palette` parameter.  There are many palettes you can choose from, but for discrete values like `"screen_name"` use the `"bright"` palette and for continous values like `"sentiment"` use the `"vlag"` palette.  Of course feel free to try other palettes. A complete list can be found here: https://seaborn.pydata.org/tutorial/color_palettes.html


In [None]:
fig = plt.figure(figsize = (16,8))
ax1 = plt.subplot(1,2,1)
sns.scatterplot(data=df, x="umap_transformer_x", y="umap_transformer_y", hue="screen_name", palette="bright", s=5)
plt.title("UMAP Transformer Embedding")

ax2 = plt.subplot(1,2,2)
sns.scatterplot(data=df, x="umap_transformer_x", y="umap_transformer_y", hue="sentiment", 
               palette="vlag", s=15)
plt.title(f"UMAP Transformer Embedding")
plt.show()