# Sentiment Analysis of Twitter Text

In today’s world, Twitter provides people with a way to publicly express their thoughts on any given subject in a concise, condensed format. This allows us to use tweets as a way to predict users’ thoughts or feelings on a certain subject.

Since the 2016 U.S. election, the influence of social media on society has become more and more concerning. Fake news, hate speech, polarization, and echo chambers attract growing scholarships to pay attention to the discussions in the online space. Understanding the sentimental content on social media is crucial to further analysis

In this project, we are going to compare and contrast two models on the performance of classifying a tweet based on sentiments.

## Load data and pre-processing

In [42]:
# import your libraries here
import pandas as pd
import nltk
import re
# from nltk.stem import SnowballStemmer
# from nltk.stem.wordnet import WordNetLemmatizer
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/vikramc18/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [43]:
replacement_patterns = [
  (r'won\'t', 'will not'),
  (r'can\'t', 'cannot'),
  (r'i\'m', 'i am'),
  (r'ain\'t', 'is not'),
  (r'(\w+)\'ll', '\g<1> will'),
  (r'(\w+)n\'t', '\g<1> not'),
  (r'(\w+)\'ve', '\g<1> have'),
  (r'(\w+)\'s', '\g<1> is'),
  (r'(\w+)\'re', '\g<1> are'),
  (r'(\w+)\'d', '\g<1> would')
]

patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]

def replace(text):
    s = text
    for (pattern, repl) in patterns:
        s = re.sub(pattern, repl, s)
    return s

TOKEN_RE = re.compile(r"\w.*?\b")
def process_text(text):
    """
    Process the paragram so it is tokenized into sentences.
    To keep the nuance of social media, we are keeping the punctuation and forms of words.
    """
    sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
    
    # now loop over each sentence and tokenize it separately
    s = []
    for sentence in sent_text:
        # regualr expression
        sentence = replace(sentence)
        # tokenize sentence
        tokenized_text = [token.casefold() for token in TOKEN_RE.findall(text)]

        s = s + tokenized_text
    return s

def process_data(series):
    # returns text in this format:
    # data = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
    # 			['this', 'is', 'the', 'second', 'sentence'],
    # 			['yet', 'another', 'sentence'],
    # 			['one', 'more', 'sentence'],
    # 			['and', 'the', 'final', 'sentence']]
    tweets = []
    for _,row in series.items():
        tweets.append(process_text(str(row)))
    
    return tweets

In [44]:
df_tweet = pd.read_csv('data/Tweets.csv')

In [45]:
df_tweet

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive


In [5]:
tweets = process_data(df_tweet['text'])

In [6]:
len(tweets)

27481

In [7]:
# From notebook 11
def load_lexicon(filename):
    """
    Load a file from Bing Liu's sentiment lexicon
    (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), containing
    English words in Latin-1 encoding.
    
    One file contains a list of positive words, and the other contains
    a list of negative words. The files contain comment lines starting
    with ';' and blank lines, which should be skipped.
    """
    lexicon = []
    with open(filename, encoding='latin-1') as infile:
        for line in infile:
            line = line.rstrip()
            if line and not line.startswith(';'):
                lexicon.append(line)
    return lexicon

pos_words = load_lexicon('data/positive-words.txt')
neg_words = load_lexicon('data/negative-words.txt')

## Train the embeddings

Right now using Glovec, can be changed later.

In [8]:
import numpy as np

In [9]:
# From notebook 11
def load_embeddings(filename):
    """
    Load a DataFrame from the generalized text format used by word2vec, GloVe,
    fastText, and ConceptNet Numberbatch. The main point where they differ is
    whether there is an initial line with the dimensions of the matrix.
    """
    labels = []
    rows = []
    with open(filename, encoding='utf-8') as infile:
        for i, line in enumerate(infile):
            items = line.rstrip().split(' ')
            if len(items) == 2:
                # This is a header row giving the shape of the matrix
                continue
            labels.append(items[0])
            values = np.array([float(x) for x in items[1:]], 'f')
            rows.append(values)
    
    arr = np.vstack(rows)
    return pd.DataFrame(arr, index=labels, dtype='f')

# for better performance, use the 42B data https://nlp.stanford.edu/data/glove.42B.300d.zip
embeddings = load_embeddings('data/glove.6B.50d.txt')
embeddings.shape

FileNotFoundError: [Errno 2] No such file or directory: 'data/glove.6B.50d.txt'

In [None]:
pos_vectors = embeddings.loc[embeddings.index.isin(pos_words)].dropna()
neg_vectors = embeddings.loc[embeddings.index.isin(neg_words)].dropna()

In [None]:
vectors = pd.concat([pos_vectors, neg_vectors])
targets = np.array([1 for entry in pos_vectors.index] + [-1 for entry in neg_vectors.index])
labels = list(pos_vectors.index) + list(neg_vectors.index)

## BERTweet
https://huggingface.co/docs/transformers/model_doc/bertweet

In [10]:
from transformers import AutoModel, AutoTokenizer

In [11]:
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)

Some weights of the model checkpoint at vinai/bertweet-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.decoder.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
emoji is not installed, thus not converting emoticons or emojis into text. Please install emoji: pip3 install emoji
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Data Cleaning
- remove tweets classified as 'neutral' so that we can perform binary classification
- remove non-string tweets
    - possibly just map these to strings?

In [46]:
# https://stackoverflow.com/questions/39275533/select-row-from-a-dataframe-based-on-the-type-of-the-objecti-e-str
# df[df['A'].apply(lambda x: isinstance(x, str))]
df_tweet_bert = df_tweet[df_tweet['text'].apply(lambda x: isinstance(x, str))].reset_index()
df_tweet_bert = df_tweet_bert[df_tweet_bert['sentiment'] != 'neutral'].reset_index()
#df_tweet_bert = df_tweet.loc[type(df_tweet['text']) == str]

In [48]:
df_tweet_bert

Unnamed: 0,level_0,index,textID,text,selected_text,sentiment
0,1,1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
1,2,2,088c60f138,my boss is bullying me...,bullying me,negative
2,3,3,9642c003ef,what interview! leave me alone,leave me alone,negative
3,4,4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
4,6,6,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive
...,...,...,...,...,...,...
16358,27474,27475,b78ec00df5,enjoy ur night,enjoy,positive
16359,27475,27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative
16360,27476,27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative
16361,27477,27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive


In [50]:
# def normalize_encode_tweet(tweet):
#     norm = tokenizer.normalizeTweet(tweet)
#     encoded = tokenizer.encode(norm)
#     return encoded

# from sentence_transformers import SentenceTransformer
roberta_model = SentenceTransformer('paraphrase-distilroberta-base-v1');
def normalize_encode_tweet(tweet):
    norm = tokenizer.normalizeTweet(tweet)
    encoded = roberta_model.encode(norm)
    return encoded

In [32]:
# https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/
df_tweet_bert['embedding'] =  df_tweet_bert.apply(lambda row: normalize_encode_tweet(row.text), axis=1)

KeyboardInterrupt: 

In [51]:
from tqdm import tqdm

# show progress
tqdm.pandas()

# https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/
df_tweet_bert['embedding'] =  df_tweet_bert.progress_apply(lambda row: normalize_encode_tweet(row.text), axis=1)

100%|██████████| 16363/16363 [12:04<00:00, 22.57it/s]


In [52]:
df_tweet_bert.head()

Unnamed: 0,level_0,index,textID,text,selected_text,sentiment,embedding
0,1,1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,"[0.093087755, 0.44367677, 0.11050558, -0.34251..."
1,2,2,088c60f138,my boss is bullying me...,bullying me,negative,"[-0.220892, -0.028724447, 0.1460157, -0.145213..."
2,3,3,9642c003ef,what interview! leave me alone,leave me alone,negative,"[0.011180244, -0.42562425, 0.10249197, -0.4529..."
3,4,4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,"[0.17745222, 0.28441083, 0.059978485, 0.294703..."
4,6,6,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive,"[-0.10432586, 0.26830515, -0.15316525, -0.0983..."


In [53]:
df_tweet_bert.to_csv('tweet_roberta_embeddings.csv')

In [17]:
temp = df_tweet_bert['embedding'].apply(len)

In [18]:
max(temp)

65

In [19]:
min(temp)

3

In [20]:
import statsmodels.formula.api

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [21]:
# test train split
X = df_tweet_bert['embedding']
y = df_tweet_bert['sentiment']

# train + test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1)
# re-split train to have training, validation, testing sets
# https://datascience.stackexchange.com/questions/15135/train-test-validation-set-splitting-in-sklearn
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state=1) # 0.2/0.8 = 0.25

# train = 60%, val = 20%, test = 20% of original data
# TODO: need higher proportion of training data?

In [22]:
len(X_train)

9817

In [23]:
len(X_val)

3273

In [24]:
len(X_test)

3273

## Train logistic regression model

In [25]:
len(X_train[0]) == len(X_train[2])

False

In [26]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(random_state=1)
clf_log_reg = log_reg.fit(X_train, y_train)

ValueError: setting an array element with a sequence.

In [None]:
# train_vectors, test_vectors, train_targets, test_targets, train_labels, test_labels = \
#     train_test_split(vectors, targets, labels, test_size=0.1, random_state=0)

In [None]:
# model = SGDClassifier(loss='log', random_state=0, max_iter=100)

In [None]:
# for i in range(100):
#     model.fit(train_vectors, train_targets)

In [None]:
# def vecs_to_sentiment(vecs):
#     # predict_log_proba gives the log probability for each class
#     predictions = model.predict_log_proba(vecs)

#     # To see an overall positive vs. negative classification in one number,
#     # we take the log probability of positive sentiment minus the log
#     # probability of negative sentiment and return it.

#     return [p[1]-p[0] for p in predictions]

# def words_to_sentiment(words):
#     vecs = embeddings.loc[embeddings.index.isin(words)].dropna()
#     # if word is in embedding list
#     if vecs.empty:
#         log_odds = 0
#     else:
#         log_odds = vecs_to_sentiment(vecs)
        
#     return pd.DataFrame({'sentiment': log_odds}, index=vecs.index)

# def text_to_sentiment(tweet_tok):
#     # Compute sentiments for all the tokens and return the mean value
    
#     sentiment = words_to_sentiment(tweet_tok)
#     return sentiment['sentiment'].mean()

### Predict label using logistic classification

In [None]:
def log_predict(tweets):
    log_labels = []
    i = 0
    for tweet in tweets:
#         print(i)
        score = text_to_sentiment(tweet)
        if score > 0:
            log_labels.append('positive')
        elif score < 0:
            log_labels.append('negative')
        else:
            log_labels.append('neutral')
        i += 1
    return log_labels

In [None]:
tweet_sentiment = log_predict(tweets)

In [None]:
len(tweet_sentiment)

## Evaluation

In [None]:
def precision(gold_labels, predicted_labels):
    """
    Calculates the precision for a set of predicted labels give the gold (ground truth) labels.
    Parameters:
          gold_labels (list): a list of labels assigned by hand ("truth")
          predicted_labels (list): a corresponding list of labels predicted by the system
    Returns: double precision (a number from 0 to 1)
    """
    TP, FP = 0, 0
    for i in range(len(gold_labels)):
        if gold_labels[i] == predicted_labels[i]:
            if predicted_labels[i] == '1':
                TP += 1
        else:
            if predicted_labels[i] == '1':
                FP += 1
    return TP / (TP + FP) if (TP + FP) else 0


def recall(gold_labels, predicted_labels):
    """
    Calculates the recall for a set of predicted labels give the gold (ground truth) labels.
    Parameters:
      gold_labels (list): a list of labels assigned by hand ("truth")
      predicted_labels (list): a corresponding list of labels predicted by the system
    Returns: double recall (a number from 0 to 1)
    """
    TP, FN = 0, 0
    for i in range(len(gold_labels)):
        if gold_labels[i] == predicted_labels[i]:
            if predicted_labels[i] == '1':
                TP += 1
        else:
            if predicted_labels[i] != '1':
                FN += 1
    return TP / (TP + FN) if (TP + FN) else 0


def f1(gold_labels, predicted_labels):
    """
    Calculates the f1 for a set of predicted labels give the gold (ground truth) labels.
    Parameters:
      gold_labels (list): a list of labels assigned by hand ("truth")
      predicted_labels (list): a corresponding list of labels predicted by the system
    Returns: double f1 (a number from 0 to 1)
    """
    P = precision(gold_labels, predicted_labels)
    R = recall(gold_labels, predicted_labels)
    return 2 * P * R / (P + R) if (P + R) else 0


In [None]:
len(tweet_sentiment)

In [None]:
len(df_tweet["sentiment"])

In [None]:
log_p = precision(df_tweet["sentiment"], tweet_sentiment)
log_r = recall(df_tweet["sentiment"], tweet_sentiment)
log_f1 = f1(df_tweet["sentiment"], tweet_sentiment)

In [None]:
log_p, log_r,log_f1

# Citation

Download the lexicon from http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar and extract it into `data/positive-words.txt` and `data/negative-words.txt`.

The following pre-processing steps are inspired from https://towardsdatascience.com/text-normalization-for-natural-language-processing-nlp-70a314bfa646.

We also pre-processed data so that it begins with < s> tokens (and ends with < /s> tokens). Inspired from answer: https://stackoverflow.com/questions/37605710/tokenize-a-paragraph-into-sentence-and-then-into-words-in-nltk

normalize text to regular expression
code from https://gist.github.com/yamanahlawat/4443c6e9e65e74829dbb6b47dd81764a