# Sentiment Analysis of Twitter Text

In today’s world, Twitter provides people with a way to publicly express their thoughts on any given subject in a concise, condensed format. This allows us to use tweets as a way to predict users’ thoughts or feelings on a certain subject.

Since the 2016 U.S. election, the influence of social media on society has become more and more concerning. Fake news, hate speech, polarization, and echo chambers attract growing scholarships to pay attention to the discussions in the online space. Understanding the sentimental content on social media is crucial to further analysis

In this project, we are going to compare and contrast two models on the performance of classifying a tweet based on sentiments.

## Load data and pre-processing

In [1]:
# import your libraries here
import pandas as pd
import nltk
import re
# from nltk.stem import SnowballStemmer
# from nltk.stem.wordnet import WordNetLemmatizer
# import nltk
# nltk.download('wordnet')

In [2]:
replacement_patterns = [
  (r'won\'t', 'will not'),
  (r'can\'t', 'cannot'),
  (r'i\'m', 'i am'),
  (r'ain\'t', 'is not'),
  (r'(\w+)\'ll', '\g<1> will'),
  (r'(\w+)n\'t', '\g<1> not'),
  (r'(\w+)\'ve', '\g<1> have'),
  (r'(\w+)\'s', '\g<1> is'),
  (r'(\w+)\'re', '\g<1> are'),
  (r'(\w+)\'d', '\g<1> would')
]

patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns]

def replace(text):
    s = text
    for (pattern, repl) in patterns:
        s = re.sub(pattern, repl, s)
    return s

TOKEN_RE = re.compile(r"\w.*?\b")
def process_text(text):
    """
    Process the paragram so it is tokenized into sentences.
    To keep the nuance of social media, we are keeping the punctuation and forms of words.
    """
    sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
    
    # now loop over each sentence and tokenize it separately
    s = []
    for sentence in sent_text:
        # regualr expression
        sentence = replace(sentence)
        # tokenize sentence
        tokenized_text = [token.casefold() for token in TOKEN_RE.findall(text)]

        s = s + tokenized_text
    return s

def process_data(series):
    # returns text in this format:
    # data = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
    # 			['this', 'is', 'the', 'second', 'sentence'],
    # 			['yet', 'another', 'sentence'],
    # 			['one', 'more', 'sentence'],
    # 			['and', 'the', 'final', 'sentence']]
    tweets = []
    for _,row in series.items():
        tweets.append(process_text(str(row)))
    
    return tweets

In [3]:
df_tweet = pd.read_csv('data/Tweets.csv')

In [4]:
df_tweet

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive


In [103]:
# From notebook 11
def load_lexicon(filename):
    """
    Load a file from Bing Liu's sentiment lexicon
    (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html), containing
    English words in Latin-1 encoding.
    
    One file contains a list of positive words, and the other contains
    a list of negative words. The files contain comment lines starting
    with ';' and blank lines, which should be skipped.
    """
    lexicon = []
    with open(filename, encoding='latin-1') as infile:
        for line in infile:
            line = line.rstrip()
            if line and not line.startswith(';'):
                lexicon.append(line)
    return lexicon

pos_words = load_lexicon('data/positive-words.txt')
neg_words = load_lexicon('data/negative-words.txt')

## Train the embeddings

Right now using Glovec, can be changed later.

In [5]:
import numpy as np

In [99]:
# # From notebook 11
def load_embeddings(filename):
    """
    Load a DataFrame from the generalized text format used by word2vec, GloVe,
    fastText, and ConceptNet Numberbatch. The main point where they differ is
    whether there is an initial line with the dimensions of the matrix.
    """
    labels = []
    rows = []
    with open(filename, encoding='utf-8') as infile:
        for i, line in enumerate(infile):
            items = line.rstrip().split(' ')
            if len(items) == 2:
                # This is a header row giving the shape of the matrix
                continue
            labels.append(items[0])
            values = np.array([float(x) for x in items[1:]], 'f')
            rows.append(values)
    
    arr = np.vstack(rows)
    return pd.DataFrame(arr, index=labels, dtype='f')

# for better performance, use the 42B data https://nlp.stanford.edu/data/glove.42B.300d.zip
embeddings = load_embeddings('data/glove.6B.50d.txt')
embeddings.shape

(400000, 50)

In [104]:
pos_vectors = embeddings.loc[embeddings.index.isin(pos_words)].dropna()
neg_vectors = embeddings.loc[embeddings.index.isin(neg_words)].dropna()

In [105]:
vectors = pd.concat([pos_vectors, neg_vectors])
targets = np.array([1 for entry in pos_vectors.index] + [-1 for entry in neg_vectors.index])
labels = list(pos_vectors.index) + list(neg_vectors.index)

## BERTweet
https://huggingface.co/docs/transformers/model_doc/bertweet

In [1]:
from transformers import AutoModel, AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm
2022-12-04 22:31:41.434910: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [None]:
bertweet = AutoModel.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", use_fast=False)

### Data Cleaning
- remove tweets classified as 'neutral' so that we can perform binary classification
- remove non-string tweets
    - possibly just map these to strings?

In [None]:
# https://stackoverflow.com/questions/39275533/select-row-from-a-dataframe-based-on-the-type-of-the-objecti-e-str
# df[df['A'].apply(lambda x: isinstance(x, str))]
df_tweet_bert = df_tweet[df_tweet['text'].apply(lambda x: isinstance(x, str))].reset_index()
df_tweet_bert = df_tweet_bert[df_tweet_bert['sentiment'] != 'neutral'].reset_index()
#df_tweet_bert = df_tweet.loc[type(df_tweet['text']) == str]

In [None]:
df_tweet_bert

In [None]:
# def normalize_encode_tweet(tweet):
#     norm = tokenizer.normalizeTweet(tweet)
#     encoded = tokenizer.encode(norm)
#     return encoded

# from sentence_transformers import SentenceTransformer
roberta_model = SentenceTransformer('paraphrase-distilroberta-base-v1');
def normalize_encode_tweet(tweet):
    norm = tokenizer.normalizeTweet(tweet)
    encoded = roberta_model.encode(norm)
    return encoded

In [32]:
# https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/
# df_tweet_bert['embedding'] =  df_tweet_bert.apply(lambda row: normalize_encode_tweet(row.text), axis=1)

KeyboardInterrupt: 

In [51]:
from tqdm import tqdm

# show progress
tqdm.pandas()

# https://www.geeksforgeeks.org/create-a-new-column-in-pandas-dataframe-based-on-the-existing-columns/
df_tweet_bert['embedding'] =  df_tweet_bert.progress_apply(lambda row: normalize_encode_tweet(row.text), axis=1)

100%|██████████| 16363/16363 [12:04<00:00, 22.57it/s]


In [None]:
df_tweet_bert.to_csv("tweet_roberta_embeddings.csv", index=False)

In [2]:
df_tweet_bert = pd.read_csv("tweet_roberta_embeddings.csv")

In [3]:
df_tweet_bert.drop(df_tweet_bert.columns[[0, 1, 2]], axis=1,inplace=True)

In [4]:
df_tweet_bert.head()

Unnamed: 0,textID,text,selected_text,sentiment,embedding
0,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,[ 9.30877551e-02 4.43676770e-01 1.10505581e-...
1,088c60f138,my boss is bullying me...,bullying me,negative,[-2.20891997e-01 -2.87244469e-02 1.46015704e-...
2,9642c003ef,what interview! leave me alone,leave me alone,negative,[ 1.11802444e-02 -4.25624251e-01 1.02491967e-...
3,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,[ 1.77452222e-01 2.84410834e-01 5.99784851e-...
4,6e0c6d75b1,2am feedings for the baby are fun when he is a...,fun,positive,[-1.04325861e-01 2.68305153e-01 -1.53165251e-...


In [18]:
type(df_tweet_bert['embedding'][0])

str

In [13]:
temp = df_tweet_bert['embedding'].apply(lambda x: x.strip("[]").split()).reindex()

In [14]:
temp[:5]

0    [9.30877551e-02, 4.43676770e-01, 1.10505581e-0...
1    [-2.20891997e-01, -2.87244469e-02, 1.46015704e...
2    [1.11802444e-02, -4.25624251e-01, 1.02491967e-...
3    [1.77452222e-01, 2.84410834e-01, 5.99784851e-0...
4    [-1.04325861e-01, 2.68305153e-01, -1.53165251e...
Name: embedding, dtype: object

In [83]:
temp2 = [len(x) for x in X]

In [84]:
max(temp2)

768

In [85]:
min(temp2)

768

In [18]:
import statsmodels.formula.api

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [58]:
import numpy as np

In [115]:
# test train split
# X = df_tweet_bert['embedding']
# when reading BERT from csv
X = df_tweet_bert['embedding'].apply(lambda s: ([float(x.strip(" \n")) for x in s.strip("[]").split()])).values.tolist()
y = df_tweet_bert['sentiment'].values.tolist()

# train + test split
X_train, X_test, y_train, y_test = train_test_split(np.array(X,dtype=object), np.array(y), test_size = 0.2, random_state=1)
# re-split train to have training, validation, testing sets
# https://datascience.stackexchange.com/questions/15135/train-test-validation-set-splitting-in-sklearn
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.25, random_state=1) # 0.2/0.8 = 0.25

# train = 60%, val = 20%, test = 20% of original data
# TODO: need higher proportion of training data?

In [118]:
len(X_train)

9817

In [119]:
len(X_val)

3273

In [120]:
len(X_test)

3273

In [121]:
type(X_train)

numpy.ndarray

In [116]:
df = pd.DataFrame(X)

In [117]:
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,758,759,760,761,762,763,764,765,766,767
0,0.093088,0.443677,0.110506,-0.342519,0.334282,0.134133,-0.079811,0.109643,-0.126565,0.068254,...,0.244574,0.369231,0.375811,0.209643,-0.253997,0.118141,-0.014578,0.338187,0.189731,-0.100272
1,-0.220892,-0.028724,0.146016,-0.145213,0.571359,-1.157924,0.182461,-0.330796,0.137217,0.456897,...,0.123236,-0.210848,0.203079,-0.147492,-0.024679,0.178868,0.277888,-0.279467,-0.325976,0.090958
2,0.011180,-0.425624,0.102492,-0.452972,-0.219800,-0.457667,0.314573,-0.382062,-0.002846,-0.448729,...,-0.367298,-0.181243,0.003006,-0.251850,-0.453673,0.200214,-0.296771,-0.072697,0.017986,-0.251559
3,0.177452,0.284411,0.059978,0.294704,-0.461114,-0.078591,0.027912,-0.055785,0.103038,-0.375385,...,-0.141730,0.011023,-0.028931,0.275780,0.392628,0.255509,-0.228360,-0.115353,-0.060927,-0.111012
4,-0.104326,0.268305,-0.153165,-0.098395,0.136013,-0.023756,0.088492,-0.139561,-0.056441,-0.164708,...,-0.252163,-0.168705,0.131596,0.134020,-0.135192,0.136525,0.146635,-0.221401,-0.097489,0.133560
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16358,0.270939,-0.060949,0.430409,0.063049,0.087054,0.140644,-0.005588,-0.014261,-0.410215,-0.261869,...,0.041894,-0.191259,0.005462,0.223951,-0.096526,0.615836,0.522237,0.275737,0.338247,-0.043370
16359,0.002887,0.125610,0.194917,-0.271436,-0.618391,-0.181236,-0.123064,0.412982,-0.064641,-0.461746,...,0.077754,-0.076400,-0.029041,0.303071,-0.178294,0.099713,0.020586,-0.142542,0.411557,-0.139364
16360,0.233436,-0.333714,0.249148,-0.065081,-0.009866,0.274120,-0.096856,0.000036,-0.101609,-0.074891,...,0.064156,0.152942,0.003013,0.009000,-0.168976,0.035944,-0.074668,-0.267087,-0.333665,0.200270
16361,0.060705,0.650956,0.364366,-0.137743,0.125858,0.377062,0.200909,-0.164970,0.192967,0.105727,...,-0.133314,-0.485834,-0.629914,0.162492,-0.074942,0.006419,0.090930,0.374194,0.108635,0.091865


## Train logistic regression model

In [122]:
X[1]

[-0.220891997,
 -0.0287244469,
 0.146015704,
 -0.145213276,
 0.571359217,
 -1.15792382,
 0.182461426,
 -0.330795884,
 0.137217268,
 0.456897318,
 -0.208934724,
 -0.224243373,
 -0.206956863,
 0.395993799,
 -0.489267319,
 -0.264985532,
 -0.228866264,
 -0.177542627,
 0.360084951,
 -0.00257473439,
 -0.11394231,
 -0.201612964,
 -0.264656901,
 -0.398159206,
 0.343744338,
 0.128893048,
 0.273463935,
 -0.11618045,
 -0.389896572,
 0.0913909823,
 -0.116113514,
 0.0094199758,
 -0.0121824145,
 0.303918958,
 0.460434675,
 -0.285305887,
 0.286740899,
 0.172171384,
 0.206737384,
 0.0589534417,
 0.298478335,
 -0.0425849259,
 -0.00463876128,
 0.206868067,
 -0.0282696187,
 -0.221420452,
 0.149338931,
 -0.655038595,
 0.0675567761,
 -0.134112984,
 0.148336649,
 -0.0656467229,
 0.353059173,
 0.130215049,
 0.126596659,
 0.065672785,
 0.592592597,
 0.234645143,
 0.203026175,
 -0.137577802,
 -0.0996464342,
 0.0167466179,
 0.481086344,
 -0.0798773393,
 -0.0738785416,
 -0.0964927524,
 -0.236611515,
 -0.12761548

In [123]:
type(X_train[0][1])

float

In [124]:
len(X_train[0]) == len(X_train[2])

True

In [125]:
# from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression() #random_state=1
clf_log_reg = log_reg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [127]:
# Use score method to get accuracy of model
score = clf_log_reg.score(X_test, y_test)
print(score)

0.8875649251451267


## Evaluation

In [None]:
tweet_sentiment = clf_log_reg.predict(X_test)

# Citation

Download the lexicon from http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar and extract it into `data/positive-words.txt` and `data/negative-words.txt`.

The following pre-processing steps are inspired from https://towardsdatascience.com/text-normalization-for-natural-language-processing-nlp-70a314bfa646.

We also pre-processed data so that it begins with < s> tokens (and ends with < /s> tokens). Inspired from answer: https://stackoverflow.com/questions/37605710/tokenize-a-paragraph-into-sentence-and-then-into-words-in-nltk

normalize text to regular expression
code from https://gist.github.com/yamanahlawat/4443c6e9e65e74829dbb6b47dd81764a