Disclaimer: The tweets used in this analysis are for illustrative purposes only and do not reflect the personal opinions or beliefs of the author. They are sourced from publicly available data and are used to demonstrate the capabilities of NLP tools.

# Preparation

In [76]:
!pip install empath vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [77]:
import pandas as pd #for dataframe management
import numpy as np #complex math
from google.colab import files #upload files to google colab
import re #replacing text

In [3]:
# select the files to be uploaded
uploaded = files.upload()

Saving data.xlsx to data.xlsx


In [28]:
data = pd.read_excel('data.xlsx')

# Data Preprocessing
An essential initial step in Natural Language Processing (NLP) is data cleaning, particularly when dealing with online text like tweets, online surveys, or other participant-generated content. This process ensures your data is accurate, consistent, and ready for analysis, ultimately leading to more reliable insights.

In this document, some data cleaning is already completed before python using excel. This includes selecting only English tweets, only tweets with more than 20 likes, and replacing recoding errors (e.g., showing â€™ for ')

Depending on your dataset and NLP methods, additional cleaning may involve stop word removal, stemming, tokenization, spell checking etc.

In [16]:
# additional cleaning in python
def clean_election(text):

    new_text = re.sub(r'<...>',' ', text)  # remove html tags <...>
    new_text = re.sub(r'http\S+', ' ', new_text) # removed all URLs
    new_text = re.sub(r'[^\x00-\x7F]+', ' ', new_text) # remove non-english characters

    new_text = new_text.lower() # convert all characters to lowercase.

    new_text = new_text.replace("\n",' ') # remove b'
    new_text = new_text.replace('\\n',' ') # remove \\n

    return new_text

In [11]:
len(data)

2999

In [21]:
data.head()

Unnamed: 0,text,date
0,BREAKING: Florida Governor Ron DeSantis is sen...,2024-11-26
1,@NormOrnstein Norman Jay Ornstein â€œ playing ...,2024-11-26
2,General Flynn was so well respected by Preside...,2024-11-26
3,( @realDonaldTrump - Truth Social Post )\n( Do...,2024-11-26
4,"Walmart rolls back diversity, equity, and incl...",2024-11-26


In [29]:
data = data.dropna()
data['text'] = data['text'].apply(clean_election)

In [31]:
data.head()

Unnamed: 0,text,date
0,breaking: florida governor ron desantis is sen...,2024-11-26
1,@normornstein norman jay ornstein playing th...,2024-11-26
2,general flynn was so well respected by preside...,2024-11-26
3,( @realdonaldtrump - truth social post ) ( don...,2024-11-26
4,"walmart rolls back diversity, equity, and incl...",2024-11-26


# Linguistic Feature Analysis
One of the earliest forms of NLP involved extracting linguistic features from text or speech to
understand the cognitive processes underlying narrations. A popular lexicon-based method for text analysis is linguistic inquiry and word count (LIWC). LIWC uses over 100 predefined lexicons to score a given document in terms of its usage of basic parts of speech (e.g., pronouns, nouns, verbs) or how it refers to more abstract psychological features (e.g., social referents, emotion, past focused), providing a useful approach to identify characteristics of language use between scenarios and across groups. LIWC is not free to use, so here we will use a free alternative, Empath

In [None]:
from empath import Empath #Empath package
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer #VADER package

In [70]:
def sentiment_empath(text):
  lexicon = Empath()
  result = lexicon.analyze(text, categories=["positive_emotion","negative_emotion"], normalize=True) #linguistic feature analysis
  positive = result["positive_emotion"]
  negative = result["negative_emotion"]
  return positive, negative

In [82]:
# count the percentage of positive and negative emotion words in all tweets
data['positive_empath'], data['negative_empath'] = zip(*data['text'].apply(sentiment_empath))

In [83]:
# tweets with the most positive words
data = data.sort_values(by=['positive_empath'], ascending=False).reset_index(drop=True)
data.head()

Unnamed: 0,text,date,positive,negative,positive_empath,negative_empath
0,"great former congressman, great wisconsinite!",2024-11-18,0.4,0.0,0.4,0.0
1,biden is so happy,2024-11-26,0.25,0.0,0.25,0.0
2,@angie_angieangi happy birthday angie,2024-11-18,0.25,0.0,0.25,0.0
3,try and keep up,2024-11-18,0.25,0.0,0.25,0.0
4,i love helping smaller maga accounts!,2024-11-26,0.166667,0.0,0.166667,0.0


In [84]:
# tweets with the most negative words
data = data.sort_values(by=['negative_empath'], ascending=False).reset_index(drop=True)
data.head()

Unnamed: 0,text,date,positive,negative,positive_empath,negative_empath
0,@danscavino #maga fight fight fight,2024-11-26,0.0,0.6,0.0,0.6
1,@jessicatarlov why lie hunny?,2024-11-26,0.0,0.25,0.0,0.25
2,"hit the road, jack.",2024-11-26,0.0,0.25,0.0,0.25
3,@mjfree cry more commie,2024-11-26,0.0,0.25,0.0,0.25
4,boom! let's hit them where it hurts the most!...,2024-11-26,0.0,0.222222,0.0,0.222222


Empath can tell apart positive tweets (e.g., great former congressman, great wisconsinite!) from negative tweets (e.g., fight fight fight). However, a big drawback is that is cannot consider the relationships between words. For instance, "I am not happy" is rated as more positive than negative.

In [85]:
lexicon.analyze('I am not happy', categories=["positive_emotion","negative_emotion"], normalize=True)

{'positive_emotion': 0.25, 'negative_emotion': 0.0}

More recent models, like Valence Aware Dictionary and sEntiment Reasoner (VADER), tracks word order-sensitive relationships between terms to compute the emotional tone of a text (i.e., positive to negative valence). Instead of relying exclusively on a lexicon with a fixed mapping between a word and valence, VADER also tracks punctuation (e.g., !) and intensifiers (e.g., extremely, some-what, kind of), affording it additional sensitivity to the degree of the sentiment being expressed.

The positive, neutral, and negative scores are ratios for proportions of text that fall in each category (so these should all add up to be 1... or close to it with float operation). The compound score (which is calculated below) is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive).

In [86]:
analyzer = SentimentIntensityAnalyzer()
result = analyzer.polarity_scores('I am not happy')
result['compound'] # 1 is positive, -1 is negative

-0.4585

In [91]:
def sentiment_vader(text):
  result = analyzer.polarity_scores(text) #linguistic feature analysis
  return result['compound']

In [92]:
data['valence_vader'] = data['text'].apply(sentiment_vader)

In [93]:
data = data.sort_values(by=['valence_vader'], ascending=True).reset_index(drop=True)
data.head()

Unnamed: 0,text,date,positive,negative,positive_empath,negative_empath,valence_vader
0,just a daily reminder that donald trump is a ...,2024-11-18,0.0,0.020408,0.0,0.020408,-0.989
1,"@mjfree angry, jealous, hateful, racist bully....",2024-11-26,0.0,0.04,0.0,0.04,-0.9846
2,hey how's that wall that trump repeatedly sa...,2024-11-26,0.0,0.020408,0.0,0.020408,-0.9721
3,maga gonna feel the actions of their consequen...,2024-11-26,0.0,0.043478,0.0,0.043478,-0.9685
4,how trump plans to stop world war iii. in tr...,2024-11-26,0.0,0.06,0.0,0.06,-0.9648


In [94]:
# the most negative tweet
data.at[0, 'text']

'just  a daily reminder that donald trump is a convicted sex offender, liar,  felon, and financial fraudster and is a racist, sexist, hateful, fear  mongering, evil fascist and is the worst president in the history of the  u.s. and deserves to spend the rest of his life in prison,  '

In [95]:
data = data.sort_values(by=['valence_vader'], ascending=False).reset_index(drop=True)
data.head()

Unnamed: 0,text,date,positive,negative,positive_empath,negative_empath,valence_vader
0,@doglover_001 yes. i love conservative women b...,2024-11-26,0.068966,0.0,0.068966,0.0,0.9843
1,@mitchellvii best at what? best destructionis...,2024-11-18,0.0,0.0,0.0,0.0,0.9816
2,@nadyabyznezz may god bless all of you and you...,2024-11-26,0.0625,0.0,0.0625,0.0,0.9781
3,"thank you, thank you, thank you. i can't tell...",2024-11-26,0.0,0.0,0.0,0.0,0.975
4,@harryjoebanks34 yeah... joe biden trying to s...,2024-11-18,0.064516,0.0,0.064516,0.0,0.9724


In [96]:
# the most positive tweet
data.at[0, 'text']

'@doglover_001 yes. i love conservative women because they are intelligent, loyal, strong, funny, compassionate, family oriented most important of all faithful and way more beautiful  than liberal women. maga/maha  '

But also consider this example, particularly the phrase 'best destructionist: Gavin Newsom.' While the user's intent is clearly sarcastic and critical of Newsom, VADER and many other NLP tools struggle to detect such nuanced sentiment.

In [97]:
data.at[1, 'text']

'@mitchellvii best at what?  best destructionist : gavin newsom  best defender of peoples rights: ron desantis   best defender of border: greg abbott with honorable mention to multiple gop govs who sent help to tx  best overall 2024: ron desantis'

In conclusion, Linguistic Feature Analysis offers a straightforward and accessible method for automatically analyzing text data. Often achievable with minimal code, it can reveal valuable insights into the overall sentiment, prevalent topics, and word usage patterns within a text. However, it's crucial to acknowledge its limitations in discerning complex relationships between words, and its susceptibility to misinterpreting irony and sarcasm.

# Text Vectorization and Embedding
In brief, embedding models convert a given text (e.g., word, sentence, or document) into numerical vectors, thus ‘embedding’ the text vectors in a high-dimensional semantic space. These ‘spaces’ are derived from large corpora of natural text (e.g., entirety of Wikipedia) and represent semantics by inferring what a word means based on how it was used in the training corpus. The specific ways in which embeddings are computed vary considerably, from directly modeling word occurrence statistics (e.g., Global Vectors for Word Representation (GLoVe)) to training neural networks to complete a specific task, like predicting the text that is likely to appear before or after a specific target [e.g., word2vec and Universal Sentence En-coder (USE)]. Yet, contemporary embedding models leverage the powerful transformer architecture (e.g., Bidirectional Encoder Representations from Transformers (BERT)), which has given rise to the proliferation of LLMs that we see today [e.g., Generative Pretrained Transformer (GPT)].

Here, I will demonstrate a simple use of text embedding: similarity in meaning, often used to quantify narrative memory accuracy and consistency.

In [98]:
import tensorflow_hub as hub
USE = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [102]:
np.linalg.norm(np.array(USE(['SDSDG'])))

np.float32(1.0)

In [106]:
def USE_similarity(USE, sentence1, sentence2):
  #convert sentence1 and sentence2 into 512-dimensional vectors
  USE_output = np.array(USE([sentence1, sentence2]))
  # calculate cosine distance between them to quantify similarity
  similarity = np.inner(USE_output[0], USE_output[1])
  return similarity

# Note: For those interested in the math, the function np.inner actually
# calculates the inner product between inputs. Cosine similarity is the inner
# product, normalized by the magnitudes of the vectors. Since the magnitudes of
# USE vectors are always 1, in this case, cosine similarity and the inner
# product are equivalent. Consequently, using the inner product directly
# provides a measure of semantic similarity between the input sentences.

In [107]:
sentence1 = 'I like apple'
sentence2 = 'I like apple pie'
USE_similarity(USE, sentence1, sentence2)

np.float32(0.7040262)

In [108]:
sentence1 = 'I like apple'
sentence2 = 'Gravity pulls objects downward'
USE_similarity(USE, sentence1, sentence2)

np.float32(0.074745)

In [109]:
# separate the data into those about republicans and those about democrats
data = pd.read_excel('data.xlsx')
data['republican'] = data['text'].str.contains('trump|republican|republicans', case=False).astype(int)
data['democrat'] = data['text'].str.contains('harris|biden|democrat|democrats', case=False).astype(int)
data = data[data['right'] != data['left']].reset_index(drop=True)