# Hongfan Lu - PSet 2

#### Q1) Asking Questions：

What are some compelling questions that you can ask with the dataset you collected?
List at least two questions. Enter your responses and the rationale behind choosing those questions in markdown cells. Write your questions and explain your reasoning for each of them.

HINT: In the first assignment you had chosen certain keywords to filter data related to the election and had provided a rationale behind the choice of the keywords. This can help you think of potential questions.

**Proposed Question 1**:

    Trade War with China (2018-2020) was one of the signature of Trump's presidency. I would like to investigate how did Trump formulate, communicate and insitgated his supporter's recognition on this policy. 
    Was he using accusation of lying, deception and so forth? I will leverage keywords to find them

**Proposed Question 2**:

    Is Trump's tweets angry? My assumption is yes since he was the target of several impeachment? What else emotions does it contain?

#### Q2) Inspect & Data Cleaning

#### 2a) Inspect:  

Write code to inspect the data. What do you observe? Along with the code, write your observations in the markdown cell.

In [1]:
import pandas as pd
import re
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from nltk.corpus import stopwords

In [2]:
trump = pd.read_csv('psets-trump-20200530.csv')
trump.shape

(18467, 7)

Each row contains one tweet (or reply) made by Trump. There are the source, acutual tweet content, created timestamp, retweet/favorite count and finally id for that specific tweet.

created_at contains string which needs to be coverted to datetime below

In [3]:
trump['source'].value_counts()

source
Twitter for iPhone      17843
Twitter Media Studio      174
Twitter for Android       174
Media Studio              153
Twitter Web Client         48
Twitter for iPad           38
Twitter Ads                33
Twitter Web App             4
Name: count, dtype: int64

Most tweets are made through Trump's phone.

In [4]:
trump['is_retweet'].value_counts()

is_retweet
False    16509
True      1900
Name: count, dtype: int64

Most tweets of Trump are retweeted.

- By visually checking the dataset: I found row 93 column 'text' contains uncleaned information. I need to extract the information and put them in the right column

In [5]:
blub_string = trump['text'][93]
blub_string[:700]

'He is arguably the greatest president in our history.” Thank you @LouDobbs! https://t.co/6dfy0yxu9l,05-27-2020 22:58:18,18223,71416,false,1265779391646187520\nTwitter for iPhone,At my request the FBI and the Department of Justice are already well into an investigation as to the very sad and tragic death in Minnesota of George Floyd....,05-27-2020 22:39:56,50111,247307,false,1265774767493148672\nTwitter for iPhone,....I have asked for this investigation to be expedited and greatly appreciate all of the work done by local law enforcement. My heart goes out to George’s family and friends. Justice will be served!,05-27-2020 22:39:56,27476,141521,false,1265774770877902848\nTwitter for iPhone,If the '

#### 2b) Clean: 

Write code to clean the data. Along with the code, you need to write the rationale behind the cleaning process, i.e., what are you observing after the first level of cleaning, what is still messy and needs additional cleaning, how are you deciding to do it, etc. At this stage, your cleaning should at least comprise: (1) common data cleaning steps, and (2) dealing with messy Youtube/Bluesky data.

- 1. Tackle the blub_string found above

In [6]:
rows = blub_string.split('\n')
original_text = rows[0]
other_rows = rows[1:]
other_data = [row.split(',') for row in other_rows]
trump_cols = list(trump.columns)
df = pd.DataFrame(other_data, columns=trump_cols)

In [7]:
original_data = original_text.split(',')
original_data.insert(0,'') # Unknown posting source so insert empty string here. 
df_row = pd.DataFrame([original_data], columns=trump_cols)

In [8]:
trump = pd.concat([df_row, df,trump])
trump['text'][93] = '' # delete the information in cell

In [9]:
trump.reset_index(drop=True, inplace=True)
trump.shape

(18501, 7)

Adding more rows of information extracted from cell under column 'text' and row 93

- 2. Transform created_at column to datetime object

In [10]:
trump['created_at'] = pd.to_datetime(trump['created_at'])

Datetime object will allow us to rank or aggregate the tweets for further comparison and analysis

- 3. Extract potential urls from text; A sepetated column for urls could be useful for analysis; Remove extracted urls from text column

In [11]:
def extract_urls(text):
    urls = re.findall(r'(https?://\S+)', str(text))
    return ",".join(urls) if urls else None

trump['extracted_text_urls'] = trump['text'].apply(extract_urls)

In [12]:
# Remove URLs from text
trump['text'] = trump['text'].str.replace(r'http\S+', '', regex=True)

- 4. Extract RT @ XXX: and make them in a sepetate column

In [13]:
def extract_RTs(text):
    RTs = re.findall(r'^RT @\w+:', str(text))
    return ",".join(RTs) if RTs else None

trump['extracted_RTs'] = trump['text'].apply(extract_RTs)

In [14]:
# Remove 'RT @ XXX:' from text
trump['text'] = trump['text'].str.replace(r'^RT @\w+:', '', regex=True)

- 5. Remove any leading or trailing whitespaces in text column, just in case

In [15]:
trump['text'] = trump['text'].str.strip()

In [25]:
trump.head()

Unnamed: 0,source,text,created_at,retweet_count,favorite_count,is_retweet,id_str,extracted_text_urls,extracted_RTs
0,,He is arguably the greatest president in our history.” Thank you @LouDobbs!,2020-05-27 22:58:18,18223,71416,False,1265779391646187520,https://t.co/6dfy0yxu9l,
1,Twitter for iPhone,At my request the FBI and the Department of Justice are already well into an investigation as to the very sad and tragic death in Minnesota of George Floyd....,2020-05-27 22:39:56,50111,247307,False,1265774767493148672,,
2,Twitter for iPhone,....I have asked for this investigation to be expedited and greatly appreciate all of the work done by local law enforcement. My heart goes out to George’s family and friends. Justice will be served!,2020-05-27 22:39:56,27476,141521,False,1265774770877902848,,
3,Twitter for iPhone,If the FISA Bill is passed tonight on the House floor I will quickly VETO it. Our Country has just suffered through the greatest political crime in its history. The massive abuse of FISA was a big part of it!,2020-05-27 22:16:31,40598,164483,False,1265768877427851265,,
4,Twitter for iPhone,Thank you to @NASA and @SpaceX for their hard work and leadership. Look forward to being back with you on Saturday!,2020-05-27 21:28:24,21816,152454,False,1265756765389418496,,


In [26]:
trump.to_csv('trump_tweets_cleaned.csv')

#### 2c) Tokenize: 

Write code to tokenize your entire dataset. Use at least three different types of tokenizers. Display results from all the tokenizers in a pandas dataframe so that you can visually compare them.

In [16]:
import nltk
from nltk.tokenize import word_tokenize, casual_tokenize, TweetTokenizer

In [17]:
# create an empty dataframe for holding the tokenized data later
token_df = pd.DataFrame()

In [18]:
# Define your tokenizers
word_tokenizer = word_tokenize
casual_tokenizer = casual_tokenize
tweet_tokenizer = TweetTokenizer().tokenize

In [19]:
# Tokenize the 'text' column using different tokenizers
token_df['Word Tokenizer'] = trump['text'].apply(word_tokenizer)
token_df['Casual Tokenizer'] = trump['text'].apply(casual_tokenizer)
token_df['Tweet Tokenizer'] = trump['text'].apply(tweet_tokenizer)

In [20]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [word for word in tokens if word.lower() not in stop_words]

In [21]:
token_df['Word Tokenizer'] = token_df['Word Tokenizer'].apply(lambda tokens: remove_stopwords(tokens))
token_df['Casual Tokenizer'] = token_df['Casual Tokenizer'].apply(lambda tokens: remove_stopwords(tokens))
token_df['Tweet Tokenizer'] = token_df['Tweet Tokenizer'].apply(lambda tokens: remove_stopwords(tokens))

In [22]:
# Set display options to wrap text
pd.set_option('display.max_colwidth', None)

In [23]:
token_df.head()

Unnamed: 0,Word Tokenizer,Casual Tokenizer,Tweet Tokenizer
0,"[arguably, greatest, president, history., ”, Thank, @, LouDobbs, !]","[arguably, greatest, president, history, ., ”, Thank, @LouDobbs, !]","[arguably, greatest, president, history, ., ”, Thank, @LouDobbs, !]"
1,"[request, FBI, Department, Justice, already, well, investigation, sad, tragic, death, Minnesota, George, Floyd, ....]","[request, FBI, Department, Justice, already, well, investigation, sad, tragic, death, Minnesota, George, Floyd, ...]","[request, FBI, Department, Justice, already, well, investigation, sad, tragic, death, Minnesota, George, Floyd, ...]"
2,"[...., asked, investigation, expedited, greatly, appreciate, work, done, local, law, enforcement, ., heart, goes, George, ’, family, friends, ., Justice, served, !]","[..., asked, investigation, expedited, greatly, appreciate, work, done, local, law, enforcement, ., heart, goes, George, ’, family, friends, ., Justice, served, !]","[..., asked, investigation, expedited, greatly, appreciate, work, done, local, law, enforcement, ., heart, goes, George, ’, family, friends, ., Justice, served, !]"
3,"[FISA, Bill, passed, tonight, House, floor, quickly, VETO, ., Country, suffered, greatest, political, crime, history, ., massive, abuse, FISA, big, part, !]","[FISA, Bill, passed, tonight, House, floor, quickly, VETO, ., Country, suffered, greatest, political, crime, history, ., massive, abuse, FISA, big, part, !]","[FISA, Bill, passed, tonight, House, floor, quickly, VETO, ., Country, suffered, greatest, political, crime, history, ., massive, abuse, FISA, big, part, !]"
4,"[Thank, @, NASA, @, SpaceX, hard, work, leadership, ., Look, forward, back, Saturday, !]","[Thank, @NASA, @SpaceX, hard, work, leadership, ., Look, forward, back, Saturday, !]","[Thank, @NASA, @SpaceX, hard, work, leadership, ., Look, forward, back, Saturday, !]"


#### 2d) Pick the best tokenizer. 

Which one do you think works best for your data and why?

 Answer: I will pick the TweetTokenizer, since it interpret the emojis and "@XXX" better; Although the difference between casual_tokenizer and tweet_tokenizer is barely recognizable.

#### Q3) Analyze Data for Sentiment

#### 3a) Sentiment analysis: 

Pick at least three different ways of conducting sentiment analysis. Write code to loop through your data and find sentiment for each comment, using each of these methods. Report your observations: What do you observe? Are there any similarities or differences across the methods? In the next question, we will quantify these comparisons.


In [24]:
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk.corpus import opinion_lexicon

ModuleNotFoundError: No module named 'textblob'

In [None]:
sentiment_df = token_df.drop(['Word Tokenizer','Casual Tokenizer'], axis = 1)
sentiment_df['original_text'] = trump['text']
sentiment_df.shape

- TextBlob

In [None]:
# Function to calculate sentiment using TextBlob
def textblob_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment

In [None]:
sentiment_df['TextBlob_Sentiment'] = sentiment_df['original_text'].apply(textblob_sentiment)

- VADER

In [None]:
analyzer = SentimentIntensityAnalyzer()
sentiment_df['VADER_Sentiment'] = sentiment_df['original_text'].apply(analyzer.polarity_scores)

- Opinion_lexicon

In [None]:
def lexicon_sentiment(text):
    positive_lexicon = set(opinion_lexicon.positive())
    negative_lexicon = set(opinion_lexicon.negative())
    
    tokens = text.lower().split()
    positive_count = sum(word in positive_lexicon for word in tokens)
    negative_count = sum(word in negative_lexicon for word in tokens)
    
    return positive_count, negative_count

In [None]:
sentiment_df['Lexicon_Sentiment'] = sentiment_df['original_text'].apply(lexicon_sentiment)

In [None]:
sentiment_df.head(1)

- Although the absolute score is different in value, all three sentiment agrees on whether an emotion is positive or negative.
- It take a lot longer to run opinion_Lexicon, compared to the other two methods.
- TextBlob Polarity seems to be comparable to the BADER_Sentiment compound methods

#### 3b) Quantitatively comparing methods: 

Is there a way to do a pairwise comparison of the methods that you picked? You need to report at least one pairwise comparison between the methods. Better if you are able to report all pairwise comparisons across all methods. In your pairwise comparison, compute a quantitative measure to show what proportion of comments match or do not match between two measures. You can also include the rationale behind the choice of your measure.

- TextBlob
    - Creating new column for textblob polarity; Value will be replaced
    - Extracting only polarity from TextBlob
    - Round the polarity scores to have 1 decimal point

In [None]:
sentiment_df['TextBlob_Sentiment_Polarity'] = sentiment_df['TextBlob_Sentiment']
for i in range(0,len(sentiment_df['TextBlob_Sentiment'])):
    sentiment_df['TextBlob_Sentiment_Polarity'][i] = sentiment_df['TextBlob_Sentiment'][i].polarity
sentiment_df['TextBlob_Sentiment_Polarity'] = pd.to_numeric(sentiment_df['TextBlob_Sentiment_Polarity'])
sentiment_df['TextBlob_Sentiment_Polarity'] = sentiment_df['TextBlob_Sentiment_Polarity'].round(1)

- VADER
    - Creating new column for VADER Compound;
    - Extracting compound from VADER
    - Round the compound scores to have 1 decimal point

In [None]:
sentiment_df['VADER_Sentiment_Compound'] = sentiment_df['VADER_Sentiment']
for i in range(0,len(sentiment_df['VADER_Sentiment'])):
    sentiment_df['VADER_Sentiment_Compound'][i] = sentiment_df['VADER_Sentiment'][i]['compound']
sentiment_df['VADER_Sentiment_Compound'] = pd.to_numeric(sentiment_df['VADER_Sentiment_Compound'], errors='coerce')
sentiment_df['VADER_Sentiment_Compound'] = round(sentiment_df['VADER_Sentiment_Compound'], 1)

- Opinion_Lexicon
    - Transforming Lexicon_Sentiment; 
    - give equal weight to positive and negative counts but +0.5 for positive and -0.5 for negative
    - Scale the lexicon sentiment column to make them between -1 and 1 like the other two

In [None]:
sentiment_df['Lexicon_Sentiment_weighted'] = sentiment_df['Lexicon_Sentiment']
for i in range(0,len(sentiment_df['Lexicon_Sentiment'])):
    sentiment_df['Lexicon_Sentiment_weighted'][i] = sentiment_df['Lexicon_Sentiment'][i][0]*0.5 + sentiment_df['Lexicon_Sentiment'][i][1]*(-0.5)

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(-1, 1))
sentiment_df['Lexicon_weighted_normalized'] = scaler.fit_transform(sentiment_df[['Lexicon_Sentiment_weighted']])
sentiment_df['Lexicon_weighted_normalized'] = round(sentiment_df['Lexicon_weighted_normalized'], 1)

- Pairwise comparison

In [None]:
TB_VADER_matches = (sentiment_df['TextBlob_Sentiment_Polarity'] == sentiment_df['VADER_Sentiment_Compound']).sum()
TB_VADER_percentage_matches = (TB_VADER_matches / len(sentiment_df)) * 100

print(f"Percentage of TextBlob and VADER matching values: {TB_VADER_percentage_matches:.2f}%")

In [None]:
TB_Lexicon_matches = (sentiment_df['TextBlob_Sentiment_Polarity'] == sentiment_df['Lexicon_weighted_normalized']).sum()
TB_Lexicon_percentage_matches = (TB_Lexicon_matches / len(sentiment_df)) * 100

print(f"Percentage of TextBlob and Lexicon weighted matching values: {TB_Lexicon_percentage_matches:.2f}%")

#### 3c) Qualitative + Quantitative comparison: 

Write code to randomly pick 40 comments. By hand, mark each of their sentiments. Now pick two sentiment analysis methods of your choice to automatically find the sentiments of these 40 tweets. Considering your hand labels as the absolute ground truth, write code to determine which sentiment analysis method works better. It might be that both methods you picked do equally well. Provide a rationale for your response in a markdown cell.

HINT: The more the analysis method’s output matches with your labels, the better it is.

In [None]:
sampled_df = trump.sample(n=40, random_state=42).reset_index(drop=True)

In [None]:
sampled_df['hand_label'] = [-0.3, 0.7, 0.3, -0.2, -0.5, -0.7, -0.5, 0.4, -0.5, 0.8, 0.2, -0.2, -0.1,0.6,-0.3,0.8,0.3,-0.5,0.8,0.8,-0.1,-0.2,0.4,-0.3,0.4,-0.1,-0.3,0.7,0.1,0.1,0.3,-0.5,0.5,-0.5,0.7,-0.7,0,0.7,0.8,0.2]

- VADER

In [None]:
sampled_df['VADER'] = sampled_df['text'].apply(analyzer.polarity_scores)
# Set up a new column 'VADER_Compound_2' in sampled_df
sampled_df['VADER_Compound'] = sampled_df['VADER']

# Extract the 'compound' score from the 'VADER_Compound' dictionary
for i in range(0, 40):
    sampled_df.at[i, 'VADER_Compound'] = sampled_df['VADER_Compound'][i]['compound']

# Convert the 'VADER_Compound_2' column to numeric, rounding to 1 decimal point
sampled_df['VADER_Compound'] = pd.to_numeric(sampled_df['VADER_Compound'], errors='coerce').round(1)

- TextBlob

In [None]:
sampled_df['TextBlob'] = sampled_df['text'].apply(textblob_sentiment)
# # Set up a new column 'VADER_Compound_2' in sampled_df
sampled_df['TextBlob_Polarity'] = sampled_df['TextBlob']
for i in range(0,40):
    sampled_df['TextBlob_Polarity'][i] = sampled_df['TextBlob'][i].polarity
sampled_df['TextBlob_Polarity'] = pd.to_numeric(sampled_df['TextBlob_Polarity']).round(1)

In [None]:
comparison = sampled_df[['hand_label', 'VADER_Compound', 'TextBlob_Polarity']]
# Calculate absolute differences
comparison['VADER/Label'] = abs(comparison['hand_label'] - comparison['VADER_Compound'])
comparison['TextBlob/Label'] = abs(comparison['hand_label'] - comparison['TextBlob_Polarity'])

In [None]:
Vader_average = comparison['VADER/Label'].mean()
print("Average Difference between Vader compound and my hand label is: ", Vader_average)

In [None]:
TextBlob_average = comparison['TextBlob/Label'].mean()
print("Average Difference between TextBlob Polarity and my hand label is: ", TextBlob_average)

Since the average abosolute difference is smaler for VADER method, it is better in this test against my hand label;Nevertheless, my hand label are not necessarily the best reference.

#### Q4) Analyze Data with LIWC
Pick at least three dimensions from LIWC that you would want to investigate on your data. You can find a version of the LIWC dictionary here. Write code to find what proportion of each of the two dimensions you picked are present in your data. Motivate your choice of dimension with a research question. For example, if you are curious to know the prevalence of angry comments, you can pick the “angry” dimension in LIWC. Another neat trick here would be to make your code modular via Python functions so that later you can reuse this function for computing across multiple dimensions.

In [None]:
import pandas as pd
import collections

In [None]:
liwc = pd.read_csv('LIWC2015 dictionary poster.xlsx - 2015-08-24-LIWC2015 - Poster.csv')

In [None]:
liwc_dict = collections.defaultdict(list)

for header_name in liwc:
    dict_key = "".join(filter(lambda x: not x.startswith('Unnamed:'), map(str, header_name))).strip('0123456789\n ')
    liwc_dict[dict_key] += list(filter(lambda x: not pd.isnull(x), liwc[header_name]))

In [None]:
liwc_dict['Anger'][:5]

In [None]:
liwc_dict.keys()

#### I would like to check for 'Anger', 'Negemo', 'Adj'

In [None]:
tokens = token_df['Tweet Tokenizer']

In [None]:
def detect_liwc(liwc_key, tweet_tokens):
    all_token = []
    emotion_keywords = liwc_dict[liwc_key]    
    for sentence in tweet_tokens:
        degree = 0    
        for token in sentence:
            if token in emotion_keywords:
                degree += 1        
        # Append 1 if emotion keyword is detected, otherwise append 0
        all_token.append(1 if degree != 0 else 0)
    return all_token

In [None]:
anger_percentage = detect_liwc('Anger', tokens)
print(f"The percentage of Anger words is: {sum(anger_percentage)/len(anger_percentage)*100:.2f}%")

In [None]:
negemo = anger_percentage = detect_liwc('Negemo', tokens)
print(f"The percentage of Negative Emotion words is: {sum(negemo)/len(negemo)*100:.2f}%")

In [None]:
Adj = anger_percentage = detect_liwc('Adj', tokens)
print(f"The percentage of Adjective words is: {sum(Adj)/len(Adj)*100:.2f}%")

#### Q5) Analyze Data over time
How does the polarity (sentiment) of your corpus change over time? Answer this question by showing plots. You need to plot polarity for at least two of your three sentiment analyzers chosen earlier.

In [None]:
import matplotlib.pyplot as plt

In [None]:
trump['Vader_compound'] = sentiment_df['VADER_Sentiment_Compound']
trump['TextBlob_Polarity'] = sentiment_df['TextBlob_Sentiment_Polarity']

In [None]:
smooth_df = trump
df_grouped = smooth_df.groupby(smooth_df['created_at'].dt.to_period('M')).mean()

In [None]:
# Plot
plt.figure(figsize=(16, 6))
plt.plot(df_grouped.index.to_timestamp(), df_grouped['Vader_compound'], label='Vader Compound', marker='o')
plt.plot(df_grouped.index.to_timestamp(), df_grouped['TextBlob_Polarity'], label='TextBlob Polarity', marker='x')
plt.xlabel('Timestamp')
plt.ylabel('Sentiment')
plt.title('Sentiment Over Time')
plt.legend()
plt.xticks(rotation=45)  
plt.show()

#### Drawing conclusions from the plots: 
What do you observe from the plots? Can you draw conclusions from your plots based on how the election campaigns were unfolding in the real world? What else can you infer from the plot?

- Answer:

    An obvious trend of Trump's tweet comments is that the sentiment scores are trending down, expecially towards the 2019-2020 period. During that time, there were cases and impeachments against him so he spent a lot of time tweeting on those matters and thus adopt an negative tones towards many people.