<center>

# **SENTIMENT ANALYSIS**

</center>

We perform the analysis for the whole data without clasify those messages related to mental health.

In this code, we use the **SentimentIntensityAnalyzer** class from the **VaderSentiment library**. The polarity_scores() method of the analyzer returns a dictionary of sentiment scores, including the compound score, which represents the overall sentiment.

Based on the compound score, we classify the sentiment as positive, negative, or neutral using a threshold of 0.05 and -0.05.

You'll need to install the VaderSentiment library before running this code:

In [None]:
pip install vaderSentiment

In [None]:
import pandas as pd
import os
from textblob import TextBlob
import re
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np


In [None]:
#Mount your Google Drive to Colab

from google.colab import drive
drive.mount('/content/drive')

Run the name of the fire you want to analyze

In [None]:
#file_name = "Tubbs_and_Thomas.csv"
wildfire_name = 'Tubbs'
file_name     = "2.TubbsFire_wildfire_words.csv"
file_name2    = '1.Tubbs_version1.csv'

In [None]:
#Set path for files
files_path        = '/content/drive/MyDrive/Mental_Health_Wildfire/Twitter_Data/Tubbs_Codes/Data/'
output_files_path = '/content/drive/MyDrive/Mental_Health_Wildfire/Twitter_Data/Tubbs_Codes/Data/'

In [None]:
# Define the data columns types
dtypes = {
    "tweet_id": "object",
    "tweet_text":"str",
    "tweet_possibly_sensitive": "bool",
    "tweet_text": "str",
    "tweet_source": "object",
    "tweet_lang": "str",
    "tweet_retweet_count": "object",
    "tweet_reply_count":"object",
    "tweet_like_count": "object",
    "tweet_quote_count": "object",
    "tweet_impression_count": "object",
    "user_id":"object",
    "user_username": "object",
    "user_verified":"object",
    "user_protected":"object",
    "user_description":"str",
    "user_profile_image_url":"float",
    "user_location":"float",
    "user_followers_count":"float",
    "user_friends_count":"float",
    "user_tweet_count":"float",
    "place_id":"object",
    "place_name": "object",
    "place_full_name":"object",
    "place_country":"object",
    "place_country_code":"object",
    "place_type":"object",
    "clean_text":"str"
}

In [None]:
# Load the CSV file into a Pandas DataFrame

df = pd.read_csv(os.path.join(files_path, file_name),dtype=dtypes)
df["tweet_created_at"] = pd.to_datetime(df["tweet_created_at"])

# Remove the hour, minute, and second information
df["tweet_created_at"] = df["tweet_created_at"].dt.date

os.listdir(files_path)

In [None]:
# Load the CSV file into a Pandas DataFrame

df2 = pd.read_csv(os.path.join(files_path, file_name2),dtype=dtypes)
df2["tweet_created_at"] = pd.to_datetime(df2["tweet_created_at"])

# Remove the hour, minute, and second information
df2["tweet_created_at"] = df2["tweet_created_at"].dt.date


In [None]:
print(df.shape)
print(df2.shape)

In [None]:
# Assuming you have a DataFrame called df with a 'tweet_created_at' column
# Convert 'tweet_created_at' column to datetime format
df['tweet_created_at'] = pd.to_datetime(df['tweet_created_at'])

# Filter data from October 4th to October 31st
start_date = pd.to_datetime('2017-10-08')
end_date   = pd.to_datetime('2017-10-31')

filtered1 = df[(df['tweet_created_at'] >= start_date)   & (df['tweet_created_at'] <= end_date)]
filtered2 = df2[(df2['tweet_created_at'] >= start_date) & (df2['tweet_created_at'] <= end_date)]

# Convert 'tweet_created_at' column to datetime format in filtered_df2
filtered2['tweet_created_at'] = pd.to_datetime(filtered2['tweet_created_at'])

# Remove rows corresponding to October 13, 21, and 24, 2017
dates_to_remove = [
    pd.to_datetime('2017-10-13').date(),
    pd.to_datetime('2017-10-21').date(),
    pd.to_datetime('2017-10-24').date()
]

filtered_df  = filtered1[~filtered1['tweet_created_at'].dt.date.isin(dates_to_remove)]
filtered_df2 = filtered2[~filtered2['tweet_created_at'].dt.date.isin(dates_to_remove)]


print(filtered_df.shape)
print(filtered_df2.shape)



# Plot total tweets per day

In [None]:
df_grouped  = filtered_df.groupby(filtered_df['tweet_created_at']).size().reset_index(name='total_tweets')
df2_grouped = filtered_df2.groupby(filtered_df2['tweet_created_at']).size().reset_index(name='total_tweets')
#------------------------------
print(df_grouped.shape)
print(df2_grouped.shape)

# Compute the average of total tweets
average_tweets = np.mean(df_grouped['total_tweets'])

# Plot a bar chart
#plt.bar(df_grouped['tweet_created_at'], df_grouped['total_tweets'],  color='blue',   alpha= 1,  label='All Tweets')
#plt.bar(df2_grouped['tweet_created_at'], df2_grouped['total_tweets'],color='orange', alpha= 0.8,   label='Wildfire Related Tweets')
plt.fill_between(df2_grouped['tweet_created_at'], df2_grouped['total_tweets'], color='blue',   alpha= 0.7,   label='All Tweets')
plt.fill_between(df_grouped['tweet_created_at'], df_grouped['total_tweets'],   color='orange', alpha= 0.9,     label='Wildfire Related Tweets')
plt.ylim(0,df2_grouped['total_tweets'].max()+1000)
plt.xlim(start_date, end_date)
plt.xlabel('Date (days)')
plt.ylabel('Total Tweets')
plt.title('')
plt.xticks(rotation=0)

# Set the x-axis tick labels to be formatted dates
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%b-%d'))

# Show the legend to differentiate between df_grouped and df2_grouped
plt.legend()

# Eliminate the space between the axis and the figure
plt.tight_layout()

##save figures
#figure_path = '/content/drive/MyDrive/Mental_Health_Wildfire/Twitter_Data/Tubbs_Codes/Figures/'
#figure_name = wildfire_name+'_Number_of_tweets_per_day'


# Save the figure as a PDF
#output_file = os.path.join(figure_path,figure_name)
#plt.savefig(output_file,format='pdf')

total_sum = df2_grouped['total_tweets'].sum()
print(total_sum)

plt.show()

#Sentiment analysis


In this code, you can pass the clean text to the get_sentiment() function, which will then calculate the sentiment score using analyzer.polarity_scores(). It extracts the compound score and determines the sentiment category (positive, negative, or neutral) based on the compound score. Finally, it prints the compound score and sentiment.

Using the clean text directly for sentiment analysis can be a valid approach, especially if the sentiment analysis model or library used performs well on unprocessed text. It simplifies the workflow by eliminating the need for tokenization and allows you to focus on the sentiment analysis itself.

In [None]:
def get_sentiment(text):
    analyzer = SentimentIntensityAnalyzer()
    sentiment_scores = analyzer.polarity_scores(text)

    # Extract the compound sentiment score
    compound_score = sentiment_scores['compound']

    if compound_score > 0:
        sentiment = 'Positive'
    elif compound_score < 0:
        sentiment = 'Negative'
    else:
        sentiment = 'Neutral'
    print(compound_score)
    return compound_score

## Option 1

In [None]:
# Apply the sentiment analysis function to each row of the DataFrame
# Apply this if the file is not too long

df['sentiment'] = df['clean_text'].apply(get_sentiment)

# Save the updated DataFrame to a new CSV file
output_name = 'SA_Tubbs_tweets.csv'
df.to_csv(os.path.join(output_files_path, output_name), index=False)
