# Sentiment Analysis and Twitter Post Popularity: A Data-Driven Investigation of the Correlation

Main goal of this research is to investigate the corrleation between sentiments of the tweets of popular people with its popularity (number of likes and shares).

For this project two different datasets are used. One is labeled (0 and 1 for positive and negative sentiments respectively) dataset and the other one is unlabeled dataset. There is no single dataset with labels and the features required for this investogation therefore this approach is taken.

Main features wrequired in a single dataframe to conduct this research are: tweet text, number of likes for that tweet, number of shares for that tweet and the sentiment score for that tweet so that correlation could be computed. As it was not possible to conduct the research using single dataset, the following steps are followed:

1. After performing all the required preprocessing of the data, a classifcation model on the labeled dataset is trained.

2. Then that trained classification model is used to predict the labels for the unlabeled dataset along with sentiment score which in this case is computed using predict_proba function of the classifier model.

3. Then the popularity score is computed using the features: number of likes and number of shares.

4. Finally linear and non-linear correlation between the populariy and the sentiment score is computed.

The steps are explained along with the code in more details on the notebook.

Datasets used: 

Training Dataset: Dataset from Sentimnet140 (link: https://docs.google.com/file/d/0B04GJPshIjmPRnZManQwWEdTZjg/edit?resourcekey=0-betyQkEmWZgp8z0DFxWsHw)

Main Dataset: Available along with the notebook. 

Libraries: 
You can find all the required libraries in requirements.txt file. And install those through terminal with this command: 
!pip install -r requirements.txt

### Importing necessary libraries

In [None]:
#importing the libraries
import pandas as pd
import numpy as np
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from multiprocessing import Pool, cpu_count
import matplotlib.pyplot as plt
import seaborn as sns
from nltk import FreqDist
from wordcloud import WordCloud
from nltk.tokenize import RegexpTokenizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
from scipy.stats import spearmanr
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score

In [None]:
# Download required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

### Reading the data

This is the first dataset which is labeled (0 for negative sentiment tweets and 1 for positive sentiment tweets). We will use this dataset to train a classification model so that we can use it for predicting labels and sentiment score for our main dataset which is unlabeled. 

In [None]:
#reading the data
data = pd.read_csv('trainingdata.csv', encoding='ISO-8859-1', header=None)
# naming the columns
data = data.rename(columns={0: 'label', 1: 'tweet_id', 2: 'date', 3: 'query', 4: 'user', 5: 'tweet_text'})
# replacing 4 with 1 for the positive tweets
data['label'] = data['label'].replace(4, 1)

In [None]:
data

In [None]:
# dropping query column as we dont need it
data = data.drop('query', axis=1)
data

In [None]:
# reindexing the columns
data = data[['tweet_id', 'user', 'date', 'tweet_text', 'label']]
data

In [None]:
# checking the class ratio in the data

num_pos = (data["label"] == 1).sum()
num_neg = (data["label"] == 0).sum()

print("Number of positive tweets:", num_pos)
print("Number of negative tweets:", num_neg)

In [None]:
data_01 = data.copy()

In [None]:
# removing @mentions from the tweet text
# we are doing this by applying a lambda function on the column 'tweet_text'
# the lambda function substitutes the '@...' with nothing
data_01['tweet_text'] = data_01['tweet_text'].apply(lambda x: re.sub('@[^\s]+','',x))

In [None]:
# removing links(urls) from the tweet text
# we are doing this by applying a lambda function to the column 'tweet_text'.
# the lambda function substitutes the 'https...' with nothing
data_01["tweet_text"] = data_01["tweet_text"].apply(lambda x: re.sub(r"http\S+", "", x))

In [None]:
# removing #hashtags from the tweet text
# we are doing this by applying a lambda function to the column 'tweet_text'.
# the lambda function substitutes the '#...' with nothing
data_01["tweet_text"] = data_01["tweet_text"].apply(lambda x: re.sub(r"#\S+", "", x))

In [None]:
#copying the data_01
data_02 = data_01.copy()
data_02

In [None]:
# lowercase the tweet text 
# here we are using a lambda function to lowercase the column 'tweet_text'
# and storing it in new column called 'tweet_text_cleaned' .
data_02['tweet_text_cleaned'] = data_02['tweet_text'].apply(lambda x: x.lower())

In [None]:
# function to remove all the emojis in the text
# we are defining a function that removes (substitutes with nothing) all the emojis that are stored in the list
# inside the function when it is applied to a text.
# afterwards, we are applying this function to the column 'tweet_text_cleaned'
def remove_emojis(text):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  
        u"\U0001F300-\U0001F5FF"  
        u"\U0001F680-\U0001F6FF"  
        u"\U0001F1E0-\U0001F1FF" 
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

data_02['tweet_text_cleaned'] = data_02['tweet_text_cleaned'].apply(remove_emojis)


In [None]:
# remove all the punctuations from the text
data_02['tweet_text_cleaned'] = data_02['tweet_text_cleaned'].str.translate(str.maketrans('', '', string.punctuation))

In [None]:
# tokenize each string in the text using word_tokenize() function from NLTK library
data_02['tweet_text_cleaned'] = data_02['tweet_text_cleaned'].apply(lambda x: word_tokenize(x))

In [None]:
# removing stop words from the text using stopwords module from NLTK library
# we are first storing in a set all the stopwrods from language 'english'
# afterwards, we are applying a lambda function to remove all the stopwords stored in the set
# from the column 'tweet_text_cleaned'
stop_words = set(stopwords.words('english'))
data_02['tweet_text_cleaned'] = data_02['tweet_text_cleaned'].apply(lambda x: [token for token in x if token not in stop_words])


In [None]:
# lemmatize the words in the text using WordNetLeammatizer from NLTK library
# here we are also using a lambda function to apply WordNetLemmatizer to the words
lemmatizer = WordNetLemmatizer()
data_02['tweet_text_cleaned'] = data_02['tweet_text_cleaned'].apply(lambda x: [lemmatizer.lemmatize(token) for token in x])


In [None]:
# joining the tokens in each tweet into a single string
# we are using lambda function for this operation as well
data_02['tweet_text_cleaned'] = data_02['tweet_text_cleaned'].apply(lambda x: ' '.join(x))

In [None]:
# printing the dataframe 'data_02'
data_02

### Visualization of data

#### Horizontal Bar Plot for 20 most frequent words

In [None]:
# With the following code we are plotting 20 most frequent positive words from the column 'tweet_text_cleaned'

# For Positive(1) Label

sns.set(style = 'white')
# Subset positive review dataset
all_words_df = data_02[data_02['label'] == 1]

# Extracts words into list and count frequency
all_words = ' '.join([text for text in all_words_df ['tweet_text_cleaned']])
all_words = all_words.split()
words_df = FreqDist(all_words)

# Extracting words and frequency from words_df object
words_df = pd.DataFrame({'word':list(words_df.keys()), 'count':list(words_df.values())})

# Subsets top 30 words by frequency
words_df = words_df.nlargest(columns="count", n = 20) 

words_df.sort_values('count', inplace = True)

# Plotting 30 frequent words
plt.figure(figsize=(20,5))
ax = plt.barh(words_df['word'], width = words_df['count'])
plt.show()

In [None]:
# With the following code we are plotting 20 most frequent negative words from the column 'tweet_text_cleaned'

# For Negative(0) Label

sns.set(style = 'white')
# Subset positive review dataset
all_words_df = data_02[data_02['label'] == 0]

# Extracts words into list and count frequency
all_words = ' '.join([text for text in all_words_df ['tweet_text_cleaned']])
all_words = all_words.split()
words_df = FreqDist(all_words)

# Extracting words and frequency from words_df object
words_df = pd.DataFrame({'word':list(words_df.keys()), 'count':list(words_df.values())})

# Subsets top 30 words by frequency
words_df = words_df.nlargest(columns="count", n = 20) 

words_df.sort_values('count', inplace = True)

# Plotting 30 frequent words
plt.figure(figsize=(20,5))
ax = plt.barh(words_df['word'], width = words_df['count'])
plt.show()

In [None]:
# Removing 'im' from all the texts as it had the most frequency for both labels
data_02['tweet_text_cleaned'] = data_02['tweet_text_cleaned'].str.replace('im', '')
data_02['tweet_text_cleaned'] = data_02['tweet_text_cleaned'].str.strip()
data_02

#### World Cloud Visualization 

In [None]:
# Building a Word Cloud for positive (1) label
# getting the positive values

word_cloud_df_pos = data_02[data_02['label'] == 1]

# joining the positive words and storing them in the variable all_words_pos
all_words_pos = ' '.join([text for text in word_cloud_df_pos['tweet_text_cleaned']])
# building a word cloud with the positive words
wordcloud_pos = WordCloud(width = 800, height = 800, 
                      background_color ='white', 
                      min_font_size = 10).generate(all_words_pos)

#plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud_pos) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show()

In [None]:
# Bilding a Word Cloud for negative (0) label
# getting the negative values

word_cloud_df_neg = data_02[data_02['label'] == 0]

# joining the negative words and storing them in the variable all_words_neg
all_words_neg = ' '.join([text for text in word_cloud_df_neg['tweet_text_cleaned']])
# building a word cloud with the negative words
wordcloud_neg = WordCloud(width = 800, height = 800, 
                      background_color ='white', 
                      min_font_size = 10).generate(all_words_neg)

#plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud_neg) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show()

In [None]:
#storing the copy of data_02 in data_final 
data_final = data_02.copy()
data_final

In [None]:
# just choosing the feature that will be used to train the model and dropping the rest of the columns
data_final = data_final.drop(['tweet_id', 'user', 'date', 'tweet_text'], axis=1)
data_final

In [None]:
# creating a regular expression tokenizer 
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
# creating a count vectorizer object with English stop words and unigram tokens
cv = CountVectorizer(stop_words='english',ngram_range = (1,1),tokenizer = token.tokenize)
# fitting the count vectorizer to the tweet text and transforming the text into matrix of token counts
text_counts = cv.fit_transform(data_final['tweet_text_cleaned'])

In [None]:
#Splitting the data into trainig and testing
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, data_final['label'], test_size=0.25, random_state=5, stratify=data_final['label'])

In [None]:
# hyperparameter tunning with GridSearchCV for Multinomial Naive Bayes classifier
# this is done by at first storing possible alpha values in 'parameters', afterwards applying GridSearchCV for
# MNB to the parameters and fitting it to X_train and y_train.
parameters = {'alpha': [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]}
MNB = MultinomialNB()
clf = GridSearchCV(MNB, parameters)
clf.fit(X_train, Y_train)

# Printing the best parameters and the best score
print(clf.best_params_)
print(clf.best_score_)

In [None]:
#Training the model with the best parameter
MNB = MultinomialNB(alpha=3.5)
MNB.fit(X_train, Y_train)

In [None]:
# predicting on the testing dataset and printing the accuracy score
predicted = MNB.predict(X_test)
accuracy_score = metrics.accuracy_score(predicted, Y_test)
print("Accuracy Score: ",accuracy_score)

In [None]:
# Displaying confusion matrix and computing precision, recall, and F1 score
cm = confusion_matrix(Y_test, predicted)

precision = precision_score(Y_test, predicted)
recall = recall_score(Y_test, predicted)
f1 = f1_score(Y_test, predicted)
sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")

plt.title("Confusion Matrix")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

print('Precision: ', precision)
print('Recall: ', recall)
print('F1 score: ', f1)

## Main Dataset

This is our main dataset which doesnot have labels but does have number of likes and number of shares features. Now as we already have trained a classification model. We will be using it to predict the label and the sentiment score for each tweet in this dataset.

In [None]:
# Reading the data
df = pd.read_csv('clean_tweets_final.csv')
df

In [None]:
# checking for null values
print(df.isnull().sum())

In [None]:
# removing null values
df.dropna(inplace=True)
df

##### Applying the same approaches for preprocessing as for the training dataset

In [None]:
# function to remove punctuation marks
# here we define 'pattern' that stores all the punctuations
# afterwards, we define 'text_without_punct' that stores all the text without punctuations. this is done by 
# substituting the punctuations stored in side 'pattern' with nothing (''). Afterwards, inside the same column
# we replace _ with space (' ') and return the 'text_without_punct' 
# finally, we apply this function to the 'content_without_stopwords' column
def remove_punctuation(text):
    pattern = r'[^\w\s_]'
    text_without_punct = re.sub(pattern, '', text)
    # Replace underscores with spaces
    text_without_punct = text_without_punct.replace('_', ' ')
    return text_without_punct

df['content_without_stopwords'] = df['content_without_stopwords'].apply(remove_punctuation)


In [None]:
# Removing 'im' from all the texts as it had the most frequency for both labels in this dataset as well
df['content_without_stopwords'] = df['content_without_stopwords'].str.replace('im', '')
df['content_without_stopwords'] = df['content_without_stopwords'].str.strip()
df

In [None]:
# tokenize each string in the text using word_tokenize() function from NLTK library
# this is done using a lambda function
df['content_without_stopwords'] = df['content_without_stopwords'].apply(lambda x: word_tokenize(x))

In [None]:
# removing stop words from the text using stopwords module from NLTK library
# we are storing the stopwords from english language inside 'stop_words'
# afterwards, we are using a lambda function to remove these stopwords from the text
stop_words = set(stopwords.words('english'))
df['content_without_stopwords'] = df['content_without_stopwords'].apply(lambda x: [token for token in x if token not in stop_words])

In [None]:
# lemmatize the words in the text using WordNetLeammatizer from NLTK library
# this is done by applying WordNetLemmnatizer() on the column 'content_without_stopwords' with a lambda function
lemmatizer = WordNetLemmatizer()
df['content_without_stopwords'] = df['content_without_stopwords'].apply(lambda x: [lemmatizer.lemmatize(token) for token in x])

In [None]:
# joining the tokens in each tweet into a single string
# this is also done using a lambda function
df['content_without_stopwords'] = df['content_without_stopwords'].apply(lambda x: ' '.join(x))

In [None]:
# printing df
df

In [None]:
#making the copy of df as df_pretrained
df_pretrained = df.copy()

In [None]:
# extracting the feature column from the dataframe
feature_df = df['content_without_stopwords']
# preprocessing the feature data using the same CountVectorizer that we used during training
feature_counts = cv.transform(feature_df)
# getting the predicted probabilities for each class
label_probabilities = MNB.predict_proba(feature_counts)
sentiment_scores = label_probabilities[:, 1]
# adding sentiment_score column in the dataframe (1 being extremely positive and 0 being extremely negative)
df['sentiment_score'] = sentiment_scores
# adding a label column to the dataframe (0 or 1)
predicted_labels = MNB.predict(feature_counts)
df['label'] = predicted_labels

In [None]:
#printing df
df

In [None]:
# storing the copy of df inside df_final
df_final = df.copy()

### Visualization of data

In [None]:
# plotting the most common positive 20 words

# For Positive(1) Label

sns.set(style = 'white')
# Subset positive review dataset
all_words_df = df_final[df_final['label'] == 1]

# Extracts words into list and count frequency
all_words = ' '.join([text for text in all_words_df ['content_without_stopwords']])
all_words = all_words.split()
words_df = FreqDist(all_words)

# Extracting words and frequency from words_df object
words_df = pd.DataFrame({'word':list(words_df.keys()), 'count':list(words_df.values())})

# Subsets top 30 words by frequency
words_df = words_df.nlargest(columns="count", n = 20) 

words_df.sort_values('count', inplace = True)

# Plotting 30 frequent words
plt.figure(figsize=(20,5))
ax = plt.barh(words_df['word'], width = words_df['count'])
plt.show()

In [None]:
# plotting the most common 20 negative words

# For Negative(0) Label

sns.set(style = 'white')
# Subset positive review dataset
all_words_df = df_final[df_final['label'] == 0]

# Extracts words into list and count frequency
all_words = ' '.join([text for text in all_words_df ['content_without_stopwords']])
all_words = all_words.split()
words_df = FreqDist(all_words)

# Extracting words and frequency from words_df object
words_df = pd.DataFrame({'word':list(words_df.keys()), 'count':list(words_df.values())})

# Subsets top 30 words by frequency
words_df = words_df.nlargest(columns="count", n = 20) 

words_df.sort_values('count', inplace = True)

# Plotting 30 frequent words
plt.figure(figsize=(20,5))
ax = plt.barh(words_df['word'], width = words_df['count'])
plt.show()

#### World Cloud Visualization 

In [None]:
# Building a Word Cloud for positive (1) label
# identifying positive words

word_cloud_df_pos = df_final[df_final['label'] == 1]
# joining the identified positive words
all_words_pos = ' '.join([text for text in word_cloud_df_pos['content_without_stopwords']])
# building a word cloud with the positive words
wordcloud_pos = WordCloud(width = 800, height = 800, 
                      background_color ='white', 
                      min_font_size = 10).generate(all_words_pos)

#plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud_pos) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show()

In [None]:
# Bilding a Word Cloud for negative (0) label
# identifying the negative words

word_cloud_df_neg = df_final[df_final['label'] == 0]
# joining the negative words
all_words_neg = ' '.join([text for text in word_cloud_df_neg['content_without_stopwords']])
# creating a word cloud with the negative words
wordcloud_neg = WordCloud(width = 800, height = 800, 
                      background_color ='white', 
                      min_font_size = 10).generate(all_words_neg)

#plot the WordCloud image                        
plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud_neg) 
plt.axis("off") 
plt.tight_layout(pad = 0) 
plt.show()

In [None]:
#storing the copy of df_final in df_conclusion
df_conclusion = df_final.copy()
df_conclusion

In [None]:
#creating a new column in df_conclusion called 'popularity' based on columns number_of_likes and
#number_of_shares
df_conclusion['popularity'] = df_conclusion['number_of_likes'] + 3 * df_conclusion['number_of_shares']
df_conclusion

In [None]:
# Scatter plot of sentiment_score vs popularity
plt.scatter(df_conclusion['sentiment_score'], df_conclusion['popularity'])
plt.xlabel('Sentiment Score')
plt.ylabel('Popularity')
plt.title('Sentiment Score vs Popularity')
plt.show()

In [None]:
# Calculating the Pearson correlation coefficients to see linear relationship
pearson_correlation = df_conclusion['sentiment_score'].corr(df_conclusion['popularity'])

print(f"Pearson Rank Correlation Coefficient between sentiment score and popularity:", pearson_correlation)

In [None]:
# Calculating the Spearmans correlation coefficients to see non-linear relationship
spearmans_correlation, p_value = spearmanr(df_conclusion['sentiment_score'], df_conclusion['popularity'])

print(f"Spearman's Rank Correlation Coefficient between sentiment score and popularity:", spearmans_correlation)

Looking at the values received for both Pearson and Spearman's correlation coefficient, we can conclude that there is neither linear nor non-linear correlation between popularity and sentiment score of tweets in the analyzed dataset.