### Introduction

In this notebook, we use a new dataset, which includesthree new columns:
1. sentiment_score and compound_sentiment are two kinds of sentiment score generated from text
2. topic_list comes from topic modeling result. Each text is assigned with one topic, and we select the top5 words for that topic.

We will use these two new features to see if they can improve our model.

### Import Packages

In [43]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import ast

from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MaxAbsScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score,confusion_matrix
from sklearn import feature_extraction

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

from scipy.sparse import hstack

from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

### Read Data

In [16]:
# Load data
train_df = pd.read_excel("topic_model_sentiment.xlsx")
train_df.head()

Unnamed: 0,id,keyword,location,text,target,target_relabelled,sentiment_score,compound_sentiment,topic_list
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,1,0.0,0.0,"[""releas"", ""trauma"", ""earthquak"", ""sever"", ""is..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,1,0.0,-0.34,"[""mh370"", ""get"", ""rain"", ""feel"", ""flood""]"
2,5,,,All residents asked to 'shelter in place' are ...,1,1,0.0,0.0,"[""siren"", ""tornado"", ""offic"", ""oil"", ""natur""]"
3,6,,,"13,000 people receive #wildfires evacuation or...",1,1,0.0,0.0,"[""disast"", ""wildfir"", ""fatal"", ""obama"", ""trap""]"
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,1,0.125,0.0,"[""wreck"", ""storm"", ""violent"", ""im"", ""like""]"


In [17]:
#Last time we store list into topic_list and write to excel file. When it's read in, it can not be recognized as list. 
#Thus we need to first convert this topic_list column.
print(type(train_df.iloc[0].topic_list))
# Convert the text representation of the list to an actual list
train_df['topic_list'] = train_df['topic_list'].apply(ast.literal_eval)
print(type(train_df.iloc[0].topic_list))

<class 'str'>
<class 'list'>


In [20]:
# replace target by target_relabelled
train_df = train_df.drop('target', axis=1)
column_mapping = {'target_relabelled': 'target'}
train_df = train_df.rename(columns=column_mapping)
train_df.head(1)

Unnamed: 0,id,keyword,location,text,target,sentiment_score,compound_sentiment,topic_list
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,0.0,0.0,"[releas, trauma, earthquak, sever, issu]"


### Preprocess

In [21]:
import Preprocessing_for_Text_Processing_Comparison as pp
train = pp.process_text(train_df)
train.head(1)

Unnamed: 0,id,keyword,location,text,target,sentiment_score,compound_sentiment,topic_list,text_clean,text_clean_string
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,0.0,0.0,"[releas, trauma, earthquak, sever, issu]","[deed, reason, earthquak, may, allah, forgiv, us]",deed reason earthquak may allah forgiv us


In [32]:
# Append words in topic_list to text_clean_string
# Define a custom function to concatenate values 
def concatenate_values(row):
    return row['text_clean_string'] + ' ' + ' '.join(row['topic_list'])

# Apply the custom function to create column C
train['combined_string'] = train.apply(concatenate_values, axis=1)
print(train.iloc[0].text_clean_string)
print(train.iloc[0].combined_string)

deed reason earthquak may allah forgiv us
deed reason earthquak may allah forgiv us releas trauma earthquak sever issu


In [45]:
train.head(1)

Unnamed: 0,id,keyword,location,text,target,sentiment_score,compound_sentiment,topic_list,text_clean,text_clean_string,combined_string
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,0.0,0.0,"[releas, trauma, earthquak, sever, issu]","[deed, reason, earthquak, may, allah, forgiv, us]",deed reason earthquak may allah forgiv us,deed reason earthquak may allah forgiv us rele...


### Word Frequency

In [49]:
count_vectorizer = feature_extraction.text.CountVectorizer()
train_vectors = count_vectorizer.fit_transform(train["text_clean_string"])
X = train_vectors
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression(solver='newton-cg')
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
print(pd.DataFrame(confusion_matrix(y_test, predicted)))

F1 Score: 0.9558550185873607
F1 Score: 0.8038834951456311
      0     1
0  3843    42
1   148  2057
     0    1
0  907   65
1  137  414


### Word Frequency + Sentiment

In [27]:
count_vectorizer = feature_extraction.text.CountVectorizer()
# Create your bag-of-words feature matrix
train_vectors = count_vectorizer.fit_transform(train["text_clean_string"])

# Create your sentiment score feature (assumed as a 1D array)
sentiment_score_column = train['compound_sentiment'].values.reshape(-1, 1)

# Stack the bag-of-words and sentiment score horizontally
X = hstack((train_vectors, sentiment_score_column))

# Convert X to a dense array
X = X.toarray()

y = train['target'].to_list()

# Split your data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train your logistic regression model
LR = LogisticRegression()
LR.fit(X_train, y_train)

# Make predictions
predicted = LR.predict(X_test)
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
print(pd.DataFrame(confusion_matrix(y_test, predicted)))

F1 Score: 0.9563602599814298
F1 Score: 0.8069835111542193
      0     1
0  3842    43
1   145  2060
     0    1
0  908   64
1  135  416


### Word Frequency + Topic List

In [35]:
count_vectorizer = feature_extraction.text.CountVectorizer()
train_vectors = count_vectorizer.fit_transform(train["combined_string"])
X = train_vectors
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression(solver='newton-cg')
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
print(pd.DataFrame(confusion_matrix(y_test, predicted)))

F1 Score: 0.9544186046511628
F1 Score: 0.8096618357487922
      0     1
0  3842    43
1   153  2052
     0    1
0  907   65
1  132  419


### Word Frequency + Sentiment + Topic List

In [37]:
count_vectorizer = feature_extraction.text.CountVectorizer()
# Create your bag-of-words feature matrix
train_vectors = count_vectorizer.fit_transform(train["combined_string"])

# Create your sentiment score feature (assumed as a 1D array)
sentiment_score_column = train['compound_sentiment'].values.reshape(-1, 1)

# Stack the bag-of-words and sentiment score horizontally
X = hstack((train_vectors, sentiment_score_column))

# Convert X to a dense array
X = X.toarray()

y = train['target'].to_list()

# Split your data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train your logistic regression model
LR = LogisticRegression(solver='newton-cg')
LR.fit(X_train, y_train)

# Make predictions
predicted = LR.predict(X_test)
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
print(pd.DataFrame(confusion_matrix(y_test, predicted)))

F1 Score: 0.9523809523809523
F1 Score: 0.8088803088803088
      0     1
0  3835    50
1   155  2050
     0    1
0  906   66
1  132  419


### Conclusion1:

It seems like sentiment score and topic list don't help a lot with BOW model. All combinations get the similar F1 score.

### TF-IDF

In [50]:
vec_text = TfidfVectorizer(min_df = 10, ngram_range = (1,2), stop_words='english') 
text_vec = vec_text.fit_transform(train['text_clean_string'])
X_train_text = pd.DataFrame(text_vec.toarray(), columns=vec_text.get_feature_names_out())

X = X_train_text
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
pd.DataFrame(confusion_matrix(y_test, predicted))

F1 Score: 0.8442703232125367
F1 Score: 0.7745197168857432
      0     1
0  3730   155
1   481  1724


Unnamed: 0,0,1
0,917,55
1,168,383


### TF-IDF + Sentiment

In [44]:
vec_text = TfidfVectorizer(min_df = 10, ngram_range = (1,2), stop_words='english') 
text_vec = vec_text.fit_transform(train['text_clean_string'])
X_train_text = pd.DataFrame(text_vec.toarray(), columns=vec_text.get_feature_names_out())
# Append sentiment score to TF-IDF 
X_train_text['sentiment'] = train['compound_sentiment']
X = X_train_text
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
pd.DataFrame(confusion_matrix(y_test, predicted))

F1 Score: 0.8478048780487805
F1 Score: 0.7736040609137056
      0     1
0  3728   157
1   467  1738


Unnamed: 0,0,1
0,919,53
1,170,381


### TF-IDF + Topic List

In [46]:
vec_text = TfidfVectorizer(min_df = 10, ngram_range = (1,2), stop_words='english') 
text_vec = vec_text.fit_transform(train['combined_string'])
X_train_text = pd.DataFrame(text_vec.toarray(), columns=vec_text.get_feature_names_out())

X = X_train_text
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
pd.DataFrame(confusion_matrix(y_test, predicted))

F1 Score: 0.8227848101265822
F1 Score: 0.7372881355932204
      0     1
0  3765   120
1   580  1625


Unnamed: 0,0,1
0,927,45
1,203,348


### Word Frequency + Sentiment + Topic List

In [47]:
vec_text = TfidfVectorizer(min_df = 10, ngram_range = (1,2), stop_words='english') 
text_vec = vec_text.fit_transform(train['combined_string'])
X_train_text = pd.DataFrame(text_vec.toarray(), columns=vec_text.get_feature_names_out())
# Append sentiment score to TF-IDF 
X_train_text['sentiment'] = train['compound_sentiment']
X = X_train_text
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
pd.DataFrame(confusion_matrix(y_test, predicted))

F1 Score: 0.8234410217881293
F1 Score: 0.7533818938605619
      0     1
0  3741   144
1   561  1644


Unnamed: 0,0,1
0,924,48
1,189,362


### Conclusion2: 

Sentiment and topic list also don't help TF-IDF model. Topic list even decrease the F1 score.

### final Conlcusion

1. BOW with Topic Modelling gives us the best F1 score: 0.809.
2. From our experiment, we found that Sentiment and Topic Modelling don’t help the prediction a lot. It’s possible that the input matrixes of BOW and TF-IDF have high dimensions so Sentiment and Topic Modelling output can affect little.
3. Logically Speaking, Sentence Embedding should give us best result. The reasons for this not occurring might result from there are still some mislabel data and we pass the embedding to simple classification model (logistic regression) instead of neural network.

Further improvement will be focused on using LLM and NN.