### Introduction

This notebook aims to explore the impact of various text processing techniques on prediction outcomes. Specifically, it will investigate the performance differences between methods such as word frequency analysis, TF-IDF (Term Frequency-Inverse Document Frequency), and sentence embedding in the context of making predictions.

To ensure that variations primarily arise from different text processing techniques, we will exclusively employ logistic regression and Naive Bayes as our chosen classification model.

### Import Basic Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Import Packages for Logistic Regression

In [2]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MaxAbsScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score,confusion_matrix


### Import Packages for Naive Bayes

In [3]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

### Import Packages to ignore warning

is_sparse function is deprecated, but it should not impact the functionality of models and results. Thus, we ignore the warning for cleaner output

In [4]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

### Read Files

The dataset is downloaded from https://www.kaggle.com/c/nlp-getting-started/overview

In [5]:
# Load data
train_df = pd.read_csv("./kaggle/input/train.csv")
test_df = pd.read_csv("./kaggle/input/test.csv")

In [6]:
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [7]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


### Clean Text Column

Text cleaning is done from a python script (Preprocessing_for_Text-Processing-Comparison.py). The following steps are involved:

* For all models, the 1-5 steps are invovled:
1. remove link
2. remove @account
3. remove line breaks
4. remove # from hashtag
5. remove non-ASCII characters

* For sentence embedding, which LLM is invovled, since it's good at understanding sequencial meanings, the following steps will not be performed (We'll also validate this assumption in Method3)
6. tokenize text
7. change words into lower case
8. only include alphabetic words
9. remove stop words
10. steeming words

Upon completion of the cleaning process, two additional columns are created. The first column, named `text_clean`, consists of a list of tokenized words. The second column, denoted as `text_clean_string`, is a representation formed by combining the tokens from the text_clean column with spaces in between each token. They will be used for the three different methods later.

For more details, can refer to the python script.

In [8]:
import Preprocessing_for_Text_Processing_Comparison as pp

In [9]:
train = pp.process_text(train_df)
test = pp.process_text(test_df)

In [10]:
train.head()

Unnamed: 0,id,keyword,location,text,target,text_clean,text_clean_string
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[deed, reason, earthquak, may, allah, forgiv, us]",deed reason earthquak may allah forgiv us
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[forest, fire, near, la, rong, sask, canada]",forest fire near la rong sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,"[resid, ask, place, notifi, offic, evacu, shel...",resid ask place notifi offic evacu shelter pla...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[peopl, receiv, wildfir, evacu, order, califor...",peopl receiv wildfir evacu order california
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[got, sent, photo, rubi, alaska, smoke, wildfi...",got sent photo rubi alaska smoke wildfir pour ...


In [11]:
test.head()

Unnamed: 0,id,keyword,location,text,text_clean,text_clean_string
0,0,,,Just happened a terrible car crash,"[happen, terribl, car, crash]",happen terribl car crash
1,2,,,"Heard about #earthquake is different cities, s...","[heard, earthquak, differ, citi, stay, safe, e...",heard earthquak differ citi stay safe everyon
2,3,,,"there is a forest fire at spot pond, geese are...","[forest, fire, spot, pond, gees, flee, across,...",forest fire spot pond gees flee across street ...
3,9,,,Apocalypse lighting. #Spokane #wildfires,"[apocalyps, light, spokan, wildfir]",apocalyps light spokan wildfir
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,"[typhoon, soudelor, kill, china, taiwan]",typhoon soudelor kill china taiwan


### Method1: Word Frequency

To kick start a base method, let's start with using the count of words in each tweet.
Below will be using `CountVectorizer` to build the count of words matrix.

In [12]:
from sklearn import feature_extraction

In [13]:
count_vectorizer = feature_extraction.text.CountVectorizer()

In [14]:
## let's take a look at expected output by using first 2 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train["text_clean_string"][0:2])
print(train["text_clean_string"][0:2].values)
print(count_vectorizer.get_feature_names_out())
print(count_vectorizer.vocabulary_)
print(example_train_vectors.toarray())
print(example_train_vectors.toarray().shape)

['deed reason earthquak may allah forgiv us'
 'forest fire near la rong sask canada']
['allah' 'canada' 'deed' 'earthquak' 'fire' 'forest' 'forgiv' 'la' 'may'
 'near' 'reason' 'rong' 'sask' 'us']
{'deed': 2, 'reason': 10, 'earthquak': 3, 'may': 8, 'allah': 0, 'forgiv': 6, 'us': 13, 'forest': 5, 'fire': 4, 'near': 9, 'la': 7, 'rong': 11, 'sask': 12, 'canada': 1}
[[1 0 1 1 0 0 1 0 1 0 1 0 0 1]
 [0 1 0 0 1 1 0 1 0 1 0 1 1 0]]
(2, 14)


The above tells us that, there are 14 unique words (or "tokens") in the first two tweets.

Now let's create vectors for all of our tweets.

In [15]:
train_vectors = count_vectorizer.fit_transform(train["text_clean_string"])

# note that we're NOT using .fit_transform() here. Using just .transform() makes sure that train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test["text_clean_string"])

In [16]:
# print(count_vectorizer.get_feature_names_out())
# print(count_vectorizer.vocabulary_)
# print(example_train_vectors.toarray())
print(train_vectors.toarray().shape)

(7613, 10532)


1-1 Build model without standerdize variables

In [17]:
X = train_vectors
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
print(pd.DataFrame(confusion_matrix(y_test, predicted)))

F1 Score: 0.9303433220877158
F1 Score: 0.7733118971061093
      0     1
0  3395    78
1   273  2344
     0    1
0  760  109
1  173  481


In [18]:
# Get the feature names (words) from the CountVectorizer
feature_names = count_vectorizer.get_feature_names_out()

# Get the coefficients from the trained logistic regression model
coefficients = LR.coef_[0]

# Create a dictionary that associates feature names with their coefficients
feature_coefficients = dict(zip(feature_names, coefficients))

# Sort the features by their coefficients to find the most important positive coeficient ones
sorted_features_positive = sorted(feature_coefficients.items(), key=lambda x: x[1], reverse=True)

# Sort the features by their coefficients to find the most important negative coeficient ones
sorted_features_negative = sorted(feature_coefficients.items(), key=lambda x: x[1])

# Print the top N most important positive features
top_n = 10  # Change this value to see more or fewer features
for feature, coefficient in sorted_features_positive[:top_n]:
    print(f"{feature}: {coefficient}")

print('----------------------------------------')
    
# Print the top N most important negative features
for feature, coefficient in sorted_features_negative[:top_n]:
    print(f"{feature}: {coefficient}")


hiroshima: 2.645388489888987
wildfir: 2.3347055956738036
earthquak: 2.0389962276401725
typhoon: 1.910274615457201
storm: 1.902511518781684
debri: 1.795257627619148
tornado: 1.7497896583076402
spill: 1.7224985363487062
massacr: 1.6862271855273108
migrant: 1.5787320214120713
----------------------------------------
nowplay: -1.2654459304583408
love: -1.2380933407793044
ticket: -1.2177880729937431
never: -1.2001342147749514
upheav: -1.194503451145276
write: -1.1776942612336967
lmao: -1.1301736722566142
technolog: -1.0972574486053386
let: -1.0860564440658924
career: -1.0761327710359425


1-2 Build model with MaxAbsScaler

In [19]:
#use scalor
scaler = MaxAbsScaler()
X_train_scaled = scaler.fit_transform(train_vectors)
X_test_scaled = scaler.transform(test_vectors)


X = X_train_scaled
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
print(pd.DataFrame(confusion_matrix(y_test, predicted)))

F1 Score: 0.9252055343894124
F1 Score: 0.761437908496732
      0     1
0  3410    63
1   310  2307
     0    1
0  765  104
1  188  466


In [20]:
# Get the feature names (words) from the CountVectorizer
feature_names = count_vectorizer.get_feature_names_out()

# Get the coefficients from the trained logistic regression model
coefficients = LR.coef_[0]

# Create a dictionary that associates feature names with their coefficients
feature_coefficients = dict(zip(feature_names, coefficients))

# Sort the features by their coefficients to find the most important positive coeficient ones
sorted_features_positive = sorted(feature_coefficients.items(), key=lambda x: x[1], reverse=True)

# Sort the features by their coefficients to find the most important negative coeficient ones
sorted_features_negative = sorted(feature_coefficients.items(), key=lambda x: x[1])

# Print the top N most important positive features
top_n = 10  # Change this value to see more or fewer features
for feature, coefficient in sorted_features_positive[:top_n]:
    print(f"{feature}: {coefficient}")

print('----------------------------------------')
    
# Print the top N most important negative features
for feature, coefficient in sorted_features_negative[:top_n]:
    print(f"{feature}: {coefficient}")

fire: 3.097592025002737
hiroshima: 3.030802588543393
storm: 2.564572486204606
wildfir: 2.5163620520627994
earthquak: 2.172251793362818
evacu: 2.1564213525966407
california: 2.147375307697853
murder: 2.075147318262127
kill: 2.071626929371796
forest: 2.0433387973774666
----------------------------------------
love: -1.9045865156101898
upheav: -1.4344258146514115
make: -1.4066990584691839
let: -1.3786792371575622
never: -1.3008434730429905
nowplay: -1.2518685391618602
fuck: -1.2097839463139448
poll: -1.2017224793245656
lmao: -1.178034252929234
career: -1.1620973133203012


Conclusion: without MaxAbsScaler, the F1 score for both training and testing are better.

In [21]:
# Create and train the Multinomial Naive Bayes model
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train,y_train)
train_predicted = nb_classifier.predict(X_train)
test_predicted = nb_classifier.predict(X_test)
print("Train F1 Score:",f1_score(y_train, train_predicted))
print("Test F1 Score:",f1_score(y_test, test_predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

Train F1 Score: 0.8802537668517051
Test F1 Score: 0.7637795275590552
      0     1
0  3266   207
1   397  2220
     0    1
0  738  131
1  169  485


Conclusion: Naive Bayes(0.76) performs similar like Logistic Regression(0.77)

### Method2: TF-IDF

Now, let's try TF-IDF by using `TfidfVectorizer`

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
# Only include >=10 occurrences
# Have unigrams and bigrams
vec_text = TfidfVectorizer(min_df = 10, ngram_range = (1,2), stop_words='english') 

text_vec = vec_text.fit_transform(train['text_clean_string'])
text_vec_test = vec_text.transform(test['text_clean_string'])
X_train_text = pd.DataFrame(text_vec.toarray(), columns=vec_text.get_feature_names_out())
X_test_text = pd.DataFrame(text_vec_test.toarray(), columns=vec_text.get_feature_names_out())
print (X_train_text.shape)

(7613, 1520)


In [24]:
X_train_text

Unnamed: 0,aba,aba woman,abandon,abc,abc news,abl,ablaz,absolut,accid,accord,...,yesterday,yo,york,young,youth,youth save,youtub,yr,yyc,zone
0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7608,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7609,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7610,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7611,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
X = X_train_text
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
pd.DataFrame(confusion_matrix(y_test, predicted))

F1 Score: 0.7975993377483444
F1 Score: 0.7660626029654036
      0     1
0  3185   288
1   690  1927


Unnamed: 0,0,1
0,774,95
1,189,465


In [26]:
# Get the feature names (words) from the CountVectorizer
feature_names = vec_text.get_feature_names_out()

# Get the coefficients from the trained logistic regression model
coefficients = LR.coef_[0]

# Create a dictionary that associates feature names with their coefficients
feature_coefficients = dict(zip(feature_names, coefficients))

# Sort the features by their coefficients to find the most important positive coeficient ones
sorted_features_positive = sorted(feature_coefficients.items(), key=lambda x: x[1], reverse=True)

# Sort the features by their coefficients to find the most important negative coeficient ones
sorted_features_negative = sorted(feature_coefficients.items(), key=lambda x: x[1])

# Print the top N most important positive features
top_n = 10  # Change this value to see more or fewer features
for feature, coefficient in sorted_features_positive[:top_n]:
    print(f"{feature}: {coefficient}")

print('----------------------------------------')
    
# Print the top N most important negative features
for feature, coefficient in sorted_features_negative[:top_n]:
    print(f"{feature}: {coefficient}")

hiroshima: 3.6523858852884152
wildfir: 3.204129362061818
california: 2.757571778025467
kill: 2.7200358240055658
forest: 2.5683849832564016
storm: 2.519988696677238
earthquak: 2.360874176144914
evacu: 2.2848403258261873
debri: 2.2419491470055353
flood: 2.176170826767648
----------------------------------------
love: -2.2125034179401326
let: -1.8439255378635953
bag: -1.7823108277022437
want: -1.7600085185610215
fuck: -1.736474276515424
new: -1.6127636473486138
make: -1.560242948545216
blew: -1.52134944892504
harm: -1.5080448559621575
upheav: -1.4948783135997972


The TF-IDF result(0.766) is slightly lower than Word Frequency (0.77)

In [27]:
# Create and train the Multinomial Naive Bayes model
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train,y_train)
train_predicted = nb_classifier.predict(X_train)
test_predicted = nb_classifier.predict(X_test)
print("Train F1 Score:",f1_score(y_train, train_predicted))
print("Test F1 Score:",f1_score(y_test, test_predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

Train F1 Score: 0.7648314127187367
Test F1 Score: 0.7508474576271186
      0     1
0  3196   277
1   825  1792
     0    1
0  786   83
1  211  443


Conclusion: Naive Bayes(0.75) performs similar like Logistic Regression(0.766)

### Method3: Sentence Embedding

Here, we're using `SentenceTransformer` to transform our text into sentence embedding.

*Reference link: https://www.youtube.com/watch?v=c7AqnswslWo

An experiment is performed to determine whether it is more effective to create sentence embeddings from the original text or from text that has undergone lowercase conversion, removal of non-alphabetic characters, exclusion of stop words, and stemming.

In [28]:
# install package
# !pip install --user sentence-transformers -q

In [29]:
# Import libraries
from sentence_transformers import SentenceTransformer, util
import torch

1. Embedding from processed text (lowercase, alphabetic, stop words, stemming)

In [30]:
# use SentenceTransformer to generate sentence embedding
# can choose diff models from: https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('all-MiniLM-L6-v2')
train['embeddings'] = train['text_clean_string'].apply(model.encode)

In [31]:
X = train['embeddings'].to_list()
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
train_predicted = LR.predict(X_train)
test_predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

print("Train F1 Score:",f1_score(y_train, train_predicted))
print("Test F1 Score:",f1_score(y_test, test_predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

Train F1 Score: 0.7454909819639277
Test F1 Score: 0.7570977917981073
      0     1
0  2960   513
1   757  1860
     0    1
0  735  134
1  174  480


2. Embedding from original text

In [32]:
train_llm = pp.process_text(train_df,LLM = True)
test_llm = pp.process_text(test_df,LLM = True)

In [33]:
train_llm.head()

Unnamed: 0,id,keyword,location,text,target,text_clean
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive wildfires evacuation ord..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...


In [34]:
# use SentenceTransformer to generate sentence embedding
# can choose diff models from: https://www.sbert.net/docs/pretrained_models.html
train_llm['embeddings'] = train_llm['text_clean'].apply(model.encode)

In [35]:
train_llm.head()

Unnamed: 0,id,keyword,location,text,target,text_clean,embeddings
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...,"[-0.0054457616, 0.064116865, 0.11215522, 0.033..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada,"[0.04022922, 0.03801415, -0.0064313184, 0.0246..."
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...,"[0.13137569, 0.012401222, 0.06718611, 0.085462..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive wildfires evacuation ord...","[0.099470675, -0.033589236, 0.0065593175, 0.01..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...,"[-0.036732815, 0.10448628, 0.0742257, 0.089533..."


In [36]:
X = train_llm['embeddings'].to_list()
y = train_llm['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
train_predicted = LR.predict(X_train)
test_predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

print("Train F1 Score:",f1_score(y_train, train_predicted))
print("Test F1 Score:",f1_score(y_test, test_predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

Train F1 Score: 0.7826431543491182
Test F1 Score: 0.8009516256938937
      0     1
0  3018   455
1   642  1975
     0    1
0  767  102
1  149  505


If you run the code below, you will get the error: Negative values in data passed to MultinomialNB (input X)

This is because MultinomialNB (Multinomial Naive Bayes) algorithm is designed for discrete data, specifically for data that represents counts or frequencies. In the context of text classification, it's commonly used to model the frequency of words in a document or a set of documents.

Negative values are not meaningful in this context because you cannot have a negative count of words in a document. The Multinomial Naive Bayes algorithm works with non-negative integer values, which typically represent the frequency or count of each term (word) in a document.

In [38]:
# # Create and train the Multinomial Naive Bayes model
# nb_classifier = MultinomialNB()
# nb_classifier.fit(X_train,y_train)
# train_predicted = nb_classifier.predict(X_train)
# test_predicted = nb_classifier.predict(X_test)
# print("Train F1 Score:",f1_score(y_train, train_predicted))
# print("Test F1 Score:",f1_score(y_test, test_predicted))
# # Confusion matrix
# print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
# print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

In [39]:
from sklearn.svm import SVC

# X_train and y_train are your sentence embeddings and labels
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
train_predicted = svm_classifier.predict(X_train)
test_predicted = svm_classifier.predict(X_test)
print("Train F1 Score:",f1_score(y_train, train_predicted))
print("Test F1 Score:",f1_score(y_test, test_predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

Train F1 Score: 0.7850542386500602
Test F1 Score: 0.7896440129449839
      0     1
0  3066   407
1   663  1954
     0    1
0  775   94
1  166  488


Conclusion: The sentence embedding with logistic regression(0.8) gives the best result among three methods

Sentence embedding from `SentenceTransformer` also support semantic search. It returns the topN similar tweets by cosine similarity. 

Reference: https://medium.com/nlplanet/two-minutes-nlp-sentence-transformers-cheat-sheet-2e9865083e7a

In [41]:
# Queries and their embeddings
queries = ["Help me, there is a serious disaster", "My life is so peaceful and happy"]
queries_embeddings = model.encode(queries)

# Ensure 'embeddings' column contains float32 or float64 arrays
train['embeddings2'] = train['embeddings'].apply(lambda x: x.astype(np.float32))

# Now you can convert the 'embeddings' column to a PyTorch tensor
embeddings_tensor = torch.from_numpy(np.stack(train['embeddings2'].values))

# Find the top-2 corpus documents matching each query
hits = util.semantic_search(queries_embeddings, embeddings_tensor, top_k=2)

# Print results of first query
print(f"Query: {queries[0]}")
for hit in hits[0]:
    print(train['text_clean_string'][hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))
# Query: What is Python?
# Python is an interpreted high-level general-purpose programming language. (Score: 0.6759)
# Python is dynamically-typed and garbage-collected. (Score: 0.6219)

# Print results of second query
print(f"Query: {queries[1]}")
for hit in hits[1]:
    print(train['text_clean_string'][hit['corpus_id']], "(Score: {:.4f})".format(hit['score']))

Query: Help me, there is a serious disaster
hurrican tornado tsunami someon pleas tell hell happen nopow (Score: 0.4548)
danger ok rest us danger (Score: 0.4378)
Query: My life is so peaceful and happy
live balanc life balanc fear allah hope merci love (Score: 0.4005)
greater tragedi becom comfort life (Score: 0.3832)


### Conlcusion

Final F1 Score: 
* Word-Frequency: 0.77
* TF-IDF: 0.76
* Sentence-Embedding: 0.8