### Introduction

Based on our previous investigation in the `FP&FN problem` notebook, we identified instances where records were incorrectly labeled. Consequently, we have made the decision to relabel those FN records. In this notebook, we will employ the relabelled dataset to re-run the model and assess whether this process leads to an enhancement in performance.

### Import Basic Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Import Packages for Logistic Regression

In [2]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MaxAbsScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score,confusion_matrix


### Import Packages for Naive Bayes

In [3]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

### Import Packages to ignore warning

is_sparse function is deprecated, but it should not impact the functionality of models and results. Thus, we ignore the warning for cleaner output

In [4]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

### Read Files

The dataset is downloaded from https://www.kaggle.com/c/nlp-getting-started/overview

In [5]:
# Load data
train_df = pd.read_csv("./kaggle/input/train.csv")
test_df = pd.read_csv("./kaggle/input/test.csv")

In [6]:
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [7]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [8]:
# read in relabelled data
relabelled_df = pd.read_excel("Relabel for FN records.xlsx")
relabelled_df

Unnamed: 0,id,keyword,location,text,target,relabel
0,9794,trapped,,Hollywood Movie About Trapped Miners Released ...,1,0.0
1,4172,drown,Portugal,I can't drown my demons they know how to swim,1,0.0
2,2817,cyclone,,@XHNews We need these plants in the pacific du...,1,0.0
3,7160,mudslide,Birmingham & Bristol,It looks like a mudslide' poor thing! ?? #grea...,1,0.0
4,8578,screams,Sheffield/Leeds,I agree with certain cultural appropriation th...,1,0.0
...,...,...,...,...,...,...
876,6702,lava,probably watching survivor,The sunset looked like an erupting volcano ......,1,0.0
877,8739,sinking,MA,that horrible sinking feeling when youÂ‰Ã›Âªve...,1,0.0
878,7147,mudslide,"Ealing, London",It looks like a mudslide!' And #GBBO is back w...,1,0.0
879,8129,rescued,,Heroes! A Springer Spaniel &amp; her dog dad r...,1,0.0


In [9]:
#Find the id needed to relabel
relabelled_ids = relabelled_df[relabelled_df.relabel == 0].id.tolist()
#There are 515 records to be replaced
train_df[train_df['id'].isin(relabelled_ids)].target.value_counts()

target
1    515
Name: count, dtype: int64

In [10]:
# Replace the "target" column in train_df based on the IDs
train_df['target_relabelled'] = train_df['id'].apply(lambda x: 0 if x in relabelled_ids else train_df[train_df['id'] == x]['target'].values[0])

In [11]:
print(train_df['target'].value_counts())
print(train_df['target_relabelled'].value_counts())

target
0    4342
1    3271
Name: count, dtype: int64
target_relabelled
0    4857
1    2756
Name: count, dtype: int64


In [12]:
def target_percentage(df, column, relabel = False):
    total_count = len(df)
    non_disaster_count = df[column].value_counts()[0]
    disaster_count = df[column].value_counts()[1]
    percentage_non_disaster = int((non_disaster_count / total_count) * 100)
    percentage_disaster = int((disaster_count / total_count) * 100)
    
    if not relabel:
        print("Original Non-Disaster:", percentage_non_disaster, "%")
        print("Original Disaster:", percentage_disaster, "%")
    else:
        print("Relabelled Non-Disaster:", percentage_non_disaster, "%")
        print("Relabelled Disaster:", percentage_disaster, "%")

In [13]:
target_percentage(train_df, 'target')
target_percentage(train_df, 'target_relabelled', relabel = True)

Original Non-Disaster: 57 %
Original Disaster: 42 %
Relabelled Non-Disaster: 63 %
Relabelled Disaster: 36 %


In [14]:
#save the modified train data
file_path = "./kaggle/input/train_relabelled.csv"
train_df.to_csv(file_path, index=False)

In [15]:
#read the modified train data 
#drop original target and rename target_relabelled to be target, so that we can use the previous code to run models
train_df = pd.read_csv("./kaggle/input/train_relabelled.csv")
train_df = train_df.drop('target', axis=1)
column_mapping = {'target_relabelled': 'target'}
train_df = train_df.rename(columns=column_mapping)

### Clean Text Column

Text cleaning is done from a python script (Preprocessing_for_Text-Processing-Comparison.py). The following steps are involved:

* For all models, the 1-5 steps are invovled:
1. remove link
2. remove @account
3. remove line breaks
4. remove # from hashtag
5. remove non-ASCII characters

* For sentence embedding, which LLM is invovled, since it's good at understanding sequencial meanings, the following steps will not be performed (We'll also validate this assumption in Method3)
6. tokenize text
7. change words into lower case
8. only include alphabetic words
9. remove stop words
10. steeming words

Upon completion of the cleaning process, two additional columns are created. The first column, named `text_clean`, consists of a list of tokenized words. The second column, denoted as `text_clean_string`, is a representation formed by combining the tokens from the text_clean column with spaces in between each token. They will be used for the three different methods later.

For more details, can refer to the python script.

In [16]:
import Preprocessing_for_Text_Processing_Comparison as pp

In [17]:
train = pp.process_text(train_df)
test = pp.process_text(test_df)

In [18]:
train.head()

Unnamed: 0,id,keyword,location,text,target,text_clean,text_clean_string
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[deed, reason, earthquak, may, allah, forgiv, us]",deed reason earthquak may allah forgiv us
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[forest, fire, near, la, rong, sask, canada]",forest fire near la rong sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,"[resid, ask, place, notifi, offic, evacu, shel...",resid ask place notifi offic evacu shelter pla...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[peopl, receiv, wildfir, evacu, order, califor...",peopl receiv wildfir evacu order california
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[got, sent, photo, rubi, alaska, smoke, wildfi...",got sent photo rubi alaska smoke wildfir pour ...


In [19]:
test.head()

Unnamed: 0,id,keyword,location,text,text_clean,text_clean_string
0,0,,,Just happened a terrible car crash,"[happen, terribl, car, crash]",happen terribl car crash
1,2,,,"Heard about #earthquake is different cities, s...","[heard, earthquak, differ, citi, stay, safe, e...",heard earthquak differ citi stay safe everyon
2,3,,,"there is a forest fire at spot pond, geese are...","[forest, fire, spot, pond, gees, flee, across,...",forest fire spot pond gees flee across street ...
3,9,,,Apocalypse lighting. #Spokane #wildfires,"[apocalyps, light, spokan, wildfir]",apocalyps light spokan wildfir
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,"[typhoon, soudelor, kill, china, taiwan]",typhoon soudelor kill china taiwan


### Method1: Word Frequency

To kick start a base method, let's start with using the count of words in each tweet.
Below will be using `CountVectorizer` to build the count of words matrix.

In [20]:
from sklearn import feature_extraction

In [21]:
count_vectorizer = feature_extraction.text.CountVectorizer()

In [22]:
## let's take a look at expected output by using first 2 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train["text_clean_string"][0:2])
print(train["text_clean_string"][0:2].values)
print(count_vectorizer.get_feature_names_out())
print(count_vectorizer.vocabulary_)
print(example_train_vectors.toarray())
print(example_train_vectors.toarray().shape)

['deed reason earthquak may allah forgiv us'
 'forest fire near la rong sask canada']
['allah' 'canada' 'deed' 'earthquak' 'fire' 'forest' 'forgiv' 'la' 'may'
 'near' 'reason' 'rong' 'sask' 'us']
{'deed': 2, 'reason': 10, 'earthquak': 3, 'may': 8, 'allah': 0, 'forgiv': 6, 'us': 13, 'forest': 5, 'fire': 4, 'near': 9, 'la': 7, 'rong': 11, 'sask': 12, 'canada': 1}
[[1 0 1 1 0 0 1 0 1 0 1 0 0 1]
 [0 1 0 0 1 1 0 1 0 1 0 1 1 0]]
(2, 14)


The above tells us that, there are 14 unique words (or "tokens") in the first two tweets.

Now let's create vectors for all of our tweets.

In [23]:
train_vectors = count_vectorizer.fit_transform(train["text_clean_string"])

# note that we're NOT using .fit_transform() here. Using just .transform() makes sure that train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test["text_clean_string"])

In [24]:
# print(count_vectorizer.get_feature_names_out())
# print(count_vectorizer.vocabulary_)
# print(example_train_vectors.toarray())
print(train_vectors.toarray().shape)

(7613, 10532)


1-1 Build model without standerdize variables

In [25]:
X = train_vectors
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
print(pd.DataFrame(confusion_matrix(y_test, predicted)))

F1 Score: 0.9558550185873607
F1 Score: 0.8038834951456311
      0     1
0  3843    42
1   148  2057
     0    1
0  907   65
1  137  414


In [26]:
# Get the feature names (words) from the CountVectorizer
feature_names = count_vectorizer.get_feature_names_out()

# Get the coefficients from the trained logistic regression model
coefficients = LR.coef_[0]

# Create a dictionary that associates feature names with their coefficients
feature_coefficients = dict(zip(feature_names, coefficients))

# Sort the features by their coefficients to find the most important positive coeficient ones
sorted_features_positive = sorted(feature_coefficients.items(), key=lambda x: x[1], reverse=True)

# Sort the features by their coefficients to find the most important negative coeficient ones
sorted_features_negative = sorted(feature_coefficients.items(), key=lambda x: x[1])

# Print the top N most important positive features
top_n = 10  # Change this value to see more or fewer features
for feature, coefficient in sorted_features_positive[:top_n]:
    print(f"{feature}: {coefficient}")

print('----------------------------------------')
    
# Print the top N most important negative features
for feature, coefficient in sorted_features_negative[:top_n]:
    print(f"{feature}: {coefficient}")


hiroshima: 3.2235906811173214
earthquak: 2.433729953999329
wildfir: 2.35408079442359
typhoon: 2.222483661423877
storm: 2.0965145002339285
hailstorm: 2.0797444679573003
terror: 2.07040024737212
massacr: 1.937770617140202
flood: 1.8580075912108673
debri: 1.8265627496019627
----------------------------------------
love: -1.6729591328635658
obliter: -1.4964097901958806
im: -1.4532273580240052
play: -1.3728193747669386
best: -1.362140773368952
let: -1.2809787867933276
poll: -1.2556801293626336
write: -1.2239639356602108
phone: -1.2122294344012605
new: -1.2110189863305614


1-2 Build model with MaxAbsScaler

In [27]:
#use scalor
scaler = MaxAbsScaler()
X_train_scaled = scaler.fit_transform(train_vectors)
X_test_scaled = scaler.transform(test_vectors)


X = X_train_scaled
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
print(pd.DataFrame(confusion_matrix(y_test, predicted)))

F1 Score: 0.9492719586660403
F1 Score: 0.8003972194637538
      0     1
0  3853    32
1   184  2021
     0    1
0  919   53
1  148  403


In [28]:
# Get the feature names (words) from the CountVectorizer
feature_names = count_vectorizer.get_feature_names_out()

# Get the coefficients from the trained logistic regression model
coefficients = LR.coef_[0]

# Create a dictionary that associates feature names with their coefficients
feature_coefficients = dict(zip(feature_names, coefficients))

# Sort the features by their coefficients to find the most important positive coeficient ones
sorted_features_positive = sorted(feature_coefficients.items(), key=lambda x: x[1], reverse=True)

# Sort the features by their coefficients to find the most important negative coeficient ones
sorted_features_negative = sorted(feature_coefficients.items(), key=lambda x: x[1])

# Print the top N most important positive features
top_n = 10  # Change this value to see more or fewer features
for feature, coefficient in sorted_features_positive[:top_n]:
    print(f"{feature}: {coefficient}")

print('----------------------------------------')
    
# Print the top N most important negative features
for feature, coefficient in sorted_features_negative[:top_n]:
    print(f"{feature}: {coefficient}")

hiroshima: 3.649444017555073
fire: 3.0221014541643365
storm: 2.856181514370036
wildfir: 2.6064582762320248
flood: 2.58462355719907
earthquak: 2.5585862748905437
california: 2.50585799480792
evacu: 2.4007609046382927
train: 2.386053040154951
suicid: 2.3753110881717667
----------------------------------------
love: -2.0071768987091567
like: -1.8033211990826956
obliter: -1.7475210443364972
new: -1.610437434648507
im: -1.5424435352718318
let: -1.4223989483274389
crush: -1.4205795325621478
want: -1.4199524024136811
play: -1.3894126350324298
best: -1.3775459572256539


In [29]:
# Create and train the Multinomial Naive Bayes model
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train,y_train)
train_predicted = nb_classifier.predict(X_train)
test_predicted = nb_classifier.predict(X_test)
print("Train F1 Score:",f1_score(y_train, train_predicted))
print("Test F1 Score:",f1_score(y_test, test_predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

Train F1 Score: 0.9104477611940298
Test F1 Score: 0.7902621722846441
      0     1
0  3754   131
1   253  1952
     0    1
0  877   95
1  129  422


### Method2: TF-IDF

Now, let's try TF-IDF by using `TfidfVectorizer`

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [31]:
# Only include >=10 occurrences
# Have unigrams and bigrams
vec_text = TfidfVectorizer(min_df = 10, ngram_range = (1,2), stop_words='english') 

text_vec = vec_text.fit_transform(train['text_clean_string'])
text_vec_test = vec_text.transform(test['text_clean_string'])
X_train_text = pd.DataFrame(text_vec.toarray(), columns=vec_text.get_feature_names_out())
X_test_text = pd.DataFrame(text_vec_test.toarray(), columns=vec_text.get_feature_names_out())
print (X_train_text.shape)

(7613, 1520)


In [32]:
X_train_text

Unnamed: 0,aba,aba woman,abandon,abc,abc news,abl,ablaz,absolut,accid,accord,...,yesterday,yo,york,young,youth,youth save,youtub,yr,yyc,zone
0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7608,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7609,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7610,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7611,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
X = X_train_text
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
pd.DataFrame(confusion_matrix(y_test, predicted))

F1 Score: 0.8442703232125367
F1 Score: 0.7745197168857432
      0     1
0  3730   155
1   481  1724


Unnamed: 0,0,1
0,917,55
1,168,383


In [34]:
# Get the feature names (words) from the CountVectorizer
feature_names = vec_text.get_feature_names_out()

# Get the coefficients from the trained logistic regression model
coefficients = LR.coef_[0]

# Create a dictionary that associates feature names with their coefficients
feature_coefficients = dict(zip(feature_names, coefficients))

# Sort the features by their coefficients to find the most important positive coeficient ones
sorted_features_positive = sorted(feature_coefficients.items(), key=lambda x: x[1], reverse=True)

# Sort the features by their coefficients to find the most important negative coeficient ones
sorted_features_negative = sorted(feature_coefficients.items(), key=lambda x: x[1])

# Print the top N most important positive features
top_n = 10  # Change this value to see more or fewer features
for feature, coefficient in sorted_features_positive[:top_n]:
    print(f"{feature}: {coefficient}")

print('----------------------------------------')
    
# Print the top N most important negative features
for feature, coefficient in sorted_features_negative[:top_n]:
    print(f"{feature}: {coefficient}")

hiroshima: 4.275421238239577
wildfir: 3.3212316744158494
kill: 3.0955466022564986
california: 3.091339943710916
earthquak: 3.0053209072353746
flood: 2.6340086332177184
build: 2.630763386661721
storm: 2.5773261202259587
evacu: 2.513961610048836
terror: 2.5034294145736102
----------------------------------------
love: -2.5050454413254055
new: -2.285640936987951
obliter: -2.237746940088469
like: -2.104307020737453
want: -2.055299262699642
let: -1.96135707284439
play: -1.879522722322039
crush: -1.832865425900002
im: -1.8203578817790873
scream: -1.7861272914707753


The TF-IDF result(0.766) is slightly lower than Word Frequency (0.77)

In [35]:
# Create and train the Multinomial Naive Bayes model
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train,y_train)
train_predicted = nb_classifier.predict(X_train)
test_predicted = nb_classifier.predict(X_test)
print("Train F1 Score:",f1_score(y_train, train_predicted))
print("Test F1 Score:",f1_score(y_test, test_predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

Train F1 Score: 0.8135508155583437
Test F1 Score: 0.7533818938605619
      0     1
0  3726   159
1   584  1621
     0    1
0  924   48
1  189  362


### Method3: Sentence Embedding

Here, we're using `SentenceTransformer` to transform our text into sentence embedding.

*Reference link: https://www.youtube.com/watch?v=c7AqnswslWo

An experiment is performed to determine whether it is more effective to create sentence embeddings from the original text or from text that has undergone lowercase conversion, removal of non-alphabetic characters, exclusion of stop words, and stemming.

In [36]:
# install package
# !pip install --user sentence-transformers -q

In [37]:
# Import libraries
from sentence_transformers import SentenceTransformer, util
import torch

1. Embedding from processed text (lowercase, alphabetic, stop words, stemming)

In [38]:
# use SentenceTransformer to generate sentence embedding
# can choose diff models from: https://www.sbert.net/docs/pretrained_models.html
model = SentenceTransformer('all-MiniLM-L6-v2')
train['embeddings'] = train['text_clean_string'].apply(model.encode)

In [39]:
X = train['embeddings'].to_list()
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
train_predicted = LR.predict(X_train)
test_predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

print("Train F1 Score:",f1_score(y_train, train_predicted))
print("Test F1 Score:",f1_score(y_test, test_predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

Train F1 Score: 0.771544521365481
Test F1 Score: 0.7568093385214009
      0     1
0  3517   368
1   589  1616
     0    1
0  884   88
1  162  389


2. Embedding from original text

In [40]:
train_llm = pp.process_text(train_df,LLM = True)
test_llm = pp.process_text(test_df,LLM = True)

In [41]:
train_llm.head()

Unnamed: 0,id,keyword,location,text,target,text_clean
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive wildfires evacuation ord..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...


In [42]:
# use SentenceTransformer to generate sentence embedding
# can choose diff models from: https://www.sbert.net/docs/pretrained_models.html
train_llm['embeddings'] = train_llm['text_clean'].apply(model.encode)

In [43]:
train_llm.head()

Unnamed: 0,id,keyword,location,text,target,text_clean,embeddings
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this earthquake Ma...,"[-0.0054457616, 0.064116865, 0.11215522, 0.033..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada,"[0.04022922, 0.03801415, -0.0064313184, 0.0246..."
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are ...,"[0.13137569, 0.012401222, 0.06718611, 0.085462..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive wildfires evacuation ord...","[0.099470675, -0.033589236, 0.0065593175, 0.01..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby Alaska as s...,"[-0.036732815, 0.10448628, 0.0742257, 0.089533..."


In [44]:
X = train_llm['embeddings'].to_list()
y = train_llm['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
train_predicted = LR.predict(X_train)
test_predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))

print("Train F1 Score:",f1_score(y_train, train_predicted))
print("Test F1 Score:",f1_score(y_test, test_predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

Train F1 Score: 0.8250116658889407
Test F1 Score: 0.787878787878788
      0     1
0  3572   313
1   437  1768
     0    1
0  883   89
1  135  416


If you run the code below, you will get the error: Negative values in data passed to MultinomialNB (input X)

This is because MultinomialNB (Multinomial Naive Bayes) algorithm is designed for discrete data, specifically for data that represents counts or frequencies. In the context of text classification, it's commonly used to model the frequency of words in a document or a set of documents.

Negative values are not meaningful in this context because you cannot have a negative count of words in a document. The Multinomial Naive Bayes algorithm works with non-negative integer values, which typically represent the frequency or count of each term (word) in a document.

In [45]:
# # Create and train the Multinomial Naive Bayes model
# nb_classifier = MultinomialNB()
# nb_classifier.fit(X_train,y_train)
# train_predicted = nb_classifier.predict(X_train)
# test_predicted = nb_classifier.predict(X_test)
# print("Train F1 Score:",f1_score(y_train, train_predicted))
# print("Test F1 Score:",f1_score(y_test, test_predicted))
# # Confusion matrix
# print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
# print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

In [46]:
from sklearn.svm import SVC

# X_train and y_train are your sentence embeddings and labels
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)
train_predicted = svm_classifier.predict(X_train)
test_predicted = svm_classifier.predict(X_test)
print("Train F1 Score:",f1_score(y_train, train_predicted))
print("Test F1 Score:",f1_score(y_test, test_predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, train_predicted)))
print(pd.DataFrame(confusion_matrix(y_test, test_predicted)))

Train F1 Score: 0.8316600513658651
Test F1 Score: 0.7823585810162992
      0     1
0  3588   297
1   424  1781
     0    1
0  888   84
1  143  408
