### Introduction

From Text Processing Comparison notebook, we already can get around 0.7 F1 score by logistic regression. Moreover, we can use `get_feature_names_out` and `LogisticRegression.coef_` to find some important words in tweets that indicate disaster or non-disaster tweets. 

This file will further look into those false positive and false negative records to see if we can do better preprocessing for the dataset to make better prediction.

### Import Basic Packages

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Import Packages for Logistic Regression

In [2]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MaxAbsScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score,confusion_matrix


### Read Files and do prediction

The dataset is downloaded from https://www.kaggle.com/c/nlp-getting-started/overview

In [3]:
# Load data
train_df = pd.read_csv("./kaggle/input/train.csv")
test_df = pd.read_csv("./kaggle/input/test.csv")

In [4]:
import Preprocessing_for_Text_Processing_Comparison as pp

In [5]:
train = pp.process_text(train_df)
test = pp.process_text(test_df)

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
# Only include >=10 occurrences
# Have unigrams and bigrams
vec_text = TfidfVectorizer(min_df = 10, ngram_range = (1,2), stop_words='english') 

text_vec = vec_text.fit_transform(train['text_clean_string'])
text_vec_test = vec_text.transform(test['text_clean_string'])
X_train_text = pd.DataFrame(text_vec.toarray(), columns=vec_text.get_feature_names_out())
X_test_text = pd.DataFrame(text_vec_test.toarray(), columns=vec_text.get_feature_names_out())
print (X_train_text.shape)

(7613, 1511)


In [11]:
import warnings

# Filter out the FutureWarning related to is_sparse
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn.utils.validation")

In [16]:
X = X_train_text
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted_test = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_train, LR.predict(X_train)))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
print(pd.DataFrame(confusion_matrix(y_train, LR.predict(X_train))))
pd.DataFrame(confusion_matrix(y_test, predicted))

F1 Score: 0.7977644380045539
F1 Score: 0.763396537510305
      0     1
0  3186   287
1   690  1927


Unnamed: 0,0,1
0,773,96
1,191,463


### Find False Positive and False Negative Tweets

In [17]:
prediction_train = LR.predict(X_train)

In [25]:
#Find false positive records in training data
prediction_train = LR.predict(X_train)
FP_indices_train = [X_train.iloc[i].name for i in range(len(X_train)) if y_train[i] == 0 and prediction_train[i] == 1]
FP_train = train.loc[FP_indices_train]

In [10]:
#easier to see full text in excel
FP_train.to_csv('FP_train.csv', index=True)

In [23]:
# Find False Negative (FN) records in the test data
FN_indices_test = [X_test.iloc[i].name for i in range(len(X_test)) if y_test[i] == 1 and predicted_test[i] == 0]
FN_test = train.loc[FN_indices_test]
#Find false negative records in training data
FN_indices_train = [X_train.iloc[i].name for i in range(len(X_train)) if y_train[i] == 1 and prediction_train[i] == 0]
FN_train = train.loc[FN_indices_train]
# combine FN records in both train and test data
full_FN = pd.concat([FN_test, FN_train])

In [24]:
#easier to see full text in excel
full_FN.to_csv('full_FN.csv', index=True)

### Demonstration of FP

The tweet contains <b>building burning</b>, which may looks like a disaster, but actually it's just a <b>metaphor</b>.
    
This is ChatGPT's explanation for this sentence:
1. "I'm mentally preparing myself for a bomb ass school year": This part suggests that the speaker is getting ready or mentally bracing themselves for an exceptionally good or exciting school year. "Bomb ass" is a slang term that means something is really excellent or outstanding.

2. "if it's not": This part introduces a condition or possibility. The speaker is implying that they are excited and prepared for a great school year, but there's also a chance it might not turn out that way.

3. "I'm burning buildings": This phrase can be metaphorical and hyperbolic. It doesn't literally mean setting fire to buildings. Instead, it may express the speaker's frustration or anger. In this context, it suggests that if the school year doesn't meet their expectations, they will feel upset or let down.


Only using TF-IDF or other methods that doesn't take ambiguous meanings into consideration may cause this kind of problem.





In [19]:
FP_train.loc[1204]['text']

"I'm mentally preparing myself for a bomb ass school year if it's not I'm burning buildings ??"

### Demonstration of FN

Some records are labelled as disaster in the dataset, but they looks like not disaster-related tweets.


In [34]:
full_FN.loc[4509]['text']

'My back is so sunburned :('

In [35]:
full_FN.loc[4444]['text']

'I went to pick up my lunch today and the bartender was holding my change hostage because he wanted my number. ??'

To decide wether relabelling the dataset, we look into the whole dataset and find that there are about 12% mislabelled records out of 150 records and most of them are mislabelled as 1. 
With a data-centric mindset and the goal of ensuring that the model can better serve our use case, which is to assist the government in detecting emergent tweets caused by disasters, we have decided to relabel the false negative (FN) data. (While it would be ideal to relabel the entire dataset, we didn't do so due to time and resource constraints.)