### Introduction

This notebook aims to explore the impact of various text processing techniques on prediction outcomes. Specifically, it will investigate the performance differences between methods such as word frequency analysis, TF-IDF (Term Frequency-Inverse Document Frequency), and sentence embedding in the context of making predictions.

To ensure that variations primarily arise from different text processing techniques, we will exclusively employ logistic regression as our chosen classification model.

### Import Basic Packages

In [1]:
import numpy as np
import pandas as pd

### Import Packages for Logistic Regression

In [18]:
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score,confusion_matrix

### Read Files

The dataset is downloaded from https://www.kaggle.com/c/nlp-getting-started/overview

In [2]:
# Load data
train_df = pd.read_csv("./kaggle/input/train.csv")
test_df = pd.read_csv("./kaggle/input/test.csv")

In [3]:
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [4]:
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


### Clean Text Column

Text cleaning is done from a python script (Preprocessing_for_Text-Processing-Comparison.py). The following steps are involved:

1. remove link
2. remove line breaks
3. remove # from hashtag
4. tokenize text
5. change words into lower case
6. only include alphabetic words
7. remove stop words
8. steeming words

Upon completion of the cleaning process, two additional columns are created. The first column, named `text_clean`, consists of a list of tokenized words. The second column, denoted as `text_clean_string`, is a representation formed by combining the tokens from the text_clean column with spaces in between each token. They will be used for the three different methods later.

For more details, can refer to the python script.

In [5]:
import Preprocessing_for_Text_Processing_Comparison as pp

In [6]:
train = pp.process_text(train_df)
test = pp.process_text(test_df)

In [7]:
train.head()

Unnamed: 0,id,keyword,location,text,target,text_clean,text_clean_string
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[deed, reason, earthquak, may, allah, forgiv, us]",deed reason earthquak may allah forgiv us
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[forest, fire, near, la, rong, sask, canada]",forest fire near la rong sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,"[resid, ask, place, notifi, offic, evacu, shel...",resid ask place notifi offic evacu shelter pla...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[peopl, receiv, wildfir, evacu, order, califor...",peopl receiv wildfir evacu order california
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[got, sent, photo, rubi, alaska, smoke, wildfi...",got sent photo rubi alaska smoke wildfir pour ...


In [8]:
test.head()

Unnamed: 0,id,keyword,location,text,text_clean,text_clean_string
0,0,,,Just happened a terrible car crash,"[happen, terribl, car, crash]",happen terribl car crash
1,2,,,"Heard about #earthquake is different cities, s...","[heard, earthquak, differ, citi, stay, safe, e...",heard earthquak differ citi stay safe everyon
2,3,,,"there is a forest fire at spot pond, geese are...","[forest, fire, spot, pond, gees, flee, across,...",forest fire spot pond gees flee across street ...
3,9,,,Apocalypse lighting. #Spokane #wildfires,"[apocalyps, light, spokan, wildfir]",apocalyps light spokan wildfir
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan,"[typhoon, soudelor, kill, china, taiwan]",typhoon soudelor kill china taiwan


### Method1: Word Frequency

To kick start a base method, let's start with using the count of words in each tweet.
Below will be using `CountVectorizer` to build the count of words matrix.

In [9]:
from sklearn import feature_extraction

In [10]:
count_vectorizer = feature_extraction.text.CountVectorizer()

In [12]:
## let's take a look at expected output by using first 2 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train["text_clean_string"][0:2])
print(train["text_clean_string"][0:2].values)
print(count_vectorizer.get_feature_names_out())
print(count_vectorizer.vocabulary_)
print(example_train_vectors.toarray())
print(example_train_vectors.toarray().shape)

['deed reason earthquak may allah forgiv us'
 'forest fire near la rong sask canada']
['allah' 'canada' 'deed' 'earthquak' 'fire' 'forest' 'forgiv' 'la' 'may'
 'near' 'reason' 'rong' 'sask' 'us']
{'deed': 2, 'reason': 10, 'earthquak': 3, 'may': 8, 'allah': 0, 'forgiv': 6, 'us': 13, 'forest': 5, 'fire': 4, 'near': 9, 'la': 7, 'rong': 11, 'sask': 12, 'canada': 1}
[[1 0 1 1 0 0 1 0 1 0 1 0 0 1]
 [0 1 0 0 1 1 0 1 0 1 0 1 1 0]]
(2, 14)


The above tells us that, there are 14 unique words (or "tokens") in the first two tweets.

Now let's create vectors for all of our tweets.

In [13]:
train_vectors = count_vectorizer.fit_transform(train_df["text_clean_string"])

# note that we're NOT using .fit_transform() here. Using just .transform() makes sure that train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(test_df["text_clean_string"])

In [16]:
# print(count_vectorizer.get_feature_names_out())
# print(count_vectorizer.vocabulary_)
# print(example_train_vectors.toarray())
print(train_vectors.toarray().shape)

(7613, 11974)


In [23]:
X = train_vectors
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
pd.DataFrame(confusion_matrix(y_test, predicted))

F1 Score: 0.7745571658615136


Unnamed: 0,0,1
0,762,107
1,173,481


### Method2: TF-IDF

Now, let's try TF-IDF by using `TfidfVectorizer`

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [25]:
# Only include >=10 occurrences
# Have unigrams and bigrams
vec_text = TfidfVectorizer(min_df = 10, ngram_range = (1,2), stop_words='english') 

text_vec = vec_text.fit_transform(train['text_clean_string'])
text_vec_test = vec_text.transform(test['text_clean_string'])
X_train_text = pd.DataFrame(text_vec.toarray(), columns=vec_text.get_feature_names_out())
X_test_text = pd.DataFrame(text_vec_test.toarray(), columns=vec_text.get_feature_names_out())
print (X_train_text.shape)

(7613, 1520)


In [26]:
X_train_text

Unnamed: 0,aba,aba woman,abandon,abc,abc news,abl,ablaz,absolut,accid,accord,...,york,young,youth,youth save,youtub,youtub playlist,youtub video,yr,yyc,zone
0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7608,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7609,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7610,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7611,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [28]:
X = X_train_text
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
pd.DataFrame(confusion_matrix(y_test, predicted))

F1 Score: 0.7600989282769992


Unnamed: 0,0,1
0,771,98
1,193,461


The TF-IDF result is slightly lower than Word Frequency

### Method3: Sentence Embedding

Here, we're using `SentenceTransformer` to transform our text into sentence embedding.

*Reference link: https://www.youtube.com/watch?v=c7AqnswslWo

In [None]:
# install package
# !pip install --user sentence-transformers -q

In [29]:
# Import libraries
from sentence_transformers import SentenceTransformer

In [31]:
# use SentenceTransformer to generate sentence embedding
model = SentenceTransformer('all-MiniLM-L6-v2')
train['embeddings'] = train['text_clean_string'].apply(model.encode)

In [32]:
train.head()

Unnamed: 0,id,keyword,location,text,target,text_clean,text_clean_string,embeddings
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[deed, reason, earthquak, may, allah, forgiv, us]",deed reason earthquak may allah forgiv us,"[-0.033004314, 0.1294759, 0.027240366, -0.0689..."
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[forest, fire, near, la, rong, sask, canada]",forest fire near la rong sask canada,"[0.036669217, 0.02750025, 0.01003717, 0.015032..."
2,5,,,All residents asked to 'shelter in place' are ...,1,"[resid, ask, place, notifi, offic, evacu, shel...",resid ask place notifi offic evacu shelter pla...,"[-0.0040206, 0.11070472, 0.02508261, 0.0450453..."
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[peopl, receiv, wildfir, evacu, order, califor...",peopl receiv wildfir evacu order california,"[0.036597367, 0.06827786, 0.023573978, 0.01135..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[got, sent, photo, rubi, alaska, smoke, wildfi...",got sent photo rubi alaska smoke wildfir pour ...,"[-0.081632115, 0.077032514, 0.03796497, 0.0621..."


In [33]:
X = train['embeddings'].to_list()
y = train['target'].to_list()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=y)
LR = LogisticRegression()
LR.fit(X_train,y_train)
predicted = LR.predict(X_test)
# print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
# print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted))
# print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted))
print("F1 Score:",f1_score(y_test, predicted))
# Confusion matrix
pd.DataFrame(confusion_matrix(y_test, predicted))

F1 Score: 0.7488076311605724


Unnamed: 0,0,1
0,736,133
1,183,471


The sentence embedding F1 score is the worst among three methods

### Conlcusion

Final F1 Score: 
* Word-Frequency: 0.77
* TF-IDF: 0.76
* Sentence-Embedding: 0.74

The result seems counter-intuitive, this may be caused by data preprocessing, finetunning, the choice of classification model, and the selection of embedding models.

Future efforts will delve into these potential areas of concern with the aim of enhancing them.