# 01. Preprocessing

## Raw Dataset
We download two training dataset from Kaggle

- [NLP with Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started/data) (NWDT)
- [Disasters on social media](https://www.kaggle.com/datasets/jannesklaas/disasters-on-social-media) (DOSM)

We use `pandas` for merging two datasets and extracing only relevant features and labels

In [1]:
import pandas as pd

NWDT_PATH = "data/nlp-with-disaster-tweets-train.csv"
DOSM_PATH = "data/disasters-on-social-media.csv"

nwdt = pd.read_csv(NWDT_PATH)
dosm = pd.read_csv(DOSM_PATH)

Modify the DOSM dataset to match the NWDT dataset

In [2]:
dosm.loc[dosm["choose_one"] == "Relevant", "target"] = 1
dosm.loc[dosm["choose_one"] == "Not Relevant", "target"] = 0
dosm = dosm.dropna(subset=["target"])
dosm["target"] = dosm["target"].astype("int")
dosm = dosm[["keyword", "location", "text", "target"]]

dosm.head()

Unnamed: 0,keyword,location,text,target
0,,,Just happened a terrible car crash,1
1,,,Our Deeds are the Reason of this #earthquake M...,1
2,,,"Heard about #earthquake is different cities, s...",1
3,,,"there is a forest fire at spot pond, geese are...",1
4,,,Forest fire near La Ronge Sask. Canada,1


Merge two datasets

In [3]:
train_tweets = pd.concat([nwdt, dosm], ignore_index=True).drop("id", axis=1)
train_tweets

Unnamed: 0,keyword,location,text,target
0,,,Our Deeds are the Reason of this #earthquake M...,1
1,,,Forest fire near La Ronge Sask. Canada,1
2,,,All residents asked to 'shelter in place' are ...,1
3,,,"13,000 people receive #wildfires evacuation or...",1
4,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...
18468,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
18469,,,Police investigating after an e-bike collided ...,1
18470,,,The Latest: More Homes Razed by Northern Calif...,1
18471,,,MEG issues Hazardous Weather Outlook (HWO) htt...,1


Export the final dataset (`train_tweets`)

Note that we use `|` as a seperator for reducing the chance of error, and do not using quoting(leaving `quoting` as `None`).

In [4]:
train_tweets.to_csv("data/train-tweets.csv", sep="|", quoting=None)

## Tokenization and Cleaning

To clean up our tweet, we use `nltk` library, and implement a series of cleaning process

1. We transform the entire tweet into a lower case and tokenize the tweet into each word by using the `word_tokenize` function from the `nltk.tokenize` package. 
2. We are interested in an only word that is an alphanumeric character.
3. We do not use any stop words in English. To consider which word is not a stop word, `nltk.corpus` gives us a list of `stopwords` that we can use to determine it. 
4. We lemmatize and transform a different form of words into a single baseline form of word i.e. books -> book, children -> child, went / gone -> go. We leverage the `WordNetLemmatizer` for doing this task. 

Putting it all together, we create a text pre-processing function `text_pre_processed()` for cleaning our texts, and we will use this for cleaning our dataset

In [5]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def text_pre_processed(text_row):
    """
    Pre-processed text
    Input:
        text_row (str): a text
    """
    
    tokens = word_tokenize(text_row.lower())
    words = []
    word_net_lemmatizer = WordNetLemmatizer()
    for token in tokens:
        if token.isalpha() and token not in stopwords.words("english"):
            word = word_net_lemmatizer.lemmatize(token)
            words.append(word)
    
    return words

In [6]:
train_tweets["tokenized_clean_text"] = train_tweets.apply(
    lambda row: text_pre_processed(row["text"]),
    axis=1
)

Let's compare the old (pre-cleaned) text to the cleaned text

In [7]:
for index in range(5):
    cleaned = " ".join(train_tweets["tokenized_clean_text"][index])
    print("old text: ", train_tweets["text"][index])
    print("cleaned text: ", cleaned, "\n")

old text:  Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
cleaned text:  deed reason earthquake may allah forgive u 

old text:  Forest fire near La Ronge Sask. Canada
cleaned text:  forest fire near la ronge sask canada 

old text:  All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
cleaned text:  resident asked place notified officer evacuation shelter place order expected 

old text:  13,000 people receive #wildfires evacuation orders in California 
cleaned text:  people receive wildfire evacuation order california 

old text:  Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
cleaned text:  got sent photo ruby alaska smoke wildfire pours school 



Now, export our cleaned data for doing more cool things such as data exploration and deep learning

In [8]:
train_tweets.to_csv("data/cleaned-tokenized-train-tweets.csv", sep="|", quoting=None)