# Text Preprocessing using Regular Expressions and NLTK

In [82]:
import re
from nltk.corpus import stopwords
import string
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

## Dataset
The training part of the [Disaster Tweets Dataset from Kaggle](https://www.kaggle.com/competitions/nlp-getting-started/discussion/134890) is used here as it is most noisy dataset and great one to practice data preprocessing

In [25]:
tweets_df = pd.read_csv("disaster_tweets_kaggle.csv")
tweets_df.head() 

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


If we check the description of the competition on Kaggle website, we can observe that the keywords are important for the classification of distaster tweet and hence a combined tweet column is created by joining location and text. First the empty keywords are replaced by "".

In [73]:
tweets_df["keyword"] = tweets_df["keyword"].fillna("")
tweets_df["tweet"] = tweets_df["keyword"] + " " + tweets_df["text"]
tweets_df.head()

Unnamed: 0,id,keyword,location,text,target,tweet
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,Our Deeds are the Reason of this #earthquake ...
1,4,,,Forest fire near La Ronge Sask. Canada,1,Forest fire near La Ronge Sask. Canada
2,5,,,All residents asked to 'shelter in place' are ...,1,All residents asked to 'shelter in place' are...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"13,000 people receive #wildfires evacuation o..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,Just got sent this photo from Ruby #Alaska as...


## Lower case 

In [74]:
tweets_df["tweet"] = tweets_df["tweet"].str.lower()
tweets_df["tweet"]

0        our deeds are the reason of this #earthquake ...
1                  forest fire near la ronge sask. canada
2        all residents asked to 'shelter in place' are...
3        13,000 people receive #wildfires evacuation o...
4        just got sent this photo from ruby #alaska as...
                              ...                        
7608     two giant cranes holding a bridge collapse in...
7609     @aria_ahrary @thetawniest the out of control ...
7610     m1.94 [01:04 utc]?5km s of volcano hawaii. ht...
7611     police investigating after an e-bike collided...
7612     the latest: more homes razed by northern cali...
Name: tweet, Length: 7613, dtype: object

## Remove html

There are many html entities in the text such as "\&gt;" and "\&lt;". Also text might contain html tags such as \<p>, \<a> or \<div>

In [81]:
from bs4 import BeautifulSoup
text = r"&gt;&gt; $15 Aftershock : Protect Yourself and Profit in the Next Global Financial... ##book http://t.co/f6ntUc734Z esquireattire"
soup = BeautifulSoup(text)
soup.get_text()

'>> $15 Aftershock : Protect Yourself and Profit in the Next Global Financial... ##book http://t.co/f6ntUc734Z esquireattire'

In [77]:
def remove_html(text):
    soup = BeautifulSoup(text)
    text = soup.get_text()
    return text

In [80]:
tweets_df["tweet"] = tweets_df["tweet"].apply(remove_html)
tweets_df["tweet"]

0       our deeds are the reason of this #earthquake m...
1                  forest fire near la ronge sask. canada
2       all residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       just got sent this photo from ruby #alaska as ...
                              ...                        
7608    two giant cranes holding a bridge collapse int...
7609    @aria_ahrary @thetawniest the out of control w...
7610    m1.94 [01:04 utc]?5km s of volcano hawaii. htt...
7611    police investigating after an e-bike collided ...
7612    the latest: more homes razed by northern calif...
Name: tweet, Length: 7613, dtype: object

## Expand Contractions

In [83]:
import contractions

tweets_df["tweet"] = tweets_df["tweet"].apply(contractions.fix)
tweets_df["tweet"]

0       our deeds are the reason of this #earthquake m...
1                  forest fire near la ronge sask. canada
2       all residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       just got sent this photo from ruby #alaska as ...
                              ...                        
7608    two giant cranes holding a bridge collapse int...
7609    @aria_ahrary @thetawniest the out of control w...
7610    m1.94 [01:04 utc]?5km s of volcano hawaii. htt...
7611    police investigating after an e-bike collided ...
7612    the latest: more homes razed by northern calif...
Name: tweet, Length: 7613, dtype: object

## Remove URLS

In [84]:
def remove_urls(text):
    pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+)(/\w*)?')
    text = re.sub(pattern, "", text)
    return text

In [85]:
text = "#stlouis #caraccidentlawyer Speeding Among Top Causes of Teen Accidents https://t.co/k4zoMOF319 https://t.co/S2kXVM0cBA Car Accident"
remove_urls(text)

'#stlouis #caraccidentlawyer Speeding Among Top Causes of Teen Accidents   Car Accident'

In [86]:
tweets_df["tweet"] = tweets_df["tweet"].apply(remove_urls)
tweets_df["tweet"]

0       our deeds are the reason of this #earthquake m...
1                  forest fire near la ronge sask. canada
2       all residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       just got sent this photo from ruby #alaska as ...
                              ...                        
7608    two giant cranes holding a bridge collapse int...
7609    @aria_ahrary @thetawniest the out of control w...
7610          m1.94 [01:04 utc]?5km s of volcano hawaii. 
7611    police investigating after an e-bike collided ...
7612    the latest: more homes razed by northern calif...
Name: tweet, Length: 7613, dtype: object

## Remove email ids

In [None]:
tweet = "please send your feedback to myemail@gmail.com "
pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+")
x = re.findall(pattern, tweet)
print(x)
z = re.sub(pattern, "", tweet)
print(z)

## Remove Tweeter Mentions

The text contains maintions using @, we need to remove these mentions before removing the punctutions. 

In [87]:
def remove_mentions(text):
    pattern = re.compile(r"@\w+")
    text = re.sub(pattern, "", text)
    return text

In [88]:
text = "@aria_ahrary @TheTawniest The out of control"
remove_mentions(text)

'  The out of control'

In [89]:
tweets_df["tweet"] = tweets_df["tweet"].apply(remove_mentions)
tweets_df["tweet"]

0       our deeds are the reason of this #earthquake m...
1                  forest fire near la ronge sask. canada
2       all residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       just got sent this photo from ruby #alaska as ...
                              ...                        
7608    two giant cranes holding a bridge collapse int...
7609      the out of control wild fires in california ...
7610          m1.94 [01:04 utc]?5km s of volcano hawaii. 
7611    police investigating after an e-bike collided ...
7612    the latest: more homes razed by northern calif...
Name: tweet, Length: 7613, dtype: object

## Remove unicode charachers

In [98]:
def remove_unicode_chars(text):
    text = text.encode("ascii", "ignore").decode()
    return text

In [99]:
# text = "\x89ÛÏWhen"
text = " lips @Â‰Ã"
remove_unicode_chars(text)

' lips @'

In [105]:
tweets_df["tweet"] = tweets_df["tweet"].apply(remove_mentions)
tweets_df["tweet"]

0       our deeds are the reason of this #earthquake m...
1                  forest fire near la ronge sask. canada
2       all residents asked to 'shelter in place' are ...
3       13,000 people receive #wildfires evacuation or...
4       just got sent this photo from ruby #alaska as ...
                              ...                        
7608    two giant cranes holding a bridge collapse int...
7609      the out of control wild fires in california ...
7610          m1.94 [01:04 utc]?5km s of volcano hawaii. 
7611    police investigating after an e-bike collided ...
7612    the latest: more homes razed by northern calif...
Name: tweet, Length: 7613, dtype: object

## Remove punctuations

In [101]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [102]:
def remove_punctuations(text):
    text = re.sub('[%s]' % re.escape(string.punctuation), "",text)
    return text

In [103]:
text = "'>> $15 Aftershock : Protect Yourself and Profit in the Next Global Financial... ##book  esquireattire'"
remove_punctuations(text)

' 15 Aftershock  Protect Yourself and Profit in the Next Global Financial book  esquireattire'

In [106]:
tweets_df["tweet"] = tweets_df["tweet"].apply(remove_punctuations)
tweets_df["tweet"]

0       our deeds are the reason of this earthquake ma...
1                   forest fire near la ronge sask canada
2       all residents asked to shelter in place are be...
3       13000 people receive wildfires evacuation orde...
4       just got sent this photo from ruby alaska as s...
                              ...                        
7608    two giant cranes holding a bridge collapse int...
7609      the out of control wild fires in california ...
7610                m194 0104 utc5km s of volcano hawaii 
7611    police investigating after an ebike collided w...
7612    the latest more homes razed by northern califo...
Name: tweet, Length: 7613, dtype: object

## Remove digits or words containing digits

This might not be appropriate in many cases. For example "MH370" mentioned in the tweets corresponds to Malaysia Airlines Flight 370 which went missing. In this case, keeping this number in the text might be useful in the disaster tweet classification. However, here I will remove all the digits for simplicity

In [108]:
def remove_digits(text):
    pattern = re.compile("\w*\d+\w*")
    text = re.sub(pattern, "",text)
    return text

In [109]:
text = " m194 0104 utc5km s of volcano hawaii"
remove_digits(text)

'    s of volcano hawaii'

In [110]:
tweets_df["tweet"] = tweets_df["tweet"].apply(remove_digits)
tweets_df["tweet"]

0       our deeds are the reason of this earthquake ma...
1                   forest fire near la ronge sask canada
2       all residents asked to shelter in place are be...
3        people receive wildfires evacuation orders in...
4       just got sent this photo from ruby alaska as s...
                              ...                        
7608    two giant cranes holding a bridge collapse int...
7609      the out of control wild fires in california ...
7610                                 s of volcano hawaii 
7611    police investigating after an ebike collided w...
7612    the latest more homes razed by northern califo...
Name: tweet, Length: 7613, dtype: object

## Remove stopwords

In [111]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in stop_words])

In [112]:
text = "our deeds are the reason of this earthquake"
remove_stopwords(text)

'deeds reason earthquake'

In [113]:
tweets_df["tweet"] = tweets_df["tweet"].apply(remove_stopwords)
tweets_df["tweet"]

0            deeds reason earthquake may allah forgive us
1                   forest fire near la ronge sask canada
2       residents asked shelter place notified officer...
3       people receive wildfires evacuation orders cal...
4       got sent photo ruby alaska smoke wildfires pou...
                              ...                        
7608    two giant cranes holding bridge collapse nearb...
7609    control wild fires california even northern pa...
7610                                       volcano hawaii
7611    police investigating ebike collided car little...
7612    latest homes razed northern california wildfir...
Name: tweet, Length: 7613, dtype: object

## Removing Extra Spaces

In [114]:
def remove_extra_spaces(text):
    text = re.sub(' +', ' ', text).strip()
    return text

In [115]:
tweets_df["tweet"] = tweets_df["tweet"].apply(remove_extra_spaces)
tweets_df["tweet"]

0            deeds reason earthquake may allah forgive us
1                   forest fire near la ronge sask canada
2       residents asked shelter place notified officer...
3       people receive wildfires evacuation orders cal...
4       got sent photo ruby alaska smoke wildfires pou...
                              ...                        
7608    two giant cranes holding bridge collapse nearb...
7609    control wild fires california even northern pa...
7610                                       volcano hawaii
7611    police investigating ebike collided car little...
7612    latest homes razed northern california wildfir...
Name: tweet, Length: 7613, dtype: object

## Stemming or Lemmatization

I generally prefer lemmatization over stemming as lemmatization gives meaningful words

In [118]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    words = [lemmatizer.lemmatize(word) for word in text.split()]
    text = ' '.join(words)
    return text

In [119]:
tweets_df["tweet"] = tweets_df["tweet"].apply(lemmatize_text)
tweets_df["tweet"]

0              deed reason earthquake may allah forgive u
1                   forest fire near la ronge sask canada
2       resident asked shelter place notified officer ...
3       people receive wildfire evacuation order calif...
4       got sent photo ruby alaska smoke wildfire pour...
                              ...                        
7608    two giant crane holding bridge collapse nearby...
7609    control wild fire california even northern par...
7610                                       volcano hawaii
7611    police investigating ebike collided car little...
7612    latest home razed northern california wildfire...
Name: tweet, Length: 7613, dtype: object