# Text data preprocessing
We know text preprocessing is the first step and an crucial part of NLP. Although nowdays many deep learning based NLP models have packages coming with built-in text preprocessing pipelines, it's still meaningful for us to understand the process of text data preprocessing which often times we will have to do by ourselves in some classical NLP tasks, e.g., sentiment analysis, topic modeling.

In this tutorial, we use the [Twitter Sentiment Anlysis](https://www.kaggle.com/datasets/jp797498e/twitter-entity-sentiment-analysis) data as an exmaple to illustrate the general process of text preprocessing using two NLP packages, NLTK and Spacy, respectively. 

# Twitter Sentiment Analysis data 

In [11]:
import numpy as np
import pandas as pd

In [12]:
data_path = "datasets/twitter_sentiment_analysis/twitter_training.csv"
train_data = pd.read_csv(data_path,header=None)
train_data.columns = ["Tweet_ID","entity","sentiment","Tweet_content"]
train_data

Unnamed: 0,Tweet_ID,entity,sentiment,Tweet_content
0,2401,Borderlands,Positive,im getting on borderlands and i will murder yo...
1,2401,Borderlands,Positive,I am coming to the borders and I will kill you...
2,2401,Borderlands,Positive,im getting on borderlands and i will kill you ...
3,2401,Borderlands,Positive,im coming on borderlands and i will murder you...
4,2401,Borderlands,Positive,im getting on borderlands 2 and i will murder ...
...,...,...,...,...
74677,9200,Nvidia,Positive,Just realized that the Windows partition of my...
74678,9200,Nvidia,Positive,Just realized that my Mac window partition is ...
74679,9200,Nvidia,Positive,Just realized the windows partition of my Mac ...
74680,9200,Nvidia,Positive,Just realized between the windows partition of...


In [13]:
train_data.sentiment.value_counts()

Negative      22542
Positive      20832
Neutral       18318
Irrelevant    12990
Name: sentiment, dtype: int64

In [18]:
rand_id = np.random.randint(0, train_data.shape[0]-1)
train_data.iloc[rand_id,]["Tweet_content"]

'It is not the first time that the EU Commission has taken such a step.'

# General steps of preprocessing

Generally, we follow the below steps to perform preprocessing for text data

1. Remove special strings or characters, e.g., Url, emoji, Twitter marks, styles; Note what to remove is a case by case question. For example, emoji can reveal emotions and actually could be imporant in sentiment analysis.
2. Tokenizing the string. This is the process of splitting strings into chunks (words, punctuations)
3. Lowercasing
4. Remoing stop words and punctuations. Stop words are words that don't add significant meaning to the text. Punctuations are special characters that help to organize the structure of sentences, i.e., ",", ".", "?". But sometimes, there are strings of punctuations that contain meanings (serve like emojis) and should be retained, i.e., ":)". 
5. Stemming/Lemmetization

## Stemming VS Lemmetization

**Stemming** is  the process of reducing infected words to their stem. It is the process of removing the last few characters of a given word, to obtain a shorter form, even if that form doesn’t have any meaning.

For example, after stemming, both "History" and "Historical" become "Histori" which has no meaning, or does not exist in English. "Finally" and "Final" become "Fina". 

The goal is to get the base form of similar words.

**Lemmetization** has the same goal as stemming but overcomes the drawbacks of stemming. In stemming, for some words, it may not give may not give meaningful representation such as “Histori”. Here, lemmatization comes into picture as it gives meaningful word.

Lemmatization takes more time as compared to stemming because it finds meaningful word/ representation. Stemming just needs to get a base word and therefore takes less time.

Stemming has its application in Sentiment Analysis while Lemmatization has its application in Chatbots, human-answering.

### Summary
	
**Stemming** is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling.	

For instance, stemming the word ‘Caring‘ would return ‘Car‘.

Stemming is used in case of large dataset where performance is an issue.

**Lemmatization** considers the context and converts the word to its meaningful base form, which is called Lemma.

For instance, lemmatizing the word ‘Caring‘ would return ‘Care‘.

Lemmatization is computationally expensive since it involves look-up tables and what not.


# Preprocessing with NLTK

In [40]:
import re
import string
import nltk
nltk.download("stopwords")
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer

[nltk_data] Downloading package stopwords to /Users/wgw/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/wgw/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/wgw/nltk_data...


In [19]:
tweet = train_data.iloc[rand_id,]["Tweet_content"]
tweet

'It is not the first time that the EU Commission has taken such a step.'

## Remove hyperlinks, Twitter marks and styles

In [23]:
# remove old sytle retweet text "RT"
tweet2 = re.sub(r'^RT[\s]+','', tweet)
# remove hyperlinks
tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)

# remove hashtags
# only removing the hash # sign from the word
tweet2 = re.sub(r'#', '', tweet2)

print(tweet2)

It is not the first time that the EU Commission has taken such a step.


## Tokenize the string

We do tokenizing and lowercasing in the same step.

In [26]:
# instantiate tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)

tweet_tokens = tokenizer.tokenize(tweet2)
print(tweet_tokens)

['it', 'is', 'not', 'the', 'first', 'time', 'that', 'the', 'eu', 'commission', 'has', 'taken', 'such', 'a', 'step', '.']


## Remove stop workds and punctuations

In [29]:
#Import the english stop words list from NLTK
stopwords_english = stopwords.words('english') 

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so

In [30]:
tweets_clean=[]
for word in tweet_tokens:
    if (word not in stopwords_english) and (word not in string.punctuation):
        tweets_clean.append(word)

print(tweets_clean)

['first', 'time', 'eu', 'commission', 'taken', 'step']


## Stemming

In [33]:
stemmer = PorterStemmer()

tweets_stem = [stemmer.stem(w) for w in tweets_clean]
print(tweets_stem)

['first', 'time', 'eu', 'commiss', 'taken', 'step']


## Lemmetization

In [41]:
lemmertizer = WordNetLemmatizer()

tweets_lem = [lemmertizer.lemmatize(w) for w in tweets_clean]
print(tweets_lem)

['first', 'time', 'eu', 'commission', 'taken', 'step']


## Put things together

In [61]:
def process_tweet(tweet, tokenizer=nltk.tokenize.TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True), stopwords = stopwords.words('english'), punctuation = string.punctuation, stemmer = nltk.stem.PorterStemmer(), lemmertizer = None):
    tweet = str(tweet)
    # remove old sytle retweet text "RT"
    tweet2 = re.sub(r'^RT[\s]+','', tweet)
    # remove hyperlinks
    tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet2 = re.sub(r'#', '', tweet2)

    # Tokenize
    tweet_tokens = tokenizer.tokenize(tweet2)

    # remove stopworks and punctuation
    tweets_clean=[]
    for word in tweet_tokens:
        if (word not in stopwords) and (word not in punctuation):
            tweets_clean.append(word)
    
    # stemming or lemmetization
    if stemmer:
        tweets_clean = [stemmer.stem(w) for w in tweets_clean]
    else:
        tweets_clean = [lemmertizer.lemmatize(w) for w in tweets_clean]
    
    return tweets_clean    

In [62]:
train_data.Tweet_content.apply(process_tweet)

0                            [im, get, borderland, murder]
1                                     [come, border, kill]
2                              [im, get, borderland, kill]
3                           [im, come, borderland, murder]
4                         [im, get, borderland, 2, murder]
                               ...                        
74677    [realiz, window, partit, mac, like, 6, year, b...
74678    [realiz, mac, window, partit, 6, year, behind,...
74679    [realiz, window, partit, mac, 6, year, behind,...
74680    [realiz, window, partit, mac, like, 6, year, b...
74681    [like, window, partit, mac, like, 6, year, beh...
Name: Tweet_content, Length: 74682, dtype: object

# TODO: Preprocessing Spacy

In [65]:
import spacy

In [67]:
nlp = spacy.load("en_core_web_sm")

In [72]:
print(tweet)

It is not the first time that the EU Commission has taken such a step.


In [74]:
# remove old sytle retweet text "RT"
tweet2 = re.sub(r'^RT[\s]+','', tweet)
# remove hyperlinks
tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)

# remove hashtags
# only removing the hash # sign from the word
tweet2 = re.sub(r'#', '', tweet2)

print(tweet2)

It is not the first time that the EU Commission has taken such a step.


In [80]:
doc = nlp(tweet2)
## tokens
print([token.text.lower() for token in doc if (not token.is_stop) and (not token.is_punct) ])

['time', 'eu', 'commission', 'taken', 'step']


In [81]:
## lemmatization
print([token.lemma_.lower() for token in doc if (not token.is_stop) and (not token.is_punct) ])

['time', 'eu', 'commission', 'take', 'step']


## Put it all together

In [84]:
def process_tweet_spacy(tweet, lemmetizer=False):
    # remove old sytle retweet text "RT"
    tweet2 = re.sub(r'^RT[\s]+','', tweet)
    # remove hyperlinks
    tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet2 = re.sub(r'#', '', tweet2)

    doc = nlp(tweet2)
    # remove stopworks and punctuation
    if lemmetizer:
        return [token.lemma_.lower() for token in doc if (not token.is_stop) and (not token.is_punct) ]
    else:
        return [token.text.lower() for token in doc if (not token.is_stop) and (not token.is_punct) ]

In [87]:
train_data.iloc[:5].Tweet_content.apply(process_tweet)

0       [im, get, borderland, murder]
1                [come, border, kill]
2         [im, get, borderland, kill]
3      [im, come, borderland, murder]
4    [im, get, borderland, 2, murder]
Name: Tweet_content, dtype: object