In [1]:
import numpy as np
import pandas as pd

In [2]:
text = pd.read_csv("/content/drive/MyDrive/NLP/sample.csv")
text.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,119237,105834,True,Wed Oct 11 06:55:44 +0000 2017,@AppleSupport causing the reply to be disregar...,119236.0,
1,119238,ChaseSupport,False,Wed Oct 11 13:25:49 +0000 2017,@105835 Your business means a lot to us. Pleas...,,119239.0
2,119239,105835,True,Wed Oct 11 13:00:09 +0000 2017,@76328 I really hope you all change but I'm su...,119238.0,
3,119240,VirginTrains,False,Tue Oct 10 15:16:08 +0000 2017,@105836 LiveChat is online at the moment - htt...,119241.0,119242.0
4,119241,105836,True,Tue Oct 10 15:17:21 +0000 2017,@VirginTrains see attached error message. I've...,119243.0,119240.0


#### Text Cleaning
In this step, we will perform fundamental actions to clean the text. These actions involve transforming all the text to lowercase, eliminating characters that do not qualify as words or whitespace, as well as removing any numerical digits present.

**I. Converting to lowercase**

Python is a case sensitive programming language. Therefore, to avoid any issues and ensure consistency in the processing of the text, we convert all the text to lowercase.

This way, “Free” and “free” will be treated as the same word, and our data analysis will be more accurate and reliable.

In [5]:
text = text.applymap(lambda x: x.lower() if isinstance(x, str) else x)
text.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,119237,105834,True,wed oct 11 06:55:44 +0000 2017,@applesupport causing the reply to be disregar...,119236.0,
1,119238,chasesupport,False,wed oct 11 13:25:49 +0000 2017,@105835 your business means a lot to us. pleas...,,119239.0
2,119239,105835,True,wed oct 11 13:00:09 +0000 2017,@76328 i really hope you all change but i'm su...,119238.0,
3,119240,virgintrains,False,tue oct 10 15:16:08 +0000 2017,@105836 livechat is online at the moment - htt...,119241.0,119242.0
4,119241,105836,True,tue oct 10 15:17:21 +0000 2017,@virgintrains see attached error message. i've...,119243.0,119240.0


In [6]:
text['text']

0     @applesupport causing the reply to be disregar...
1     @105835 your business means a lot to us. pleas...
2     @76328 i really hope you all change but i'm su...
3     @105836 livechat is online at the moment - htt...
4     @virgintrains see attached error message. i've...
                            ...                        
88    @105860 i wish amazon had an option of where i...
89    they reschedule my shit for tomorrow https://t...
90    @105861 hey sara, sorry to hear of the issues ...
91    @tesco bit of both - finding the layout cumber...
92    @105861 if that doesn't help please dm your fu...
Name: text, Length: 93, dtype: object

As we can see we also having the **urls** so now we will remove the urls

**II. Removing URLs**

When building a model, URLs are typically not relevant and can be removed from the text data.

For removing URLs we can use ‘regex’ library.

In [7]:
import re

#define a regex pattern to match urls
url_pattern = re.compile(r'https?://\S+')

#define a function to remove urls from text
def remove_urls(text):
  return url_pattern.sub('', text)

#apply the function to the 'text' column and create a new column clean_text
text['text'] =  text['text'].apply(remove_urls)

In [8]:
text['text']

0     @applesupport causing the reply to be disregar...
1     @105835 your business means a lot to us. pleas...
2     @76328 i really hope you all change but i'm su...
3     @105836 livechat is online at the moment -  or...
4     @virgintrains see attached error message. i've...
                            ...                        
88    @105860 i wish amazon had an option of where i...
89                they reschedule my shit for tomorrow 
90    @105861 hey sara, sorry to hear of the issues ...
91    @tesco bit of both - finding the layout cumber...
92    @105861 if that doesn't help please dm your fu...
Name: text, Length: 93, dtype: object

**III.**  **Removing remove non-word and non-whitespace characters**

It is essential to remove any characters that are not considered as words or whitespace from the text dataset.

These non-word and non-whitespace characters can include punctuation marks, symbols, and other special characters that do not provide any meaningful information for our analysis.

In [9]:
text = text.replace(to_replace = r'[^\w\s]', value = '', regex = True)

In [10]:
text['text']

0     applesupport causing the reply to be disregard...
1     105835 your business means a lot to us please ...
2     76328 i really hope you all change but im sure...
3     105836 livechat is online at the moment   or c...
4     virgintrains see attached error message ive tr...
                            ...                        
88    105860 i wish amazon had an option of where i ...
89                they reschedule my shit for tomorrow 
90    105861 hey sara sorry to hear of the issues yo...
91    tesco bit of both  finding the layout cumberso...
92    105861 if that doesnt help please dm your full...
Name: text, Length: 93, dtype: object

**IV. Removing digits**

It is important to remove all numerical digits from the text dataset. This is because, in most cases, numerical values do not provide any significant meaning to the text analysis process.

Moreover, they can interfere with natural language processing algorithms, which are designed to understand and process text-based information.

In [11]:
text = text.replace(to_replace = r'\d', value = '', regex = True)

In [12]:
text['text']

0     applesupport causing the reply to be disregard...
1      your business means a lot to us please dm you...
2      i really hope you all change but im sure you ...
3      livechat is online at the moment   or contact...
4     virgintrains see attached error message ive tr...
                            ...                        
88     i wish amazon had an option of where i can ju...
89                they reschedule my shit for tomorrow 
90     hey sara sorry to hear of the issues you are ...
91    tesco bit of both  finding the layout cumberso...
92     if that doesnt help please dm your full name ...
Name: text, Length: 93, dtype: object

#### 2. Tokenization
Tokenization is the process of breaking down large blocks of text such as paragraphs and sentences into smaller, more manageable units.

In this step, we will be applying word tokenization to split the data in the Text column into words.

By performing word tokenization, we can obtain a more accurate representation of the underlying patterns and trends present in the text data.

In [None]:
nltk.download('all')

In [15]:
import nltk
from nltk.tokenize import word_tokenize

text['text'] = text['text'].apply(word_tokenize)

In [18]:
text['text'][0]

['applesupport',
 'causing',
 'the',
 'reply',
 'to',
 'be',
 'disregarded',
 'and',
 'the',
 'tapped',
 'notification',
 'under',
 'the',
 'keyboard',
 'is',
 'opened']

#### **3. Stopword Removal**
Stopwords refer to the most commonly occurring words in any natural language.

For the purpose of analyzing text data and building NLP models, these stopwords might not add much value to the meaning of the document. Therefore, removing stopwords can help us to focus on the most important information in the text and improve the accuracy of our analysis.

One of the advantages of removing stopwords is that it can reduce the size of the dataset, which in turn reduces the training time required for natural language processing models.

Various libraries such as ‘Natural Language Toolkit’ (NLTK), ‘spaCy’, and ‘Scikit-Learn’ can be used to remove stopwords.

In this example, we will use the NLTK library to remove stopwords in the 'Text' column of our dataset.

In [19]:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
text['text'] = text['text'].apply(lambda x: [word for word in x if word not in stop_words])

In [20]:
text['text'][0]

['applesupport',
 'causing',
 'reply',
 'disregarded',
 'tapped',
 'notification',
 'keyboard',
 'opened']

#### **4. Stemming/Lemmatization**
What’s the difference between Stemming and Lemmatization?

* **Stemming** is a process that stems or removes last few characters from a

  word, often leading to incorrect meanings and spelling.

  For instance, stemming the word ‘Caring‘ would return ‘Car‘.

  Stemming is used in case of large dataset where performance is an issue.


* **Lemmatization** considers the context and converts the word to its meaningful base form, which is called Lemma.

  For instance, lemmatizing the word ‘Caring‘ would return ‘Care‘.
  
  Lemmatization is computationally expensive since it involves look-up tables and what not.

There are various algorithms that can be used for stemming,

· Porter Stemmer algorithm

· Snowball Stemmer algorithm

· Lovins Stemmer algorithm

**Stemming**

Let’s take a look at how we can use ‘Porter Stemmer’ algorithm on our dataset.

In [21]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

#initialize the porter stemmer
stemmer = PorterStemmer()

#define a function to perform stemming on the 'text' column
def stem_words(words):
  return [stemmer.stem(word) for word in words]

#apply the function to the text column and create a new column 'stemmed_text'
text['stemmen_text'] = text['text'].apply(stem_words)


In [23]:
text['stemmen_text']  #stemming converted most of the meaningless word

0     [applesupport, caus, repli, disregard, tap, no...
1     [busi, mean, lot, us, pleas, dm, name, zip, co...
2           [realli, hope, chang, im, sure, wont, dont]
3     [livechat, onlin, moment, contact, option, lea...
4     [virgintrain, see, attach, error, messag, ive,...
                            ...                        
88    [wish, amazon, option, get, ship, up, store, a...
89                          [reschedul, shit, tomorrow]
90    [hey, sara, sorri, hear, issu, ask, lay, speed...
91    [tesco, bit, find, layout, cumbersom, remov, i...
92    [doesnt, help, pleas, dm, full, name, address,...
Name: stemmen_text, Length: 93, dtype: object

**Lemmatization**

Next, let’s take a look at how we can implement Lemmatization for the same dataset.

In [30]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import pandas as pd

# initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# define function to lemmatize tokens
def lemmatize_tokens(tokens):
    # convert POS tag to WordNet format
    def get_wordnet_pos(word):
        tag = nltk.pos_tag([word])[0][1][0].upper()
        tag_dict = {"J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV}
        return tag_dict.get(tag, wordnet.NOUN)

    # lemmatize tokens
    lemmas = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in tokens]

    # return lemmatized tokens as a list
    return lemmas

# apply lemmatization function to column of dataframe
text['lemmatized_messages'] = text['text'].apply(lemmatize_tokens)

In [32]:
text['lemmatized_messages'] #lemmatization generated the meaningful words

0     [applesupport, cause, reply, disregard, tapped...
1     [business, mean, lot, u, please, dm, name, zip...
2          [really, hope, change, im, sure, wont, dont]
3     [livechat, online, moment, contact, option, le...
4     [virgintrains, see, attach, error, message, iv...
                            ...                        
88    [wish, amazon, option, get, ship, ups, store, ...
89                         [reschedule, shit, tomorrow]
90    [hey, sara, sorry, hear, issue, ask, lay, spee...
91    [tesco, bit, find, layout, cumbersome, remove,...
92    [doesnt, help, please, dm, full, name, address...
Name: lemmatized_messages, Length: 93, dtype: object

### **Conclusion**

In this colab we discussed main preprocessing steps in building an NLP model, which include text cleaning, tokenization, stopword removal, and stemming/lemmatization. Implementing these steps can help improve model accuracy by reducing the noise in the text data and converting it into a structured format that can be easily analyzed by the model.

### **THANK YOU FOR VISTING**
