# Basic Preprocessing Technique for NLP

## Introduction

Lowercasing in NLP is the process of converting all uppercase letters in a text to lowercase. It's done to make the text consistent and easier to work with. By reducing the complexity of the text, lowercasing helps in analyzing and processing it more effectively.


## Improtance of Lowercasing

- 1>`Text Normalization` : By converting text to lowercase, different cases of the same word are treated as a single entity. This normalization step helps simplify the analysis by reducing the number of distinct word forms. 

"For example, "Cat" and "cat" both refer to the same concept (but taken as different due to case sensitive property of languages), and lowercasing them allows treating them as identical."


- 2`Vocabulary Reduction` ->   Lowercasing reduces the vocabulary size by treating the same word in different cases as a single term. This simplifies tasks such as text classification, sentiment analysis, or information retrieval, where treating different cases as distinct entities may not be desired.

- 3`Consistency` -> Lowercasing ensures consistent representation of words. Maintaining a consistent capitalization style across the text makes it easier to process and analyze. 

"For example, "New York" and "new york" could refer to the same entity, but lowercasing makes them consistent and treatable as a single entity."

- 4`Case Insenstivity` ->  Lowercasing makes the text case-insensitive, which can be beneficial in scenarios where the case information is not relevant for the analysis or processing tasks. It simplifies matching or comparing words, regardless of their original case




## USAGE

Following are the defined and generalized steps to apply lowercasing in NLP pipeline :

-`Tokenization : ` Break down of text into tokens or indivisual word.

-`Lowercase Coversion : ` Apply lowercase transformation to each token, converting all uppercase letters to lowercase.

#### NOTE -> Code implementations are provided in respective notebook in both R and Python programming language. 



## Considerations

- `While lowercasing is a commonly used preprocessing step in NLP, it's important to consider its appropriateness for specific tasks. Here are some key factors to keep in mind : `

- `Context : ` In certain NLP tasks like named entity recognition or sentiment analysis, the case of words can carry valuable information. Lowercasing in such cases may lead to the loss of important nuances. 

For example, distinguishing between "Apple" (the company) and "apple" (the fruit) relies on the case

- `Task-specific Requirements : ` Depending on your particular NLP task, you may need to preserve the case information. Consider the objectives and requirements of your project. 

For instance, if you're working on a task that involves detecting proper nouns or acronyms, lowercasing might not be suitable.

- `Language Considerations : ` Different languages have varying conventions for case sensitivity. Some languages, like English, are typically case-insensitive, while others, like German, have specific case rules. It's crucial to consider the linguistic characteristics of the language you're working with to determine whether lowercasing is appropriate or desirable.

-`Named Entities : ` Named entities, such as person names, organization names, or locations, often have specific capitalization patterns. Lowercasing these entities might lead to ambiguity or loss of important information. Determine if your task involves identifying or analyzing named entities.

- `Acronyms and Abbreviations : ` Acronyms and abbreviations often rely on capitalization for recognition and understanding. Lowercasing them could result in the loss of their intended meaning. Check if your task involves handling acronyms or abbreviations.

- `Sentiment Analysis : ` In sentiment analysis or opinion mining, the capitalization of words can sometimes convey emphasis or sentiment. Lowercasing might flatten or alter the sentiment expressed in the text. Consider if sentiment analysis is a part of your task.

- `Language-specific Rules : ` Different languages have distinct rules regarding case sensitivity. Some languages, like English, are largely case-insensitive, while others, like Turkish, have specific case rules. Ensure that you understand the case conventions of the language you're working with.

- `Domain-specific Considerations : ` Certain domains or industries might have their own conventions for capitalization. For example, legal documents or scientific literature may follow specific capitalization styles. Take into account any domain-specific requirements for your task.

- `Task-specific Objectives : ` Consider the objectives of your NLP task and whether lowercasing aligns with them. Determine if lowercasing improves or hinders the performance of subsequent analysis steps, such as classification, clustering, or information retrieval.

- `Preserving Original Formatting : ` In some cases, it might be important to preserve the original formatting of the text, including capitalization. This is relevant when maintaining the original appearance or style is necessary, such as in text generation or dialogue systems.

In [1]:
# Importing Libraries
import pandas as pd

In [29]:
df = pd.read_csv("Basic PreProcessing\Corona_NLP_test.csv")

In [30]:
df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,Positive
2,3,44955,,02-03-2020,Find out how you can protect yourself and love...,Extremely Positive
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,Negative
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral


In [31]:
# Extracting a peice of  statement and understand how to perform lowercasing
 
sample_text = df["OriginalTweet"][3]

print(sample_text)

#Panic buying hits #NewYork City as anxious shoppers stock up on food&amp;medical supplies after #healthcare worker in her 30s becomes #BigApple 1st confirmed #coronavirus patient OR a #Bloomberg staged event?

https://t.co/IASiReGPC4

#QAnon #QAnon2018 #QAnon2020 
#Election2020 #CDC https://t.co/29isZOewxu


In [32]:
lower_text = sample_text.lower()

# So why we lowering it ??

# Whenever we works with python (as we know as it is case sensitive ), so it will take same word but with different format as two indivisuals which increase complexity of model to prevent it we perform lowercasing. 

print(lower_text)


#panic buying hits #newyork city as anxious shoppers stock up on food&amp;medical supplies after #healthcare worker in her 30s becomes #bigapple 1st confirmed #coronavirus patient or a #bloomberg staged event?

https://t.co/iasiregpc4

#qanon #qanon2018 #qanon2020 
#election2020 #cdc https://t.co/29iszoewxu


In [35]:
# Performing lowercase to complete coloumns


updated_text= df["OriginalTweet"].str.lower()


# So it is easy ....
updated_text

0       trending: new yorkers encounter empty supermar...
1       when i couldn't find hand sanitizer at fred me...
2       find out how you can protect yourself and love...
3       #panic buying hits #newyork city as anxious sh...
4       #toiletpaper #dunnypaper #coronavirus #coronav...
                              ...                        
3793    meanwhile in a supermarket in israel -- people...
3794    did you panic buy a lot of non-perishable item...
3795    asst prof of economics @cconces was on @nbcphi...
3796    gov need to do somethings instead of biar je r...
3797    i and @forestandpaper members are committed to...
Name: OriginalTweet, Length: 3798, dtype: object

# REMOVING REDUDENCIES

## Removing HTML Tags

In [36]:
# sample tet with html tags ...
text = "<p>Lorem ipsum dolor sit, amet consectetur adipisicing elit. Veritatis quae cum totam! Adipisci fuga dolor inventore labore voluptate alias, consequuntur, corrupti ea non iste autem quo quam dignissimos, quasi repellat. Sit molestias aut temporibus voluptatum quae. Incidunt fugit nisi eum, quae similique provident atque accusamus. Est, aperiam ipsum, placeat corporis aliquam inventore sunt autem accusantium, nobis fugit id rem laboriosam quae asperiores nulla enim incidunt et dolorum vel quam cumque ad! Quisquam, natus odio? Distinctio, consectetur maiores. Ad, beatae qui dolorem itaque culpa odit vero accusamus quo quam voluptas, impedit soluta dolorum facilis laboriosam eum? Eum amet nemo, repellat qui deleniti obcaecati placeat totam molestias assumenda sit vitae reprehenderit perspiciatis porro, asperiores quaerat fugiat ratione. Blanditiis nobis laboriosam quae fugiat culpa et, odit quaerat, commodi consectetur expedita inventore eos, provident deserunt! Sunt doloremque, blanditiis veniam velit culpa pariatur eveniet error sit a molestias voluptates deleniti obcaecati hic corporis aut maxime. Voluptates laudantium obcaecati molestias quidem voluptatum! Illo deleniti beatae magnam, provident ullam eveniet reprehenderit et! Aliquid illum consequuntur incidunt quis saepe atque nulla, molestiae eveniet quia deleniti at excepturi nisi sunt natus perspiciatis tempore dignissimos similique tenetur nesciunt vero culpa! Quasi, repudiandae facere nemo obcaecati soluta unde similique suscipit, quae tempore veniam minima eum distinctio illum harum temporibus, debitis aspernatur sed mollitia. Nesciunt veniam itaque delectus praesentium iste! Saepe consequuntur, sapiente numquam nobis molestiae totam perspiciatis. Id sapiente ad cupiditate libero inventore totam blanditiis, necessitatibus quis rerum, dolore minus nam debitis cum accusamus dolorem sit ex vel aut consequuntur et enim ipsa exercitationem iure! Quidem expedita illum, nisi accusamus unde, repudiandae exercitationem accusantium et iste mollitia amet voluptatem dolore labore omnis nesciunt facilis autem facere voluptate tenetur quod cumque alias qui! Fugiat vero tempora velit ipsum dicta ipsa, ullam eius eum earum labore. Voluptate omnis optio autem ut neque minima labore alias quam excepturi officiis est laborum nam, officia aperiam rem consequuntur nemo cupiditate reprehenderit dolorum a facere modi quod odio! Omnis quidem a minus provident nam, ipsa reiciendis rerum tempora est, quasi ipsam maxime esse maiores sint perferendis tenetur eum ad sed, ipsum corrupti earum! Magnam, tenetur? Magni asperiores mollitia inventore incidunt recusandae suscipit odit a voluptatum cum. Aspernatur sapiente minus, aperiam doloribus possimus dolorem magni. Quam exercitationem voluptatum molestias sapiente provident odit excepturi at, omnis laborum natus laboriosam tempora cupiditate fugit quia voluptates quod aperiam pariatur eveniet esse dignissimos id enim ducimus animi debitis? Enim aliquam nihil id laboriosam ea autem earum ratione quo corrupti suscipit nesciunt vel omnis corporis quis, distinctio voluptatibus unde? Aperiam necessitatibus quae suscipit earum quo ad, nemo beatae. Facilis in adipisci ad ducimus temporibus possimus eum, exercitationem nam cupiditate autem architecto, accusantium nobis minus aperiam, voluptatem enim neque dolore! Dolorum non unde iusto laborum tenetur explicabo. At quidem laborum nulla? Architecto facilis molestiae accusamus cumque iure voluptates vitae recusandae repellat corrupti reiciendis, debitis a placeat doloremque ipsam quod repudiandae pariatur vel quisquam atque corporis autem est iusto adipisci dolore! Explicabo, voluptatem sed. Perferendis nostrum optio itaque ut. Mollitia non voluptate velit repudiandae nulla laudantium eligendi consectetur quia ratione cumque, totam tenetur ex dolore pariatur blanditiis laboriosam impedit officia modi culpa, necessitatibus nisi. Molestias omnis sunt numquam dicta natus maiores assumenda. Odit voluptatum, doloremque accusamus eaque labore cum! Omnis consequatur quo, maiores error praesentium quos necessitatibus natus odit distinctio harum architecto reprehenderit molestiae! Incidunt vero nulla fugit nisi itaque reprehenderit at natus animi! Alias provident sapiente hic laborum aliquam. Ea libero provident exercitationem possimus, autem ipsum aperiam velit reprehenderit, dolor tenetur eveniet optio quo quas atque magnam pariatur, accusamus beatae placeat! Saepe, nobis corrupti, nostrum delectus commodi itaque numquam ab ullam doloribus vel error eveniet, labore quia id! Impedit placeat cumque quod vero alias ipsam aspernatur hic vitae! Molestiae, repellendus deserunt repudiandae esse beatae impedit qui accusamus rerum et reprehenderit accusantium, repellat ducimus similique autem sequi reiciendis magnam expedita, veniam mollitia maiores in! Dolores quibusdam at iusto, ex inventore suscipit eligendi voluptates, ipsam ducimus saepe, odio commodi eos nostrum recusandae in ad eum quas ut amet adipisci molestias! Sequi eius voluptatum corporis quis nobis adipisci architecto aspernatur aliquam necessitatibus perferendis et, laborum explicabo unde error provident? Qui culpa dolorem, animi omnis saepe consectetur dolor deleniti natus numquam, sequi aspernatur delectus voluptatem, eum similique? Rem tempore repellat assumenda quam amet quidem molestias in autem doloribus error? Rerum ullam dolores omnis aliquam, culpa, quaerat perspiciatis obcaecati iure, accusantium corrupti recusandae. Quibusdam magnam voluptatibus numquam omnis? Ratione a quisquam blanditiis eligendi iste? Possimus repellat nesciunt pariatur accusantium placeat officiis soluta dolore. Laborum optio dolores ad quasi molestias deserunt odio, unde libero obcaecati numquam at sequi magni voluptatem dolorum, corporis repellat, reprehenderit architecto. Accusantium perspiciatis, beatae laboriosam aliquid molestias provident cum. Dolor, consectetur vel provident quam et dolores sequi ipsa suscipit adipisci sed enim incidunt porro sapiente, placeat odit corrupti possimus hic ipsam earum veritatis, illo harum! Similique veritatis sequi beatae, nam accusantium numquam minus vitae atque? Amet sapiente veritatis vero error accusantium alias rem, a esse id aliquid enim. Suscipit nesciunt quia aliquid ipsa officiis eum molestias, ad distinctio magni dolorum? Velit nulla laboriosam, voluptate vitae doloremque ipsam animi obcaecati libero temporibus unde qui voluptas dolores odio impedit. Numquam sunt doloribus nobis necessitatibus molestiae eveniet eligendi, hic temporibus laboriosam repellendus, libero maxime quos minima corporis. Eaque saepe, quibusdam laboriosam expedita voluptates reiciendis distinctio perspiciatis repellendus minus praesentium nobis magni est tenetur itaque provident dolorem cumque dicta tempore facere quam! Atque dignissimos nemo nobis consequatur, pariatur ipsa molestias eaque aliquid nihil voluptates culpa autem asperiores rerum aut. Dolore at voluptatum dolorum, consequatur cumque tempora libero neque, architecto molestias incidunt facere, eos minus ipsum hic tempore vel nihil quisquam! Quidem quaerat ea suscipit aliquid molestias in, quibusdam amet perspiciatis vel impedit quod similique earum recusandae pariatur dignissimos quos laboriosam hic, deserunt consequuntur? Quas veritatis recusandae animi ipsam consequatur aliquid a magnam fugiat. Harum pariatur et, rem non, quod accusantium accusamus aut aliquid, in voluptatibus eligendi! Fugiat quam harum cum, voluptates ex possimus explicabo quaerat officia corporis accusantium, ipsam suscipit? Vero dicta cum sed, nesciunt exercitationem illum voluptate ipsam ex!<p><h1>Welcome to tutorial</h1><p>Lorem ipsum, dolor sit amet consectetur adipisicing elit. Repellat voluptatem quam cum distinctio molestias quia quisquam, dignissimos eos consequuntur vero veniam animi reprehenderit architecto omnis iusto numquam perferendis ratione inventore.</p></p></p>"

In [38]:
# defining a function using regular expression to remove html tags
import re 

def removeTags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'', text)

In [39]:
# finalizing result

removeTags(text) # removed all html tags

'Lorem ipsum dolor sit, amet consectetur adipisicing elit. Veritatis quae cum totam! Adipisci fuga dolor inventore labore voluptate alias, consequuntur, corrupti ea non iste autem quo quam dignissimos, quasi repellat. Sit molestias aut temporibus voluptatum quae. Incidunt fugit nisi eum, quae similique provident atque accusamus. Est, aperiam ipsum, placeat corporis aliquam inventore sunt autem accusantium, nobis fugit id rem laboriosam quae asperiores nulla enim incidunt et dolorum vel quam cumque ad! Quisquam, natus odio? Distinctio, consectetur maiores. Ad, beatae qui dolorem itaque culpa odit vero accusamus quo quam voluptas, impedit soluta dolorum facilis laboriosam eum? Eum amet nemo, repellat qui deleniti obcaecati placeat totam molestias assumenda sit vitae reprehenderit perspiciatis porro, asperiores quaerat fugiat ratione. Blanditiis nobis laboriosam quae fugiat culpa et, odit quaerat, commodi consectetur expedita inventore eos, provident deserunt! Sunt doloremque, blanditiis 

In [45]:
# Applying tag removal to complete colomn.

update_text = df["OriginalTweet"].apply(removeTags)

update_text

0       TRENDING: New Yorkers encounter empty supermar...
1       When I couldn't find hand sanitizer at Fred Me...
2       Find out how you can protect yourself and love...
3       #Panic buying hits #NewYork City as anxious sh...
4       #toiletpaper #dunnypaper #coronavirus #coronav...
                              ...                        
3793    Meanwhile In A Supermarket in Israel -- People...
3794    Did you panic buy a lot of non-perishable item...
3795    Asst Prof of Economics @cconces was on @NBCPhi...
3796    Gov need to do somethings instead of biar je r...
3797    I and @ForestandPaper members are committed to...
Name: OriginalTweet, Length: 3798, dtype: object

## Removing URLS

In [46]:
text = df["OriginalTweet"][1]
text # It will contain link which is also a redundancy for model , so its better to remove it 


"When I couldn't find hand sanitizer at Fred Meyer, I turned to #Amazon. But $114.97 for a 2 pack of Purell??!!Check out how  #coronavirus concerns are driving up prices. https://t.co/ygbipBflMY"

In [47]:
def remove_URLS(text):
    pattern = re.compile(r'https?://\S+|www\.\S+')
    return pattern.sub(r'',text)

In [48]:
# Lets check the function for sample text
remove_URLS(text) # Its working..

"When I couldn't find hand sanitizer at Fred Meyer, I turned to #Amazon. But $114.97 for a 2 pack of Purell??!!Check out how  #coronavirus concerns are driving up prices. "

In [50]:
# Code to apply it in complete colomn

text_without_URL = df["OriginalTweet"].apply(remove_URLS)

text_without_URL

0       TRENDING: New Yorkers encounter empty supermar...
1       When I couldn't find hand sanitizer at Fred Me...
2       Find out how you can protect yourself and love...
3       #Panic buying hits #NewYork City as anxious sh...
4       #toiletpaper #dunnypaper #coronavirus #coronav...
                              ...                        
3793    Meanwhile In A Supermarket in Israel -- People...
3794    Did you panic buy a lot of non-perishable item...
3795    Asst Prof of Economics @cconces was on @NBCPhi...
3796    Gov need to do somethings instead of biar je r...
3797    I and @ForestandPaper members are committed to...
Name: OriginalTweet, Length: 3798, dtype: object

## Removing Punctuations

In [53]:
# List of Punctuations by python..

import string
punctuations = string.punctuation
punctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [54]:
# Defining function to remove punctuations

def removePunc(text):
    for mark in punctuations:
        text = text.replace(mark,'')
    return text

In [61]:
new_text = df["OriginalTweet"][19]
new_text


new_text = removePunc(new_text)
new

"When I couldn't find hand sanitizer at Fred Meyer, I turned to #Amazon. But $114.97 for a 2 pack of Purell??!!Check out how  #coronavirus concerns are driving up prices. https://t.co/ygbipBflMY"

In [62]:
# Advance function to remove punctuation

def removePunc2(text):
    return text.translate(str.maketrans('','',punctuations))

In [63]:
removePunc2(new_text) # This is one is more effective 

'Studies show the coronavirus like COVID19 can live up to nine days on hard surfaces like metal plastic and glass\r\r\n\r\r\nOur Deputy Commissioner of Consumer Affairs Mary Barzee Flores shows you how to keep clean at the gas pump\r\r\n\r\r\nWatch and share httpstcoAIqATWT5zz'

In [64]:
# Applying it in datatset

df["OriginalTweet"].apply(removePunc2)

0       TRENDING New Yorkers encounter empty supermark...
1       When I couldnt find hand sanitizer at Fred Mey...
2       Find out how you can protect yourself and love...
3       Panic buying hits NewYork City as anxious shop...
4       toiletpaper dunnypaper coronavirus coronavirus...
                              ...                        
3793    Meanwhile In A Supermarket in Israel  People d...
3794    Did you panic buy a lot of nonperishable items...
3795    Asst Prof of Economics cconces was on NBCPhila...
3796    Gov need to do somethings instead of biar je r...
3797    I and ForestandPaper members are committed to ...
Name: OriginalTweet, Length: 3798, dtype: object

## Chat Word Treatment

In [70]:
# Suppose we have some words mapping for fast paced word, just an example to ellaborate how chat word treatment works

# Mapped slangs with their root words 
chat_slang = {
    "GN": "Good Night",
    "BTW" : "By The Way",
    "GM" : "Good Morning",
    "U" : "You",
    "U2" : "You2",
    "THY" : "Thank You",
    "TH" : "Thanks",
    "MUS" : "Meet You Soon"
}

In [77]:
def chat_conversion(text):
    new_text = []
    for word in text.split():
        if word.upper() in chat_slang:
            new_text.append(chat_slang[word.upper()])
        else:
            new_text.append(word)
    return " ".join(new_text)

In [71]:
message = "Hey Rohan GM , TH BTW for tommorow party , MUS "

In [78]:
chat_conversion(message) # so we had done our deconding ... 

'Hey Rohan Good Morning , Thanks By The Way for tommorow party , Meet You Soon'

## Spelling Correction

In [90]:
# example

text =  "Hello,  Have u seen a peice of paper on table this evening. I think I had placed the papr somewhere "

# There are identical words (paper) but in different spelling (paper, papr) which can be a redundancy for model as it increases the complexity    

In [80]:
# Importing libraries for spelling corection

from textblob import TextBlob 

In [91]:
# Istantiate textblob object with input as parameter
text_blob_object = TextBlob(text)

# check for spelling

text_blob_object.correct().string # corrected string


'Hello,  Have u seen a peace of paper on table this evening. I think I had placed the paper somewhere '

## Removing Stop Words

In [94]:
# A random text
text = df["OriginalTweet"][3]
text

'#Panic buying hits #NewYork City as anxious shoppers stock up on food&amp;medical supplies after #healthcare worker in her 30s becomes #BigApple 1st confirmed #coronavirus patient OR a #Bloomberg staged event?\r\r\n\r\r\nhttps://t.co/IASiReGPC4\r\r\n\r\r\n#QAnon #QAnon2018 #QAnon2020 \r\r\n#Election2020 #CDC https://t.co/29isZOewxu'

In [100]:
from nltk.corpus import stopwords

stopwords.words('english')

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [107]:
# Defining a function
def Remove_stpwrd(text):
    new_text= []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    x = new_text[:]
    new_text.clear()
    return " ".join(x)

In [108]:
# old
text

'#Panic buying hits #NewYork City as anxious shoppers stock up on food&amp;medical supplies after #healthcare worker in her 30s becomes #BigApple 1st confirmed #coronavirus patient OR a #Bloomberg staged event?\r\r\n\r\r\nhttps://t.co/IASiReGPC4\r\r\n\r\r\n#QAnon #QAnon2018 #QAnon2020 \r\r\n#Election2020 #CDC https://t.co/29isZOewxu'

In [109]:
# updated text
Remove_stpwrd(text)

'#Panic buying hits #NewYork City  anxious shoppers stock   food&amp;medical supplies  #healthcare worker   30s becomes #BigApple 1st confirmed #coronavirus patient OR  #Bloomberg staged event? https://t.co/IASiReGPC4 #QAnon #QAnon2018 #QAnon2020 #Election2020 #CDC https://t.co/29isZOewxu'

In [110]:
# Applying it on dataset

df["OriginalTweet"].apply(Remove_stpwrd) 

0       TRENDING: New Yorkers encounter empty supermar...
1       When I  find hand sanitizer  Fred Meyer, I tur...
2          Find     protect   loved ones  #coronavirus. ?
3       #Panic buying hits #NewYork City  anxious shop...
4       #toiletpaper #dunnypaper #coronavirus #coronav...
                              ...                        
3793    Meanwhile In A Supermarket  Israel -- People d...
3794    Did  panic buy  lot  non-perishable items? ECH...
3795    Asst Prof  Economics @cconces   @NBCPhiladelph...
3796    Gov need   somethings instead  biar je rakyat ...
3797    I  @ForestandPaper members  committed   safety...
Name: OriginalTweet, Length: 3798, dtype: object

## Handling Emojis

In [115]:
# We have two options for emojis either remove it or convert it ...

# Conversion : 

import re

def emojisTransform(text):
    emoji_pattern = re.compile("["
                                u"\U0001F600-\U0001F64F" # emoticons
                                u"\U0001F300-\U0001F5FF" # symbols & pictographs
                                u"\U0001F600-\U0001F6FF" # transport and map symbols
                                u"\U0001F1E0-\U0001F1FF" # flags
                                "]+", flags =re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [116]:
emojisTransform("loved the movie , It is fantastic 😁😁😁😁🫡🫡🫡")

'loved the movie , It is fantastic \U0001fae1\U0001fae1\U0001fae1'

In [118]:
# Another way to interpret emojis

import emoji 

emoji.demojize("loved the movie , It is fantastic 😁😁😁😁🫡🫡🫡")

'loved the movie , It is fantastic :beaming_face_with_smiling_eyes::beaming_face_with_smiling_eyes::beaming_face_with_smiling_eyes::beaming_face_with_smiling_eyes::saluting_face::saluting_face::saluting_face:'

## Tokenization (Important)

In [120]:
# Sample corpus
corpus = df["OriginalTweet"]
corpus

# Why tokenization important ??

# its overall affect the model if you not able to send the actual required tokens for feature engineering...

# Lets approach it from most basic approach to advance one     

0       TRENDING: New Yorkers encounter empty supermar...
1       When I couldn't find hand sanitizer at Fred Me...
2       Find out how you can protect yourself and love...
3       #Panic buying hits #NewYork City as anxious sh...
4       #toiletpaper #dunnypaper #coronavirus #coronav...
                              ...                        
3793    Meanwhile In A Supermarket in Israel -- People...
3794    Did you panic buy a lot of non-perishable item...
3795    Asst Prof of Economics @cconces was on @NBCPhi...
3796    Gov need to do somethings instead of biar je r...
3797    I and @ForestandPaper members are committed to...
Name: OriginalTweet, Length: 3798, dtype: object

In [124]:
# Using split function which keeps eye on spacing " "...

sample_text = removePunc(sample_text)

sample_text.split()

['Panic',
 'buying',
 'hits',
 'NewYork',
 'City',
 'as',
 'anxious',
 'shoppers',
 'stock',
 'up',
 'on',
 'foodampmedical',
 'supplies',
 'after',
 'healthcare',
 'worker',
 'in',
 'her',
 '30s',
 'becomes',
 'BigApple',
 '1st',
 'confirmed',
 'coronavirus',
 'patient',
 'OR',
 'a',
 'Bloomberg',
 'staged',
 'event',
 'httpstcoIASiReGPC4',
 'QAnon',
 'QAnon2018',
 'QAnon2020',
 'Election2020',
 'CDC',
 'httpstco29isZOewxu']

In [125]:
# Lets see some cases of Tokenization you face

sample_case = "hey i am new to new delhi"

sample_case.split() # here it will takenize new delhi to seperate token (new, delhi) which again leads to redundancy during model creation and affect performance

['hey', 'i', 'am', 'new', 'to', 'new', 'delhi']

In [131]:
# Using regular expression

import re 

tokens =re.findall('[\w]+',sample_case)
tokens

['hey', 'i', 'am', 'new', 'to', 'new', 'delhi']

In [134]:
# using NLTK 

from nltk.tokenize import word_tokenize,sent_tokenize

sample_case = "welcome! we are your new friend ? "
word_tokenize(sample_case)

['welcome', '!', 'we', 'are', 'your', 'new', 'friend', '?']

## Stemming

In [140]:
# importing different libraries realted to stemming

from nltk.stem.porter import PorterStemmer

# Instationa stemmer object
stemObj = PorterStemmer()

# Define function for stemming using Comprehension
def stemWords(text):
    return" ".join([stemObj.stem(word) for word in text.split()]) 


# test sample

sample1 = "walk walks walking walked"

stemWords(sample1) # done....



'walk walk walk walk'

## Lemmatization

In [180]:
# importing libraries

import string
import nltk
punctuations = string.punctuation
from nltk.stem import WordNetLemmatizer

# Instantialte object 
wordnet_lemmatizer = WordNetLemmatizer()

# Sample Case
sentence = "Running ,Swimming and eating in coherent manner might harm your health mentally and pyhsically."


sent_words = nltk.word_tokenize(sentence)

for word in sent_words:
    if word in punctuations:
        sent_words.remove(word)


print("{0:20}{1:20}".format("Word","Lemma"))
for word in  sent_words:
    print("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word,pos='v')))

Word                Lemma               
Running             Running             
Swimming            Swimming            
and                 and                 
eating              eat                 
in                  in                  
coherent            coherent            
manner              manner              
might               might               
harm                harm                
your                your                
health              health              
mentally            mentally            
and                 and                 
pyhsically          pyhsically          
