## What is this Notebook about?

In this notebook I will try to share different aspects of a NLP problem and well see what difference one makes. This is not a traditional approach to a pproblem but a compairative one.

# Table of Contents
1. [Prepairing the data](#1)
2. [Tokenization](#2)
3. [Problem based Cleaning](#3)
4. [Where embedding might fail](#4)
5. [Traditional data prep and where to use it](#5)
6. [Augmentation](#6)
7. [Resolving StopWords](#7)
8. [A look at Collection extraction](#8)
9. [Similarity Analysis Among Sentences and feature extraction](#9)

<font color="red" size=3>Please upvote this kernel if you like it. It motivates me to produce more quality content :)</font>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.model_selection import StratifiedKFold
from tqdm.notebook import tqdm
from IPython.display import YouTubeVideo
tqdm.pandas()

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="1" > </a>
## 1.Prepairing the data

In [None]:
df = pd.read_csv('../input/covid-19-nlp-text-classification/Corona_NLP_train.csv', encoding='latin8')
df.head()

Before doing anything else let's first create folds for our dataset.

In [None]:
def create_folds(X,y):
    
    df['kfold'] = -1
    
    splitter = StratifiedKFold(n_splits=5)
    
    for f, (t_, v_) in enumerate(splitter.split(X, y)):
        
        X.loc[v_, 'kfold'] = f
        
    return X

In [None]:
df = create_folds(df, df['Sentiment'])
df.head()

In [None]:
df = df[['OriginalTweet', 'Sentiment', 'kfold']]
df.head(2)

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="2" > </a>
## 2.Tokenization
While tokenizing the tweets we have many tokenizers to choose from. Here we have to wise and to see what gives what?

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from nltk.tokenize import TweetTokenizer, word_tokenize

In [None]:
sentences = df['OriginalTweet'][:5]

In [None]:
for i in sentences[2:3]:
    print("Original:\n")
    print(i)
    print('\nTensorflow Tokenizer\n:')
    a = Tokenizer()
    a.fit_on_texts([i])
    print(a.word_index)
    print("\nTweet Tokenizer:\n")
    print(TweetTokenizer().tokenize(i))
    print('\nNLTK word_tokenizer:\n')
    print(word_tokenize(i))

As you can see these all yield different results and you have to see which works best for your use case. For now we will use NLTK Tweet-Tokenizer.

In [None]:
tweets = []

for i in tqdm(df['OriginalTweet']):
    
    tweet = TweetTokenizer().tokenize(i)
    tweet = ' '.join(tweet)
    tweets.append(tweet)

In [None]:
for i in tweets[:3]:
    print(i, '\n')

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="3" > </a>
## 3.Cleaning (Like thoughtful cleaning)

Ok. Now what ?? <br>
Well, Now we can do data cleaning but before that we have to see how we should do that. And for that I have a very good kernel which I will take insights from. - https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings. I will still write the code but for better explanation and deeper understanding I would highly recommend that notebook and also a couple more from the same author.

In short, we will load the embeddings and see how much vocablary is covered by the embeddings.

In [None]:
from gensim.models import KeyedVectors
from gensim import downloader

embedding_file = '../input/embeddings/GoogleNews-vectors-negative-300d.bin'

embedding_model =  KeyedVectors.load_word2vec_format(embedding_file, binary=True)

In [None]:
def build_vocab(sentences, verbose =  True):
    """
    :param sentences: list of list of words
    :return: dictionary of words and their count
    """
    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [None]:
vocab = build_vocab([tweet.split() for tweet in tweets])
print({k: vocab[k] for k in list(vocab)[:5]})

In [None]:
import operator 

def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

In [None]:
oov = check_coverage(vocab,embedding_model)

In [None]:
oov[:20]

We will remove the punctuation which is not in the embeddings

In [None]:
def clean_text(x):

    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    return x

In [None]:
for index, tweet in enumerate(tweets):
    
    tweets[index] = clean_text(tweet)


vocab = build_vocab([tweet.split() for tweet in tweets])

In [None]:
oov = check_coverage(vocab,embedding_model)

In [None]:
oov[:20]

So, we did improve a lot but as we can see we still have a significat portion of vocab which still has no embeddings. <br>
The main reason is our use case that is "Covid-19" which itself is a new term and hence the previously trained word embeddings will be useless. So, what can we possibly do in this case. Well, In my opinion we do have one option and that to replace every "COVID" occurance with "crisis" ( Just a word for which we have embedding) also we can replace the "SocialDistancing" with distancing.

In [None]:
"crisis" in embedding_model

In [None]:
"distancing" in embedding_model

Before we make these changes let's look at which oov words have a significant length.

In [None]:
len(oov)

In [None]:
count = 0
index = 0

while((count != 30) and count < len(oov)):
    
    if len(oov[index][0]) > 3:
        print(oov[index])
        count += 1
        
    index += 1

Oh man I hate cleaning. But well what can I say it's damn important.

In [None]:
to_replace = [('COVID', 'health crisis'),
            ('COVID19', 'health crisis'),
            ('Covid19', 'health crisis'),
            ('Covid', 'health crisis'),
            ('COVID2019', 'health crisis'),
            ('covid19', 'health crisis'),
            ('toiletpaper', 'toilet paper'),
            ('covid', 'health crisis'),
            ('CoronaCrisis', 'health crisis'),
            ('CoronaVirus', 'health crisis'),
            ('SocialDistancing', 'Social distancing'),
            ('2020', 'this year'),
            ('CoronavirusPandemic', 'health crisis'),
            ('CoronavirusOutbreak', 'health crisis'),
            ('StayHomeSaveLives', 'Stay Home Save Lives'),
            ('StayAtHome', 'Stay At Home'),
            ('StayHome', 'Stay Home'),
            ('panicbuying', 'Panic Buying'),
            ('socialdistancing', 'Social Distancing'),
            ('CoronaVirusUpdate', 'health crisis update'),
            ('StopHoarding', 'Stop Hoarding'),
            ('realDonaldTrump', 'real Donald Trump'),
            ('StopPanicBuying', 'Stop Panic Buying'),
            ('covid19UK', 'health crisis'),
            ('QuarantineLife', 'Quarantine life'),
            ('behaviour', 'behave')]

In [None]:
to_replace_dict = {}

for i in to_replace:
    
    to_replace_dict[i[0]] = i[1]

In [None]:
for index, tweet in tqdm(enumerate(tweets)):
    
    cleaned_tweet = []
    
    for word in tweet.split():
        
        if len(word) > 2:
            
            if word in to_replace_dict:              
                cleaned_tweet.append(to_replace_dict[word])
            else:
                cleaned_tweet.append(word)
                
    tweets[index] = ' '.join(cleaned_tweet)

In [None]:
vocab = build_vocab([tweet.split() for tweet in tweets])

In [None]:
oov = check_coverage(vocab,embedding_model)

???? Why are we only at 88%. Let's  look again at our oov words.

In [None]:
count = 0
index = 0

while((count != 30) and count < len(oov)):
    
    if len(oov[index][0]) > 3:
        print(oov[index])
        count += 1
        
    index += 1

OH MAN I HATE THIS. Why can't they just use normal english and use Space between characters. Also as you can see "can't" is written as "canÂ" and thanks to that I can't use someone else cleaning code. Well, I just want to show you that there are things that you need to take care of.

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="4" > </a>
## 4.Some other examples where embeddings might fail

What we have seen is only one use case. Some time ago there was this competition "Toxic comment classification" which used many emoticons and some vulgur words and so let's have a look at our embedding and see if we have anything related to that.

I apolozise in advance for the mention of these words. I am using them simply to show the importance.

In [None]:
to_check = ['fuck', 'motherfucker', ':)', ":{", 'bastard', ':(']

for i in to_check:
    if i in embedding_model:
        print('yes')
    else:
        print('no')

As you can see, we may or may not have embeddings for all the words and especially emoticons and nowdays emoticons are really popular. I would recommend to make your own embeddings as it will have better coverage. One more thing is that you could use the pretrained embeddings and then just finetune them. <br> Also let me show you the good thing about tweet_tokenizer of nltk.

In [None]:
TweetTokenizer().tokenize('This word has a :) face')

You see. it recognizes the emoticons and that's really helpful while making embeddings.

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="5" > </a>
## 5.Traditional Data prep and where to use it

By now you could be wondering where is all the Stemming, Lemmatization and etc etc. <br>
Well the thing is we generally don't need those when using pretrained embeddings. Why so? Well let's have a look. 

In [None]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer

word = 'elegant'
stem_word = SnowballStemmer('english').stem(word)
lemma = WordNetLemmatizer().lemmatize(word)

print("Stem word: ", stem_word)
print("\nLemma: ", lemma)

print("\nIs stemmed word present in embedding :", stem_word in embedding_model)
print("\nIs lemma present in embedding :", lemma in embedding_model)

As you can see eleg is not a word and isn't present in the embedding and hence it will only reduce our coverage. let's have a few more examples for understanding it better.

In [None]:
word1 = 'feet'
word2 = 'foot'

print(WordNetLemmatizer().lemmatize(word1))

print(word1 in embedding_model)
print(word2 in embedding_model)

Well as you can see both feet and foot are present in our embedding and so if do lemmatize the word we will lose the availabe variance and hence we should not do that. <br>
To be truthful you could try lemmatization and see what results it yields and then choose wheather to use it or not.

#### Now the question is where to use this Stemming then ?

Well, the answer to that question is not so simple. I will give you 2 places wher you could use Stemming and Lemmmatization. More likely stemming.
    * 1) We have small data and the pretrained embeddings doesn't have good vocab coverage ( Also you can search for domain wise
    pretrained embeddings and you will likey find one.) or you just don't want to use pretrained embeddings,in such a case stemming will be 
    very useful as it will provide better vectors for words as after stemming the occurance will be incresed significantly.
    
    * 2) Instead of using simple embeddings you could use Tf-idf weighted embeddings and can use stemming in the creation of Tf-idf vectors.

#### Using tf-idf weighted embeddings you you an extra edge in most of the cases. At least I find so.

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="6" > </a>
## 6.Augmentation

Well, this part is really less discussed and I am also not too sure how to deal with it. For all the video and articles/blogs I have read regarding this I found 2 very useful.

* 1) Use Synonyms of words and make new sentences by replacing the word by its synonyms.
* 2) Convert text into another language and then convert it back again to the original language.

In [None]:
!pip install nlpaug

In [None]:
import nlpaug.augmenter.word as naw

In [None]:
sent = 'All month there hasn been crowding the supermarkets restaurants however reducing all the hours and closing the malls means everyone now using the same entrance and dependent single supermarket manila lockdown covid2019 Philippines https tco HxWs9LAnF9'
print('original: ', sent)
print('\nAugmented: ', naw.SynonymAug(aug_src='wordnet').augment(sent))

Pretty good write!!! <br>
It would be interseting to explore this library further. I will list a few more such libraries : <br>
    * textattack
    * textaugment

### I would recommend these 2 videos:

In [None]:
YouTubeVideo('BBR3J2HI5xI')

In [None]:
YouTubeVideo('VpLAjOQHaLU')

Now for converting into different language you should see this - https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/48038

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="7" > </a>
## 7.Resolving StopWords
Well, there's nothing much to say here but a small remainder "Be careful as 'not' is also a stopword"

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [None]:
'not' in stop_words

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="8" > </a>
## 8.A look at Collection Extraction

A good read - https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908

In [None]:
words = ['break the rules', 'free time', 'draw a conclusion', 'keep in mind', 'get ready']

for i in words:
    
    print(i in embedding_model)

You see these are not individual words but have a meaning due to continuty. You will find the solution in the above mentioned read. <br>
Also, the n-gram approach of BOW and Tf-idf could be very helpful in this context.

----------------------------------------------------------------------------------------------------------------------------------------------
<a id ="9" > </a>
## 9.Similarity Analysis among sentences

USE CASE - A couple years back there was a Competition "Quora Question Pair Similarity.." in which we had to predict given 2 questions whether they are simmilar or not and the following features could be very useful.

A good look - https://github.com/seatgeek/fuzzywuzzy#usage

In [None]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

### Simple Ratio - How much macthing.

In [None]:
fuzz.ratio("this is a test", "this is a test!")

### Partial Ration - Does it have a partial match

In [None]:
fuzz.partial_ratio("this is a test", "this is a test!")

### Token Sort Ration - After sorting tokens how much match

In [None]:
fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

In [None]:
fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

### Token Set Ratio - Matching Ratio after making a set of tokens

In [None]:
fuzz.token_sort_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

In [None]:
fuzz.token_set_ratio("fuzzy was a bear", "fuzzy fuzzy was a bear")

<font color="red" size=3>Please upvote this kernel if you like it. It motivates me to produce more quality content :)</font>