<a id="top_section"></a>
<div align='center'><font size="6" color="#000000"><b>NLP Preprocessing and Feature Extraction Methods A-Z</b></font></div>
<hr>
<div align='center'><font size="4" color="#000000">The Beginning to Intermediate Guide</font></div>
<hr>

<a id="Introduction"></a>
# Introduction

This notebook's motivation is to create a ready-to-made all-in-one-place of NLP preprocessing and feature extraction techniques and codes. Furthermore, as we would not use all of those techniques simultaneously as it would depend on different specific NLP problems, each task was design as a separate module with its quick and straightforward explanation, then the implementation that we could pick out, plug-and-play independently and conveniently. We will mainly use the [Real or Not? NLP with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started) dataset for illustration and other dataset such as [Jigsaw Multilingual Toxic Comment Classification](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification), and [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) for some specific tasks too.

The NLP pipeline could be represent and below image and this notebook only focus on three stages: Text Cleaning, Pre-Processing and Feature Engineering/ Extraction.

<img src="https://miro.medium.com/max/1750/1*rJQVqDjbhI3k22lHqa4dFw.png" align="center"/>

image source: [Natural Language Processing Pipeline](https://towardsdatascience.com/natural-language-processing-pipeline-93df02ecd03f)

The paper [Text Classification Algorithms: A Survey](https://arxiv.org/abs/1904.08067) has been inspired me a lot for this notebook and most of the definition of each tasks also could be found from the paper. 

# Table of Contents
* [Introduction](#Introduction)
* [Read and explore data](#Read_and_explore_data)
    - [Importing Main Packages](#Importing_Main_Packages)
    - [Read the Data](#Read_the_Data)
* [Text Cleaning](#Text_Cleaning)
    - [Capitalization/ Lower case](#Capitalization)
    - [Expand the Contractions](#Expand_the_Contractions)
    - [Noise Removal](#Noise_Removal)
        - [Remove URLs](#Remove_urls)
        - [Remove HTML tags](#Remove_HTML_tags)
        - [Remove Non-ASCII](#Remove_Non_ASCII)
        - [Remove special characters](#Remove_special_characters)
    - [Remove punctuations](#Remove_punctuations)
    - [Other Manual Text Cleaning Tasks](#Other_Manual_Text_Cleaning_Tasks)
        - [Replace the Typos, slang, acronyms or informal abbreviations](#Replace_Typos)
        - [Spelling correction](#Spelling_correction)    
* [Text Preprocessing](#Text_Preprocessing)
    - [Tokenization](#Tokenization)
    - [Remove Stop Words (or/and Frequent words/ Rare words)](#Remove_Stop_Words)
    - [Stemming](#Stemming)
        - [PorterStemmer](#PorterStemmer)
        - [SnowballStemmer](#SnowballStemmer)
        - [LancasterStemmer](#LancasterStemmer)
    - [Part of Speech Tagging (POS Tagging)](#POS_Tagging)    
    - [Lemmatization](#Lemmatization)
        - [Lemmatization without POS Tagging](#Lemmatization_wo_pos)
        - [Lemmatization with POS tagging](#Lemmatization_w_pos)
    - [Other (Optional) Text Preprocessing Techniques:](#Other_Text_Preprocessing)
        - [Language Detection](#Language_Detection)
* [Text Features Extraction](#Text_Features_Extraction)
    - [Weighted Words - Bag of Words (BoW)](#BoW)
        - [Frequency Vectors - CountVectorizer](#CountVectorizer)
        - [Term Frequency-Inverse Document Frequency](#TF_IDF)
    - [Word Embedding](#Word_Embedding)
        - [Basic Word Embedding Methods](#Basic_Word_Embedding)
            - [Word2Vec](#Word2Vec)
            - [Global Vectors for Word Representation](#GloVe)
            - [FastText](#FastText)
        - [Advanced Word Embedding Methods - Deep Contextualized Word Representations](#Advanced_methods)
            - [Bidirectional Encoder Representations from Transformers (BERT)](#BERT)
    - [Comparison of Feature Extraction Techniques](#Comparison)
* [References](#References)
    - [Paper](#Paper)
    - [Books](#Books)
    - [Blogs/ Notebooks](#Blogs_Notebooks)

<a id="Read_and_explore_data"></a>

# Read and explore data

<a id="Importing_Main_Packages"></a>
## Importing Main Packages

[Back To Table of Contents](#top_section)

In [None]:
%time
import os
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
    
import numpy as np
import pandas as pd
import sklearn

# Libraries and packages for text (pre-)processing 
import string
import re
import nltk

print("Python version:", sys.version)
print("Version info.:", sys.version_info)
print("pandas version:", pd.__version__)
print("numpy version:", np.__version__)
print("skearn version:", sklearn.__version__)
print("re version:", re.__version__)
print("nltk version:", nltk.__version__)

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id="Read_the_Data"></a>
## Read the Data

In [None]:
%time

# read the csv file
train_df = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
display(train_df.shape, train_df.head())

In [None]:
# some early explorations

display(train_df[~train_df["location"].isnull()].head())
display(train_df[train_df["target"] == 0]["text"].values[1])
display(train_df[train_df["target"] == 1]["text"].values[1])

<a id="Text_Cleaning"></a>

# Text Cleaning:

<a id="Capitalization"></a>
## Capitalization/ Lower case
The most common approach in text cleaning is capitalization or lower case due to the diversity of capitalization to form a sentence. This technique will project all words in text and document into the same feature space. However, it would also cause problems with exceptional cases such as the USA or UK, which could be solved by replacing typos, slang, acronyms or informal abbreviations technique.

[Back To Table of Contents](#top_section)

In [None]:
train_df["text_clean"] = train_df["text"].apply(lambda x: x.lower())
display(train_df.head())

<a id="Expand_the_Contractions"></a>
## Expand the Contractions
We use the [contractions package](https://github.com/kootenpv/contractions) to expand the contraction in English such as we'll -> we will or we shouldn't've -> we should not have.

[Back To Table of Contents](#top_section)

In [None]:
# Intall the contractions package - https://github.com/kootenpv/contractions
!pip install contractions

In [None]:
%time
import contractions

# Test
test_text = """
            Y'all can't expand contractions I'd think. I'd like to know how I'd done that! 
            We're going to the zoo and I don't think I'll be home for dinner.
            Theyre going to the zoo and she'll be home for dinner.
            We should've do it in here but we shouldn't've eat it
            """
print("Test: ", contractions.fix(test_text))

train_df["text_clean"] = train_df["text_clean"].apply(lambda x: contractions.fix(x))

# double check
print(train_df["text"][67])
print(train_df["text_clean"][67])
print(train_df["text"][12])
print(train_df["text_clean"][12])

<a id="Noise_Removal"></a>

## Noise Removal 
Text data could include various unnecessary characters or punctuation such as URLs, HTML tags, non-ASCII characters, or other special characters (symbols, emojis, and other graphic characters). 

<a id="Remove_urls"></a>
### Remove URLs
[Back To Table of Contents](#top_section)

In [None]:
def remove_URL(text):
    """
        Remove URLs from a sample string
    """
    return re.sub(r"https?://\S+|www\.\S+", "", text)

In [None]:
# remove urls from the text
train_df["text_clean"] = train_df["text_clean"].apply(lambda x: remove_URL(x))

# double check
print(train_df["text"][31])
print(train_df["text_clean"][31])
print(train_df["text"][37])
print(train_df["text_clean"][37])
print(train_df["text"][62])
print(train_df["text_clean"][62])

<a id="Remove_HTML_tags"></a>

### Remove HTML tags
[Back To Table of Contents](#top_section)

In [None]:
def remove_html(text):
    """
        Remove the html in sample text
    """
    html = re.compile(r"<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
    return re.sub(html, "", text)

In [None]:
# remove html from the text
train_df["text_clean"] = train_df["text_clean"].apply(lambda x: remove_html(x))

# double check
print(train_df["text"][62])
print(train_df["text_clean"][62])
print(train_df["text"][7385])
print(train_df["text_clean"][7385])

<a id="Remove_Non_ASCII"></a>

### Remove Non-ASCI:
[Back To Table of Contents](#top_section)

In [None]:
def remove_non_ascii(text):
    """
        Remove non-ASCII characters 
    """
    return re.sub(r'[^\x00-\x7f]',r'', text) # or ''.join([x for x in text if x in string.printable]) 

In [None]:
# remove non-ascii characters from the text
train_df["text_clean"] = train_df["text_clean"].apply(lambda x: remove_non_ascii(x))

# double check
print(train_df["text"][38])
print(train_df["text_clean"][38])
print(train_df["text"][7586])
print(train_df["text_clean"][7586])

<a id="Remove_special_characters"></a>

### Remove special characters: 
The special characters could be symbols, emojis, and other graphic characters.
We use the "Toxic Comment Classification Challenge" dataset as the "Real or Not? NLP with Disaster Tweets" dataset do not have any special charaters in their text.

[Back To Table of Contents](#top_section)

In [None]:
train_df_jtcc = pd.read_csv("/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv.zip")
print(train_df_jtcc.shape)
train_df_jtcc.head()

In [None]:
def remove_special_characters(text):
    """
        Remove special special characters, including symbols, emojis, and other graphic characters
    """
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [None]:
%time
# remove non-ascii characters from the text
train_df_jtcc["text_clean"] = train_df_jtcc["comment_text"].apply(lambda x: remove_special_characters(x))
display(train_df_jtcc.head())

# double check
print(train_df_jtcc["comment_text"][143])
print(train_df_jtcc["text_clean"][143])
print(train_df_jtcc["comment_text"][189])
print(train_df_jtcc["text_clean"][189])

In [None]:
# Saving disk space
del train_df_jtcc

<a id="Remove_punctuations"></a>

## Remove punctuations:
[Back To Table of Contents](#top_section)

In [None]:
def remove_punct(text):
    """
        Remove the punctuation
    """
#     return re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', "", text)
    return text.translate(str.maketrans('', '', string.punctuation))

In [None]:
# remove punctuations from the text
train_df["text_clean"] = train_df["text_clean"].apply(lambda x: remove_punct(x))

# double check
print(train_df["text"][5])
print(train_df["text_clean"][5])
print(train_df["text"][7597])
print(train_df["text_clean"][7597])

<a id="Other_Manual_Text_Cleaning_Tasks"></a>

## Other Manual Text Cleaning Tasks: 

Other techniques could be considered and manually processed case by case: 
    - Replace the Unicode character with equivalent ASCII character (instead of removing)
    - Replace the entity references with their actual symbols  instead of removing as HTML tags
    - Replace the Typos, slang, acronyms or informal abbreviations - depend on different situations or main topics of the NLP such as finance or medical topics.
    - List out all the hashtags/ usernames then replace with equivalent words
    - Replace the emoticon/ emoji with equivalant word meaning such as ":)" with "smile" 
    - Spelling correction

<a id="Replace_Typos"></a>
### Replace the Typos, slang, acronyms or informal abbreviations: 
[Back To Table of Contents](#top_section)

In [None]:
def other_clean(text):
        """
            Other manual text cleaning techniques
        """
        # Typos, slang and other
        sample_typos_slang = {
                                "w/e": "whatever",
                                "usagov": "usa government",
                                "recentlu": "recently",
                                "ph0tos": "photos",
                                "amirite": "am i right",
                                "exp0sed": "exposed",
                                "<3": "love",
                                "luv": "love",
                                "amageddon": "armageddon",
                                "trfc": "traffic",
                                "16yr": "16 year"
                                }

        # Acronyms
        sample_acronyms =  { 
                            "mh370": "malaysia airlines flight 370",
                            "okwx": "oklahoma city weather",
                            "arwx": "arkansas weather",    
                            "gawx": "georgia weather",  
                            "scwx": "south carolina weather",  
                            "cawx": "california weather",
                            "tnwx": "tennessee weather",
                            "azwx": "arizona weather",  
                            "alwx": "alabama weather",
                            "usnwsgov": "united states national weather service",
                            "2mw": "tomorrow"
                            }

        
        # Some common abbreviations 
        sample_abbr = {
                        "$" : " dollar ",
                        "€" : " euro ",
                        "4ao" : "for adults only",
                        "a.m" : "before midday",
                        "a3" : "anytime anywhere anyplace",
                        "aamof" : "as a matter of fact",
                        "acct" : "account",
                        "adih" : "another day in hell",
                        "afaic" : "as far as i am concerned",
                        "afaict" : "as far as i can tell",
                        "afaik" : "as far as i know",
                        "afair" : "as far as i remember",
                        "afk" : "away from keyboard",
                        "app" : "application",
                        "approx" : "approximately",
                        "apps" : "applications",
                        "asap" : "as soon as possible",
                        "asl" : "age, sex, location",
                        "atk" : "at the keyboard",
                        "ave." : "avenue",
                        "aymm" : "are you my mother",
                        "ayor" : "at your own risk", 
                        "b&b" : "bed and breakfast",
                        "b+b" : "bed and breakfast",
                        "b.c" : "before christ",
                        "b2b" : "business to business",
                        "b2c" : "business to customer",
                        "b4" : "before",
                        "b4n" : "bye for now",
                        "b@u" : "back at you",
                        "bae" : "before anyone else",
                        "bak" : "back at keyboard",
                        "bbbg" : "bye bye be good",
                        "bbc" : "british broadcasting corporation",
                        "bbias" : "be back in a second",
                        "bbl" : "be back later",
                        "bbs" : "be back soon",
                        "be4" : "before",
                        "bfn" : "bye for now",
                        "blvd" : "boulevard",
                        "bout" : "about",
                        "brb" : "be right back",
                        "bros" : "brothers",
                        "brt" : "be right there",
                        "bsaaw" : "big smile and a wink",
                        "btw" : "by the way",
                        "bwl" : "bursting with laughter",
                        "c/o" : "care of",
                        "cet" : "central european time",
                        "cf" : "compare",
                        "cia" : "central intelligence agency",
                        "csl" : "can not stop laughing",
                        "cu" : "see you",
                        "cul8r" : "see you later",
                        "cv" : "curriculum vitae",
                        "cwot" : "complete waste of time",
                        "cya" : "see you",
                        "cyt" : "see you tomorrow",
                        "dae" : "does anyone else",
                        "dbmib" : "do not bother me i am busy",
                        "diy" : "do it yourself",
                        "dm" : "direct message",
                        "dwh" : "during work hours",
                        "e123" : "easy as one two three",
                        "eet" : "eastern european time",
                        "eg" : "example",
                        "embm" : "early morning business meeting",
                        "encl" : "enclosed",
                        "encl." : "enclosed",
                        "etc" : "and so on",
                        "faq" : "frequently asked questions",
                        "fawc" : "for anyone who cares",
                        "fb" : "facebook",
                        "fc" : "fingers crossed",
                        "fig" : "figure",
                        "fimh" : "forever in my heart", 
                        "ft." : "feet",
                        "ft" : "featuring",
                        "ftl" : "for the loss",
                        "ftw" : "for the win",
                        "fwiw" : "for what it is worth",
                        "fyi" : "for your information",
                        "g9" : "genius",
                        "gahoy" : "get a hold of yourself",
                        "gal" : "get a life",
                        "gcse" : "general certificate of secondary education",
                        "gfn" : "gone for now",
                        "gg" : "good game",
                        "gl" : "good luck",
                        "glhf" : "good luck have fun",
                        "gmt" : "greenwich mean time",
                        "gmta" : "great minds think alike",
                        "gn" : "good night",
                        "g.o.a.t" : "greatest of all time",
                        "goat" : "greatest of all time",
                        "goi" : "get over it",
                        "gps" : "global positioning system",
                        "gr8" : "great",
                        "gratz" : "congratulations",
                        "gyal" : "girl",
                        "h&c" : "hot and cold",
                        "hp" : "horsepower",
                        "hr" : "hour",
                        "hrh" : "his royal highness",
                        "ht" : "height",
                        "ibrb" : "i will be right back",
                        "ic" : "i see",
                        "icq" : "i seek you",
                        "icymi" : "in case you missed it",
                        "idc" : "i do not care",
                        "idgadf" : "i do not give a damn fuck",
                        "idgaf" : "i do not give a fuck",
                        "idk" : "i do not know",
                        "ie" : "that is",
                        "i.e" : "that is",
                        "ifyp" : "i feel your pain",
                        "IG" : "instagram",
                        "iirc" : "if i remember correctly",
                        "ilu" : "i love you",
                        "ily" : "i love you",
                        "imho" : "in my humble opinion",
                        "imo" : "in my opinion",
                        "imu" : "i miss you",
                        "iow" : "in other words",
                        "irl" : "in real life",
                        "j4f" : "just for fun",
                        "jic" : "just in case",
                        "jk" : "just kidding",
                        "jsyk" : "just so you know",
                        "l8r" : "later",
                        "lb" : "pound",
                        "lbs" : "pounds",
                        "ldr" : "long distance relationship",
                        "lmao" : "laugh my ass off",
                        "lmfao" : "laugh my fucking ass off",
                        "lol" : "laughing out loud",
                        "ltd" : "limited",
                        "ltns" : "long time no see",
                        "m8" : "mate",
                        "mf" : "motherfucker",
                        "mfs" : "motherfuckers",
                        "mfw" : "my face when",
                        "mofo" : "motherfucker",
                        "mph" : "miles per hour",
                        "mr" : "mister",
                        "mrw" : "my reaction when",
                        "ms" : "miss",
                        "mte" : "my thoughts exactly",
                        "nagi" : "not a good idea",
                        "nbc" : "national broadcasting company",
                        "nbd" : "not big deal",
                        "nfs" : "not for sale",
                        "ngl" : "not going to lie",
                        "nhs" : "national health service",
                        "nrn" : "no reply necessary",
                        "nsfl" : "not safe for life",
                        "nsfw" : "not safe for work",
                        "nth" : "nice to have",
                        "nvr" : "never",
                        "nyc" : "new york city",
                        "oc" : "original content",
                        "og" : "original",
                        "ohp" : "overhead projector",
                        "oic" : "oh i see",
                        "omdb" : "over my dead body",
                        "omg" : "oh my god",
                        "omw" : "on my way",
                        "p.a" : "per annum",
                        "p.m" : "after midday",
                        "pm" : "prime minister",
                        "poc" : "people of color",
                        "pov" : "point of view",
                        "pp" : "pages",
                        "ppl" : "people",
                        "prw" : "parents are watching",
                        "ps" : "postscript",
                        "pt" : "point",
                        "ptb" : "please text back",
                        "pto" : "please turn over",
                        "qpsa" : "what happens", #"que pasa",
                        "ratchet" : "rude",
                        "rbtl" : "read between the lines",
                        "rlrt" : "real life retweet", 
                        "rofl" : "rolling on the floor laughing",
                        "roflol" : "rolling on the floor laughing out loud",
                        "rotflmao" : "rolling on the floor laughing my ass off",
                        "rt" : "retweet",
                        "ruok" : "are you ok",
                        "sfw" : "safe for work",
                        "sk8" : "skate",
                        "smh" : "shake my head",
                        "sq" : "square",
                        "srsly" : "seriously", 
                        "ssdd" : "same stuff different day",
                        "tbh" : "to be honest",
                        "tbs" : "tablespooful",
                        "tbsp" : "tablespooful",
                        "tfw" : "that feeling when",
                        "thks" : "thank you",
                        "tho" : "though",
                        "thx" : "thank you",
                        "tia" : "thanks in advance",
                        "til" : "today i learned",
                        "tl;dr" : "too long i did not read",
                        "tldr" : "too long i did not read",
                        "tmb" : "tweet me back",
                        "tntl" : "trying not to laugh",
                        "ttyl" : "talk to you later",
                        "u" : "you",
                        "u2" : "you too",
                        "u4e" : "yours for ever",
                        "utc" : "coordinated universal time",
                        "w/" : "with",
                        "w/o" : "without",
                        "w8" : "wait",
                        "wassup" : "what is up",
                        "wb" : "welcome back",
                        "wtf" : "what the fuck",
                        "wtg" : "way to go",
                        "wtpa" : "where the party at",
                        "wuf" : "where are you from",
                        "wuzup" : "what is up",
                        "wywh" : "wish you were here",
                        "yd" : "yard",
                        "ygtr" : "you got that right",
                        "ynk" : "you never know",
                        "zzz" : "sleeping bored and tired"
                        }
            
        sample_typos_slang_pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in sample_typos_slang.keys()) + r')(?!\w)')
        sample_acronyms_pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in sample_acronyms.keys()) + r')(?!\w)')
        sample_abbr_pattern = re.compile(r'(?<!\w)(' + '|'.join(re.escape(key) for key in sample_abbr.keys()) + r')(?!\w)')
        
        text = sample_typos_slang_pattern.sub(lambda x: sample_typos_slang[x.group()], text)
        text = sample_acronyms_pattern.sub(lambda x: sample_acronyms[x.group()], text)
        text = sample_abbr_pattern.sub(lambda x: sample_abbr[x.group()], text)
        
        return text

In [None]:
%time

# Test
test_text = """
            brb with some sample ph0tos I lov u. I need some $ for 2mw.
            """
print("Test: ", other_clean(test_text))

# remove punctuations from the text
train_df["text_clean"] = train_df["text_clean"].apply(lambda x: other_clean(x))

# double check
print(train_df["text"][1844])
print(train_df["text_clean"][1844])
print(train_df["text"][4409])
print(train_df["text_clean"][4409])

<a id="Spelling_correction"></a>

### Spelling Correction
Spelling correction could also be considered an optional preprocessing task as the social media text data is often are typos or mistyped. However, the spelling correction output should be carefully double-checked with the original text input as it could be a mistake.

[Back To Table of Contents](#top_section)

In [None]:
from textblob import TextBlob
print("Test: ", TextBlob("sleapy and tehre is no plaxe I'm gioong to.").correct())

<a id="Text_Preprocessing"></a>

# Text Preprocessing:

<a id="Tokenization"></a>
## Tokenization
Tokenization is a common technique that split a sentence into tokens, where a token could be characters, words, phrases, symbols, or other meaningful elements. By breaking sentences into smaller chunks, that would help to investigate the words in a sentence and also the subsequent steps in the NLP pipeline, such as stemming. 

[Back To Table of Contents](#top_section)

In [None]:
# Tokenizing the tweet base texts.
from nltk.tokenize import word_tokenize

train_df['tokenized'] = train_df['text_clean'].apply(word_tokenize)
train_df.head()

<a id="Remove_Stop_Words"></a>

## Remove Stop Words (or/and Frequent words/ Rare words):
Stop words are common words in any language that occur with a high frequency but do not deliver meaningful information for the whole sentence. For example, {“a”, “about”, “above”, “across”, “after”, “afterward”, “again”, ...} can be considered as stop words. Traditionally, we could remove all of them in the text preprocessing stage. However, refer to the example from the [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action) book: 
> * Mark reported to the CEO
> * Suzanne reported as the CEO to the board 

> In your NLP pipeline, you might create 4-grams such as reported to the CEO and reported as the CEO. If you remove the stop words from the 4-grams, both examples would be reduced to "reported CEO", and you would lack the information about the professional hierarchy. In the first example, Mark could have been an assistant to the CEO, whereas in the second example Suzanne was the CEO reporting to the board. Unfortunately, retaining the stop words within your pipeline creates another problem: it increases the length of the n-grams required to make use of these connections formed by the otherwise meaningless stop words. This issue forces us to retain at least 4-grams if you want to avoid the ambiguity of the human resources example.
> Designing a filter for stop words depends on your particular application.

In short, removing stop words is a common method in NLP text preprocessing, whereas, it needs to be experimented carefully depending on different situations. 

[Back To Table of Contents](#top_section)

In [None]:
# Removing stopwords.
nltk.download("stopwords")
from nltk.corpus import stopwords

stop = set(stopwords.words('english'))
train_df['stopwords_removed'] = train_df['tokenized'].apply(lambda x: [word for word in x if word not in stop])
train_df.head()

<a id="Stemming"></a>

## Stemming
Stemming is a process of extracting a root word - identifying a common stem among various forms (e.g., singular and plural noun form) of a word, for example, the words "gardening", "gardener" or "gardens" share the same stem, garden. Stemming uproots suffixes from words to merge words with similar meanings under their standard stem.

There are three major stemming algorithms in use nowadays:
- **Porter** - PorterStemmer()): This stemming algorithm is an older one. It’s from the 1980s and its main concern is removing the common endings to words so that they can be resolved to a common form. It’s not too complex and development on it is frozen. Typically, it’s a nice starting basic stemmer, but it’s not really advised to use it for any production/complex application. Instead, it has its place in research as a nice, basic stemming algorithm that can guarantee reproducibility. It also is a very gentle stemming algorithm when compared to others.

- **Snowball** - LancasterStemmer(): This algorithm is also known as the Porter2 stemming algorithm. It is almost universally accepted as better than the Porter stemmer, even being acknowledged as such by the individual who created the Porter stemmer. That being said, it is also more aggressive than the Porter stemmer. A lot of the things added to the Snowball stemmer were because of issues noticed with the Porter stemmer. There is about a 5% difference in the way that Snowball stems versus Porter.

- **Lancaster** - SnowballStemmer(): Just for fun, the Lancaster stemming algorithm is another algorithm that you can use. This one is the most aggressive stemming algorithm of the bunch. However, if you use the stemmer in NLTK, you can add your own custom rules to this algorithm very easily. It’s a good choice for that. One complaint around this stemming algorithm though is that it sometimes is overly aggressive and can really transform words into strange stems. Just make sure it does what you want it to before you go with this option!

source: http://hunterheidenreich.com/blog/stemming-lemmatization-what/

<a id="PorterStemmer"></a>
### PorterStemmer
[Back To Table of Contents](#top_section)

In [None]:
from nltk.stem import PorterStemmer

def porter_stemmer(text):
    """
        Stem words in list of tokenized words with PorterStemmer
    """
    stemmer = nltk.PorterStemmer()
    stems = [stemmer.stem(i) for i in text]
    return stems

In [None]:
%time 

train_df['porter_stemmer'] = train_df['stopwords_removed'].apply(lambda x: porter_stemmer(x))
train_df.head()

<a id="SnowballStemmer"></a>
### SnowballStemmer
[Back To Table of Contents](#top_section)

In [None]:
from nltk.stem import SnowballStemmer

def snowball_stemmer(text):
    """
        Stem words in list of tokenized words with SnowballStemmer
    """
    stemmer = nltk.SnowballStemmer("english")
    stems = [stemmer.stem(i) for i in text]
    return stems

In [None]:
%time 

train_df['snowball_stemmer'] = train_df['stopwords_removed'].apply(lambda x: snowball_stemmer(x))
train_df.head()

<a id="LancasterStemmer"></a>
### LancasterStemmer
[Back To Table of Contents](#top_section)

In [None]:
from nltk.stem import LancasterStemmer

def lancaster_stemmer(text):
    """
        Stem words in list of tokenized words with LancasterStemmer
    """
    stemmer = nltk.LancasterStemmer()
    stems = [stemmer.stem(i) for i in text]
    return stems

In [None]:
%time 

train_df['lancaster_stemmer'] = train_df['stopwords_removed'].apply(lambda x: lancaster_stemmer(x))
train_df.head()

<a id="POS_Tagging"></a>

## Part of Speech Tagging (POS Tagging):
Part of speech tagging (POS tagging) distinguishes the part of speech (noun, verb, adjective, and etc.) of each word in the text. This is the critical stage for many NLP applications since, by identifying the POS of a word, we can infer its contextual meaning. The NLTK packages offer different POS Tagging algorithms, and in this notebook, we use the combination version of them.

- pos_tag/ DefaultTagger
- UnigramTagger
- BigramTagger
- Could also be a combination of the bigram tagger, unigram tagger, and default tagger (source: https://www.nltk.org/book/ch05.html) 

[Back To Table of Contents](#top_section)

In [None]:
from nltk.corpus import wordnet
from nltk.corpus import brown

wordnet_map = {"N":wordnet.NOUN, 
               "V":wordnet.VERB, 
               "J":wordnet.ADJ, 
               "R":wordnet.ADV
              }
    
train_sents = brown.tagged_sents(categories='news')
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)

def pos_tag_wordnet(text, pos_tag_type="pos_tag"):
    """
        Create pos_tag with wordnet format
    """
    pos_tagged_text = t2.tag(text)
    
    # map the pos tagging output with wordnet output 
    pos_tagged_text = [(word, wordnet_map.get(pos_tag[0])) if pos_tag[0] in wordnet_map.keys() else (word, wordnet.NOUN) for (word, pos_tag) in pos_tagged_text ]
    return pos_tagged_text

In [None]:
pos_tag_wordnet(train_df['stopwords_removed'][2])

In [None]:
%time 

train_df['combined_postag_wnet'] = train_df['stopwords_removed'].apply(lambda x: pos_tag_wordnet(x))

train_df.head()

<a id="Lemmatization"></a>

## Lemmatization:
According to the [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf) book:
> Lemmatization is the task of determining that two words have the same root, despite their surface differences. The words am, are, and is have the shared lemma be; the words dinner and dinners both have the lemma dinner. Lemmatizing each of these forms to the same lemma will let us ﬁnd all mentions of words in Russian like Moscow. The lemmatized form of a sentence like He is reading detective stories would thus be He be read detective story.

and the book [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action):
> Some lemmatizers use the word’s part of speech (POS) tag in addition to its spelling to help improve accuracy. The POS tag for a word indicates its role in the grammar of a phrase or sentence. For example, the noun POS is for words that refer to “people, places, or things” within a phrase. An adjective POS is for a word that modifies or describes a noun. A verb refers to an action. The POS of a word in isolation cannot be determined. The context of a word must be known for its POS to be identified. So some advanced lemmatizers can’t be run-on words in isolation.

For example, the "good", "better" or "best" is lemmatized into good and the verb "gardening" should be lemmatized to "to garden", while the "garden" and "gardener" are both different lemmas. In this notebook, we will also explore on both lemmatize on without POS-Tagging and POS-Tagging examples.

[Back To Table of Contents](#top_section)

In [None]:
from nltk.stem import WordNetLemmatizer

def lemmatize_word(text):
    """
        Lemmatize the tokenized words
    """

    lemmatizer = WordNetLemmatizer()
    lemma = [lemmatizer.lemmatize(word, tag) for word, tag in text]
    return lemma

<a id="Lemmatization_wo_pos"></a>

### Lemmatization without POS Tagging:
[Back To Table of Contents](#top_section)

In [None]:
%time 

# Test without POS Tagging
lemmatizer = WordNetLemmatizer()

train_df['lemmatize_word_wo_pos'] = train_df['stopwords_removed'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
train_df['lemmatize_word_wo_pos'] = train_df['lemmatize_word_wo_pos'].apply(lambda x: [word for word in x if word not in stop])
train_df.head()

In [None]:
print(train_df["combined_postag_wnet"][8])
print(train_df["lemmatize_word_wo_pos"][8])

<a id="Lemmatization_w_pos"></a>

### Lemmatization with POS Tagging:
[Back To Table of Contents](#top_section)

In [None]:
%time 

# Test with POS Tagging
lemmatizer = WordNetLemmatizer()

train_df['lemmatize_word_w_pos'] = train_df['combined_postag_wnet'].apply(lambda x: lemmatize_word(x))
train_df['lemmatize_word_w_pos'] = train_df['lemmatize_word_w_pos'].apply(lambda x: [word for word in x if word not in stop]) # double check to remove stop words
train_df['lemmatize_text'] = [' '.join(map(str, l)) for l in train_df['lemmatize_word_w_pos']] # join back to text

train_df.head()

Comparing the output of Lemmatization on non-POS-Tagging and POS-Tagging output. We can see in the original text, the word \" happening\" is a verb and was corrected assigned as a verb by POS-tagging stage, then Lemmatize accurately with back as \"happen\" but lemmatized without-POS-tagging resulted in \"happening\" is not correct. 

In [None]:
print(train_df["text"][8])
print(train_df["combined_postag_wnet"][8])
print(train_df["lemmatize_word_wo_pos"][8])
print(train_df["lemmatize_word_w_pos"][8])

Comparison between original text and the lammatized text:

In [None]:
display(train_df["text"][0], train_df["lemmatize_text"][0])
display(train_df["text"][5], train_df["lemmatize_text"][5])
display(train_df["text"][10], train_df["lemmatize_text"][10])
display(train_df["text"][15], train_df["lemmatize_text"][15])
display(train_df["text"][20], train_df["lemmatize_text"][20])

<a id="Other_Text_Preprocessing"></a>

## Other (Optional) Text Preprocessing Techniques:
- language detection
- Code mixing and transliteration

<a id="Language_Detection"></a>
### Language Detection:
We will use the package [polyglot](https://github.com/aboSamoor/polyglot) for language detection

[Back To Table of Contents](#top_section)

In [None]:
# Install the main polygot and other neccesary packages
!pip install pyicu
!pip install pycld2
!pip install polyglot

We will use the "Jigsaw Multilingual Toxic Comment Classification" dataset for this case as the dataset is multilingual

In [None]:
train_df_jmtc = pd.read_csv("../input/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train.csv")
print(train_df_jmtc.shape)
train_df_jmtc.head()

In [None]:
%time 
from polyglot.detect import Detector

def get_language(text):
    return Detector("".join(x for x in text if x.isprintable()), quiet=True).languages[0].name

train_df_jmtc["lang"] = train_df_jmtc["comment_text"].apply(lambda x: get_language(x))

#Test
display(train_df_jmtc[train_df_jmtc["lang"] == "de"].head())
print(train_df_jmtc["comment_text"][823])
print(train_df_jmtc["comment_text"][8130])
print(train_df_jmtc["comment_text"][14511])

In [None]:
# save disk space
del train_df_jmtc

### code mixing and transliteration:
This situation should be considered in case of multilingual text such as the mixed up between English and other languages.

<a id="Text_Features_Extraction"></a>

# Text Features Extraction:

<a id="BoW"></a>
## Weighted Words - Bag of Words (BoW) - Bag of n-grams:
* N-gram is a sequence that contains n-elements (characters, words, etc). A single word such a "apple", "orange" is a Uni-gram; hence, "red apple" "big orange" is bi-gram and "red ripped apple", "big orange bag" is tri-gram. 
* Bags of words: Vectors of word counts or frequencies 
* Bags of n-grams: Counts of word pairs (bigrams), triplets (trigrams), and so on

> The bag-of-words/ bag-of-n-gram model is a reduced and simpliﬁed representation of a text document from selected parts of the text, based on speciﬁc criteria, such as word frequency.
> 
> In a BoW, a body of text, such as a document or a sentence, is thought of like a bag of words. Lists of words are created in the BoW process. These words in a matrix are not sentences which structure sentences and grammar, and the semantic relationship between these words are ignored in their collection and construction. The words are often representative of the content of a sentence. While grammar and order of appearance are ignored, multiplicity is counted and may be used later to determine the focus points of the documents.
> 
> Example:
> Document
> 
> “As the home to UVA’s recognized undergraduate and graduate degree programs in systems engineering. In the UVA Department of Systems and Information Engineering, our students are exposed to a wide range of range”
> 
> Bag-of-Words (BoW):
> {“As”, “the”, “home”, “to”, “UVA’s”, “recognized”, “undergraduate”, “and”, “graduate”, “degree”, “program”, “in”, “systems”, “engineering”, “in”, “Department”, “Information”,“students”, “ ”,“are”, “exposed”, “wide”, “range” }
> 
> Bag-of-Feature (BoF)
> Feature = {1,1,1,3,2,1,2,1,2,3,1,1,1,2,1,1,1,1,1,1}

(source:[Text Classification Algorithms: A Survey](https://arxiv.org/abs/1904.08067))

<a id="CountVectorizer"></a>
### Frequency Vectors - CountVectorizer:
We will implement the Bag of Words/ Bag of n-grams text representation via sklearn - CountVectorizer function.
The code will test with a sample corpus of the first five sentence of the dataset, then print out the output of uni-gram, bi-gram and tri-gram. Finaly, we also run on the whole dataset.

[Back To Table of Contents](#top_section)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def cv(data, ngram = 1, MAX_NB_WORDS = 75000):
    count_vectorizer = CountVectorizer(ngram_range = (ngram, ngram), max_features = MAX_NB_WORDS)
    emb = count_vectorizer.fit_transform(data).toarray()
    print("count vectorize with", str(np.array(emb).shape[1]), "features")
    return emb, count_vectorizer

In [None]:
def print_out(emb, feat, ngram, compared_sentence=0):
    print(ngram,"bag-of-words: ")
    print(feat.get_feature_names(), "\n")
    print(ngram,"bag-of-feature: ")
    print(test_cv_1gram.vocabulary_, "\n")
    print("BoW matrix:")
    print(pd.DataFrame(emb.transpose(), index = feat.get_feature_names()).head(), "\n")
    print(ngram,"vector example:")
    print(train_df["lemmatize_text"][compared_sentence])
    print(emb[compared_sentence], "\n")

In [None]:
test_corpus = train_df["lemmatize_text"][:5].tolist()
print("The test corpus: ", test_corpus, "\n")

test_cv_em_1gram, test_cv_1gram = cv(test_corpus, ngram=1)
print_out(test_cv_em_1gram, test_cv_1gram, ngram="Uni-gram")

In [None]:
test_cv_em_2gram, test_cv_2gram = cv(test_corpus, ngram=2)
print_out(test_cv_em_2gram, test_cv_2gram, ngram="Bi-gram")

In [None]:
test_cv_em_3gram, test_cv_3gram = cv(test_corpus, ngram=3)
print_out(test_cv_em_2gram, test_cv_2gram, ngram="Tri-gram")

In [None]:
%time 

# implement into the whole dataset
train_df_corpus = train_df["lemmatize_text"].tolist()
train_df_em_1gram, vc_1gram = cv(train_df_corpus, 1)
train_df_em_2gram, vc_2gram = cv(train_df_corpus, 2)
train_df_em_3gram, vc_3gram = cv(train_df_corpus, 3)

print(len(train_df_corpus))
print(train_df_em_1gram.shape)
print(train_df_em_2gram.shape)
print(train_df_em_3gram.shape)

In [None]:
del train_df_em_1gram, train_df_em_2gram, train_df_em_3gram

<a id="TF_IDF"></a>

### Term Frequency-Inverse Document Frequency (TF-IDF):
> The Inverse Document Frequency (IDF) as a method to be used in conjunction with term frequency in order to lessen the effect of implicitly common words in the corpus. IDF assigns a higher weight to words with either high or low frequencies term in the document. This combination of TF and IDF is well known as Term Frequency-Inverse document frequency (TF-IDF). The mathematical representation of the weight of a term in a document by TF-IDF is given in Equation: 
> $$ W(d,t) = TF(d,t) * log \frac{N}{df(t)}$$
> Here N is the number of documents and $df(t)$ is the number of documents containing the term t in the corpus. The ﬁrst term in the equation improves the recall while the second term improves the precision of the word embedding. Although TF-IDF tries to overcome the problem of common terms in the document, it still suffers from some other descriptive limitations. Namely, TF-IDF cannot account for the similarity between the words in the document since each word is independently presented as an index. However, with the development of more complex models in recent years, new methods, such as word embedding, have been presented that can incorporate concepts such as similarity of words and part of speech tagging.

(source: [Text Classification Algorithms: A Survey](https://arxiv.org/abs/1904.08067))

We also implement the TF-IDF via sklearn TfidfVectorizer function, the experiments are similar to the previous [Frequency Vectors - CountVectorizer](#CountVectorizer) section

[Back To Table of Contents](#top_section)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

def TFIDF(data, ngram = 1, MAX_NB_WORDS = 75000):
    tfidf_x = TfidfVectorizer(ngram_range = (ngram, ngram), max_features = MAX_NB_WORDS)
    emb = tfidf_x.fit_transform(data).toarray()
    print("tf-idf with", str(np.array(emb).shape[1]), "features")
    return emb, tfidf_x

In [None]:
test_corpus = train_df["lemmatize_text"][:5].tolist()
print("The test corpus: ", test_corpus, "\n")

test_tfidf_em_1gram, test_tfidf_1gram = TFIDF(test_corpus, ngram=1)
print_out(test_tfidf_em_1gram, test_tfidf_1gram, ngram="Uni-gram")

In [None]:
test_tfidf_em_2gram, test_tfidf_2gram = TFIDF(test_corpus, ngram=2)
print_out(test_tfidf_em_2gram, test_tfidf_2gram, ngram="Bi-gram")

In [None]:
test_tfidf_em_3gram, test_tfidf_3gram = TFIDF(test_corpus, ngram=3)
print_out(test_tfidf_em_3gram, test_tfidf_3gram, ngram="Tri-gram")

In [None]:
%time 

# implement into the whole dataset
train_df_corpus = train_df["lemmatize_text"].tolist()
train_df_tfidf_1gram, tfidf_1gram = TFIDF(train_df_corpus, 1)
train_df_tfidf_2gram, tfidf_2gram = TFIDF(train_df_corpus, 2)
train_df_tfidf_3gram, tfidf_3gram = TFIDF(train_df_corpus, 3)

print(len(train_df_corpus))
print(train_df_tfidf_1gram.shape)
print(train_df_tfidf_1gram.shape)
print(train_df_tfidf_1gram.shape)

In [None]:
del train_df_tfidf_1gram, train_df_tfidf_2gram, train_df_tfidf_3gram

<a id="Word_Embedding"></a>

## Word Embedding:

> **Word vectors** are numerical vector representations of word semantics, or meaning, including literal and implied meaning. So word vectors can capture the connotation of words, like “peopleness,” “animalness,” “placeness,” “thingness,” and even “conceptness.” And they combine all that into a dense vector (no zeros) of floating point values. This dense vector enables queries and logical reasoning.

(source: [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action))

> Even though we have syntactic word representations, it does not mean that the model captures the semantics meaning of the words. On the other hand, bag-of-word models do not respect the semantics of the word. For example, words “airplane”, “aeroplane”, “plane”, and “aircraft” are often used in the same context. However, the vectors corresponding to these words are orthogonal in the bag-of-words model. This issue presents a serious problem to understanding sentences within the model. The other problem in the bag-of-word is that the order of words in the phrase is not respected. The n-gram does not solve this problem so a similarity needs to be found for each word in the sentence. Many researchers worked on word embedding to solve this problem. The Word2Vec propose a simple single-layer architecture based on the inner product between two word vectors.

> Word embedding is a feature learning technique in which each word or phrase from the vocabulary is mapped to a N dimension vector of real numbers. Various word embedding methods have been proposed to translate unigrams into understandable input for machine learning algorithms. This work focuses on Word2Vec, GloVe, and FastText, three of the most common methods that have been successfully used for deep learning techniques.

(source: [Text Classification Algorithms: A Survey](https://arxiv.org/abs/1904.08067))

<a id="Basic_Word_Embedding"></a>
### Basic Word Embedding Methods:

<a id="Word2Vec"></a>
#### Word2Vec:

[T. Mikolov et al.](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) presented the Word2vec in 2013, which learns the meaning of words merely by processing a large corpus of unlabeled text. The Word2Vec approach uses shallow neural networks with two hidden layers, continuous bag-of-words (CBOW), and the Skip-gram model to create a high dimension vector for each word. This unsupervised nature of Word2vec is what makes it so powerful. The world is full of unlabeled, uncategorized, unstructured natural language text.

We will implement the Word2vec via gensim libary with the pre-trained word vectors on the dataset Google News corpus (source: https://code.google.com/archive/p/word2vec/) and see the embedding output on the sample sentence from the our dataset. 

[Back To Table of Contents](#top_section)

In [None]:
%time 

import gensim
print("gensim version:", gensim.__version__)

word2vec_path = "../input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin"

# we only load 200k most common words from Google News corpus 
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True, limit=200000) 

Compare the similarity between "cat" vs. "kitten" and "cat" vs. "cats"

In [None]:
print(word2vec_model.similarity('cat', 'kitten'))
print(word2vec_model.similarity('cat', 'cats'))

In [None]:
def get_average_vec(tokens_list, vector, generate_missing=False, k=300):
    """
        Calculate average embedding value of sentence from each word vector
    """
    
    if len(tokens_list)<1:
        return np.zeros(k)
    
    if generate_missing:
        vectorized = [vector[word] if word in vector else np.random.rand(k) for word in tokens_list]
    else:
        vectorized = [vector[word] if word in vector else np.zeros(k) for word in tokens_list]
    
    length = len(vectorized)
    summed = np.sum(vectorized, axis=0)
    averaged = np.divide(summed, length)
    return averaged

def get_embeddings(vectors, text, generate_missing=False, k=300):
    """
        create the sentence embedding
    """
    embeddings = text.apply(lambda x: get_average_vec(x, vectors, generate_missing=generate_missing, k=k))
    return list(embeddings)

In [None]:
%time 

embeddings_word2vec = get_embeddings(word2vec_model, train_df["lemmatize_text"], k=300)

print("Embedding matrix size", len(embeddings_word2vec), len(embeddings_word2vec[0]))
print("The sentence: \"%s\" got embedding values: " % train_df["lemmatize_text"][0])
print(embeddings_word2vec[0])

In [None]:
del embeddings_word2vec

<a id="GloVe"></a>

#### Global Vectors for Word Representation (GloVe):
> Another powerful word embedding technique that has been used for text classiﬁcation is [Global Vectors (GloVe)](https://nlp.stanford.edu/pubs/glove.pdf). The approach is very similar to the Word2Vec method, where each word is presented by a high dimension vector and trained based on the surrounding words over a huge corpus. The pre-trained word embedding used in many works is based on 400,000 vocabularies trained over Wikipedia 2014 and Gigaword 5 as the corpus and 50 dimensions for word presentation. GloVe also provides other pre-trained word vectorizations with 100, 200, 300 dimensions which are trained over even bigger corpora, including Twitter content.

(source: [Text Classification Algorithms: A Survey](https://arxiv.org/abs/1904.08067))

We will create our GloVe's sentence embeddings  via gensim libary with the pre-trained word vectors on the dataset from Wikipedia 2014 + Gigaword 5 (source: https://github.com/stanfordnlp/GloVe) and see the embedding output on the sample sentence from the our dataset. 


[Back To Table of Contents](#top_section)

In [None]:
%time 

from gensim.scripts.glove2word2vec import glove2word2vec

glove_input_file = "../input/glove6b/glove.6B.300d.txt"
word2vec_output_file = "glove.6B.100d.txt.word2vec"
glove2word2vec(glove_input_file, word2vec_output_file)

# we only load 200k most common words from Google New corpus 
glove_model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False, limit=200000) 

Compare the similarity between "cat" vs. "kitten" and "cat" vs. "cats" from GloVe

In [None]:
print(glove_model.similarity('cat', 'kitten'))
print(glove_model.similarity('cat', 'cats'))

In [None]:
%time 

embeddings_glove = get_embeddings(glove_model, train_df["lemmatize_text"], k=300)

print("Embedding matrix size", len(embeddings_glove), len(embeddings_glove[0]))
print("The sentence: \"%s\" got embedding values: " % train_df["lemmatize_text"][0])
print(embeddings_glove[0])

In [None]:
del embeddings_glove

<a id="FastText"></a>

#### FastText:
> Many other word embedding representations ignore the morphology of words by assigning a distinct vector to each word ([Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606)). Facebook AI Research lab released a novel technique to solve this issue by introducing a new word embedding method called FastText. Each word, w, is represented as a bag of character n-gram. For example, given the word “introduce” and n = 3, FastText will produce the following representation composed of character tri-grams: < in, int, ntr, tro, rod, odu, duc, uce, ce >
> Note that the sequence <int>, corresponding to the word here is different from the tri-gram “int” from the word introduce.

(source: [Text Classification Algorithms: A Survey](https://arxiv.org/abs/1904.08067))

We will create our FastText's sentence embeddings via gensim libary with the pre-trained word vectors from the Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (source: https://fasttext.cc/docs/en/english-vectors.html) and see the embedding output on the sample sentence from the our dataset. 


[Back To Table of Contents](#top_section)

In [None]:
%time 

from gensim.models.fasttext import FastText

fasttext_path = "../input/fasttext-wikinews/wiki-news-300d-1M.vec"
fasttext_model = gensim.models.KeyedVectors.load_word2vec_format(fasttext_path, binary=False, limit=200000)

Compare the similarity between "cat" vs. "kitten" and "cat" vs. "cats" from FastText

In [None]:
print(fasttext_model.similarity('cat', 'kitten'))
print(fasttext_model.similarity('cat', 'cats'))

In [None]:
embeddings_fasttext = get_embeddings(fasttext_model, train_df["lemmatize_text"], k=300)

print("Embedding matrix size", len(embeddings_fasttext), len(embeddings_fasttext[0]))
print("The sentence: \"%s\" got embedding values: " % train_df["lemmatize_text"][0])
print(embeddings_fasttext[0])

In [None]:
del embeddings_fasttext

<a id="Advanced_methods"></a>

### Advanced Word Embedding Methods - Deep Contextualized Word Representations: 

<a id="BERT"></a>
#### Bidirectional Encoder Representations from Transformers (BERT):
> BERT is a deep learning model that has given state-of-the-art results on a wide variety of natural language processing tasks. It stands for Bidirectional Encoder Representations for Transformers. It has been pre-trained on Wikipedia and BooksCorpus and requires task-specific fine-tuning.

> Lets understand BERT by breaking BERT abbreviation:
> * **Bidirectional**: BERT takes whole text passage as input and reads passage in both direction to understand the meaning of each word.
> * **Transformers**: BERT is based on a Deep Transformer network. Transformer network is a type of network that can process efficiently long texts by using attention. An attention is a mechanism to learn contextual relations between words (or sub-words) in a text.
> * **Encoder Representation**: Originally Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task, since BERT’s goal is to generate a language model only the encoder mechanism is necessary hence 'encoder representation'

> BERT is a multi-layer bidirectional Transformer encoder. There are two models introduced in the paper.
> * BERT base – 12 layers (transformer blocks), 12 attention heads, and 110 million parameters.
> * BERT Large – 24 layers, 16 attention heads and, 340 million parameters.


> How BERT performs Bidirectional training?
> 
> BERT uses following two prediction models simultaneously with the goal of minimizing the combined loss function of the two strategies:
> 
> * **Masked Language Model**: Before feeding word sequences into BERT, 15% of the words in each sequence are replaced with a [MASK] token. The model then attempts to predict the original value of the masked words, based on the context provided by the other, non-masked, words in the sequence.
> * **Next Sentence Prediction**: The model receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document. During training, 50% of the inputs are a pair in which the second sentence is the subsequent sentence in the original document, while in the other 50% a random sentence from the corpus is chosen as the second sentence. The assumption is that the random sentence will be disconnected from the first sentence.

Resources and further reading on BERT's explanation could be found in the great Kaggle notebooks and Blogs here:
* https://www.kaggle.com/abhinand05/bert-for-humans-tutorial-baseline-version-2
* https://www.kaggle.com/ratan123/in-depth-guide-to-google-s-bert
* https://www.kaggle.com/kksienc/comprehensive-nlp-tutorial-3-bert
* https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/

We will create our sentence embeddings by BERT's pre-trained word vectors (Uncased) via Tensorflow (source: https://github.com/google-research/bert) and see the embedding output on the sample sentence from the our dataset. Noted that we will use the BERT isself tonkenizer. 

[Back To Table of Contents](#top_section)

In [None]:
%time 

import tensorflow_hub as hub

# download the tonkenizer 
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
import tokenization

In [None]:
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

In [None]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [None]:
%time 

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

bert_input = bert_encode(train_df["text"].values, tokenizer, max_len=300)

In [None]:
print("Embedding tensor size", len(bert_input), len(bert_input[0]), len(bert_input[0][0]))
print("The sentence: \"%s\" got embedding values: " % train_df["lemmatize_text"][0])
print(bert_input[0])

<a id="Comparison"></a>
## Comparison of Feature Extraction Techniques
Please refer to the below table as the Comparison between Feature Extraction Techniques, thanks to the paper [Text Classification Algorithms: A Survey](https://arxiv.org/abs/1904.08067) for all of their awesome works.

[Back To Table of Contents](#top_section)

| Model                               	| Advantages                                                                                                                                                                                                                                                                             	| Limitation                                                                                                                                                                                                                                                                                                                                                                  	|
|-------------------------------------	|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------	|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------	|
| Weighted Words                      	| * Easy to compute<br>* Easy to compute the similarity between 2 documents using it<br>* Basic metric to extract the most descriptive terms in a document<br>* Works with an unknown word (e.g., New words in languages)                                                                	| * It does not capture the position in the text (syntactic)<br>* It does not capture meaning in the text (semantics)<br>* Common words effect on the results (e.g., “am”, “is”, etc.)                                                                                                                                                                                        	|
| TF-IDF                              	| * Easy to compute<br>* Easy to compute the similarity between 2 documents using it<br>* Basic metric to extract the most descriptive terms in a document<br>* Common words do not affect the results due to IDF (e.g., “am”, “is”, etc.)                                               	| * It does not capture the position in the text (syntactic)<br>* It does not capture meaning in the text (semantics)                                                                                                                                                                                                                                                         	|
| Word2Vec                            	| * It captures the position of the words in the text (syntactic)<br>* It captures meaning in the words (semantics)                                                                                                                                                                      	| * It cannot capture the meaning of the word from the text (fails to capture polysemy)<br>* It cannot capture out-of-vocabulary words from corpus                                                                                                                                                                                                                            	|
| GloVe (Pre-Trained)                 	| * It captures the position of the words in the text (syntactic)<br>* It captures meaning in the words (semantics)<br>* Trained on huge corpus                                                                                                                                          	| * It cannot capture the meaning of the word from the text (fails to capture polysemy)<br>* Memory consumption for storage<br>* It cannot capture out-of-vocabulary words from corpus                                                                                                                                                                                        	|
| GloVe (Trained)                     	| * It is very straightforward, e.g., to enforce the word vectors to capture sub-linear relationships in the vector space (performs better than Word2vec)<br>* Lower weight for highly frequent word pairs, such as stop words like “am”, “is”, etc. Will not dominate training progress 	| * Memory consumption for storage<br>* Needs huge corpus to learn<br>* It cannot capture out-of-vocabulary words from the corpus<br>* It cannot capture the meaning of the word from the text (fails to capture polysemy)                                                                                                                                                    	|
| FastText                            	| * Works for rare words (rare in their character n-grams which are still shared with other words<br>* Solves out of vocabulary words with n-gram in character level                                                                                                                     	| * It cannot capture the meaning of the word from the text (fails to capture polysemy)<br>* Memory consumption for storage<br>* Computationally is more expensive in comparing with GloVe and Word2Vec                                                                                                                                                                       	|
| Contextualized Word Representations 	| * It captures the meaning of the word from the text (incorporates context, handling polysemy)                                                                                                                                                                                          	| * Memory consumption for storage<br>* Improves performance notably on downstream tasks. Computationally is more expensive in comparison to others<br>* Needs another word embedding for all LSTM and feedforward layers<br>* It cannot capture out-of-vocabulary words from a corpus<br>* Works only sentence and document level (it cannot work for individual word level) 	|

<a id="References"></a>

# References:
<a id="Paper"></a>
## Paper:
* [Text Classification Algorithms: A Survey](https://arxiv.org/abs/1904.08067)

<a id="Books"></a>
## Books:
* [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action)
* [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf)

<a id="Blogs_Notebooks"></a>
## Blogs/ Notebooks:
* https://www.kaggle.com/abhinand05/bert-for-humans-tutorial-baseline-version-2
* https://www.kaggle.com/amar09/text-pre-processing-and-feature-extraction 
* https://www.kaggle.com/ashishpatel26/beginner-to-intermediate-nlp-tutorial 
* https://www.kaggle.com/ashutosh3060/nlp-basic-feature-creation-and-preprocessing 
* https://www.kaggle.com/datafan07/disaster-tweets-nlp-eda-bert-with-transformers 
* https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert 
* https://www.kaggle.com/kksienc/comprehensive-nlp-tutorial-3-bert
* https://www.kaggle.com/liananapalkova/simply-about-word2vec 
* https://www.kaggle.com/ratan123/in-depth-guide-to-google-s-bert 
* https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing
* https://www.kaggle.com/vbmokin/nlp-eda-bag-of-words-tf-idf-glove-bert 
* https://www.kdnuggets.com/2018/08/practitioners-guide-processing-understanding-text-2.html 
* https://yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/
* https://gist.github.com/MrEliptik/b3f16179aa2f530781ef8ca9a16499af 
* https://github.com/hundredblocks/concrete_NLP_tutorial/blob/master/NLP_notebook.ipynb 
* https://machinelearningmastery.com/gentle-introduction-bag-words-model/ 
* https://towardsdatascience.com/natural-language-processing-pipeline-93df02ecd03f


### I really appreciate your feedbacks, there would be some areas can be fixed and improved.
## If you liked my work please Upvote!
[Back To Table of Contents](#top_section)