# Natural Language Processing

This notebook contains a collection of useful functions and code snippets for Natural Language Processing.

### Table of contents

1. [Text preprocessing](#Text-preprocessing)<br>
    1.1 [Word tokenization](#Word-tokenization)<br>
    1.2 [Removal of stop words](#Removal-of-stop-words)<br>
    1.3 [Stemming](#Stemming)<br>
    1.4 [Lemmatization](#Lemmatization)<br>
    1.5 [Reprocessing function](#Preprocessing-function)<br>
2. [Tokenization](#Tokenization)<br>
3. [Word embeddings](#Word-embeddings)<br>
    3.1 [GloVe](#GloVe)<br>
    3.2 [BERT](#BERT)<br>
    3.3 [Custom word embeddings](#Custom-word-embeddings)<br>

## 1. Text preprocessing<a id="Text-preprocessing"/>
Data preprocessing – or text preprocessing, in the context of NLP – is the practise of transforming data into a more digestable format for a machine learning model. Preprocessing is an essential step of the NLP workflow, and can have a significant impact on the outcome, of any particular model.

### 1.1 Word tokenization<a id="Word-tokenization"/>
Word tokenization is the process of splitting a larger piece of text into smaller parts called tokens. For instance, a sentence into words. The output of tokenization is used as the input for many other preprocessing tasks. 

In [1]:
from nltk.tokenize import word_tokenize

sentence = "Splitting the sentence into words."
words = word_tokenize(sentence)
words

['Splitting', 'the', 'sentence', 'into', 'words', '.']

### 1.2 Removal of stop words<a id="Removal-of-stop-words"/>
Stop words are a set of commonly used words such as "a", "an", "the", "in", etc. By convention, these words are generally removed, since they usually provide little added value in the context of most NLP projects.

In [2]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

sentence = "Removal of stop words"
tokens = word_tokenize(sentence)
tokens_without_stopwords = [word for word in tokens if not word in stopwords.words("english")]
sentence = (" ").join(tokens_without_stopwords)
sentence

'Removal stop words'

### 1.3 Stemming<a id="Stemming"/>
Stemming is the act of reducing a word to its stem.

In [3]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

sentence = "Stemming is the act of reducing a word to its stem."

stemmer = PorterStemmer()
tokens = word_tokenize(sentence)
stemmed_tokens = [stemmer.stem(word) for word in tokens]
sentence = (" ").join(stemmed_tokens)
sentence

'stem is the act of reduc a word to it stem .'

### 1.4 Lemmatization<a id="Lemmatization"/>
Lemmatizing is the act of reducing a word to its dictionary root form.

In [4]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

sentence = "Lemmatizing is the act of reducing a word to its dictionary root form."

lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(sentence)
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
sentence = (" ").join(lemmatized_tokens)
sentence

'Lemmatizing is the act of reducing a word to it dictionary root form .'

### 1.5 Preprocessing function<a id="Preprocessing-function"/>

In [5]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re

sentence = "A short meaningless sentencce for the testing of the below preprocess_text function."

def preprocess_text(text):
    
    # Convert to lowercase
    sentence = text.lower()
    
    # Remove emails
    sentence = re.sub(r"@[^\s]+", " ", sentence)
    
    # Remove URLs
    sentence = re.sub(r"((www\.[^\s]+)|(https?://[^\s]+))", " ", sentence)
    
    # Remove tags
    sentence = re.sub(r"<[^>]*>", " ", sentence)
    
    # Remove punctuations and numbers
    sentence = re.sub("[^a-zA-Z]", " ", sentence)
    
    # Remove single characters 
    sentence = re.sub(r"\s+[a-zA-Z]\s+", " ", sentence)
    
    # Remove multiple spaces
    sentence = re.sub(r"\s+", " ", sentence)
    
    # Remove stopwords
    tokens = word_tokenize(sentence)
    tokens_without_stopwords = [word for word in tokens if not word in stopwords.words("english")]
    sentence = (" ").join(tokens_without_stopwords)
    
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    lemma_tokens = word_tokenize(sentence)
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in lemma_tokens]
    sentence = (" ").join(lemmatized_tokens)
    
    return sentence

sentence = preprocess_text(sentence)
sentence
# X = df["Text"].apply(lambda x: preprocess_text(x))

'short meaningless sentencce testing preprocess text function'

## 2. Tokenization<a id="Tokenization"/>
Tokenization is a way of converting textual data (sentences) into a numeric representation of the data.

In [6]:
from keras.preprocessing.text import Tokenizer

sentences = [
    "An arbitrary sentence",
    "Another arbitrary sentence"  
]

# Create an instance of the Tokenizer object
tokenizer = Tokenizer(num_words=100) # num_words sets the maximum number of words to keep

# Fit tokenizer on your sentences
tokenizer.fit_on_texts(sentences)

# Turn sentences into sequences of numbers
X = tokenizer.texts_to_sequences(sentences)
X

[[3, 1, 2], [4, 1, 2]]

## 3. Word Embeddings<a id="Word-embeddings"/>

Word Embedding is the act of transforming words, which are essentially lists of characters, into a computer understandable format - namely vector representations of words. Word embeddings takes into account the semantic relations between words, and result in more condensed representations, in contrast to methods such as one-hot encoding.

There are different methods for using word embeddings. Using pre-trained models such as GloVe, BERT or creating custom word embeddings, to name a few.

### 3.1 GloVe<a id="GloVe"/>
<a ahref ="https://nlp.stanford.edu/projects/glove/">GloVe: Global Vectors for Word Representation</a> offers several pre-trained word embeddings, including embeddings specifically trained on tweets.
* Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased, 25d, 50d, 100d, & 200d vectors, 1.42 GB download): glove.twitter.27B.zip

First we create a word embeddings dictionary; where keys represent the words, and values the word embeddings

In [None]:
import numpy as np
from keras.preprocessing.text import Tokenizer

# Create empty dictionary
embeddings_dict = dict()

GLOVE_DIM = 200 # Dimensions of the word embeddings

# Filepath + file
file = "data/" + "glove.twitter.27B." + str(GLOVE_DIM) + "d.txt"

# Populate with words and embeddings
with open(file, "r") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype="float32")
        embeddings_dict[word] = vector

Once the GloVe embeddings have been loaded into a dictionary, the embeddings matrix can be created.

In [None]:
# Create embeddings matrix
embeddings_matrix = np.zeros((len(tokenizer.word_index) + 1, GLOVE_DIM)) # len(tokenizer.word_index refers to vocabulary size
for word, index in tokenizer.word_index.items():
    embeddings_vector = embeddings_dict.get(word)
    if embeddings_vector is not None:
        embeddings_matrix[index] = embeddings_vector

### 3.2 BERT<a id="BERT"/>
Unlike the GloVe model, which is context independent, the BERT model takes into account the contexts of words..

### 3.3 Custom word embeddings<a id="Custom-word-embeddings"/>
...