### What is NLP?

NLP is an interdisciplinary field concerned with the interactions between computers and human natural languages — speech or text. NLP-powered softwares help us in our daily lives in various ways, for example:

* Personal Assistants

* Auto-Complete

* Spell checking

* Machine Translation

There are generally 3 different types of NLP systems:

* Rule-based systems: domain-specific rules (e.g: regular expressions), can be used to solve simple problems such as extracting structured data (e.g: emails) from unstructured data (e.g: web-pages)

* Classical ML approaches: uses crafted features fed into a statistical ml model which learns patterns in the training set and applies them to previously unseen data. (e.g: Spam Detection)

* Deep Learning approaches: use feature extractors in an automatic way, allowing for models that provide end-to-end solutions, allowed hard nlp tasks to be completed (e.g: Machine Translation)

### Text Representation

one-hot encoding: where a sentence is represented as a matrix of shape (NxN) where N is the number of unique tokens in the sentence, for example in the above picture, each word is represented as a sparse vectors (mostly zeroes) except of one cell (could be one, or the number of occurrences of the word in the sentence)

![title](pics/1.png)

however, this approach has two major drawbacks: 
    
* memory capacity issues (sparse vectors)
* lack of meaning represenation (no similarities between words)

As a result a majority of modern nlp algorithms will use word2vec, a shallow deep learning approach which represents words as dense vectors, and allow capturing of semantic meaning.

Further research built upon word2vec to create models like GloVe and fastText.

![title](pics/2.png)

## Example: Classification of Text Messages

This example uses a text message spam dataset from Kaggle which can be found here: https://www.kaggle.com/uciml/sms-spam-collection-dataset/home

In [3]:
import pandas as pd
data = pd.read_csv("spam.csv", encoding = "latin-1")
data = data[['v1', 'v2']]
data = data.rename(columns = {'v1': 'label', 'v2': 'text'})

In [5]:
data.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [11]:
data['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

As you can see, this dataset is pretty unbalanced which we will have to deal with later...

### Data Cleaning

Compared to regular data cleaning there is a much heavier emphasis on text normalization because it makes it much easier to extract semantic meaning. Also is important because it reduces heavily the amount of computational power we need since the matrix is smaller.

Some methods include:

* Case normalization

* Removing stop words, NOTE: There is a lot of debate over when removing stop words is a good idea. This practice is used in many information retrieval tasks (such as search engine querying), but can be detrimental when syntactical understanding of language is required.

* Removing punctuations, special symbols

* Lemmatising/Stemming: reducing inflection forms to normalise words with the same lemma or base. (note that lemmatising differs from stemming in that it considers the context of the word)

![title](pics/3.png)

Other normalization techniques include error correction, converting words to their parts of speech or mapping to synonyms, many of the tools to do this are included in the nltk library

For this particular problem, only case normalization is used because stemmers are difficult to apply to colloquial english and removing stop words will reduce the size of already short texts even more

In [14]:
# normalize 
def review_messages(msg):
    # converting messages to lowercase
    msg = msg.lower()
    return msg

In [18]:
# for reference on how to do stemming and remove stopwords
from nltk import stem
from nltk.corpus import stopwords
# import nltk
# nltk.download('stopwords')
stemmer = stem.SnowballStemmer('english')
stopwords = set(stopwords.words('english'))

def alt_review_messages(msg):
    # converting messages to lowercase
    msg = msg.lower()
    # removing stopwords
    msg = [word for word in msg.split() if word not in stopwords]
    # using a stemmer
    msg = " ".join([stemmer.stem(word) for word in msg])
    return msg

In [19]:
data['text'] = data['text'].apply(review_messages)

### Vectorizing Text

The bag of words model discussed earlier, is too simple for this task so we will use the TF-IDF vectorizer

For review, this is how the TF-IDF vectorizer (Term Frequency — Inverse Document Frequency) works

TF-IDF vectorizes documents by calculating a TF-IDF statistic between the document and each term in the vocabulary. The document vector is constructed by using each statistic as an element in the vector.

![title](pics/5.png)

After settling with TF-IDF, we must decide the granularity of our vectorizer. A popular alternative to assigning each word as its own term is to use a tokenizer. A tokenizer splits documents into tokens (thus assigning each token to its own term) based on white space and special characters.

For example, the phrase what’s going on might be split into what, ‘s, going, on.

However since we are dealing with colloquial English with possibly URLs and emails, we are splitting by word.

### Train-test Split

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size = 0.1, random_state = 1)
# training the vectorizer 
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)

### Building Classifier

In [25]:
# train
from sklearn import svm
svm = svm.SVC(C=1000)
svm.fit(X_train, y_train)



SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [26]:
# test
from sklearn.metrics import confusion_matrix
X_test = vectorizer.transform(X_test)
y_pred = svm.predict(X_test)
print(confusion_matrix(y_test, y_pred))

[[490   0]
 [ 10  58]]


![title](pics/6.png)

In [31]:
from sklearn.metrics import roc_auc_score
y_score =svm.decision_function(X_test)
print(roc_auc_score(y_test, y_score))

0.9964285714285716


In [28]:
def pred(msg):
    msg = vectorizer.transform([msg])
    prediction = svm.predict(msg)
    return prediction[0]

data['pred'] = data['text'].apply(pred)

In [29]:
data.head()

Unnamed: 0,label,text,pred
0,ham,"go until jurong point, crazy.. available only ...",ham
1,ham,ok lar... joking wif u oni...,ham
2,spam,free entry in 2 a wkly comp to win fa cup fina...,spam
3,ham,u dun say so early hor... u c already then say...,ham
4,ham,"nah i don't think he goes to usf, he lives aro...",ham


## Word2vec

"Trained over large corpora, word2vec uses unsupervised learning to determine semantic and syntactic meaning from word co-occurrence, which is used to construct vector representations for every word in the vocabulary."

A more detailed look into how it works is included here: https://arxiv.org/pdf/1301.3781.pdf

The model uses a two layer shallow neural network to find the vector mappings for each word in the corpus. The neural network is used to predict known co-occurrences in the corpus and the weights of the hidden layer are used to create the word vectors.

![title](pics/7.png)

There are two model architectures used to train word2vec: Continuous Bag of Words and Skip Gram. These models determine how textual data is passed into the neural network.

Both of these architectures use a context window to determine contextually similar words. 

A context window with a fixed size n means that all words within n units from the target word belong to its context.
Consider the following example with a fixed window size of 2:

![title](pics/8.png)

Fox is our target word and quick, brown, jumped, over belong to the context of fox. The assumption is that with enough examples of contextual similarity, the network is able to learn the correct associations between words.
This assumption is in line with the distributional hypothesis which states that “words which are used and occur in the same contexts tend to purport similar meaning”

The implementation of context window in word2vec is dynamic.

A dynamic context window has a maximum window size. Context is sampled from the maximum window size with probability 1/d, where d is the distance between the word to the target.

Consider the target word fox using a dynamic context window with maximum window size of 2. (brown, jumped) have a 1/1 probability of being included in the context since they are one word away from fox. (quick, over) have a 1/2 probability of being included in the context since they are two words away from fox.

![title](pics/9.png)

### Continuous Bag of Words
We structure the data such that the context is used to predict the target word. For example, if our context is (quick, brown, jumped, over), we use that as features of the class fox.
### Skip Gram
We structure the data such that the target word is used to predict the context. For example, we use the feature (fox) to predict the context (quick, brown, jumped, over).

## Building the Neural Network
Word2vec trains a shallow neural network over data as structured using either Continuous Bag of Words or Skip Gram architecture. Instead of leveraging the model for predictive purposes, we use the hidden weights from the neural network to generate the word vectors.

Assuming a Continuous Bag of Words architecture with a fixed context window of 1 word, this is what the process would look like. First, the corpus.

##### very simple corpus


**I like math.**

**I like programming.**

**Today is Friday.**

**Today is a good day.**

To make things even easier, we can require our context window to only include words which proceeds the target. We can assume that the context of words at the end of a sentence is the first word of the next sentence. Under such rules:

* like is the context of target I

* math is the context of target like

* programming is also the context of target like

Even with such a simple corpus, we can begin to recognize some patterns. “Math” and “programming” are both context to “like”. While this might not be picked up by the model, both of these words can be understood as things that I like.

Steps:

### 1. One-hot encode
### 2. feed forward neural-network with one hidden layer and a output using a softmax activation function

The data set used to train the network uses the one hot encoded context vector to predict the one hot encoded target vector.
The number of neurons in the hidden layer corresponds to the number of dimensions in the final word vectors.

### 3. Obtain weights of hidden network
Each row in the weight matrix corresponds to the vector of each word in the vocabulary.

## Building your own word2vec

In [45]:
import wikipedia
import nltk 
from gensim.models import Word2Vec

chess = wikipedia.page("Chess").content

In [48]:
# split our document into sentences
sentences = nltk.sent_tokenize(chess) 

length = len(sentences)

stopwords = set(nltk.corpus.stopwords.words("english"))

for i in range(0, length): 
    
    # further tokenize our sentences
    temp = nltk.word_tokenize(sentences[i])
    
    # removing stop words, non-alpabetical tokens and converting to lower case 
    sentences[i] = [word.lower() for word in temp if word not in stopwords and word.isalpha()]    

print(len(sentences))
    
# size refers to the desired dimensionality of vectors 
# window is upper bound in dynamic context window
# sg: The training algorithm, either CBOW(0) or skip gram(1). The default training algorithm is CBOW.
model = Word2Vec(sentences, size=100, window=5, sg=0)

521


In [41]:
# Measures the similarity between words using cosine similarity 

model.wv.similarity("rook", "knight") 

0.48693413

In [42]:
# Finds the top n most similar words 

model.wv.similar_by_word("king",10)

[('chess', 0.7700669765472412),
 ('player', 0.7547798752784729),
 ('world', 0.753699541091919),
 ('players', 0.7380087375640869),
 ('example', 0.7331035733222961),
 ('rating', 0.7260173559188843),
 ('pawns', 0.716291069984436),
 ('moves', 0.7120612859725952),
 ('game', 0.7079309821128845),
 ('two', 0.6990262866020203)]

In [39]:
# Find word using vector addition. Does opponent + checkmate = lose? 

opponent_checkmate = model.wv["opponent"] + model.wv["checkmate"] # add the vectors for king and checkmate
model.wv.most_similar(positive = [opponent_checkmate], topn=10) # find most similar word to vector

[('opponent', 0.9239256978034973),
 ('checkmate', 0.8886814117431641),
 ('chess', 0.8274511098861694),
 ('pawns', 0.7916924357414246),
 ('player', 0.7825618982315063),
 ('game', 0.7560703158378601),
 ('pawn', 0.7550085186958313),
 ('world', 0.7383760809898376),
 ('rating', 0.7352045774459839),
 ('games', 0.7286425232887268)]

In [59]:
model.wv['king']

array([ 4.1064163e-04,  2.7849227e-03, -6.9686084e-04, -7.6698628e-04,
       -1.8941686e-03, -8.2237376e-03,  8.1017101e-04, -3.6570632e-03,
        5.4781986e-03, -4.7286502e-03, -6.1999485e-03, -2.3176360e-03,
        4.3832846e-03, -1.2438551e-04,  1.6759310e-03,  8.2134837e-03,
        7.0482595e-03, -4.8480411e-03, -1.2123109e-03, -1.0787294e-02,
        6.2058652e-03,  2.9616610e-03,  6.8436926e-03,  4.7568060e-03,
       -4.5343186e-03,  9.4778789e-04,  4.5285113e-03, -3.1439899e-03,
        8.4937172e-04, -2.2017551e-03, -6.4727659e-03,  2.2153649e-03,
       -4.6007079e-03,  4.4411896e-03, -9.0642720e-03,  7.5955037e-04,
       -7.1106087e-03,  4.8968929e-04, -4.3056291e-03, -7.3126878e-04,
       -1.7485697e-03, -3.7330389e-03, -5.8895075e-03, -4.4027744e-03,
        2.6723740e-03,  1.1313351e-02, -3.6093348e-03, -3.0519129e-04,
       -4.9488023e-03,  2.4059690e-03, -1.4644397e-03, -3.9526285e-03,
       -2.7676583e-03, -8.5032859e-04,  2.0749362e-03, -2.9554041e-03,
      

Try doing this using a larger corpus and see how it can work

After you have contructed a word2vec model you can use it in several ways to plug it into your standard ml models

Average of Word2Vec vectors : You can just take the average of all the word vectors in a sentence. This average vector will represent your sentence vector.

Average of Word2Vec vectors with TF-IDF : Just take the word vectors and multiply it with their TF-IDF scores. Just take the average and it will represent your sentence vector.