<a href="https://colab.research.google.com/github/wayneczw/ntuoss-nlp-workshop/blob/master/ntuoss_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to NLP

**What is [NLP](https://towardsdatascience.com/an-easy-introduction-to-natural-language-processing-b1e2801291c1)?**

*
Natural Language Processing (NLP) is a sub-field of Artificial Intelligence that is focused on enabling computers to understand and process human languages, to get computers closer to a human-level understanding of language. Computers don’t yet have the same intuitive understanding of natural language that humans do. They can’t really understand what the language is really trying to say. In a nutshell, a computer can’t read between the lines.
*

**NLP Tasks**

* [Information Retrieval](https://nlp.stanford.edu/IR-book/information-retrieval-book.html)
* [Information Extraction](https://www.ontotext.com/knowledgehub/fundamentals/information-extraction/)
* [Document Classification](https://cloud.google.com/blog/products/gcp/problem-solving-with-ml-automatic-document-classification)
* [Summarization (Extractive/Abstractive)](https://towardsdatascience.com/a-quick-introduction-to-text-summarization-in-machine-learning-3d27ccf18a9f)
* [Question Answering](https://medium.com/lingvo-masino/question-and-answering-in-natural-language-processing-part-i-168f00291856)
* [Machine Translation](https://www.andovar.com/machine-translation/)
* **Sentiment Analysis**
* [Parsing](http://stp.lingfil.uu.se/~nivre/master/NLP-Parsing.pdf)
* [Part of Speech (POS) Tagging](https://medium.freecodecamp.org/an-introduction-to-part-of-speech-tagging-and-the-hidden-markov-model-953d45338f24)
* [Word Sense Disambiguation (WSD)](https://web.stanford.edu/~jurafsky/slp3/slides/Chapter18.wsd.pdf)
* [Named Entity Recognition (NER)](https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da)
* [Topic Modeling](https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/)
* and more....

##Agenda


*   Basic understanding to train/validation/test dataset and model selection.
*   Basic techniques to preprocess/clean the text data.
*   Brief introduction to commonly used text embeddings.
*   Brief introduction to commonly used scoring metrics.
*   Simple sentiment analysis toy experiment.
*   Brief introduction to Universal Sentence Encoder






#Task 1 - Prepare Data

##TASK 1.1 - Load In Data

In [0]:
'''
These data are adopted from
https://github.com/Seh83/ML_Sentiment_Label_Model/tree/master/data
'''

from google.colab import files
uploaded = files.upload()

In [0]:
!ls

In [0]:
with open("imdb_labelled.txt", "r") as f:
    str_data = f.read().split("\n")

with open("amazon_cells_labelled.txt", "r") as f:
    str_data += f.read().split("\n")

with open("yelp_labelled.txt", "r") as f:
    str_data += f.read().split("\n")

Let's see how the raw text looks like:

In [0]:
print(str_data[0])
print(type(str_data[0]))

Manipulation of data types:

In [0]:
data = [line.split("\t") for line in str_data if len(line.split("\t")) == 2 and line.split("\t")[1]]
print(data[0])
print(type(data[0]))

Split the data into X and Y:

In [0]:
X = [line[0] for line in data]
Y = [line[1] for line in data]
print("X: {}".format(X[0]))
print("Y: {}".format(Y[0]))

##Task 1.2 - Split Data

### Train-Test Split
**Train set**: 
Train the model.

**Test set**:
Assess the model performance + Model Selction

**Validation set (sometimes optional but usually strongly encouraged)**: 
Model selection + Early Stopping

The usual ratio of train/validation/test goes as follow:
- *raw data* -> 80% Train + 20% Test OR 70% Train + 30% Test
- *Train* -> 80% Train + 20% Validation OR 90% Train + 10% Validation

### Model Selection

**Overfitting**
When the model trained is fitted too closely to the train set, the model may lose its ability to generalize to untrained data. This essentially means that an overfitted model is unlikely to make accurate prediction on unseen data.

**Underfitting**
When the model fails to identify the pattern in the train set, it it very unlikely that it can do well in predicting unseen data.

<!-- Models are trained on data, and data is supposedly a good representation of the *true population*. However, more than often, the data available for training are not good enough due to:
- insufficient data
- unbalance data mix
- etc

Due to the inevitable defect in the data itself, we need to apply some techniques to select the best model among different models: -->

Common methods:
- **K-fold Cross Validation**
- Leave-One-Out (LOO) Cross Validation
- K Data Splits Random SubSampling
- Three-Way Data Splits Method

Want to know more: https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

In [0]:
import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(7)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

print(len(X_train))
print(len(X_test))

##Task 1.3- Preprocess Data

In [0]:
print(X_train[1])

Also, the fries are without a doubt the worst fries I've ever had.


##Preprocessing Techniques

**Regex**:  a tool to clean up unwanted text.

Eg1 re.sub(r'[^\w]', '', text), which will remove all non-word tokens; 'This is a !@#!$ token.' -> 'This is a  token.'

**Stopwords**: words which are filtered out before or after processing of natural language data (text). Usually the commonly seen words such as *I, we, you, and, this*.

**Stemming**: stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form.

Eg1 ideas -> idea
Eg2 policy -> polic
        police -> polic

**Lemmatization**: algorithmic process of determining the lemma for a given word. Lemmatization is very similar to stemming, except that stemming is a lot more brutal.



In [0]:
import nltk
import re
nltk.download('stopwords'),nltk.download('snowball_data')
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
stemmer = SnowballStemmer("english")

def pre_process(text):
    if not isinstance(text, str): text = str(text)

    z = re.sub(r'[^\w\d\s]', ' ', text)
    z = re.sub(r'\s+', ' ', z)
    z = re.sub(r'^\s+|\s+?$', '', z.lower())
    return ' '.join(stemmer.stem(token) for token in z.split() if token not in set(stop_words))
#end def

X_train_processed = [pre_process(x) for x in X_train]
X_test_processed = [pre_process(x) for x in X_test]

print(X_train_processed[1])

#Task 2 - Train



##Common Text Embeddings

Text is an unstructure data, so computer cannot understand them. Therefore, there is a need to transform the text into a numerical vector/matrix.


**CountVectorizer**: aka Bag of Words/N-gram



**TfidfVectorizer**: Term-frequency-inverse-document-frequency. This technique takes into account of the document level frequency, and penalise tokens that appear often among different documents.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

count = CountVectorizer(ngram_range=(1, 2))
tfidf =TfidfVectorizer(ngram_range=(1, 2))

In [0]:
X_train_count = count.fit_transform(X_train_processed)
X_test_count = count.transform(X_test_processed)

X_train_tfidf = tfidf.fit_transform(X_train_processed)
X_test_tfidf = tfidf.transform(X_test_processed)

In [0]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import f1_score

# Using CounterVectorizer
classifier = LogisticRegressionCV(cv=5, random_state=0, multi_class='ovr')
classifier.fit(X_train_count, Y_train)

count_predicts = classifier.predict(X_test_count)

# Using TfidfVectorizer
classifier.fit(X_train_tfidf, Y_train)
tfidf_predicts = classifier.predict(X_test_tfidf)


**Logistic regression** is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).

**LogisticRegressionCV** is a sklearn implementation of logistic regression classifier, with **Cross Validation** technique being used.

#Task 3 - Evaluation

In [0]:
from sklearn.metrics import classification_report

print("Classification Report for CountVectorizer:\n{}".format(classification_report(Y_test, count_predicts)))

print("Classification Report for TfidfVectorizer:\n{}".format(classification_report(Y_test, tfidf_predicts)))

**Precision** is the number of True Positives divided by the number of True Positives and False Positives. Put another way, it is the number of positive predictions divided by the total number of positive class values predicted.

**Recall** is the number of True Positives divided by the number of True Positives and the number of False Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data.

**F1 Score** is the **2*((precision*recall)/(precision+recall))**. It is also called the F Score or the F Measure. Put another way, the F1 score conveys the balance between the precision and the recall.

Want to know more: https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/

# Task 4 - USE Embedding with Neural Network

In [0]:
!pip3 install --quiet tensorflow-hub
!pip3 install keras
import tensorflow as tf
import tensorflow_hub as hub
from keras.layers import Dense
from keras.layers import Input
from keras.layers import Lambda
from keras.models import Model
from keras import backend as K



##Task 4.1 - Introduce to USE

The [Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder/1) encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

There are many other encoders:
- [elmo](https://tfhub.dev/google/elmo/1)
- [GloVe](https://nlp.stanford.edu/projects/glove/)
- [Word2Vec](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)
- [Doc2Vec](https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e)
- etc

In [0]:
'''Codes in this cell are adopted, with slight modifications, from 
https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb
'''

module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)

# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)

with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    X_train_use = session.run(embed(X_train))

    for i, embedding in enumerate(np.array(X_train_use).tolist()):
        print("Original: {}".format(X_train[i]))
        print("Embedding size: {}".format(len(embedding)))
        embedding_snippet = ", ".join(
            (str(x) for x in embedding[:3]))    
        print("Embedding: [{}, ...]\n".format(embedding_snippet))
        
        if i == 5: break

## Task 4.2 - Build a NN Model with USE Embedding

In [0]:
X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, Y_train, test_size=0.2)

In [0]:
def _batch_iter(X, Y, batch_size=32, **kwargs):
    data_size = len(Y)
    num_batches_per_epoch = int((data_size - 1) / batch_size) + 1

    def data_generator():
        while True:
            # Shuffle the data at each epoch
            shuffled_indices = np.random.permutation(np.arange(data_size, dtype=np.int))

            for batch_num in range(num_batches_per_epoch):
                start_index = batch_num * batch_size
                end_index = min((batch_num + 1) * batch_size, data_size)
                X_batch = [X[i] for i in shuffled_indices[start_index:end_index]]
                Y_batch = [Y[i] for i in shuffled_indices[start_index:end_index]]

                yield ({'x_input': np.asarray(X_batch)}, {'output': np.asarray(Y_batch)})
            #end for
        #end while
    #end def

    return num_batches_per_epoch, data_generator()
#end def

train_steps, train_batches = _batch_iter(X_train, Y_train)
validate_steps, validate_batches = _batch_iter(X_validation, Y_validation)

In [0]:
USE_MODULE_URL = "https://tfhub.dev/google/universal-sentence-encoder/2"
USE_EMBED = hub.Module(USE_MODULE_URL, trainable=True)


def USE_Embedding(x):
    return USE_EMBED(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]
#end def


# Initialize session
with tf.Session() as session:
    K.set_session(session)
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    tf.logging.set_verbosity(tf.logging.ERROR)

    x_input = Input(shape=(1,), dtype=tf.string, name='x_input')
    x_embed = Lambda(USE_Embedding, output_shape=(512,))(x_input)
    x = Dense(256, activation='relu')(x_embed)
    output = Dense(1, activation='sigmoid', name='output')(x)

    model = Model(inputs=[x_input], outputs=output)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    model.fit_generator(
            epochs=10,
            generator=train_batches,
            steps_per_epoch=train_steps,
            validation_data=validate_batches,
            validation_steps=validate_steps)
    
    X_test = np.array(X_test, dtype=object)
    Y_test = np.array(Y_test, dtype=int)
    threshold = sum(Y_test)/Y_test.shape[0]
    use_predicts = model.predict(X_test)
    use_predicts = [1 if i > threshold else 0 for i in use_predicts]
    print("Classification Report for USE:\n{}".format(classification_report(Y_test, use_predicts)))
