<a href="https://colab.research.google.com/github/wayneczw/ntuoss-nlp-workshop/blob/master/ntuoss_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Task 1 - Prepare Data

####TASK 1.1 - Load In Data

In [5]:
'''
These data are adopted from
https://github.com/Seh83/ML_Sentiment_Label_Model/tree/master/data
'''

from google.colab import files
uploaded = files.upload()

Saving amazon_cells_labelled.txt to amazon_cells_labelled.txt
Saving imdb_labelled.txt to imdb_labelled.txt
Saving yelp_labelled.txt to yelp_labelled.txt


In [6]:
!ls

amazon_cells_labelled.txt  imdb_labelled.txt  sample_data  yelp_labelled.txt


In [39]:
with open("imdb_labelled.txt", "r") as f:
    str_data = f.read().split("\n")

with open("amazon_cells_labelled.txt", "r") as f:
    str_data += f.read().split("\n")

with open("yelp_labelled.txt", "r") as f:
    str_data += f.read().split("\n")

print(str_data[0])
print(type(str_data[0]))

A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  	0
<class 'str'>


In [40]:
data = [line.split("\t") for line in str_data if len(line.split("\t")) == 2 and line.split("\t")[1]]
print(data[0])
print(type(data[0]))

['A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  ', '0']
<class 'list'>


In [41]:
X = [line[0] for line in data]
Y = [line[1] for line in data]
print(X[0])
print(Y[0])

A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  
0


In [42]:
import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(7)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

print(len(X_train))
print(len(X_test))

2400
600


#### Task 1.2 - Preprocess Data

In [43]:
print(X_train[1])

Also, the fries are without a doubt the worst fries I've ever had.


In [44]:
import nltk
import re
nltk.download('stopwords'),nltk.download('snowball_data')
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
stemmer = SnowballStemmer("english")

def pre_process(text):
    if not isinstance(text, str): text = str(text)

    z = re.sub(r'[^\w\d\s]', ' ', text)
    z = re.sub(r'\s+', ' ', z)
    z = re.sub(r'^\s+|\s+?$', '', z.lower())
    return ' '.join(stemmer.stem(token) for token in z.split() if token not in set(stop_words))
#end def

X_train_processed = [pre_process(x) for x in X_train]
X_test_processed = [pre_process(x) for x in X_test]

print(X_train_processed[1])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package snowball_data to /root/nltk_data...
[nltk_data]   Package snowball_data is already up-to-date!
also fri without doubt worst fri ever


###Task 2 - Train



In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

count = CountVectorizer()
tfidf =TfidfVectorizer(ngram_range=(1, 2))

In [0]:
X_train_count = count.fit_transform(X_train_processed)
X_test_count = count.transform(X_test_processed)

X_train_tfidf = tfidf.fit_transform(X_train_processed)
X_test_tfidf = tfidf.transform(X_test_processed)

In [47]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import f1_score

# Using CounterVectorizer
classifier = LogisticRegressionCV(cv=5, random_state=0, multi_class='ovr')
classifier.fit(X_train_count, Y_train)

count_predicts = classifier.predict(X_test_count)

# Using TfidfVectorizer
classifier.fit(X_train_tfidf, Y_train)
tfidf_predicts = classifier.predict(X_test_tfidf)




###Task 3 - Evaluation

In [48]:
from sklearn.metrics import classification_report

print("Classification Report for CountVectorizer:\n{}".format(classification_report(Y_test, count_predicts)))

print("Classification Report for TfidfVectorizer:\n{}".format(classification_report(Y_test, tfidf_predicts)))

Classification Report for CountVectorizer:
              precision    recall  f1-score   support

           0       0.82      0.78      0.80       299
           1       0.79      0.83      0.81       301

   micro avg       0.81      0.81      0.81       600
   macro avg       0.81      0.81      0.81       600
weighted avg       0.81      0.81      0.81       600

Classification Report for TfidfVectorizer:
              precision    recall  f1-score   support

           0       0.85      0.79      0.82       299
           1       0.80      0.86      0.83       301

   micro avg       0.82      0.82      0.82       600
   macro avg       0.83      0.82      0.82       600
weighted avg       0.83      0.82      0.82       600



### Task 4 - USE Embedding with Neural Network

In [49]:
!pip3 install --quiet tensorflow-hub
!pip3 install keras
import tensorflow as tf
import tensorflow_hub as hub
from keras.layers import Dense
from keras.layers import Input
from keras.layers import Lambda
from keras.models import Model
from keras import backend as K





#### Task 4.1 Introduce to USE

In [18]:
'''Codes in this cell are adopted, with slight modifications, from 
https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder.ipynb
'''

module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)

# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)

with tf.Session() as session:
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    X_train_use = session.run(embed(X_train))

    for i, embedding in enumerate(np.array(X_train_use).tolist()):
        print("Original: {}".format(X_train[i]))
        print("Embedding size: {}".format(len(embedding)))
        embedding_snippet = ", ".join(
            (str(x) for x in embedding[:3]))    
        print("Embedding: [{}, ...]\n".format(embedding_snippet))
        
        if i == 5: break

INFO:tensorflow:Using /tmp/tfhub_modules to cache modules.
INFO:tensorflow:Downloading TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder-large/3'.
INFO:tensorflow:Downloaded https://tfhub.dev/google/universal-sentence-encoder-large/3, Total size: 810.60MB
INFO:tensorflow:Downloaded TF-Hub Module 'https://tfhub.dev/google/universal-sentence-encoder-large/3'.
Original: want clip go top ear caus discomfort
Embedding size: 512
Embedding: [0.04672318324446678, -0.0001613523781998083, -0.020706530660390854, ...]

Original: also fri without doubt worst fri ever
Embedding size: 512
Embedding: [-0.08140411227941513, 0.01702791452407837, 0.014239952899515629, ...]

Original: good price
Embedding size: 512
Embedding: [0.02641877345740795, -0.043094903230667114, 0.002205158118158579, ...]

Original: updat procedur difficult cumbersom
Embedding size: 512
Embedding: [0.06264284998178482, 0.06796059012413025, 0.004704696591943502, ...]

Original: redeem qualiti restaur inexpens
Embed

#### Task 4.2 Build a NN Model with USE Embedding

In [0]:
X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, Y_train, test_size=0.2)

In [0]:
def _batch_iter(X, Y, batch_size=32, **kwargs):
    data_size = len(Y)
    num_batches_per_epoch = int((data_size - 1) / batch_size) + 1

    def data_generator():
        while True:
            # Shuffle the data at each epoch
            shuffled_indices = np.random.permutation(np.arange(data_size, dtype=np.int))

            for batch_num in range(num_batches_per_epoch):
                start_index = batch_num * batch_size
                end_index = min((batch_num + 1) * batch_size, data_size)
                X_batch = [X[i] for i in shuffled_indices[start_index:end_index]]
                Y_batch = [Y[i] for i in shuffled_indices[start_index:end_index]]

                yield ({'x_input': np.asarray(X_batch)}, {'output': np.asarray(Y_batch)})
            #end for
        #end while
    #end def

    return num_batches_per_epoch, data_generator()
#end def

train_steps, train_batches = _batch_iter(X_train, Y_train)
validate_steps, validate_batches = _batch_iter(X_validation, Y_validation)

In [52]:
USE_MODULE_URL = "https://tfhub.dev/google/universal-sentence-encoder/2"
USE_EMBED = hub.Module(USE_MODULE_URL, trainable=True)


def USE_Embedding(x):
    return USE_EMBED(tf.squeeze(tf.cast(x, tf.string)), signature="default", as_dict=True)["default"]
#end def


# Initialize session
with tf.Session() as session:
    K.set_session(session)
    session.run([tf.global_variables_initializer(), tf.tables_initializer()])
    tf.logging.set_verbosity(tf.logging.ERROR)

    x_input = Input(shape=(1,), dtype=tf.string, name='x_input')
    x_embed = Lambda(USE_Embedding, output_shape=(512,))(x_input)
    x = Dense(256, activation='relu')(x_embed)
    output = Dense(1, activation='sigmoid', name='output')(x)

    model = Model(inputs=[x_input], outputs=output)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    model.fit_generator(
            epochs=10,
            generator=train_batches,
            steps_per_epoch=train_steps,
            validation_data=validate_batches,
            validation_steps=validate_steps)
    
    X_test = np.array(X_test, dtype=object)
    Y_test = np.array(Y_test, dtype=int)
    threshold = sum(Y_test)/Y_test.shape[0]
    use_predicts = model.predict(X_test)
    use_predicts = [1 if i > threshold else 0 for i in use_predicts]
    print("Classification Report for USE:\n{}".format(classification_report(Y_test, use_predicts)))


Exception ignored in: <bound method BaseSession._Callable.__del__ of <tensorflow.python.client.session.BaseSession._Callable object at 0x7f5b7303c5f8>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1455, in __del__
    self._session._session, self._handle, status)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.CancelledError: Session has been closed.
Exception ignored in: <bound method BaseSession._Callable.__del__ of <tensorflow.python.client.session.BaseSession._Callable object at 0x7f5b7304fdd8>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1455, in __del__
    self._session._session, self._handle, status)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
x_input (InputLayer)         (None, 1)                 0         
_________________________________________________________________
lambda_6 (Lambda)            (None, 512)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 256)               131328    
_________________________________________________________________
output (Dense)               (None, 1)                 257       
Total params: 131,585
Trainable params: 131,585
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Classification Report for USE:
              precision    recall  f1-score   support

           0       0.90      0.86      0.88       299
           1     