<a href="https://colab.research.google.com/github/sagunkayastha/ILab_Tutorials/blob/master/NLP_Deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word Embedding

**Downloading and extracting word embedding**

In [0]:
import numpy as np

In [2]:
!wget http://nlp.stanford.edu/data/glove.42B.300d.zip
!unzip glove.42B.300d.zip

--2019-09-27 04:16:45--  http://nlp.stanford.edu/data/glove.42B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.42B.300d.zip [following]
--2019-09-27 04:16:45--  https://nlp.stanford.edu/data/glove.42B.300d.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.42B.300d.zip [following]
--2019-09-27 04:16:45--  http://downloads.cs.stanford.edu/nlp/data/glove.42B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1877800501 (1.7G) [application/zip]
Sav

In [0]:
def read_glove_vecs(glove_file):
    with open(glove_file, 'r') as f:
        words = set()
        word_to_vec_map = {}
        
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
            
    return words, word_to_vec_map

In [0]:
words, word_to_vec_map = read_glove_vecs('glove.42B.300d.txt')

In [0]:
father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
king = word_to_vec_map["king"]
queen = word_to_vec_map["queen"]
france = word_to_vec_map["france"]
italy = word_to_vec_map["italy"]
paris = word_to_vec_map["paris"]


In [0]:
# Visualize the embedding
print(father)

**Cosine Similarity**
$$\text{CosineSimilarity(u, v)} = \frac {u . v} {||u||_2 ||v||_2} = cos(\theta) \tag{1}$$

In [0]:
def cosine_similarity(u, v):
    
    distance = 0.0
    
    dot = np.dot(u, v)
    
    norm_u = np.sqrt(np.sum(u**2))
    norm_v = np.sqrt(np.sum(v**2))
    
    cosine_similarity = dot / np.dot(norm_u, norm_v)

    
    return cosine_similarity


In [0]:
print(cosine_similarity(father,mother))
print(cosine_similarity(king,queen))
print(cosine_similarity(king,rome))

0.8171303880464806
0.7596175811305904
0.39602592048853846


**A=B then C=?**

In [0]:
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    
    # word embeddings v_a, v_b and v_c 
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]
    
    
    words = word_to_vec_map.keys()
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words:        
        
        if w in [word_a, word_b, word_c] :
            continue
        
        
        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)  (≈1 line)
        cosine_sim = cosine_similarity((e_b - e_a), (word_to_vec_map[w] - e_c))
        
       
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
        
        
    return best_word

In [0]:
complete_analogy('France','Paris','Germany',word_to_vec_map)

'burger'

# RNN

In [0]:
!wget https://raw.githubusercontent.com/sagunkayastha/ILab_Tutorials/master/ISEAR.csv

--2019-09-26 16:13:48--  https://raw.githubusercontent.com/sagunkayastha/ILab_Tutorials/master/ISEAR.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 957909 (935K) [text/plain]
Saving to: ‘ISEAR.csv.2’


2019-09-26 16:13:48 (106 MB/s) - ‘ISEAR.csv.2’ saved [957909/957909]



In [0]:
# import
import numpy as np
import pandas as pd
import keras
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import string
import nltk

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing import sequence
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

from keras.utils import np_utils
from sklearn.preprocessing import LabelEncoder

# load and preprocess the dataset
df = pd.read_csv('./ISEAR.csv',names = ['Emotions','Sentence'], index_col=False)
df.describe()

# preprocessing the dataset

# clean the dataset
df.dropna(axis=0, inplace = True)
df.describe()
# replacing misspelled data
df['Emotions'].replace(to_replace = 'guit', value='guilt', inplace = True)

Using TensorFlow backend.


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
def clean_text(text):
    sw = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from','in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y']
    text = text.lower()
    text = nltk.word_tokenize(text)
    stem = WordNetLemmatizer()
    text = [stem.lemmatize(w) for w in text if w not in sw and w.isalnum()]
    text = ' '.join(text)
    return text
df['Sentence'] = df['Sentence'].map(lambda x: clean_text(x))
df.drop(df[df['Sentence'].apply(len) == 0].index, inplace=True)

# check the dataset after cleaning
# print(df['Emotions'].value_counts())

# split the dataset
X, y = df['Sentence'], df['Emotions']

# transform y to one-hot-encoding
encoder = LabelEncoder()
encoder.fit(y)
encoded_y = encoder.transform(y)
onehot_y = np_utils.to_categorical(encoded_y)

# total number of unique words
vocab = set()
total_vocab = [vocab.add(el) for s in X.values for el in s.split(' ')]

In [0]:
# embeddings_index = {}
# f = open('glove.42B.300d.txt', encoding='utf8')
# for line in f:
#     values = line.split()
#     word = ''.join(values[:-300])
#     coefs = np.asarray(values[-300:], dtype='float32')
#     embeddings_index[word] = coefs
# f.close()

In [0]:
# Tokenize the sentences
VOCABULARY_SIZE = len(total_vocab)  # Select an Appropriate Vobabulary Size
PADDED_LENGTH = 90  # Select an Appropriate padded Length for text
tokenizer = Tokenizer(VOCABULARY_SIZE) 
tokenizer.fit_on_texts(X)

# text to number
def preprocessing(X):
    seq = tokenizer.texts_to_sequences(X)
    data = sequence.pad_sequences(seq, maxlen = PADDED_LENGTH)
    return data
data = preprocessing(X)


# split test set and train set
X_train, X_test, y_train, y_test = train_test_split(data, onehot_y, test_size = 0.1, random_state = 101, stratify = onehot_y)

# model
LEARNING_RATE = 1e-3
EPOCH = 80
EMBEDDING_SIZE = 300
SEED = 2019
DROPOUTRATE = 0.3

In [0]:
embeddings_index = word_to_vec_map
embedding_matrix = np.zeros((VOCABULARY_SIZE, 300))
for word, i in tokenizer.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

In [0]:
import keras
from keras.models import Sequential
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.optimizers import Adam
from keras import initializers, regularizers, constraints, optimizers, layers, callbacks
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D, concatenate, SpatialDropout1D

modelrnn = Sequential()
modelrnn.add(Embedding(VOCABULARY_SIZE, EMBEDDING_SIZE, input_length = PADDED_LENGTH,weights=[embedding_matrix], trainable= 'false',
                   embeddings_regularizer=regularizers.l1(0.001)))
modelrnn.add(SpatialDropout1D(0.2))


modelrnn.add(LSTM(128, dropout=DROPOUTRATE, recurrent_dropout=DROPOUTRATE, return_sequences = True))
modelrnn.add(LSTM(64, dropout=DROPOUTRATE, recurrent_dropout=DROPOUTRATE))
modelrnn.add(Dense(64, activation='relu'))
modelrnn.add(Dropout(DROPOUTRATE, seed = SEED))
modelrnn.add(Dense(128, activation='relu'))
modelrnn.add(Dropout(DROPOUTRATE, seed = SEED))
modelrnn.add(Dense(7, activation='softmax'))

optim = optimizers.Adam(lr = LEARNING_RATE,
                       beta_1=0.9,
                       beta_2=0.999,
                       epsilon=1e-08,
                       decay=LEARNING_RATE/EPOCH)

modelrnn.compile(loss='categorical_crossentropy', optimizer=optim, metrics=['accuracy'])
modelrnn.summary()






Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 90, 300)           22623600  
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 90, 300)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 90, 128)           219648    
_________________________________________________________________
lstm_2 (LSTM)                (None, 64)                49408     
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         

In [0]:
history = modelrnn.fit(X_train,y_train, batch_size=16, validation_split=0.2, epochs = EPOCH)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 5410 samples, validate on 1353 samples
Epoch 1/80
Epoch 2/80
Epoch 3/80
Epoch 4/80
Epoch 5/80
Epoch 6/80
Epoch 7/80
Epoch 8/80
Epoch 9/80
Epoch 10/80
Epoch 11/80
Epoch 12/80
Epoch 13/80
Epoch 14/80
Epoch 15/80
Epoch 16/80
Epoch 17/80
Epoch 18/80
Epoch 19/80
Epoch 20/80
Epoch 21/80
Epoch 22/80
Epoch 23/80
Epoch 24/80
Epoch 25/80
Epoch 26/80
Epoch 27/80
Epoch 28/80
Epoch 29/80
Epoch 30/80
Epoch 31/80
Epoch 32/80
Epoch 33/80
Epoch 34/80
Epoch 35/80
Epoch 36/80
Epoch 37/80
Epoch 38/80
Epoch 39/80
Epoch 40/80
Epoch 41/80
Epoch 42/80
Epoch 43/80
Epoch 44/80
Epoch 45/80
Epoch 46/80
Epoch 47/80
Epoch 48/80
Epoch 49/80
Epoch 50/80
Epoch 51/80
Epoch 52/80
Epoch 53/80
Epoch 54/80
Epoch 55/80
Epoch 56/80
Epoch 57/80
Epoch 58/80
Epoch 59/80
Epoch 60/80
Epoch 61/80
Epoch 62/80
Epoch 63/80
Epoch 64/80
Epoch 65/80
Epoch 66/80
Epoch 67/80
Epoch 68/80
Epoch 69/80
Epoch 70/80
Epoch 71/80
Epoch 72/80
Epo

# OpenAI

In [0]:
!git clone https://github.com/openai/gpt-2.git
%cd gpt-2/

Cloning into 'gpt-2'...
remote: Enumerating objects: 209, done.[K
Receiving objects:   0% (1/209)   Receiving objects:   1% (3/209)   Receiving objects:   2% (5/209)   Receiving objects:   3% (7/209)   Receiving objects:   4% (9/209)   Receiving objects:   5% (11/209)   Receiving objects:   6% (13/209)   Receiving objects:   7% (15/209)   Receiving objects:   8% (17/209)   Receiving objects:   9% (19/209)   Receiving objects:  10% (21/209)   Receiving objects:  11% (23/209)   Receiving objects:  12% (26/209)   Receiving objects:  13% (28/209)   Receiving objects:  14% (30/209)   Receiving objects:  15% (32/209)   Receiving objects:  16% (34/209)   Receiving objects:  17% (36/209)   Receiving objects:  18% (38/209)   Receiving objects:  19% (40/209)   Receiving objects:  20% (42/209)   Receiving objects:  21% (44/209)   Receiving objects:  22% (46/209)   Receiving objects:  23% (49/209)   Receiving objects:  24% (51/209)   Receiving objects:  25% (53/209)   Re

In [0]:
!pip install -r requirements.txt

Collecting fire>=0.1.3 (from -r requirements.txt (line 1))
[?25l  Downloading https://files.pythonhosted.org/packages/d9/69/faeaae8687f4de0f5973694d02e9d6c3eb827636a009157352d98de1129e/fire-0.2.1.tar.gz (76kB)
[K     |████████████████████████████████| 81kB 3.9MB/s 
[?25hCollecting regex==2017.4.5 (from -r requirements.txt (line 2))
[?25l  Downloading https://files.pythonhosted.org/packages/36/62/c0c0d762ffd4ffaf39f372eb8561b8d491a11ace5a7884610424a8b40f95/regex-2017.04.05.tar.gz (601kB)
[K     |████████████████████████████████| 604kB 10.5MB/s 
Collecting tqdm==4.31.1 (from -r requirements.txt (line 4))
[?25l  Downloading https://files.pythonhosted.org/packages/6c/4b/c38b5144cf167c4f52288517436ccafefe9dc01b8d1c190e18a6b154cd4a/tqdm-4.31.1-py2.py3-none-any.whl (48kB)
[K     |████████████████████████████████| 51kB 19.2MB/s 
Building wheels for collected packages: fire, regex
  Building wheel for fire (setup.py) ... [?25l[?25hdone
  Created wheel for fire: filename=fire-0.2.1-py2.

In [0]:
!python download_model.py 124M

Fetching checkpoint:   0%|                                              | 0.00/77.0 [00:00<?, ?it/s]Fetching checkpoint: 1.00kit [00:00, 589kit/s]                                                      
Fetching encoder.json:   0%|                                           | 0.00/1.04M [00:00<?, ?it/s]Fetching encoder.json: 1.04Mit [00:00, 27.5Mit/s]                                                   
Fetching hparams.json: 1.00kit [00:00, 609kit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:09, 50.7Mit/s]                                  
Fetching model.ckpt.index: 6.00kit [00:00, 3.34Mit/s]                                               
Fetching model.ckpt.meta: 472kit [00:00, 35.8Mit/s]                                                 
Fetching vocab.bpe: 457kit [00:00, 26.1Mit/s]                                                       


In [0]:
!python src/interactive_conditional_samples.py 124M


2019-09-26 07:51:24.735194: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-09-26 07:51:24.787428: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-09-26 07:51:24.788252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
2019-09-26 07:51:24.794955: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-09-26 07:51:25.032229: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-09-26 07:51:25.137095: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.1