#### Assignment 8: Recurrent Neural Networks(RNN)
NOTE : PLEASE DO NOT POST/SHARE THE CODE OR YOUR SOLUTIONS ON THE WEB/GIT except CANVAS FOR GRADING

References:
https://www.kaggle.com/c/nlp-getting-started/notebooks
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
https://paperswithcode.com/method/bilstm
https://towardsdatascience.com/sentiment-analysis-with-deep-learning-62d4d0166ef6
https://stackabuse.com/python-for-nlp-movie-sentiment-analysis-using-deep-learning-in-keras/ 

Management Problem
Executive management of a large retailer is thinking about using a language model to classify written customer reviews of call and complaint logs. The goal here is, if the most critical customer messages can be identified, then customer support personnel can be assigned to contact those customers to gather feedback and enhance customer experience.

How would you advise senior management? What kinds of systems and methods would be most relevant to the customer services function? Considering the results of this assignment in particular, what is needed to make an automated customer support system that is capable of identifying negative customer feelings? What can data scientists do to make language models more useful in a customer service function?


Description:
This assignment involves working with language models developed with pretrained word vectors. We use sentences (sequences of words) to train language models for predicting movie review sentiment (thumbs up versus thumbs down). We study effects of word vector size, vocabulary size, and neural network structure (hyperparameters) on classification performance. We build on resources for recurrent neural networks (RNNs) as implemented in TensorFlow. RNNs are well suited to the analysis of sequences, as needed for natural language processing (NLP). 

Initial background reading for this assignment comes from Chapter 15/16 (pp. 495–567) of the Géron textbook:
Géron, A. (2019). Hands-on machine learning with Scikit-Learn & TensorFlow: Concepts, tools, and techniques to build intelligent systems. Sebastopol, CA: O’Reilly. [ISBN-13 978-1-491-96229-9]. Source code available via Github (https://github.com/ageron/handson-ml2)

Specialized RNN models have been developed to accommodate the needs of many language processing tasks. Larger relevant vocabularies are usually associated with more accurate models, but training with larger vocabularies requires more memory and longer processing times. We can speed up the training process by using pretrained word vectors and subsets of pretrained word vectors.

Technologies such as word2vec, GloVe (global vectors), and fastText provide ways of representing words as numeric vectors. These numeric vectors or neural network embeddings capture the meaning of words as well as their common usage as parts of speech. Word embeddings have extensive applications in natural language processing.

This assignment requires the use of two pretrained word embeddings selected from a list of supported vectors. That is, we replace each word in a sentence or sequence with a vector of numbers. Methods for downloading word embeddings are provided in the Python package chakin (https://github.com/chakki-works/chakin)

Early work on word2vec embeddings is cited in these references:
Mikolov, T., et al. (2013). Efficient Estimation of Word Representations in Vector Space (https://arxiv.org/pdf/1301.3781.pdf). 
Mikolov, T., et al. (2013). Linguistic Regularities in Continuous Space Word Representations (https://www.aclweb.org/anthology/N13-1090/). 

GloVe, a method for estimating pretrained word vectors, was developed at Stanford (https://nlp.stanford.edu/projects/glove/). A tutorial and code for using GloVe embeddings is available here. (https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer)

A third set of pretrained vectors is fastText, described here (https://arxiv.org/pdf/1607.04606.pdf).  Word embeddings are an active area of research, as shown in recent developments in probabilistic fastText (https://www.aclweb.org/anthology/P18-1001/).

We can also test vocabulary sizes associated with pretrained word vectors, defined as part of the language model. For example, we might compare a vocabulary of the top 10,000 words in English versus a vocabulary of the top 30,000 words.Additionally, we can test alternative RNN structures and settings of hyperparameters.


Requirements:
- Install the Python chakin package, obtain GloVe (and perhaps non-GloVe) embeddings.
- Run the starter code below for the assignment, which uses pretrained word vectors from GloVe.6B.50d, a vocabulary of 10,000 words, and movie review data.
- Revise the jump-start code to accommodate two pretrained word vectors and two vocabulary sizes. These represent the cells of a completely crossed 2-by-2 experimental design, defining four distinct language models.
- Build and evaluate at least four language models of the experimental design. For each cell in the design, compute classification accuracy in the test set.
- Evaluate the four language models and make recommendations to management.
- Test two or more alternative RNN structures or hyperparameter settings.


Deliverables and File Formats
- Python notebook that address the problem and the writeup as indicated towards the end of this notebook (Audience:Director Data Science/Analytics)

Optional (Audience:Business/C-Suite) - Additional 20 points
1. Provide a double-spaced paper with a two-page maximum for the text. The paper in pdf format should include 
    (1) Summary and problem definition for management; 
    (2) Discussion of the methodology, data findings and traditional machine learning methods employed; 
    (3) List assumptions, programming work, issues along with model evaluation metrics; and 
    (4) Review of results/ insight swith recommendations for management.

Formatting Python Code
Refer to Google’s Python Style Guide (https://google.github.iostyleguide/pyguide.html) for ideas about formatting Python code:


NOTE : 
- Below is the starter code and please feel free to update/edit/change to provide your thoughts/solutions to the problem. 
- Comment often and in detail, highlighting major sections of code, describing the thinking behind the programming methods being employed.
- This code has a lot of errors so please make sure to updated all the cells based on best practices along with your analysis/findings.


GRADING GUIDELINES (100 points)
--------------------------------
(1) Data preparation, exploration, visualization (20 points)
(2) Review research design and modeling methods (20 points)
(3) Review results, evaluate models (20 points)
(4) Implementation and programming (20 points)
(5) Exposition, problem description, and management recommendations (20 points) 


In [1]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
!pip install chakin

In [2]:
# S1 Install & Import Packages 
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import division, print_function, unicode_literals

from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from matplotlib.backends.backend_pdf import PdfPages
import sklearn

from datetime import datetime
import cv2

import chakin  
import json
from collections import defaultdict
import nltk
from nltk.tokenize import TreebankWordTokenizer
from itertools import chain
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

import pandas as pd  # data frame operations  
import numpy as np  # arrays and math functions
import matplotlib.pyplot as plt  # static plotting
import re # regular expressions
import scipy
import os # Operation System
import os.path
import time # Record processing time

#%tensorflow_version 2.x
import tensorflow as tf
print(tf.__version__)


2.5.0


In [None]:
#S2 Mount Google Drive to Colab Enviorment
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
#S3 Establish working directory - COlAB
# Need to create this directory structure on Goolge Drive
os.getcwd()
%cd /content/gdrive/My Drive/wk8/
!pwd
!ls
print('Working Directory')
print(os.getcwd())
work_dir = "/content/gdrive/My Drive/wk8/"
data_dir = work_dir+"/data/"
chp_id = "rnn"

In [3]:
#S3 Establish working directory  - LOCAL
print('Working Directory')
print(os.getcwd())

work_dir = "./"
data_dir = work_dir +"data/"
chp_id = "rnn"
print(data_dir +"train/")

Working Directory
C:\Users\sbhar\Desktop\shree\teaching\08-MLPA\03-Assignments\8
./data/train/


In [4]:
# Import data and Word Embeddings for Sentiment analysis
chakin.search(lang='English')  # lists available indices in English

                   Name  Dimension                     Corpus VocabularySize  \
2          fastText(en)        300                  Wikipedia           2.5M   
11         GloVe.6B.50d         50  Wikipedia+Gigaword 5 (6B)           400K   
12        GloVe.6B.100d        100  Wikipedia+Gigaword 5 (6B)           400K   
13        GloVe.6B.200d        200  Wikipedia+Gigaword 5 (6B)           400K   
14        GloVe.6B.300d        300  Wikipedia+Gigaword 5 (6B)           400K   
15       GloVe.42B.300d        300          Common Crawl(42B)           1.9M   
16      GloVe.840B.300d        300         Common Crawl(840B)           2.2M   
17    GloVe.Twitter.25d         25               Twitter(27B)           1.2M   
18    GloVe.Twitter.50d         50               Twitter(27B)           1.2M   
19   GloVe.Twitter.100d        100               Twitter(27B)           1.2M   
20   GloVe.Twitter.200d        200               Twitter(27B)           1.2M   
21  word2vec.GoogleNews        300      

##### Data Collection
##### RNN - Movie Review Sentiment  PART 1 : run chakin to get embeddings

In [5]:
# Specify English embeddings file to download and install by index number, number of dimensions, and subfoder name
# Note that GloVe 50-, 100-, 200-, and 300-dimensional folders are downloaded with a single zip download
CHAKIN_INDEX = 11
NUMBER_OF_DIMENSIONS = 50
SUBFOLDER_NAME = "gloVe.6B"

In [6]:
DATA_FOLDER = "embeddings"
ZIP_FILE = os.path.join(DATA_FOLDER, "{}.zip".format(SUBFOLDER_NAME))
ZIP_FILE_ALT = "glove" + ZIP_FILE[5:]  # sometimes it's lowercase only...
UNZIP_FOLDER = os.path.join(DATA_FOLDER, SUBFOLDER_NAME)
if SUBFOLDER_NAME[-1] == "d":
    GLOVE_FILENAME = os.path.join(
        UNZIP_FOLDER, "{}.txt".format(SUBFOLDER_NAME))
else:
    GLOVE_FILENAME = os.path.join(UNZIP_FOLDER, "{}.{}d.txt".format(
        SUBFOLDER_NAME, NUMBER_OF_DIMENSIONS))


if not os.path.exists(ZIP_FILE) and not os.path.exists(UNZIP_FOLDER):
    # GloVe by Stanford is licensed Apache 2.0:
    #     https://github.com/stanfordnlp/GloVe/blob/master/LICENSE
    #     http://nlp.stanford.edu/data/glove.twitter.27B.zip
    #     Copyright 2014 The Board of Trustees of The Leland Stanford Junior University
    print("Downloading embeddings to '{}'".format(ZIP_FILE))
    chakin.download(number=CHAKIN_INDEX, save_dir='./{}'.format(DATA_FOLDER))
else:
    print("Embeddings already downloaded.")

if not os.path.exists(UNZIP_FOLDER):
    import zipfile
    if not os.path.exists(ZIP_FILE) and os.path.exists(ZIP_FILE_ALT):
        ZIP_FILE = ZIP_FILE_ALT
    with zipfile.ZipFile(ZIP_FILE, "r") as zip_ref:
        print("Extracting embeddings to '{}'".format(UNZIP_FOLDER))
        zip_ref.extractall(UNZIP_FOLDER)
else:
    print("Embeddings already extracted.")

print('\nRun complete')

# After this step there should be
# embeddings folder in the current working directory A
# Directory called glove.6b within embeddings directory
# 4 files within it

Downloading embeddings to 'embeddings\gloVe.6B.zip'


Test: 100% ||                                      | Time:  0:02:39   5.1 MiB/s


Extracting embeddings to 'embeddings\gloVe.6B'

Run complete


##### Data Preparation & Modelling

In [7]:
# seed value for random number generators to obtain reproducible results
RANDOM_SEED = 9999

REMOVE_STOPWORDS = False  # no stopword removal 

EVOCABSIZE = 10000  # specify desired size of pre-defined embedding vocabulary 

# To make output stable across runs
def reset_graph(seed= RANDOM_SEED):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

In [8]:
# ------------------------------------------------------------- 
# Select the pre-defined embeddings source        
# Define vocabulary size for the language model    
# Create a word_to_embedding_dict for GloVe.6B.50d
embeddings_directory = 'embeddings/gloVe.6B'
filename = 'glove.6B.50d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)

In [9]:
# Utility function for loading embeddings follows methods described in
# https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
# Creates the Python defaultdict dictionary word_to_embedding_dict
# for the requested pre-trained word embeddings

# Note the use of defaultdict data structure from the Python Standard Library
# collections_defaultdict.py lets the caller specify a default value up front
# The default value will be retuned if the key is not a known dictionary key
# That is, unknown words are represented by a vector of zeros
# For word embeddings, this default value is a vector of zeros
# Documentation for the Python standard library:
#   Hellmann, D. 2017. The Python 3 Standard Library by Example. Boston: 
#     Addison-Wesley. [ISBN-13: 978-0-13-429105-5]


# Load a embedding text file
def load_embedding_from_disks(embeddings_filename, with_indexes=True):
    """
    Read a embeddings txt file. If `with_indexes=True`, we return a tuple of two dictionaries 
    `(word_to_index_dict, index_to_embedding_array)`, otherwise we return only a direct 
    `word_to_embedding_dict` dictionary mapping from a string to a numpy array.
    """
    if with_indexes:
        word_to_index_dict = dict()
        index_to_embedding_array = []
  
    else:
        word_to_embedding_dict = dict()

    with open(embeddings_filename, 'r', encoding='utf-8') as embeddings_file:
        for (i, line) in enumerate(embeddings_file):
            split = line.split(' ')
            word = split[0]
            representation = split[1:]
            representation = np.array(
                [float(val) for val in representation]
            )

            if with_indexes:
                word_to_index_dict[word] = i
                index_to_embedding_array.append(representation)
            else:
                word_to_embedding_dict[word] = representation

    # Empty representation for unknown words.
    _WORD_NOT_FOUND = [0.0] * len(representation)
    if with_indexes:
        _LAST_INDEX = i + 1
        word_to_index_dict = defaultdict(
            lambda: _LAST_INDEX, word_to_index_dict)
        index_to_embedding_array = np.array(
            index_to_embedding_array + [_WORD_NOT_FOUND])
        return word_to_index_dict, index_to_embedding_array
    else:
        word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
        return word_to_embedding_dict

In [10]:
# Check if the loaded embedding files glove.6B.50d. successfully.   "glove.6B.50d."was ontained though "run-chakin-to-get-embeddings-v001.py"
print('\nLoading embeddings from', embeddings_filename)
word_to_index, index_to_embedding = \
    load_embedding_from_disks(embeddings_filename, with_indexes=True)
print("Embedding loaded from disks.")

# Note: unknown words have representations with values [0, 0, ..., 0]


Loading embeddings from embeddings/gloVe.6B\glove.6B.50d.txt
Embedding loaded from disks.


In [11]:
# Additional background code from
# https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
# shows the general structure of the data structures for word embeddings
# This code is modified for our purposes in language modeling 

# Check vocabrary size and embedding dimention
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
print("This means (number of words, number of dimensions per word)\n")
print("The first words are words that tend occur more often.")

#Check embedding data
print("Note: for unknown words, the representation is an empty vector,\n"
      "and the index is the last one. The dictionnary has a limit:")
print("    {} --> {} --> {}".format("A word", "Index in embedding", 
      "Representation"))
word = "worsdfkljsdf"  # a word obviously not in the vocabulary
idx = word_to_index[word] # index for word obviously not in the vocabulary
complete_vocabulary_size = idx 
embd = list(np.array(index_to_embedding[idx], dtype=int)) # "int" compact print
print("    {} --> {} --> {}".format(word, idx, embd))
word = "the"
idx = word_to_index[word]
embd = list(index_to_embedding[idx])  # "int" for compact print only.
print("    {} --> {} --> {}".format(word, idx, embd))

Embedding is of shape: (400001, 50)
This means (number of words, number of dimensions per word)

The first words are words that tend occur more often.
Note: for unknown words, the representation is an empty vector,
and the index is the last one. The dictionnary has a limit:
    A word --> Index in embedding --> Representation
    worsdfkljsdf --> 400000 --> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    the --> 0 --> [0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862, -0.00066023, -0.6566, 0.27843, -0.14767, -0.55677, 0.14658, -0.0095095, 0.011658, 0.10204, -0.12792, -0.8443, -0.12181, -0.016801, -0.33279, -0.1552, -0.23131, -0.19181, -1.8823, -0.76746, 0.099051, -0.42125, -0.19526, 4.0071, -0.18594, -0.52287, -0.31681, 0.00059213, 0.0074449, 0.17778, -0.15897, 0.012041, -0.054223, -0.29871, -0.15749, -0.34758, -0.045637, -0.44251, 0.18785, 0.0027849, -0.18

In [12]:
# Show how to use embeddings dictionaries with a test sentence
# This is a famous typing exercise with all letters of the alphabet
# https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog
a_typing_test_sentence = 'The quick brown fox jumps over the lazy dog'
print('\nTest sentence: ', a_typing_test_sentence, '\n')
words_in_test_sentence = a_typing_test_sentence.split()

print('Test sentence embeddings from complete vocabulary of', 
      complete_vocabulary_size, 'words:\n')
for word in words_in_test_sentence:
    word_ = word.lower()
    embedding = index_to_embedding[word_to_index[word_]]
    print(word_ + ": ", embedding)


Test sentence:  The quick brown fox jumps over the lazy dog 

Test sentence embeddings from complete vocabulary of 400000 words:

the:  [ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]
quick:  [ 0.13967   -0.53798   -0.18047   -0.25142    0.16203   -0.13868
 -0.24637    0.75111    0.27264    0.61035   -0.82548    0.038647
 -0.32361    0.30373   -0.14598   -0.23551    0.39267   -1.1287
 -0.23636   -1.0629     0.046277   0.29143   -0.25

In [13]:
# ------------------------------------------------------------- 
# Define vocabulary size for the language model    
# To reduce the size of the vocabulary to the n most frequently used words

def default_factory():
    return EVOCABSIZE  # last/unknown-word row in limited_index_to_embedding
# dictionary has the items() function, returns list of (key, value) tuples
limited_word_to_index = defaultdict(default_factory, \
    {k: v for k, v in word_to_index.items() if v < EVOCABSIZE})


In [14]:
# Select the first EVOCABSIZE rows to the index_to_embedding
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]

# Set the unknown-word row to be all zeros as previously
limited_index_to_embedding = np.append(limited_index_to_embedding, 
    index_to_embedding[index_to_embedding.shape[0] - 1, :].\
        reshape(1,embedding_dim), 
    axis = 0)

In [15]:
# Verify the new vocabulary: should get same embeddings for test sentence
# Note that a small EVOCABSIZE may yield some zero vectors for embeddings
print('\nTest sentence embeddings from vocabulary of', EVOCABSIZE, 'words:\n')
for word in words_in_test_sentence:
    word_ = word.lower()
    embedding = limited_index_to_embedding[limited_word_to_index[word_]]
    print(word_ + ": ", embedding)


Test sentence embeddings from vocabulary of 10000 words:

the:  [ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]
quick:  [ 0.13967   -0.53798   -0.18047   -0.25142    0.16203   -0.13868
 -0.24637    0.75111    0.27264    0.61035   -0.82548    0.038647
 -0.32361    0.30373   -0.14598   -0.23551    0.39267   -1.1287
 -0.23636   -1.0629     0.046277   0.29143   -0.25819   -0.094902
  0.79478   -1.2095    -0.01039   -0.092086   0.84322   

In [16]:
# ------------------------------------------------------------
# code for working with movie reviews data 
# Source: Miller, T. W. (2016). Web and Network Data Science.
#    Upper Saddle River, N.J.: Pearson Education.
#    ISBN-13: 978-0-13-388644-3
# This original study used a simple bag-of-words approach
# to sentiment analysis, along with pre-defined lists of
# negative and positive words.        
# Code available at:  https://github.com/mtpa/wnds       
# ------------------------------------------------------------

# Utility function to get file names within a directory
def listdir_no_hidden(path):
    start_list = os.listdir(path)
    end_list = []
    for file in start_list:
        if (not file.startswith('.')):
            end_list.append(file)
    return(end_list)

In [17]:
# define list of codes to be dropped from document
# carriage-returns, line-feeds, tabs
codelist = ['\r', '\n', '\t']   

In [18]:
# We will not remove stopwords in this exercise because they are
# important to keeping sentences intact
if REMOVE_STOPWORDS:
    print(nltk.corpus.stopwords.words('english'))

# previous analysis of a list of top terms showed a number of words, along 
# with contractions and other word strings to drop from further analysis, add
# these to the usual English stopwords to be dropped from a document collection
    more_stop_words = ['cant','didnt','doesnt','dont','goes','isnt','hes',\
        'shes','thats','theres','theyre','wont','youll','youre','youve', 'br'\
        've', 're', 'vs'] 

    some_proper_nouns_to_remove = ['dick','ginger','hollywood','jack',\
        'jill','john','karloff','kudrow','orson','peter','tcm','tom',\
        'toni','welles','william','wolheim','nikita']

    # start with the initial list and add to it for movie text work 
    stoplist = nltk.corpus.stopwords.words('english') + more_stop_words +\
        some_proper_nouns_to_remove

In [19]:
# text parsing function for creating text documents 
# there is more we could do for data preparation 
# stemming... looking for contractions... possessives... 
# but we will work with what we have in this parsing function
# if we want to do stemming at a later time, we can use
#     porter = nltk.PorterStemmer()  
# in a construction like this
#     words_stemmed =  [porter.stem(word) for word in initial_words]  
def text_parse(string):
    # replace non-alphanumeric with space 
    temp_string = re.sub('[^a-zA-Z]', '  ', string)    
    # replace codes with space
    for i in range(len(codelist)):
        stopstring = ' ' + codelist[i] + '  '
        temp_string = re.sub(stopstring, '  ', temp_string)      
    # replace single-character words with space
    temp_string = re.sub('\s.\s', ' ', temp_string)   
    # convert uppercase to lowercase
    temp_string = temp_string.lower()    
    if REMOVE_STOPWORDS:
        # replace selected character strings/stop-words with space
        for i in range(len(stoplist)):
            stopstring = ' ' + str(stoplist[i]) + ' '
            temp_string = re.sub(stopstring, ' ', temp_string)        
    # replace multiple blank characters with one blank character
    temp_string = re.sub('\s+', ' ', temp_string)    
    return(temp_string)   

In [21]:
# -----------------------------------------------
# Download and install both negtive and positive movies reviews from the URL below
# https://github.com/mtpa/wnds/tree/master/WNDS_Chapter_8
# -----------------------------------------------

# Load Negative Movie reviews in lists of lists
# Create /movie-reviews-negative directory and move negative IMDB files to this drive
# os.mkdir(work_dir+'movie-reviews-negative')

# -----------------------------------------------
# gather data for 500 negative movie reviews
# -----------------------------------------------

# Set path to the negative word dictionary, "moive-reviews-negative"
dir_name = os.getcwd() + './data/movie-reviews-negative'
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))


Directory: C:\Users\sbhar\Desktop\shree\teaching\08-MLPA\03-Assignments\8./data/movie-reviews-negative
500 files found


In [23]:
# Read data for negative movie reviews
# Data will be stored in a list of lists where the each list represents 
# a document and document is a list of words.
# We then break the text into words.

def read_data(filename):

    with open(filename, encoding='utf-8') as f:
        data = tf.compat.as_str(f.read())
        data = data.lower()
        data = text_parse(data)
        data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

    return data

negative_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    ## print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))

    negative_documents.append(words)
    
print('Data size (Characters) (Document %d) %d' %(i,len(words)))
print('Sample string (Document %d) %s'%(i,words[:50]))



Processing document files under C:\Users\sbhar\Desktop\shree\teaching\08-MLPA\03-Assignments\8./data/movie-reviews-negative
Data size (Characters) (Document 499) 133
Sample string (Document 499) ['this', 'is', 'one', 'of', 'the', 'dumbest', 'films', 've', 'ever', 'seen', 'it', 'rips', 'off', 'nearly', 'ever', 'type', 'of', 'thriller', 'and', 'manages', 'to', 'make', 'mess', 'of', 'them', 'all', 'br', 'br', 'there', 'not', 'single', 'good', 'line', 'or', 'character', 'in', 'the', 'whole', 'mess', 'if', 'there', 'was', 'plot', 'it', 'was', 'an', 'afterthought', 'and', 'as', 'far']


In [24]:
# -----------------------------------------------
# gather data for 500 positive movie reviews
# -----------------------------------------------

# Set path to the positive word dictionary, "moive-reviews-positive"
dir_name = './data/movie-reviews-positive'  
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

for i in range(len(filenames)):
    file_exists = os.path.isfile(os.path.join(dir_name, filenames[i]))
    assert file_exists
print('\nDirectory:',dir_name)    
print('%d files found' % len(filenames))


Directory: ./data/movie-reviews-positive
500 files found


In [25]:
# Read data for positive movie reviews
# Data will be stored in a list of lists where the each list 
# represents a document and document is a list of words.
# We then break the text into words.

def read_data(filename):

    with open(filename, encoding='utf-8') as f:
        data = tf.compat.as_str(f.read())
        data = data.lower()
        data = text_parse(data)
        data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

    return data

positive_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    ## print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))
    positive_documents.append(words)
    
print('Data size (Characters) (Document %d) %d' %(i,len(words)))
print('Sample string (Document %d) %s'%(i,words[:50]))



Processing document files under ./data/movie-reviews-positive
Data size (Characters) (Document 499) 157
Sample string (Document 499) ['working', 'class', 'romantic', 'drama', 'from', 'director', 'martin', 'ritt', 'is', 'as', 'unbelievable', 'as', 'they', 'come', 'yet', 'there', 'are', 'moments', 'of', 'pleasure', 'due', 'mostly', 'to', 'the', 'charisma', 'of', 'stars', 'jane', 'fonda', 'and', 'robert', 'de', 'niro', 'both', 'terrific', 'she', 'widow', 'who', 'can', 'move', 'on', 'he', 'illiterate', 'and', 'closet', 'inventor', 'you', 'can', 'probably', 'guess']


In [26]:
# -----------------------------------------------------
# convert positive/negative documents into numpy array
# note that reviews vary from 22 to 1052 words   
# so we use the first 20 and last 20 words of each review 
# as our word sequences for analysis
# -----------------------------------------------------

max_review_length = 0  # initialize
for doc in negative_documents:
    max_review_length = max(max_review_length, len(doc))    
for doc in positive_documents:
    max_review_length = max(max_review_length, len(doc)) 
print('max_review_length:', max_review_length) 

min_review_length = max_review_length  # initialize
for doc in negative_documents:
    min_review_length = min(min_review_length, len(doc))    
for doc in positive_documents:
    min_review_length = min(min_review_length, len(doc)) 
print('min_review_length:', min_review_length) 

max_review_length: 1052
min_review_length: 0


In [27]:
# construct list of 1000 lists with 40 words in each list
from itertools import chain
documents = []
for doc in negative_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    
for doc in positive_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    

In [28]:
# create list of lists of lists for embeddings
embeddings = []    
for doc in documents:
    embedding = []
    for word in doc:
        embedding.append(limited_index_to_embedding[limited_word_to_index[word]]) 
    embeddings.append(embedding)

In [29]:
# -----------------------------------------------------    
# Check on the embeddings list of list of lists 
# -----------------------------------------------------

# Show the first word in the first document
test_word = documents[0][0]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      limited_index_to_embedding[limited_word_to_index[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings[0][0][:])

First word in first document: story
Embedding for this word:
 [ 0.48251    0.87746   -0.23455    0.0262     0.79691    0.43102
 -0.60902   -0.60764   -0.42812   -0.012523  -1.2894     0.52656
 -0.82763    0.30689    1.1972    -0.47674   -0.46885   -0.19524
 -0.28403    0.35237    0.45536    0.76853    0.0062157  0.55421
  1.0006    -1.3973    -1.6894     0.30003    0.60678   -0.46044
  2.5961    -1.2178     0.28747   -0.46175   -0.25943    0.38209
 -0.28312   -0.47642   -0.059444  -0.59202    0.25613    0.21306
 -0.016129  -0.29873   -0.19468    0.53611    0.75459   -0.4112
  0.23625    0.26451  ]
Corresponding embedding from embeddings list of list of lists
 [ 0.48251    0.87746   -0.23455    0.0262     0.79691    0.43102
 -0.60902   -0.60764   -0.42812   -0.012523  -1.2894     0.52656
 -0.82763    0.30689    1.1972    -0.47674   -0.46885   -0.19524
 -0.28403    0.35237    0.45536    0.76853    0.0062157  0.55421
  1.0006    -1.3973    -1.6894     0.30003    0.60678   -0.46044
  2.596

In [30]:
# Show the seventh word in the tenth document
test_word = documents[6][9]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      limited_index_to_embedding[limited_word_to_index[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings[6][9][:])

First word in first document: but
Embedding for this word:
 [ 0.35934   -0.2657    -0.046477  -0.2496     0.54676    0.25924
 -0.64458    0.1736    -0.53056    0.13942    0.062324   0.18459
 -0.75495   -0.19569    0.70799    0.44759    0.27031   -0.32885
 -0.38891   -0.61606   -0.484      0.41703    0.34794   -0.19706
  0.40734   -2.1488    -0.24284    0.33809    0.43993   -0.21616
  3.7635     0.19002   -0.12503   -0.38228    0.12944   -0.18272
  0.076803   0.51579    0.0072516 -0.29192   -0.27523    0.40593
 -0.040394   0.28353   -0.024724   0.10563   -0.32879    0.10673
 -0.11503    0.074678 ]
Corresponding embedding from embeddings list of list of lists
 [ 0.35934   -0.2657    -0.046477  -0.2496     0.54676    0.25924
 -0.64458    0.1736    -0.53056    0.13942    0.062324   0.18459
 -0.75495   -0.19569    0.70799    0.44759    0.27031   -0.32885
 -0.38891   -0.61606   -0.484      0.41703    0.34794   -0.19706
  0.40734   -2.1488    -0.24284    0.33809    0.43993   -0.21616
  3.7635

In [31]:
# Show the last word in the last document
test_word = documents[999][39]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      limited_index_to_embedding[limited_word_to_index[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings[999][39][:])        

First word in first document: from
Embedding for this word:
 [ 0.41037   0.11342   0.051524 -0.53833  -0.12913   0.22247  -0.9494
 -0.18963  -0.36623  -0.067011  0.19356  -0.33044   0.11615  -0.58585
  0.36106   0.12555  -0.3581   -0.023201 -1.2319    0.23383   0.71256
  0.14824   0.50874  -0.12313  -0.20353  -1.82      0.22291   0.020291
 -0.081743 -0.27481   3.7343   -0.01874  -0.084522 -0.30364   0.27959
  0.043328 -0.24621   0.015373  0.49751   0.15108  -0.01619   0.40132
  0.23067  -0.10743  -0.36625  -0.051135  0.041474 -0.36064  -0.19616
 -0.81066 ]
Corresponding embedding from embeddings list of list of lists
 [ 0.41037   0.11342   0.051524 -0.53833  -0.12913   0.22247  -0.9494
 -0.18963  -0.36623  -0.067011  0.19356  -0.33044   0.11615  -0.58585
  0.36106   0.12555  -0.3581   -0.023201 -1.2319    0.23383   0.71256
  0.14824   0.50874  -0.12313  -0.20353  -1.82      0.22291   0.020291
 -0.081743 -0.27481   3.7343   -0.01874  -0.084522 -0.30364   0.27959
  0.043328 -0.24621   0.

In [32]:
# -----------------------------------------------------    
# Make embeddings a numpy array for use in an RNN 
# Create training and test sets with Scikit Learn
# -----------------------------------------------------

# Apply embeddings to numpy
embeddings_array = np.array(embeddings)

In [33]:
# Define the labels to be used 500 negative (0) and 500 positive (1)
thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32), 
                      np.ones((500), dtype = np.int32)), axis = 0)

###### Modeling

In [34]:
# Scikit Learn for random splitting of the data  
from sklearn.model_selection import train_test_split

In [35]:
# Set training and test data
# Random splitting of the data in to training (80%) and test (20%)  
X_train, X_test, y_train, y_test = \
    train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, 
                     random_state = RANDOM_SEED)

print("Shape of Training data: ", X_train.shape)
print("Shape of Test data: ", X_test.shape)

print("\nShape of Training data: ", y_train.shape)
print("Shape of Test data: ", y_test.shape)

Shape of Training data:  (800,)
Shape of Test data:  (200,)

Shape of Training data:  (800,)
Shape of Test data:  (200,)


##### Base Model: Simple RNN with BPTT
##### GloVe.6B.50d - Embedding Vocabulary Size 10,000

In [None]:
### Code for base case
n_steps = embeddings_array.shape[1]  # number of words per document 
n_inputs = embeddings_array.shape[2]  # dimension of  pre-trained embeddings
n_neurons = 20  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001 # Learning rate = 0.001


In [None]:
### Execution Phase###

# Set number of epochs and batch size for training model.
n_epochs = 50
batch_size = 100

# Record start time for neural network training
start_time_base = time.clock()

In [None]:
# Print prediction classes and actual classes.
print("-------- Base Model --------")
print("RNN backpropagation through time (BPTT)")
print("\nPredicted classes:", y_pred_base)
print("Actual classes:", y_test[:25])
print("Test Set Accuracy:", accuracy_base)
print("Train Set Accuracy:", accuracy_base_y)
# Record end time for neural network training
stop_time_base = time.clock()

#Total processing time
runtime_base = stop_time_base - start_time_base 

print("\nStart time:", start_time_base)
print("Stop time:", stop_time_base)
print("processing time:", runtime_base)



### Model 1: RNN with LSTM cells and 3 Layers
#### GloVe.6B.50d - Embedding Vocabulary Size 10,000

In [36]:
from tensorflow import keras
from keras.layers import Dense, Dropout, Flatten, Activation, Input, Add
from keras.layers import Input, LSTM, GRU
from keras.models import Sequential, Model
def generateModel1():
    inputLayer=Input(shape=(40,50))
    unitNum = 32
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(inputLayer)
    model=LSTM(units=unitNum, return_sequences=True, activation='elu')(model)
    model=LSTM(units=unitNum, activation='elu')(model)
    outputModel=Dense(1)(model)
    model=Model(inputLayer,outputModel)
    return model
start_time_base_1 = time.clock()

In [37]:
model = generateModel1()
print(model.summary())

Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 40, 50)]          0         
_________________________________________________________________
lstm (LSTM)                  (None, 40, 32)            10624     
_________________________________________________________________
lstm_1 (LSTM)                (None, 40, 32)            8320      
_________________________________________________________________
lstm_2 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 27,297
Trainable params: 27,297
Non-trainable params: 0
_________________________________________________________________
None


In [None]:
from keras.optimizers import adam
model.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

lossThreshold = .1
valLossThreshold = .1
count = 0
hist = model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=50, batch_size=64, verbose=1)

In [None]:
accuracy_1 = model.evaluate(X_test, y_test, verbose = 0)
accuracy_base_y_1 = model.evaluate(X_train, y_train,verbose = 0)
stop_time_base_1 = time.clock()
time_1 = stop_time_base_1 - start_time_base_1
print("processing time")
print("Test Set Accuracy:", accuracy_base_y_1[1])
print("Train Set Accuracy:", accuracy_1[1])
print("processing time", time_1)

#### Model 2: RNN with LSTM cells and 3 Layers
#### GloVe.6B.50d - Embedding Vocabulary Size 30,000

#### Model 8: RNN with GRU cells and 5 Layers
#### GloVe.6B.100d - Embedding Vocabulary Size 30,000

In [None]:
start_time_base_8 = time.clock()
def generateModel8():
    inputLayer=Input(shape=(40,50))
    unitNum = 32
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(inputLayer)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, return_sequences=True, activation='elu')(model)
    model=GRU(units=unitNum, activation='elu')(model)
    outputModel=Dense(1)(model)
    model=Model(inputLayer,outputModel)
    return model

In [None]:
model_8 = generateModel8()
print(model_8.summary())

In [None]:
model_8.compile(loss='mse', optimizer='adam',metrics=['accuracy'])

lossThreshold = .1
valLossThreshold = .1

hist_8 = model_8.fit(X1_train, y1_train, validation_data=(X1_test, y1_test), epochs=50, batch_size=64, verbose=1)

In [None]:
accuracy_8 = model_8.evaluate(X1_test, y1_test, verbose = 0)
accuracy_base_y_8 = model_8.evaluate(X1_train, y1_train)
stop_time_base_8 = time.clock()
time_8 = stop_time_base_8 - start_time_base_8

print("Test Set Accuracy:", accuracy_base_y_8[1])
print("Train Set Accuracy:",accuracy_8[1] )
print("processing time", time_8)

In [None]:
#### Summary Table
from tabulate import tabulate

col_labels = ['# - Model', 'Number of Layers', 'Embedding Vocabulary Size','Processing Time', 'Test Set Accuracy', 'Training Set Accuracy']

table_vals = [['1 - Simple RNN with BPTT','3','10000',runtime_base, accuracy_base,accuracy_base_y],
              ['2 - RNN with LSTM cells','3','10000',time_1,accuracy_1[1],accuracy_base_y_1[1]],
              ['3 - RNN with LSTM cells','3','30000',time_2,accuracy_2[1],accuracy_base_y_2[1]],
              ['4 - RNN with LSTM cells','5','10000',time_3,accuracy_3[1],accuracy_base_y_3[1]],
              ['5 - RNN with LSTM cells','5','30000',time_4,accuracy_4[1],accuracy_base_y_4[1]],
              ['6 - RNN with GRU','3','10000',time_5,accuracy_5[1],accuracy_base_y_5[1]],
              ['7 - RNN with GRU','3','30000',time_6,accuracy_6[1],accuracy_base_y_6[1]],
              ['8 - RNN with GRU','5','10000',time_7,accuracy_7[1],accuracy_base_y_7[1]],
              ['9 - RNN with GRU','5','30000',time_8,accuracy_8[1],accuracy_base_y_8[1]],]

print('------------------------------- Summary Table -------------------------------')

table = tabulate(table_vals, headers=col_labels, tablefmt="simple",numalign="left")
print(table)

Management Problem
Executive management of a large retailer is thinking about using a language model to classify written customer reviews of call and complaint logs. The goal here is, if the most critical customer messages can be identified, then customer support personnel can be assigned to contact those customers to gather feedback and enhance customer experience.
- How would you advise senior management? What kinds of systems and methods would be most relevant to the customer services function? Considering the results of this assignment in particular, what is needed to make an automated customer support system that is capable of identifying negative customer feelings? What can data scientists do to make language models more useful in a customer service function?


#### REPORT/FINDINGS: 
(1) A summary and problem definition for management; 

(2) Discussion of the research design, measurement and statistical methods, traditional and machine learning methods employed 

(3) Overview of programming work; 

(4) Review of results with recommendations for management.