# Predicting E-Commerce Product Recommendations from Reviews 


![](https://github.com/dipanjanS/feature_engineering_session_dhs18/blob/master/ecommerce_product_ratings_prediction/clothing_banner.jpg?raw=1)

This is a classic NLP problem dealing with data from an e-commerce store focusing on women's clothing. Each record in the dataset is a customer review which consists of the review title, text description and a recommendation 0 or 1) for a product amongst other features


__Main Objective:__ Leverage the review text attributes and other features if needed to predict the recommendation (classification)

---
---


- Experiment 1: Basic NLP Count based Features & Age, Feedback Count
  - Training a Logistic Regression Model
  - Model Evaluation Metrics - Quick Refresher
- Experiment 2: Features from Sentiment Analysis
  - Leveraging Text Sentiment
  - Model Training and Evaluation
- Text Pre-processing and Wrangling
- Experiment 3: Modeling based on Bag of Words based Features - 1-grams
  - Use the following default config for count vectorizer
  - Model Training and Evaluation
- Experiment 4: Modeling with Bag of Words based Features - 2-grams
  - Model Training and Evaluation
- Experiment 5: Adding Bag of Words based Features - 3-grams
  - Model Training and Evaluation
- Experiment 6: Adding Bag of Words based Features - 3-grams with Feature Selection
- Experiment 7: Combining Bag of Words based Features - 3-grams with Feature Selection and the Structured Features
  - Coverting dense features into sparse format
  - Combine the features using `hstack`
  - Model Training and Evaluation
- Experiment 8: Modeling on FastText Averaged Document Embeddings
  - Generate document level embeddings
  

# Load up basic dependencies

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf
import tensorflow_hub as hub
import nltk
import matplotlib.pyplot as plt

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

In [None]:
!pip install contractions
!pip install textsearch
!pip install tqdm

# Load and View the Dataset

The data is available at https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews from where you can download it.

You can also access it from my [__GitHub Repo__](https://github.com/dipanjanS/text-analytics-with-python/blob/master/media) if needed.

Following code enables it to get it easily from the web.

In [None]:
df = pd.read_csv('https://github.com/dipanjanS/text-analytics-with-python/raw/master/media/Womens%20Clothing%20E-Commerce%20Reviews%20-%20NLP.csv', keep_default_na=False)
df.head()

# Basic Data Processing

- Merge all review text attributes (title, text description) into one attribute
- Subset out columns of interest

In [None]:
df['Review'] = (df['Title'].map(str) +' '+ df['Review Text']).apply(lambda row: row.strip())
df['Recommended'] = df['Recommended IND']
df = df[['Review', 'Age', 'Positive Feedback Count', 'Recommended']]
df.head()

# # Day 4
# df['Review'] = (df['Title'].map(str) +' '+ df['Review Text']).apply(lambda row: row.strip())
# df['Recommended'] = df['Recommended IND']
# df = df[['Review', 'Recommended']]
# df.head()

## Remove all records with no review text

In [None]:
df = df[df['Review'] != '']
df.info()

## There is some imbalance in the data based on product recommendations

In [None]:
df['Recommended'].value_counts()

# Build train and test datasets

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Recommended']), df['Recommended'], test_size=0.3, random_state=42)
X_train.shape, X_test.shape

In [None]:
from collections import Counter
Counter(y_train), Counter(y_test)

In [None]:
X_train.head(3)

In [None]:
y_train[:3]

Looks like this should help us get features which can distinguish between good and bad products. Let's try it out on our dataset!

# Experiment 1: Basic NLP Count based Features & Age, Feedback Count

A number of basic text based features can also be created which sometimes are helpful for improving text classification models. 
Some examples are:

- __Word Count:__ total number of words in the documents
- __Character Count:__ total number of characters in the documents
- __Average Word Density:__ average length of the words used in the documents
- __Puncutation Count:__ total number of punctuation marks in the documents
- __Upper Case Count:__ total number of upper count words in the documents
- __Title Word Count:__ total number of proper case (title) words in the documents

In [None]:
import string

X_train['char_count'] = X_train['Review'].apply(len)
X_train['word_count'] = X_train['Review'].apply(lambda x: len(x.split()))
X_train['word_density'] = X_train['char_count'] / (X_train['word_count']+1)
X_train['punctuation_count'] = X_train['Review'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
X_train['title_word_count'] = X_train['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
X_train['upper_case_word_count'] = X_train['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))


X_test['char_count'] = X_test['Review'].apply(len)
X_test['word_count'] = X_test['Review'].apply(lambda x: len(x.split()))
X_test['word_density'] = X_test['char_count'] / (X_test['word_count']+1)
X_test['punctuation_count'] = X_test['Review'].apply(lambda x: len("".join(_ for _ in x if _ in string.punctuation))) 
X_test['title_word_count'] = X_test['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.istitle()]))
X_test['upper_case_word_count'] = X_test['Review'].apply(lambda x: len([wrd for wrd in x.split() if wrd.isupper()]))

In [None]:
X_train.head()

## Training a Logistic Regression Model 

A logistic regression model is easy to train, interpret and works well on a wide variety of NLP problems

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=1, random_state=42, solver='lbfgs', max_iter=1e4)

## Model Evaluation Metrics - Quick Refresher

Just accuracy is never enough in datasets with a rare class problem.

- __Precision:__ The positive predictive power of a model. Out of all the predictions made by a model for a class, how many are actually correct
- __Recall:__ The coverage or hit-rate of a model. Out of all the test data samples belonging to a class, how many was the model able to predict (hit or cover) correctly.
- __F1-score:__ The harmonic mean of the precision and recall

# Experiment 2: Features from Sentiment Analysis 

## Leveraging Text Sentiment

Reviews are pretty subjective, opinionated and people often express stong emotions, feelings. 
This makes it a classic case where the text documents here are a good candidate for extracting sentiment as a feature.

The general expectation is that highly rated and recommended products (__label 1__) should have a __positive__ sentiment and products which are not recommended (__label 0__) should have a __negative__ sentiment.

TextBlob is an excellent open-source library for performing NLP tasks with ease, including sentiment analysis. It also an a sentiment lexicon (in the form of an XML file) which it leverages to give both polarity and subjectivity scores. 

- The polarity score is a float within the range [-1.0, 1.0]. 
- The subjectivity is a float within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective. 

Perhaps this could be used for getting some new features? Let's look at some basic examples.

Source: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

In [None]:
import textblob

textblob.TextBlob('This is an AMAZING pair of Jeans!').sentiment

In [None]:
textblob.TextBlob('I really hated this UGLY T-shirt!!').sentiment

Remember this is unsupervised, lexicon-based sentiment analysis where we don't have any pre-labeled data saying which review migth have a positive or negative sentiment. We use the lexicon to determine this.

In [None]:
x_train_snt_obj = X_train['Review'].apply(lambda row: textblob.TextBlob(row).sentiment)
X_train['Polarity'] = [obj.polarity for obj in x_train_snt_obj.values]
X_train['Subjectivity'] = [obj.subjectivity for obj in x_train_snt_obj.values]

x_test_snt_obj = X_test['Review'].apply(lambda row: textblob.TextBlob(row).sentiment)
X_test['Polarity'] = [obj.polarity for obj in x_test_snt_obj.values]
X_test['Subjectivity'] = [obj.subjectivity for obj in x_test_snt_obj.values]

In [None]:
X_train.head()

## Model Training and Evaluation

In [None]:
lr.fit(X_train.drop(['Review'], axis=1), y_train)
predictions = lr.predict(X_test.drop(['Review'], axis=1))

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

You will probably get a better model than Experiment 1

Can we still improve on our model since the recall of bad reviews is still pretty low?

# Text Pre-processing and Wrangling

We want to extract some specific features based on standard NLP feature engineering models like the classic Bag of Words model.
For this we need to clean and pre-process our text data. We will build a simple text pre-processor here since the main intent is to look at feature engineering strategies.

We will focus on:
- Text Lowercasing
- Removal of contractions
- Removing unnecessary characters, numbers and symbols
- Stopword removal

Source: https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72

In [None]:
import contractions

contractions.fix('I didn\'t like this t-shirt')

In [None]:
import nltk
import contractions
import re
import tqdm

# remove some stopwords to capture negation in n-grams if possible
stopwords = nltk.corpus.stopwords.words('english')
stopwords.remove('no')
stopwords.remove('not')
stopwords.remove('but')


def normalize_document(doc):
    # fix contractions
    doc = contractions.fix(doc)
    # remove special characters and digits
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, flags=re.I|re.A)
    # lower case
    doc = doc.lower()
    # strip whitespaces
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    #filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stopwords]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

def normalize_corpus(docs):
    norm_docs = []
    for doc in tqdm.tqdm(docs):
        norm_doc = normalize_document(doc)
        norm_docs.append(norm_doc)

    return norm_docs

In [None]:
X_train['Clean Review'] = normalize_corpus(X_train['Review'].values)
X_test['Clean Review'] = normalize_corpus(X_test['Review'].values)

## Let's remove the review column now since we don't need it anymore and restructure our dataframes

In [None]:
X_train = X_train[['Clean Review', 'Age', 'Positive Feedback Count', 'Polarity', 'Subjectivity']]
X_test = X_test[['Clean Review', 'Age', 'Positive Feedback Count', 'Polarity', 'Subjectivity']]

X_train.head()

## Extracting out the structured features from previous experiments

__We will extract out the structured columns \ features so we can use them right at the end after doing a few experiments with bag of words__

`X_train_struct` and `X_test_struct` should contain only 4 columns i.e.

- Age
- Positive Feedback Count
- Polarity
- Subjectivity

In [None]:
X_train_struct = X_train.drop(['Clean Review'], axis=1).reset_index(drop=True)
X_test_struct = X_test.drop(['Clean Review'], axis=1).reset_index(drop=True)

X_train_struct.head()

# Experiment 3: Modeling based on Bag of Words based Features - 1-grams

This is perhaps the most simple vector space representational model for unstructured text. A vector space model is simply a mathematical model to represent unstructured text (or any other data) as numeric vectors, such that each dimension of the vector is a specific feature\attribute. 

The bag of words model represents each text document as a numeric vector where each dimension is a specific word from the corpus and the value could be its frequency in the document, occurrence (denoted by 1 or 0) or even weighted values. 

The model’s name is such because each document is represented literally as a ‘bag’ of its own words, disregarding word orders, sequences and grammar.

Source: https://towardsdatascience.com/understanding-feature-engineering-part-3-traditional-methods-for-text-data-f6f7d70acd41

In [None]:
train_clean_text = X_train['Clean Review']
test_clean_text = X_test['Clean Review']

## Use the following default config for count vectorizer

- `min_df` as 0.0
- `max_df` as 1.0
- `ngram_range` as (1,1)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1, 1))

X_traincv = cv.fit_transform(train_clean_text)
X_testcv = cv.transform(test_clean_text)

In [None]:
X_traincv

## Model Training and Evaluation

In [None]:
lr.fit(X_traincv, y_train)
predictions = lr.predict(X_testcv)

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

This should look promising and far better that the previous models if you did it correctly

Can we still improve on our model? Let's look at n-grams!

# Experiment 4: Modeling with Bag of Words based Features - 2-grams

We use the same feature engineering technique here except we consider both 1 and 2-grams as our features. 

In [None]:
cv = CountVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1, 2))

X_traincv = cv.fit_transform(train_clean_text)
X_testcv = cv.transform(test_clean_text)

In [None]:
X_traincv

## Model Training and Evaluation

In [None]:
lr.fit(X_traincv, y_train)
predictions = lr.predict(X_testcv)

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

You should be able to see some minor improvements

# Experiment 5: Adding Bag of Words based Features - 3-grams 

We use the same feature engineering technique here except we consider 1, 2 and 3-grams as our features.

In [None]:
cv = CountVectorizer(min_df=0.0, max_df=1.0, ngram_range=(1, 3))

X_traincv = cv.fit_transform(train_clean_text)
X_testcv = cv.transform(test_clean_text)

In [None]:
X_traincv

## Model Training and Evaluation

In [None]:
lr.fit(X_traincv, y_train)
predictions = lr.predict(X_testcv)

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

# Experiment 6: Adding Bag of Words based Features - 3-grams with Feature Selection

Set `min_df` as 3 in CountVectorizer and keep other params same as the previous experiment and notice the drop in features.

We drop all words \ n-grams which occur less than 3 times in all documents.

How will the model perform now?

In [None]:
cv = CountVectorizer(min_df=3, max_df=1., ngram_range=(1, 3))

X_traincv = cv.fit_transform(train_clean_text)
X_testcv = cv.transform(test_clean_text)

In [None]:
X_traincv

In [None]:
lr.fit(X_traincv, y_train)
predictions = lr.predict(X_testcv)

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

# Experiment 7: Combining Bag of Words based Features - 3-grams with Feature Selection and the Structured Features

Let's combine our sparse BOW feature matrices with our structured features from earlier.

We do need to convert those structured features into sparse format so we can concatenate them to the BOW features!

In [None]:
X_train_struct.values

## Coverting dense features into sparse format

In [None]:
from scipy import sparse

In [None]:
sparse.csr_matrix(X_train_struct)

In [None]:
X_traincv

## Combine the features using `hstack`

Check documentation if needed, it should be straightforward

In [None]:
from scipy.sparse import hstack

X_train_combined = hstack([sparse.csr_matrix(X_train_struct), 
                           X_traincv])
X_test_combined = hstack([sparse.csr_matrix(X_test_struct), X_testcv])

In [None]:
X_train_combined

## Model Training and Evaluation

In [None]:
lr.fit(X_train_combined, y_train)
predictions = lr.predict(X_test_combined)

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

# Experiment 8: Modeling on FastText Averaged Document Embeddings

## Build the FastText embedding model here

Remember more the iterations usually better the embeddings but the more time it will take depending on your system CPU

10 iterations might take 5 mins

In [None]:
%%time

from gensim.models import FastText

tokenized_docs_train = [doc.split() for doc in train_clean_text]
# sample config params size: 300, window: 30, min_count=2 or more, iter=10
ft_model = FastText(tokenized_docs_train, size=300, window=30, min_count=2, workers=4, sg=1, iter=10)

## Generate document level embeddings

Word embedding models give us an embedding for each word, how can we use it for downstream ML\DL tasks? one way is to flatten it or use sequential models. A simpler approach is to average all word embeddings for words in a document and generate a fixed-length document level emebdding

In [None]:
def averaged_word2vec_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

In [None]:
tokenized_docs_train = [doc.split() for doc in train_clean_text]
tokenized_docs_test = [doc.split() for doc in test_clean_text]

Xtrain_doc_vecs_ft = averaged_word2vec_vectorizer(tokenized_docs_train, ft_model, 300)
Xtest_doc_vecs_ft = averaged_word2vec_vectorizer(tokenized_docs_test, ft_model, 300)

Xtrain_doc_vecs_ft.shape

## Model Training and Evaluation

In [None]:
lr.fit(Xtrain_doc_vecs_ft, y_train)
predictions = lr.predict(Xtest_doc_vecs_ft)

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

# Experiment 9: Combine FastText Vectors + Structured Features and build a model

In [None]:
X_train_combined = np.concatenate((X_train_struct, Xtrain_doc_vecs_ft),axis=1)
X_test_combined = np.concatenate((X_test_struct, Xtest_doc_vecs_ft),axis=1)

## Model Training and Evaluation

In [None]:
lr.fit(X_train_combined, y_train)
predictions = lr.predict(X_test_combined)

print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

# Experiment 10: Train Classfier with CNN + FastText Embeddings & Evaluate Performance on Test Data

__Note:__ Skip FastText Embeddings part if it takes too much time to download or load it since it does consume a good amount of memory to load the pretrained embeddings.

If you want to load pre-trained embeddings use a slightly smaller file than the one we used in live-coding which had over 2 million words. Here is the link to get embeddings from facebook's pre-trained fasttext model.

https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

__Hint:__ Use the code from the live-coding session to download and load relevant embeddings from the above dataset

In [None]:
train_clean_text = X_train['Clean Review']
test_clean_text = X_test['Clean Review']

In [None]:
t = tf.keras.preprocessing.text.Tokenizer(oov_token='<UNK>')
# fit the tokenizer on the documents
t.fit_on_texts(train_clean_text)
t.word_index['<PAD>'] = 0

In [None]:
print(max([(k, v) for k, v in t.word_index.items()], key = lambda x:x[1]), 
      min([(k, v) for k, v in t.word_index.items()], key = lambda x:x[1]), 
      t.word_index['<UNK>'])

In [None]:
train_sequences = t.texts_to_sequences(train_clean_text)
test_sequences = t.texts_to_sequences(test_clean_text)

In [None]:
print("Vocabulary size={}".format(len(t.word_index)))
print("Number of Documents={}".format(t.document_count))

In [None]:
max(len(i) for i in train_sequences)

In [None]:
max(len(doc.split()) for doc in train_clean_text)

In [None]:
plt.hist([len(item) for item in train_sequences], bins=30);

In [None]:
MAX_SEQUENCE_LENGTH = 121

# pad dataset to a maximum review length in words
train_seqs = tf.keras.preprocessing.sequence.pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH)
test_seqs = tf.keras.preprocessing.sequence.pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
train_seqs.shape, test_seqs.shape

In [None]:
VOCAB_SIZE = len(t.word_index)
EMBED_SIZE = 300
EPOCHS=100
BATCH_SIZE=32

In [None]:
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip

In [None]:
!unzip wiki-news-300d-1M.vec.zip

In [None]:
word2idx = t.word_index
FASTTEXT_INIT_EMBEDDINGS_FILE = './wiki-news-300d-1M.vec'


def load_pretrained_embeddings(word_to_index, max_features, embedding_size, embedding_file_path):    
    
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype='float32')
    
    embeddings_index = dict(get_coefs(*row.split(" ")) 
                                for row in open(embedding_file_path, encoding="utf8", errors='ignore') 
                                    if len(row)>100)

    all_embs = np.stack(embeddings_index.values())
    emb_mean, emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]

    nb_words = min(max_features, len(word_to_index))
    embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embedding_size))
    
    for word, idx in word_to_index.items():
        if idx >= max_features: 
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: 
            embedding_matrix[idx] = embedding_vector

    return embedding_matrix

In [None]:
ft_embeddings = load_pretrained_embeddings(word_to_index=word2idx, 
                                           max_features=VOCAB_SIZE, 
                                           embedding_size=EMBED_SIZE, 
                                           embedding_file_path=FASTTEXT_INIT_EMBEDDINGS_FILE)
ft_embeddings.shape

In [None]:
# create the model
model = tf.keras.models.Sequential()

model.add(tf.keras.layers.Embedding(VOCAB_SIZE, EMBED_SIZE,
                                    weights=[ft_embeddings],
                                    trainable=True,
                                    input_length=MAX_SEQUENCE_LENGTH))

model.add(tf.keras.layers.Conv1D(filters=256, kernel_size=4, padding='same', activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

model.add(tf.keras.layers.Conv1D(filters=128, kernel_size=4, padding='same', activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

model.add(tf.keras.layers.Conv1D(filters=64, kernel_size=4, padding='same', activation='relu'))
model.add(tf.keras.layers.MaxPooling1D(pool_size=2))

model.add(tf.keras.layers.Flatten())

model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [None]:
# Fit the model
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', 
                                      patience=2,
                                      restore_best_weights=True,
                                      verbose=1)

model.fit(train_seqs, y_train, 
          validation_split=0.1,
          epochs=EPOCHS, 
          batch_size=BATCH_SIZE, 
          shuffle=True,
          callbacks=[es],
          verbose=1)

In [None]:
predictions = model.predict_classes(test_seqs).ravel()

print(classification_report(y_test, predictions))
pd.DataFrame(confusion_matrix(y_test, predictions))

# Experiment 11: Train Classfier with LSTM + FastText Embeddings & Evaluate Performance on Test Data

__Note:__ Skip FastText Embeddings part if it takes too much time to download or load it since it does consume a good amount of memory to load the pretrained embeddings.

In [None]:
LSTM_DIM = 256

# create the model
model = tf.keras.models.Sequential()

model.add(tf.keras.layers.Embedding(VOCAB_SIZE, EMBED_SIZE,
                                    weights=[ft_embeddings],
                                    trainable=True,
                                    input_length=MAX_SEQUENCE_LENGTH))

#model.add(tf.keras.layers.LSTM(LSTM_DIM, return_sequences=True))
model.add(tf.keras.layers.LSTM(LSTM_DIM, return_sequences=False))

model.add(tf.keras.layers.Flatten())

model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [None]:
# Fit the model
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', 
                                      patience=2,
                                      restore_best_weights=True,
                                      verbose=1)

model.fit(train_seqs, y_train, 
          validation_split=0.1,
          epochs=EPOCHS, 
          batch_size=BATCH_SIZE, 
          shuffle=True,
          callbacks=[es],
          verbose=1)

In [None]:
predictions = model.predict_classes(test_seqs).ravel()

print(classification_report(y_test, predictions))
pd.DataFrame(confusion_matrix(y_test, predictions))

# Experiment 12: Train Classfier with NNLM Universal Embedding Model

__Hint:__ This model should accept the pre-processed text directly (as shown in livecoding)


In [None]:
model = "https://tfhub.dev/google/tf2-preview/nnlm-en-dim128/1"
hub_layer = hub.KerasLayer(model, output_shape=[128], input_shape=[], 
                           dtype=tf.string, trainable=True)

In [None]:
model = tf.keras.models.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Dense(256, activation='relu'))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [None]:
# Fit the model
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', 
                                      patience=2,
                                      restore_best_weights=True,
                                      verbose=1)

model.fit(train_clean_text, y_train, 
          validation_split=0.1,
          epochs=EPOCHS, 
          batch_size=BATCH_SIZE, 
          shuffle=True,
          callbacks=[es],
          verbose=1)

In [None]:
predictions = model.predict_classes(test_clean_text).ravel()

print(classification_report(y_test, predictions))
pd.DataFrame(confusion_matrix(y_test, predictions))

# Experiment 13: Train Classfier with BERT

##### Note: You might need to restart the notebook environment on colab after installing the below library

In [None]:
!pip install transformers --ignore-installed

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report, confusion_matrix
import tensorflow as tf
import tensorflow_hub as hub
import nltk
import matplotlib.pyplot as plt

df = pd.read_csv('https://github.com/dipanjanS/text-analytics-with-python/raw/master/media/Womens%20Clothing%20E-Commerce%20Reviews%20-%20NLP.csv', keep_default_na=False)
df['Review'] = (df['Title'].map(str) +' '+ df['Review Text']).apply(lambda row: row.strip())
df['Recommended'] = df['Recommended IND']
df = df[['Review', 'Recommended']]
df = df[df['Review'] != '']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['Recommended']), df['Recommended'], test_size=0.3, random_state=42)

import nltk
import contractions
import re
import tqdm


def normalize_document(doc):
    doc = doc.translate(doc.maketrans("\n\t\r", "   "))
    doc = doc.lower()
    doc = contractions.fix(doc)
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', ' ', doc, re.I|re.A)
    doc = re.sub(' +', ' ', doc)
    doc = doc.strip()  

    return doc

def normalize_corpus(docs):
    norm_docs = []
    for doc in tqdm.tqdm(docs):
        norm_doc = normalize_document(doc)
        norm_docs.append(norm_doc)

    return norm_docs

X_train['Clean Review'] = normalize_corpus(X_train['Review'].values)
X_test['Clean Review'] = normalize_corpus(X_test['Review'].values)

train_clean_text = X_train['Clean Review']
test_clean_text = X_test['Clean Review']

#### Train and Evaluate your BERT model using `transformers`

In [None]:
import transformers

In [None]:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
def create_bert_input_features(tokenizer, docs, max_seq_length):
    
    all_ids, all_masks, all_segments= [], [], []
    for doc in tqdm.tqdm(docs, desc="Converting docs to features"):
        
        tokens = tokenizer.tokenize(doc)
        
        if len(tokens) > max_seq_length-2:
            tokens = tokens[0 : (max_seq_length-2)]
        tokens = ['[CLS]'] + tokens + ['[SEP]']
        ids = tokenizer.convert_tokens_to_ids(tokens)
        masks = [1] * len(ids)
        
        # Zero-pad up to the sequence length.
        while len(ids) < max_seq_length:
            ids.append(0)
            masks.append(0)
            
        segments = [0] * max_seq_length
        all_ids.append(ids)
        all_masks.append(masks)
        all_segments.append(segments)
        
    encoded = np.array([all_ids, all_masks, all_segments])
    
    return encoded

In [None]:
MAX_SEQ_LENGTH = 121

In [None]:
inp_id = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), dtype='int32', name="bert_input_ids")
inp_mask = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), dtype='int32', name="bert_input_masks")
inp_segment = tf.keras.layers.Input(shape=(MAX_SEQ_LENGTH,), dtype='int32', name="bert_segment_ids")
inputs = [inp_id, inp_mask, inp_segment]

hidden_state = transformers.TFBertModel.from_pretrained('bert-base-uncased')(inputs)
pooled_output = hidden_state[1]
dense1 = tf.keras.layers.Dense(256, activation='relu')(pooled_output)
drop1 = tf.keras.layers.Dropout(0.25)(dense1)
dense2 = tf.keras.layers.Dense(256, activation='relu')(drop1)
drop2 = tf.keras.layers.Dropout(0.25)(dense2)
output = tf.keras.layers.Dense(1, activation='sigmoid')(drop2)

model = tf.keras.Model(inputs=inputs, outputs=output)
model.compile(optimizer=tf.optimizers.Adam(learning_rate=2e-5, 
                                           epsilon=1e-08), 
              loss='binary_crossentropy', metrics=['accuracy'])

model.summary()

In [None]:
train_features_ids, train_features_masks, train_features_segments = create_bert_input_features(tokenizer, 
                                                                                               train_clean_text, 
                                                                                               max_seq_length=MAX_SEQ_LENGTH)

In [None]:
es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', 
                                      patience=1,
                                      restore_best_weights=True,
                                      verbose=1)
model.fit([train_features_ids, 
           train_features_masks, 
           train_features_segments], y_train, 
          validation_split=0.1,
          epochs=3, 
          batch_size=25, 
          callbacks=[es],
          shuffle=True,
          verbose=1)

In [None]:
test_features_ids, test_features_masks, test_features_segments = create_bert_input_features(tokenizer, 
                                                                                            test_clean_text, 
                                                                                            max_seq_length=MAX_SEQ_LENGTH)
print('Test Features:', test_features_ids.shape, test_features_masks.shape, test_features_segments.shape)

In [None]:
predictions = [1 if pr > 0.5 else 0 
                   for pr in model.predict([test_features_ids, 
                                            test_features_masks, 
                                            test_features_segments], verbose=0).ravel()]

print(classification_report(y_test, predictions))
pd.DataFrame(confusion_matrix(y_test, predictions))