Develop a Recurrent Neural Network Text Classification Model in Python.

In [7]:
#Code Block 1
import numpy as np
import pandas as pd
import tensorflow as tf

from sklearn.model_selection import train_test_split

import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


We will be utilizing the women's clothing customer review dataset found on Kaggle (https://www.kaggle.com/datasets/nicapotato/womens-ecommerce-clothing-reviews) to determine if, given a customer's review, we can predict if a customer would recommend the product or not. We are specifically interested in how well we can predict if a customer would **not** recommend a specific product (Recommended IND = 0), and will attempt to create a model with this aspect in mind.  

In [8]:
#Code Block 2
!git clone https://github.com/tabishkhan72/nlp-customer-review-classification.git



fatal: destination path 'nlp-customer-review-classification' already exists and is not an empty directory.


In [9]:
#Code Block 3
import sys
sys.path.insert(0,'/content/nlp-customer-review-classification')

In [10]:
#Code Block 4
# import the clothing review dataset 
clothing_data = pd.read_csv(
    "/content/nlp-customer-review-classification/data/clothing_reviews.csv", 
    usecols=['Review Text', 'Recommended IND', 'Rating'], 
    dtype={'Review Text': str, 'Recommended IND': np.int64, 'Rating': np.int64}
)
clothing_data.rename(columns={"Review Text": "Text"}, inplace=True)
clothing_data.rename(columns={"Recommended IND": "Would Recommend"}, inplace=True)

len(clothing_data)

23486

# New section

In [11]:
#Code Block 5
# calculate total ratio of 'Would Recommend' to 'Would Not Recommend'
sum(clothing_data['Would Recommend'])/len(clothing_data['Would Recommend'])

0.8223622583666865

In [12]:
#Code Block 6
# view a snapshow of what our data looks like
clothing_data.head()

Unnamed: 0,Text,Rating,Would Recommend
0,Absolutely wonderful - silky and sexy and comf...,4,1
1,Love this dress! it's sooo pretty. i happene...,5,1
2,I had such high hopes for this dress and reall...,3,0
3,"I love, love, love this jumpsuit. it's fun, fl...",5,1
4,This shirt is very flattering to all due to th...,5,1


We will first import the necessary variables (the review text & if the customer would recommend the piece or not), along with the rating, which we may be able to use as validation to see if another model may be more useful. We will then split the data into training, validation, and testing datasets, with 20% of the original data being used as test data and 20% of the training set being used as validation data. We can see from our import that there are a total of 23,486 reviews, with 82.23% of them resulting in a 'Would Recommend' rating (with 0 = Would Not Recommend, 1 = Would Recommend). 

In [13]:
#Code Block 7
x = clothing_data['Text']
y = clothing_data['Would Recommend']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state = 123)  
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=0.2, random_state = 123) 

In [14]:

#Code Block 8
x_train.head()

13230    Super easy and cute. i received lots of compli...
4644     I really love this top. it looks great under a...
5992     I purchased the teal/blue version and the colo...
1791     I bought the grey and white plaid shirt in a l...
13652    Beautiful detail. sweater is warm, soft, and v...
Name: Text, dtype: object

In [15]:
#Code Block 9
sum(y_train)/len(y_train)

0.8216899534264803

In [16]:
#Code Block 10
sum(y_val)/len(y_val)

0.810803618946248

In [17]:
#Code Block 11
sum(y_test)/len(y_test)

0.8337590464027246

In [18]:
#Code Block 12
train_data = np.column_stack((x_train, y_train))
train_data = pd.DataFrame(train_data, columns = ['Text','Would Recommend'])

# make sure to reset data to necessary type
type_dict = {'Text': str,
             'Would Recommend': np.int64
                }
 
train_data = train_data.astype(type_dict)
train_data.head()

Unnamed: 0,Text,Would Recommend
0,Super easy and cute. i received lots of compli...,1
1,I really love this top. it looks great under a...,1
2,I purchased the teal/blue version and the colo...,1
3,I bought the grey and white plaid shirt in a l...,1
4,"Beautiful detail. sweater is warm, soft, and v...",1


In [19]:
#Code Block 13
val_data = np.column_stack((x_val, y_val))
val_data = pd.DataFrame(val_data, columns = ['Text','Would Recommend'])
val_data = val_data.astype(type_dict)

In [20]:
#Code Block 14
test_data = np.column_stack((x_test, y_test))
test_data = pd.DataFrame(test_data, columns = ['Text','Would Recommend'])
test_data = test_data.astype(type_dict)

We will next clean our text data, to remove any unwanted noise from things like urls or punctuation (done through TensorFlow tokenizer). We will also remove stop words like 'a' or 'the' that will not add meaning into our final model. Finally, we will revert the words in our data to their stem, or base derivation, which may help improve our model further. 

In [21]:
#Code Block 15

# remove urls from text
def remove_url(sentence):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', sentence)

# remove any emojis from text
def remove_emoji(sentence):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    
    return emoji_pattern.sub(r'', sentence)

# remove stopwords
def remove_stopwords(sentence):
    words = sentence.split()
    words = [word for word in words if word not in stopwords.words('english')]
    
    return ' '.join(words)

# 
stemmer = SnowballStemmer('english')

def stem_words(sentence):
    words = sentence.split()
    words = [stemmer.stem(word) for word in words ]
    
    return ' '.join(words)

def clean_text(data):
    data['Text'] = data['Text'].apply(lambda x : remove_url(x))
    data['Text'] = data['Text'].apply(lambda x : remove_emoji(x))
    data['Text'] = data['Text'].apply(lambda x : remove_stopwords(x))
    data['Text'] = data['Text'].apply(lambda x : stem_words(x))
    
    return data

In [22]:
#Code Block 16
# clean the text of our training, validation, and testing data
train_data = clean_text(train_data)
val_data = clean_text(val_data)
test_data = clean_text(test_data)

After cleaning our data, we will now tokenize each of the text values so that our model can understand our input properly. To do this, we will utilise two functions that will assign index numbers to each word in the dataset, then encode the sentences so that each word is represented with an indexnumber in an array of index numbers, respectively. 

In [23]:
#Code Block 17
# function used to assign index numbers to each word
def define_tokenizer(train_sentences, val_sentences, test_sentences):
    sentences = pd.concat([train_sentences, test_sentences])
    
    tokenizer = tf.keras.preprocessing.text.Tokenizer()
    tokenizer.fit_on_texts(sentences)
    
    return tokenizer

# function used to encode each review into an array of index numbers
def encode(sentences, tokenizer):
    encoded_sentences = tokenizer.texts_to_sequences(sentences)
    encoded_sentences = tf.keras.preprocessing.sequence.pad_sequences(encoded_sentences)
    
    return encoded_sentences

In [24]:
#Code Block 18
# define and encode our text data
tokenizer = define_tokenizer(train_data['Text'], val_data['Text'], test_data['Text'])

encoded_train_reviews = encode(train_data['Text'], tokenizer)
encoded_val_reviews = encode(val_data['Text'], tokenizer)
encoded_test_reviews = encode(test_data['Text'], tokenizer)

In [25]:
#Code Block 19
# number of words in encoding dictionary
len(tokenizer.word_index)

12318

In [26]:
#Code Block 20

# view general information from tokenizer 
print('Lower: ', tokenizer.get_config()['lower'])
print('Split: ', tokenizer.get_config()['split'])
print('Filters: ', tokenizer.get_config()['filters'])

Lower:  True
Split:   
Filters:  !"#$%&()*+,-./:;<=>?@[\]^_`{|}~	



To make sure our model trains on a more complete array dataset, we will import the GloVe Embedding and synchronize this embedding with our own encoding in order to have a more complete dataset for our model to train on. 

In [38]:
#Code Block 21

from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [39]:
#Code Block 22

embedding_dict = {}

with open("/content/drive/MyDrive/DATASET/glove.6B.100d.txt",'r', encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vectors = np.asarray(values[1:],'float32')
        embedding_dict[word] = vectors
        
f.close()

In [40]:
#Code Block 23

num_words = len(tokenizer.word_index) + 1
embedding_matrix = np.zeros((num_words, 100))

for word, i in tokenizer.word_index.items():
    if i > num_words:
        continue
    
    emb_vec = embedding_dict.get(word)
    
    if emb_vec is not None:
        embedding_matrix[i] = emb_vec

We will now make sure to convert our text data into the native TensorFlow data format, so that we can maximize TensorFlow functionality. 

In [41]:
#Code Block 24

# convert training data into native TensorFlow format
tf_data = tf.data.Dataset.from_tensor_slices((encoded_train_reviews, train_data['Would Recommend'].values))

We will now define the pipeline for our reformatted training data, along with reformatting and defining a pipeline for our validation data. 

In [42]:
#Code Block 25

# define pipeline
def pipeline(tf_data, buffer_size=100, batch_size=32):
    tf_data = tf_data.shuffle(buffer_size)    
    tf_data = tf_data.prefetch(tf.data.experimental.AUTOTUNE)
    tf_data = tf_data.padded_batch(batch_size, padded_shapes=([None],[]))
    
    return tf_data

tf_data = pipeline(tf_data, buffer_size=1000, batch_size=32)

In [43]:
#Code Block 26

# convert validation data to native TensorFlow format
tf_val_data = tf.data.Dataset.from_tensor_slices((encoded_val_reviews, val_data['Would Recommend'].values))

# define pipeline
def val_pipeline(tf_data, batch_size=1):        
    tf_data = tf_data.prefetch(tf.data.experimental.AUTOTUNE)
    tf_data = tf_data.padded_batch(batch_size, padded_shapes=([None],[]))
    
    return tf_data

tf_val_data = val_pipeline(tf_val_data, batch_size=len(val_data))

After reformatting our data, we can now define our model. In this case, we will first define an embedding layer so that our model can gain an understanding of a words meaning, then an RNN layer so that our model can begin to build relationships between words. Finally, we will use our Dense layer to output if our text indicates a recommendation or not

In [44]:
#Code Block 27

# define the embedding, RNN & Dense layers of the model, using LSTM as our RNN model of choice in this instance

embedding = tf.keras.layers.Embedding(
    len(tokenizer.word_index) + 1,
    100,
    embeddings_initializer = tf.keras.initializers.Constant(embedding_matrix),
    trainable = True
)
model = tf.keras.Sequential([
    embedding,
    tf.keras.layers.SpatialDropout1D(0.2),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2)),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [45]:
#Code Block 28
model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(0.0001),
    metrics=['accuracy', 'Precision', 'Recall']
)

# make sure to avoid stepping past optimum
callbacks = [
    tf.keras.callbacks.ReduceLROnPlateau(monitor='loss', patience=2, verbose=1),
    tf.keras.callbacks.EarlyStopping(monitor='loss', patience=4, verbose=1),
]

After defining and compiling our model, we will now fit our model over 10 epochs on our training dataset. 

In [46]:
#Code Block 29

model.fit(
    tf_data, 
    validation_data = tf_val_data,
    epochs = 10,
    callbacks = callbacks
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff67102aee0>

To evaluate our model, we will use the model F1 score on our validation data, to check that our model is perfoming adequately before testing.

In [47]:
#Code Block 30
metrics = model.evaluate(tf_val_data)

precision = metrics[2]
recall = metrics[3]
f1 = 2 * (precision * recall) / (precision + recall)

print('F1 score: ' + str(f1)) 

F1 score: 0.928683631301736


From our F1 score of 0.927, we can be fairly confident that our model will perform well with our test data. We will now create a new pipeline definition for our test data, then use our model to predict if the review implies a potential recommendation or not. 

In [48]:
#Code Block 31
tf_test_data = tf.data.Dataset.from_tensor_slices((encoded_test_reviews))

def test_pipeline(tf_data, batch_size=1):        
    tf_data = tf_data.prefetch(tf.data.experimental.AUTOTUNE)
    tf_data = tf_data.padded_batch(batch_size, padded_shapes=([None]))
    
    return tf_data

tf_test_data = test_pipeline(tf_test_data)

In [49]:
#Code Block 32
predictions = model.predict(tf_test_data)



In [50]:
#Code Block 33
predictions = np.concatenate(predictions).round().astype(int)

In [51]:
#Code Block 34
compare = np.column_stack((y_test, predictions))
compare = pd.DataFrame(compare, columns = ['Test Recommendation','Predicted Recommendation'])

In [52]:
#Code Block 35
# determine percentage of correct predictions
len(compare.query('`Test Recommendation` == `Predicted Recommendation`'))/len(predictions)

0.8888888888888888

In [53]:
#Code Block 36
# determine ratio of false positives to total incorrect predictions
incorrect_prediction = compare.query('`Test Recommendation` != `Predicted Recommendation`')
false_recommendation = incorrect_prediction.query('`Test Recommendation` == 0')   
len(false_recommendation)/len(incorrect_prediction)

0.7701149425287356

From our predictions, we can see that we have correctly predicted about 88.25% of the total number of test reviews, which in most cases is extremely high and shows a useful model. However, we can see that of our incorrect predictions, about 74.67% of them are of incorrectly predicting that a customer would recommend a piece of clothing when they would not. This means that, while we have an overall fairly accurate model, a lot of our accuracy comes from the dramatic skew in our data towards positive recommendations. Therefore, while our model does predict better than average ( ~ 80% from guessing with respects to the predetermined ratios), we can still see potential improvements in this model. One such potential improvement is to determine a way to adequately weigh the negative reviews higher, without creating unwanted bias or adding negative sentiment to words that may not necessarly call for it. 