## Preparation

This section is where we prepare for the project, through a variety of initial steps. The steps in this section are as follows:

- Importing Packages
- Importing Data
- Dropping NA Values
- Subsetting Data

### Importing Packages

In [1]:

#Data management
import pandas as pd
import numpy as np
import re
#from pandas_profiling import ProfileReport

#TextBlob Featuresc
from textblob import TextBlob

#Plotting
import matplotlib.pyplot as plt

#SciKit-Learn
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

#nltk
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

#Tensorflow / Keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

#Test
from collections import Counter
import bby
import bby.util as ut

np.random.seed(0)

### Importing Data

In [2]:

Drop_Columns = ['Location',	'Workforce', 'NPS® Breakdown', 'NPSCommentCleaned',
                'NPSCommentPolarity', 'NPSCommentSubjectivity',	'OverallCommentCleaned',
                'OverallCommentPolarity', 'OverallCommentSubjectivity']
Sentiment_Columns = ['NPSCommentPolarity', 'NPSCommentSubjectivity', 'OverallCommentPolarity', 'OverallCommentSubjectivity']

# National NPS extract processed by CleanNPS_National.py - this cleans and lemmatises the NPS and Overall
# comments.
# 

all_path = "../data/clean/NPS_NATL_subset.csv"
raw_df = pd.read_csv(all_path)


# Drop everything but what we're going to use
all_df = raw_df.drop(['Location',	'Workforce', 'NPS® Breakdown',
'NPSCommentPolarity', 'NPSCommentSubjectivity',	'OverallCommentCleaned',
'OverallCommentPolarity', 'OverallCommentSubjectivity'], axis=1)

print(all_df.shape)
# print(all_df.head(5))


(26703, 5)


### Basic Visualisation

We can display basic statistics about the data using pandas, and also view a few entries of the dataset, to see example points with which we'll work.

In [3]:
#Convert the "NPS® Breakdown" column into indexes
all_df["Sentiment"] = all_df['NPS_Code'].copy()
all_df = all_df.drop(['NPS_Code'], axis=1)
all_df.head()
print(all_df.shape)

(26703, 5)


### Tokenisation

We create tokens for the most common words in the dataset, so we can represent the presence of words in our created corpus (the n most common words) with a list of integers. 

In [4]:
def str_it(_ls):
    ls = str(_ls)
    word_tokens = word_tokenize(ls)
    ls = [w for w in word_tokens]

    ls = " ".join(ls)
    return ls
    

#Define the Tokeniser
tokeniser = Tokenizer()
nps_list = all_df['NPSCommentCleaned'].apply(str_it)
data_words = list(ut.sent_to_words(nps_list))


data = []
for i in range(len(data_words)):
    data.append(ut.detokenize(data_words[i]))
print(data[:5])

#Create the corpus by finding the most common words
tokeniser.fit_on_texts(data)

#Tokenise our column lemmatised NPS comments
nps_tokens = tokeniser.texts_to_matrix(data)


['staff in store in person close by when need them', 'adieb anbari was beyond helpful he answered all my questions got me in and out of there as fast as possible and even entertained my year old by answering all her questions and making boat out of paper for her', 'quick and knowledgeable', 'he called back quickly within minutes and was very good at explaining the reason for our issue', 'had really good experience thanks to your tech named ricky']


In [5]:
from sklearn.feature_extraction.text import CountVectorizer
# initialize
#cv = CountVectorizer(stop_words='english') 
#cv_matrix = cv.fit_transform([Vocab_str])
# create document term matrix

#df_dtm = pd.DataFrame(cv_matrix.toarray(), index=all_df['NPSCommentCleaned'].values, columns=cv.get_feature_names())
#print(df_dtm.shape)
#df_dtm.head()

#print("The total wordcount dict: ", tokeniser.document_count)
#print(Vocab_str)



### Adding the Tokenised Strings to the DataFrame

Currently, the tokens are contained in a matrix titled "nps_tokens". We then want to combine these back into the dataframe containing all of the current data. This is completed below, and then we test to make sure that this has occurred correctly by looking at the number of columns compared to that of the original matrices.

In [6]:
#
# New stuff
# 
# print(all_df.shape)
# print(nps_tokens.shape)

# user Counter to count the words in our NPS data
#

from bby.util import nps_freqs

vocab = Counter()
Topwords = dict()

nps_stringlist = raw_df['NPSCommentCleaned'].values.tolist()

Freqs = nps_freqs(nps_stringlist, 30)
Freqs.plot(30)
Freqs.tabulate(20)


ImportError: cannot import name 'nps_freqs' from 'bby.util' (/Users/suchanek/repos/npsML/bby/bby/util.py)

In [7]:
tokeniser.word_counts
#print(tokeniser.num_words)

OrderedDict([('staff', 780),
             ('in', 6886),
             ('store', 2520),
             ('person', 991),
             ('close', 65),
             ('by', 1019),
             ('when', 3244),
             ('need', 688),
             ('them', 1289),
             ('adieb', 5),
             ('anbari', 1),
             ('was', 18180),
             ('beyond', 159),
             ('helpful', 2459),
             ('he', 2831),
             ('answered', 254),
             ('all', 2080),
             ('my', 14810),
             ('questions', 576),
             ('got', 1492),
             ('me', 6839),
             ('and', 20004),
             ('out', 1658),
             ('of', 6073),
             ('there', 1830),
             ('as', 1780),
             ('fast', 525),
             ('possible', 90),
             ('even', 1047),
             ('entertained', 1),
             ('year', 223),
             ('old', 754),
             ('answering', 57),
             ('her', 271),
             ('mak

In [8]:
#Combining the dataframe with the tokens using pd.concat
all_df = pd.concat([all_df, pd.DataFrame(nps_tokens)], sort=False, axis=1)
all_df.shape

(26703, 13563)

## Final Data Preparation

The data is now almost ready for a model to be trained on it, but a few final preparations will need to occur. For example, we need to drop the columns that we don't plan to use, such as the "Tweet_Content" column, which has had its useful information extracted already.

We also split the data into a training and test set, such that we can evaluate our model's performance without touching the held-out data. We do this because if we continually test against this held-out data, it loses its usefulness as unseen "real-world" data.

Sections under this header include:
- Dropping Unused Data
- Test-Train Split

### Dropping Unused Data

We drop non-useful columns from the DataFrame here. These either have no use (Tweet ID), or have already had the useful information extracted (Tweet Content). We also remove the "y" or dependent variable here, so we don't accidentally train on it.

In [9]:
# Remove dependent variable
# Location	Workforce	NPS® Breakdown	NPS_Code	NPSCommentCleaned	NPSCommentLemmatised	NPSCommentPolarity	NPSCommentSubjectivity	OverallCommentCleaned	OverallCommentLemmatised	OverallCommentPolarity	OverallCommentSubjectivity	Sentiment
y = all_df["Sentiment"]

all_df = all_df.drop(columns=['NPSCommentCleaned', 'Sentiment'])

all_df.head()
print(all_df.shape)
input_size = all_df.shape[1]


(26703, 13561)


### Test-Train Split

Here, we use SciKit-Learn's inbuilt function to split our data into a test set and a train set, with the appropriate labels. We use a constant random state to make this replicable.

In [10]:
X_train, X_test, y_train, y_test = train_test_split(all_df, y, test_size=0.2, random_state=1)

## Model Construction and Training

Finally, it is time to construct our model. In this case, we use a neural network constructed with Keras. We then train it with our data in the training dataset, and validate using the test datasets.

Sections under this header include:
- Model Construction
- Training

### Model Construction

Here, we define the neural network that we will train to predict the output. This model is constructed with the following layers:
1. Dense
2. Dropout

The Dense layers are fully-connected layers. This means that inbetween each layer, we can transfer data from any neuron to any one in the next layer (or indeed all others), scaled by the weight associated with that transfer. These weights are trained.

The Dropout layers prevent our overall weights from getting too large, as can happen with larger neural networks. This helps to stop certain areas of the network from overloading the network as a whole.


In [11]:
#Test model
import datetime
log_dir = "../logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(12, input_dim=input_size, activation='relu'),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dense(50, activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(12, activation='relu'),
    tf.keras.layers.Dense(50, activation='relu'),
    tf.keras.layers.Dense(3, activation='softmax')
])
model.compile(
     loss='sparse_categorical_crossentropy',
     optimizer='adam',
     metrics=['accuracy']
)

# [print(i.shape, i.dtype) for i in model.inputs]
# [print(o.shape, o.dtype) for o in model.outputs]
# [print(l.name, l.input_shape, l.dtype) for l in model.layers]

Metal device set to: Apple M1 Pro

systemMemory: 16.00 GB
maxCacheSize: 5.33 GB



2022-07-06 20:15:04.893091: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-07-06 20:15:04.893281: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


### Training

Next, we fit this model with our data, using backpropagation, for 30 epochs. We can view the increase in accuracy of the model through the different epochs, on both the training and test dataset.

In [12]:
from keras.callbacks import ModelCheckpoint
checkpoint = ModelCheckpoint("best_model_bow.hdf5", monitor='accuracy', verbose=0, save_best_only=True, mode='auto', save_freq=1,save_weights_only=False)

X_train = np.asarray(X_train).astype('float32')
y_train = np.asarray(y_train).astype('float32')
X_test = np.asarray(X_test).astype('float32')
y_test = np.asarray(y_test).astype('float32')

h = model.fit(
     X_train, y_train,
     validation_data=(X_test, y_test),
     epochs=50,
     callbacks=[tf.keras.callbacks.EarlyStopping(monitor='accuracy', patience=5), tensorboard_callback, checkpoint]
)

ValueError: could not convert string to float: 'Before even looking at my device or seeing what was wrong with it, the employee started listing off prices.'

## Model Evaluation

Now that we've trained the model, we can view it's accuracy with a confusion matrix. This allows us to see the predictions for Tweets with various true values. From this, we might see that we are better at predicting certain classes than others, such as in this model, where we can predict Negative and Positive sentiment significantly better than Irrelevant or Neutral.

In [None]:
#Generate predictions
y_pred = np.argmax(model.predict(X_test), axis=1)

#Assign labels to predictions and test data
y_pred_labels = ut.ids_to_names(y_pred)
y_test_labels = ut.ids_to_names(y_test)

In [None]:
y_unique = list(set(y_test_labels))
cm = confusion_matrix(y_test_labels, y_pred_labels, labels = y_unique, normalize='true')

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=y_unique)
disp.plot()

## Training with Full Dataset

Now that we are happy with our model, we can train using the full dataset, and predict the held-out test data. This involves performing all of our transformation steps on both this training dataset and the held-out test data. Luckily, we can reuse the code from above to achieve this, so little further explanation is required.

In [None]:
#Use the full dataset!
# df = train_df

# the test dataframe was loaded earlier and is named test_df

### Basic Data Enrichment

In [None]:
#One-hot encode using Pandas' get_dummies()

##Train
onehot = pd.get_dummies(all_df["Entity"], prefix="Entity")

#Join these new columns back into the DataFrame
df = pd.DataFrame()
df = df.join(onehot)


##Test
onehot = pd.get_dummies(test_df["Entity"], prefix="Entity")

test_df = test_df.join(onehot)

In [None]:
#Enrich using TextBlob's built in sentiment analysis
##Train
df["Polarity"], df["Subjectivity"] = ut.tb_enrichtb_enrich(list(df["Tweet_Content"]))


##Test
test_df["Polarity"], test_df["Subjectivity"] = ut.tb_enrichtb_enrich(list(test_df["Tweet_Content"]))

In [None]:
#Convert the "Sentiment" column into indexes

##Train
df["Sentiment"] = ut.names_to_ids(all_df["Sentiment"])
y = df["Sentiment"]

##Test
test_df["Sentiment"] = ut.names_to_ids(test_df["Sentiment"])
y_test = test_df["Sentiment"]

### NLP Data Enrichment

In [None]:
#Tokenisation

#Define the Tokeniser
tokeniser = Tokenizer(num_words=1000, lower=True)

#Create the corpus by finding the most common 
tokeniser.fit_on_texts(all_df["Tweet_Content_Split"])

##Train
#Tokenise our column of edited Tweet content
nps_tokens = tokeniser.texts_to_matrix(list(all_df["Tweet_Content_Split"]))

##Test
#Tokenise our column of edited Tweet content
nps_tokens_test = tokeniser.texts_to_matrix(list(test_df["Tweet_Content_Split"]))

In [None]:
#Combining the dataframe with the tokens using pd.concat

#Reset axes to avoid overlapping
df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

##Train
full_df = pd.concat([df, pd.DataFrame(nps_tokens)], sort=False, axis=1)

##Test
full_test_df = pd.concat([test_df, pd.DataFrame(nps_tokens_test)], sort=False, axis=1)

In [None]:
#Final prep

##Train
#Drop all non-useful columns
full_df = full_df.drop(["Sentiment", "Tweet_ID", "Tweet_Content", "Tweet_Content_Split", "Entity"], axis=1)


##Test
full_test_df = full_test_df.drop(["Sentiment", "Tweet_ID", "Tweet_Content", "Tweet_Content_Split", "Entity"], axis=1)

### Model Definition and Training

This time, we train with all of the available training data

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(12, input_dim=1034, activation='relu'),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dense(50, activation='relu'),
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(12, activation='relu'),
    tf.keras.layers.Dense(50, activation='relu'),
    tf.keras.layers.Dense(4, activation='sigmoid')
])
model.compile(
     loss='sparse_categorical_crossentropy',
     optimizer='adam',
     metrics=['accuracy']
)

In [None]:
h = model.fit(
     full_df, y,
     epochs=30,
     callbacks=[tf.keras.callbacks.EarlyStopping(monitor='accuracy', patience=5)]
)

## Final Model Evaluation

In [None]:
#Generate predictions
y_pred = np.argmax(model.predict(full_test_df), axis=1)

#Assign labels to predictions and test data
y_pred_labels = ut.ids_to_names(y_pred)
y_test_labels = ut.ids_to_names(y_test)

In [None]:
y_unique = list(set(y_test_labels))
cm = confusion_matrix(y_test_labels, y_pred_labels, labels = y_unique, normalize='true')

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=y_unique)
disp.plot()

In [None]:
#To see the final accuracy
accuracy_score(y_test, y_pred)

In [None]:
#Original
model2 = tf.keras.models.Sequential([
tf.keras.layers.Embedding(10000,12,input_length=50),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20, return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(20)),
tf.keras.layers.Dense(4, activation='softmax')
])

model.compile(
     loss='sparse_categorical_crossentropy',
     optimizer='adam',
     metrics=['accuracy']
)

In [None]:
h = model2.fit(
     full_df, y,
     validation_data=(full_test_df, y_test),
     epochs=30,
     callbacks=[tf.keras.callbacks.EarlyStopping(monitor='accuracy', patience=5)]
)