# Big Data Content Analytics - AUEB

## Comment Analysis

**Importing libraries**

In [208]:

import pandas as pd
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
from IPython.core.display import HTML 

#%matplotlib inline

from pandas import read_excel

In [209]:
df = pd.read_csv("/Users/sakis/Desktop/The_Dataset",sep="~")
df.head()

Unnamed: 0,Comment,Category
0,The device came back in a worse condition when...,Repair Experience/Availability
1,reply,Support Efficiency
2,"Your employees are always friendly. However, o...",Support Efficiency
3,Been told the correct information to start wit...,Information Provided
4,By having better service support someone who u...,Information Provided


The dataset consists of no null values.

In a basic description, the dataset consists of 27649 comments and each one of them belongs to a category. 
The dataset consists of no null values.

In [210]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27649 entries, 0 to 27648
Data columns (total 2 columns):
Comment     27649 non-null object
Category    27649 non-null object
dtypes: object(2)
memory usage: 432.1+ KB


Now let's have a look in the frequency of each category.

In [211]:
category_counts = Counter(df['Category'])
category_counts

Counter({'Repair Experience/Availability': 1508,
         'Support Efficiency': 5179,
         'Information Provided': 3550,
         'Warranty Coverage': 1185,
         'Product/Spares Quality': 2171,
         'Issue Resolution': 4931,
         'Agent Contact Skills': 2251,
         'Positive Verbatim': 1801,
         'Price': 291,
         'Promotion Conditions': 371,
         'Website / Store Experience': 1933,
         '3rd Party Complaint': 163,
         'Product /Spare Availability': 2080,
         'Other': 217,
         'issue Resolution': 1,
         'other': 9,
         'price': 1,
         'positive verbatim': 4,
         'issue resolution': 1,
         '3rd party complaint': 1,
         'promotion conditions': 1})

There seems to be a small problem with lower and upper case letters in the "Category" column, so we can fix this

In [212]:
df["Category"]= df["Category"].str.lower() 
category_counts = Counter(df['Category'])
category_counts

Counter({'repair experience/availability': 1508,
         'support efficiency': 5179,
         'information provided': 3550,
         'warranty coverage': 1185,
         'product/spares quality': 2171,
         'issue resolution': 4933,
         'agent contact skills': 2251,
         'positive verbatim': 1805,
         'price': 292,
         'promotion conditions': 372,
         'website / store experience': 1933,
         '3rd party complaint': 164,
         'product /spare availability': 2080,
         'other': 226})

4 of the 14 categories contain a small number of data compared to the other 10 categories. As a first attempt we will keep all the  14 categories and see how well the model responds.

In [213]:
top_categories = category_counts.most_common()[:14]


The comments will be the input for the model and the categories will be the output.

In [214]:
X = df['Comment']

In [215]:
y = df['Category']

In [216]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import StratifiedShuffleSplit


Now, it is time to split the dataset. We are going to split it in 3 components. One will be the training dataset, one the validation dataset and the last one the test dataset. StratifiedShuffleSplit shuffles the data, and then it splits the data into n_splits parts. After this step, StratifiedShuffleSplit picks one part to use as a test set. Then it repeats the same process n_splits - 1 other times, to get n_splits - 1 other test sets. So, it shuffles each time before splitting, and it splits n_splits times, and as a result the test sets can overlap.
At first, we will split the Original dataset into two pieces:Train-Validation dataset and Test dataset
Secondly, we will split the Train-Validation dataset into another two pieces:Train dataset and Validation dataset


In [217]:
test_sss = StratifiedShuffleSplit(n_splits=5,
                                  test_size=0.15,
                                  random_state=0)

In [218]:
val_sss = StratifiedShuffleSplit(n_splits=5, 
                                 test_size=0.2,
                                 random_state=0)

In [219]:
X_train_val.head()  

0    In this case, everything was fine, but I have ...
1        No aftercare or loyalty towards its customers
2    The very service from you has been excellent. ...
3                               Employ qualified staff
4                                Answered the question
Name: Comment, dtype: object

We reset the indexes for both the X-train-val and y-train-val in order to break them 
again into two subsets.


In [220]:
X_train_val = X_train_val.reset_index(drop=True)
y_train_val = y_train_val.reset_index(drop=True)

We split again the train-val dataset into train and validation datasets

In [221]:

X_train, X_val, y_train, y_val = None, None, None, None

for train_index, val_index in val_sss.split(X_train_val, y_train_val):
    
    X_train, X_val = X_train_val[train_index], X_train_val[val_index]
    y_train, y_val = y_train_val[train_index], y_train_val[val_index] 

The next step is to encode the labels (categories) using a One-Hot Encoder. 
At first we run fit_transform on the Training dataand then we use the fitted One-hot-Encoder to transform the rest of the data



In [222]:
y_enc = OneHotEncoder(sparse=False)

y_train_enc = y_enc.fit_transform(y_train.values.reshape(-1, 1))

y_val_enc = y_enc.transform(y_val.values.reshape(-1, 1))
y_test_enc = y_enc.transform(y_test.values.reshape(-1, 1))


After constructing the three datasets we can see their shape. The training dataset consists of 18800 rows, while the test and validation datasets consist of 4148 and 4701 rows.

In [223]:

print('y_train shape: {}'.format(y_train_enc.shape))
print('y_val shape: {}'.format(y_val_enc.shape))
print('y_test shape: {}'.format(y_test_enc.shape))


y_train shape: (18800, 14)
y_val shape: (4701, 14)
y_test shape: (4148, 14)


### Bag of Words Approach (BoW)


A bag-of-words model (BoW), is a way of extracting features from text for use in modeling. The approach is very simple and flexible, and can be used in thousands of ways for extracting features from documents. It is a representation of text that describes the occurrence of words within a document. It involves two things: A vocabulary of known words. and a measure of the presence of known words. The model is only concerned with whether known words occur in the document, not where in the document.

In [224]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [225]:
comments = df['Comment'].str.lower().str.replace('.', ' ')
comments.head()

0    the device came back in a worse condition when...
1                                                reply
2    your employees are always friendly  however, o...
3    been told the correct information to start wit...
4    by having better service support someone who u...
Name: Comment, dtype: object

We will concatenate all the comments into one text and split in into tokens. then we are going to find the most frequent of them. Below we can see the 15 most frequent words in all the comments and how many times each one of them appears in the dataset.

In [226]:
all_words = " ".join(comments)
top_words = Counter(all_words.split()).most_common()
top_words[:15]

[('the', 41566),
 ('i', 21916),
 ('to', 21045),
 ('a', 18517),
 ('not', 15027),
 ('and', 14717),
 ('was', 11984),
 ('of', 9739),
 ('for', 9139),
 ('that', 8932),
 ('my', 8489),
 ('have', 8400),
 ('it', 8243),
 ('is', 7948),
 ('in', 7900)]

When we deal with text problems, stop words removal process is a one of the most important steps in order to have a better input for the models. The term stop words means that they are very common words in a language. these words do not help on most of the problems such as semantic analysis, classification etc. For that, we will print the top 100 most common words in order to pick those that we want to include in our Stop Words List. 

In [227]:
print(sorted([i[0].lower() for i in top_words[:100]]))

['-', '2', 'a', 'about', 'after', 'again', 'all', 'also', 'an', 'and', 'answer', 'are', 'as', 'at', 'back', 'be', 'because', 'been', 'better', 'but', 'buy', 'by', 'can', 'contact', 'could', 'customer', 'device', 'did', 'do', 'even', 'first', 'for', 'from', 'get', 'good', 'had', 'has', 'have', 'help', 'i', 'if', 'in', 'information', 'is', 'it', 'just', 'machine', 'me', 'more', 'my', 'new', 'no', 'not', 'nothing', 'now', 'of', 'on', 'one', 'only', 'or', 'order', 'out', 'part', 'parts', 'philips', 'problem', 'product', 'products', 'question', 'received', 'repair', 'replacement', 'send', 'sent', 'service', 'should', 'so', 'spare', 'still', 'that', 'the', 'then', 'there', 'they', 'this', 'time', 'to', 'very', 'warranty', 'was', 'we', 'were', 'what', 'when', 'which', 'will', 'with', 'would', 'you', 'your']


Our stop list will contain most of the words above. In fact we will exclude words that may be important for the model, such as "good" or "problem". The final list contains the words:

In [228]:
el_stop = ['-', '2', 'a', 'about', 'after', 'again', 'all', 'also', 'an', 'and',
           'are', 'as', 'at', 'back', 'be', 'because', 'been',
           'buy', 'by', 'can', 'could', 'customer', 
           'did', 'do', 'even', 'first', 'for', 'from', 'get', 'had', 
           'has', 'have', 'i', 'if', 'in', 'is', 'it',
           'just', 'me', 'more', 'my', 'new', 'no', 'now', 'of', 
           'on', 'one', 'only', 'or', 'out', 'part', 'parts',
           'send', 'sent', 'service', 'should', 'so', 
           'still', 'than', 'that', 'the', 'then', 'there', 'they', 'this', 'time', 
           'to', 'very', 'was', 'we', 'were', 'what', 'when', 'which', 
           'will', 'with', 'would', 'you', 'your']


We will set the total number of words used for vectorization to 3000. "comments_vect_counts" vectorizer will give the number of appearances of each word and "comments_vect_binary" vectorizer will return 1 if the word appears and 0 if the word doesn't appear.
 


In [229]:
max_words = 3000

comments_vect_counts = CountVectorizer(encoding='utf-8',
                                     strip_accents='unicode',
                                     lowercase=True,
                                     stop_words=el_stop,
                                     ngram_range=(1, 1), # unigrams
                                     max_features=max_words,
                                     binary=False # binary output or full counts. 
                                     )

comments_vect_binary = CountVectorizer(encoding='utf-8',
                                     strip_accents='unicode',
                                     lowercase=True,
                                     stop_words=el_stop,
                                     ngram_range=(1, 1), # unigrams
                                     max_features=max_words,
                                     binary=True # binary output or full counts. 
                                     )

In this step we fit the CountVectorizer only on the training dataset and use it to transform the Validation and Test sets.


In [230]:
X_train_enc = comments_vect_counts.fit_transform(X_train.astype(str))
X_val_enc = comments_vect_counts.transform(X_val.astype(str))
X_test_enc = comments_vect_counts.transform(X_test.astype(str))

In [231]:
X_train_enc


<18800x3000 sparse matrix of type '<class 'numpy.int64'>'
	with 203290 stored elements in Compressed Sparse Row format>

### The model

In [232]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras import metrics
from tensorflow.keras.utils import plot_model
import pydot

Fist of all we set values to some variables that will be used for the model. We set Number of Classes for the Y labels to 14, number of epochs to 30, the batch_size of the data that will be fed to the Model when training to 32 and Dropout Rate of the Dropout Layer to 40%.


In [233]:
nb_classes = len(y_enc.categories_[0])
nb_epoch = 30
batch_size = 32 
dropout_rate = 0.4

We will create a sequential model. In sequential models each layer will use as input the output of the former layer added to the model.


In [234]:
model = Sequential()

model.add(Dense(512, input_shape=(max_words,)))

model.add(Activation('relu'))

model.add(Dropout(dropout_rate))

model.add(Dense(512))

model.add(Activation('relu'))

model.add(Dropout(dropout_rate))

model.add(Dense(nb_classes))

model.add(Activation('softmax'))


We can see some information avout the models' layers

In [235]:
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_9 (Dense)              (None, 512)               1536512   
_________________________________________________________________
activation_9 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 512)               262656    
_________________________________________________________________
activation_10 (Activation)   (None, 512)               0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 14)                7182      
__________

In order to compile the model we will use categorical crossentropy as the loss function, Adam as an optimizer and  "accuracy" as ametric.

In [236]:
model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=[
        'accuracy'
    ],
)

The model is now ready to be trained

In [237]:
history = model.fit(
    
    X_train_enc,            # features (as dense inputs)
    y_train_enc,            # labels
    epochs=nb_epoch,        # number of epochs
    batch_size=batch_size,  # define batch size
    verbose=2,              # the most extended verbose
    validation_data=(       
        X_val_enc,          # the validation split that we did before
        y_val_enc
    ))

Train on 18800 samples, validate on 4701 samples
Epoch 1/30
 - 25s - loss: 1.5590 - acc: 0.5098 - val_loss: 1.1700 - val_acc: 0.6311
Epoch 2/30
 - 22s - loss: 0.8534 - acc: 0.7384 - val_loss: 1.0682 - val_acc: 0.6841
Epoch 3/30
 - 23s - loss: 0.5281 - acc: 0.8366 - val_loss: 1.1358 - val_acc: 0.7003
Epoch 4/30
 - 19s - loss: 0.3393 - acc: 0.8945 - val_loss: 1.2840 - val_acc: 0.7081
Epoch 5/30
 - 19s - loss: 0.2436 - acc: 0.9252 - val_loss: 1.3944 - val_acc: 0.7088
Epoch 6/30
 - 18s - loss: 0.1926 - acc: 0.9383 - val_loss: 1.5418 - val_acc: 0.7128
Epoch 7/30
 - 18s - loss: 0.1590 - acc: 0.9486 - val_loss: 1.6113 - val_acc: 0.7156
Epoch 8/30
 - 17s - loss: 0.1435 - acc: 0.9540 - val_loss: 1.7328 - val_acc: 0.7169
Epoch 9/30
 - 17s - loss: 0.1257 - acc: 0.9595 - val_loss: 1.7753 - val_acc: 0.7137
Epoch 10/30
 - 17s - loss: 0.1187 - acc: 0.9615 - val_loss: 1.9144 - val_acc: 0.7116
Epoch 11/30
 - 18s - loss: 0.1108 - acc: 0.9651 - val_loss: 1.9744 - val_acc: 0.7156
Epoch 12/30
 - 18s - loss

The model has Test accuracy: 71.842 %

In [238]:
score = model.evaluate(
    X_test_enc.todense(),    
    y_test_enc,              
    batch_size=batch_size,  
    verbose=2  
)

print('\nTest categorical_crossentropy: {}'.format(score[0]))
print('\nTest accuracy: {:.3f} %'.format(score[1]*100))


 - 1s - loss: 2.4343 - acc: 0.7184

Test categorical_crossentropy: 2.4342753984889045

Test accuracy: 71.842 %


### Predictions

Let's now use our model to make some predictions. We will try to predict three new comments 

In [239]:
input_comment = """
you should treat your customers with more respect
"""

In [240]:
# custom prediction function 
def get_one_hot_predictions(pred_probs):
    """
    """
    max_probs = np.max(pred_probs, axis=1)
    
    # reshaping to (len_of_predicts, 1)
    max_probs = max_probs.reshape(max_probs.shape[0], 1)

    return np.equal(pred_probs, max_probs).astype(float)

In [241]:
# vectorizing comment with Count Vectorizer
comments_vect = comments_vect_counts.transform([input_comment])

print(comments_vect.shape, end='\n\n')

comments_pred = model.predict_proba(comments_vect)

print('Probabilities', end='\n\n')
print(comments_pred, end='\n\n')


comments_pred_hot = get_one_hot_predictions(comments_pred)

print('Probabilities One Hot Vector', end='\n\n')
print(comments_pred_hot, end='\n\n')

print('Category Prediction:', end='\n\n')
print(y_enc.inverse_transform(comments_pred_hot)[0][0])

(1, 3000)

Probabilities

[[7.8205051e-09 9.9996042e-01 5.0657007e-07 7.6840382e-07 3.8032923e-08
  1.6107725e-07 9.8454396e-12 1.4838255e-07 4.9604137e-10 1.4961180e-09
  1.3154876e-06 3.0278037e-05 6.2538120e-06 7.3398032e-10]]

Probabilities One Hot Vector

[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Category Prediction:

agent contact skills


In [242]:
input_comment = """
my coffee machine was not working properly. you need to replace it

"""

In [243]:
# vectorizing comment with Count Vectorizer
comments_vect = comments_vect_counts.transform([input_comment])

print(comments_vect.shape, end='\n\n')

comments_pred = model.predict_proba(comments_vect)

print('Probabilities', end='\n\n')
print(comments_pred, end='\n\n')


comments_pred_hot = get_one_hot_predictions(comments_pred)

print('Probabilities One Hot Vector', end='\n\n')
print(comments_pred_hot, end='\n\n')

print('Category Prediction:', end='\n\n')
print(y_enc.inverse_transform(comments_pred_hot)[0][0])

(1, 3000)

Probabilities

[[8.0171067e-06 3.7088193e-05 1.8261351e-06 1.9022897e-04 2.8009591e-08
  3.2227317e-06 2.1956212e-06 1.6302232e-05 2.4610418e-01 3.6110265e-08
  7.4970257e-01 1.7178725e-06 3.9324574e-03 1.9985058e-07]]

Probabilities One Hot Vector

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]

Category Prediction:

repair experience/availability


In [244]:
input_comment = """
i had to repeat myself many times because the person i talked to did not understand what i was saying"""

In [245]:
comments_vect = comments_vect_counts.transform([input_comment])

print(comments_vect.shape, end='\n\n')

comments_pred = model.predict_proba(comments_vect)

print('Probabilities', end='\n\n')
print(comments_pred, end='\n\n')


comments_pred_hot = get_one_hot_predictions(comments_pred)

print('Probabilities One Hot Vector', end='\n\n')
print(comments_pred_hot, end='\n\n')

print('Category Prediction:', end='\n\n')
print(y_enc.inverse_transform(comments_pred_hot)[0][0])

(1, 3000)

Probabilities

[[4.5612730e-21 1.0000000e+00 1.0968228e-17 1.9419117e-16 1.6112674e-21
  6.2411744e-24 3.9289934e-34 2.7189185e-27 5.9850743e-22 3.8446253e-22
  6.9290071e-19 9.5610242e-10 1.3797056e-20 7.4135892e-23]]

Probabilities One Hot Vector

[[0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

Category Prediction:

agent contact skills


In [246]:
input_comment = """
i had to sent three e mails before you answer"""

In [247]:
comments_vect = comments_vect_counts.transform([input_comment])

print(comments_vect.shape, end='\n\n')

comments_pred = model.predict_proba(comments_vect)

print('Probabilities', end='\n\n')
print(comments_pred, end='\n\n')


comments_pred_hot = get_one_hot_predictions(comments_pred)

print('Probabilities One Hot Vector', end='\n\n')
print(comments_pred_hot, end='\n\n')

print('Category Prediction:', end='\n\n')
print(y_enc.inverse_transform(comments_pred_hot)[0][0])

(1, 3000)

Probabilities

[[1.7313271e-08 2.0742184e-05 1.9657342e-05 4.6481141e-03 5.6552727e-09
  3.9227157e-08 8.5542858e-11 3.7096447e-06 2.8665161e-07 5.1006445e-08
  2.1594708e-06 9.9450231e-01 2.9879510e-09 8.0292067e-04]]

Probabilities One Hot Vector

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]]

Category Prediction:

support efficiency


As we can see the model makes accurate predictions.