### Guided Annotation tool - LSTM
This notebook shows you the whole process of preparing the data which is used as input in the guided annotation tool.
The tool basically shows unlabelled data in the form of explainable clusters to label.
It will show you the following steps:

    1. Load dataset
    2. Train a model and explain it - LSTM in this case
    3. Perform shap clustering
    4. Save the clusters to database with keywords to be highlighted by the annotation tool

In [2]:
import os
files = os.listdir("C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/bin")
files = [i for i in files if i.endswith('.dll') and i.startswith('cu')]


In [3]:
import ctypes
for file in files:
    try:
        hllDll = ctypes.WinDLL("C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.0\\bin\\"+ file)
    except:
        pass



## 1. Install dependencies and imports

In [4]:
!pip install --user -U tensorflow-gpu==1.15.0 --force-reinstall 

Collecting tensorflow-gpu==1.15.0
  Using cached tensorflow_gpu-1.15.0-cp37-cp37m-win_amd64.whl (294.5 MB)
Collecting six>=1.10.0
  Using cached six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting numpy<2.0,>=1.16.0
  Using cached numpy-1.20.3-cp37-cp37m-win_amd64.whl (13.6 MB)
Collecting wrapt>=1.11.1
  Using cached wrapt-1.12.1-cp37-cp37m-win_amd64.whl
Collecting tensorflow-estimator==1.15.1
  Using cached tensorflow_estimator-1.15.1-py2.py3-none-any.whl (503 kB)
Collecting absl-py>=0.7.0
  Using cached absl_py-0.12.0-py3-none-any.whl (129 kB)
Collecting keras-applications>=1.0.8
  Using cached Keras_Applications-1.0.8-py3-none-any.whl (50 kB)
Collecting termcolor>=1.1.0
  Using cached termcolor-1.1.0-py3-none-any.whl
Collecting google-pasta>=0.1.6
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting opt-einsum>=2.3.2
  Using cached opt_einsum-3.3.0-py3-none-any.whl (65 kB)
Collecting grpcio>=1.8.6
  Using cached grpcio-1.38.0-cp37-cp37m-win_amd64.whl (3.1 MB)
Collect




  Attempting uninstall: importlib-metadata
    Found existing installation: importlib-metadata 4.5.0
    Uninstalling importlib-metadata-4.5.0:
      Successfully uninstalled importlib-metadata-4.5.0
  Attempting uninstall: cached-property
    Found existing installation: cached-property 1.5.2
    Uninstalling cached-property-1.5.2:
      Successfully uninstalled cached-property-1.5.2
  Attempting uninstall: wheel
    Found existing installation: wheel 0.36.2
    Uninstalling wheel-0.36.2:
      Successfully uninstalled wheel-0.36.2
  Attempting uninstall: werkzeug
    Found existing installation: Werkzeug 2.0.1
    Uninstalling Werkzeug-2.0.1:
      Successfully uninstalled Werkzeug-2.0.1
  Attempting uninstall: setuptools
    Found existing installation: setuptools 57.0.0
    Uninstalling setuptools-57.0.0:
      Successfully uninstalled setuptools-57.0.0
  Attempting uninstall: protobuf
    Found existing installation: protobuf 3.17.3
    Uninstalling protobuf-3.17.3:
      Success

In [5]:
import json
import pandas as pd
import numpy as np
from scipy.spatial.distance import euclidean, cosine

import nltk
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score, homogeneity_score, v_measure_score, completeness_score
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go
# import chart_studio.plotly as py
import seaborn as sns
import shap


from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.text import Tokenizer

import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score, homogeneity_score

pd.set_option('display.max_colwidth', 1000)

In [6]:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 13817535035596038136
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 3186409472
locality {
  bus_id: 1
  links {
  }
}
incarnation: 16401184304559122423
physical_device_desc: "device: 0, name: Quadro M1200, pci bus id: 0000:01:00.0, compute capability: 5.0"
]


### Built-in function usage
To avoid re-writing a lot of stuff for each dataset/model, I have created some functions in the models module/folder.
We are going to use this python module in this tutorial.

In [7]:
import os
os.getcwd()
os.chdir('../')
from models.trainers import Trainer
from app.utils import clear_labels
from models.guided_learning import GuidedLearner
pd.set_option('display.max_colwidth', 1000)

## 2.Download dataset

In [8]:
df = pd.read_csv('datasets/davidson_dataset.csv') # substitute other datasets in similar format
print(df.shape)
df.head(10)

(5233, 4)


Unnamed: 0,index,label,text,processed
0,1,0,as a woman you should not complain about cleaning up your house as a man you should always take the garbage out,as a woman you should not complain about cleaning up your house as a man you should always take the garbage out
1,2,0,momma said no cats inside my doghouse,momma said no cats inside my doghouse
2,3,0,simply addicted to guys hot scally lad,simply addicted to guys hot scally lad
3,4,0,and hot soles,and hot soles
4,5,0,lemmie eat a oreo do these dishes one oreo,lemmie eat a oreo do these dishes one oreo
5,6,0,why the eggplant emoji would he say she looked like scream,why the eggplant emoji would he say she looked like scream
6,7,0,baby monkey bathtime this is so adorable,baby monkey bathtime this is so adorable
7,8,0,baseball season for the win yankees this is where the love started,baseball season for the win yankees this is where the love started
8,9,0,i am an early bird and i am a night owl so i am wise and have worms,i am an early bird and i am a night owl so i am wise and have worms
9,10,0,overdosing on heavy drugs does not sound bad tonight i do that every day,overdosing on heavy drugs does not sound bad tonight i do that every day


In [9]:
df['label'].value_counts()

0    3857
1    1376
Name: label, dtype: int64

## 3. Define Model

In [10]:
x = df['text'].values
y = df['label'].values

In [11]:
indices =  np.random.randint(low=0, high=x.shape[0], size=x.shape[0])
train_indices = indices[0:round(0.8*x.shape[0])]
pool_indices = indices[round(0.8*x.shape[0]):]
df_train = df.iloc[train_indices]['text'].values
df_test = df.iloc[pool_indices]['text'].values
y_train = y[train_indices]
y_test = y[pool_indices]

In [12]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df_train)

X_train = tokenizer.texts_to_sequences(df_train)
X_test = tokenizer.texts_to_sequences(df_test)

vocab_size = len(tokenizer.word_index) + 1  # Adding 1 because of reserved 0 index
print(df_train[2])
print(X_train[2])

 this the i play soccer cheat on girls and wear khaki coloured cargos haircut 
[18, 1, 3, 173, 602, 660, 17, 190, 7, 395, 1020, 510, 1021, 1022]


In [13]:
path_to_glove_file = "glove.6B.300d.txt"


embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

FileNotFoundError: [Errno 2] No such file or directory: 'glove.6B.300d.txt'

In [14]:
max_features = vocab_size
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 4
n_epochs = 5
print('Loading data...')
(x_train, y_train), (x_test, y_test) = (X_train, y_train), (X_test, y_test)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
x_train = np.asarray(x_train).astype(np.float32)
y_train =np.asarray(y_train).astype(np.float32)
x_test =np.asarray(x_test).astype(np.float32)
y_test =np.asarray(y_test).astype(np.float32)
print('Build model...')
model = Sequential()
model.add(Embedding(vocab_size, 128))
model.add(LSTM(128, dropout=0.4, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
history  = model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=n_epochs,
          validation_data=(x_test, y_test))
score, acc = model.evaluate(x_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Loading data...
4186 train sequences
1047 test sequences
Pad sequences (samples x time)
x_train shape: (4186, 80)
x_test shape: (1047, 80)
Build model...
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


NotImplementedError: Cannot convert a symbolic Tensor (lstm/strided_slice:0) to a numpy array.

In [None]:
# Plot epochs vs train and test scores
# data = [go.Scatter(x=list(range(n_epochs)), y=homogeneity_scores, mode="lines", name="homogeneity"),


In [None]:
predictions = model.predict_classes(x_test, verbose=1)
print('Overall F1 Score', f1_score(predictions, y_test))
sns.heatmap(confusion_matrix(y_test, predictions),annot=True)

In [None]:
np.unique(predictions)

In [None]:
import shap

# we use the first 100 training examples as our background dataset to integrate over
explainer = shap.DeepExplainer(model, x_train[:5000])

# explain the first 10 predictions
# explaining each prediction requires 2 * background dataset size runs
shap_values = explainer.shap_values(x_test[:1000])

In [None]:
shap_values[0][0].shape

In [None]:
# Creating a reverse dictionary
reverse_word_map = dict(map(reversed, tokenizer.word_index.items()))
shap.summary_plot(shap_values[0], x_test[:100], reverse_word_map)

In [None]:
# transform the indexes to words
import numpy as np
# use number to words map to get x_test in form of words
num2word = reverse_word_map
x_test_words = np.stack([np.array(list(map(lambda x: num2word.get(x, "NONE"), x_test[i]))) for i in range(10)])



In [None]:
index = 1
shap.force_plot(explainer.expected_value[0], shap_values[0][index], x_test_words[index], matplotlib=True), df_test[index]

In [None]:
shap_values_pool = shap_values[0]
n_clusters = 20
kmeans = KMeans(n_clusters= n_clusters, n_jobs=-1, max_iter=600)
kmeans.fit(shap_values_pool)

In [None]:
homogeneity_score( y_test[:1000], kmeans.labels_), v_measure_score(y_test[:1000], kmeans.labels_)