# Target Visualization - T-SNE and Doc2Vec
Source: https://www.kaggle.com/arthurtok/target-visualization-t-sne-and-doc2vec/notebook

This kernel will be an exploration into the target variable and how it is distributed accorss the structure of the training data to see if any potential information or patterns can be gleaned going forwards. Since classical treatment of text data normally comes with the challenges of high dimensionality (using terms frequencies or term frequency inverse document frequencies), the plan therefore in this kernel is to visually explore the target variable in some lower dimensional space using SVD and LSA(Latent Semantic Analysis) and Doc2Vec method. In these lower dimensional spaces, we can finally utilize the manifold learning method of the t-distributed stochastic neighbour embedding (tNSE) technique to further reduce the dimensionality for target variable visualisation. 

In [3]:

# Importing the relevant libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer

# from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer 
# from nltk.stem import WordNetLemmatizer, PorterStemmer
# from nltk.corpus import stopwords
from string import punctuation


import re
from functools import reduce

import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show, output_notebook, reset_output
from bokeh.palettes import d3
import bokeh.models as bmo
from bokeh.io import save, output_file

# init_notebook_mode(connected = True)
# color = sns.color_palette("Set2")
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999

In [6]:
from pathlib import Path

BASE_PATH = Path('../..')
events_path = BASE_PATH / 'events'
dictionary_path = BASE_PATH / 'dictionary'
data_path = BASE_PATH / 'data'
subset_reports_path = data_path / 'subset'
subset_reports_path_txt = data_path / 'subset_txt'
df_path = data_path / 'dataframes'
patterns_path = dictionary_path / 'patterns'
triggers_path = dictionary_path / 'trigger phrases'

GROUP = 0
group_events_path = events_path / f'group_{GROUP}_events.csv'
labelled_path = events_path / f'group_{GROUP}_labelled.csv'
processed_path = events_path / f'group_{GROUP}_processed.csv'

In [9]:
import string
import spacy
from spacy import displacy

from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English

# 1. Data preprocessing

In [115]:
#Read in labelled event data file from 6 groups

groups = [0, 1, 2, 3, 4, 6] # lol

filenames = {group: events_path / f'group_{group}_labelled.csv' for group in groups}

# instantiate empty list to store dfs on read
dfall = []
for group in groups:
    df = pd.read_csv(filenames[group])
    
    # data processing and cleaning on near miss event column
    df = df.loc[df['Near Miss Event'].notna(), ]
    
    # pd.Series(['False']) returns True as string are converted to bool on whether they are empty or not!
    df['Near Miss Event'] = df['Near Miss Event'].apply(lambda x : (x == 'True') | (x == True)).astype(bool)
    
    # need to read in dataframe to work out length of group column
    df.insert(2, 'group', np.repeat(group, len(df)))
    dfall.append(df)
    
# concat list of dfs as a single data frame
dfall = pd.concat(dfall)

In [116]:
dfall.to_csv(events_path / f'group_all_labelled.csv', index=False)

In [119]:
df = dfall.loc[dfall.reviewed][['event_id','filename', 'group', 'event_text', 'Near Miss Event']]
# Target Label
df['Label'] = df['Near Miss Event'].astype(int)
df.head()

Unnamed: 0,event_id,filename,group,event_text,Near Miss Event,Label
0,a080918_e9_1443_annual_09_13904956_0,a080918_e9_1443_annual_09_13904956.json,0,following the completion of the hole and loggi...,False,0
1,a080918_e9_1443_annual_09_13904956_15,a080918_e9_1443_annual_09_13904956.json,0,mineral drillholes data 2. lithology summary a...,False,0
2,a080918_e9_1443_annual_09_13904956_18,a080918_e9_1443_annual_09_13904956.json,0,several suitable target areas were identified ...,False,0
3,a080918_e9_1443_annual_09_13904956_21,a080918_e9_1443_annual_09_13904956.json,0,the gascoyne platform is a diamond shaped area...,False,0
4,a080918_e9_1443_annual_09_13904956_34,a080918_e9_1443_annual_09_13904956.json,0,bromine levels in the halite are high (up to 3...,True,1


In [138]:
X = df[['filename', 'event_text', 'group', 'Label']]
X.head()

Unnamed: 0,filename,event_text,group,Label
0,a080918_e9_1443_annual_09_13904956.json,following the completion of the hole and loggi...,0,0
1,a080918_e9_1443_annual_09_13904956.json,mineral drillholes data 2. lithology summary a...,0,0
2,a080918_e9_1443_annual_09_13904956.json,several suitable target areas were identified ...,0,0
3,a080918_e9_1443_annual_09_13904956.json,the gascoyne platform is a diamond shaped area...,0,0
4,a080918_e9_1443_annual_09_13904956.json,bromine levels in the halite are high (up to 3...,0,1



In this section, we arrive at the pre-processing of the question text contained within the training data. The processing applied here are some of the standard NLP steps that one would implement in a text based problem, consisting of:

Tokenization

Stemming or Lemmatization

In [139]:
# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en_core_web_lg')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens

In [140]:
# Apply Spacy functions
X["event_text"] = X["event_text"].apply(lambda x: spacy_tokenizer(x))

In [141]:
X.head()

Unnamed: 0,filename,event_text,group,Label
0,a080918_e9_1443_annual_09_13904956.json,"[following, completion, hole, logging, core, l...",0,0
1,a080918_e9_1443_annual_09_13904956.json,"[mineral, drillholes, data, 2, lithology, summ...",0,0
2,a080918_e9_1443_annual_09_13904956.json,"[suitable, target, areas, identified, areas, a...",0,0
3,a080918_e9_1443_annual_09_13904956.json,"[gascoyne, platform, diamond, shaped, area, co...",0,0
4,a080918_e9_1443_annual_09_13904956.json,"[bromine, levels, halite, high, 330ppm, sugges...",0,1


In [142]:
X['Label'].value_counts()

0    1245
1     425
Name: Label, dtype: int64

# 2. T-SNE applied to Latent Semantic (LSA) space


To start off we look at the sparse representation of text documents via the Term frequency Inverse document frequency method. What this does is create a matrix representation that upweights locally prevalent but globally rare terms - therefore accounting for the occurence bias when using just term frequencies

In [143]:
tf_idf_vec = TfidfVectorizer(min_df=3,
                             max_features = 60_000, #100_000,
                             analyzer="word",
                             ngram_range=(1,3), # (1,6)
                             stop_words="english")

tf_idf = tf_idf_vec.fit_transform(list(X["event_text"].map(lambda tokens: " ".join(tokens))))

In [144]:
# Applying the Singular value decomposition
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50, random_state=2018)
svd_tfidf = svd.fit_transform(tf_idf)
print("Dimensionality of LSA space: {}".format(svd_tfidf.shape))

Dimensionality of LSA space: (1670, 50)


In [172]:
# # Showing scatter plots 
# from mpl_toolkits.mplot3d import Axes3D
# fig = plt.figure(figsize=(16,12))

# # Plot models:
# ax = Axes3D(fig) 
# ax.scatter(svd_tfidf[:,0],
#            svd_tfidf[:,1],
#            svd_tfidf[:,2],
#            c=X.Label.values,
#            cmap=plt.cm.winter_r,
#            s=2,
#            edgecolor='none',
#            marker='o')
# plt.title("Semantic Tf-Idf-SVD reduced plot of Sincere-Insincere data distribution")
# plt.xlabel("First dimension")
# plt.ylabel("Second dimension")
# plt.legend()
# plt.xlim(0.0, 0.20)
# plt.ylim(-0.2,0.4)
# plt.show()

In [147]:
from sklearn.manifold import TSNE

# Importing multicore version of TSNE
#from MulticoreTSNE import MulticoreTSNE as TSNE

In [148]:
tsne_model = TSNE(n_jobs=4,
                  early_exaggeration=4, # Trying out exaggeration trick
                  n_components=2,
                  verbose=1,
                  random_state=2018,
                  n_iter=500)
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1670 samples in 0.004s...
[t-SNE] Computed neighbors for 1670 samples in 0.097s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1670
[t-SNE] Computed conditional probabilities for sample 1670 / 1670
[t-SNE] Mean sigma: 0.122082
[t-SNE] KL divergence after 250 iterations with early exaggeration: 19.006523
[t-SNE] KL divergence after 500 iterations: 1.454434


In [149]:
# Putting the tsne information into a dataframe
tsne_tfidf_df = pd.DataFrame(data=tsne_tfidf, columns=["x", "y"])
tsne_tfidf_df["filename"] = X["filename"].values
tsne_tfidf_df["event_text"] = X["event_text"].values
tsne_tfidf_df["Label"] = X["Label"].values
tsne_tfidf_df['group'] = X['group'].values
tsne_tfidf_df.head()

Unnamed: 0,x,y,filename,event_text,Label,group
0,-9.841389,9.088011,a080918_e9_1443_annual_09_13904956.json,"[following, completion, hole, logging, core, l...",0,0
1,-8.386702,-2.832415,a080918_e9_1443_annual_09_13904956.json,"[mineral, drillholes, data, 2, lithology, summ...",0,0
2,-8.094007,-3.112274,a080918_e9_1443_annual_09_13904956.json,"[suitable, target, areas, identified, areas, a...",0,0
3,-6.529096,-4.482185,a080918_e9_1443_annual_09_13904956.json,"[gascoyne, platform, diamond, shaped, area, co...",0,0
4,-8.624323,-2.836989,a080918_e9_1443_annual_09_13904956.json,"[bromine, levels, halite, high, 330ppm, sugges...",1,0


In [171]:
tsne_tfidf_df.group.apply(lambda x : colormap[x])

0       darkblue
1       darkblue
2       darkblue
3       darkblue
4       darkblue
          ...   
1665      yellow
1666      yellow
1667      yellow
1668      yellow
1669      yellow
Name: group, Length: 1670, dtype: object

In [173]:
output_notebook()
plot_tfidf = bp.figure(plot_width = 800, plot_height = 700, 
                       title = "T-SNE applied to Tfidf_SVD space")

# colormap = np.array(["#6d8dca", "#d07d3c"])

# we need a list of length 7 becasue charlie labelled group 6 instead of 5 lol
colormap = np.array(["darkblue", "red", "purple", "green", "orange", "yellow", "yellow"])

# palette = d3["Category10"][len(tsne_tfidf_df["asset_name"].unique())]
source = ColumnDataSource(data = dict(x = tsne_tfidf_df["x"], 
                                      y = tsne_tfidf_df["y"],
                                      color = colormap[tsne_tfidf_df["Label"]],
                                      event_text = tsne_tfidf_df["event_text"],
                                      filename = tsne_tfidf_df["filename"],
                                      Label = tsne_tfidf_df["Label"]))

plot_tfidf.scatter(x = "x", 
                   y = "y", 
                   color="color",
                   legend = "Label",
                   source = source,
                   alpha = 0.7)
hover = plot_tfidf.select(dict(type = HoverTool))
hover.tooltips = {"filename": "@filename", 
                  "event_text": "@event_text", 
                  "Label":"@Label"}

show(plot_tfidf)

In [175]:
output_notebook()
plot_tfidf = bp.figure(plot_width = 800, plot_height = 700, 
                       title = "T-SNE applied to Tfidf_SVD space")

# colormap = np.array(["#6d8dca", "#d07d3c"])

# we need a list of length 7 becasue charlie labelled group 6 instead of 5 lol
colormap = np.array(["darkblue", "red", "purple", "green", "orange", "yellow", "yellow"])

# palette = d3["Category10"][len(tsne_tfidf_df["asset_name"].unique())]
source = ColumnDataSource(data = dict(x = tsne_tfidf_df["x"], 
                                      y = tsne_tfidf_df["y"],
                                      color = colormap[tsne_tfidf_df["group"]],
                                      event_text = tsne_tfidf_df["event_text"],
                                      filename = tsne_tfidf_df["filename"],
                                      Label = tsne_tfidf_df["group"]))

plot_tfidf.scatter(x = "x", 
                   y = "y", 
                   color="color",
                   legend = "Label",
                   source = source,
                   alpha = 0.7)
hover = plot_tfidf.select(dict(type = HoverTool))
hover.tooltips = {"filename": "@filename", 
                  "event_text": "@event_text", 
                  "Label":"@Label"}

show(plot_tfidf)

# 3. T-SNE applied on Doc2Vec embedding
Pushing forward with our T-SNE visual explorations, we next move away from semantic matrices into the realm of embeddings. Here we will use the Doc2Vec algorithm and much like its very well known counterpart Word2vec involves unsupervised learning of continuous representations for text. Unlike Word2vec which involves finding the representations for words (i.e. word embeddings), Doc2vec modifies the former method and extends it to sentences and even documents.

For this notebook, we will be using gensim's Doc2Vec class which inherits from the base Word2Vec class where style of usage and parameters are similar. The only differences lie in the naming terminology of the training method used which are the “distributed memory” or “distributed bag of words” methods.

According to the Gensim documentation, Doc2Vec requires the input to be an iterable object representing the sentences in the form of two lists, a list of the terms and a list of labels

In [166]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [167]:
# Storing the question texts in a list
quora_texts = list(X["event_text"])

# Creating a list of terms and a list of labels to go with it
documents = [TaggedDocument(doc, tags=[str(i)]) for i, doc in enumerate(quora_texts)]

In [168]:
#Implement Doc2Vec
max_epochs = 100
alpha=0.025
model = Doc2Vec(documents,
                size=10, 
                min_alpha=0.00025,
                alpha=alpha,
                min_count=1,
#                 window=2, 
                workers=4)

In [169]:
# Creating and fitting the tsne model to the document embeddings
tsne_model = TSNE(n_jobs=4,
                  early_exaggeration=4,
                  n_components=2,
                  verbose=1,
                  random_state=2018,
                  n_iter=300)
tsne_d2v = tsne_model.fit_transform(model.docvecs.vectors_docs)

# Putting the tsne information into sq
tsne_d2v_df = pd.DataFrame(data=tsne_d2v, columns=["x", "y"])
# tsne_tfidf_df.columns = ["x", "y"]
tsne_tfidf_df["filename"] = X["filename"].values
tsne_tfidf_df["event_text"] = X["event_text"].values
tsne_tfidf_df["Label"] = X["Label"].values

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1670 samples in 0.001s...
[t-SNE] Computed neighbors for 1670 samples in 0.046s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1670
[t-SNE] Computed conditional probabilities for sample 1670 / 1670
[t-SNE] Mean sigma: 0.044392
[t-SNE] KL divergence after 250 iterations with early exaggeration: 15.474952
[t-SNE] KL divergence after 300 iterations: 1.568828


In [170]:
output_notebook()
plot_d2v = bp.figure(plot_width = 800, plot_height = 700, 
                       title = "T-SNE applied to Doc2vec document embeddings")

# colormap = np.array(["#6d8dca", "#d07d3c"])
colormap = np.array(["darkblue", "red", "purple", "green", "orange", "yellow", "yellow"])

# palette = d3["Category10"][len(tsne_tfidf_df["asset_name"].unique())]
source = ColumnDataSource(data = dict(x = tsne_d2v_df["x"], 
                                      y = tsne_d2v_df["y"],
                                      color = colormap[tsne_tfidf_df["group"]],
                                      event_text = tsne_tfidf_df["event_text"],
                                      filename = tsne_tfidf_df["filename"],
                                      Label = tsne_tfidf_df["group"]))

plot_d2v.scatter(x = "x", 
                   y = "y", 
                   color="color",
                   legend = "Label",
                   source = source,
                   alpha = 0.7)
hover = plot_d2v.select(dict(type = HoverTool))
hover.tooltips = {"filename": "@filename", 
                  "event_text": "@event_text", 
                  "Label":"@Label"}

show(plot_d2v)

Takeaways from the plot

The visual overlap between near miss and non near miss event are even greater in the Doc2Vec plots - so much so that there doesn't seem to be any obvious manner to segragate the labels via eye-balling if going down the route of document embeddings.