### Guided Annotation tool
This notebook shows you the whole process of preparing the data which is used as input in the guided annotation tool.
The tool basically shows unlabelled data in the form of explainable clusters to label.
It will show you the following steps:

    1. Load dataset
    2. Train a model and explain it
    3. Perform shap clustering
    4. Save the clusters to database with keywords to be highlighted by the annotation tool

#### Imports

In [None]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import euclidean, cosine

import nltk
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score, homogeneity_score, v_measure_score, completeness_score
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import plotly
import plotly.graph_objs as go
# import chart_studio.plotly as py
import seaborn as sns
import shap

### Built-in function usage
To avoid re-writing a lot of stuff for each dataset/model, I have created some functions in the models module/folder.
We are going to use this python module in this tutorial.

In [None]:
import os
os.getcwd()
os.chdir('../')
from models.trainers import Trainer
from app.utils import clear_labels
from models.guided_learning import GuidedLearner
pd.set_option('display.max_colwidth', 1000)

#### View dataset

In [None]:
df = pd.read_csv('datasets/davidson_dataset.csv') # substitute other datasets in similar format
print(df.shape)
df.head(10)

In [None]:
df["label"].hist()

#### Splitting data
  We split data into training, test, pool and individual. Pool is the unlabelled pool we want to generate SHAP clusters for.
  Individual is the bunch of labels we want to get from the user without any guidance
  
  We split as follows: (can be altered)
  70% train
  10% test
  10% pool
  10% individual

In [None]:
t = Trainer(dataset_name="davidson") # the name which you want for the tables in the database
df_train, df_test, df_pool, df_individual = t.train_test_pool_split(df)
df_train.shape, df_test.shape, df_pool.shape, df_individual.shape


#### Model fitting


In [None]:
learner = GuidedLearner(df_train, df_test, df_pool, df_individual, 'davidson', 1)
tfid, x_train, x_test, x_pool, y_train, y_test, y_pool = learner.tfid_fit()

In [None]:
model, explainer = learner.grid_search_fit_svc(c=[1])

#### Perform shap clustering
We are going to cluster the training data using SHAP explanations (shapely space)
SHAP clustering works by clustering on Shapley values of each instance. 
This means that you cluster instances by explanation similarity.

In [None]:
df_final_labels, uncertainty = learner.cluster_data_pool(n_clusters=25)

Convert predict probability to uncertainty. In binary classification this would be the same as 1-P

In [None]:
df_final_labels.head()

In [None]:
learner.save_to_db(df_final_labels)

In [None]:
plt.hist(uncertainty)
plt.show()

#### Additional explanations

With saving to database, all your steps for the guided annotations are complete.
In this section, we show you how to look at explanations of a single instance

In [None]:
predictions = model.predict(x_pool)

In [None]:
shap_values_train = explainer.shap_values(x_train)
shap_values_pool = explainer.shap_values(x_pool)

In [None]:
shap_values_pool.shape

In [None]:
df_test.head()

Explain a single positive prediction at 'index'

In [None]:
postive_index = 0
index = np.where(predictions==1)[0][postive_index]
print("text ", df_test["text"].values[index], " prediction: ", predictions[index], "actual ", y_test[index])
shap.force_plot(explainer.expected_value, 
                               shap_values_pool[index,:], 
                               x_test[index,:], feature_names = tfid.get_feature_names(),
               matplotlib=True)