# Description
In this notebook I will demonstrate how to use [**Neural Graph Learning**](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/bbd774a3c6f13f05bf754e09aa45e7aa6faa08a8.pdf) to improve your model accuracy. The Idea of this method is to leverage prior knowledge about how your data is interconnected to make the training more reboust and accurate.

In other words, If you have a graph where each node is an example from your data and the edges between two nodes represent how similar those two nodes, then you can use this graph to make your model predicts similar label to examples with high weighted edge.
To understand this idea have a look at this objective function:

$\mathcal{L}(\theta)=\alpha_1\sum_{(u, v)\in V}{w_{uv}d(h_\theta(x_u), h_\theta(x_v))}+\alpha_2\sum_{i}^Ncost(g_\theta(x_i), y_i)$

*  V: labeled nodes in the graph
*  $d:$ distance function
*  $w_{uv}$: weight of the edge between u and v
*  $h_\theta:$ activation of some hidden layer in the model
*  $g_\theta:$ activation of the last layer of your model
*  $cost:$: crossentopy, MSE, ...

The first term in previous equation acts as a regularization term. When the weight between u and v is high (similar examples) the model will minimize the distance between u and v or it will be punched. In other words the previous objective encourges the model to assign similar labels to similar examples. However this method generalizes to semi-supervised learning. We can incorporte unlabel nodes in the above equation by adding their distances to the objective function. More details ([**Neural Graph Learning**](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/bbd774a3c6f13f05bf754e09aa45e7aa6faa08a8.pdf))

## How can we build a graph from our data?
Tensorflow provides a framework called **neural-structured-learning** that allows us to build structured graphs from our data. But we need to transform the data to a format that is acceptable by this framework.

*  Create an ID and embeddings for each textual example in our dataset(Labeled and unlabeled)
*  Transform each example to TFRecord format
*  Serialize the transfomed data to the hard
*  Use **neural-structured-learning** to generate a graph of our data based on (Cosine similarity)
*  Transform training data to TFRecord fomate and serialize it to the hard
*  Use **neural-structured-learning** to augment your training data

At the end of this stage, we will get a dataset where each record is a dictionary consists of:
*  Orginal training example
*  N Nieghboors of the example
*  weight of each neighboor

## How to build the regularized model?
**neural-structured-learning** provides functionalities to wrap any base model and compute all necessary terms.
The framework compute pairwise distance function between each example in our data and its neighboors then add add the results to the cost function defined by the base model.
## After using this method, you will see that the accuracy is increased by: ~1.15%

# Install libs

In [None]:
!pip install neural-structured-learning
!pip install tensorflow-text

# Import libs

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

import neural_structured_learning as nsl

import tensorflow as tf
import tensorflow_text as text
import tensorflow_hub as tfh

tf.keras.backend.clear_session()

# Read data/shuffle/split

In [None]:
df = pd.read_csv('/kaggle/input/disaster-tweets-cleaned/df.csv')
test_df = pd.read_csv('/kaggle/input/disaster-tweets-cleaned/test_df.csv')

X_tr, X_val, y_tr, y_val = train_test_split(
    df['ctext'].values, df['target'].values,
    test_size = 0.15,
    shuffle = True, stratify = df['target']
)
X_test = test_df['ctext'].values
y_test = test_df['target'].values
y_tr = np.reshape(y_tr, (-1, 1))
y_val = np.reshape(y_val, (-1, 1))
y_test = np.reshape(y_test, (-1, 1))
print(X_tr.shape, y_tr.shape)
print(X_val.shape, y_val.shape)
print(X_test.shape, y_test.shape)

# BertTokenizer/BertLayer

*  layers: 4
*  hidden state size: 128
*  Attention heads: 2

In [None]:
preprocessor = tfh.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
embedding_handler = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/2'
embedding_layer = tfh.KerasLayer(embedding_handler, trainable = True, name = 'embedder')

# Fine Tune Bert

In [None]:
def make_model(seq_length = 40 ):
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
    encoder_inputs = preprocessor(text_input)
    x = embedding_layer(encoder_inputs,)
    x = x['pooled_output']
    output = tf.keras.layers.Dense(1, activation = 'sigmoid')(x)
    model = tf.keras.Model(text_input, output)
    model.compile(loss = 'binary_crossentropy', 
                  optimizer = tf.keras.optimizers.Adam(1e-4), 
                  metrics = ['acc'])
    return model

model = make_model()
model.fit(x = X_tr, y = y_tr, epochs = 3, validation_data = (X_val, y_val))

# Get FeatureExtractor from finetuned Bert

In [None]:
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
encoder_inputs = preprocessor(text_input)
embedd = model.get_layer('embedder')(encoder_inputs)
embedder = tf.keras.Model(text_input, embedd)

# Create embeddings for the data

In [None]:
embed_X_tr = embedder(X_tr)['pooled_output'].numpy()
embed_X_val = embedder(X_val)['pooled_output'].numpy()
embed_X_test = embedder(X_test)['pooled_output'].numpy()

np.save('X_tr', embed_X_tr)
np.save('X_val', embed_X_val)
np.save('X_test', embed_X_test)

In [None]:
embed_X_tr = np.load('X_tr.npy')
embed_X_val = np.load('X_val.npy')
embed_X_test = np.load('X_test.npy')

# Neural Graph Learning

# Transform data to TFRecord

In [None]:
def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value = value.tolist()))
def _bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value = [value.encode('utf-8')]))
def _float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value = value.tolist()))

In [None]:
def create_embedding_example(sent_embed, record_id, layer_id = 'pooled_output'):
    features = {
        'id': _bytes_feature(str(record_id)),
        'embedding': _float_feature(sent_embed)
    }
    return tf.train.Example(features = tf.train.Features(feature = features))

def create_embeddings(sent_embeds, output_path, strarting_id = 0):
    record_id = int(strarting_id)
    with tf.io.TFRecordWriter(output_path) as writer:
        for sent_embed in sent_embeds:
            example = create_embedding_example(sent_embed, record_id)
            record_id += 1
            writer.write(example.SerializeToString())
    return record_id

In [None]:
!mkdir -p tmp/tweets
create_embeddings(embed_X_tr, 'tmp/tweets/embeddings.tfr', 0)

# Build Graph from the Data

*  **lsh_splits:** locality sensitive hashing (in this case 0 because the size of the data is small. No need to group the data into smaller buckets)
*  **lsh_rounds:** locality senitive hashing running times(in this case 0 because we are not bucketizing the data)

In [None]:
graph_builder_config = nsl.configs.GraphBuilderConfig(similarity_threshold=0.97, 
                                                      lsh_splits=0, 
                                                      lsh_rounds=0, 
                                                      random_seed=12345)
nsl.tools.build_graph_from_config(['tmp/tweets/embeddings.tfr'],
                                  'tmp/tweets/graph.tsv',
                                  graph_builder_config)

In [None]:
!wc -l tmp/tweets/graph.tsv

## Explore the Graph

In [None]:
graph = pd.read_csv('tmp/tweets/graph.tsv', 
                    delimiter = '\t', 
                    header = None,
                   names = ['node1', 'node2', 'weight'])
graph.head()

In [None]:
id1 = graph.node1.values[0]
g_tmp = graph[graph['node1'] == id1]
g_tmp.head()

In [None]:
print('Node1:', id1)
print(X_tr[id1])
print()
print('Neighboors:')
for txt in g_tmp.node2.values:
    print(X_tr[txt])

# Transform each example in the data to Node with id

In [None]:
def create_example(record, label, record_id):
    features={
        'id': _bytes_feature(str(record_id)),
        'label': _int64_feature(label),
        'embed': _float_feature(record),
    }
    return tf.train.Example(features = tf.train.Features(feature=features))

def create_records(sent_embeds, labels, record_path, start_id = 0):
    record_id = int(start_id)
    with tf.io.TFRecordWriter(record_path) as writer:
        for sent_embed, label in zip(sent_embeds, labels):
            example = create_example(sent_embed, label, record_id)
            record_id += 1
            writer.write(example.SerializeToString())
    return record_id

In [None]:
next_record_id = create_records(embed_X_tr, y_tr, 'tmp/tweets/train.tfr', 0)
next_record_id

# Associalte neighboors to each Node

In [None]:
max_nbrs = 4
nsl.tools.pack_nbrs(
    'tmp/tweets/train.tfr',
    '',
    'tmp/tweets/graph.tsv',
    'tmp/tweets/nsl_train.tfr',
    add_undirected_edges=True,
    max_nbrs=max_nbrs)

# Read augmented data

In [None]:
NBR_FEATURE_PREFIX = 'NL_nbr_'
NBR_WEIGHT_SUFFIX = '_weight'
def parse_example(val):
    def pad_sequence(sequence, max_seq_length):
        pad_size = tf.maximum([0], max_seq_length - tf.shape(sequence)[0])
        padded = tf.concat(
            [sequence.values,
             tf.fill((pad_size), tf.cast(0, sequence.dtype))],
            axis=0)
        return tf.slice(padded, [0], [max_seq_length])
    
    feature_spec = {
        'embed': tf.io.VarLenFeature(dtype = tf.float32),
        'label': tf.io.FixedLenFeature((), tf.int64, default_value=-1),
    }
    for i in range(max_nbrs):
        nbr_feature_key = '{}{}_{}'.format(NBR_FEATURE_PREFIX, i, 'embed')
        nbr_weight_key = '{}{}{}'.format(NBR_FEATURE_PREFIX, i,
                                         NBR_WEIGHT_SUFFIX)
        feature_spec[nbr_feature_key] = tf.io.VarLenFeature(dtype = tf.float32)
        feature_spec[nbr_weight_key] = tf.io.FixedLenFeature(
            [1], tf.float32, default_value=tf.constant([0.0]))

    features = tf.io.parse_single_example(val, feature_spec)
    features['embed'] = pad_sequence(features['embed'], 128)
    for i in range(max_nbrs):
        nbr_feature_key = '{}{}_{}'.format(NBR_FEATURE_PREFIX, i, 'embed')
        features[nbr_feature_key] = pad_sequence(features[nbr_feature_key],
                                                 128)

    return features, features.pop('label')

In [None]:
tmp_ds = tf.data.TFRecordDataset(['tmp/tweets/nsl_train.tfr'])
tmp_ds = tmp_ds.map(parse_example)
for t, l in tmp_ds.take(1):
    for n in t.keys():
        print(n)

# Create Batched data pipeline

In [None]:
train_ds = tf.data.TFRecordDataset(['tmp/tweets/nsl_train.tfr'])
train_ds = train_ds.map(parse_example)
train_ds = train_ds.batch(256)

val_ds = tf.data.Dataset.from_tensor_slices((embed_X_val, y_val))
val_ds = val_ds.batch(256)

# Create Base Model

In [None]:
def make_feed_forward_model():
    inputs = tf.keras.Input(shape=(128,), dtype='float32', name='embed')
    dense_layer = tf.keras.layers.Dense(128, activation='relu')(inputs)
    dense_layer = tf.keras.layers.Dense(32, activation='relu')(dense_layer)
    outputs = tf.keras.layers.Dense(1, activation='sigmoid')(dense_layer)
    return tf.keras.Model(inputs=inputs, outputs=outputs)

model = make_feed_forward_model()
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['acc'])
model.fit(embed_X_tr, y_tr, validation_data = (embed_X_val, y_val), epochs = 15, verbose = 1)

## Base model evaluation

In [None]:
results = model.evaluate(embed_X_test, y_test)
print(results)

# Create regularized model

In [None]:
base_model = make_feed_forward_model()
graph_reg_config = nsl.configs.make_graph_reg_config(
    max_neighbors=2,
    multiplier=10,
    distance_type=nsl.configs.DistanceType.L1,
    sum_over_axis=-1)
graph_reg_model = nsl.keras.GraphRegularization(base_model,
                                                graph_reg_config)

In [None]:
graph_reg_model.compile(
    optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
graph_reg_model.fit(train_ds, validation_data = val_ds, epochs = 15, verbose = 1)

## Regularized Model evaluation

In [None]:
results = graph_reg_model.evaluate(embed_X_test, y_test)
print(results)