# Text Classification - Universal Sentence Encoder

In this notebook we will train the USE (DAN) on: 

1. The whole abstract
2. Single Sentences of the whole abstract

The data are not preprocessed. The removal of punctuations, etc. is done by the feature column module. 

In [12]:
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns
from nltk.tokenize import sent_tokenize
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

## Loading the model 

In [3]:
# I dont know if he overwrites the model during retraining.
model_url = "https://tfhub.dev/google/universal-sentence-encoder/2" 

In [4]:
import hashlib
# The path where tf-hub will cache the model (use an absolute path..) 
os.environ["TFHUB_CACHE_DIR"] = ''

#TF-hub will store the name as hex
hashlib.sha1(model_url.encode("utf8")).hexdigest()

'1fb57c3ffe1a38479233ee9853ddd7a8ac8a8c47'

In [5]:
# Reduce logging output.
tf.logging.set_verbosity(tf.logging.ERROR)

In [6]:
%%time
# Initial download takes a while till the model is downloaded from tf-hub (~1GB)
model = hub.Module(model_url, trainable = True) # trainable = True for Transfer Learning!!

CPU times: user 2.72 s, sys: 204 ms, total: 2.92 s
Wall time: 2.96 s


## Loading the Abstract into a Dataframe

In [8]:
data_dir = "../../data/datasets/daniel_0212"
num_classes = 4 # ADJUST THAT

In [17]:
# ADJUST THAT
def get_label_id(class_name:str):
    if class_name == "clustering":   
        return 0
    if class_name == "association":
        return 1
    if class_name == "regression":
        return 2
    if class_name == "classification":
        return 3

In [24]:
%%time
data_abstract = {}
data_abstract["sentence"] = []
data_abstract["class"] = []

data_sentences = {}
data_sentences["sentence"] = []
data_sentences["class"] = []

for root, dirs, files in os.walk(data_dir):
    for _dir in dirs: 
        for txt_file in [x for x in os.listdir(os.path.join(root, _dir)) if x.endswith((".txt", ".TXT"))]:
            # Class name = dir name
            class_name = _dir
            #Read File
            file_name = os.path.join(root, _dir, txt_file)
            file = open(file_name, "r")
            txt = file.read()
            file.close()
            # Abstractss
            data_abstract["sentence"].append(txt)
            data_abstract["class"].append(get_label_id(class_name))
            # Sentences
            sentences = sent_tokenize(txt)
            for sentence in sentences:
                data_sentences["sentence"].append(sentence)
                data_sentences["class"].append(get_label_id(class_name))
                
            
            
            
df_abstracts = pd.DataFrame.from_dict(data_abstract)
df_sentences = pd.DataFrame.from_dict(data_sentences)
del data_abstract
del data_sentences

CPU times: user 3.57 s, sys: 327 ms, total: 3.89 s
Wall time: 3.91 s


### Abstracts

In [25]:
df_abstracts.sample(frac=1).head()

Unnamed: 0,sentence,class
5410,Clustering is an essential data mining tool th...,0
3713,Single-cell RNA sequencing (scRNA-seq) is a fa...,1
6603,Tensor regression has shown to be advantageous...,2
7678,"Having a regression model, we are interested i...",2
5750,Subspace clustering is the problem of partitio...,0


In [28]:
df_abstracts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8240 entries, 0 to 8239
Data columns (total 2 columns):
sentence    8240 non-null object
class       8240 non-null int64
dtypes: int64(1), object(1)
memory usage: 128.8+ KB


### Sentences

In [26]:
df_sentences.sample(frac=1).head()

Unnamed: 0,sentence,class
45468,HDDC is based on the idea that high-dimensiona...,0
52189,We apply our method on a real example to estim...,2
47657,Spatial Auto-Regression (SAR) is a\ncommon too...,2
38030,Some common clustering algorithms are\napplied...,0
55678,"We develop a novel ""decouple-recouple"" dynamic...",2


In [29]:
df_sentences.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58243 entries, 0 to 58242
Data columns (total 2 columns):
sentence    58243 non-null object
class       58243 non-null int64
dtypes: int64(1), object(1)
memory usage: 910.1+ KB


## Train and Test Split

### Abstract DataFrame

In [33]:
df_abstracts = shuffle(df_abstracts) # Shuffle the DataFrame
X_abstracts, Y_abstracts = train_test_split(df_abstracts, test_size=0.2, random_state = 101)

In [40]:
print("Train Data: \n")
print(X_abstracts.info())
print("\n Test Data: \n")
print(Y_abstracts.info())

Train Data: 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6592 entries, 4127 to 5946
Data columns (total 2 columns):
sentence    6592 non-null object
class       6592 non-null int64
dtypes: int64(1), object(1)
memory usage: 154.5+ KB
None

 Test Data: 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1648 entries, 7507 to 8158
Data columns (total 2 columns):
sentence    1648 non-null object
class       1648 non-null int64
dtypes: int64(1), object(1)
memory usage: 38.6+ KB
None


### Sentence DataFrame

In [46]:
df_sentences = shuffle(df_sentences) # Shuffle the DataFrame
X_sentences, Y_sentences = train_test_split(df_sentences, test_size=0.2, random_state = 101)

In [48]:
print("Train Data: \n")
print(X_sentences.info())
print("\n Test Data: \n")
print(Y_sentences.info())

Train Data: 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46594 entries, 7286 to 42326
Data columns (total 2 columns):
sentence    46594 non-null object
class       46594 non-null int64
dtypes: int64(1), object(1)
memory usage: 1.1+ MB
None

 Test Data: 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11649 entries, 8892 to 48762
Data columns (total 2 columns):
sentence    11649 non-null object
class       11649 non-null int64
dtypes: int64(1), object(1)
memory usage: 273.0+ KB
None


## Training of the Classifier

### The Feature-Column Module

1. Is responsible for the preprocessing of strings
2. And constructs a dense representation of string (Embedding), which is feed into the classifier

___
```
hub.text_embedding_column(
    key,
    module_spec,
    trainable=False
)
```
___

- Key = The column in the DataFrame, which is passed to the Estimator
- module_spec = The URL to the Embedding-Model
- trainable = Retrain the Model

In [None]:
embedded_text_feature_column = hub.text_embedding_column(
    key="sentence", 
    module_spec=model_url,
    trainable = False)

In [17]:
embedded_text_feature_column_retrain = hub.text_embedding_column(
    key="sentence", 
    module_spec=model_url,
    trainable = True)

### Constructing the Estimator

In the following we will just use an simple DNNClassifier. More: https://www.tensorflow.org/api_docs/python/tf/estimator/DNNClassifier



Key Facts: 

0. We build two identical estimators, one for a retrained embedding model the other with transer learning. 
1. Optimizer Adadelta - seems to be the fastest (http://ruder.io/optimizing-gradient-descent/index.html#amsgrad) 
2. Hidden Units: [1024, 512, 256]. I dont know if it makes sense but before i tried less hidden units and i want to check for an improvement.
3. No Dropout
4. No Batch-Normalization

In [None]:
tf.logging.set_verbosity(tf.logging.INFO) # Reduce the stupid tf-warnings

In [22]:
estimator_abstracts = tf.estimator.DNNClassifier(
    hidden_units=[1024, 512, 256],
    feature_columns=[embedded_text_feature_column],
    model_dir = "models_save/abstracts",
    n_classes=num_classes,
    optimizer=tf.train.AdadeltaOptimizer(learning_rate=0.001))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp786y30g8', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fb140524ef0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [None]:
estimator_sentences = tf.estimator.DNNClassifier(
    hidden_units=[1024, 512, 256],
    feature_columns=[embedded_text_feature_column],
    model_dir = "models_save/sentences",
    n_classes=num_classes,
    optimizer=tf.train.AdadeltaOptimizer(learning_rate=0.001))

In [None]:
estimator_retrain_abstracts = tf.estimator.DNNClassifier(
    hidden_units=[1024, 512, 256],
    feature_columns=[embedded_text_feature_column_retrain],
    model_dir = "models_save/abstracts_retrain",
    n_classes=num_classes,
    optimizer=tf.train.AdadeltaOptimizer(learning_rate=0.001))

In [None]:
estimator_retrain_sentences = tf.estimator.DNNClassifier(
    hidden_units=[1024, 512, 256],
    feature_columns=[embedded_text_feature_column_retrain],
    model_dir = "models_save/sentences_retrain",
    n_classes=num_classes,
    optimizer=tf.train.AdadeltaOptimizer(learning_rate=0.001))

### Preparing the Input Functions

Here we also don't play around with hyperparameters. Batch size is standard batch size of 128. 

#### DataFrame with Abstracts

In [24]:
# Training input on the whole training set with no limit on training epochs.
train_input_fn_abstracts = tf.estimator.inputs.pandas_input_fn(
    X_abstracts, X_abstracts["class"], num_epochs=None, shuffle = False)

# Prediction on the whole training set.
predict_train_input_fn_abstracts = tf.estimator.inputs.pandas_input_fn(
    X_abstracts, X_abstracts["class"], shuffle=False)
# Prediction on the test set.
predict_test_input_fn_abstracts = tf.estimator.inputs.pandas_input_fn(
    Y_abstracts, Y_abstracts["class"], shuffle=False)

#### DataFrame with Sentences

In [None]:
# Training input on the whole training set with no limit on training epochs.
train_input_fn_sentences = tf.estimator.inputs.pandas_input_fn(
    X_sentences, X_sentences["class"], num_epochs=None, shuffle = False)

# Prediction on the whole training set.
predict_train_input_fn_sentences = tf.estimator.inputs.pandas_input_fn(
    X_sentences, X_sentences["class"], shuffle=False)
# Prediction on the test set.
predict_test_input_fn_sentences = tf.estimator.inputs.pandas_input_fn(
    Y_sentences, Y_sentences["class"], shuffle=False)

### Training

#### Estimators without retraining the embedding model

In [26]:
estimator_abstracts.train(input_fn=train_input_fn_abstracts, steps=4000);

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp786y30g8/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp786y30g8/model.ckpt.
INFO:tensorflow:loss = 177.35513, step = 1
INFO:tensorflow:global_step/sec: 5.66988
INFO:tensorflow:loss = 46.401173, step = 101 (17.641 sec)
INFO:tensorflow:global_step/sec: 5.97983
INFO:tensorflow:loss = 9.45266, step = 201 (16.722 sec)
INFO:tensorflow:global_step/sec: 8.94714
INFO:tensorflow:loss = 11.922728, step = 301 (11.175 sec)
INFO:tensorflow:global_step/sec: 9.2158
INFO:tensorflow:loss = 5.3924923, step = 401 (10.852 sec)
IN

In [None]:
estimator_sentences.train(input_fn=train_input_fn_sentences, steps=4000)

#### Estimators with retraining the embedding model

In [None]:
estimator_retrain_abstracts.train(input_fn=train_input_fn_abstracts, steps=4000))

In [None]:
estimator_retrain_sentences.train(input_fn=train_input_fn_sentences, steps=4000))

## Evaluation

### Insampling

In [None]:
result_abstracts = estimator_abstracts.evaluate(input_fn=predict_train_input_fn_abstracts)
result_sentences = estimator_abstracts.evaluate(input_fn=predict_train_input_fn_sentences)

print("Model learned with abstracts and without retraining: \n")
print(result_abstracts)

print("Model learned with sentences and without retraining: \n")
print(result_sentences)

In [None]:
result_abstracts = estimator_retrain_abstracts.evaluate(input_fn=predict_train_input_fn_abstracts)
result_sentences = estimator_retrain_abstracts.evaluate(input_fn=predict_train_input_fn_sentences)

print("Model learned with abstracts and retraining: \n")
print(result_abstracts)

print("Model learned with sentences and retraining: \n")
print(result_sentences)

### Outsampling

In [None]:
result_abstracts = estimator_abstracts.evaluate(input_fn=predict_test_input_fn_abstracts)
result_sentences = estimator_abstracts.evaluate(input_fn=predict_test_input_fn_sentences)

print("Model learned with abstracts and without retraining: \n")
print(result_abstracts)

print("Model learned with sentences and without retraining: \n")
print(result_sentences)

In [27]:
result_abstracts_retrain = estimator_retrain_abstracts.evaluate(input_fn=predict_test_input_fn_abstracts)
result_sentences_retrain = estimator_retrain_abstracts.evaluate(input_fn=predict_test_input_fn_sentences)

print("Model learned with abstracts and without retraining: \n")
print(result_abstracts)

print("Model learned with sentences and without retraining: \n")
print(result_sentences)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-06-17:43:29
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp786y30g8/model.ckpt-4000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-12-06-17:43:55
INFO:tensorflow:Saving dict for global step 4000: accuracy = 0.9764866, average_loss = 0.033985835, global_step = 4000, loss = 4.308358
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 4000: /tmp/tmp786y30g8/model.ckpt-4000
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
INFO:tensorflow:Saver not created because there are no variables in the g

## Confusion Matrix

### Insampling

### TestSet