# The Stanford Sentiment Treebank 
The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.

In [1]:
from IPython.display import display, Markdown
with open('../../doc/env_variables_setup.md', 'r') as fh:
    content = fh.read()
display(Markdown(content))

Environment variables that need to be defined:   
`export DIR_PROJ=your_path_git_repository`  
`export PYTHONPATH=$DIR_PROJ/src`  
`export PATH_TENSORBOARD=your_path_tensorboard`  
`export PATH_DATASETS=your_path_datasets`  
`export PROJECT_ID=your_gcp_project_id`  
`export BUCKET_NAME=your_gcp_gs_bucket_name`  
`export BUCKET_TRANSLATION_NAME=your_gcp_gs_bucket_translation_name`  
`export BUCKET_STAGING_NAME=your_gcp_gs_bucket_staging_name` 
`export REGION=your_region`  
`export PATH_SAVE_MODEL=your_path_to_save_model`  
`export CLOUDSDK_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`  
`export CLOUDSDK_GSUTIL_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`  

- Use local Jupyter Lab 
    - you need to have the `jupyter-notebook` Anaconda python environment created [link](local_jupyter_lab_installation.md) 
    - you need to have the `jupyter-notebook` Anaconda python environment activated [link](local_jupyter_lab_installation.md) 
    - then define the environment variables above (copy and paste) 
    - you need to have the `env_multilingual_class` Anaconda python environment created [link](local_jupyter_lab_installation.md)  
    - start Jupyter Lab:  `jupyter lab` 
    - open a Jupyter Lab notebook from `notebook/` 
     - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - choose the proper Anaconda python environment:  `Python [conda env:env_multilingual_class]` [link](conda_env.md) 
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`


- Use GCP Jupyter Lab 
    - Go on GCP
    - open a Cloud Shell
    - `ssh-keygen -t rsa -b 4096 -C firstName_lastName`
    - `cp .ssh/id_rsa.pub .`
    - use Cloud Editor to edit this file `id_rsa.pub` and copy the full content
    - Go on Compute Engine -> Metadata
    - Click SSH Keys
    - Click Edit
    - Click + Add item, copy the content of `id_rsa.pub`
    - You should see firstName_lastName of the left
    - Click Save
    - you need to start a AI Platform instance 
    - open a Jupyter Lab terminal and got to `/home/gcp_user_name/`
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - then `cd proj_multilingual_text_classification/`
    - create the Anacond Python environment `conda env create -f env/environment.yml`
    - create a file `config.sh` in `/home` with the following information: 
    ```
    #!/bin/bash
    
    echo "applying some configuration ..."
    git config --global user.email user_email
    git config --global user.name user_name
    git config --global credential.helper store
        
    # Add here the enviroment variables from above below
    # [EDIT ME]
    export DIR_PROJ=your_path_git_repository
    export PYTHONPATH=$DIR_PROJ/src
  
    cd /home/gcp_user_name/
    
    conda activate env_multilingual_class

    export PS1='\[\e[91m\]\u@:\[\e[32m\]\w\[\e[0m\]$'
    ```
    - Got to AI Platform Notebook, select your instance and click "Reset".
    - Wait and reshreh you Web browser with the Notebook


## Import Packages

In [2]:
import tensorflow as tf
from transformers import (
    BertConfig,
    BertTokenizer,
    XLMRobertaTokenizer,
    TFBertModel,
    TFXLMRobertaModel,
)
import os
from datetime import datetime
import sys

from absl import logging
from absl import flags
from absl import app
import logging as logger
tf.get_logger().propagate = False

## Import local packages

In [3]:
import preprocessing.preprocessing as pp
import utils.model_metrics as mm
import utils.model_utils as mu
import model.tf_custom_bert_classification.model as tf_custom_bert
import model.tf_bert_classification.model as tf_bert

In [4]:
import importlib
importlib.reload(pp);
importlib.reload(mm);
importlib.reload(mu);
importlib.reload(tf_bert);
importlib.reload(tf_custom_bert);

## Check configuration

In [5]:
print(tf.version.GIT_VERSION, tf.version.VERSION)

v2.2.0-rc4-8-g2b96f3662b 2.2.0


In [6]:
print(tf.keras.__version__)

2.3.0-tf


In [7]:
gpus = tf.config.list_physical_devices('GPU')
if len(gpus)>0:
    for gpu in gpus:
        print('Name:', gpu.name, '  Type:', gpu.device_type)
else:
    print('No GPU available !!!!')

No GPU available !!!!


## Define Paths

In [8]:
try:
    data_dir=os.environ['PATH_DATASETS']
except KeyError:
    print('missing PATH_DATASETS')
try:   
    tensorboard_dir=os.environ['PATH_TENSORBOARD']
except KeyError:
    print('missing PATH_TENSORBOARD')
try:   
    savemodel_dir=os.environ['PATH_SAVE_MODEL']
except KeyError:
    print('missing PATH_SAVE_MODEL')

## Read data from TFRecord files [local training of the model]

In [9]:
# Path of the directory with TFRecord files
tfrecord_data_dir=data_dir+'/tfrecord/sst2/'

## Define parameters of the model

In [10]:
# models
MODELS = [(TFBertModel,         BertTokenizer,       'bert-base-multilingual-uncased'),
          (TFXLMRobertaModel,   XLMRobertaTokenizer, 'jplu/tf-xlm-roberta-base')]
model_index = 0 # BERT
model_class        = MODELS[model_index][0] # i.e TFBertModel
tokenizer_class    = MODELS[model_index][1] # i.e BertTokenizer
pretrained_weights = MODELS[model_index][2] #'i.e bert-base-multilingual-uncased'
number_label = 2                                                        

## Train the model locally with AI Platform Training (for tests)

In [11]:
#savemodel_path = os.path.join(savemodel_dir, 'saved_model')
pretrained_model_dir=savemodel_dir+'/pretrained_model/'+pretrained_weights
model_name='tf_bert_classification'

In [12]:
# train locally
os.environ['EPOCH'] = '1' 
os.environ['STEPS_PER_EPOCH_TRAIN'] = '1' 
os.environ['BATCH_SIZE_TRAIN'] = '32' 
os.environ['STEPS_PER_EPOCH_EVAL'] = '1' 
os.environ['BATCH_SIZE_EVAL'] = '64'
os.environ['TRAINER_PACKAGE_PATH'] = os.environ['PYTHONPATH']
os.environ['MAIN_TRAINER_MODULE'] = 'model.'+model_name+'.task'
os.environ['INPUT_EVAL_TFRECORDS'] = tfrecord_data_dir+'/valid'
os.environ['INPUT_TRAIN_TFRECORDS'] = tfrecord_data_dir+'/train'
os.environ['OUTPUT_DIR'] = savemodel_dir
os.environ['PRETRAINED_MODEL_DIR']= pretrained_model_dir

In [13]:
%%bash
# Use Cloud Machine Learning Engine to train the model in local file system
gcloud ai-platform local train \
   --module-name=$MAIN_TRAINER_MODULE \
   --package-path=$TRAINER_PACKAGE_PATH \
   -- \
   --epochs=$EPOCH \
   --steps_per_epoch_train=$STEPS_PER_EPOCH_TRAIN \
   --batch_size_train=$BATCH_SIZE_TRAIN \
   --steps_per_epoch_eval=$STEPS_PER_EPOCH_EVAL \
   --batch_size_eval=$BATCH_SIZE_EVAL \
   --input_eval_tfrecords=$INPUT_EVAL_TFRECORDS \
   --input_train_tfrecords=$INPUT_TRAIN_TFRECORDS \
   --output_dir=$OUTPUT_DIR \
   --pretrained_model_dir=$PRETRAINED_MODEL_DIR \
   --verbosity_level='INFO'

Process is interrupted.


## Debug model's function

In [14]:
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

# reset
tf.keras.backend.clear_session()

# create and compile the Keras model in the context of strategy.scope
with strategy.scope():
    model=tf_bert.create_model(pretrained_weights, 
                               pretrained_model_dir=pretrained_model_dir,
                               num_labels=number_label,
                               learning_rate=3e-5,
                               epsilon=1e-08)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Number of devices: 1


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=999358484.0, style=ProgressStyle(descri…




In [15]:
model.summary()

Model: "tf_bert_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  167356416 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 167,357,954
Trainable params: 167,357,954
Non-trainable params: 0
_________________________________________________________________


In [16]:
model.inputs

In [17]:
# define default parameters
BATCH_SIZE_TRAIN = 32
BATCH_SIZE_TEST = 32
BATCH_SIZE_VALID = 64
EPOCHS = 1
STEP_EPOCH_TRAIN = 5
STEP_EPOCH_VALID = 1

In [18]:
# Using function
train_files = tfrecord_data_dir+'/'+model.name+'/train'
test_files = tfrecord_data_dir+'/'+model.name+'/test'
valid_files = tfrecord_data_dir+'/'+model.name+'/valid'

train_dataset = tf_bert.build_dataset(train_files, BATCH_SIZE_TRAIN)
test_dataset = tf_bert.build_dataset(test_files, BATCH_SIZE_TEST)
valid_dataset = tf_bert.build_dataset(valid_files, BATCH_SIZE_VALID)

train_dataset=train_dataset.repeat(EPOCHS+1)

In [19]:
for i in valid_dataset:
    print(i)
    break

({'input_ids': <tf.Tensor: shape=(64, 128), dtype=int32, numpy=
array([[  101, 11526, 10855, ...,     0,     0,     0],
       [  101, 10768,   112, ...,     0,     0,     0],
       [  101, 10399, 10108, ...,     0,     0,     0],
       ...,
       [  101, 10855, 11229, ...,     0,     0,     0],
       [  101, 10923, 12207, ...,     0,     0,     0],
       [  101,   151, 10407, ...,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(64, 128), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(64, 128), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>

In [20]:
FLAGS = flags.FLAGS

def del_all_flags(FLAGS):
    flags_dict = FLAGS._flags()    
    keys_list = [keys for keys in flags_dict]    
    for keys in keys_list:
        FLAGS.__delattr__(keys)
    
del_all_flags(flags.FLAGS)

# to avoid crashes in Notebook
flags.DEFINE_string('f', '', 'kernel') # just for jupyter notebook and avoir : "UnrecognizedFlagError: Unknown command line flag 'f'"
# to avoid crashes with absl
flags.DEFINE_enum('verbosity', 'INFO', ['VERBOSE', 'DEBUG', 'INFO', 'WARNING', 'ERROR', 'FATAL'], 'verbosity in the logfile')

# default parameters for training the model
# compute and save accuracy and loss after N steps
N_STEPS_HISTORY = 10

# hyper parameters
# adam parameters
LEARNING_RATE = 3e-5
EPSILON = 1e-08
# learning rate decay parameters
DECAY_LR = 0.95
DECAY_TYPE = 'exponential'
N_BATCH_DECAY = 2
# number of classes
NUM_CLASSES = 2
# BERT Maximum length, be be careful BERT max length is 512!
MAX_LENGTH = 512

# get parameters for the training
flags.DEFINE_float('learning_rate', LEARNING_RATE, 'learning rate')
flags.DEFINE_float('decay_learning_rate', DECAY_LR, 'decay of the learning rate, e.g. 0.9')
flags.DEFINE_float('epsilon', EPSILON, 'epsilon')
flags.DEFINE_integer('epochs', EPOCHS, 'The number of epochs to train')
flags.DEFINE_integer('steps_per_epoch_train', STEP_EPOCH_TRAIN, 'The number of steps per epoch to train')
flags.DEFINE_integer('batch_size_train', BATCH_SIZE_TRAIN, 'Batch size for training')
flags.DEFINE_integer('steps_per_epoch_eval', STEP_EPOCH_VALID, 'The number of steps per epoch to evaluate')
flags.DEFINE_integer('batch_size_eval', BATCH_SIZE_VALID, 'Batch size for evaluation')
flags.DEFINE_integer('num_classes', NUM_CLASSES, 'number of classes in our model')
flags.DEFINE_integer('n_steps_history', N_STEPS_HISTORY, 'number of step for which we want custom history')
flags.DEFINE_integer('n_batch_decay', N_BATCH_DECAY, 'number of batches after which the learning rate gets update')
flags.DEFINE_string('decay_type', DECAY_TYPE, 'type of decay for the learning rate: exponential, stepwise, timebased, or constant')
flags.DEFINE_string('input_train_tfrecords', None, 'input folder of tfrecords training data')
flags.DEFINE_string('input_eval_tfrecords', None, 'input folder of tfrecords evaluation data')
flags.DEFINE_string('output_dir', None, 'gs blob where are stored all the output of the model')
flags.DEFINE_string('pretrained_model_dir', None, 'number of classes in our model')
flags.DEFINE_enum('verbosity_level', 'INFO', ['VERBOSE', 'DEBUG', 'INFO', 'WARNING', 'ERROR', 'FATAL'], 'verbosity in the logfile')
flags.DEFINE_boolean('use_tpu', False, 'activate TPU for training')
flags.DEFINE_boolean('use_decay_learning_rate', False, 'activate decay learning rate')
flags.DEFINE_boolean('is_hyperparameter_tuning', False, 'automatic and inter flag')
FLAGS(sys.argv);

In [21]:
print(FLAGS)


/Users/tarrade/anaconda-release/conda-env/env_multilingual_class/lib/python3.7/site-packages/ipykernel_launcher.py:
  --batch_size_eval: Batch size for evaluation
    (default: '64')
    (an integer)
  --batch_size_train: Batch size for training
    (default: '32')
    (an integer)
  --decay_learning_rate: decay of the learning rate, e.g. 0.9
    (default: '0.95')
    (a number)
  --decay_type: type of decay for the learning rate: exponential, stepwise,
    timebased, or constant
    (default: 'exponential')
  --epochs: The number of epochs to train
    (default: '1')
    (an integer)
  --epsilon: epsilon
    (default: '1e-08')
    (a number)
  --f: kernel
    (default: '')
  --input_eval_tfrecords: input folder of tfrecords evaluation data
  --input_train_tfrecords: input folder of tfrecords training data
  --[no]is_hyperparameter_tuning: automatic and inter flag
    (default: 'false')
  --learning_rate: learning rate
    (default: '3e-05')
    (a number)
  --n_batch_decay: number of

In [22]:
history_test=tf_bert.train_and_evaluate(model, 
                                        num_epochs=1, 
                                        steps_per_epoch=2, 
                                        train_data=train_dataset, 
                                        validation_steps=1, 
                                        eval_data=valid_dataset, 
                                        n_steps_history=1,
                                        output_dir=savemodel_dir,
                                        FLAGS=FLAGS,
                                        decay_type='exponential',
                                        learning_rate=3e-5,
                                        s=0.95,
                                        n_batch_decay=2,
                                        metric_accuracy='NotDefined')

INFO:absl:training the model ...
INFO:absl:model's callback:
 [<tensorflow.python.keras.callbacks.TensorBoard object at 0x7f8282e2e950>]
INFO:absl:starting model.fit




INFO:absl:
execution time: 0:01:35
INFO:absl:
debugging .... : 


# Structure of the data:

   <RepeatDataset shapes: ({input_ids: (None, None), attention_mask: (None, None), token_type_ids: (None, None)}, (None,)), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>

# Output shape of one entry:
   ({'input_ids': TensorShape([None, None]), 'attention_mask': TensorShape([None, None]), 'token_type_ids': TensorShape([None, None])}, TensorShape([None]))

# Output types of one entry:
   ({'input_ids': tf.int32, 'attention_mask': tf.int32, 'token_type_ids': tf.int32}, tf.int64)

# Output typesof one entry:
   ({'input_ids': <class 'tensorflow.python.framework.ops.Tensor'>, 'attention_mask': <class 'tensorflow.python.framework.ops.Tensor'>, 'token_type_ids': <class 'tensorflow.python.framework.ops.Tensor'>}, <class 'tensorflow.python.framework.ops.Tensor'>)
 

# Shape of the data:

   (4210, 2)
   ---> 4210 batches
   ---> 2 dim
        label
           shape: (32,)
        dict structure
           dim: 3
        

In [23]:
for i in train_dataset:
    print(i)
    break

({'input_ids': <tf.Tensor: shape=(32, 128), dtype=int32, numpy=
array([[  101, 21270, 94696, ...,     0,     0,     0],
       [  101,   143, 45100, ...,     0,     0,     0],
       [  101, 24220,   102, ...,     0,     0,     0],
       ...,
       [  101, 11008, 10346, ...,     0,     0,     0],
       [  101, 43062, 15648, ...,     0,     0,     0],
       [  101, 13178, 18418, ...,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(32, 128), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(32, 128), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>

In [24]:
model.summary()

Model: "tf_bert_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  167356416 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 167,357,954
Trainable params: 167,357,954
Non-trainable params: 0
_________________________________________________________________


In [25]:
model.inputs

In [26]:
model.outputs

In [27]:
#os.environ['MODEL_LOCAL']=savemodel_path+'/'+model.name
os.environ['MODEL_LOCAL']=savemodel_dir+'/saved_model/'+model.name

In [28]:
#os.environ['MODEL_LOCAL']

In [29]:
!ls -la $MODEL_LOCAL

total 18984
drwxr-xr-x  5 tarrade  staff      160 Aug 30 14:36 [1m[36m.[m[m
drwxr-xr-x  4 tarrade  staff      128 Aug 30 12:34 [1m[36m..[m[m
drwxr-xr-x  2 tarrade  staff       64 Aug 30 12:34 [1m[36massets[m[m
-rw-r--r--  1 tarrade  staff  9719354 Aug 30 14:36 saved_model.pb
drwxr-xr-x  4 tarrade  staff      128 Aug 30 14:36 [1m[36mvariables[m[m


In [30]:
%%bash
saved_model_cli show --dir $MODEL_LOCAL --tag_set serve --signature_def serving_default

The given SavedModel SignatureDef contains the following input(s):
  inputs['input_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 5)
      name: serving_default_input_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['output_1'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 2)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict


In [32]:
model.evaluate(test_dataset)



[0.5532639622688293, 0.9994508624076843]

In [102]:
train_dataset

<RepeatDataset shapes: ({input_ids: (None, None), attention_mask: (None, None), token_type_ids: (None, None)}, (None,)), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>

In [104]:
from tensorflow.python.data.ops import dataset_ops
dataset_ops.get_legacy_output_shapes(train_dataset)

({'input_ids': TensorShape([None, None]),
  'attention_mask': TensorShape([None, None]),
  'token_type_ids': TensorShape([None, None])},
 TensorShape([None]))

In [33]:
pp.print_info_data(train_dataset)

# Structure of the data:

   <RepeatDataset shapes: ({input_ids: (None, None), attention_mask: (None, None), token_type_ids: (None, None)}, (None,)), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>

# Output shape of one entry:
   ({'input_ids': TensorShape([None, None]), 'attention_mask': TensorShape([None, None]), 'token_type_ids': TensorShape([None, None])}, TensorShape([None]))

# Output types of one entry:
   ({'input_ids': tf.int32, 'attention_mask': tf.int32, 'token_type_ids': tf.int32}, tf.int64)

# Output typesof one entry:
   ({'input_ids': <class 'tensorflow.python.framework.ops.Tensor'>, 'attention_mask': <class 'tensorflow.python.framework.ops.Tensor'>, 'token_type_ids': <class 'tensorflow.python.framework.ops.Tensor'>}, <class 'tensorflow.python.framework.ops.Tensor'>)
 

# Shape of the data:

   (4210, 2)
   ---> 4210 batches
   ---> 2 dim
        label
           shape: (32,)
        dict structure
           dim: 3
        