# The Stanford Sentiment Treebank 
The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.

In [1]:
from IPython.display import display, Markdown
with open('../../doc/env_variables_setup.md', 'r') as fh:
    content = fh.read()
display(Markdown(content))

Environment variables that need to be defined:   
`export DIR_PROJ=your_path_git_repository`  
`export PYTHONPATH=$DIR_PROJ/src`  
`export PATH_TENSORBOARD=your_path_tensorboard`  
`export PATH_DATASETS=your_path_datasets`  
`export PROJECT_ID=your_gcp_project_id`  
`export BUCKET_NAME=your_gcp_gs_bucket_name`  
`export BUCKET_TRANSLATION_NAME=your_gcp_gs_bucket_translation_name`  
`export REGION=your_region`  
`export PATH_SAVE_MODEL=your_path_to_save_model`  
`export CLOUDSDK_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`  
`export CLOUDSDK_GSUTIL_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`  

- Use local Jupyter Lab 
    - you need to have the `jupyter-notebook` Anaconda python environment created [link](local_jupyter_lab_installation.md) 
    - you need to have the `jupyter-notebook` Anaconda python environment activated [link](local_jupyter_lab_installation.md) 
    - then define the environment variables above (copy and paste) 
    - you need to have the `env_multilingual_class` Anaconda python environment created [link](local_jupyter_lab_installation.md)  
    - start Jupyter Lab:  `jupyter lab` 
    - open a Jupyter Lab notebook from `notebook/` 
     - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - choose the proper Anaconda python environment:  `Python [conda env:env_multilingual_class]` [link](conda_env.md) 
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`


- Use GCP Jupyter Lab 
    - Go on GCP
    - open a Cloud Shell
    - `ssh-keygen -t rsa -b 4096 -C firstName_lastName`
    - `cp .ssh/id_rsa.pub .`
    - use Cloud Editor to edit this file `id_rsa.pub` and copy the full content
    - Go on Compute Engine -> Metadata
    - Click SSH Keys
    - Click Edit
    - Click + Add item, copy the content of `id_rsa.pub`
    - You should see firstName_lastName of the left
    - Click Save
    - you need to start a AI Platform instance 
    - open a Jupyter Lab terminal and got to `/home/gcp_user_name/`
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - then `cd proj_multilingual_text_classification/`
    - create the Anacond Python environment `conda env create -f env/environment.yml`
    - create a file `config.sh` in `/home` with the following information: 
    ```
    #!/bin/bash
    
    echo "applying some configuration ..."
    git config --global user.email user_email
    git config --global user.name user_name
    git config --global credential.helper store
        
    # Add here the enviroment variables from above below
    # [EDIT ME]
    export DIR_PROJ=your_path_git_repository
    export PYTHONPATH=$DIR_PROJ/src
  
    cd /home/gcp_user_name/
    
    conda activate env_multilingual_class

    export PS1='\[\e[91m\]\u@:\[\e[32m\]\w\[\e[0m\]$'
    ```
    - Got to AI Platform Notebook, select your instance and click "Reset".
    - Wait and reshreh you Web browser with the Notebook


## Import Packages

In [2]:
import tensorflow as tf
from transformers import (
    BertConfig,
    BertTokenizer,
    XLMRobertaTokenizer,
    TFBertModel,
    TFXLMRobertaModel,
)
import os
from datetime import datetime

## Import local packages

In [3]:
import preprocessing.preprocessing as pp
import utils.model_metrics as mm
import utils.model_utils as mu
import model.tf_custom_bert_classification.model as tf_custom_bert
import model.tf_bert_classification.model as tf_bert



In [49]:
import importlib
importlib.reload(pp);
importlib.reload(mm);
importlib.reload(mu);
importlib.reload(tf_bert);
importlib.reload(tf_custom_bert);

## Check configuration

In [5]:
print(tf.version.GIT_VERSION, tf.version.VERSION)

v2.2.0-rc4-8-g2b96f3662b 2.2.0


In [6]:
print(tf.keras.__version__)

2.3.0-tf


In [7]:
gpus = tf.config.list_physical_devices('GPU')
if len(gpus)>0:
    for gpu in gpus:
        print('Name:', gpu.name, '  Type:', gpu.device_type)
else:
    print('No GPU available !!!!')

No GPU available !!!!


## Define Paths

In [11]:
try:
    data_dir=os.environ['PATH_DATASETS']
except KeyError:
    print('missing PATH_DATASETS')
try:   
    tensorboard_dir=os.environ['PATH_TENSORBOARD']
except KeyError:
    print('missing PATH_TENSORBOARD')
try:   
    savemodel_dir=os.environ['PATH_SAVE_MODEL']
except KeyError:
    print('missing PATH_SAVE_MODEL')

## Read data from TFRecord files [local training of the model]

In [12]:
# Path of the directory with TFRecord files
tfrecord_data_dir=data_dir+'/tfrecord/sst2'

## Define parameters of the model

In [13]:
# models
MODELS = [(TFBertModel,         BertTokenizer,       'bert-base-multilingual-uncased'),
          (TFXLMRobertaModel,   XLMRobertaTokenizer, 'jplu/tf-xlm-roberta-base')]
model_index = 0 # BERT
model_class        = MODELS[model_index][0] # i.e TFBertModel
tokenizer_class    = MODELS[model_index][1] # i.e BertTokenizer
pretrained_weights = MODELS[model_index][2] #'i.e bert-base-multilingual-uncased'
number_label = 2                                                        

## Train the model locally with AI Platform Training (for tests)

In [14]:
savemodel_path = os.path.join(savemodel_dir, 'saved_model')
pretrained_model_dir=savemodel_dir+'/pretrained_model/'+pretrained_weights
model_name='tf_bert_classification'

In [15]:
# train locally
os.environ['EPOCH'] = '1' 
os.environ['STEPS_PER_EPOCH_TRAIN'] = '1' 
os.environ['BATCH_SIZE_TRAIN'] = '32' 
os.environ['STEPS_PER_EPOCH_EVAL'] = '1' 
os.environ['BATCH_SIZE_EVAL'] = '64'
os.environ['TRAINER_PACKAGE_PATH'] = os.environ['PYTHONPATH']
os.environ['MAIN_TRAINER_MODULE'] = 'model.'+model_name+'.task'
os.environ['INPUT_EVAL_TFRECORDS'] = tfrecord_data_dir
os.environ['INPUT_TRAIN_TFRECORDS'] = tfrecord_data_dir
os.environ['OUTPUT_DIR'] = savemodel_path
os.environ['PRETRAINED_MODEL_DIR']= pretrained_model_dir

In [36]:
%%bash
# Use Cloud Machine Learning Engine to train the model in local file system
gcloud ai-platform local train \
   --module-name=$MAIN_TRAINER_MODULE \
   --package-path=$TRAINER_PACKAGE_PATH \
   -- \
   --epochs=$EPOCH \
   --steps_per_epoch_train=$STEPS_PER_EPOCH_TRAIN \
   --batch_size_train=$BATCH_SIZE_TRAIN \
   --steps_per_epoch_eval=$STEPS_PER_EPOCH_EVAL \
   --batch_size_eval=$BATCH_SIZE_EVAL \
   --input_eval_tfrecords=$INPUT_EVAL_TFRECORDS \
   --input_train_tfrecords=$INPUT_TRAIN_TFRECORDS \
   --output_dir=$OUTPUT_DIR \
   --pretrained_model_dir=$PRETRAINED_MODEL_DIR \
   --verbosity_level='INFO'

Process is interrupted.


## Debug model's function

In [16]:
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

# create and compile the Keras model in the context of strategy.scope
with strategy.scope():
    model=tf_bert.create_model(pretrained_weights, 
                               pretrained_model_dir=pretrained_model_dir,
                               num_labels=number_label,
                               learning_rate=3e-5,
                               epsilon=1e-08)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Number of devices: 1


In [39]:
# define parameters
BATCH_SIZE_TRAIN = 10 #32
BATCH_SIZE_TEST = 10 #32
BATCH_SIZE_VALID = 10 #64
EPOCH = 1

In [51]:
# TFRecords encode and store data
train_files = tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/train/*.tfrecord')
test_files = tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/test/*.tfrecord')
valid_files = tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/valid/*.tfrecord')

In [46]:
# TFRecords encode and store data
train_files = tf.data.TFRecordDataset(tf.io.gfile.glob(tfrecord_data_dir+'/tf_bert_classification/train/*.tfrecord'))
test_files = tf.data.TFRecordDataset(tf.io.gfile.glob(tfrecord_data_dir+'/tf_bert_classification/test/*.tfrecord'))
valid_files = tf.data.TFRecordDataset(tf.io.gfile.glob(tfrecord_data_dir+'/tf_bert_classification/valid/*.tfrecord'))

train_dataset = train_files.map(pp.parse_tfrecord_glue_files)
test_dataset = test_files.map(pp.parse_tfrecord_glue_files)
valid_dataset = valid_files.map(pp.parse_tfrecord_glue_files)

train_dataset = train_dataset.shuffle(100).batch(BATCH_SIZE_TRAIN).repeat(EPOCH+1)
test_dataset = test_dataset.shuffle(100).batch(BATCH_SIZE_TEST).repeat(EPOCH+1)
valid_dataset = valid_dataset.batch(BATCH_SIZE_VALID) #.repeat(EPOCH+1)

In [58]:
# Using function
train_files = tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/train/*.tfrecord')
test_files = tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/test/*.tfrecord')
valid_files = tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/valid/*.tfrecord')

train_dataset_2 = tf_bert.build_dataset(train_files, BATCH_SIZE_TRAIN)
test_dataset_2 = tf_bert.build_dataset(test_files, BATCH_SIZE_TEST)
valid_dataset_2 = tf_bert.build_dataset(valid_files, BATCH_SIZE_VALID)

train_dataset_2=train_dataset_2.repeat(EPOCH+1)

In [74]:
list(tf.data.Dataset.list_files(tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/train/*.tfrecord'))) #.interleave(tf.data.TFRecordDataset, cycle_length=tf.data.experimental.AUTOTUNE, num_parallel_calls=tf.data.experimental.AUTOTUNE)

[<tf.Tensor: shape=(), dtype=string, numpy=b'/Users/tarrade/tensorflow_datasets/tfrecord/sst2/tf_bert_classification/train/train_dataset.tfrecord'>]

In [73]:
list(tf.data.Dataset.list_files(tfrecord_data_dir+'/'+model.name+'/train/*.tfrecord'))

[<tf.Tensor: shape=(), dtype=string, numpy=b'/Users/tarrade/tensorflow_datasets/tfrecord/sst2/tf_bert_classification/train/train_dataset.tfrecord'>]

In [70]:
tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/train/*.tfrecord')

['/Users/tarrade/tensorflow_datasets/tfrecord/sst2/tf_bert_classification/train/train_dataset.tfrecord']

In [65]:
tf.data.Dataset.list_files(tfrecord_data_dir+'/'+model.name+'/train/*.tfrecord')

<ShuffleDataset shapes: (), types: tf.string>

In [54]:
tf.get_logger().propagate = False
from absl import logging
logging.set_verbosity(logging.INFO)
history_test=tf_bert.train_and_evaluate(model, 
                                        num_epochs=1, 
                                        steps_per_epoch=2, 
                                        train_data=train_dataset, 
                                        validation_steps=1, 
                                        eval_data=valid_dataset, 
                                        n_steps_history=1,
                                        output_dir=savemodel_path)

INFO:absl:training the model ...



 training set -> batch:1 loss:0.6349474787712097 and acc: 0.699999988079071

 validation set -> batch:1 val loss:0.7102639675140381 and val acc: 0.5091742873191833

 training set -> batch:2 loss:0.7102600932121277 and acc: 0.5090702772140503

 validation set -> batch:2 val loss:0.7107909321784973 and val acc: 0.5091742873191833

Epoch 00001: saving model to /Users/tarrade/tensorflow_model/saved_model/checkpoint_model/ckpt_01
accuracy_train 0.5090702772140503 epoch 0 



INFO:absl:
execution time: 0:05:07
INFO:absl:timing per epoch:
['0:05:07']
INFO:absl:sum timing over all epochs:
0:05:07


INFO:tensorflow:Assets written to: /Users/tarrade/tensorflow_model/saved_model/saved_model/tf_bert_classification/assets


In [55]:
k=0
for i in train_dataset:
    print(i)
    k+=1
    if  k>5:
        break

({'input_ids': <tf.Tensor: shape=(10, 128), dtype=int32, numpy=
array([[  101, 10144, 19141, ...,     0,     0,     0],
       [  101, 11811,   143, ...,     0,     0,     0],
       [  101, 33631, 26976, ...,     0,     0,     0],
       ...,
       [  101, 10160, 13645, ...,     0,     0,     0],
       [  101, 25164, 48740, ...,     0,     0,     0],
       [  101, 19567, 14478, ...,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(10, 128), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(10, 128), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>

In [56]:
tf.get_logger().propagate = False
from absl import logging
logging.set_verbosity(logging.INFO)
history_test=tf_bert.train_and_evaluate(model, 
                                        num_epochs=1, 
                                        steps_per_epoch=2, 
                                        train_data=train_dataset_2, 
                                        validation_steps=1, 
                                        eval_data=valid_dataset_2, 
                                        n_steps_history=1,
                                        output_dir=savemodel_path)

INFO:absl:training the model ...



 training set -> batch:1 loss:0.7118471264839172 and acc: 0.5

 validation set -> batch:1 val loss:0.7104257345199585 and val acc: 0.5091953873634338

 training set -> batch:2 loss:0.7106784582138062 and acc: 0.5090909004211426

 validation set -> batch:2 val loss:0.7091863751411438 and val acc: 0.5091953873634338

Epoch 00001: saving model to /Users/tarrade/tensorflow_model/saved_model/checkpoint_model/ckpt_01
accuracy_train 0.5090909004211426 epoch 0 



INFO:absl:
execution time: 0:05:05
INFO:absl:timing per epoch:
['0:05:04']
INFO:absl:sum timing over all epochs:
0:05:04


INFO:tensorflow:Assets written to: /Users/tarrade/tensorflow_model/saved_model/saved_model/tf_bert_classification/assets


In [57]:
k=0
for i in train_dataset_2:
    print(i)
    k+=1
    if  k>5:
        break

({'input_ids': <tf.Tensor: shape=(10, 128), dtype=int32, numpy=
array([[  101, 79296, 59289, ...,     0,     0,     0],
       [  101, 56352, 10486, ...,     0,     0,     0],
       [  101, 27635, 29233, ...,     0,     0,     0],
       ...,
       [  101, 20524, 10563, ...,     0,     0,     0],
       [  101, 11811,   143, ...,     0,     0,     0],
       [  101, 89441, 18771, ...,     0,     0,     0]], dtype=int32)>, 'attention_mask': <tf.Tensor: shape=(10, 128), dtype=int32, numpy=
array([[1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       ...,
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0],
       [1, 1, 1, ..., 0, 0, 0]], dtype=int32)>, 'token_type_ids': <tf.Tensor: shape=(10, 128), dtype=int32, numpy=
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int32)>