# The Stanford Sentiment Treebank 
The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.

In [1]:
from IPython.display import display, Markdown
with open('../../doc/env_variables_setup.md', 'r') as fh:
    content = fh.read()
display(Markdown(content))

Environment variables that need to be defined:   
`export DIR_PROJ=your_path_git_repository`  
`export PYTHONPATH=$DIR_PROJ/src`  
`export PATH_TENSORBOARD=your_path_tensorboard`  
`export PATH_DATASETS=your_path_datasets`  
`export PROJECT_ID=your_gcp_project_id`  
`export BUCKET_NAME=your_gcp_gs_bucket_name`  
`export BUCKET_TRANSLATION_NAME=your_gcp_gs_bucket_translation_name`  
`export REGION=your_region`  
`export PATH_SAVE_MODEL=your_path_to_save_model`  
`export CLOUDSDK_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`  
`export CLOUDSDK_GSUTIL_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`  

- Use local Jupyter Lab 
    - you need to have the `jupyter-notebook` Anaconda python environment created [link](local_jupyter_lab_installation.md) 
    - you need to have the `jupyter-notebook` Anaconda python environment activated [link](local_jupyter_lab_installation.md) 
    - then define the environment variables above (copy and paste) 
    - you need to have the `env_multilingual_class` Anaconda python environment created [link](local_jupyter_lab_installation.md)  
    - start Jupyter Lab:  `jupyter lab` 
    - open a Jupyter Lab notebook from `notebook/` 
     - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - choose the proper Anaconda python environment:  `Python [conda env:env_multilingual_class]` [link](conda_env.md) 
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`


- Use GCP Jupyter Lab 
    - Go on GCP
    - open a Cloud Shell
    - `ssh-keygen -t rsa -b 4096 -C firstName_lastName`
    - `cp .ssh/id_rsa.pub .`
    - use Cloud Editor to edit this file `id_rsa.pub` and copy the full content
    - Go on Compute Engine -> Metadata
    - Click SSH Keys
    - Click Edit
    - Click + Add item, copy the content of `id_rsa.pub`
    - You should see firstName_lastName of the left
    - Click Save
    - you need to start a AI Platform instance 
    - open a Jupyter Lab terminal and got to `/home/gcp_user_name/`
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - then `cd proj_multilingual_text_classification/`
    - create the Anacond Python environment `conda env create -f env/environment.yml`
    - create a file `config.sh` in `/home` with the following information: 
    ```
    #!/bin/bash
    
    echo "applying some configuration ..."
    git config --global user.email user_email
    git config --global user.name user_name
    git config --global credential.helper store
        
    # Add here the enviroment variables from above below
    # [EDIT ME]
    export DIR_PROJ=your_path_git_repository
    export PYTHONPATH=$DIR_PROJ/src
  
    cd /home/gcp_user_name/
    
    conda activate env_multilingual_class

    export PS1='\[\e[91m\]\u@:\[\e[32m\]\w\[\e[0m\]$'
    ```
    - Got to AI Platform Notebook, select your instance and click "Reset".
    - Wait and reshreh you Web browser with the Notebook


## Import Packages

In [1]:
import tensorflow as tf
from transformers import (
    BertConfig,
    BertTokenizer,
    XLMRobertaTokenizer,
    TFBertModel,
    TFXLMRobertaModel,
)
import os
from datetime import datetime

## Import local packages

In [12]:
import preprocessing.preprocessing as pp
import utils.model_metrics as mm
import utils.model_utils as mu
import model.tf_custom_bert_classification.model as tf_custom_bert
import model.tf_bert_classification.model as tf_bert



## Check configuration

In [2]:
print(tf.version.GIT_VERSION, tf.version.VERSION)

v2.1.0-rc2-17-ge5bf8de410 2.1.0


In [3]:
print(tf.keras.__version__)

2.2.4-tf


In [4]:
gpus = tf.config.list_physical_devices('GPU')
if len(gpus)>0:
    for gpu in gpus:
        print('Name:', gpu.name, '  Type:', gpu.device_type)
else:
    print('No GPU available !!!!')

No GPU available !!!!


## Define Paths

In [5]:
try:
    data_dir=os.environ['PATH_DATASETS']
except KeyError:
    print('missing PATH_DATASETS')
try:   
    tensorboard_dir=os.environ['PATH_TENSORBOARD']
except KeyError:
    print('missing PATH_TENSORBOARD')
try:   
    savemodel_dir=os.environ['PATH_SAVE_MODEL']
except KeyError:
    print('missing PATH_SAVE_MODEL')

## Read data from TFRecord files [local training of the model]

In [6]:
# Path of the directory with TFRecord files
tfrecord_data_dir=data_dir+'/tfrecord/sst2'

## Define parameters of the model

In [7]:
# models
MODELS = [(TFBertModel,         BertTokenizer,       'bert-base-multilingual-uncased'),
          (TFXLMRobertaModel,   XLMRobertaTokenizer, 'jplu/tf-xlm-roberta-base')]
model_index = 0 # BERT
model_class        = MODELS[model_index][0] # i.e TFBertModel
tokenizer_class    = MODELS[model_index][1] # i.e BertTokenizer
pretrained_weights = MODELS[model_index][2] #'i.e bert-base-multilingual-uncased'
number_label = 2                                                        

## Train the model locally with AI Platform Training (for tests)

In [8]:
savemodel_path = os.path.join(savemodel_dir, 'saved_model')
pretrained_model_dir=savemodel_dir+'/pretrained_model/'+pretrained_weights
model_name='tf_bert_classification'

In [9]:
# train locally
os.environ['EPOCH'] = '1' 
os.environ['STEPS_PER_EPOCH_TRAIN'] = '1' 
os.environ['BATCH_SIZE_TRAIN'] = '32' 
os.environ['STEPS_PER_EPOCH_EVAL'] = '1' 
os.environ['BATCH_SIZE_EVAL'] = '64'
os.environ['TRAINER_PACKAGE_PATH'] = os.environ['PYTHONPATH']
os.environ['MAIN_TRAINER_MODULE'] = 'model.'+model_name+'.task'
os.environ['INPUT_EVAL_TFRECORDS'] = tfrecord_data_dir
os.environ['INPUT_TRAIN_TFRECORDS'] = tfrecord_data_dir
os.environ['OUTPUT_DIR'] = savemodel_path
os.environ['PRETRAINED_MODEL_DIR']= pretrained_model_dir

In [27]:
%%bash
# Use Cloud Machine Learning Engine to train the model in local file system
gcloud ai-platform local train \
   --module-name=$MAIN_TRAINER_MODULE \
   --package-path=$TRAINER_PACKAGE_PATH \
   -- \
   --epochs=$EPOCH \
   --steps_per_epoch_train=$STEPS_PER_EPOCH_TRAIN \
   --batch_size_train=$BATCH_SIZE_TRAIN \
   --steps_per_epoch_eval=$STEPS_PER_EPOCH_EVAL \
   --batch_size_eval=$BATCH_SIZE_EVAL \
   --input_eval_tfrecords=$INPUT_EVAL_TFRECORDS \
   --input_train_tfrecords=$INPUT_TRAIN_TFRECORDS \
   --output_dir=$OUTPUT_DIR \
   --pretrained_model_dir=$PRETRAINED_MODEL_DIR \
   --verbosity_level='INFO'

Train for 1 steps, validate for 1 steps

 training set -> batch:1 loss:0.6951441168785095 and acc: 0.5

Epoch 00001: saving model to /Users/tarrade/tensorflow_model/saved_model/checkpoint_model/ckpt_01
accuracy_train 0.5 epoch 0 



[INFO 2020-05-06 11:39:38,945 task.py:71] 2.1.0
[INFO 2020-05-06 11:39:38,945 task.py:72] 2.2.4-tf
[INFO 2020-05-06 11:39:38,945 task.py:73] ['logtostderr', 'alsologtostderr', 'log_dir', 'v', 'verbosity', 'stderrthreshold', 'showprefixforinfo', 'run_with_pdb', 'pdb_post_mortem', 'run_with_profiling', 'profile_file', 'use_cprofile_for_profiling', 'only_check_args', 'op_conversion_fallback_to_while_loop', 'test_random_seed', 'test_srcdir', 'test_tmpdir', 'test_randomize_ordering_seed', 'xml_output_file', 'learning_rate', 'epsilon', 'epochs', 'steps_per_epoch_train', 'batch_size_train', 'steps_per_epoch_eval', 'batch_size_eval', 'num_classes', 'n_steps_history', 'input_train_tfrecords', 'input_eval_tfrecords', 'output_dir', 'pretrained_model_dir', 'verbosity_level', 'use_tpu', '?', 'help', 'helpshort', 'helpfull', 'helpxml']
[INFO 2020-05-06 11:39:38,947 task.py:86] downloading pretrained model!
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:C

## Debug model's function

In [28]:
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

# create and compile the Keras model in the context of strategy.scope
with strategy.scope():
    model=tf_bert.create_model(pretrained_weights, 
                               pretrained_model_dir=pretrained_model_dir,
                               num_labels=number_label,
                               learning_rate=3e-5,
                               epsilon=1e-08)

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Number of devices: 1


In [61]:
# TFRecords encode and store data
#train_files = tf.data.TFRecordDataset(tfrecord_data_dir+'/train_dataset.tfrecord')
#test_files = tf.data.TFRecordDataset(tfrecord_data_dir+'/test_dataset.tfrecord')
#valid_files = tf.data.TFRecordDataset(tfrecord_data_dir+'/valid_dataset.tfrecord')

train_files = tf.data.TFRecordDataset(tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/train/*.tfrecord'))
test_files = tf.data.TFRecordDataset(tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/test/*.tfrecord'))
valid_files = tf.data.TFRecordDataset(tf.io.gfile.glob(tfrecord_data_dir+'/'+model.name+'/valid/*.tfrecord'))

In [62]:
train_dataset = train_files.map(pp.parse_tfrecord_glue_files)
test_dataset = test_files.map(pp.parse_tfrecord_glue_files)
valid_dataset = valid_files.map(pp.parse_tfrecord_glue_files)

In [63]:
# define parameters
BATCH_SIZE_TRAIN = 32
BATCH_SIZE_TEST = 32
BATCH_SIZE_VALID = 64
EPOCH = 2

In [64]:
# set shuffle and batch size
train_dataset = train_dataset.shuffle(100).batch(BATCH_SIZE_TRAIN).repeat(EPOCH+1)
test_dataset = test_dataset.shuffle(100).batch(BATCH_SIZE_TEST).repeat(EPOCH+1)
valid_dataset = valid_dataset.batch(BATCH_SIZE_VALID)

In [66]:
tf.get_logger().propagate = False
from absl import logging
logging.set_verbosity(logging.INFO)
history_test=tf_bert.train_and_evaluate(model, 
                                        num_epochs=1, 
                                        steps_per_epoch=1, 
                                        train_data=train_dataset, 
                                        validation_steps=1, 
                                        eval_data=valid_dataset, 
                                        n_steps_history=1,
                                        output_dir=savemodel_path)

INFO:absl:training the model ...


Train for 1 steps, validate for 1 steps

 training set -> batch:1 loss:0.6993996500968933 and acc: 0.5

 validation set -> batch:1 val loss:0.6954854258469173 and val acc: 0.5091742873191833

Epoch 00001: saving model to /Users/tarrade/tensorflow_model/saved_model/checkpoint_model/ckpt_01
accuracy_train 0.5 epoch 0 



INFO:absl:
execution time: 0:04:07
INFO:absl:timing per epoch:
['0:04:05']
INFO:absl:sum timing over all epochs:
0:04:05
INFO:absl:env variables: 
environ({'BUCKET_NAME_STAGING': 'ai-platform-training-package-staging', 'TERM_PROGRAM': 'Apple_Terminal', 'PATH_TENSORBOARD': '/Users/tarrade/tensorboard', 'SHELL': '/bin/bash', 'TERM': 'xterm-color', 'CLOUDSDK_GSUTIL_PYTHON': '/Users/tarrade/anaconda-release/conda-env/env_gcp_sdk/bin/python', 'TMPDIR': '/var/folders/l7/00kxfwvs0vbbqxtrp3rpf3yh0000gn/T/', 'Apple_PubSub_Socket_Render': '/private/tmp/com.apple.launchd.3b9hENPCVa/Render', 'CONDA_SHLVL': '3', 'DIR_PROJ': '/Users/tarrade/Desktop/Work/Data_Science/Tutorials_Codes/Python/proj_multilingual_text_classification/src', 'TERM_PROGRAM_VERSION': '421.2', 'CONDA_PROMPT_MODIFIER': '(/Users/tarrade/anaconda-release/conda-env/env_multilingual_class) ', 'TERM_SESSION_ID': '9A9F31A5-24E4-468C-9323-3B89F0AE8F4D', 'LC_ALL': 'en_US.UTF-8', 'USER': 'tarrade', 'BUCKET_STAGING_NAME': 'ai-platform-trai

INFO:tensorflow:Assets written to: /Users/tarrade/tensorflow_model/saved_model/saved_model/tf_bert_classification/assets
