# The IMDb Dataset
The IMDb dataset consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.

In [17]:
from IPython.display import display, Markdown
with open('../../doc/env_variables_setup.md', 'r') as fh:
    content = fh.read()
display(Markdown(content))

Environment variables that need to be defined:   
`export DIR_PROJ=your_path_git_repository`  
`export PYTHONPATH=$DIR_PROJ/src`  
`export PATH_TENSORBOARD=your_path_tensorboard`  
`export PATH_DATASETS=your_path_datasets`  
`export PROJECT_ID=your_gcp_project_id`  
`export BUCKET_NAME=your_gcp_gs_bucket_name`  
`export BUCKET_TRANSLATION_NAME=your_gcp_gs_bucket_translation_name`  
`export BUCKET_STAGING_NAME=your_gcp_gs_bucket_staging_name` 
`export REGION=your_region`  
`export PATH_SAVE_MODEL=your_path_to_save_model`  
`export CLOUDSDK_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`  
`export CLOUDSDK_GSUTIL_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`  

- Use local Jupyter Lab 
    - you need to have the `jupyter-notebook` Anaconda python environment created [link](local_jupyter_lab_installation.md) 
    - you need to have the `jupyter-notebook` Anaconda python environment activated [link](local_jupyter_lab_installation.md) 
    - then define the environment variables above (copy and paste) 
    - you need to have the `env_multilingual_class` Anaconda python environment created [link](local_jupyter_lab_installation.md)  
    - start Jupyter Lab:  `jupyter lab` 
    - open a Jupyter Lab notebook from `notebook/` 
     - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - choose the proper Anaconda python environment:  `Python [conda env:env_multilingual_class]` [link](conda_env.md) 
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`


- Use GCP Jupyter Lab 
    - Go on GCP
    - open a Cloud Shell
    - `ssh-keygen -t rsa -b 4096 -C firstName_lastName`
    - `cp .ssh/id_rsa.pub .`
    - use Cloud Editor to edit this file `id_rsa.pub` and copy the full content
    - Go on Compute Engine -> Metadata
    - Click SSH Keys
    - Click Edit
    - Click + Add item, copy the content of `id_rsa.pub`
    - You should see firstName_lastName of the left
    - Click Save
    - you need to start a AI Platform instance 
    - open a Jupyter Lab terminal and got to `/home/gcp_user_name/`
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - then `cd proj_multilingual_text_classification/`
    - create the Anacond Python environment `conda env create -f env/environment.yml`
    - create a file `config.sh` in `/home` with the following information: 
    ```
    #!/bin/bash
    
    echo "applying some configuration ..."
    git config --global user.email user_email
    git config --global user.name user_name
    git config --global credential.helper store
        
    # Add here the enviroment variables from above below
    # [EDIT ME]
    export DIR_PROJ=your_path_git_repository
    export PYTHONPATH=$DIR_PROJ/src
  
    cd /home/gcp_user_name/
    
    conda activate env_multilingual_class

    export PS1='\[\e[91m\]\u@:\[\e[32m\]\w\[\e[0m\]$'
    ```
    - Got to AI Platform Notebook, select your instance and click "Reset".
    - Wait and reshreh you Web browser with the Notebook


## Import Packages

In [18]:
import tensorflow as tf
from transformers import (
    BertConfig,
    BertTokenizer,
    XLMRobertaTokenizer,
    TFBertModel,
    TFXLMRobertaModel,
)
import os
from datetime import datetime
import tensorflow_datasets
from tensorboard import notebook
import math
from google.cloud import storage
from googleapiclient import discovery
from googleapiclient import errors
import logging
import subprocess

import time

## Check configuration

In [19]:
print(tf.version.GIT_VERSION, tf.version.VERSION)

v2.2.0-rc4-8-g2b96f3662b 2.2.0


In [20]:
print(tf.keras.__version__)

2.3.0-tf


In [21]:
gpus = tf.config.list_physical_devices('GPU')
if len(gpus)>0:
    for gpu in gpus:
        print('Name:', gpu.name, '  Type:', gpu.device_type)
else:
    print('No GPU available !!!!')

No GPU available !!!!


## Define Paths

In [22]:
try:
    data_dir=os.environ['PATH_DATASETS']
except KeyError:
    print('missing PATH_DATASETS')
try:   
    tensorboard_dir=os.environ['PATH_TENSORBOARD']
except KeyError:
    print('missing PATH_TENSORBOARD')
try:   
    savemodel_dir=os.environ['PATH_SAVE_MODEL']
except KeyError:
    print('missing PATH_SAVE_MODEL')

## Train the model on AI Platform Training (for production)

In [23]:
model_name = 'tf_bert_classification'

In [24]:
os.environ['DIR_PROJ']

'/home/vera_luechinger/proj_multilingual_text_classification'

In [25]:
# create the package
process=subprocess.Popen(['python','setup.py', 'sdist'], cwd=os.environ['DIR_PROJ']+'/src', shell=False, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# wait for the process to terminate
for line in process.stderr:
    print(line.decode('utf8').replace('\n',''))
for line in process.stdout:
    print(line.decode('utf8').replace('\n',''))



running sdist
running egg_info
writing bert_model.egg-info/PKG-INFO
writing dependency_links to bert_model.egg-info/dependency_links.txt
writing requirements to bert_model.egg-info/requires.txt
writing top-level names to bert_model.egg-info/top_level.txt
reading manifest file 'bert_model.egg-info/SOURCES.txt'
writing manifest file 'bert_model.egg-info/SOURCES.txt'
running check
creating bert_model-0.1
creating bert_model-0.1/analysis
creating bert_model-0.1/bert_model.egg-info
creating bert_model-0.1/model
creating bert_model-0.1/model/sklearn_naive_bayes
creating bert_model-0.1/model/test
creating bert_model-0.1/model/tf_bert_classification
creating bert_model-0.1/model/tf_custom_bert_classification
creating bert_model-0.1/preprocessing
creating bert_model-0.1/utils
copying files to bert_model-0.1...
copying setup.py -> bert_model-0.1
copying analysis/__init__.py -> bert_model-0.1/analysis
copying analysis/get_data.py -> bert_model-0.1/analysis
copying bert_model.egg-info/PKG-INFO -

In [26]:
path_package=''
name_package=''
for root, dirs, files in os.walk(os.environ['DIR_PROJ']+'/src/dist/'):
    for filename in files:
        print(root.split('/')[-4]+'/'+filename)
        print('Last modified: {}'.format(time.ctime(os.path.getmtime(root+'/'+filename))))
        print('Created: {}'.format(time.ctime(os.path.getctime(root+'/'+filename))))
        path_package = root+'/'+filename
        name_package = filename

proj_multilingual_text_classification/bert_model-0.1.tar.gz
Last modified: Thu May 28 08:35:06 2020
Created: Thu May 28 08:35:06 2020


In [27]:
bucket_name = os.environ['BUCKET_STAGING_NAME']
output_folder = model_name +'_'+datetime.now().strftime("%Y_%m_%d_%H%M%S")

storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(output_folder+'/'+filename)
blob.upload_from_filename(path_package)

path_package_gcs='gs://'+os.environ['BUCKET_STAGING_NAME']+'/'+output_folder+'/'+filename

In [28]:
project_name = os.environ['PROJECT_ID']
project_id = 'projects/{}'.format(project_name)
ai_platform_training = discovery.build('ml', 'v1')

In [29]:
output_folder

'tf_bert_classification_2020_05_28_083506'

In [30]:
# variable used to build some variable's name
type_production = 'test' #'test', 'production'
hardware = 'gpu' #'cpu', 'gpu', 'tpu'
owner = os.environ['OWNER']
tier = 'custom' #'basic', 'custom'
hp_tuning= False

# define parameters for ai platform training
package_gcs = path_package_gcs

job_name = model_name+'_lr_3e5_1_600_'+datetime.now().strftime("%Y_%m_%d_%H%M%S")
module_name = 'model.'+model_name+'.task'
if tier=='basic' and hardware=='cpu':
    # CPU
    region = 'europe-west1'
    
elif tier=='basic' and hardware=='gpu':
    # GPU
    region = 'europe-west1'
    
elif tier=='custom' and hardware=='gpu':
    # Custom GPU
    region = 'europe-west4'
    
elif tier=='basic' and hardware=='tpu':
    # TPU
    region = 'us-central1'
    
else:
    # Default
    region = 'europe-west1'
    
verbosity = 'INFO'

# define parameters for training of the model
if type_production=='production':
    # reading metadata
    _, info = tensorflow_datasets.load(name='glue/imdb',
                                       data_dir=data_dir,
                                       with_info=True)
    # define parameters
    epochs = 2 
    batch_size_train = 32
    #batch_size_test = 32
    batch_size_eval = 64  
    
    # Maxium length, becarefull BERT max length is 512!
    max_length = 512

    # extract parameters
    size_train_dataset=info.splits['train'].num_examples
    #size_test_dataset=info.splits['test'].num_examples
    size_valid_dataset=info.splits['validation'].num_examples

    # computer parameter
    steps_per_epoch_train = math.ceil(size_train_dataset/batch_size_train)
    #steps_per_epoch_test = math.ceil(size_test_dataset/batch_size_test)
    steps_per_epoch_eval = math.ceil(size_valid_dataset/batch_size_eval)

    #print('Dataset size:          {:6}/{:6}/{:6}'.format(size_train_dataset, size_test_dataset, size_valid_dataset))
    #print('Batch size:            {:6}/{:6}/{:6}'.format(batch_size_train, batch_size_test, batch_size_eval))
    #print('Step per epoch:        {:6}/{:6}/{:6}'.format(steps_per_epoch_train, steps_per_epoch_test, steps_per_epoch_eval))
    #print('Total number of batch: {:6}/{:6}/{:6}'.format(steps_per_epoch_train*(epochs+1), steps_per_epoch_test*(epochs+1), steps_per_epoch_eval*1))
    print('Number of epoch:        {:6}'.format(epochs))
    print('Batch size:            {:6}/{:6}'.format(batch_size_train, batch_size_eval))
    print('Step per epoch:        {:6}/{:6}'.format(steps_per_epoch_train, steps_per_epoch_eval))

else:
    epochs = 2
    steps_per_epoch_train = 50
    batch_size_train = 32 
    steps_per_epoch_eval = 2
    batch_size_eval = 64
    
input_eval_tfrecords = 'gs://'+os.environ['BUCKET_NAME']+'/tfrecord/imdb/bert-base-multilingual-uncased/valid'
input_train_tfrecords = 'gs://'+os.environ['BUCKET_NAME']+'/tfrecord/imdb/bert-base-multilingual-uncased/train'
output_dir = 'gs://'+os.environ['BUCKET_NAME']+'/training_model_gcp/'+job_name
pretrained_model_dir = 'gs://'+os.environ['BUCKET_NAME']+'/pretrained_model/bert-base-multilingual-uncased'
epsilon = 1e-08
learning_rate= 3e-5
s = 0.5
decay_type = 'test'
n_batch_decay = 2

# building training_inputs
parameters =  ['--epochs', str(epochs),
               '--steps_per_epoch_train', str(steps_per_epoch_train),
               '--batch_size_train', str(batch_size_train),
               '--steps_per_epoch_eval', str(steps_per_epoch_eval),
               '--batch_size_eval', str(batch_size_eval),
               '--input_eval_tfrecords', input_eval_tfrecords ,
               '--input_train_tfrecords', input_train_tfrecords,
               '--output_dir', output_dir,
               '--pretrained_model_dir', pretrained_model_dir,
               '--verbosity_level', verbosity,
               '--epsilon', str(epsilon),
               '--learning_rate', str(learning_rate),
               '--s', str(s),
               '--decay_type', decay_type,
               '--n_batch_decay', str(n_batch_decay)]
if hardware=='tpu':
    parameters.append('--use_tpu')
    parameters.append('True')

training_inputs = {
    'packageUris': [package_gcs],
    'pythonModule': module_name,
    'args': parameters,
    'region': region,
    'runtimeVersion': '2.1',
    'pythonVersion': '3.7',
}

if tier=='basic' and hardware=='cpu':
    # CPU
    training_inputs['scaleTier'] = 'BASIC'
    
elif tier=='basic' and hardware=='gpu':
    # GPU
    training_inputs['scaleTier'] = 'BASIC_GPU'
    
elif tier=='custom' and hardware=='gpu':
    # Custom GPU
    training_inputs['scaleTier'] = 'CUSTOM'
    training_inputs['masterType'] = 'n1-standard-8'
    accelerator_master = {'acceleratorConfig': {
        'count': '1',
        'type': 'NVIDIA_TESLA_V100'}
    }
    training_inputs['masterConfig'] = accelerator_master

    
elif tier=='basic' and hardware=='tpu':
    # TPU
    training_inputs['scaleTier'] = 'BASIC_TPU'

else:
    # Default
    training_inputs['scaleTier'] = 'BASIC'

# add hyperparameter tuning to the job config.
if hp_tuning:
    hyperparams = {
        'algorithm': 'ALGORITHM_UNSPECIFIED',
        'goal': 'MAXIMIZE',
        'hyperparameterMetricTag': 'metric1',
        'maxTrials': 3,
        'maxParallelTrials': 2,
        'maxFailedTrials': 1,
        'enableTrialEarlyStopping': True,
        'hyperparameterMetricTag': 'accuracy_train',
        'params': []}

    hyperparams['params'].append({
        'parameterName':'learning_rate',
        'type':'DOUBLE',
        'minValue': 1.0e-8,
        'maxValue': 1.0,
        'scaleType': 'UNIT_LOG_SCALE'})
    
    hyperparams['params'].append({
        'parameterName':'epsilon',
        'type':'DOUBLE',
        'minValue': 1.0e-9,
        'maxValue': 1.0,
        'scaleType': 'UNIT_LOG_SCALE'})

    # Add hyperparameter specification to the training inputs dictionary.
    training_inputs['hyperparameters'] = hyperparams
    
# building job_spec
labels = {'accelerator': hardware,
          'type': type_production,
          'owner': owner}

job_spec = {'jobId': job_name, 
            'labels': labels, 
            'trainingInput': training_inputs}

In [31]:
training_inputs

{'packageUris': ['gs://ai-platform-training-package-staging/tf_bert_classification_2020_05_28_083506/bert_model-0.1.tar.gz'],
 'pythonModule': 'model.tf_bert_classification.task',
 'args': ['--epochs',
  '2',
  '--steps_per_epoch_train',
  '50',
  '--batch_size_train',
  '32',
  '--steps_per_epoch_eval',
  '2',
  '--batch_size_eval',
  '64',
  '--input_eval_tfrecords',
  'gs://multilingual_text_classification/tfrecord/imdb/bert-base-multilingual-uncased/valid',
  '--input_train_tfrecords',
  'gs://multilingual_text_classification/tfrecord/imdb/bert-base-multilingual-uncased/train',
  '--output_dir',
  'gs://multilingual_text_classification/training_model_gcp/tf_bert_classification_lr_3e5_1_600_2020_05_28_083507',
  '--pretrained_model_dir',
  'gs://multilingual_text_classification/pretrained_model/bert-base-multilingual-uncased',
  '--verbosity_level',
  'INFO',
  '--epsilon',
  '1e-08',
  '--learning_rate',
  '3e-05',
  '--s',
  '0.5',
  '--decay_type',
  'test',
  '--n_batch_decay',


In [32]:
# submit the training job
request = ai_platform_training.projects().jobs().create(body=job_spec,
                                                        parent=project_id)
try:
    response = request.execute()
    print('Job status for {}:'.format(response['jobId']))
    print('    state : {}'.format(response['state']))
    print('    createTime: {}'.format(response['createTime']))

except errors.HttpError as err:
    # For this example, just send some text to the logs.
    # You need to import logging for this to work.
    logging.error('There was an error creating the training job.'
                  ' Check the details:')
    logging.error(err._get_reason())

Job status for tf_bert_classification_lr_3e5_1_600_2020_05_28_083507:
    state : QUEUED
    createTime: 2020-05-28T08:35:16Z


In [33]:
# if you want to specify a specific job ID
#job_name = 'tf_bert_classification_2020_05_16_193551'
jobId = 'projects/{}/jobs/{}'.format(project_name, job_name)
request = ai_platform_training.projects().jobs().get(name=jobId)
response = None

try:
    response = request.execute()
    print('Job status for {}:'.format(response['jobId']))
    print('    state : {}'.format(response['state']))
    if 'trainingOutput' in response.keys():
        if 'trials' in response['trainingOutput'].keys():
            for sub_job in response['trainingOutput']['trials']:
                print('    trials : {}'.format(sub_job))
    if 'consumedMLUnits' in response.keys():
        print('    consumedMLUnits : {}'.format(response['trainingOutput']['consumedMLUnits']))
    if 'errorMessage' in response.keys():
        print('    errorMessage : {}'.format(response['errorMessage']))
    
except errors.HttpError as err:
    logging.error('There was an error getting the logs.'
                  ' Check the details:')
    logging.error(err._get_reason())

Job status for tf_bert_classification_lr_3e5_1_600_2020_05_28_083507:
    state : PREPARING


In [63]:
# how to stream logs
# --stream-logs

# TensorBoard for job running on GCP

In [22]:
# View open TensorBoard instance
notebook.list() 

Known TensorBoard instances:
  - port 6080: logdir /home/vera_luechinger/tensorflow_model/saved_model/tensorboard (started 2:28:10 ago; pid 4119)
  - port 8083: logdir /home/vera_luechinger/tensorflow_model/saved_model/tensorboard (started 2:21:31 ago; pid 4153)
  - port 6006: logdir /home/vera_luechinger/tensorflow_model/saved_model/tensorboard (started 3:36:08 ago; pid 2100)


In [92]:
# View pid
#!ps -ef|grep tensorboard

In [23]:
# Killed Tensorboard process by using pid
!kill -9 4119

In [1]:
%load_ext tensorboard
#%reload_ext tensorboard
%tensorboard  --logdir {'/home/vera_luechinger/tensorflow_model/saved_model/tensorboard'} \
              --host 0.0.0.0 \
              --port 6006

Reusing TensorBoard on port 6006 (pid 2100), started 3:39:55 ago. (Use '!kill 2100' to kill it.)

In [33]:
#%load_ext tensorboard
%reload_ext tensorboard
%tensorboard  --logdir {output_dir+'/tensorboard'} \
              --host 0.0.0.0 \
              --port 6006 \
              #--debugger_port 6006

Reusing TensorBoard on port 6006 (pid 2947), started 3:14:58 ago. (Use '!kill 2947' to kill it.)

In [226]:
%load_ext tensorboard
#%reload_ext tensorboard
%tensorboard  --logdir {os.environ['OUTPUT_DIR']+'/hparams_tuning'} \
              #--host 0.0.0.0 \
              #--port 6006 \
              #--debugger_port 6006

The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard
