# The Stanford Sentiment Treebank 
The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.

In [1]:
from IPython.display import display, Markdown
with open('../../doc/env_variables_setup.md', 'r') as fh:
    content = fh.read()
display(Markdown(content))

Environment variables that need to be defined:   
`export DIR_PROJ=your_path_git_repository`  
`export PYTHONPATH=$DIR_PROJ/src`  
`export PATH_TENSORBOARD=your_path_tensorboard`  
`export PATH_DATASETS=your_path_datasets`  
`export PROJECT_ID=your_gcp_project_id`  
`export BUCKET_NAME=your_gcp_gs_bucket_name`  
`export BUCKET_TRANSLATION_NAME=your_gcp_gs_bucket_translation_name`
`export REGION=your_region`  
`export PATH_SAVE_MODEL=your_path_to_save_model` 
`export CLOUDSDK_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`
`export CLOUDSDK_GSUTIL_PYTHON=your_path/conda-env/env_gcp_sdk/bin/python`

- Use local Jupyter Lab 
    - you need to have the `jupyter-notebook` Anaconda python environment created [link](local_jupyter_lab_installation.md) 
    - you need to have the `jupyter-notebook` Anaconda python environment activated [link](local_jupyter_lab_installation.md) 
    - then define the environment variables above (copy and paste) 
    - you need to have the `env_multilingual_class` Anaconda python environment created [link](local_jupyter_lab_installation.md)  
    - start Jupyter Lab:  `jupyter lab` 
    - open a Jupyter Lab notebook from `notebook/` 
     - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - choose the proper Anaconda python environment:  `Python [conda env:env_multilingual_class]` [link](conda_env.md) 
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`


- Use GCP Jupyter Lab 
    - Go on GCP
    - open a Cloud Shell
    - `ssh-keygen -t rsa -b 4096 -C firstName_lastName`
    - `cp .ssh/id_rsa.pub .`
    - use Cloud Editor to edit this file `id_rsa.pub` and copy the full content
    - Go on Compute Engine -> Metadata
    - Click SSH Keys
    - Click Edit
    - Click + Add item, copy the content of `id_rsa.pub`
    - You should see firstName_lastName of the left
    - Click Save
    - you need to start a AI Platform instance 
    - open a Jupyter Lab terminal and got to `/home/gcp_user_name/`
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - then `cd proj_multilingual_text_classification/`
    - create the Anacond Python environment `conda env create -f env/environment.yml`
    - create a file `config.sh` in `/home` with the following information: 
    ```
    #!/bin/bash
    
    echo "applying some configuration ..."
    git config --global user.email user_email
    git config --global user.name user_name
    git config --global credential.helper store
        
    # Add here the enviroment variables from above below
    # [EDIT ME]
    export DIR_PROJ=your_path_git_repository
    export PYTHONPATH=$DIR_PROJ/src
  
    cd /home/gcp_user_name/
    
    conda activate env_multilingual_class

    export PS1='\[\e[91m\]\u@:\[\e[32m\]\w\[\e[0m\]$'
    ```
    - Got to AI Platform Notebook, select your instance and click "Reset".
    - Wait and reshreh you Web browser with the Notebook


## Import Packages

In [1]:
import tensorflow as tf
import tensorflow_datasets

from tensorflow.keras.utils import to_categorical

from transformers import (
    BertConfig,
    BertTokenizer,
    XLMRobertaTokenizer,
    TFBertModel,
    TFXLMRobertaModel,
    TFBertForSequenceClassification,
    glue_convert_examples_to_features,
    glue_processors
)

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt

from google.cloud import storage

import math
import numpy as np
import os
import glob
import time
from datetime import timedelta
import shutil
from datetime import datetime
import pickle
import re
import codecs
import json

## Check configuration

In [2]:
print(tf.version.GIT_VERSION, tf.version.VERSION)

unknown 2.1.0


In [3]:
print(tf.keras.__version__)

2.2.4-tf


In [4]:
gpus = tf.config.list_physical_devices('GPU')
if len(gpus)>0:
    for gpu in gpus:
        print('Name:', gpu.name, '  Type:', gpu.device_type)
else:
    print('No GPU available !!!!')

No GPU available !!!!


## Define Paths

In [5]:
try:
    data_dir=os.environ['PATH_DATASETS']
except KeyError:
    print('missing PATH_DATASETS')
try:   
    tensorboard_dir=os.environ['PATH_TENSORBOARD']
except KeyError:
    print('missing PATH_TENSORBOARD')
try:   
    savemodel_dir=os.environ['PATH_SAVE_MODEL']
except KeyError:
    print('missing PATH_SAVE_MODEL')

## Import local packages

In [6]:
import preprocessing.preprocessing as pp
import utils.model_metrics as mm
import utils.model_utils as mu



In [7]:
import importlib
importlib.reload(pp);
importlib.reload(mm);
importlib.reload(mu);

## Check the local model

In [10]:
savemodel_path = os.path.join(savemodel_dir, 'saved_model_512')
os.makedirs(savemodel_path, exist_ok=True)

In [11]:
model=tf.keras.models.load_model(os.path.join(savemodel_path, 'tf_bert_classification'))

In [12]:
# check the saved model
print('Model: {}'.format(model.name))
for i in os.listdir(os.path.join(savemodel_path,model.name)):
        print(' ',i)

Model: tf_bert_classification
  assets
  variables
  saved_model.pb


In [13]:
model.summary()

Model: "tf_bert_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  167356416 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 167,357,954
Trainable params: 167,357,954
Non-trainable params: 0
_________________________________________________________________


In [14]:
source_bucket_name = savemodel_path+'/'+model.name

In [15]:
for root, dirs, files in os.walk(source_bucket_name):
    for name in files:
        if not 'history' in name:
            print(os.path.join('.../', name))

.../saved_model.pb
.../variables.index
.../variables.data-00000-of-00001


The **variables** directory contains a standard training checkpoint (see the guide to training checkpoints).  
The **assets** directory contains files used by the TensorFlow graph, for example text files used to initialize vocabulary tables.  
The **saved_model.pb** file stores the actual TensorFlow program, or model, and a set of named signatures, each identifying a function that accepts tensor inputs and produces tensor outputs.

In [16]:
os.environ['MODEL_LOCAL']=savemodel_path+'/'+model.name

SavedModels may contain multiple variants of the model (multiple v1.MetaGraphDefs, identified with the **--tag_set** flag to saved_model_cli), but this is rare. APIs which create multiple variants of a model include 

In [17]:
%%bash
saved_model_cli show --dir $MODEL_LOCAL --tag_set serve 

The given SavedModel MetaGraphDef contains SignatureDefs with the following keys:
SignatureDef key: "__saved_model_init_op"
SignatureDef key: "serving_default"


In [18]:
%%bash
saved_model_cli show --dir $MODEL_LOCAL --tag_set serve --signature_def serving_default

The given SavedModel SignatureDef contains the following input(s):
  inputs['attention_mask'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 512)
      name: serving_default_attention_mask:0
  inputs['input_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 512)
      name: serving_default_input_ids:0
  inputs['token_type_ids'] tensor_info:
      dtype: DT_INT32
      shape: (-1, 512)
      name: serving_default_token_type_ids:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['output_1'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 2)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict


## Copy the local model on GCP

def copy_local_directory_to_gcs(local_path, bucket, gcs_path):
    """Recursively copy a directory of files to GCS.

    local_path should be a directory and not have a trailing slash.
    """
    assert os.path.isdir(local_path)
    for root, dirs, files in os.walk(local_path):
        for name in files:
            local_file = os.path.join(root, name)
            remote_path = os.path.join(gcs_path, local_file[1 + len(local_path) :])
            print(remote_path)
            blob = bucket.blob(remote_path)
            blob.upload_from_filename(local_file)
            print('copy of the file on gs:// done !')

In [19]:
destination_bucket_name = 'saved_model/tf_bert_classification'
copy_model_gcp=False

In [20]:
# will take some time since the size of the model is 2 GB!
storage_client = storage.Client()
bucket = storage_client.get_bucket(os.environ['BUCKET_NAME'])
if copy_model_gcp:
    mu.copy_local_directory_to_gcs(source_bucket_name, bucket, destination_bucket_name)

## Model serving setup

### Defined a name, a region for our model on GCP and create it

In [21]:
# defining the name of the model for online prediction
os.environ['MODEL_NAME']=model.name+'_IMDb_512'

In [22]:
# Normal VM has a model size of 500 MB for more you need to use a specific n1-standard-2 VM (2 GB) for online prediction. It is only available in us-central1.
os.environ['REGION_PRED']='us-central1'

### Check models already deployed

In [23]:
%%bash
gcloud ai-platform models list

NAME                         DEFAULT_VERSION_NAME
tf_bert_classification_test  v1


In [24]:
%%bash
gcloud ai-platform models create $MODEL_NAME \
       --regions=$REGION_PRED \
       --enable-logging > /dev/null 2>&1 \
|| printf "\n\nthe model already exists !\n\n\n"

In [66]:
%%bash
gcloud ai-platform models create $MODEL_NAME \
       --regions=$REGION_PRED \
       --enable-logging > /dev/null 2>&1 \
|| printf "\n\nthe model already exists !\n\n\n"



the model already exists !




### Defined all parameters and upload our models

In [27]:
# define python and run time version
os.environ['RUNTIME_VERSION'] = '2.1'
os.environ['PYTHON_VERSION'] = '3.7'
os.environ['MODEL_BINARIES'] = 'gs://'+os.environ['BUCKET_NAME']+'/saved_model_512/'+model.name
os.environ['MODEL_VERSION'] = 'v1'

In [29]:
#!gsutil ls gs://multilingual_text_classification/saved_model_512/tf_bert_classification

gs://multilingual_text_classification/saved_model_512/tf_bert_classification/saved_model.pb
gs://multilingual_text_classification/saved_model_512/tf_bert_classification/variables/


In [30]:
%%bash
gcloud beta ai-platform versions create $MODEL_VERSION \
       --model $MODEL_NAME \
       --origin $MODEL_BINARIES \
       --runtime-version $RUNTIME_VERSION \
       --python-version $PYTHON_VERSION \
       --machine-type n1-standard-2 \
       --description "This is sentiment classifier using BERT and fine tune on SST-2 for test" > /dev/null 2>&1 \
|| printf "\n\nsame version of the model already exists !\n\n\n"

### Check that the new modelal was deployed

In [31]:
%%bash
gcloud ai-platform models list


NAME                             DEFAULT_VERSION_NAME
tf_bert_classification_IMDb_512  v1
tf_bert_classification_test      v1


## Model serving inference

### Prepare data for online prediction

example of format:

```--json-request  
  {  
    "instances": [  
      {"x": [1, 2], "y": [3, 4]},  
      {"x": [-1, -2], "y": [-3, -4]}  
    ]  
  }  ```
  
```--json-instances  
  {"images": [0.0, …, 0.1], "key": 3}  
  {"images": [0.0, …, 0.1], "key": 2}  ```

In [27]:
# load data
#tfrecord_data_dir=data_dir+'/tfrecord/sst2'
#os.makedirs(tfrecord_data_dir, exist_ok=True)
#valid_files = tf.data.TFRecordDataset(tfrecord_data_dir+'/valid_dataset.tfrecord')
#valid_dataset = valid_files.map(pp.parse_tfrecord_glue_files)

In [36]:
TOKENIZER = 'bert-base-multilingual-uncased'
tokenizer = BertTokenizer.from_pretrained(TOKENIZER)

In [40]:
def transform(text):
    #print('example of input:\n\n{}\n \nlength:{}\n'.format(text[0], len(text)))
    #print(‘text:{} length:{}\n’.format(text, len(text)))
    # get probablility for each classes
    tokens=tokenizer.batch_encode_plus(text, return_tensors="tf", pad_to_max_length=True, max_length=512)
   
    return tokens

In [41]:
text = ["Ich habe nicht wirklich viele Filme mit Holly Hunter gesehen, aber es war eine angenehme Überraschung, sie in den Broadcast News zu sehen. Sie ist eine hartgesottene Journalistin, Jane Craig, die ihre ganze Zeit der TV-Nachrichtensendung gewidmet hat. Ihr Kollege Aaron Altman hat ihre Fackel lange getragen, ohne etwas zu sagen. Das Liebesdreieck wird von Tom Grunnick vervollständigt. Er ist der etwas distanzierte Ex-Sportcaster, der der neue Reporter ist. Für Jane symbolisiert er alles, was sie an Nachrichten nicht mag, und verwandelt sie in Edutainment, nicht in ernsthafte Geschäfte. Zu ihrer großen Überraschung fühlt sich Jane von Tom angezogen. Holly Hunter macht eine großartige Leistung als freche Journalistin. Aber ich verstehe nicht ganz, was sie an ihrem neuen Kollegen Tom so reizvoll findet. Es ist etwas mit ihnen, das uns daran hindert, ihm ganz nah zu kommen. Fast ebenso beeindruckend ist Albert Brooks, der alles in der Rolle eines Profis gibt, der mehr als 100 Prozent für seinen Job gibt, aber nicht ganz so viel dafür bekommt. Eigentlich dachte ich eine Weile, er sei Steve Guttenberg von der Police Academy (1984). Er hat ein paar lustige Zeilen und wenn dies ein Bild von Meg Ryan wäre, würden sie es eine romantische Komödie nennen. Bei einer Laufzeit von mehr als zwei Stunden könnten einige Szenen bearbeitet oder komplett weggelassen worden sein, z. Janes und Aarons Reise nach Mittelamerika. Außerdem bin ich ein Trottel für Happy Ends und hatte sieben Jahre später ein anderes Ende als nur ein Wiedersehen zwischen den dreien vorgezogen."]

In [46]:
example = transform(text)
example

{'input_ids': <tf.Tensor: shape=(1, 512), dtype=int32, numpy=
 array([[  101, 12373, 21072, 10801, 24257, 90393, 16994, 14373, 10234,
         37948, 19282, 55223,   117, 11712, 10153, 10313, 10361, 56085,
         57290, 10688, 10859, 13026, 31267,   117, 10271, 10104, 10129,
         19826, 11636, 10331, 24622,   119, 10271, 10339, 10361, 19799,
         13217, 82486, 10111, 19477, 10262,   117, 13758, 21859,   117,
         10121, 12566, 37930, 12201, 10118, 10827,   118, 60583, 62307,
         10728, 25463, 95001, 11193,   119, 13329, 39407, 21432, 24028,
         13631, 10629, 11193, 12566, 72010, 15135, 13405, 79571,   117,
         15025, 23345, 10331, 58528,   119, 10216, 24941, 69246, 31261,
         22067, 10888, 10168, 11956, 27730, 27342, 15405, 44655, 73166,
         10123,   119, 10162, 10339, 10118, 23345, 91023, 16149, 11460,
           118, 13148, 28265, 10177,   117, 10118, 10118, 13931, 24136,
         10339,   119, 10325, 13758, 24395, 51180, 10162, 21785,   117,
  

In [57]:
#np_array = np.array(list(example.as_numpy_iterator()))
example['input_ids'].numpy()[0]

array([  101, 12373, 21072, 10801, 24257, 90393, 16994, 14373, 10234,
       37948, 19282, 55223,   117, 11712, 10153, 10313, 10361, 56085,
       57290, 10688, 10859, 13026, 31267,   117, 10271, 10104, 10129,
       19826, 11636, 10331, 24622,   119, 10271, 10339, 10361, 19799,
       13217, 82486, 10111, 19477, 10262,   117, 13758, 21859,   117,
       10121, 12566, 37930, 12201, 10118, 10827,   118, 60583, 62307,
       10728, 25463, 95001, 11193,   119, 13329, 39407, 21432, 24028,
       13631, 10629, 11193, 12566, 72010, 15135, 13405, 79571,   117,
       15025, 23345, 10331, 58528,   119, 10216, 24941, 69246, 31261,
       22067, 10888, 10168, 11956, 27730, 27342, 15405, 44655, 73166,
       10123,   119, 10162, 10339, 10118, 23345, 91023, 16149, 11460,
         118, 13148, 28265, 10177,   117, 10118, 10118, 13931, 24136,
       10339,   119, 10325, 13758, 24395, 51180, 10162, 21785,   117,
       10140, 10271, 10144, 60583, 10801, 18115,   117, 10138, 15405,
       30253, 16380,

In [58]:
# create a json file with the right format for online prediction
serving_data_dir=data_dir+'/serving/sst2'
os.makedirs(serving_data_dir, exist_ok=True)
json_file = serving_data_dir+'/input_predict_gcloud.json' 

with codecs.open(json_file, 'w', encoding='utf-8') as f:
    #for el in np_array[0:10]:
        instance={'input_ids': example['input_ids'].numpy()[0].tolist(), 'attention_mask': example['attention_mask'].numpy()[0].tolist(), 'token_type_ids': example['token_type_ids'].numpy()[0].tolist()}
        json.dump(instance, f , sort_keys=True)
        f.write("\n")

In [59]:
# inspecting the new created data
with open(json_file) as f:
    for i in f:
        # convert string to dictionary
        json_data = eval(i)
        print(json_data.keys())
        print(json_data)

dict_keys(['attention_mask', 'input_ids', 'token_type_ids'])
{'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

### Make prediction using local model to test the input data

In [60]:
os.environ['DATA_ONLINE_PRED']=json_file

In [61]:
%%bash
gcloud ai-platform local predict \
       --signature-name serving_default \
       --model-dir $MODEL_LOCAL \
       --json-instances $DATA_ONLINE_PRED \
       --verbosity info \
       #--log-http 

OUTPUT_1
[-1.6366816759109497, 1.5684915781021118]


Instructions for updating:
non-resource variables are not supported in the long term
2020-04-14 15:49:45.212670: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2020-04-14 15:49:45.225213: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2800235000 Hz
2020-04-14 15:49:45.228970: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55fdca7496a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-14 15:49:45.229019: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-14 15:49:45.229294: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
Instructions for updating:
This function will only be available through the

#### Make online prediction

In [62]:
%%bash
gcloud ai-platform predict \
       --model $MODEL_NAME \
       --version $MODEL_VERSION \
       --json-instances $DATA_ONLINE_PRED\
       --verbosity info \
       #--log-http

[[-1.63668013, 1.56849098]]


INFO: Display format: "default table[no-heading](predictions)"
