# The Stanford Sentiment Treebank 
The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.

In [1]:
from IPython.display import display, Markdown
with open('../../doc/env_variables_setup.md', 'r') as fh:
    content = fh.read()
display(Markdown(content))

Environment variables that need to be defined:   
`export DIR_PROJ=your_path_git_repository`  
`export PYTHONPATH=$DIR_PROJ/src`  
`export PATH_TENSORBOARD=your_path_tensorboard`  
`export PATH_DATASETS=your_path_datasets`  
`export PROJECT_ID=your_gcp_project_id`  
`export BUCKET_NAME=your_gcp_gs_bucket_name`  
`export REGION=your_region`  
`export PATH_SAVE_MODEL=your_path_to_save_model` 

- Use local Jupyter Lab 
    - you need to have the `jupyter-notebook` Anaconda python environment created [link](local_jupyter_lab_installation.md) 
    - you need to have the `jupyter-notebook` Anaconda python environment activated [link](local_jupyter_lab_installation.md) 
    - then define the environment variables above (copy and paste) 
    - you need to have the `env_multilingual_class` Anaconda python environment created [link](local_jupyter_lab_installation.md)  
    - start Jupyter Lab:  `jupyter lab` 
    - open a Jupyter Lab notebook from `notebook/` 
     - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - choose the proper Anaconda python environment:  `Python [conda env:env_multilingual_class]` [link](conda_env.md) 
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`


- Use GCP Jupyter Lab 
    - Go on GCP
    - open a Cloud Shell
    - `ssh-keygen -t rsa -b 4096 -C firstName_lastName`
    - `cp .ssh/id_rsa.pub .`
    - use Cloud Editor to edit this file `id_rsa.pub` and copy the full content
    - Go on Compute Engine -> Metadata
    - Click SSH Keys
    - Click Edit
    - Click + Add item, copy the content of `id_rsa.pub`
    - You should see firstName_lastName of the left
    - Click Save
    - you need to start a AI Platform instance 
    - open a Jupyter Lab terminal and got to `/home/gcp_user_name/`
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - then `cd proj_multilingual_text_classification/`
    - create the Anacond Python environment `conda env create -f env/environment.yml`
    - create a file `config.sh` in `/home` with the following information: 
    ```
    #!/bin/bash
    
    echo "applying some configuration ..."
    git config --global user.email user_email
    git config --global user.name user_name
    git config --global credential.helper store
        
    # Add here the enviroment variables from above below
    # [EDIT ME]
    export DIR_PROJ=your_path_git_repository
    export PYTHONPATH=$DIR_PROJ/src
  
    cd /home/gcp_user_name/
    
    conda activate env_multilingual_class

    export PS1='\[\e[91m\]\u@:\[\e[32m\]\w\[\e[0m\]$'
    ```
    - Got to AI Platform Notebook, select your instance and click "Reset".
    - Wait and reshreh you Web browser with the Notebook


## Import Packages

In [2]:
import tensorflow as tf
import tensorflow_datasets

from tensorflow.keras.utils import to_categorical

from transformers import (
    BertConfig,
    BertTokenizer,
    XLMRobertaTokenizer,
    TFBertModel,
    TFXLMRobertaModel,
    TFBertForSequenceClassification,
    glue_convert_examples_to_features,
    glue_processors
)

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt

from google.cloud import storage

import math
import numpy as np
import os
import glob
import time
from datetime import timedelta
import shutil
from datetime import datetime
import pickle
import re

## Check configuration

In [3]:
print(tf.version.GIT_VERSION, tf.version.VERSION)

v2.1.0-rc2-17-ge5bf8de410 2.1.0


In [4]:
print(tf.keras.__version__)

2.2.4-tf


In [5]:
gpus = tf.config.list_physical_devices('GPU')
if len(gpus)>0:
    for gpu in gpus:
        print('Name:', gpu.name, '  Type:', gpu.device_type)
else:
    print('No GPU available !!!!')

No GPU available !!!!


## Define Paths

In [6]:
try:
    data_dir=os.environ['PATH_DATASETS']
except KeyError:
    print('missing PATH_DATASETS')
try:   
    tensorboard_dir=os.environ['PATH_TENSORBOARD']
except KeyError:
    print('missing PATH_TENSORBOARD')
try:   
    savemodel_dir=os.environ['PATH_SAVE_MODEL']
except KeyError:
    print('missing PATH_SAVE_MODEL')

## Import local packages

In [7]:
import preprocessing.preprocessing as pp
import utils.model_metrics as mm
import utils.model_utils as mu



In [53]:
import importlib
importlib.reload(pp);
importlib.reload(mm);
importlib.reload(mu);

## Loading a data from Tensorflow Datasets

In [9]:
data, info = tensorflow_datasets.load(name='glue/sst2',
                                      data_dir=data_dir,
                                      with_info=True)

INFO:absl:Overwrite dataset info from restored data version.
INFO:absl:Reusing dataset glue (/Users/tarrade/tensorflow_datasets/glue/sst2/1.0.0)
INFO:absl:Constructing tf.data.Dataset for split None, from /Users/tarrade/tensorflow_datasets/glue/sst2/1.0.0


### Checking baics info from the metadata

In [10]:
info

tfds.core.DatasetInfo(
    name='glue',
    version=1.0.0,
    description='GLUE, the General Language Understanding Evaluation benchmark
(https://gluebenchmark.com/) is a collection of resources for training,
evaluating, and analyzing natural language understanding systems.

            The Stanford Sentiment Treebank consists of sentences from movie reviews and
            human annotations of their sentiment. The task is to predict the sentiment of a
            given sentence. We use the two-way (positive/negative) class split, and use only
            sentence-level labels.',
    homepage='https://nlp.stanford.edu/sentiment/index.html',
    features=FeaturesDict({
        'idx': tf.int32,
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'sentence': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=70042,
    splits={
        'test': 1821,
        'train': 67349,
        'validation': 872,
    },
    supervised_keys=None,
    citation="""@

In [11]:
pp.print_info_dataset(info)

Labels:
      ['negative', 'positive']

Number of label:
      2

Structure of the data:
      dict_keys(['sentence', 'label', 'idx'])

Number of entries:
   Train dataset: 67349
   Test dataset:  1821
   Valid dataset: 872



### Checking baics info from the metadata

In [12]:
data

{'test': <DatasetV1Adapter shapes: {idx: (), label: (), sentence: ()}, types: {idx: tf.int32, label: tf.int64, sentence: tf.string}>,
 'train': <DatasetV1Adapter shapes: {idx: (), label: (), sentence: ()}, types: {idx: tf.int32, label: tf.int64, sentence: tf.string}>,
 'validation': <DatasetV1Adapter shapes: {idx: (), label: (), sentence: ()}, types: {idx: tf.int32, label: tf.int64, sentence: tf.string}>}

In [13]:
data.keys()

dict_keys(['test', 'train', 'validation'])

In [14]:
pp.print_info_data(data['train'])

# Structure of the data:

   <DatasetV1Adapter shapes: {idx: (), label: (), sentence: ()}, types: {idx: tf.int32, label: tf.int64, sentence: tf.string}>

# Output shape of one entry:
   {'idx': TensorShape([]), 'label': TensorShape([]), 'sentence': TensorShape([])}

# Output types of one entry:
   {'idx': tf.int32, 'label': tf.int64, 'sentence': tf.string}

# Output typesof one entry:
   {'idx': <class 'tensorflow.python.framework.ops.Tensor'>, 'label': <class 'tensorflow.python.framework.ops.Tensor'>, 'sentence': <class 'tensorflow.python.framework.ops.Tensor'>}
 

# Shape of the data:

   (67349,)
   ---> 67349 entries
   ---> 1 dim
        dict structure
           dim: 3
           [idx       / label     / sentence ]
           [()        / ()        / ()       ]
           [int32     / int64     / bytes    ]


# Examples of data:
{'idx': 16399,
 'label': 0,
 'sentence': b'for the uninitiated plays better on video with the sound '}
{'idx': 1680,
 'label': 0,
 'sentence': b'like a g

## Define parameters of the model

In [15]:
# models
#MODELS = [(TFBertModel,     BertTokenizer,       'bert-base-multilingual-uncased'),
#          (OpenAIGPTModel,  OpenAIGPTTokenizer,  'openai-gpt'),
#          (GPT2Model,       GPT2Tokenizer,       'gpt2'),
#          (CTRLModel,       CTRLTokenizer,       'ctrl'),
#          (TransfoXLModel,  TransfoXLTokenizer,  'transfo-xl-wt103'),
#          (XLNetModel,      XLNetTokenizer,      'xlnet-base-cased'),
#          (XLMModel,        XLMTokenizer,        'xlm-mlm-enfr-1024'),
#          (DistilBertModel, DistilBertTokenizer, 'distilbert-base-cased'),
#          (RobertaModel,    RobertaTokenizer,    'roberta-base'),
#          (XLMRobertaModel, XLMRobertaTokenizer, 'xlm-roberta-base'),
#         ]
MODELS = [(TFBertModel,         BertTokenizer,       'bert-base-multilingual-uncased'),
          (TFXLMRobertaModel,   XLMRobertaTokenizer, 'jplu/tf-xlm-roberta-base')]
model_index = 0 # BERT
model_class        = MODELS[model_index][0] # i.e TFBertModel
tokenizer_class    = MODELS[model_index][1] # i.e BertTokenizer
pretrained_weights = MODELS[model_index][2] #'i.e bert-base-multilingual-uncased'

# Maxium length, becarefull BERT max length is 512!
MAX_LENGTH = 128

# define parameters
BATCH_SIZE_TRAIN = 32
BATCH_SIZE_TEST = 32
BATCH_SIZE_VALID = 64
EPOCH = 2

# extract parameters
size_train_dataset  = info.splits['train'].num_examples
size_test_dataset   = info.splits['test'].num_examples
size_valid_dataset = info.splits['validation'].num_examples
number_label = info.features["label"].num_classes

# computer parameter
STEP_EPOCH_TRAIN = math.ceil(size_train_dataset/BATCH_SIZE_TRAIN)
STEP_EPOCH_TEST = math.ceil(size_test_dataset/BATCH_SIZE_TEST)
STEP_EPOCH_VALID = math.ceil(size_test_dataset/BATCH_SIZE_VALID)


print('Dataset size:          {:6}/{:6}/{:6}'.format(size_train_dataset, size_test_dataset, size_valid_dataset))
print('Batch size:            {:6}/{:6}/{:6}'.format(BATCH_SIZE_TRAIN, BATCH_SIZE_TEST, BATCH_SIZE_VALID))
print('Step per epoch:        {:6}/{:6}/{:6}'.format(STEP_EPOCH_TRAIN, STEP_EPOCH_TEST, STEP_EPOCH_VALID))
print('Total number of batch: {:6}/{:6}/{:6}'.format(STEP_EPOCH_TRAIN*(EPOCH+1), STEP_EPOCH_TEST*(EPOCH+1), STEP_EPOCH_VALID*(EPOCH+1)))

Dataset size:           67349/  1821/   872
Batch size:                32/    32/    64
Step per epoch:          2105/    57/    29
Total number of batch:   6315/   171/    87


## Tokenizer and prepare data for BERT

In [16]:
# Define the checkpoint directory to store the checkpoints
pretrained_model_dir=savemodel_dir+'/pretrained_model/'+pretrained_weights
os.makedirs(pretrained_model_dir, exist_ok=True)

In [17]:
# Load tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights, cache_dir=pretrained_model_dir)

In [18]:
# recap of input dataset
print(data['train'])
print(tf.data.experimental.cardinality(data['train']))
print(tf.data.experimental.cardinality(data['test']))
print(tf.data.experimental.cardinality(data['validation']))
# super slow since looping over all data
#print(len(list(data['train'])))

<DatasetV1Adapter shapes: {idx: (), label: (), sentence: ()}, types: {idx: tf.int32, label: tf.int64, sentence: tf.string}>
tf.Tensor(-2, shape=(), dtype=int64)
tf.Tensor(-2, shape=(), dtype=int64)
tf.Tensor(-2, shape=(), dtype=int64)


In [19]:
# Prepare data for BERT
train_dataset = glue_convert_examples_to_features(data['train'], 
                                                  tokenizer, 
                                                  max_length=MAX_LENGTH, 
                                                  task='sst-2')
test_dataset = glue_convert_examples_to_features(data['test'], 
                                                  tokenizer, 
                                                  max_length=MAX_LENGTH, 
                                                  task='sst-2')
valid_dataset = glue_convert_examples_to_features(data['validation'], 
                                                  tokenizer, 
                                                  max_length=MAX_LENGTH, 
                                                  task='sst-2')

In [20]:
# adding the number of entries
if tf.version.VERSION[0:5]=='2.2.0':
    train_dataset=train_dataset.apply(tf.data.experimental.assert_cardinality(tf.data.experimental.cardinality(data['train'])))
    test_dataset=test_dataset.apply(tf.data.experimental.assert_cardinality(tf.data.experimental.cardinality(data['test'])))
    valid_dataset=valid_dataset.apply(tf.data.experimental.assert_cardinality(tf.data.experimental.cardinality(data['validation']))) 

In [21]:
# recap of pre processing dataset
print(train_dataset)
if tf.version.VERSION[0:5]=='2.2.0':
    print(tf.data.experimental.cardinality(train_dataset))
    print(tf.data.experimental.cardinality(test_dataset))
    print(tf.data.experimental.cardinality(valid_dataset))
    # super slow since looping over all data
    #print(len(list(train_dataset)))
else:
    print(size_train_dataset)
    print(size_test_dataset)
    print(size_valid_dataset)

<FlatMapDataset shapes: ({input_ids: (None,), attention_mask: (None,), token_type_ids: (None,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>
67349
1821
872


## Check the final data

In [23]:
pp.print_info_data(train_dataset,print_example=False)

# Structure of the data:

   <FlatMapDataset shapes: ({input_ids: (None,), attention_mask: (None,), token_type_ids: (None,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>

# Output shape of one entry:
   ({'input_ids': TensorShape([None]), 'attention_mask': TensorShape([None]), 'token_type_ids': TensorShape([None])}, TensorShape([]))

# Output types of one entry:
   ({'input_ids': tf.int32, 'attention_mask': tf.int32, 'token_type_ids': tf.int32}, tf.int64)

# Output typesof one entry:
   ({'input_ids': <class 'tensorflow.python.framework.ops.Tensor'>, 'attention_mask': <class 'tensorflow.python.framework.ops.Tensor'>, 'token_type_ids': <class 'tensorflow.python.framework.ops.Tensor'>}, <class 'tensorflow.python.framework.ops.Tensor'>)
 

# Shape of the data:

   (67349, 2)
   ---> 67349 batches
   ---> 2 dim
        label
           shape: ()
        dict structure
           dim: 3
           [input_ids       / attention_mask  / to

In [24]:
pp.print_detail_tokeniser(train_dataset, tokenizer)

 input_ids     ---->    attention_mask    token_type_ids    modified text                 

       101     ---->           1                 1          [ C L S ]                     
     10139     ---->           1                 1          f o r                         
     10103     ---->           1                 1          t h e                         
     18768     ---->           1                 1          u n i                         
     45611     ---->           1                 1          # # n i t i                   
     21096     ---->           1                 1          # # a t e d                   
     17173     ---->           1                 1          p l a y s                     
     16197     ---->           1                 1          b e t t e r                   
     10125     ---->           1                 1          o n                           
     11379     ---->           1                 1          v i d e o                    

## Save data as TFRecord files

In [25]:
# Create directory to save TFRecord files
tfrecord_data_dir=data_dir+'/tfrecord/sst2'
os.makedirs(tfrecord_data_dir, exist_ok=True)

In [26]:
pp.write_tf_data_into_tfrecord(train_dataset,tfrecord_data_dir+'/train_dataset')

In [27]:
pp.write_tf_data_into_tfrecord(test_dataset,tfrecord_data_dir+'/test_dataset')

In [28]:
pp.write_tf_data_into_tfrecord(valid_dataset,tfrecord_data_dir+'/valid_dataset')

## Read data from TFRecord files (sanity check)

In [44]:
# TFRecords encode and store data
train_files = tf.data.TFRecordDataset(tfrecord_data_dir+'/train_dataset.tfrecord')
test_files = tf.data.TFRecordDataset(tfrecord_data_dir+'/test_dataset.tfrecord')
valid_files = tf.data.TFRecordDataset(tfrecord_data_dir+'/valid_dataset.tfrecord')

In [54]:
train_dataset2 = train_files.map(pp.parse_tfrecord_glue_files)
test_dataset2 = test_files.map(pp.parse_tfrecord_glue_files)
valid_dataset2 = valid_files.map(pp.parse_tfrecord_glue_files)

In [31]:
# adding the number of entries
if tf.version.VERSION[0:5]=='2.2.0':
    train_dataset2=train_dataset2.apply(tf.data.experimental.assert_cardinality(train_dataset2.reduce(0, lambda x, _: x + 1).numpy()))
    test_dataset2=test_dataset2.apply(tf.data.experimental.assert_cardinality(test_dataset2.reduce(0, lambda x, _: x + 1).numpy()))
    valid_dataset2=valid_dataset2.apply(tf.data.experimental.assert_cardinality(valid_dataset2.reduce(0, lambda x, _: x + 1).numpy()))

In [32]:
if tf.version.VERSION[0:5]=='2.2.0':
    print(tf.data.experimental.cardinality(train_dataset2))
    print(tf.data.experimental.cardinality(test_dataset2))
    print(tf.data.experimental.cardinality(valid_dataset2))
else:
    print(train_dataset2.reduce(0, lambda x, _: x + 1).numpy())
    print(test_dataset2.reduce(0, lambda x, _: x + 1).numpy())
    print(valid_dataset2.reduce(0, lambda x, _: x + 1).numpy())

67349
1821
872


In [55]:
pp.print_info_data(train_dataset2,print_example=False)

# Structure of the data:

   <MapDataset shapes: ({input_ids: (None,), attention_mask: (None,), token_type_ids: (None,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>

# Output shape of one entry:
   ({'input_ids': TensorShape([None]), 'attention_mask': TensorShape([None]), 'token_type_ids': TensorShape([None])}, TensorShape([]))

# Output types of one entry:
   ({'input_ids': tf.int32, 'attention_mask': tf.int32, 'token_type_ids': tf.int32}, tf.int64)

# Output typesof one entry:
   ({'input_ids': <class 'tensorflow.python.framework.ops.Tensor'>, 'attention_mask': <class 'tensorflow.python.framework.ops.Tensor'>, 'token_type_ids': <class 'tensorflow.python.framework.ops.Tensor'>}, <class 'tensorflow.python.framework.ops.Tensor'>)
 

# Shape of the data:

   (67349, 2)
   ---> 67349 batches
   ---> 2 dim
        label
           shape: ()
        dict structure
           dim: 3
           [input_ids       / attention_mask  / token_

In [34]:
pp.print_detail_tokeniser(train_dataset2, tokenizer)

 input_ids     ---->    attention_mask    token_type_ids    modified text                 

       101     ---->           1                 1          [ C L S ]                     
     10139     ---->           1                 1          f o r                         
     10103     ---->           1                 1          t h e                         
     18768     ---->           1                 1          u n i                         
     45611     ---->           1                 1          # # n i t i                   
     21096     ---->           1                 1          # # a t e d                   
     17173     ---->           1                 1          p l a y s                     
     16197     ---->           1                 1          b e t t e r                   
     10125     ---->           1                 1          o n                           
     11379     ---->           1                 1          v i d e o                    

In [36]:
train_dataset

<FlatMapDataset shapes: ({input_ids: (None,), attention_mask: (None,), token_type_ids: (None,)}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>

In [35]:
train_dataset2

<MapDataset shapes: ({input_ids: <unknown>, attention_mask: <unknown>, token_type_ids: <unknown>}, ()), types: ({input_ids: tf.int32, attention_mask: tf.int32, token_type_ids: tf.int32}, tf.int64)>

In [38]:
for i in train_dataset2:
    print(i)
    break

({'input_ids': <tf.Tensor: shape=(128,), dtype=int32, numpy=
array([  101, 10139, 10103, 18768, 45611, 21096, 17173, 16197, 10125,
       11379, 10171, 10103, 14127,   102,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,  

In [37]:
for i in train_dataset:
    print(i)
    break

({'input_ids': <tf.Tensor: shape=(128,), dtype=int32, numpy=
array([  101, 10139, 10103, 18768, 45611, 21096, 17173, 16197, 10125,
       11379, 10171, 10103, 14127,   102,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,  