# The IMDb Dataset
The IMDb dataset consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.

In [28]:
from IPython.display import display, Markdown
with open('../../doc/env_variables_setup.md', 'r') as fh:
    content = fh.read()
display(Markdown(content))

Environment variables that need to be defined:   
`export DIR_PROJ=your_path_git_repository`  
`export PYTHONPATH=$DIR_PROJ/src`  
`export PATH_TENSORBOARD=your_path_tensorboard`  
`export PATH_DATASETS=your_path_datasets`  
`export PROJECT_ID=your_gcp_project_id`  
`export BUCKET_NAME=your_gcp_gs_bucket_name`  
`export REGION=your_region`  
`export PATH_SAVE_MODEL=your_path_to_save_model` 

- Use local Jupyter Lab 
    - you need to have the `jupyter-notebook` Anaconda python environment created [link](local_jupyter_lab_installation.md) 
    - you need to have the `jupyter-notebook` Anaconda python environment activated [link](local_jupyter_lab_installation.md) 
    - then define the environment variables above (copy and paste) 
    - you need to have the `env_multilingual_class` Anaconda python environment created [link](local_jupyter_lab_installation.md)  
    - start Jupyter Lab:  `jupyter lab` 
    - open a Jupyter Lab notebook from `notebook/` 
     - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - choose the proper Anaconda python environment:  `Python [conda env:env_multilingual_class]` [link](conda_env.md) 
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`


- Use GCP Jupyter Lab 
    - Go on GCP
    - open a Cloud Shell
    - `ssh-keygen -t rsa -b 4096 -C firstName_lastName`
    - `cp .ssh/id_rsa.pub .`
    - use Cloud Editor to edit this file `id_rsa.pub` and copy the full content
    - Go on Compute Engine -> Metadata
    - Click SSH Keys
    - Click Edit
    - Click + Add item, copy the content of `id_rsa.pub`
    - You should see firstName_lastName of the left
    - Click Save
    - you need to start a AI Platform instance 
    - open a Jupyter Lab terminal and got to `/home/gcp_user_name/`
    - clone this repositiory: `git clone https://github.com/tarrade/proj_multilingual_text_classification.git`
    - then `cd proj_multilingual_text_classification/`
    - create the Anacond Python environment `conda env create -f env/environment.yml`
    - create a file `config.sh` in `/home` with the following information: 
    ```
    #!/bin/bash
    
    echo "applying some configuration ..."
    git config --global user.email user_email
    git config --global user.name user_name
    git config --global credential.helper store
        
    # Add here the enviroment variables from above below
    # [EDIT ME]
    export DIR_PROJ=your_path_git_repository
    export PYTHONPATH=$DIR_PROJ/src
  
    cd /home/gcp_user_name/
    
    conda activate env_multilingual_class

    export PS1='\[\e[91m\]\u@:\[\e[32m\]\w\[\e[0m\]$'
    ```
    - Got to AI Platform Notebook, select your instance and click "Reset".
    - Wait and reshreh you Web browser with the Notebook


## Import Packages

In [29]:
import tensorflow as tf
import tensorflow_datasets

from tensorflow.keras.utils import to_categorical

from transformers import (
    BertConfig,
    BertTokenizer,
    TFBertModel,
    TFBertForSequenceClassification,
    glue_convert_examples_to_features,
    glue_processors
)

from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt

import math
import numpy as np
import os
import time
from datetime import timedelta
import shutil
from datetime import datetime
import pickle

# new
import re
from keras.models import Sequential, load_model

## Check configuration

In [30]:
print(tf.version.GIT_VERSION, tf.version.VERSION)

unknown 2.1.0


In [31]:
print(tf.keras.__version__)

2.2.4-tf


In [32]:
gpus = tf.config.list_physical_devices('GPU')
if len(gpus)>0:
    for gpu in gpus:
        print('Name:', gpu.name, '  Type:', gpu.device_type)
else:
    print('No GPU available !!!!')

No GPU available !!!!


## Define Paths

In [33]:
# note: these need to be specified in the config.sh file
try:
    data_dir=os.environ['PATH_DATASETS']
except KeyError:
    print('missing PATH_DATASETS')
try:   
    tensorboard_dir=os.environ['PATH_TENSORBOARD']
except KeyError:
    print('missing PATH_TENSORBOARD')
try:   
    savemodel_dir=os.environ['PATH_SAVE_MODEL']
except KeyError:
    print('missing PATH_SAVE_MODEL')

## Import local packages

In [34]:
import preprocessing.preprocessing as pp
import utils.model_metrics as mm

In [35]:
import importlib
importlib.reload(pp);
importlib.reload(mm);

## Loading a data from Tensorflow Datasets

In [36]:
data, info = tensorflow_datasets.load(name="imdb_reviews",
                            data_dir=data_dir,
                            as_supervised=True,
                            with_info=True)

INFO:absl:No config specified, defaulting to first: imdb_reviews/plain_text
INFO:absl:Overwrite dataset info from restored data version.
INFO:absl:Reusing dataset imdb_reviews (/home/vera_luechinger/data/imdb_reviews/plain_text/1.0.0)
INFO:absl:Constructing tf.data.Dataset for split None, from /home/vera_luechinger/data/imdb_reviews/plain_text/1.0.0


In [37]:
# IMDb specific:
data_valid = data['test'].take(1000)

# trying to create a true validation data set for after the computation
#data_valid_ext = data['test'].take(2000)
#data_valid = data_valid_ext.take(1000)

### Checking basic info from the metadata

In [38]:
info

tfds.core.DatasetInfo(
    name='imdb_reviews',
    version=1.0.0,
    description='Large Movie Review Dataset.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well.',
    homepage='http://ai.stanford.edu/~amaas/data/sentiment/',
    features=FeaturesDict({
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=2),
        'text': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=100000,
    splits={
        'test': 25000,
        'train': 25000,
        'unsupervised': 50000,
    },
    supervised_keys=('text', 'label'),
    citation="""@InProceedings{maas-EtAl:2011:ACL-HLT2011,
      author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
      title     = {Learning Word

In [39]:
pp.print_info_dataset(info)

Labels:
      ['neg', 'pos']

Number of label:
      2

Structure of the data:
      dict_keys(['text', 'label'])

Number of entries:
   Train dataset: 25000
   Test dataset:  25000
--> validation dataset not defined


### Checking basic info from the metadata

In [40]:
data

{'test': <DatasetV1Adapter shapes: ((), ()), types: (tf.string, tf.int64)>,
 'train': <DatasetV1Adapter shapes: ((), ()), types: (tf.string, tf.int64)>,
 'unsupervised': <DatasetV1Adapter shapes: ((), ()), types: (tf.string, tf.int64)>}

In [41]:
data.keys()

dict_keys(['test', 'train', 'unsupervised'])

In [42]:
# only works for glue-compatible datasets
try:
    pp.print_info_data(data['train'])
except AttributeError:
    print('data format incompatible')


# Structure of the data:

   <DatasetV1Adapter shapes: ((), ()), types: (tf.string, tf.int64)>

# Output shape of one entry:
   (TensorShape([]), TensorShape([]))

# Output types of one entry:
   (tf.string, tf.int64)

# Output typesof one entry:
   (<class 'tensorflow.python.framework.ops.Tensor'>, <class 'tensorflow.python.framework.ops.Tensor'>)
 

# Shape of the data:

   (25000, 2)
   ---> 25000 entries
   ---> 2 dim
           [text            / label           ]
           [()              / ()              ]
           [|S709           / |S1             ]


# Examples of data:
array([b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revo

## Define parameters of the model

In [43]:
# changes: had to eliminate all lines concerning a test data set because we only have train and valid


# define parameters
#BATCH_SIZE_TRAIN = 32
#BATCH_SIZE_TEST = 32
#BATCH_SIZE_VALID = 64
#EPOCH = 2
#TOKENIZER = 'bert-base-multilingual-uncased'
#MAX_LENGTH = 512

# extract parameters
size_train_dataset = info.splits['train'].num_examples

# the size for the validation data set has been manually computed according to the function 
# pp.print_info_data because the test set has been manually split above
size_valid_dataset = np.shape(np.array(list(data_valid.as_numpy_iterator())))[0]
number_label = info.features["label"].num_classes

# computer parameter
#STEP_EPOCH_TRAIN = math.ceil(size_train_dataset/BATCH_SIZE_TRAIN)
#STEP_EPOCH_VALID = math.ceil(size_valid_dataset/BATCH_SIZE_VALID)


#print('Dataset size:          {:6}/{:6}'.format(size_train_dataset, size_valid_dataset))
#print('Batch size:            {:6}/{:6}'.format(BATCH_SIZE_TRAIN, BATCH_SIZE_VALID))
#print('Step per epoch:        {:6}/{:6}'.format(STEP_EPOCH_TRAIN, STEP_EPOCH_VALID))
#print('Total number of batch: {:6}/{:6}'.format(STEP_EPOCH_TRAIN*(EPOCH+1), STEP_EPOCH_VALID*(EPOCH+1)))

### Additional steps for the IMDb dataset specifically

#### Cleaning

In [44]:
def preprocess_reviews(reviews):
    #REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
    REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
    
    #ae, oe, ue => only for GERMAN data
    #REPLACE_UMLAUT_AE = re.compile("(ae)")
    #REPLACE_UMLAUT_OE = re.compile("(oe)")
    #REPLACE_UMLAUT_UE = re.compile("(ue)")
    
    #reviews = [REPLACE_NO_SPACE.sub("", line[0].decode("utf-8").lower()) for line in np.array(list(reviews.as_numpy_iterator()))]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line[0].decode("utf-8")) for line in np.array(list(reviews.as_numpy_iterator()))]# for line in reviews]
    #reviews = [REPLACE_UMLAUT_AE.sub("ä", line[0]) for line in reviews]
    #reviews = [REPLACE_UMLAUT_OE.sub("ö", line[0]) for line in reviews]
    #reviews = [REPLACE_UMLAUT_UE.sub("ü", line[0]) for line in reviews]
    
    return reviews

In [45]:
reviews_train_clean = preprocess_reviews(data['train'])
reviews_valid_clean = preprocess_reviews(data_valid)

In [46]:
# calculate the number of characters
x = []
for i in reviews_valid_clean:
    x.append(len(i))
    
sum(x)

1221777

In [47]:
# divide into two batches
batch_1 = reviews_valid_clean[:500]
batch_2 = reviews_valid_clean[500:]

## Translating the Validation Dataset

In [54]:
# do this for 3 examples first
# step 1: save data in the right format (.txt, .tsv or html)
with open('en_batch_2.txt', 'w') as f:
    for item in batch_2:
#    for item in reviews_valid_clean[:3]:
        f.write("%s\n\n\n" % item)


In [21]:
# step 2: upload to storage bucket 1 (os.environ['BUCKET_NAME'])
# gsutil cp /home/vera_luechinger/proj_multilingual_text_classification/notebook/00-Test/en_batch_2.txt gs://os.environ['BUCKET_NAME']/

In [50]:

# step 3: translate in storage and store in bucket 2 (os.environ['BUCKET_NAME']_translation: must be empty before the translation process begins)


# batch translation

from google.cloud import translate
import time


In [55]:
def batch_translate_text(
    input_uri="gs://"+os.environ['BUCKET_NAME']+"/en_batch_2.txt",
    output_uri="gs://"+os.environ['BUCKET_NAME_TRANSLATION']+"/",
    project_id=os.environ['PROJECT_ID']
):
    """Translates a batch of texts on GCS and stores the result in a GCS location."""

    client = translate.TranslationServiceClient()


    location = "us-central1"
    # Supported file types: https://cloud.google.com/translate/docs/supported-formats
    gcs_source = {"input_uri": input_uri}

    input_configs_element = {
        "gcs_source": gcs_source,
        "mime_type": "text/plain"  # Can be "text/plain" or "text/html".
    }
    gcs_destination = {"output_uri_prefix": output_uri}
    output_config = {"gcs_destination": gcs_destination}
    parent = client.location_path(project_id, location)

    # Supported language codes: https://cloud.google.com/translate/docs/language
    start_time = time.time()

    operation = client.batch_translate_text(
        parent=parent,
        source_language_code="en",
        target_language_codes=["fr","de"],  # Up to 10 language codes here.
        input_configs=[input_configs_element],
        output_config=output_config)

    print(u"Waiting for operation to complete...")
    response = operation.result(180)
    elapsed_time_secs = time.time() - start_time
    print(u"Execution Time: {}".format(elapsed_time_secs))
    print(u"Total Characters: {}".format(response.total_characters))
    print(u"Translated Characters: {}".format(response.translated_characters))

In [57]:
batch_translate_text()

Waiting for operation to complete...
Execution Time: 136.1635901927948
Total Characters: 1238630
Translated Characters: 1238630


In [21]:
# step 4: save files in the first bucket
#gsutil cp gs://os.environ['BUCKET_NAME']+_translation/os.environ['BUCKET_NAME']_en_batch_2_fr_translations.txt gs://os.environ['BUCKET_NAME']/batch_2/


In [60]:
de_1_dir = "gs://"+os.environ['BUCKET_NAME']+"/batch_1/"+os.environ['BUCKET_NAME']+"_en_batch_1_de_translations.txt"

In [66]:
from google.cloud import storage
#from config import bucketName, localFolder, bucketFolder

storage_client = storage.Client()
bucket = storage_client.get_bucket(os.environ['BUCKET_NAME'])
#bucket

In [69]:
def download_file(bucketName, file, localFolder):
    """Download file from GCP bucket."""
    #fileList = list_files(bucketName)
    #rand = randint(0, len(fileList) - 1)
    storage_client = storage.Client()
    bucket = storage_client.get_bucket(bucketName)
    blob = bucket.blob(file)
    fileName = blob.name.split('/')[-1]
    blob.download_to_filename(localFolder + fileName)
    return f'{fileName} downloaded from bucket.'

In [1]:
# drop this before pushing

download_file(os.environ['BUCKET_NAME'], "batch_1/"+os.environ['BUCKET_NAME']+"_en_batch_1_fr_translations.txt", "/home/vera_luechinger/data/imdb_reviews/")
download_file(os.environ['BUCKET_NAME'], "batch_1/"+os.environ['BUCKET_NAME']+"_en_batch_1_de_translations.txt", "/home/vera_luechinger/data/imdb_reviews/")
download_file(os.environ['BUCKET_NAME'], "batch_2/"+os.environ['BUCKET_NAME']+"_en_batch_2_fr_translations.txt", "/home/vera_luechinger/data/imdb_reviews/")
download_file(os.environ['BUCKET_NAME'], "batch_2/"+os.environ['BUCKET_NAME']+"_en_batch_2_de_translations.txt", "/home/vera_luechinger/data/imdb_reviews/")
print("")

NameError: name 'download_file' is not defined

In [89]:
# step 5: get translated files from storage to use in notebook
with open("/home/vera_luechinger/data/imdb_reviews/"+os.environ['BUCKET_NAME']+"_en_batch_1_de_translations.txt", 'r') as file:
    de_1 = file.readlines()
    
with open("/home/vera_luechinger/data/imdb_reviews/"+os.environ['BUCKET_NAME']+"_en_batch_2_de_translations.txt", 'r') as file:
    de_2 = file.readlines()

with open("/home/vera_luechinger/data/imdb_reviews/"+os.environ['BUCKET_NAME']+"_en_batch_1_fr_translations.txt", 'r') as file:
    fr_1 = file.readlines()

with open("/home/vera_luechinger/data/imdb_reviews/"+os.environ['BUCKET_NAME']+"_en_batch_2_fr_translations.txt", 'r') as file:
    fr_2 = file.readlines()

de = de_1 + de_2
fr = fr_1 + fr_2
de = [item.replace("\n","") for item in de]
fr = [item.replace("\n","") for item in fr]

In [94]:

len(de)

1000