# mT5 Transformer finetuned on XNLI dataset for Language Prediction

- XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages.
- mT5 is pretrained on the mC4 corpus, covering 101 languages
- mT5 was only pre-trained on mC4 excluding any supervised training. Therefore, this model has to be fine-tuned before it is useable on a downstream task. For Example: Language Classification
- `data` folder contains the `xnli.test.tsv` and `xnli.dev.tsv`. These files can also be downloaded from [here.](https://cims.nyu.edu/~sbowman/xnli/)
- `output` folder contains the outputs produced by the trained model. I trained the model on a TPU, GPU can also be used(slower) for fine tuning.
- The trained model can be found on this [drive link.](https://drive.google.com/drive/folders/1VBtbUt66v_gAwM6o0vPVUgqgeGrKbkMk?usp=sharing)

### Steps:
1. Run the cells one by one.
2. If using TPU, set the flag `ON_TPU` to true.
3. Also if using TPU, set the `MAIN_DIR` and `PATH_TO_DATA` to your respective GCS Bucket
4. If you are not training the model again, download the model from [drive link.](https://drive.google.com/drive/folders/1VBtbUt66v_gAwM6o0vPVUgqgeGrKbkMk?usp=sharing) and run the last cell for the predictions.

In [None]:
!pip install t5

#### Set `ON_TPU` flag to true, if using TPU

In [None]:
import functools
from functools import partial
import tensorflow._api.v2.compat.v1 as tf
import pandas as pd
import os
import t5
import t5.models
from t5.models import MtfModel
import seqio

tf.disable_v2_behavior()

ON_TPU = False      #Change in case if TPU is present
tf.test.gpu_device_name()

#### Set `PATH_TO_DATA` to your GCS Bucket(if using TPU)/Local Path where the XNLI data is present.

In [None]:
#Reads Data and store in Pandas Dataframe

PATH_TO_DATA = "data/"
train_df = pd.read_csv(PATH_TO_DATA+"xnli.test.tsv", sep="\t")
test_df = pd.read_csv(PATH_TO_DATA+"xnli.dev.tsv",sep="\t")
print(train_df.shape)
print(test_df.shape)

In [None]:
def create_data(old,new):

    """
    Function to create a new `csv` file by concatenating the `sentence1` and `sentence2` fields
    :param old: Path to old csv
    :param new: Path of new csv
    :return: new csv is created
    """
    
    df = pd.read_csv(old,sep='\t')
    df = df[['language','sentence1','sentence2']]
    sent1 = df[['language','sentence1']].rename(columns={"sentence1":"input"})
    sent2 = df[['language','sentence2']].rename(columns={"sentence2":"input"})
    final = pd.concat([sent1,sent2],ignore_index=True)
    final['input'] = 'input: '+final.input
    final = final.drop_duplicates()
    final = final.sample(frac=1)
    final.to_csv(new,index=False,header=False)
    print(f'Shape: {final.shape}')

## Task Registration
1. A `Task` is a dataset along with preprocessing functions and evaluation metrics.
2. For this notebook, we register a new task in the `TaskRegistry` called `lang_classify`
3. There are predefined tasks as well, but for a downstream task like QnA and Language Identification(as done here), a new task needs to be registered first.
#### Set `MAIN_DIR` path to your GCS Bucket, if using TPU

In [None]:
MAIN_DIR = ""         #Add path to a gcs bucket if using TPU

xnli_csv_path = {
    "train":"train.csv",
    "test": "test.csv"
}

def xnli_dataset_fn(split, shuffle_files=False):
  if MAIN_DIR=="":
    ds = tf.data.TextLineDataset(xnli_csv_path[split])
  else:
    ds = tf.data.TextLineDataset(MAIN_DIR+xnli_csv_path[split])
  ds = ds.map(
      functools.partial(tf.io.decode_csv, record_defaults=["", ""],
                        field_delim=","),
      num_parallel_calls=tf.data.experimental.AUTOTUNE)
  ds = ds.map(lambda *ex: dict(zip(["language", "input"], ex)))
  return ds

def lang_preprocessor(data):
        return data.map(lambda ex:{"inputs":ex["input"],"targets": ex["language"]}, num_parallel_calls=tf.data.experimental.AUTOTUNE)

DEFAULT_VOCAB = t5.data.SentencePieceVocabulary("gs://t5-data/vocabs/mc4.250000.100extra/sentencepiece.model")

DEFAULT_OUTPUT_FEATURES = {
    "inputs":
        seqio.Feature(
            vocabulary=DEFAULT_VOCAB, add_eos=True,required=False),
    "targets":
        seqio.Feature(
            vocabulary=DEFAULT_VOCAB, add_eos=True)
}

task = "lang_classify"

seqio.TaskRegistry.remove(task)
seqio.TaskRegistry.add(
    task,
    source=seqio.FunctionDataSource(
        dataset_fn=xnli_dataset_fn,
        splits=["train", "test"],
        ),
    preprocessors=[
        lang_preprocessor,
        seqio.preprocessors.tokenize_and_append_eos,
    ],
    postprocess_fn=t5.data.postprocessors.lower_text,
    metric_fns=[t5.evaluation.metrics.accuracy],
    output_features=DEFAULT_OUTPUT_FEATURES,
  )

## Data Creation and PreProcessing
1. We create two new files `train.csv` and `test.csv` with the help of  `create_data` function.
2. These files consist only the relevant data that is of use to us i.e [language] and the [sentence1, sentence2]

In [None]:
create_data(PATH_TO_DATA+'xnli.test.tsv',"train.csv")
create_data(PATH_TO_DATA+'xnli.dev.tsv',"test.csv")
print(pd.read_csv("train.csv").shape)

## Model Training

1. mT5 small Transformer is trained with the following specifications.
    - Learning Rate = 0.003
    - Batch Size = 32 if GPU or else 128 if TPU
    - EPOCH = 5

In [None]:
PRE_TRAINED_MODEL = "gs://t5-data/pretrained_models/mt5/small"
LR = 0.003
BATCH_SIZE = 32

TPU_TOPOLOGY = "v2-8"
TPU_ADDRESS = None

if ON_TPU:
    BATCH_SIZE = 128
    try:
        tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
        TPU_ADDRESS = tpu.get_master()
        print('Running on TPU:', TPU_ADDRESS)
    except ValueError as e:
        raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')

    tf.enable_eager_execution()
    tf.config.experimental_connect_to_host(TPU_ADDRESS)

n = pd.read_csv("train.csv").shape[0]
EPOCH = 5
ft_steps = int(n/BATCH_SIZE)*EPOCH

if MAIN_DIR=="":
  MODEL_DIR = "models/"
else:
  MODEL_DIR = MAIN_DIR+"models/"

model = MtfModel(MODEL_DIR,
                   tpu=TPU_ADDRESS,
                 tpu_topology=TPU_TOPOLOGY,
                   model_parallelism=1,
                   batch_size=BATCH_SIZE,
                   sequence_length={"inputs": 64, "targets": 15},
                   learning_rate_schedule=LR,
                   save_checkpoints_steps=5000,
                    keep_checkpoint_max= 16 if ON_TPU else None,
                   iterations_per_loop=300 if ON_TPU else 100)

model.finetune(
      mixture_or_task_name=task,
      pretrained_model_dir=PRE_TRAINED_MODEL,
      finetune_steps=ft_steps,
      split="train")


## Model Evaluation
We now evaluate on the validation sets of the tasks.

In [None]:
model.batch_size = BATCH_SIZE*4
SUMM_DIR = "output/" if MAIN_DIR=="" else MAIN_DIR+"output/"
model.eval(
    "lang_classify",
    summary_dir=SUMM_DIR,
    checkpoint_steps=-1,
    split="test"
)

## Model Metrics
Here we evaluate the model on different metrics like, precision, recall, f1. And finally take a look into a classification.

In [14]:
from sklearn.metrics import precision_recall_fscore_support, classification_report

def get_prediction(output_dir,task_name):

    """
    Helper function to get the prediction files
    :param output_dir: Directory where the output of the .eval() were saved.
    :param task_name: Task name
    """

    def _prediction_file_to_ckpt(path):
        return int(path.split("_")[-2])
    prediction_files = tf.io.gfile.glob(os.path.join(output_dir,"%s_*_predictions"%task_name))
    if len(prediction_files) == 0: return None
    return sorted(prediction_files, key=_prediction_file_to_ckpt)[-1]

def evaluation(output_dir,task_name):

    """
    Gives the evaluation of the model trained. i.e. precision, recall and f1
    :param output_dir: Directory where the output of the .eval() were saved.
    :param task_name: Task name
    :return: Classification report
    """

    pred_fn = get_prediction(output_dir,task_name)
    if not pred_fn: return None,None,None
    with tf.io.gfile.GFile(pred_fn) as p:
        preds = [line.strip() for line in p]

    with tf.io.gfile.GFile(os.path.join(output_dir,"%s_targets" % task_name)) as t:
        targets = [line.strip() for line in t]

    with tf.io.gfile.GFile(os.path.join(output_dir,"%s_inputs" % task_name)) as i:
        inputs = [eval(line).decode('utf-8') for line in i]

    p,r,f1,_ = precision_recall_fscore_support(targets, preds,average='micro')
    print(f'precison: {p} \nrecall: {r} \nf1: {f1}\n')
    print()
    print(classification_report(targets,preds))

evaluation(SUMM_DIR, "lang_classify")


precison: 0.9972889933128501 
recall: 0.9972889933128501 
f1: 0.9972889933128501


              precision    recall  f1-score   support

          ar       1.00      1.00      1.00      3320
          bg       1.00      1.00      1.00      3320
          de       1.00      1.00      1.00      3320
          el       1.00      1.00      1.00      3320
          en       0.99      1.00      0.99      3320
          es       1.00      1.00      1.00      3320
          fr       1.00      1.00      1.00      3320
          hi       1.00      0.98      0.99      3320
          ru       1.00      1.00      1.00      3320
          sw       1.00      1.00      1.00      3319
          th       1.00      1.00      1.00      3320
          tr       1.00      1.00      1.00      3320
          ur       0.98      1.00      0.99      3319
          vi       1.00      1.00      1.00      3320
          zh       1.00      1.00      1.00      3319

    accuracy                           1.00     497

In [None]:
if ON_TPU:
    %reload_ext tensorboard
%load_ext tensorboard
%tensorboard --logdir="$MODEL_DIR" --port=0

## Predictions
1. Finally, the model is used to predict on the outside data.
2. `load_model` returns the latest model and `predictions` return the prediction of the input.
<br/><br/>
*Note Do add the task "lang_classify" in the `TaskRegistry`. This can be done by running the cell labeled as `Task Registration`*

In [None]:
#[NOTE] : Do register the new task i.e. "lang_classify" before running this cell. Do run the `Task Registration` cell.

def load_model(path):

    """
    Loads the current model
    :param path: path to model
    :return: MtfModel
    """

    return MtfModel(path,
                   tpu=None,
                   model_parallelism=1,
                   sequence_length={"inputs": 64, "targets": 15})

def predictions(inputs,model=None):
    """
    Get predictions of the input.
    :param inputs: List of Strings or a String
    :param model: Model used to predict
    :return: Predictions in (predd.txt)
    """

    if type(inputs) == str:
        inputs = [inputs]

    with open('inputs.txt', "w") as f:
        for inp in inputs:
            f.write("input: %s\n" % inp.lower())

    model.predict(
          input_file='inputs.txt',
          output_file='predd.txt',
          temperature=0,
      )


model = load_model(MODEL_DIR)
inputs = [
    "चलो पार्क चलते हैं",
    "Hãy đến công viên",
    "Vamos a aparcar",
    "Пойдем в парк",
          ]
predictions(inputs,model)