# Transfer Learning for Text Data

In this lab, we begin an exploration of transfer learning models designed to facilitate multilingual modeling. 

First, make sure you have the following dependencies:

```bash
    $ pip install torch tensorflow transformers cld2-cffi
```

### Imports

In [1]:
import os
import cld2
import gzip
import time
import torch
import random
import zipfile
import requests
import simplejson
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from transformers import BertConfig, BertTokenizer
from transformers import BertPreTrainedModel, BertModel
from transformers import get_linear_schedule_with_warmup
from sklearn.model_selection import train_test_split as tts
from transformers import BertForSequenceClassification, AdamW
from tensorflow.keras.preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler

## Download and load the datasets

This project requires two datasets, both containing reviews of books. The first dataset contains Hindi-language book reviews, and was originally gathered from Raghvendra Pratap Singh
([MrRaghav](https://github.com/MrRaghav)) via [his GitHub repository](https://github.com/MrRaghav/Complaints-mining-from-Hindi-product-reviews) concerning complaint-mining in product reviews.


The second dataset contains English-language book reviews, and is a subset of the [Amazon product review corpus](https://registry.opendata.aws/amazon-reviews-ml/), a (unfortunately English-only, to my knowledge) portion of which is available from [Julian McAuley at UCSD](https://cseweb.ucsd.edu/~jmcauley/) [here](http://jmcauley.ucsd.edu/data/amazon/).

In [2]:
def fetch_data(url, fname):
    """
    Helper method to retrieve data via Python's requests module
    """
    response = requests.get(url)
    outpath  = os.path.abspath(fname)
    with open(outpath, "wb") as f:
        f.write(response.content)
    
    return outpath

In [3]:
# Fetch the Hindi review data

FIXTURES = os.path.join("..", "data")

if not os.path.exists(FIXTURES):
    os.makedirs(FIXTURES)

HINDI_FILE = os.path.join(FIXTURES, "amazon-youtube-hindi-complaints-data.xlsx")
HINDI_URL = "https://tinyurl.com/y5h2dkn8"
HINDI_REVIEWS = fetch_data(HINDI_URL, HINDI_FILE)

In [4]:
hindi_reviews = pd.read_excel(
    HINDI_REVIEWS, 
    sheet_name="Sheet1"
)
hindi_reviews.head()

Unnamed: 0,Category,Label,Reviews
0,Phone,1,वीवो वी 19 अच्छा है इनका गैलरी मजा नहीं आता स...
1,Phone,0,बहोत सस्ता है
2,Book,0,किंडल आपके साथ इस किताब को पढ़ने में मुझे कंटि...
3,Book,0,मुस्लिम शासकों उनके अत्याचारों से हिन्दू जनता ...
4,Book,0,पर नशा है आईएएस की तैयारी


In [5]:
# This dataset includes both book and phone reviews.
# Let's keep only the book reviews.

hindi_reviews = hindi_reviews[hindi_reviews.Category == "Book"]
hindi_reviews = hindi_reviews.drop(columns=["Category"])
print(len(hindi_reviews))

2839


In [6]:
hindi_reviews.head()

Unnamed: 0,Label,Reviews
2,0,किंडल आपके साथ इस किताब को पढ़ने में मुझे कंटि...
3,0,मुस्लिम शासकों उनके अत्याचारों से हिन्दू जनता ...
4,0,पर नशा है आईएएस की तैयारी
5,0,एकदम जबरदस्त किताब है
6,0,एक जबरदस्त कहानी


In [7]:
# Now we'll load the English language reviews
# Note that we've previously downloaded them from the link below
# http://jmcauley.ucsd.edu/data/amazon/
# It's a 3 gig file, compressed

def parse(path, n_rows=10000):
    g = gzip.open(path, 'rb')
    idx = 0
    for line in g:
        if idx > n_rows:
            break
        else:
            idx += 1
            yield eval(line)

def make_dataframe(path):
    idx = 0
    df = {}
    for dictionary in parse(path):
        df[idx] = dictionary
        idx += 1
    return pd.DataFrame.from_dict(df, orient='index')

In [8]:
ENGL_REVIEWS = os.path.join(FIXTURES, "reviews_Books_5.json.gz")
english_reviews = make_dataframe(ENGL_REVIEWS)
english_reviews.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A10000012B7CGYKOMPQ4L,000100039X,Adam,"[0, 0]",Spiritually and mentally inspiring! A book tha...,5.0,Wonderful!,1355616000,"12 16, 2012"
1,A2S166WSCFIFP5,000100039X,"adead_poet@hotmail.com ""adead_poet@hotmail.com""","[0, 2]",This is one my must have books. It is a master...,5.0,close to god,1071100800,"12 11, 2003"
2,A1BM81XB4QHOA3,000100039X,"Ahoro Blethends ""Seriously""","[0, 0]",This book provides a reflection that you can a...,5.0,Must Read for Life Afficianados,1390003200,"01 18, 2014"
3,A1MOSTXNIO5MPJ,000100039X,Alan Krug,"[0, 0]",I first read THE PROPHET in college back in th...,5.0,Timeless for every good and bad time in your l...,1317081600,"09 27, 2011"
4,A2XQ5LZHTD4AFT,000100039X,Alaturka,"[7, 9]",A timeless classic. It is a very demanding an...,5.0,A Modern Rumi,1033948800,"10 7, 2002"


In [9]:
def get_complaints(rating):
    if rating > 2:
        return 0
    else:
        return 1

In [10]:
english_reviews["Score"] = english_reviews["overall"].apply(get_complaints)

english_reviews = english_reviews.drop(
    columns=[
        "reviewerID", "asin", "reviewerName", "helpful", 
        "summary", "unixReviewTime", "reviewTime", "overall"
    ]
)

In [11]:
english_reviews.columns = ["Reviews", "Label"]
english_reviews.head()

Unnamed: 0,Reviews,Label
0,Spiritually and mentally inspiring! A book tha...,0
1,This is one my must have books. It is a master...,0
2,This book provides a reflection that you can a...,0
3,I first read THE PROPHET in college back in th...,0
4,A timeless classic. It is a very demanding an...,0


## Set up Model Architecture

The number of epochs, maximum length of sequences, batch size, and random seed for training are global variables, as is the path to the directory where we will store the trained model.

In [12]:
EPOCHS = 3
MAX_LEN = 128
BATCH_SIZE = 32
RANDOM_SEED = 38

STORE_PATH = os.path.join("..", "results")

if not os.path.exists(STORE_PATH):
    os.makedirs(STORE_PATH)

In [13]:
def prep(df):
    """
    This prep function will take the feature dataframe as input,
    perform tokenization, and return the encoded feature vectors
    """
    sentences = df.values
    tokenizer = BertTokenizer.from_pretrained(
        'bert-base-multilingual-cased', do_lower_case=True
    )
    
    encoded_sentences = []
    for sent in sentences:
        encoded_sent = tokenizer.encode(
            sent,
            add_special_tokens=True,
            truncation=True,
            max_length=MAX_LEN
        )
        
        encoded_sentences.append(encoded_sent)

    encoded_sentences = pad_sequences(
        encoded_sentences, 
        maxlen=MAX_LEN, 
        dtype="long", 
        value=0, 
        truncating="post", 
        padding="post"
    )

    return encoded_sentences


def attn_mask(encoded_sentences):
    """
    This function takes the encoded sentences as input and returns 
    attention masks ahead of BERT training. 
    
    A 0 value corresponds to padding, and a value of 1 is an actual token.
    """

    attention_masks = []
    for sent in encoded_sentences:
        att_mask = [int(token_id > 0) for token_id in sent]
        attention_masks.append(att_mask)
    return attention_masks

### Split the data and preprocess it

In [14]:
X = english_reviews["Reviews"]
y = english_reviews["Label"]

# Create train and test splits
X_train, X_test, y_train, y_test = tts(
    X, y, test_size=0.20, random_state=38, shuffle=True
)

X_train_encoded = prep(X_train)
X_train_masks = attn_mask(X_train_encoded)

X_test_encoded = prep(X_test)
X_test_masks = attn_mask(X_test_encoded)

### Convert the input layer to tensors

In [15]:
train_inputs = torch.tensor(X_train_encoded)
train_labels = torch.tensor(y_train.values)
train_masks = torch.tensor(X_train_masks)

validation_inputs = torch.tensor(X_test_encoded)
validation_labels = torch.tensor(y_test.values)
validation_masks = torch.tensor(X_test_masks)

### Configure data loaders for training and validation

In [16]:
# data loader for training
train_data = TensorDataset(
    train_inputs, 
    train_masks, 
    train_labels
)
train_sampler = SequentialSampler(train_data)
trainer = DataLoader(
    train_data, 
    sampler=train_sampler, 
    batch_size=BATCH_SIZE
)

# data loader for validation
validation_data = TensorDataset(
    validation_inputs, 
    validation_masks, 
    validation_labels
)
validation_sampler = SequentialSampler(validation_data)
validator = DataLoader(
    validation_data, 
    sampler=validation_sampler, 
    batch_size=BATCH_SIZE
)

In [17]:
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

model = BertForSequenceClassification.from_pretrained(
    "bert-base-multilingual-cased",
    num_labels=2,   # we are doing binary classification
    output_attentions=False, 
    output_hidden_states=False, 
)

optimizer = AdamW(
    model.parameters(),
    lr=3e-5, 
    eps=1e-8,
    weight_decay=0.01
)


total_steps = len(trainer) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer, 
    num_warmup_steps=0,
    num_training_steps=total_steps
)

Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model ch

In [18]:
# See Hugging Face example: https://tinyurl.com/y5629dsp

def compute_accuracy(y_pred, y_true):
    """
    Comput the accuracy of the predicted values
    """
    predicted = np.argmax(y_pred, axis=1).flatten()
    actual = y_true.flatten()
    return np.sum(predicted==actual)/len(actual)


def train_model(train_loader, test_loader, epochs):
    losses = []
    for e in range(epochs):
        print('======== Epoch {:} / {:} ========'.format(e + 1, epochs))
        start_train_time = time.time()
        total_loss = 0
        model.train()
        for step, batch in enumerate(train_loader):

            if step%10 == 0:
                elapsed = time.time() - start_train_time
                print(
                    "{}/{} --> Time elapsed {}".format(
                        step, len(train_loader), elapsed
                    )
                )

            input_data, input_masks, input_labels = batch
            input_data = input_data.type(torch.LongTensor)
            input_masks = input_masks.type(torch.LongTensor)
            input_labels = input_labels.type(torch.LongTensor)

            model.zero_grad()

            # forward propagation
            out = model(
                input_data,
                token_type_ids=None, 
                attention_mask=input_masks,
                labels=input_labels
            )
            loss = out[0]
            total_loss = total_loss + loss.item()

            # backward propagation
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
            optimizer.step()
        
        epoch_loss = total_loss/len(train_loader)
        losses.append(epoch_loss)
        print("Training took {}".format(
            (time.time() - start_train_time)
        ))

        # Validation
        start_validation_time = time.time()
        model.eval()
        eval_loss, eval_acc = 0, 0
        for step, batch in enumerate(test_loader):
            eval_data, eval_masks, eval_labels = batch
            eval_data = input_data.type(torch.LongTensor)
            eval_masks = input_masks.type(torch.LongTensor)
            eval_labels = input_labels.type(torch.LongTensor)
            
            with torch.no_grad():
                out = model(
                    eval_data,
                    token_type_ids=None, 
                    attention_mask=eval_masks
                )
            logits = out[0]

            batch_acc = compute_accuracy(
                logits.numpy(), eval_labels.numpy()
            )

            eval_acc += batch_acc
            
        print(
            "Accuracy: {}, Time elapsed: {}".format(
                eval_acc/(step + 1),
                time.time() - start_validation_time
            )
        )
        
    return losses

In [19]:
losses = train_model(trainer, validator, EPOCHS)

0/250 --> Time elapsed 0.007717132568359375
10/250 --> Time elapsed 327.6781442165375
20/250 --> Time elapsed 671.3720242977142
30/250 --> Time elapsed 980.1099593639374
40/250 --> Time elapsed 1277.7987241744995
50/250 --> Time elapsed 1568.9109942913055
60/250 --> Time elapsed 1894.9694511890411
70/250 --> Time elapsed 2249.337110042572
80/250 --> Time elapsed 2551.6829221248627
90/250 --> Time elapsed 2854.640032052994
100/250 --> Time elapsed 3155.7811381816864
110/250 --> Time elapsed 3452.4280750751495
120/250 --> Time elapsed 3757.349958181381
130/250 --> Time elapsed 4064.8563373088837
140/250 --> Time elapsed 4367.3598091602325
150/250 --> Time elapsed 4682.460071325302
160/250 --> Time elapsed 4979.263302326202
170/250 --> Time elapsed 5297.050108194351
180/250 --> Time elapsed 5601.208858251572
190/250 --> Time elapsed 5905.388324260712
200/250 --> Time elapsed 6215.239118099213
210/250 --> Time elapsed 6516.767795085907
220/250 --> Time elapsed 6821.974551200867
230/250 -->

In [20]:
model_to_save = model.module if hasattr(model, 'module') else model
model_to_save.save_pretrained(STORE_PATH)

In [23]:
def test_model(new_df):
    """
    Test the trained model on a dataset in another language.
    This function assumes the input dataframe contains two columns
    "Reviews" (the text of the review) and "Labels" (the score for
    the review, where a 0 represents no complaint and a 1 represents
    a complaint.)
    """
    X = new_df["Reviews"]
    y = new_df["Label"]

    X_test_encoded = prep(X)
    X_test_masks = attn_mask(X_test_encoded)

    test_inputs = torch.tensor(X_test_encoded)
    test_labels = torch.tensor(y.values)
    test_masks = torch.tensor(X_test_masks)

    test_data = TensorDataset(
        test_inputs, 
        test_masks, 
        test_labels
    )
    test_sampler = SequentialSampler(test_data)
    tester = DataLoader(
        test_data, 
        sampler=test_sampler, 
        batch_size=BATCH_SIZE
    )

    model.eval()
    eval_loss, eval_acc = 0, 0
    
    for step, batch in enumerate(tester):
        eval_data, eval_masks, eval_labels = batch
        eval_data = eval_data.type(torch.LongTensor)
        eval_masks = eval_masks.type(torch.LongTensor)
        eval_labels = eval_labels.type(torch.LongTensor)
            
        with torch.no_grad():
            out = model(
                eval_data,
                token_type_ids=None,
                attention_mask=eval_masks
            )
        logits = out[0]
        logits = logits.detach().cpu().numpy()
        eval_labels = eval_labels.to('cpu').numpy()
        batch_acc = compute_accuracy(logits, eval_labels)
        eval_acc += batch_acc
    print("Accuracy: {}".format(eval_acc/(step + 1)))

In [24]:
test_model(hindi_reviews)

Accuracy: 0.9507053004396678
