# CS-433 2019 - Text Classification

## Foreword
This Notebook contains all the preprocessing/training/classification pipeline for the [text classification challenge](https://www.aicrowd.com/challenges/epfl-ml-text-classification-2019/leaderboards)

It was run on a [Google Cloud Deep Learning VM](https://cloud.google.com/ai-platform/deep-learning-vm/docs) based on the [PyTorch image](https://cloud.google.com/ai-platform/deep-learning-vm/docs/pytorch_start_instance). The vm was configured with 2 vCPUs (13Go of RAM) and a Nvidia Tesla P100

Thus the only missing package is [transformers](https://huggingface.co/transformers/index.html)

### Disclaimer
We don't recomand training the model without a beefy GPU (like an RTX 2080 Ti) @home, inference can be done on Google Colab pretty easily, but we provide a `run.py` for reference. See the [`README`](README.md) for details

## 0. Setup
**Google Cloud** for the entire proccess or **Google Colab** (inference)

In [None]:
!pip install git+https://github.com/huggingface/transformers.git

In [None]:
# Check the current GPU infos
!nvidia-smi

### Imports
If you are not using an online source. Assuming an active conda env, run:
```shell
conda install pytorch cudatoolkit=10.1 -c pytorch
conda install pandas tqdm spacy numpy scikit-learn
python -m spacy download en_core_web_sm
```

In [None]:
import random
import csv
from os import mkdir

import numpy as np
import torch
from torch.nn.utils import clip_grad_norm_
from transformers import BertTokenizer, BertForSequenceClassification,\
                         AdamW, get_cosine_schedule_with_warmup
from sklearn.model_selection import train_test_split

from tqdm.auto import tqdm, trange
import pandas as pd
tqdm.pandas()

from preprocessing import tokenize, transform
from helpers import BatchGenerator, get_device, set_seed, create_csv_submission
from run import TestModel
%load_ext autoreload
%autoreload 2

### Config

In [None]:
SEED = 1
MODEL_NAME = 'bert-base-uncased'
TOKENIZER = BertTokenizer
MODEL = BertForSequenceClassification

BATCH_SIZE = 25
LOG_EVERY = 3000
LR = 1e-5
MAX_GRAD_NORM = 1
NUM_TRAINING_STEPS = 300_000
NUM_WARMUP_STEPS = 30_000

## 1. Preprocessing
The preprocessing is the same as our GloVe based models, we just drop the '<' and '>'

**1.1 Preprocessing functions**
From `preprocessing.py`

**1.2 Preprocess the train data**
Methods adapted from our GloVe model
1. Load data from file `.txt`
2. Preprocess the data using `tokenize` and `transform`
3. Save as `.tsv` for reusability

*Training Data*

In [None]:
def preprocess_train(positive_file, negative_file, out_file):
    with open(negative_file, 'r', encoding='utf-8') as neg,\
            open(positive_file, 'r', encoding='utf-8') as pos,\
            open(out_file, 'w', encoding='utf-8') as out:
        print('label\ttweet', file=out)
        for l in tqdm(neg, total=1250000, desc='Neg'):
            print('0\t' + transform(' '.join(tokenize(l))), file=out)
        for l in tqdm(pos, total=1250000, desc='Pos'):
            print('1\t' + transform(' '.join(tokenize(l))), file=out)
    
preprocess_train('data/train_pos_full.txt',
                 'data/train_neg_full.txt',
                 'data/train_preprocessed.tsv')

**1.3 Encoding procedure**
1. Load preprocessed data into Pandas df
2. Apply encoding from the tokenizer to transform text to list of tensors (Wordpiece indexes)
3. (opt) Save as `pkl.gz` for reusability

In [None]:
tokenizer = TOKENIZER.from_pretrained(MODEL_NAME)

def encode_df(df, tokenizer, save=True, path='data/train_encoded_BERT.pkl.gz'):
    tqdm.pandas()
    df['tensor'] = df.tweet.progress_apply(lambda x: torch.tensor(tokenizer.encode(x),
                                                                  dtype=torch.long))
    df['length'] = df.tensor.progress_apply(len)
    df = df[['label', 'tensor', 'length']].sort_values(by='length')
    if save:
        df[['label', 'tensor']].to_pickle(path, compression='gzip')
    
    return df

*Encode the train set*

In [None]:
train_df = pd.read_csv('data/train_preprocessed.tsv', delimiter='\t')

In [None]:
train_df = encode_df(train_df, tokenizer)

*Preprocess + encode the test set*

In [None]:
def read_test(path='data/test_data.txt'):
    with open(path, 'r', encoding='utf-8') as test_file:
        test_lines = [line.rstrip('\n').split(',', 1) for line in test_file]
        df = pd.DataFrame(test_lines, columns=['label', 'tweet'])
        df.tweet = df.tweet.apply(lambda x: transform(' '.join(tokenize(x))))
        return df

test = encode_df(read_test(), tokenizer, save=True, path='data/test_encoded_BERT.pkl.gz')

## 2. Training

Uncomment and run the cell below to load the encoded df (if you have downloaded our encoded training set)

In [None]:
train_df = pd.read_pickle('data/train_encoded_BERT.pkl.gz')

### 2.1 Set up the seed for reproducability + select device (GPU)

In [None]:
set_seed(SEED)
device = get_device()

### 2.2 Split the data (90% train, 10% test) + generate batches

In [None]:
train, val = train_test_split(train_df, train_size=0.9, random_state=SEED, shuffle=False)
train_batch = BatchGenerator(train, BATCH_SIZE, device)
val_batch = BatchGenerator(val, BATCH_SIZE, device, shuffle=False)

## 2.3 Helper functions

In [None]:
def compute_accuracy(model, batch, text):
    ''' Make predicitons for a set (train/val), print and return the accuracy'''
    val_pred = []
    val_target = []
    model.eval()
    for seq, mask, labels in tqdm(batch):
        pred = model(seq, attention_mask=mask)[0]
        val_pred.append(pred.argmax(axis=1))
        val_target.append(labels)
    accuracy = (torch.cat(val_pred) == torch.cat(val_target)).float().mean().item()
    print(text, accuracy)
    return accuracy

In [None]:
def fit(model, optimizer, scheduler, train_batch, val_batch, log_every=None, n_epoch=3, path_prefix='model/checkpoint_BERT_'):
    ''' 
    Fit the model using the optimizer and scheduler of n_epoch on the train_batch
    Every log_every step, print the current train loss
    At the end of each epoch, save the model under path_prefix and compute the val accuracy and loss over val_batch
    '''
    for epoch in range(n_epoch):
        sum_loss = 0
        print('EPOCH', epoch)
        model.train()
        for i, (seq, mask, labels) in enumerate(tqdm(train_batch)):
            optimizer.zero_grad()
            loss, pred = model(seq, attention_mask=mask, labels=labels)[:2]
            sum_loss += loss.item()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)
            optimizer.step()
            scheduler.step()
            if log_every and i % log_every == log_every-1:
                print(i+1, sum_loss)
                sum_loss = 0
        
        # Save current model
        path = path_prefix + str(epoch)
        try:
            mkdir(path)
        except:
            pass
        model.save_pretrained(path)
        
        # Compute accuracy at the end of the epoch
        #compute_accuracy(model, train_batch, f'Train accuracy {epoch}')
        compute_accuracy(model, val_batch, f'Validation accuracy {epoch}')

## 2.4 Create Model and run

In [None]:
model = MODEL.from_pretrained(MODEL_NAME).to(device)
optimizer = AdamW(model.parameters(), lr=LR) 
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=NUM_WARMUP_STEPS,
                                            num_training_steps=NUM_TRAINING_STEPS)

fit(model, optimizer, scheduler, train_batch, val_batch, log_every=LOG_EVERY)

## 3. Making predictions

Run this part only if you have downloaded our model + encoded test set

In [None]:
model = TestModel(MODEL, 'model/checkpoint_BERT_1', 'data/test_encoded_BERT.pkl.gz', BATCH_SIZE, device)
test_ids, test_pred = model.make_predictions()
create_csv_submission(test_ids, test_pred, 'data/best_submission.csv')