# Machine Learning Assignment Report
* **Student name:** Đặng Trung Cương
* **Student ID:** 19021229
* **Class:** Machine Learning - INT3405E 20
* **Instructor:** Trần Quốc Long

# Problem Description

**Quora** is a question and answer website where people go to find information about literally anything (it's like StackOverflow is mainly for engineer and Quora is for anything not just tech). Up on the website are tons of good questions (sincere) and bad questions (insincere) and our mission is to classify which question belongs to which category (0 for sincere and 1 for insincere - binary classification)

The **input** (datasets) that Quora provided contains:
* train.csv: train dataset contains questions and all the questions are labeled 0 or 1 for us.
* test.csv: validation dataset
* sample_submission.csv: for kaggle submission.
* embedding.zip: contains some embedding libs such as GloVe,...

Output: submission.csv, in which we need to classify and label (0 or 1) for each question within 'sample_submission.csv' 

Since the submission requires Internet connection turned off, we need to download some packages/dependencies and import them into the dataset.

**Install packages/dependencies for pytorch-xla (TPU)**

In [None]:
# Turn on TPU before executing these lines
# !cp ../input/pytorch-xla-setup-script/torch-nightly-cp37-cp37m-linux_x86_64.whl ./torch-nightly-cp37-cp37m-linux_x86_64.whl
# !cp ../input/pytorch-xla-setup-script/torch_xla-nightly-cp37-cp37m-linux_x86_64.whl ./torch_xla-nightly-cp37-cp37m-linux_x86_64.whl
# !cp ../input/pytorch-xla-setup-script/torchvision-nightly-cp37-cp37m-linux_x86_64.whl ./torchvision-nightly-cp37-cp37m-linux_x86_64.whl

# # This deb files are the dependencies, copying them to the working dir.
# !cp ../input/pytorch-xla-setup-script/libgfortran4_7.5.0-3ubuntu1_18.04_amd64.deb ./libgfortran4_7.5.0-3ubuntu1_18.04_amd64.deb
# !cp ../input/pytorch-xla-setup-script/libomp5_5.0.1-1_amd64.deb ./libomp5_5.0.1-1_amd64.deb
# !cp ../input/pytorch-xla-setup-script/libopenblas-base_0.2.20ds-4_amd64.deb ./libopenblas-base_0.2.20ds-4_amd64.deb
# !cp ../input/pytorch-xla-setup-script/libopenblas-dev_0.2.20ds-4_amd64.deb ./libopenblas-dev_0.2.20ds-4_amd64.deb

# #installing pytorch-xla by running this script
# !python ../input/pytorch-xla-setup-script/pytorch-xla-env-setup.py --version nightly

# #Now, istalling depedencies
# !dpkg -i ./libgfortran4_7.5.0-3ubuntu1_18.04_amd64.deb
# !dpkg -i ./libomp5_5.0.1-1_amd64.deb
# !dpkg -i ./libopenblas-base_0.2.20ds-4_amd64.deb
# !dpkg -i ./libopenblas-dev_0.2.20ds-4_amd64.deb

# # Removing wheel and deb files, as we don't need them now.
# !rm torch-nightly-cp37-cp37m-linux_x86_64.whl 
# !rm torch_xla-nightly-cp37-cp37m-linux_x86_64.whl 
# !rm torchvision-nightly-cp37-cp37m-linux_x86_64.whl
# !rm libgfortran4_7.5.0-3ubuntu1_18.04_amd64.deb 
# !rm libomp5_5.0.1-1_amd64.deb 
# !rm libopenblas-base_0.2.20ds-4_amd64.deb 
# !rm libopenblas-dev_0.2.20ds-4_amd64.deb

**Import normal packages**

In [None]:
# Please switch on the TPU before running these lines.
import pandas as pd
import numpy as np
import os
import gc

import random

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

# set a seed value
torch.manual_seed(555)

from sklearn.utils import shuffle
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score

import transformers
from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification
from transformers import AdamW
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")


print(torch.__version__)


In [None]:
# Have to install torch-1.8-1 since Kaggle environments (by default) only provide torch-1.7.1+
# which is not compatible with torch-xla
# !pip install /kaggle/input/torch181whl/torch-1.8.1+cu111-cp37-cp37m-linux_x86_64.whl

In [None]:
# Imports required to use TPUs with Pytorch.
# https://pytorch.org/xla/release/1.5/index.html

# import torch_xla
# import torch_xla.core.xla_model as xm

# Get files from Kaggle directory

In [None]:
# If Kaggle runs out of RAM then we may need this.
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

In [None]:
# Check dir contents
os.listdir('../input/quora-insincere-questions-classification')

In [None]:
# Load the training data.

path = '../input/quora-insincere-questions-classification/train.csv'
df_train = pd.read_csv(path)

print(df_train.shape)

df_train.head()

In [None]:
# Load the test data.

path = '../input/quora-insincere-questions-classification/test.csv'
df_test = pd.read_csv(path)

print(df_test.shape)

df_test.head()

# Investigate the datasets (train.csv && test.csv)

In [None]:
# Check info of the intact file
df_train.info() 

In [None]:
# Modify to reduce memory usage but keep the file's integrity
df_train = reduce_mem_usage(df_train)

In [None]:
df_train.info()

In [None]:
df_test = reduce_mem_usage(df_test)

**Investigate on "target" column**

In [None]:
# Print len(df_train) and len(df_test)
print("length of df_train: ", len(df_train))
print("length of df_test: ", len(df_test))

# sinc_q: sincere question // insinc_q: insincere
sinc_q = df_train[df_train.target == 0]
insinc_q = df_train[df_train.target == 1]
# Print number of sincere questions
print("Number of sincere question: ", len(sinc_q))
# Print number of insincere questions
print("Number of sincere question: ", len(insinc_q))
# Ratio of sinc_q and insinc_q with total number of questions
print("Percentage of sincere questions: ", len(sinc_q)/len(df_train)*100, "%")
print("Percentage of insincere questions: ", len(insinc_q)/len(df_train)*100, "%")

**Datasets overview**

**Train dataset(train.csv):**

Number of questions: 1306122

Columns: 3 (qid, question_text(text_column), target(label_column)<0 = sincere; 1 = insincere>)

**Test dataset(test.csv):**

Number of questions: 375806

Columns: 2 (qid, question_text)

**"target" column of train dataset:**

Dataset is overloaded with sincere questions and lack of insincere questions (1225312 of sincere versus 80810 of insincere, hence the ratio sincere:insincre is appoximately 15:1). This imbalance may leads to some problems:
* Accuracy should not be used as the main metric to judge model's ability because if we predict all the questions are sincere (target=0) then model's accuracy will be ~0.94.
=> Need to use other evaluate metrics such as f1_score, roc, auc...
* Since there are too many sincere questions, the trained model will be good at detecting sincere questions. Meanwhile, its performance in insincere questions detection will be decreased.

Other than using other metrics, we can try using pre-trained model (BERT...) or use Boosting methods (LightGBM, XGBoost,...)

# Text Preprocessing


**Remove punctuations**

In [None]:

def clean_text(x):
    puncts = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]
    x = str(x)
    for punct in "/-'":
        x = x.replace(punct, ' ')
    for punct in '&':
        x = x.replace(punct, f' {punct} ')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
    for punct in puncts:
        x = x.replace(punct, f' {punct} ')
    return x



**Remove numbers**

In [None]:
import re
def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

**Fix mispelled/stop words**

In [None]:
def _get_mispell(mispell_dict):
    mispell_re = re.compile('(%s)' % '|'.join(mispell_dict.keys()))
    return mispell_dict, mispell_re
 
 
mispell_dict = {'colour':'color','centre':'center','didnt':'did not','doesnt':'does not',
                'isnt':'is not','shouldnt':'should not','favourite':'favorite','travelling':'traveling',
                'counselling':'counseling','theatre':'theater','cancelled':'canceled','labour':'labor',
                'organisation':'organization','wwii':'world war 2','citicise':'criticize','instagram': 'social medium',
                'whatsapp': 'social medium','snapchat': 'social medium',"ain't": "is not", 
                "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", 
                "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", 
                "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would",
                "he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", 
                "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", 
                "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", 
                "i'd": "i would", "i'd've": "i would have", "i'll": "i will","i'll've": "i will have",
                "i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", 
                "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have",
                "it's": "it is","let's": "let us", "ma'am": "madam", "mayn't": "may not", 
                "might've": "might have","mightn't": "might not","mightn't've": "might not have", 
                "must've": "must have", "mustn't": "must not", "mustn't've": "must not have",
                "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", 
                "oughtn't": "ought not","oughtn't've": "ought not have", "shan't": "shall not", 
                "sha'n't": "shall not","shan't've": "shall not have","she'd": "she would", 
                "she'd've": "she would have","she'll": "she will","she'll've": "she will have", 
                "she's": "she is","should've": "should have","shouldn't": "should not","shouldn't've": "should not have", 
                "so've": "so have","so's": "so as","this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", 
                "there'd": "there would", "there'd've": "there would have", "there's": "there is", 
                "here's": "here is","they'd": "they would", "they'd've": "they would have",
                "they'll": "they will", "they'll've": "they will have", 
                "they're": "they are", "they've": "they have", "to've": "to have", 
                "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", 
                "we'll": "we will", "we'll've": "we will have", "we're": "we are", 
                "we've": "we have", "weren't": "were not", "what'll": "what will", 
                "what'll've": "what will have", "what're": "what are",  "what's": "what is",
                "what've": "what have", "when's": "when is", "when've": "when have", 
                "where'd": "where did", "where's": "where is", "where've": "where have", 
                "who'll": "who will", "who'll've": "who will have", "who's": "who is", 
                "who've": "who have", "why's": "why is", "why've": "why have", 
                "will've": "will have", "won't": "will not", "won't've": "will not have", 
                "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", 
                "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have",
                "y'all're": "you all are","y'all've": "you all have","you'd": "you would", 
                "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                "you're": "you are", "you've": "you have", 'colour': 'color', 'centre': 'center', 
                'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 
                'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 
                'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 
                'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 
                'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 
                'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do',
                'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 
                'mastrubation': 'masturbation', 'mastrubate': 'masturbate', 
                "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 
                'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 
                'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', 
                "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 
                'demonitization': 'demonetization', 'demonetisation': 'demonetization'
                }
 
mispellings, mispellings_re = _get_mispell(mispell_dict)
 
def replace_typical_misspell(text):
    def replace(match):
        return mispellings[match.group(0)]
 
    return mispellings_re.sub(replace, text)

**Process**

In [None]:
# lower
df_train['prep_question_text'] = df_train['question_text'].apply(lambda x : x.lower())
df_test['prep_question_text'] = df_test['question_text'].apply(lambda x : x.lower())
 
# clean the text
df_train["prep_question_text"] = df_train["prep_question_text"].apply(lambda x : clean_text(x))
df_test["prep_question_text"] = df_test["prep_question_text"].apply(lambda x : clean_text(x))
 
# clean numbers
df_train["prep_question_text"] = df_train["prep_question_text"].apply(lambda x: clean_numbers(x))
df_test["prep_question_text"] = df_test["prep_question_text"].apply(lambda x : clean_numbers(x))
 
# clean spellings
df_train['prep_question_text'] = df_train['prep_question_text'].apply(lambda x: replace_typical_misspell(x))
df_test['prep_question_text'] = df_test['prep_question_text'].apply(lambda x: replace_typical_misspell(x))
 


In [None]:
# processed text
df_train.head()

In [None]:
df_train.to_csv('prep.csv', columns=['qid', 'target', 'prep_question_text'], index=False)
!cp 'prep.csv' './prep.csv'

df_prep = pd.read_csv('./prep.csv')

In [None]:
df_prep.groupby("target").size()


# Model Training (XGBoost and XLM-RoBERTa)

**XGBoost**

**XGBoost** is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples.

In [None]:
# Install, looks like XGBoost in pre-installed in Kaggle kernel.

# !pip install xgboost

In [None]:
# from numpy import loadtxt
# from xgboost import XGBClassifier
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score, f1_score, classification_report

In [None]:
# # Vectorizer data
# from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer = TfidfVectorizer(stop_words="english",
#                              ngram_range=(1, 3))
# X = vectorizer.fit_transform(df_train['prep_question_text'])
# x = vectorizer.transform(df_test['prep_question_text'])

In [None]:
# # Split train and test data
# from sklearn.model_selection import train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, df_train['target'], test_size=0.3, random_state=42)

In [None]:
# # Function for f1_score calculation
# from sklearn.metrics import f1_score, accuracy_score, classification_report
# def get_score(model, name):
#   y_train_pred, y_pred = model.predict(X_train), model.predict(X_test)
#   print(classification_report(y_test, y_pred), '\n')

#   print('{} model with F1 score = {}'.format(name, f1_score(y_test, y_pred)))

In [None]:
# # Fit
# import xgboost as xgb
# xgb = xgb.XGBClassifier()
# xgb.fit(X_train, y_train)

In [None]:
# model = XGBClassifier()
# model.fit(X_train, y_train)

In [None]:
# get_score(xgb, 'XGBClassifier')

**XLM-RoBERTa**

**XLM-RoBERTa** is a multilingual version of **RoBERTa**. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.

**RoBERTa** is a transformers model pretrained on a large corpus in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.

More precisely, it was pretrained with the Masked language modeling (MLM) objective. Taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally mask the future tokens. It allows the model to learn a bidirectional representation of the sentence.

In [None]:
# # Undersampling to 1:10 => reduce imbalancing
# from sklearn.utils import resample

# df_train = pd.concat([resample(sinc_q, replace = True, n_samples = len(insinc_q)*10), insinc_q])
# df_train

**Create training folds**

In [None]:
# from sklearn.model_selection import KFold, StratifiedKFold

# # shuffle
# df = shuffle(df_train)

# # initialize kfold
# kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=1024)

# # for stratification
# y = df['target']

# # Note:
# # Each fold is a tuple ([train_index_values], [val_index_values])

# # Put the folds into a list. This is a list of tuples.
# fold_list = list(kf.split(df, y))

# train_df_list = []
# val_df_list = []

# for i, fold in enumerate(fold_list):

#     # map the train and val index values to dataframe rows
#     df_train = df[df.index.isin(fold[0])]
#     df_val = df[df.index.isin(fold[1])]
    
#     train_df_list.append(df_train)
#     val_df_list.append(df_val)
    
    

# print(len(train_df_list))
# print(len(val_df_list))

In [None]:
# # Display one train fold

# df_train = train_df_list[0]

# df_train.head()

In [None]:
# # Display one val fold

# df_val = val_df_list[0]

# df_val.head()

In [None]:
# # Plot histogram of text lengths => optimize max tokens length
# word_length_list = [len(x.split()) for x in df_train['prep_question_text'] if len(x.split()) < 60]
# char_length_list = [len(x) for x in df_train['prep_question_text'] if len(x) < 100]
# fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True)
# axs[0].hist(word_length_list, bins=25)
# axs[0].set_title('Number of Words')

# axs[1].hist(char_length_list, bins=25)
# axs[1].set_title('Number of Characters')
# plt.show()

From the above histograms, a maxlen of ~30-35 tokens should be fine for BERT.

**RoBERTa Implement**

In [None]:
# PRETRAINED_PATH = "../input/xlmroberta/xlm-roberta-base"


# L_RATE = 1e-5 # Optimized lrate<paper>
# MAX_LEN = 35

# NUM_EPOCHS = 2
# BATCH_SIZE_TRAIN = 32
# BATCH_SIZE_TEST = 64
# NUM_CORES = os.cpu_count()

# NUM_CORES

In [None]:
# # Tell PyTorch to use the TPU.    
# device = xm.xla_device()

# print(device)

In [None]:
# # Only use one fold
# df_train = train_df_list[0]

# df_train.head()

In [None]:
# df_val = val_df_list[0]

# df_val.head()

In [None]:
# from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification


# print('Loading XLMRoberta tokenizer...')
# tokenizer = XLMRobertaTokenizer.from_pretrained(PRETRAINED_PATH)

In [None]:
# df_train = df_train.reset_index(drop=True)
# df_val = df_val.reset_index(drop=True)

In [None]:
# class QuestionDataset(Dataset):

#     def __init__(self, df):
#         self.df_data = df



#     def __getitem__(self, index):

#         # get the sentence from the dataframe
#         sentence1 = self.df_data.loc[index, 'prep_question_text']
      
#         # Process the sentence
#         # ---------------------

#         encoded_dict = tokenizer.encode_plus(
#                     sentence1,        # Sentences to encode.
#                     add_special_tokens = True,      # Add the special tokens.
#                     max_length = MAX_LEN,           # Pad & truncate all sentences.
#                     pad_to_max_length = True,
#                     return_attention_mask = True,   # Construct attn. masks.
#                     return_tensors = 'pt',          # Return pytorch tensors.
#                )
        
#         # These are torch tensors.
#         padded_token_list = encoded_dict['input_ids'][0]
#         att_mask = encoded_dict['attention_mask'][0]
        
#         # Convert the target to a torch tensor
#         target = torch.tensor(self.df_data.loc[index, 'target'])

#         sample = (padded_token_list, att_mask, target)


#         return sample


#     def __len__(self):
#         return len(self.df_data)
    
    
    
    
    

# class TestDataset(Dataset):

#     def __init__(self, df):
#         self.df_data = df



#     def __getitem__(self, index):

#         # get the sentence from the dataframe
#         sentence1 = self.df_data.loc[index, 'prep_question_text']

#         # Process the sentence
#         # ---------------------

#         encoded_dict = tokenizer.encode_plus(
#                     sentence1,         # Sentence to encode.
#                     add_special_tokens = True,      # Add the special tokens.
#                     max_length = MAX_LEN,           # Pad & truncate all sentences.
#                     pad_to_max_length = True,
#                     return_attention_mask = True,   # Construct attn. masks.
#                     return_tensors = 'pt',          # Return pytorch tensors.
#                )
        
#         # These are torch tensors.
#         padded_token_list = encoded_dict['input_ids'][0]
#         att_mask = encoded_dict['attention_mask'][0]
        
               

#         sample = (padded_token_list, att_mask)


#         return sample


#     def __len__(self):
#         return len(self.df_data)

In [None]:
# train_data = QuestionDataset(df_train)
# val_data = QuestionDataset(df_val)
# test_data = TestDataset(df_test)

# train_dataloader = torch.utils.data.DataLoader(train_data,
#                                         batch_size=BATCH_SIZE_TRAIN,
#                                         shuffle=True,
#                                        num_workers=NUM_CORES)

# val_dataloader = torch.utils.data.DataLoader(val_data,
#                                         batch_size=BATCH_SIZE_TRAIN,
#                                         shuffle=True,
#                                        num_workers=NUM_CORES)

# test_dataloader = torch.utils.data.DataLoader(test_data,
#                                         batch_size=BATCH_SIZE_TEST,
#                                         shuffle=False,
#                                        num_workers=NUM_CORES)



# print(len(train_dataloader))
# print(len(val_dataloader))
# print(len(test_dataloader))

In [None]:
# # Get one train batch

# padded_token_list, att_mask, target = next(iter(train_dataloader))

# print(padded_token_list.shape)
# print(att_mask.shape)
# print(target.shape)

In [None]:
# # Get one val batch

# padded_token_list, att_mask, target = next(iter(val_dataloader))

# print(padded_token_list.shape)
# print(att_mask.shape)
# print(target.shape)

In [None]:
# # Get one test batch

# padded_token_list, att_mask = next(iter(test_dataloader))

# print(padded_token_list.shape)
# print(att_mask.shape)

In [None]:
# from transformers import XLMRobertaForSequenceClassification

# model = XLMRobertaForSequenceClassification.from_pretrained(
#     PRETRAINED_PATH, 
#     num_labels = 2, # The number of output labels. 2 for binary classification.
# )

# # Send the model to the device.
# model.to(device)

In [None]:
# # Create a batch of train samples
# # We will set a small batch size of 8 so that the model's output can be easily displayed.

# train_dataloader = torch.utils.data.DataLoader(train_data,
#                                         batch_size=8,
#                                         shuffle=True,
#                                        num_workers=NUM_CORES)

# b_input_ids, b_input_mask, b_labels = next(iter(train_dataloader))

# print(b_input_ids.shape)
# print(b_input_mask.shape)
# print(b_labels.shape)

In [None]:
# # Pass a batch of train samples to the model.

# batch = next(iter(train_dataloader))

# # Send the data to the device
# b_input_ids = batch[0].to(device)
# b_input_mask = batch[1].to(device)
# b_labels = batch[2].to(device)

# # Run the model
# outputs = model(b_input_ids, 
#                         attention_mask=b_input_mask, 
#                         labels=b_labels)

# # The ouput is a tuple (loss, preds).
# outputs

In [None]:
# outputs

In [None]:
# # The output is a tuple: (loss, preds)

# len(outputs)

In [None]:
# # This is the loss.

# outputs[0]

In [None]:
# # These are the predictions.

# outputs[1]

In [None]:
# preds = outputs[1].detach().cpu().numpy()

# y_true = b_labels.detach().cpu().numpy()
# y_pred = np.argmax(preds, axis=1)

# y_pred

In [None]:
# # This is the accuracy without fine tuning.

# val_acc = accuracy_score(y_true, y_pred)

# val_acc

In [None]:
# # The loss and preds are Torch tensors

# print(type(outputs[0]))
# print(type(outputs[1]))

In [None]:
# # Define the optimizer
# optimizer = AdamW(model.parameters(),
#               lr = 1e-3, 
#               eps = 1e-8 
#             )

In [None]:
# # Create the dataloaders.

# train_data = QuestionDataset(df_train)
# val_data = QuestionDataset(df_val)
# test_data = TestDataset(df_test)

# train_dataloader = torch.utils.data.DataLoader(train_data,
#                                         batch_size=BATCH_SIZE_TRAIN,
#                                         shuffle=True,
#                                        num_workers=NUM_CORES)

# val_dataloader = torch.utils.data.DataLoader(val_data,
#                                         batch_size=BATCH_SIZE_TRAIN,
#                                         shuffle=True,
#                                        num_workers=NUM_CORES)

# test_dataloader = torch.utils.data.DataLoader(test_data,
#                                         batch_size=BATCH_SIZE_TEST,
#                                         shuffle=False,
#                                        num_workers=NUM_CORES)



# print(len(train_dataloader))
# print(len(val_dataloader))
# print(len(test_dataloader))

In [None]:
# %%time


# # Set the seed.
# seed_val = 101

# random.seed(seed_val)
# np.random.seed(seed_val)
# torch.manual_seed(seed_val)
# torch.cuda.manual_seed_all(seed_val)

# # Store the average loss after each epoch so we can plot them.
# loss_values = []


# # For each epoch...
# for epoch in range(0, NUM_EPOCHS):
    
#     print("")
#     print('======== Epoch {:} / {:} ========'.format(epoch + 1, NUM_EPOCHS))
    

#     stacked_val_labels = []
#     targets_list = []

#     # ========================================
#     #               Training
#     # ========================================
    
#     print('Training...')
    
#     # put the model into train mode
#     model.train()
    
#     # This turns gradient calculations on and off.
#     torch.set_grad_enabled(True)


#     # Reset the total loss for this epoch.
#     total_train_loss = 0

#     for i, batch in enumerate(train_dataloader):
        
#         train_status = 'Batch ' + str(i) + ' of ' + str(len(train_dataloader))
        
#         print(train_status, end='\r')


#         b_input_ids = batch[0].to(device)
#         b_input_mask = batch[1].to(device)
#         b_labels = batch[2].to(device)

#         model.zero_grad()        


#         outputs = model(b_input_ids, 
#                     attention_mask=b_input_mask,
#                     labels=b_labels)
        
#         # Get the loss from the outputs tuple: (loss, logits)
#         loss = outputs[0]
        
#         # Convert the loss from a torch tensor to a number.
#         # Calculate the total loss.
#         total_train_loss = total_train_loss + loss.item()
        
#         # Zero the gradients
#         optimizer.zero_grad()
        
#         # Perform a backward pass to calculate the gradients.
#         loss.backward()
        
        
#         # Clip the norm of the gradients to 1.0.
#         # This is to help prevent the "exploding gradients" problem.
#         torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        
        
#         # Use the optimizer to update the weights.
        
#         # Optimizer for GPU
#         # optimizer.step() 
        
#         # Optimizer for TPU
#         # https://pytorch.org/xla/
#         xm.optimizer_step(optimizer, barrier=True)

    
#     print('Train loss:' ,total_train_loss)


#     # ========================================
#     #               Validation
#     # ========================================
    
#     print('\nValidation...')

#     # Put the model in evaluation mode.
#     model.eval()

#     # Turn off the gradient calculations.
#     # This tells the model not to compute or store gradients.
#     # This step saves memory and speeds up validation.
#     torch.set_grad_enabled(False)
    
    
#     # Reset the total loss for this epoch.
#     total_val_loss = 0
    

#     for j, batch in enumerate(val_dataloader):
        
#         val_status = 'Batch ' + str(j) + ' of ' + str(len(val_dataloader))
        
#         print(val_status, end='\r')

#         b_input_ids = batch[0].to(device)
#         b_input_mask = batch[1].to(device)
#         b_labels = batch[2].to(device)      


#         outputs = model(b_input_ids, 
#                 attention_mask=b_input_mask, 
#                 labels=b_labels)
        
#         # Get the loss from the outputs tuple: (loss, logits)
#         loss = outputs[0]
        
#         # Convert the loss from a torch tensor to a number.
#         # Calculate the total loss.
#         total_val_loss = total_val_loss + loss.item()
        

#         # Get the preds
#         preds = outputs[1]


#         # Move preds to the CPU
#         val_preds = preds.detach().cpu().numpy()
        
#         # Move the labels to the cpu
#         targets_np = b_labels.to('cpu').numpy()

#         # Append the labels to a numpy list
#         targets_list.extend(targets_np)

#         if j == 0:  # first batch
#             stacked_val_preds = val_preds

#         else:
#             stacked_val_preds = np.vstack((stacked_val_preds, val_preds))

    
#     # Calculate the validation accuracy
#     y_true = targets_list
#     y_pred = np.argmax(stacked_val_preds, axis=1)
    
#     val_acc = accuracy_score(y_true, y_pred)
#     val_f1 = f1_score(y_true, y_pred)
    
#     print('Val loss:' ,total_val_loss)
#     print('Val acc: ', val_acc)
#     print('Val F1: ', val_f1)


#     # Save the Model
#     torch.save(model.state_dict(), 'model.pt')
    
#     # Use the garbage collector to save memory.
#     gc.collect()

In [None]:
# for j, batch in enumerate(test_dataloader):
        
#         inference_status = 'Batch ' + str(j+1) + ' of ' + str(len(test_dataloader))
        
#         print(inference_status, end='\r')

#         b_input_ids = batch[0].to(device)
#         b_input_mask = batch[1].to(device)


#         outputs = model(b_input_ids, 
#                 attention_mask=b_input_mask)
        
        
#         # Get the preds
#         preds = outputs[0]


#         # Move preds to the CPU
#         preds = preds.detach().cpu().numpy()
        
#         # Move the labels to the cpu
#         targets_np = b_labels.to('cpu').numpy()

#         # Append the labels to a numpy list
#         targets_list.extend(targets_np)
        
#         # Stack the predictions.

#         if j == 0:  # first batch
#             stacked_preds = preds

#         else:
#             stacked_preds = np.vstack((stacked_preds, preds))

In [None]:
# stacked_preds

In [None]:
# # Take the argmax. This returns the column index of the max value in each row.

# preds = np.argmax(stacked_preds, axis=1)

# preds

In [None]:
# # Load the sample submission.
# # The row order in the test set and the sample submission is the same.

# path = '../input/quora-insincere-questions-classification/sample_submission.csv'

# df_sample = pd.read_csv(path)

# print(df_sample.shape)

# df_sample.head()

In [None]:
# # Assign the preds to the prediction column

# df_sample['prediction'] = preds

# df_sample.head()

In [None]:
# # Create a submission csv file
# # Note that for this competition the submission file must be named submission.csv.
# # Therefore, it won't be possible to submit this csv file for leaderboard scoring.
# df_sample.to_csv('submission.csv', index=False)

In [None]:
# !ls

In [None]:
# # Check the distribution of the predicted classes.

# df_sample['prediction'].value_counts()

# Using GPU for RoBERTa

In [None]:
!pip install pytorch-lightning
!pip install transformers
!pip install sentencepiece
!pip install fairseq

In [None]:
# try plain data to reproduce error
import pytorch_lightning as pl
from torch.utils.data import random_split
from typing import Optional

train_ratio = 0.8
DATA_DIR = "../input/quora-insincere-questions-classification/train.csv"
train = pd.read_csv(DATA_DIR)
train.head()

In [None]:
class QuestionData(Dataset):
    """
    Dataset class for Question analysis. 
    Every dataset using pytorch should be overwrite this class
    This require 2 function, __len__ and __getitem__
    """
    def __init__(self, data_dir):
        """
        Args:
            data_dir (string): Directory with the csv file
        """
        self.df = pd.read_csv(data_dir, index_col=0)

    def __len__(self):
        """
        length of the dataset, i.e. number of rows in the csv file
        Returns: int 
        """
        return len(self.df)

    def __getitem__(self, idx):
        """
        given a row index, returns the corresponding row of the csv file
        Returns: text (string), label (int) 
        """
        text = self.df["question_text"][idx]
        label = self.df["target"][idx]

        return text, label


class QuestionDataModule(pl.LightningDataModule):
    """
    Module class for question analysis. this class is used to load the data to the model. 
    It is a subclass of LightningDataModule. 
    """

    def __init__(self, data_dir: str = DATA_DIR, batch_size: int = 32):
        """
        Args:
            data_dir (string): Directory with the csv file
            batch_size (int): batch size for dataloader
        """
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size

    def setup(self, stage: Optional[str] = None):
        """
        Loads the data to the model. 
        the data is loaded in the setup function, so that it is loaded only once. 
        """
        data_full = QuestionData(self.data_dir)
        train_size = round(len(data_full) * train_ratio)
        val_size = len(data_full) - train_size
        print(len(data_full), train_size, val_size)
        self.data_train, self.data_val = random_split(data_full, [train_size, val_size])

    def train_dataloader(self):
        """
        Returns: dataloader for training
        """
        return DataLoader(self.data_train, batch_size=self.batch_size, shuffle=True)

    def val_dataloader(self):
        """
        Returns: dataloader for validation
        """
        return DataLoader(self.data_val, batch_size=self.batch_size, shuffle = True)

# Do some Test with data
if __name__ == "__main__":
	dm = QuestionDataModule(DATA_DIR)
	dm.setup()
	idx = 0
	for item in (dm.train_dataloader()):
		print(idx)
		print(item)
		idx += 1
		if idx > 5: break

In [None]:
from fairseq.data import Dictionary
import sentencepiece as spm
from os.path import join as pjoin
from transformers import PreTrainedTokenizer
import sentencepiece as spm


class XLMRobertaTokenizer(PreTrainedTokenizer):
    """
    XLM-RoBERTa tokenizer adapted from transformers.PreTrainedTokenizer. This helps to convert the input text into 
    tokenized format. eg, 
    
    input: "Hello, how are you?" output: ["1", "2", "3", "65", "2", "1"]
    
    this class also provides the method to convert the tokenized format into the original text.
    
    eg, input: ["1", "2", "3", "65", "2", "1"] output: "Hello, how are you?"
    
    """
    def __init__(
            self,
            pretrained_file,
            bos_token="<s>",
            eos_token="</s>",
            sep_token="</s>",
            cls_token="<s>",
            unk_token="<unk>",
            pad_token="<pad>",
            mask_token="<mask>",
            **kwargs
    ):
        """
        :param pretrained_file: path to the pretrained model file
        :param bos_token: beginning of sentence token
        :param eos_token: end of sentence token
        :param sep_token: separation token
        :param cls_token: classification token
        :param unk_token: unknown token
        :param pad_token: padding token
        :param mask_token: mask token
        """
        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
            sep_token=sep_token,
            cls_token=cls_token,
            pad_token=pad_token,
            mask_token=mask_token,
            **kwargs,
        )
        # load bpe model and vocab file
        sentencepiece_model = pjoin(pretrained_file, 'sentencepiece.bpe.model')
        vocab_file = '../input/robertabase/roberta-base-dict.txt'
        self.sp_model = spm.SentencePieceProcessor()
        self.sp_model.Load(
            sentencepiece_model)  # please dont use anything from sp_model bcz it makes everything goes wrong
        self.bpe_dict = Dictionary().load(vocab_file)
        # Mimic fairseq token-to-id alignment for the first 4 token
        self.fairseq_tokens_to_ids = {"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3}
        # The first "real" token "," has position 4 in the original fairseq vocab and position 3 in the spm vocab
        self.fairseq_offset = 0
        self.fairseq_tokens_to_ids["<mask>"] = len(self.bpe_dict) + self.fairseq_offset
        self.fairseq_ids_to_tokens = {v: k for k, v in self.fairseq_tokens_to_ids.items()}

    def _tokenize(self, text):
        """ Tokenize a string. """
        return self.sp_model.EncodeAsPieces(text)

    def _convert_token_to_id(self, token):
        """ Converts a token (str) in an id using the vocab. """
        if token in self.fairseq_tokens_to_ids:
            return self.fairseq_tokens_to_ids[token]
        spm_id = self.bpe_dict.index(token)
        return spm_id

    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        if index in self.fairseq_ids_to_tokens:
            return self.fairseq_ids_to_tokens[index]
        return self.bpe_dict[index]

    @property
    def vocab_size(self):
        """ Size of the base vocabulary (without the added tokens) """
        return len(self.bpe_dict) + self.fairseq_offset + 1  # Add the <mask> token

    def get_vocab(self):
        """ Returns the vocabulary as a list of tokens. """
        vocab = {self.convert_ids_to_tokens(i): i for i in range(self.vocab_size)}
        vocab.update(self.added_tokens_encoder)
        return vocab

In [None]:
from transformers import XLMRobertaConfig, XLMRobertaForSequenceClassification
import torch

pretrained_path = '../input/xlmroberta/xlm-roberta-base' 
!ls $pretrained_path
# load tokenizer
roberta = XLMRobertaForSequenceClassification.from_pretrained(pretrained_path)
tokenizer = XLMRobertaTokenizer(pretrained_path)

In [None]:
from sklearn.metrics import roc_auc_score, classification_report, accuracy_score


class QuestionRoberta(pl.LightningModule):
    """
    QuestionRoberta class inherits from LightningModule
    This class is used to train a model using PyTorch Lightning
    It overrides the following methods:
        - forward : forward pass of the model
        - training_step : training step of the model
        - validation_step : validation step of the model
        - validation_epoch_end : end of the validation epoch
        - configure_optimizers : configure optimizers
    """
    def __init__(self, lr_roberta, lr_classifier):
        """
        Initialize the model with the following parameters:
            - lr_roberta : learning rate of the roberta model
            - lr_classifier : learning rate of the classifier model
        """
        super().__init__()
        self.roberta = XLMRobertaForSequenceClassification.from_pretrained(pretrained_path)
        self.tokenizer = XLMRobertaTokenizer(pretrained_path)
        self.lr_roberta = lr_roberta
        self.lr_classifer = lr_classifier

    def forward(self, texts, labels=None):
        """
        Forward pass of the model
        Args:
            - texts : input texts
            - labels : labels of the input texts
        """
        inputs = self.tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=256)
        for key in inputs:
            inputs[key] = inputs[key].to(self.device)

        outputs = self.roberta(**inputs, labels=labels)
        return outputs

    def configure_optimizers(self):
        """
        Configure optimizers
        This method is used to configure the optimizers of the model by using the learning rate
        for specific parameter of the roberta model and the classifier model
        """
        roberta_params = self.roberta.roberta.named_parameters()
        classifier_params = self.roberta.classifier.named_parameters()

        grouped_params = [
            {"params": [p for n, p in roberta_params], "lr": self.lr_roberta},
            {"params": [p for n, p in classifier_params], "lr": self.lr_classifer}
        ]
        optimizer = torch.optim.AdamW(
            grouped_params
        )
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1000, gamma=0.98)
        return {
            'optimizer': optimizer,
            'lr_scheduler': {
                'scheduler': scheduler,
                'monitor': 'f1/val',
            }
        }

    def training_step(self, batch, batch_idx):
        """
        Training step of the model
        Args:
            - batch : batch of the data
            - batch_idx : index of the batch
        """
        texts, labels = batch
        outputs = self(texts, labels=labels)

        if len(outputs.values()) == 3:
            loss, logits, _ = outputs.values()
        else:
            loss, logits = outputs.values()
        return loss

    def validation_step(self, batch, batch_idx):
        """
        Validation step of the model, used to compute the metrics
        Args:
            - batch : batch of the data
            - batch_idx : index of the batch
        """
        texts, labels = batch
        outputs = self(texts, labels=labels)

        if len(outputs.values()) == 3:
            loss, logits, _ = outputs.values()
        else:
            loss, logits = outputs.values()

        output_scores = torch.softmax(logits, dim=-1)
        return loss, output_scores, labels

    def validation_epoch_end(self, validation_step_outputs):
        """
        End of the validation epoch, this method will be called at the end of the validation epoch,
        it will compute the multiple metrics of classification problem
        Args:
            - validation_step_outputs : outputs of the validation step
        """

        val_preds = torch.tensor([], device=self.device)
        val_scores = torch.tensor([], device=self.device)
        val_labels = torch.tensor([], device=self.device)
        val_loss = 0
        total_item = 0

        for idx, item in enumerate(validation_step_outputs):
            loss, output_scores, labels = item

            predictions = torch.argmax(output_scores, dim=-1)
            val_preds = torch.cat((val_preds, predictions), dim=0)
            val_scores = torch.cat((val_scores, output_scores[:, 1]), dim=0)
            val_labels = torch.cat((val_labels, labels), dim=0)

            val_loss += loss
            total_item += 1

        # print("VAL PREDS", val_preds.shape)
        # print("VAL SCORES", val_scores.shape)
        # print("VAL LABELS", val_labels.shape)
        val_preds = val_preds.cpu().numpy()
        val_scores = val_scores.cpu().numpy()
        val_labels = val_labels.cpu().numpy()

        reports = classification_report(val_labels, val_preds, output_dict=True)
        print("VAL LABELS", val_labels)
        print("VAL SCORES", val_scores)
        try:
            auc = roc_auc_score(val_labels, val_scores)
        except Exception as e:
            print(e)
            print("Cannot calculate AUC. Default to 0")
            auc = 0
        accuracy = accuracy_score(val_labels, val_preds)

        print(classification_report(val_labels, val_preds))

        self.log("loss/val", val_loss)
        self.log("auc/val", auc)
        self.log("accuracy/val", accuracy)
        self.log("precision/val", reports["weighted avg"]["precision"])
        self.log("recall/val", reports["weighted avg"]["recall"])
        self.log("f1/val", reports["weighted avg"]["f1-score"])

In [None]:
trainer = pl.Trainer(
    fast_dev_run=True,
)
model = QuestionRoberta(lr_roberta=1e-5, lr_classifier=3e-3)
dm = QuestionDataModule()

trainer.fit(model, dm)

In [None]:
from pytorch_lightning import loggers as pl_loggers
from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint

torch.manual_seed(123)

tb_logger = pl_loggers.TensorBoardLogger('/tb_logs/')

trainer = pl.Trainer(
    min_epochs=1,
    max_epochs=2,
    gpus=1,
    precision=16,
    val_check_interval=0.5,
    # check_val_every_n_epoch=1,
    callbacks=[
      ModelCheckpoint(
          dirpath='/ckpt',
          save_top_k=3,
          monitor='f1/val',
      ), 
      EarlyStopping('f1/val', patience=5)
    ],
    fast_dev_run=False,
    logger=tb_logger
)

dm.setup(stage="fit")
trainer.fit(model, dm)

In [None]:
# # Kaggle banned Tensorboard. Got my account locked for days.
# %reload_ext tensorboard
# %tensorboard --logdir '/tb_logs/'