# Sentence Embeddings for Regression with RAPIDS

## Summary :
1. Obtaining Sentence Embeddings from Transformers
2. Using it for Regression

### Installing RAPIDS and other requirements

[RAPIDS](https://rapids.ai) enable you to perform every numpy, pandas or sklearn manipulation & modeling, entirely on GPU for higher performance.

In [None]:
import sys
!cp ../input/rapids/rapids.0.18.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
import cudf as pd # pandas on GPU
import cupy as np # numpy on GPU
from cuml.decomposition import PCA # scikit-learn on GPU
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer # PyTorch supported
import gc
import torch
import matplotlib.pyplot as plt
import matplotlib as mpl

### Reading and Processing Text Data

In [None]:
df_train = pd.read_csv('../input/commonlitreadabilityprize/train.csv')
df_test  = pd.read_csv('../input/commonlitreadabilityprize/test.csv')

In [None]:
df_train.head()

In [None]:
df_train.shape

We will concatenate text data from train and test database in order to process them globally.

In [None]:
n0 = df_train.shape[0]
# text is upper case in source data, but models are lower case
# txt = [sent3.lower() for df in (df_train, df_test) for t in df.excerpt.to_array() for sent in t.split('\n') for sent2 in sent.split('.') for sent3 in sent2.split(';')]
txt = [t.lower() for df in (df_train, df_test) for t in df.excerpt.to_array()]
ids = [id_ for df in (df_train, df_test) for id_ in df.id.to_array() ]
txt[:3]

In [None]:
IDS = []
TXT = []
for t, id_ in zip(txt, ids):
    sent = [w.replace('\\', '') for w in t.split('\n')]
    TXT += sent
    for _ in range(len(sent)):
        IDS.append(id_)
TXT[:3]

In [None]:
txt = []
ids = []
for t, id_ in zip(TXT, IDS):
    sent = [t]
    for char in ['...', '.', ';', '!', '?', '"']:
        sent = [w for t_ in sent for w in t_.split(char)]
    sent = list(filter(lambda w: len(w)>1, sent))
    txt += sent
    for _ in range(len(sent)):
        ids.append(id_)

In [None]:
# txt, ids

Transformers are a very efficient way of getting optimal text embeddings.
I will compute raw sentence embeddings based on the paraphrase-trained DistilRoberta. You can see more on this model [here](https://github.com/UKPLab/sentence-transformers) or [here](https://www.sbert.net).

### Model Preparation

In [None]:
if torch.cuda.is_available(): # check if GPU enabled kernel"
    print('Cuda !')

In [None]:
model = SentenceTransformer('paraphrase-distilroberta-base-v1', device='cuda')
print(f'Initial sequence length in paraphrase distilroberta : {model.max_seq_length}')
print(f'First sentence : {txt[0]}\nCorresponding tokens : {model.tokenizer(txt[0])}')
print(f"Maximal sequence length in our text data : {max([len(model.tokenizer(t)['input_ids']) for t in txt])}")
model.max_seq_length = 150
print(f'Resized sequence length in paraphrase distilroberta : {model.max_seq_length}')

### Raw Roberta embeddings

In [None]:
txt_encoded = np.array(model.encode(txt, normalize_embeddings=True))
txt_encoded.shape

In [None]:
plt.hist(np.var(txt_encoded, axis=0).get(), bins=100)
plt.title('Variance on the 768 Embedding Coordinates')
plt.show()

In [None]:
train_ids = [i for i in ids if i in df_train.id.to_pandas().values]

In [None]:
n0 = len(train_ids)

In [None]:
x_train, x_test = txt_encoded[:n0, :], txt_encoded[n0:, :]
x_train.shape

In [None]:
targets = np.array([df_train.loc[df_train.id==i, 'target'].values[0, 0] for i in train_ids])
targets

# Simple predictive neuron

In [None]:
import torch
from torch.autograd import Variable
import torch.nn.functional as F
import torch.utils.data as Data

torch.manual_seed(1)    # reproducible

net = torch.nn.Sequential(
        torch.nn.Linear(768, 200),
        torch.nn.LeakyReLU(),
        torch.nn.Linear(200, 200),
        torch.nn.LeakyReLU(),
        torch.nn.Linear(200, 100),
        torch.nn.LeakyReLU(),
        torch.nn.Linear(100, 1),
    )
net.cuda()

optimizer = torch.optim.Adam(net.parameters(), lr=0.001)
loss_func = torch.nn.MSELoss()  # this is for regression mean squared loss

BATCH_SIZE = 64
EPOCH = 10

torch_dataset = Data.TensorDataset(torch.tensor(x_train, device='cuda'), torch.tensor(targets, device='cuda'))

In [None]:
from tqdm.notebook import tqdm
loader = Data.DataLoader(
    dataset=torch_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=True)
# start training
l = []
for epoch in tqdm(range(EPOCH)):
    total_loss = 0
    for step, (b_x, b_y) in enumerate(loader): # for each training step
        
#         b_x = Variable(batch_x)
#         b_y = Variable(batch_y)

        prediction = net(b_x)     # input x and predict based on x

        loss = loss_func(prediction.float(), b_y.float())     # must be (1. nn output, 2. target)
        with torch.no_grad():
            total_loss += loss.item() / len(loader)
        optimizer.zero_grad()   # clear gradients for next train
        loss.backward()         # backpropagation, compute gradients
        optimizer.step()        # apply gradients
    l.append(total_loss)
plt.plot(l)
plt.show()

In [None]:
pred = net(torch.tensor(x_test, device='cuda'))
pred

That's not great... Predicts almost a constant ! We'll have to improve that !

In [None]:
sub = pd.read_csv('../input/commonlitreadabilityprize/sample_submission.csv')
sub.head()

In [None]:
test_ids = [i for i in ids if i in df_test.id.to_pandas().values]
sub = pd.DataFrame({'id':test_ids, 'target':pred.detach().cpu()}).groupby('id').mean()
sub.head()

In [None]:
sub.to_csv('submission.csv', index=True)