# PCA on Sentence Embeddings with RAPIDS

This notebook is intended on providing [**optimal text embeddings**](https://www.kaggle.com/louise2001/embeddings-actuarial-loss-competition) for the claim descriptions of the [actuarial loss competition](https://www.kaggle.com/c/actuarial-loss-estimation). I will detail the whole procedure with additional illustrations of the obtained vectorized representations.

## Summary :
1. Obtaining Sentence Embeddings from Transformers
2. Principal Components Analysis with RAPIDS
2. Data Analysis and Vizualization

### Installing RAPIDS and other requirements

[RAPIDS](https://rapids.ai) enable you to perform every numpy, pandas or sklearn manipulation & modeling, entirely on GPU for higher performance.

In [None]:
import sys
!cp ../input/rapids/rapids.0.18.0 /opt/conda/envs/rapids.tar.gz
!cd /opt/conda/envs/ && tar -xzvf rapids.tar.gz > /dev/null
sys.path = ["/opt/conda/envs/rapids/lib/python3.7/site-packages"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib/python3.7"] + sys.path
sys.path = ["/opt/conda/envs/rapids/lib"] + sys.path 
!cp /opt/conda/envs/rapids/lib/libxgboost.so /opt/conda/lib/

In [None]:
import cudf as pd # pandas on GPU
import cupy as np # numpy on GPU
from cuml.decomposition import PCA # scikit-learn on GPU
!pip install sentence-transformers
from sentence_transformers import SentenceTransformer # PyTorch supported
import gc
import torch
import matplotlib.pyplot as plt
import matplotlib as mpl

### Reading and Processing Text Data

In [None]:
df_train = pd.read_csv('../input/actuarial-loss-estimation/train.csv')
df_test  = pd.read_csv('../input/actuarial-loss-estimation/test.csv')

We will concatenate text data from train and test database in order to process them globally. That will provide better fitting in the PCA as well as prevent any unpleasant surprises.

In [None]:
n0 = df_train.shape[0]
# text is upper case in source data, but models are lower case
txt = [t.lower() for df in (df_train, df_test) for t in df.ClaimDescription.to_array()]
txt[:3]

Transformers are a very efficient way of getting optimal text embeddings.
I will compute raw sentence embeddings based on the paraphrase-trained DistilRoberta. You can see more on this model [here](https://github.com/UKPLab/sentence-transformers) or [here](https://www.sbert.net).

Why did I choose the model trained on the paraphrase-scoring task ? We are trying to extract the global meaning of the description, in order to get an idea of the gravity or lasting consequences of an accident. It therefore seemed to me to be a related and adequate model to use for getting my raw embeddings.

### Model Preparation

In [None]:
if torch.cuda.is_available(): # check if GPU enabled kernel
    print('Cuda !')

In [None]:
model = SentenceTransformer('paraphrase-distilroberta-base-v1', device='cuda')
print(f'Initial sequence length in paraphrase distilroberta : {model.max_seq_length}')
print(f'First sentence : {txt[0]}\nCorresponding tokens : {model.tokenizer(txt[0])}')
print(f"Maximal sequence length in our text data : {max([len(model.tokenizer(t)['input_ids']) for t in txt])}")
# resizing model max_seq_length for faster computations (remove a loooot of unuseful <PAD> tokens)
model.max_seq_length = 25
print(f'Resized sequence length in paraphrase distilroberta : {model.max_seq_length}')

### Raw Roberta embeddings

In [None]:
txt_encoded = np.array(model.encode(txt, normalize_embeddings=True))
txt_encoded.shape

We get 768-dimensional vectorized and normalized representations of our 90_000 sentences.That's quite big, and not very efficient : as our sentences are from the same writing-style, they are probably highly correlated, that's not great for building a model that would be able to differentiate injury severity. Moreover, we would like the coordinates to bear explainability power, and ideally to have them orthonormal.

In [None]:
plt.hist(np.var(txt_encoded, axis=0).get(), bins=100)
plt.title('Variance on the 768 Embedding Coordinates')
plt.show()

Indeed, we can see that the variance on each embedding axis is very low (for comparison, the expected variance had the coordinates been sampled from a $\mathcal{U}([0,1])$ would have been of $\frac{1}{12} \approx 0.0833$ along each of the 768 axes). It means, as expected, that the data is highly colinear, all embedding coordinates being centered on the same values.

### Principal Component Analysis

In [None]:
pca = PCA(n_components=2, copy=True, random_state=0, svd_solver='jacobi', whiten=True, verbose=True)
txt_encoded_pca = pca.fit_transform(txt_encoded)

How much of the total variance are explained by these 2 first axes ?

In [None]:
pca.explained_variance_

That's not much... Let's get a visualization of how our claim descriptions are scattered based on the first 2 explainability axes.

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
ax.scatter(txt_encoded_pca[:, 0].get(), txt_encoded_pca[:, 1].get(), c='r')
ax.set_xlabel('First Principal Component')
ax.set_ylabel('Second Principal Component')
plt.show()

Isn't that cute ? Now, let's see how these axes bear explainability towards our problem. We will try to see how the accidents scatter along these 2 axes, based on the target ultimate cost that we wish to predict.

### Clustering on Target Value

In [None]:
def quantile(array, threshold):
    array_ = np.sort(array.flatten())
    return array_[int(threshold * array_.shape[0])]

In [None]:
x, y = txt_encoded_pca[:n0, 0].get(), txt_encoded_pca[:n0, 1].get()
# take only train data as we want to explain target, here ultimate cost
c = df_train.UltimateIncurredClaimCost.values
trunc = quantile(c, 0.8)
# for more visibility, as there are some very extreme values, I have to truncate too big values
criteria = np.where(c <= trunc)[0].get()
x, y, c = x[criteria], y[criteria], c[criteria]
fig, ax = plt.subplots(figsize=(10,10))
ax.scatter(x, y, c=c.get(), cmap='jet')
ax.set_xlabel('First Principal Component')
ax.set_ylabel('Second Principal Component')

cmap = mpl.cm.jet
norm = mpl.colors.Normalize(vmin=min(c), vmax=max(c))

cbar = fig.colorbar(mpl.cm.ScalarMappable(norm=norm, cmap=cmap),
             ax=ax, orientation='vertical', label='Ultimate Cost')

plt.savefig('Ultimate_cost_based_on_text.png', dpi=100)
plt.show()

We notice a very clear separation of the two halves of the heart into 2 subgroups, the left half with lower ultimate costs, the right half with higher ultimate costs.

As a conclusion, the data provided in the [Embeddings dataset](https://www.kaggle.com/louise2001/embeddings-actuarial-loss-competition) has been generated on the first 20 principal orthonormal components. I provide the source code hereunder.

In [None]:
n_comp = 20
pca = PCA(n_components=n_comp, copy=True, random_state=0, svd_solver='jacobi', whiten=True, verbose=True)
txt_encoded_pca = pca.fit_transform(txt_encoded)
embeddings_train, embeddings_test = pd.DataFrame(txt_encoded_pca[:n0, :], columns=[f'X_{i}' for i in range(n_comp)]), pd.DataFrame(txt_encoded_pca[n0:, :], columns=[f'X_{i}' for i in range(n_comp)])
embeddings_train.to_csv(f'embeddings_train_{n_comp}.csv', index=False)
embeddings_test.to_csv(f'embeddings_test_{n_comp}.csv', index=False)