<a href="https://colab.research.google.com/github/unpackAI/unpackai/blob/main/examples/nlp_regression_toxity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP tasks
> From unpackAI

In [1]:
!pip install -q unpackai==0.1.8.9
!pip install -Uqq fastai
!pip install -q transformers

[?25l[K     |██████                          | 10 kB 16.3 MB/s eta 0:00:01[K     |████████████▏                   | 20 kB 18.8 MB/s eta 0:00:01[K     |██████████████████▏             | 30 kB 16.8 MB/s eta 0:00:01[K     |████████████████████████▎       | 40 kB 11.1 MB/s eta 0:00:01[K     |██████████████████████████████▍ | 51 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████████| 54 kB 923 kB/s 
[K     |████████████████████████████████| 85 kB 2.0 MB/s 
[K     |████████████████████████████████| 189 kB 5.1 MB/s 
[K     |████████████████████████████████| 56 kB 3.3 MB/s 
[K     |████████████████████████████████| 2.9 MB 5.1 MB/s 
[K     |████████████████████████████████| 596 kB 68.7 MB/s 
[K     |████████████████████████████████| 895 kB 67.7 MB/s 
[K     |████████████████████████████████| 3.3 MB 21.2 MB/s 
[K     |████████████████████████████████| 56 kB 4.3 MB/s 
[?25h

In [None]:
from fastai.text.all import *
from unpackai.nlp import HFTextBlock
from transformers import AutoModel, AutoTokenizer

## Regression Problem

### About the problem
Acute toxicity LD50 measures the most conservative dose that can lead to lethal adverse effects. 

> We are to design a model, which take the chemical expression as input, and predict the possible toxity, hence will prioritize the experiments on more promising drug.
* X:**SMILES** expression, (The simplified molecular-input line-entry system), it's a way to write 3d organic chemical structure into a string, that can be reconstructed back to 3d structure
* Y: **LD50 Toxity level**,  the amount of the compound that will kill around 50% of the lab rats

### About dataset
[TDC](https://tdcommons.ai/start/) is the first unifying framework to systematically access and evaluate ML across the entire range of therapeutics

### Download dataset

In [2]:
!pip install -q PyTDC

[?25l[K     |███▋                            | 10 kB 11.3 MB/s eta 0:00:01[K     |███████▎                        | 20 kB 16.6 MB/s eta 0:00:01[K     |███████████                     | 30 kB 16.7 MB/s eta 0:00:01[K     |██████████████▌                 | 40 kB 11.7 MB/s eta 0:00:01[K     |██████████████████▏             | 51 kB 5.6 MB/s eta 0:00:01[K     |█████████████████████▉          | 61 kB 5.6 MB/s eta 0:00:01[K     |█████████████████████████▌      | 71 kB 5.4 MB/s eta 0:00:01[K     |█████████████████████████████   | 81 kB 6.0 MB/s eta 0:00:01[K     |████████████████████████████████| 90 kB 3.6 MB/s 
[?25h  Building wheel for PyTDC (setup.py) ... [?25l[?25hdone


In [3]:
from tdc.single_pred import Tox
data = Tox(name = 'LD50_Zhu')

Downloading...
100%|██████████| 707k/707k [00:01<00:00, 382kiB/s] 
Loading...
Done!


In [4]:
df = data.get_data()
df

Unnamed: 0,Drug_ID,Drug,Y
0,"Diazene, diphenyl-, 1-oxide",[O-][N+](=Nc1ccccc1)c1ccccc1,2.505
1,"Methane, tribromo-",BrC(Br)Br,2.343
2,Bromoethene (9CI),C=CBr,2.330
3,"1,1'-Biphenyl, hexabromo-",Brc1ccc(-c2ccc(Br)c(Br)c2Br)c(Br)c1Br,1.465
4,"Isothiocyanic acid, p-bromophenyl ester",S=C=Nc1ccc(Br)cc1,2.729
...,...,...,...
7380,"Isocyanic acid, (alpha,alpha,alpha-trifluoro-o...",O=C=Nc1ccccc1C(F)(F)F,1.427
7381,"1,3-Benzenediamine, 4-(1,1,2,2-tetrafluoroetho...",Nc1ccc(OC(F)(F)C(F)F)c(N)c1,2.321
7382,"Propanenitrile, 3-((2-(acetyloxy)ethyl)phenyla...",CC(=O)OCCN(CCC#N)c1ccccc1,2.050
7383,"IH-Indene-1,3(2H)-dione, 2-(2-methyl-1-(2-(1-p...",CC1=CC(=C2C(=O)c3ccccc3C2=O)C=CN1CCN1CCCCC1,2.951


In [5]:
df.describe()

Unnamed: 0,Y
count,7385.0
mean,2.5444
std,0.958667
min,-0.343
25%,1.856
50%,2.368
75%,3.035
max,10.207


Normalize the target a little bit

In [34]:
df['target'] = df["Y"]-2.5

## Download pretrained model

In [7]:
pretrained = AutoModel.from_pretrained("seyonec/SMILES_tokenized_PubChem_shard00_160k")
tokenizer = AutoTokenizer.from_pretrained("seyonec/SMILES_tokenized_PubChem_shard00_160k")

Downloading:   0%|          | 0.00/515 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/321M [00:00<?, ?B/s]

Some weights of the model checkpoint at seyonec/SMILES_tokenized_PubChem_shard00_160k were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.bias', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/327 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/7.95k [00:00<?, ?B/s]

Downloading: 0.00B [00:00, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Data
### Tokenizer
Let's look like what the tokenization on SMILES looks like.

It's exactly like on usual sentences

In [8]:
tokenizer("CC1=CC(=C2C(=O)c3ccccc3C2=O)C=CN1CCN1CCCCC1")

{'input_ids': [12, 16, 16, 20, 22, 16, 16, 17, 22, 16, 21, 16, 17, 22, 19, 18, 15, 26, 15, 15, 15, 15, 15, 26, 16, 21, 22, 19, 18, 16, 22, 16, 23, 20, 16, 16, 23, 20, 16, 16, 16, 16, 16, 20, 13], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Datablock

In [35]:
dblock = DataBlock(
    blocks=[
            HFTextBlock(
                tokenizer,
                max_length=128,
                padding=True),
            RegressionBlock],
    get_x=ColReader("Drug"), get_y=ColReader("target",))

### Dataloaders


In [37]:
dls = dblock.dataloaders(df, bs=32)

In [41]:
dls.train_ds[3]

('CCOP(=O)(Oc1ccc(Cl)cc1Cl)Oc1ccc(Cl)cc1Cl', tensor(-0.6730))

A preview on a batch of data

In [44]:
x, y = dls.one_batch()

In [45]:
x.shape, y.shape

(torch.Size([32, 53]), torch.Size([32]))

## Create a model with pretrained weights

In [54]:
class NLPClassifier(nn.Module):
    def __init__(self, pretrained):
        super().__init__()
        self.pretrained = pretrained
        self.top = nn.Linear(pretrained.config.hidden_size, 1)

    def forward(self, x):
        output = self.pretrained(x)
        return self.top(output.pooler_output)

In [55]:
model = NLPClassifier(pretrained)

## Training

In [56]:
learn = Learner(model=model,dls=dls,loss_func=nn.MSELoss())

In [None]:
learn.fit(2)

epoch,train_loss,valid_loss,time
