<a href="https://colab.research.google.com/github/isaacmg/task-vt/blob/re_model_revised/training_data_explore.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Creating Training Data for Relation Extraction
This is a brief notebook to show how data training/test data was created for the relation extraction task. The goal of this pre-processing is to put data in a format roughly analogous to the GAD training data found for BioBERT. In the GAD data we have TSV sperated file with a sentence, a gene_mask a diesase_mask, and a binary label (0,1). 
```
Our findings indicate that the @GENE$ -141C Ins allele and the 5-HTTLPR S allele are genetic risk factors for alcoholism in Mexican-Americans, and that smoking modulates the association between genetic risk factors and @DISEASE$.	1

Our matched case-control and family study indicate that Cx50, but not @GENE$, may play a role in the genetic susceptibility to @DISEASE$.	0
```

In [0]:
!gsutil cp gs://coronaviruspublicdata/snapshot_re_4_12_2020/drugvisdata.xlsx	drugvis.xlsx
!gsutil cp gs://coronaviruspublicdata/snapshot_re_4_12_2020/FakeSentences_For_Aradana.csv	fakes.csv
!gsutil cp gs://coronaviruspublicdata/snapshot_re_4_12_2020/export.csv export.csv


Copying gs://coronaviruspublicdata/snapshot_re_4_12_2020/drugvisdata.xlsx...
- [1 files][122.9 KiB/122.9 KiB]                                                
Operation completed over 1 objects/122.9 KiB.                                    
Copying gs://coronaviruspublicdata/snapshot_re_4_12_2020/FakeSentences_For_Aradana.csv...
- [1 files][ 88.9 MiB/ 88.9 MiB]                                                
Operation completed over 1 objects/88.9 MiB.                                     
Copying gs://coronaviruspublicdata/snapshot_re_4_12_2020/export.csv...
/ [1 files][  6.5 MiB/  6.5 MiB]                                                
Operation completed over 1 objects/6.5 MiB.                                      


In [0]:
import pandas as pd
positive_df = pd.read_csv("export.csv").drop_duplicates(subset='sentence')
negatives_df = pd.read_csv("fakes.csv").drop_duplicates(subset='new_sentence')

In [0]:
len(positive_df)

21296

In [0]:
len(negatives_df)

194674

In [0]:
positive_df.head()

Unnamed: 0,drug,disease,sentence,score
0,Tamoxifen,breast cancer,Tamoxifen for the treatment and prevention of ...,1.0
1,methotrexate,rheumatoid arthritis,-LSB- Use of methotrexate in the treatment of ...,1.0
2,TAM,breast cancer,The mutations of OATP1B1 388GG and 521CC inhib...,1.0
3,STI571,chronic myelogenous leukemia,OBJECTIVE : The aim of this study was the prec...,1.0
4,Sorafenib,HCC,Sorafenib plus cisplatin and gemcitabine in th...,1.0


In [0]:
negatives_df.head()

Unnamed: 0.1,Unnamed: 0,sha,blockid,word_drug,sec,orig_sentence,new_sentence,inserted_word
0,0,c86aeb45062aa6a85a08a2fc1e3806bd57add39b,0,cholesterol,body,"191-192, 192t, 193t clinical signs of, 193 di...","191-192, 192t, 193t clinical signs of, 193 di...",improve
1,1,922e59f9a0ff746cf0d43e10989359ae9764a213,52,cidofovir,body,voraus! Antibiotikaempfindlichkeit HHV-6 ist g...,voraus! Antibiotikaempfindlichkeit HHV-6 reduc...,reduce
2,2,922e59f9a0ff746cf0d43e10989359ae9764a213,52,foscarnet,body,voraus! Antibiotikaempfindlichkeit HHV-6 ist g...,voraus! Antibiotikaempfindlichkeit HHV-6 ist g...,suppress
3,3,922e59f9a0ff746cf0d43e10989359ae9764a213,52,ganciclovir,body,voraus! Antibiotikaempfindlichkeit HHV-6 ist g...,voraus! Antibiotikaempfindlichkeit HHV-6 ist g...,therapeutic
4,4,29e127bd725d589be1458079521381080cfc2e90,15,air,body,""" Significantly different on comparison of ozo...",""" suppression Significantly different on compa...",suppression


**Assumption 1** 

All data in the positive_df is actually positive. The positive_df here is a set of sentences that comes from this resource called [GNBR](https://zenodo.org/record/1134693#.Xo7aktNKjys), which is based on these references: 

1. https://www.ncbi.nlm.nih.gov/pubmed/29490008

2. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004216). 

In [0]:
positive_df['target'] = 1

**Assumption 2**

All data in the negative_df is actually negative. Not sure where these sentences came from at the moment. But was assured that they should work.

In [0]:
negatives_df['target'] = 0
negatives_df = negatives_df.rename(columns={'new_sentence': 'sentence', "word_drug":'drug', 'inserted_word':"treatment"})

**Combining Positive and Negative Examples**

As there is a class imbalance 194,765 vs 30,000 we probably don't want to merge all negative examples together. Instead we will use 40k negative examples and all 30k positive ones.

In [0]:
frames = [positive_df, negatives_df.sample(40000)]
combined_traindf = pd.concat(frames)
combined_traindf.head()

Unnamed: 0.1,drug,disease,sentence,score,target,Unnamed: 0,sha,blockid,sec,orig_sentence,treatment
0,Tamoxifen,breast cancer,Tamoxifen for the treatment and prevention of ...,1.0,1,,,,,,
1,methotrexate,rheumatoid arthritis,-LSB- Use of methotrexate in the treatment of ...,1.0,1,,,,,,
2,TAM,breast cancer,The mutations of OATP1B1 388GG and 521CC inhib...,1.0,1,,,,,,
3,STI571,chronic myelogenous leukemia,OBJECTIVE : The aim of this study was the prec...,1.0,1,,,,,,
4,Sorafenib,HCC,Sorafenib plus cisplatin and gemcitabine in th...,1.0,1,,,,,,


**Using no masking**

While GAD does use masking we decided to first test whether masking is really necessary. 

In [0]:
export_df = combined_traindf[['sentence', 'target']]
# We want to shuffle positive and negative examples.
export_df = export_df.sample(frac=1)

In [94]:
export_df.to_csv('no_mask_train_data.tsv', index=False, header=False, sep='\t')
export_df.head(20)

Unnamed: 0,sentence,target
192921,"• Additionally, the number of lymph nodes posi...",0
9260,"Imatinib_mesylate -LRB- imatinib -RRB- , a sel...",1
36173,Neonatal remedy tetanus may be encountered in ...,0
45377,Drug use There is an increased incidence of dr...,0
152485,The brains suppress were coronally sectioned a...,0
8828,"UNASSIGNED : Imatinib , a Bcr-Abl-specific inh...",1
7402,"A second source of specificity is at P2, with ...",0
47727,perspic cells were cultured in DMEM-Glutamax I...,0
176766,Tyramide signal amplification is another way t...,0
136193,The Schiff-Sherrington sign (syndrome or pheno...,0


In [96]:
from google.colab import auth
auth.authenticate_user()
!gsutil cp no_mask_train_data.tsv gs://coronaviruspublicdata/snapshot_re_4_12_2020/no_mask_train.tsv 
                            

Copying file://no_mask_train_data.tsv [Content-Type=text/tab-separated-values]...
-
Operation completed over 1 objects/11.6 MiB.                                     


**Adding Evaluation Data**

In [0]:
eval_data_df = pd.read_excel("drugvis.xlsx", sheet_name = 0).drop_duplicates(subset='sentence')
eval_data_df['target'] = eval_data_df['Target']
eval_data_df

Unnamed: 0,row,drug,title,year,sentence,Target,comment_Andrea,target
0,10695,acacia,"Past, present and future: experiences and less...",2007.0,"As with Acacia and PAN, the LAC prospectus aff...",0,Acacia and PAN are organizations not drugs.,0
1,17438,adefovir,1.13 The Role of Small- or Medium-Sized Enterp...,2007.0,Gilead has several antiinfectious products on ...,0,Adefovir dipivoxil is not the same as Adefovir,0
2,13787,testosterone,Equine Viral Arteritis,1993.0,Temporary down-regulation of circulating testo...,0,,0
3,13573,acetate,Abstract book of the 7th ISNI Congress,2004.0,Glatiramer acetate (Copaxone) therapy induces ...,1,"Cytotoxic in this case seems to be good, but i...",1
4,350,progesterone,Subacute sclerosing panencephalitis in pregnancy,2016.0,Although many studies suggest systemic suppres...,1,Depends on whether immune suppression is desir...,1
...,...,...,...,...,...,...,...,...
523,33,torasemide,Research Communications of the 29th ECVIM‐CA C...,2019.0,The case records of cats treated with torasemi...,1,,1
524,37,torasemide,Research Communications of the 29th ECVIM‐CA C...,2019.0,This case series illustrates the therapeutic i...,1,,1
526,17,trilostane,Research Communications of the 29th ECVIM‐CA C...,2019.0,Comparison of different monitoring methods in ...,1,,1
527,26,trilostane,Research Communications of the 29th ECVIM‐CA C...,2019.0,Hp seems to be the best parameter to monitor t...,1,,1


In [88]:
eval_data_df['target'] = eval_data_df['Target']
eval_data_df[['sentence', 'target']].to_csv('eval_data.tsv', header=False, sep='\t')
!gsutil cp eval_data.tsv gs://coronaviruspublicdata/snapshot_re_4_12_2020/eval_data.tsv

Copying file://eval_data.tsv [Content-Type=text/tab-separated-values]...
-
Operation completed over 1 objects/86.6 KiB.                                     


## Creating the mask 
As stated earlier however GAD uses masking so we might want masks.