Building the SmartTag models
===


## Initial checks

In [None]:
source .venv/bin/activate

In [None]:
python setup.py install

In [None]:
ls

In [None]:
ls models

In [None]:
ls data

In [None]:
ls data4th

In [None]:
smtag-neo2xml --help

In [None]:
smtag-convert2th --help

In [None]:
smtag-meta --help

In [None]:
smtag-eval --help

In [None]:
smtag-predict --help

## Panel start

### Data preparation

Use generic dataset with all panels that contain at least some entities (`-y`).

In [None]:
smtag-neo2xml -f all_entities

In [None]:
smtag-convert2th -X5 -L1200 -c all_entities -f 5X_L1200_entities # dataset to be zipped and transferred manually to GPU

### Hyper scan

In [None]:
smtag-meta -E50 -o panel_start -H depth,kernel -I 25 -f 5X_L1200_entities -w /efs/smtag # after data transfer to GPU machine

Winning combo:
    4,4,4,4
    
But a lot of variability. Initialization issues?

### Train

In [None]:
smtag-meta -E50 -o panel_start -Z32 -R0.01 -k 4,4,4,4 -f 8,8,8,8 -p 2,2,2,2 -f 5X_L1200_entities -w /efs/smtag

saved as `5X_L1200_entities_panel_start_2018-10-05-14-59.zip`

### Eval

In [None]:
smtag-eval -f 5X_L1200_entities -m 5X_L1200_entities_panel_start_2018-10-05-14-59.zip -S -T

    thres prec  recal f1
    0.000 0.002 1.000 0.005
    0.100 0.659 0.955 0.780
    0.200 0.746 0.937 0.831
    0.300 0.795 0.923 0.854
    0.400 0.826 0.909 0.866
    0.500 0.854 0.894 0.873
    0.600 0.876 0.876 0.876<<<
    0.700 0.898 0.848 0.872
    0.800 0.920 0.794 0.853
    0.900 0.948 0.672 0.787
    1.000 0.500 0.000 0.000

In [None]:
python setup.py install # after mapper has been updated with optimal threshold

In [None]:
smtag-eval -f 5X_L1200_entities -m 5X_L1200_entities_panel_start_2018-10-05-14-59.zip

========================================================

 Data: ./data4th/5X_L1200_entities/test

 Model: 5X_L1200_entities_panel_start_2018-10-05-14-59.zip

 Global stats: 

	precision = 0.8786332607269287.2f
	recall = 0.9617899656295776.2f
	f1 = 0.9183329343795776.2f

 Feature: ': 'panel_start' ()'

	precision = 0.8786332607269287.2f
	recall = 0.9617899656295776.2f
	f1 = 0.9183329343795776.2f

## Gene products

### Data preparation

In [None]:
smtag-neo2xml -y gene,protein -f gene_protein

In [None]:
smtag-convert2th -X5 -L1200 -c gene_protein -f 5X_L1200_gene_protein

### Hyper scan:

In [None]:
smtag-meta -E50 -o geneprod -H kernel,depth -I 25 -f 5X_L1200_gene_protein -w /efs/smtag # after transfer to GPU machine

Winning combo with `f1=0.80` after 50 epochs, no overfitting with valid:

    namebase=5X_L1200_gene_protein; modelname=; learning_rate=0.01000000000000001; epochs=50; minibatch_size=32; selected_features=['geneprod']; collapsed_features=[]; overlap_features=[]; features_as_input=[]; nf_table=[8, 8, 8]; pool_table=[2, 2, 2]; kernel_table=[6, 6, 6]; dropout=0.1; validation_fraction=0.2; nf_input=32; nf_output=1

In [None]:
# add kernel 3 at the model exit
python setup.py install

In [None]:
smtag-meta -E200 -Z32 -R0.01 -o geneprod -k 6,6,6 -n 8,8,8 -p 2,2,2 -f 5X_L1200_gene_protein -w /efs/smtag

 ========================================================

 Data: ./data4th/5X_L1200_gene_protein/test

 Model: 5X_L1200_gene_protein_geneprod_2018-10-04-10-05.zip

 Global stats: 

	precision = 0.8259445428848267.2f
	recall = 0.7741530537605286.2f
	f1 = 0.7992106080055237.2f

 Feature: 'entity: 'geneprod' (geneprod)'

	precision = 0.8259445428848267.2f
	recall = 0.7741530537605286.2f
	f1 = 0.7992106080055237.2f
    
  No decisive advantage. Reverting to 1x1 final conv1d.

### Train

For geneproduct (`-o`) with 32 examples per minibatch (`-Z`), 50 epochs (`-E`), learning rate 0.01 (`-R`), 2 super layers with kernel 6 (`-k`), 8 channels (`n`), max pool window and stride 2 (`-p`):

In [None]:
smtag-meta -Z32 -E120 -R0.01 -k 6,6,6 -n 8,8,8 -p 2,2,2 -o geneprod -f 5X_L1200_gene_protein -w /efs/smtag

saved as `5X_L1200_gene_protein_geneprod_2018-10-05-04-07.zip`

### Eval

In [None]:
smtag-eval -f 5X_L1200_gene_protein_train -m 5X_L1200_gene_protein_geneprod_2018-10-05-04-07.zip -S -T

    thres prec  recal f1
    0.000 0.045 1.000 0.086
    0.100 0.651 0.943 0.770
    0.200 0.728 0.903 0.806
    0.300 0.775 0.865 0.817
    0.400 0.812 0.824 0.818<<<
    0.500 0.842 0.779 0.809
    0.600 0.871 0.725 0.791
    0.700 0.897 0.655 0.757
    0.800 0.922 0.557 0.695
    0.900 0.950 0.398 0.561
    1.000 0.500 0.000 0.000

In [None]:
python setup.py install # after mapper has been updated with optimal threshold

In [None]:
smtag-eval -f 5X_L1200_gene_protein -m 5X_L1200_gene_protein_geneprod_2018-10-05-04-07.zip

 Data: ./data4th/5X_L1200_gene_protein/test

 Model: 5X_L1200_gene_protein_geneprod_2018-10-05-04-07.zip

 Global stats: 

	precision = 0.8096075057983398.2f
	recall = 0.7962476015090942.2f
	f1 = 0.8028719425201416.2f

 Feature: 'entity: 'geneprod' (geneprod)'

	precision = 0.8096075057983398.2f
	recall = 0.7962476015090942.2f
	f1 = 0.8028719425201416.2f

## Role intervention/assayed with reporter not anonymized

### Data gen

Anonymize gene and proteins (-`A`) but do not anonymize reporters (`-AA`). Keep roles only for genes and proteins (`-s`).

In [None]:
smtag-neo2xml -y gene,protein -A gene,protein -AA reporter -s -f gene_protein_anonym_not_reporter

In [None]:
smtag-convert2th -L1200 -X5 -c gene_protein_anonym_not_reporter -f 5X_L1200_gene_protein_anonym_not_reporter

### Train

In [None]:
smtag-meta -Z32 -E200 -R0.01 -o intervention,assayed -f 5X_L1200_gene_protein_anonym_not_reporter # -w /efs/smtag

saved as `5X_L1200_gene_protein_anonym_not_reporter_intervention_assayed_2018-10-06-06-48.zip`

Note: 20 Epochs would have been enough! Overfitting thereafter...

### Eval

In [None]:
smtag-eval -f 5X_L1200_gene_protein_anonym_not_reporter -m 5X_L1200_gene_protein_anonym_not_reporter_intervention_assayed_2018-10-06-06-48.zip

    0.000	0.000 0.013; 0.021 1.000; 1.000 0.026; 0.040
    0.100	0.100 0.500; 0.644 0.908; 0.951 0.645; 0.768
    0.200	0.200 0.569; 0.696 0.851; 0.915 0.682; 0.791
    0.300	0.300 0.615; 0.728 0.801; 0.881 0.696; 0.797
    0.400	0.400 0.652; 0.755 0.749; 0.846 0.697; 0.798<<<
    0.500	0.500 0.687; 0.779 0.692; 0.804 0.690; 0.791
    0.600	0.600 0.724; 0.804 0.628; 0.754 0.673; 0.778
    0.700	0.700 0.765; 0.828 0.548; 0.685 0.639; 0.749
    0.800	0.800 0.821; 0.852 0.424; 0.575 0.560; 0.686
    0.900	0.900 0.876; 0.892 0.191; 0.363 0.313; 0.516
    1.000	1.000 0.500; 0.500 0.000; 0.000 0.000; 0.000

In [None]:
smtag-eval -f 5X_L1200_gene_protein_anonym_not_reporter -m 5X_L1200_gene_protein_anonym_not_reporter_intervention_assayed_2018-10-06-06-48.zip

========================================================

 Data: ./data4th/5X_L1200_gene_protein_anonym_not_reporter/test

 Model: 5X_L1200_gene_protein_anonym_not_reporter_intervention_assayed_2018-10-06-06-48.zip

 Global stats: 

	precision = 0.7332009077072144.2f
	recall = 0.7734760046005249.2f
	f1 = 0.7527563571929932.2f

 Feature: 'entity: 'intervention' (intervention)'

	precision = 0.6864056587219238.2f
	recall = 0.7123331427574158.2f
	f1 = 0.6991291046142578.2f

 Feature: 'entity: 'assayed' (assayed)'

	precision = 0.7799961566925049.2f
	recall = 0.8346188068389893.2f
	f1 = 0.8063835501670837.2f

## Reporter gene product

### Data prep

No anonymization, keep roles only for proteins and genes (`-s`)

In [None]:
smtag-neo2xml -l1000 -y gene,protein -s -f geneprod_selective

In [None]:
smtag-convert2th -X5 -L1200 -c geneprod_selective -f 5X_L1200_geneprod_selective

Learn reporter role. Since roles were kept only for gene prod, will learn geneprod reporter.

In [None]:
smtag-meta -E100 -Z32 -R0.01 -o reporter -f 5X_L1200_geneprod_selective

saved as `5X_L1200_geneprod_selective_reporter_2018-10-06-08-14.zip `

### Eval

In [None]:
smtag-eval -f 5X_L1200_geneprod_selective -m 5X_L1200_geneprod_selective_reporter_2018-10-06-08-14.zip  -S -T

    0.000 0.006 1.000 0.011
    0.100 0.765 0.905 0.829
    0.200 0.813 0.886 0.848
    0.300 0.838 0.863 0.850
    0.400 0.857 0.840 0.848
    0.500 0.873 0.822 0.847
    0.600 0.891 0.801 0.844
    0.700 0.908 0.779 0.839
    0.800 0.929 0.740 0.824
    0.900 0.950 0.630 0.757
    1.000 0.500 0.000 0.000

In [None]:
python setup.py install # after mapper has been updated with optimal threshold

In [None]:
smtag-eval -f 5X_L1200_geneprod_selective -m 5X_L1200_geneprod_selective_reporter_2018-10-06-08-14.zip 

========================================================

 Data: ./data4th/5X_L1200_geneprod_selective/test

 Model: 5X_L1200_geneprod_selective_reporter_2018-10-06-08-14.zip

 Global stats: 

	precision = 0.8972335457801819.2f
	recall = 0.8679876327514648.2f
	f1 = 0.8823683261871338.2f

 Feature: 'entity: 'reporter' (reporter)'

	precision = 0.8972335457801819.2f
	recall = 0.8679876327514648.2f
	f1 = 0.8823683261871338.2f

## Experimental assay (measurementMethod)

Using generic entities dataset

### Train

In [None]:
smtag-meta -Z32 -E200 -R0.01 -k 6,6,6 -p 2,2,2 -n 8,8,8 -o assay -f 5X_L1200_entities_train -w /efs/smtag/

saved as `5X_L1200_entities_assay_2018-10-06-16-45.zip`

### Eval

In [None]:
smtag-eval -f 5X_L1200_entities -m 5X_L1200_entities_assay_2018-10-06-16-45.zip -S -T

    0.000 0.029 1.000 0.055
    0.100 0.579 0.848 0.688
    0.200 0.675 0.812 0.737
    0.300 0.726 0.786 0.755
    0.400 0.763 0.763 0.763
    0.500 0.794 0.742 0.767<<<
    0.600 0.822 0.717 0.766
    0.700 0.851 0.684 0.758
    0.800 0.886 0.627 0.734
    0.900 0.930 0.518 0.665
    1.000 0.500 0.000 0.000

In [None]:
python setup.py install # after mapper has been updated with optimal threshold

In [None]:
smtag-eval -f 5X_L1200_entities -m 5X_L1200_entities_assay_2018-10-06-16-45.zip

========================================================

 Data: ./data4th/5X_L1200_entities/test

 Model: 5X_L1200_entities_assay_2018-10-06-16-45.zip

 Global stats: 

	precision = 0.8133354783058167.2f
	recall = 0.7144335508346558.2f
	f1 = 0.7606832385063171.2f

 Feature: 'assay: 'assay' ()'

	precision = 0.8133354783058167.2f
	recall = 0.7144335508346558.2f
	f1 = 0.7606832385063171.2f

`.                                                                                                          .`

## Small molecule model

### Data gen

Enrich examples for small molecules (`-y`). Generate the data with `-5X` sampling of each example of length `-L1200` characters.

In [None]:
smtag-neo2xml -l1000 -y molecule -f small_molecule

In [None]:
smtag-convert2th -X5 -L1200 -c small_molecule -f 5X_L1200_small_molecule

### Train

In [None]:
smtag-meta -E100 -Z32 -R0.01 -o small_molecule -f 5X_L1200_small_molecule_train -w /efs/smtag

    5X_L1200_small_molecule_train_small_molecule_2018-09-30-12-36.zip

### Eval

smtag-eval -f 5X_L1200_small_molecule_train -m 5X_L1200_small_molecule_train_small_molecule_2018-09-30-12-36.zip -S -T

    0.000 0.021 1.000 0.042
    0.100 0.706 0.932 0.803
    0.200 0.832 0.888 0.859
    0.300 0.889 0.847 0.867
    0.400 0.922 0.807 0.861
    0.500 0.946 0.765 0.846
    0.600 0.962 0.718 0.823
    0.700 0.974 0.661 0.788
    0.800 0.983 0.585 0.734
    0.900 0.990 0.467 0.634
    1.000 0.500 0.000 0.000

In [None]:
python setup.py install # after mapper has been updated with optimal threshold

In [None]:
smtag-eval -f 5X_L1200_small_molecule_test -m 5X_L1200_small_molecule_train_small_molecule_2018-09-30-12-36.zip

Global stats: 

	precision = 0.8067349791526794.2f
	recall = 0.5043478012084961.2f
	f1 = 0.6206701993942261.2f

 Feature: 'entity: 'small_molecule' (small_molecule)'

	precision = 0.8067349791526794.2f
	recall = 0.5043478012084961.2f
	f1 = 0.6206701993942261.2f

## Subcellular structures

### Data prep

In [None]:
smtag-neo2xml -l1000 -y subcellular -f subcellular

In [None]:
smtag-convert2th -X10 -L1200 -c subcellular -f 10X_L1200_subcellular

In [None]:
smtag-convert2th -X10 -L1200 -c subcellular -f 10X_L1200_subcellular -T

### Train

In [None]:
smtag-meta -E200 -Z32 -R0.01 -o subcellular -f 10X_L1200_subcellular_train

saved as `10X_L1200_subcellular_train_subcellular_2018-10-02-04-31.zip`

### Eval

In [None]:
smtag-eval -m 10X_L1200_subcellular_train_subcellular_2018-10-02-04-31.zip -f 10X_L1200_subcellular_train -T -S

    0.000 0.023 1.000 0.046
    0.100 0.768 0.949 0.849
    0.200 0.856 0.925 0.890
    0.300 0.900 0.902 0.901
    0.400 0.928 0.878 0.902
    0.500 0.949 0.850 0.897
    0.600 0.966 0.816 0.885
    0.700 0.978 0.772 0.863
    0.800 0.988 0.708 0.824
    0.900 0.995 0.590 0.741
    1.000 0.500 0.000 0.000

In [None]:
python setup.py install # after mapper has been updated with optimal threshold

In [None]:
smtag-eval -m 10X_L1200_subcellular_train_subcellular_2018-10-02-04-31.zip -f 10X_L1200_subcellular_test

Global stats: 

	precision = 0.8699927926063538.2f
	recall = 0.5450037717819214.2f
	f1 = 0.6701773405075073.2f

 Feature: 'entity: 'subcellular' (subcellular)'

	precision = 0.8699927926063538.2f
	recall = 0.5450037717819214.2f
	f1 = 0.6701773405075073.2f

## Cell lines and cell types

### Data gen

In [None]:
smtag-neo2xml -l1000 -y cell -f cell

In [None]:
smtag-convert2th -X10 -L1200 -c cell -f 10X_L1200_cell

In [None]:
smtag-convert2th -X10 -L1200 -c cell -f 10X_L1200_cell -T

### Train

saved as `10X_L1200_cell_train_cell_2018-10-02-13-01.zip`

### Eval

In [None]:
smtag-eval -f 10X_L1200_cell_train -m 10X_L1200_cell_train_cell_2018-10-02-13-01.zip -S -T

    0.000 0.019 1.000 0.038
    0.100 0.746 0.985 0.849
    0.200 0.833 0.978 0.900
    0.300 0.879 0.972 0.923
    0.400 0.908 0.965 0.936
    0.500 0.930 0.957 0.943
    0.600 0.946 0.946 0.946
    0.700 0.960 0.932 0.946
    0.800 0.973 0.908 0.939
    0.900 0.984 0.856 0.916
    1.000 0.500 0.000 0.000

In [None]:
python setup.py install # after mapper has been updated with optimal threshold

In [None]:
smtag-eval -f 10X_L1200_cell_test -m 10X_L1200_cell_train_cell_2018-10-02-13-01.zip

Global stats: 

	precision = 0.8186321258544922.2f
	recall = 0.6864217519760132.2f
	f1 = 0.7467199563980103.2f

 Feature: 'entity: 'cell' (cell)'

	precision = 0.8186321258544922.2f
	recall = 0.6864217519760132.2f
	f1 = 0.7467199563980103.2f

## Tissue and organs

### Data prep

In [None]:
smtag-neo2xml -l1000 -y tissue -f tissue

In [None]:
smtag-convert2th -X10 -L1200 -c tissue -f 10X_L1200_tissue

In [None]:
smtag-convert2th -X10 -L1200 -c tissue -f 10X_L1200_tissue -T

### Train

In [None]:
smtag-meta -E200 -Z32 -R0.01 -o tissue -f 10X_L1200_tissue_train -w /efs/smtag

saved as `10X_L1200_tissue_train_tissue_2018-10-02-08-46.zip`

### Eval

In [None]:
smtag-eval -m 10X_L1200_tissue_train_tissue_2018-10-02-08-46.zip -f 10X_L1200_tissue_train -S -T

    0.000 0.022 1.000 0.042
    0.100 0.835 0.927 0.879
    0.200 0.901 0.880 0.890
    0.300 0.928 0.837 0.880
    0.400 0.946 0.792 0.862
    0.500 0.957 0.741 0.836
    0.600 0.967 0.679 0.798
    0.700 0.976 0.597 0.741
    0.800 0.983 0.486 0.651
    0.900 0.986 0.321 0.485

In [None]:
python setup.py install # after mapper has been updated with optimal threshold

In [None]:
smtag-eval -m 10X_L1200_tissue_train_tissue_2018-10-02-08-46.zip -f 10X_L1200_tissue_test

 Global stats: 

	precision = 0.7564066648483276.2f
	recall = 0.5962263941764832.2f
	f1 = 0.6668321490287781.2f

 Feature: 'entity: 'tissue' (tissue)'

	precision = 0.7564066648483276.2f
	recall = 0.5962263941764832.2f
	f1 = 0.6668321490287781.2f

## Organisms

### Data prep with sd-graph

In [None]:
smtag-neo2xml -l1000 -y organism -f organism

In [None]:
smtag-convert2th -X10 -L1200 -c organism -f 10X_L1200_organism

In [None]:
smtag-convert2th -X10 -L1200 -c organism -f 10X_L1200_organism -T

### Train

In [None]:
smtag-meta -E200 -Z32 -R0.01 -o organism -f 10X_L1200_organism_train

In [None]:
smtag-eval -f 10X_L1200_organism_train -m -S -T

saved as `10X_L1200_organism_train_organism_2018-10-02-17-56.zip`

### Eval

In [None]:
smtag-eval -f 10X_L1200_organism_train -m 10X_L1200_organism_train_organism_2018-10-02-17-56.zip -S -T

    0.000 0.016 1.000 0.031
    0.100 0.783 0.981 0.871
    0.200 0.857 0.974 0.912
    0.300 0.894 0.967 0.929
    0.400 0.917 0.960 0.938
    0.500 0.933 0.952 0.943
    0.600 0.947 0.943 0.945
    0.700 0.958 0.928 0.943
    0.800 0.969 0.905 0.936
    0.900 0.981 0.850 0.911
    1.000 0.500 0.000 0.000

In [None]:
python setup.py install # after mapper has been updated with optimal threshold

In [None]:
smtag-eval -f 10X_L1200_organism_test -m 10X_L1200_organism_train_organism_2018-10-02-17-56.zip

Global stats: 

	precision = 0.8689602613449097.2f
	recall = 0.6663433313369751.2f
	f1 = 0.7542819380760193.2f

 Feature: 'entity: 'organism' (organism)'

	precision = 0.8689602613449097.2f
	recall = 0.6663433313369751.2f
	f1 = 0.7542819380760193.2f

### Data prep with LINNAEUS_GSC_brat

### Benchmark on sd

### Pretraining

### Trainnig

## Diseases

### Data prep

In [None]:
smtag-convert2th -L1200 -X5 -b -c NCBI_disease -f 5X_L1200_NCBI_disease

In [None]:
smtag-convert2th -L1200 -X5 -b -c NCBI_disease -f 5X_L1200_NCBI_disease -T

### Train

In [None]:
smtag-meta -E200 -Z32 -R0.01 -o disease -f 5X_L1200_NCBI_disease_train

saved as `5X_L1200_NCBI_disease_train_disease_2018-10-03-12-49.zip`

### Eval

In [None]:
smtag-eval -m 5X_L1200_NCBI_disease_train_disease_2018-10-03-12-49.zip -f 5X_L1200_NCBI_disease_train -S -T

    0.000 0.082 1.000 0.152
    0.100 0.774 0.974 0.862
    0.200 0.856 0.954 0.902
    0.300 0.898 0.935 0.916
    0.400 0.925 0.914 0.920
    0.500 0.946 0.890 0.917
    0.600 0.962 0.862 0.909
    0.700 0.974 0.825 0.893
    0.800 0.984 0.773 0.866
    0.900 0.992 0.678 0.805
    1.000 0.500 0.000 0.000

In [None]:
smtag-eval -m 5X_L1200_NCBI_disease_train_disease_2018-10-03-12-49.zip -f 5X_L1200_NCBI_disease_test

Global stats: 

	precision = 0.8464170694351196.2f
	recall = 0.7622231245040894.2f
	f1 = 0.8021168112754822.2f

 Feature: 'disease: 'disease' ()'

	precision = 0.8464170694351196.2f
	recall = 0.7622231245040894.2f
	f1 = 0.8021168112754822.2f

# E. Loading models into the rack

In [None]:
cd ~/Documents/code/py-smtag

In [None]:
cp models/5X_L1200_entities_train_panel_start_2018-09-28-18-08.zip rack/panel_start.zip

In [None]:
cp models/5X_L1200_gene_protein_train_geneprod_2018-09-28-20-17.zip rack/geneprod.zip

In [None]:
cp models/5X_L1200_small_molecule_train_small_molecule_2018-09-30-12-36.zip rack/small_molecule.zip

In [None]:
cp models/10X_L1200_subcellular_train_subcellular_2018-10-02-04-31.zip rack/subcellular.zip

In [None]:
cp models/10X_L1200_cell_train_cell_2018-10-02-13-01.zip rack/cell.zip

In [None]:
cp models/10X_L1200_tissue_train_tissue_2018-10-02-08-46.zip rack/tissue.zip

In [None]:
cp models/10X_L1200_organism_train_organism_2018-10-02-17-56.zip rack/organism.zip

In [None]:
cp models/5X_L1200_entities_train_assay_2018-09-30-11-25.zip rack/exp_assay.zip

In [None]:
cp models/5X_L1200_gene_protein_anonym_train_intervention_assayed_2018-09-30-10-09.zip rack/role_geneprod.zip

In [None]:
cp models/5X_L1200_reporter_geneprod_train_reporter_2018-10-03-00-29.zip rack/reporter_geneprod.zip

In [None]:
cp models/5X_L1200_NCBI_disease_train_disease_2018-10-03-12-49.zip rack/disease.zip

In [None]:
ls rack