Building the SmartTag models
===


## Initial checks

In [None]:
source .venv/bin/activate

In [None]:
python setup.py install

In [None]:
ls

In [None]:
ls models

In [None]:
ls data

In [None]:
ls data4th

In [None]:
smtag-neo2xml --help

In [None]:
smtag-convert2th --help

In [None]:
smtag-meta --help

In [None]:
smtag-eval --help

In [None]:
smtag-predict --help

## Preparation of corpus of xml documents and images

In [None]:
smtag-neo2xml -l10000 -f 181203all # on large sd-graph with 30000 panels from 1100 papers

In [None]:
smtag-ocr # run only once

## Small molecules

Classic

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@type='molecule']" \
    -f 10X_L1200_molecule \
    -w /efs/smtag --noocr

    smtag-meta -f 10X_L1200_molecule -w /efs/smtag \
    -E120 -Z128 -R0.01 \
    -o small_molecule \
    -w /efs/smtag

saved `10X_L1200_molecule_small_molecule_2019-01-05-19-23.sddl` (size: 96117)

    smtag-eval -f .zip -S -T
    smtag-eval -f 5X_L1200_molecule_small_molecule_2018-12-19-06-04 -m 5X_L1200_molecule_small_molecule_2018-12-19-06-04.zip
    
    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_molecule_small_molecule_2019-01-05-19-23.zip ../py-smtag/smtag/rack/small_molecule.zip

With 4 super layers, more sampling and low learning rate

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X15 -y ".//sd-tag[@type='molecule']" \
    -f 15X_L1200_molecule -w /efs/smtag --noocr

    smtag-meta -f 15X_L1200_molecule \
    -E200 -Z32 -R0.001 \
    -o small_molecule \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2" \
    -w /efs/smtag

- not great, 68% or so
- overfitting after 15 epochs


## Gene products

In [None]:
smtag-convert2th -c 181203all \
-E ".//figure-caption" \
-L1200 -X5 -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
-f 5X_L1200_geneprod --noocr \
-w /efs/smtag

In [None]:
smtag-meta -f 5X_L1200_geneprod \
-E120 -Z32 -R0.01 \
-o geneprod \
-w /efs/smtag 

model saved under `5X_L1200_geneprod_geneprod_2018-12-22-18-16.sddl`

In [None]:
smtag-eval -f 3X_L600_geneprod -m 3X_L600_geneprod_geneprod_2018-12-05-00-32.zip -S -T

In [None]:
smtag-eval -f .zip

Trying without entry BN, with tracking

    smtag-meta -f 5X_L1200_geneprod \
    -E120 -Z32 -R0.005 \
    -o geneprod \
    -w /efs/smtag
    
- Without entry BN with tracking: does not work very well, immediate learning to max 75% accuracy without further progress
- using -R0.01 or -Z128 does not improve anything
- trying without entry BN __and__ without tracking: a bit better
- trying restoring BN with tracking (classic mode): best so far
- trying with -k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2": even better

Trying 4 layers but __without__ entry BN and __without__ tracking, smaller learning rate (0.001) with more epochs (500)

    smtag-meta -f 5X_L1200_geneprod \
    -E500 -Z32 -R0.001 \
    -o geneprod \
    -k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2" \
    -w /efs/smtag

- better 80% max, overfit starts at epoch 120

Trying with more sampling, with entry BN and tracking (classic), reduce features first super layer

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X15 -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -f 15X_L1200_geneprod --noocr \
    -w /efs/smtag

    smtag-meta -f 15X_L1200_geneprod \
    -E200 -Z32 -R0.001 \
    -o geneprod \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2" \
    -w /efs/smtag
- better than 3 layers
- this seems pretty good, with combination of higher sampling, feature bottleneck and slow learning
- remarkably no ovefitting after 200 epochs
- still entry BN expected to make problems when short fragments padded with spaces which changes distribution
- small batch size makes training slow

__saved `15X_L1200_geneprod_geneprod_2018-12-31-20-12.sddl` (size: 115412)__

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/15X_L1200_geneprod_geneprod_2018-12-31-20-12.zip ../py-smtag/smtag/rack/geneprod.zip

Trying with 6 super layers which gives window of 1200 characters
with classical BN

    smtag-meta -f 15X_L1200_geneprod \
    -E500 -Z32 -R0.001 \
    -o geneprod \
    -k "6,6,6,6,6,6" -n "8,8,8,8,8,8" -p "2,2,2,2,2,2" \
    -w /efs/smtag

not better than 3 layers after 160 epochs, 80%, and starting overfitting

In [None]:
# trying half window padding instead of 20
smtag-convert2th -c 181203all \
-E ".//figure-caption" \
-L1200 -X15 -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
-p 64 \
-f 15X_L1200_geneprod_64_padding --noocr \
-w /efs/smtag

In [None]:
# trying without padding at all
smtag-meta -f 15X_L1200_geneprod \
-E200 -Z32 -R0.001 \
-o geneprod \
-k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2" \
-p 0 \
-w /efs/smtag

Trying panel-level training

    smtag-convert2th -c 181203all \
    -E ".//sd-panel" \
    -L600 -X10 -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -f 10X_L600_geneprod  \
    -w /efs/smtag

cp models/3X_L600_geneprod_geneprod_2018-12-05-00-32.zip rack/geneprod.zip

Trying concurrent model with whole document examples, no entry BN, with BN tracking

    smtag-convert2th -c 181203all \
    -E "." \
    -L6000 -X5 \
    -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -f 5X_L6000_geneprod_whole_doc -w /efs/smtag --noocr

    smtag-meta -f 5X_L6000_geneprod_whole_doc -w /efs/smtag \
    -E120 -Z32 -R0.01 \
    -o geneprod \
    -k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2"

saved `5X_L6000_geneprod_whole_doc_geneprod_2018-12-23-13-53.json`

## Subcellular

Figure level, calssic

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@type='subcellular']" \
    -f 10X_L1200_subcellular -w /efs/smtag --noocr

Rapid training

    smtag-meta -f 10X_L1200_subcellular -w /efs/smtag \
    -E120 -Z128 -R0.01 \
    -o subcellular

save as `10X_L1200_subcellular_subcellular_2018-12-22-21-12.sddl` (size: 100167)

With 4 layers, small batch size and slow learning

    smtag-meta -f 10X_L1200_subcellular -w /efs/smtag \
    -E120 -Z32 -R0.001 \
    -o subcellular \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2"

__save as `saved 10X_L1200_subcellular_subcellular_2019-01-03-21-29.sddl` (size: 115412)__

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_subcellular_subcellular_2019-01-03-21-29.zip ../py-smtag/smtag/rack/subcellular.zip

In [None]:
smtag-eval -f  -m .zip -S -T

In [None]:
smtag-eval -f  -m .zip

In [None]:
cp models/.zip rack/subcellular.zip

## Cell

Classic training

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@type='cell']" \
    -f 10X_L1200_cell -w /efs/smtag --noocr

    smtag-meta -f 10X_L1200_cell \
    -E120 -Z128 -R0.01 \
    -o cell \
    -w /efs/smtag
    
__saved `saved 10X_L1200_cell_cell_2019-01-04-23-37.sddl (size: 102531)`__

Note: quite a bit of overfitting

Trying with smaller batch and slower learning:

    smtag-meta -f 10X_L1200_cell \
    -E120 -Z32 -R0.001 \
    -o cell \
    -w /efs/smtag
    
77% instead of 75%..., pretty similar to classic, perhaps slighltly less overfitting?
saved 10X_L1200_cell_cell_2019-01-05-02-07.sddl (size: 102531)
    
    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_cell_cell_2019-01-04-23-37.zip ../py-smtag/smtag/rack/cell.zip

In [None]:
smtag-eval -f  -m .zip -S -T

In [None]:
smtag-eval -f  -m .zip

## Tissue 

Classic

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@type='tissue']" \
    -f 10X_L1200_tissue --noocr \
    -w /efs/smtag

    smtag-meta -f 10X_L1200_tissue \
    -E120 -Z128 -R0.01 \
    -o tissue \
    -w /efs/smtag
    
saved `saved 10X_L1200_tissue_tissue_2019-01-05-09-54.sddl (size: 102531)`

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_tissue_tissue_2019-01-05-09-54.zip ../py-smtag/smtag/rack/tissue.zip


In [None]:
smtag-eval -f 3X_L600_tissue -m 3X_L600_tissue_tissue_2018-12-06-23-58.zip -S -T

In [None]:
smtag-eval -f 3X_L600_tissue -m 3X_L600_tissue_tissue_2018-12-06-23-58.zip

## Organism 

Classic 

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@type='organism']" \
    --noocr \
    -f 10X_L1200_organism \
    -w /efs/smtag

    smtag-meta -f 10X_L1200_organism \
    -E120 -Z128 -R0.01 \
    -o organism \
    -w /efs/smtag
    
__saved `10X_L1200_organism_organism_2019-01-05-10-48.sddl` (size: 102517)__

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_organism_organism_2019-01-05-10-48.zip ../py-smtag/smtag/rack/organism.zip

In [None]:
smtag-eval -f  -m .zip -S -T

In [None]:
smtag-eval -f  -m .zip

## Diseases 

From brat dataset (option `-b`)

    smtag-convert2th -L1200 -X10 -b -c NCBI_disease -f 5X_L1200_NCBI_disease 

    smtag-meta -E120 -Z128 -R0.01 -o disease -f 10X_L1200_NCBI_disease -w /efs/smtag
    
saved `10X_L1200_NCBI_disease_disease_2019-01-05-10-57.sddl`
    
    smtag-meta -E120 -Z32 -R0.001 -o disease -f 10X_L1200_NCBI_disease -w /efs/smtag
    
saved `saved 10X_L1200_NCBI_disease_disease_2019-01-05-11-22.sddl`
    
Delays overfitting to 30 epochs instead of 15, but overall not better accuracy and not better loss on valid

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_disease_2019-01-05-10-57.zip ../py-smtag/smtag/rack/disease.zip
    
Hyperscan

    smtag-meta -f 10X_L1200_NCBI_disease \
    -E15 -Z32 -R0.01 -o disease \
    -H depth,kernel \
    -w /efs/smtag

No huge differences; large variability; maybe kernel 8 best
    
    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/scans/scan_depth_kernel_X25_2019-01-05-11-24/scanned_perf.csv ../py-smtag/scans/scan_depth_kernel_X25_2019-01-05-11-24_scanned_perf.csv
    
Models makes problems when applied. Training set probably just too small and not varied enough. Would need to be complemented with negative text to avoid crazy false positives.

Pre-training on negative data


    smtag-meta -E60 -Z128 -R0.01 -o disease -f 15X_L1200_geneprod -w /efs/smtag
    
Stopped after 60 epochs

    smtag-meta -E120 -Z128 -R0.01 -o disease -f 10X_L1200_NCBI_disease -m 15X_L1200_geneprod_last_saved.zip -w /efs/smtag
    
    smtag-meta -E60 -Z128 -R0.01 -o disease -f 15X_L1200_geneprod -m 15X_L1200_geneprod_disease_2019-01-06-20-29.zip -w /efs/smtag
    
    saved `15X_L1200_geneprod_last_saved.sddl`
    
Stopped after 18 epochs. Apparently kept the ability to recognize disease terms, started at f1 0.75 and climbed quickly to 0.8

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/15X_L1200_geneprod_last_saved.zip ../py-smtag/smtag/rack/disease.zip 
    
Still, many artefacts due to hyphens or strange predictions.

Trying to generate a larger dataset (testset combined with train `cp test/* train` to make training set with 1426 examples)

    smtag-convert2th -L1200 -X10 -b -c NCBI_disease -f 10X_L1200_NCBI_disease -w /efs/smtag
    
    smtag-meta -E60 -Z128 -R0.01 -o disease -f 10X_L1200_NCBI_disease -w /efs/smtag
    
Saved `10X_L1200_NCBI_disease_disease_2019-01-07-10-51.sddl`


    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_disease_2019-01-07-10-51.zip ../py-smtag/smtag/rack/disease.zip
    
Still same kind of artifacts due to hyphen or bizarre tagging of "the levels"... Puzzling!
    
Trying to quench false negatives by post-training on negative data:
    
    smtag-meta -E60 -Z128 -R0.01 -o disease -f 15X_L1200_geneprod -m 10X_L1200_NCBI_disease_disease_2019-01-07-10-51.zip -w /efs/smtag
    
Loss goes to 1E-4 after 6 epochs already.
    
Saved `10X_L1200_NCBI_disease_disease_2019-01-07-11-46.sddl`

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_disease_2019-01-07-11-46.zip ../py-smtag/smtag/rack/disease.zip
    
Seems to have forgotten after so many epochs. :-( Confirmed by retraining the retrained model. Starts from zero!

Trying slower more limited training to avoid overfitting

    smtag-meta -E30 -Z32 -R0.001 -o disease -f 10X_L1200_NCBI_disease -w /efs/smtag

__Saved `10X_L1200_NCBI_disease_disease_2019-01-07-12-06.sddl`__

No overfitting!

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_disease_2019-01-07-12-06.zip ../py-smtag/smtag/rack/disease.zip
    
Still influenced by INF-γ. Not sure what to do next. Included a cleanup step to remove em dash minus and other and replace with standard ASCII hyphen.

To intermingle negative examples during the whole training, assembled manually a `encoded/10X_L1200_NCBI_disease_augmented` dataset by adding all of the already encoded examples from `encoded/15X_L1200_geneprod`.

    smtag-convert2th -L1200 -X10 -b -c NCBI_disease -f 10X_L1200_NCBI_disease_augmented -w /efs/smtag
    
Since the examples are already encoded, it does not touch them and should proceed with sampling of the whole set of encoded examples. 
    
    smtag-meta -E120 -Z32 -R0.001 -o disease -f 10X_L1200_NCBI_disease_augmented -w /efs/smtag
    
Possible that the negative examples dominate too much the training set. After initial peak (5-6th epoch), perf decreases as training progresses. No overfitting. Perf goes back up after valley (around 30-40 epochs). Goes down again later, when valid loss goes up. Capacity of the newtork too small?
   
    smtag-meta -E200 -Z128 -R0.01 -n "16,16,16,16" -p "2,2,2,2" -k "6,6,6,6" -o disease -f 10X_L1200_NCBI_disease_augmented -w /efs/smtag
    
Testing an intermediate model after 65 epcohs, f1 0.8 no overfitting:

Saved `10X_L1200_NCBI_disease_augmented_last_saved.sddl`
    
    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_augmented_last_saved.zip ../py-smtag/smtag/rack/disease.zip
    
Final model: f1=0.85

Saved `10X_L1200_NCBI_disease_augmented_disease_2019-01-08-17-59.zip`

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_augmented_disease_2019-01-08-17-59.zip ../py-smtag/smtag/rack/disease.zip

    


## Experimental assay 

Calssic

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@category='assay']" \
    -f 10X_L1200_assay \
    --noocr \
    -w /efs/smtag
    
    smtag-meta -f 10X_L1200_assay \
    -E120 -Z128 -R0.01 \
    -o assay \
    -w /efs/smtag
    
__saved `10X_L1200_assay_assay_2019-01-05-15-14.sddl`__

Checking a bit hyperparam

    smtag-meta -f 10X_L1200_assay \
    -E30 -Z128 -R0.01 -o assay \
    -H depth,kernel -I 100 \
    -w /efs/smtag
    
Very little impact. If kernel 3 or so, then less good, but otherwise seem remarkably insensitige.
    
    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_assay_assay_2019-01-05-15-14.zip ../py-smtag/smtag/rack/exp_assay.zip

## Intervention-assay geneprod 

Classical training

    smtag-convert2th -c 181203all \
    -L1200 -X15 \
    -E ".//figure-caption" \
    -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -e ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -A ".//sd-tag[@role='intervention']",".//sd-tag[@role='assayed']",".//sd-tag[@role='normalizing']",".//sd-tag[@role='experiment']",".//sd-tag[@role='component']", \
    -f 15X_L1200_geneprod_anonym_not_reporter -w /efs/smtag --noocr
    
    smtag-meta -f 15X_L1200_geneprod_anonym_not_reporter -w /efs/smtag \
    -E120 -Z128 -R0.01 \
    -o intervention,assayed \
    -k "6,6,6" -n "8,8,8" -p "2,2,2"

__saved `saved 15X_L1200_geneprod_anonym_not_reporter_intervention_assayed_2019-01-04-15-05.sddl (size: 102679)`__

    smtag-meta -f 15X_L1200_geneprod_anonym_not_reporter -w /efs/smtag \
    -E120 -Z32 -R0.001 \
    -o intervention,assayed \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2"
    
Slightly worse than the 3 layer version, intervention 69%, assayed 78%, slighly more overfitting.

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/15X_L1200_geneprod_anonym_not_reporter_intervention_assayed_2019-01-04-15-05.zip ../py-smtag/smtag/rack/role_geneprod.zip

Trying whole document, __without entry BN__, tracking on

    smtag-convert2th -c 181203all \
    -L6000 -X5 \
    -E "." \
    -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -e ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -A ".//sd-tag[@role='intervention']",\
    ".//sd-tag[@role='assayed']",\
    ".//sd-tag[@role='normalizing']",\
    ".//sd-tag[@role='experiment']",\
    ".//sd-tag[@role='component']", \
    -f 5X_L6000_whole_doc_geneprod_anonym_not_reporter -w /efs/smtag --noocr

    smtag-meta -f 5X_L6000_whole_doc_geneprod_anonym_not_reporter -w /efs/smtag \
    -E120 -Z32 -R0.01 \
    -o intervention,assayed \
    -k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2"

- not better
- later exp suggest tracking should be off if no entry BN?

saved `5X_L6000_whole_doc_geneprod_anonym_not_reporter_intervention_assayed_2018-12-24-13-13.sddl` (size: 142459)

Trying whole document, entry BN and tracking (classic), reduced features number (8 everywhere), slower learning rate:

    smtag-meta -f 5X_L6000_whole_doc_geneprod_anonym_not_reporter -w /efs/smtag \
    -E500 -Z32 -R0.001 \
    -o intervention,assayed \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2"
    
- not good, intervention 60%, assayed 70%, overfitting

trying without skip links in unet to force bottleneck

    smtag-meta -f 5X_L6000_whole_doc_geneprod_anonym_not_reporter \
    -E500 -Z32 -R0.01 \
    -o intervention,assayed \
    -k "6,6,6" -n "16,32,64" -p "2,2,2" \
    -w /efs/smtag 

- a disaster...

## Geneprod Reporter

Classical

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X5 \
    -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -e ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -f 5X_L1200_geneprod_reporter -w /efs/smtag --noocr

    smtag-meta -f 5X_L1200_geneprod_reporter -w /efs/smtag \
    -E120 -Z128 -R0.01 \
    -o reporter

saved `5X_L1200_geneprod_reporter_reporter_2018-12-22-23-44.sddl` (size: 96117)

Trying 4 layers, with BN and tracking

    smtag-meta -f 5X_L1200_geneprod_reporter -w /efs/smtag \
    -E120 -Z32 -R0.001 \
    -o reporter \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2"

__saved `5X_L1200_geneprod_reporter_reporter_2019-01-03-18-48.sddl` (size: 115412)__


    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/5X_L1200_geneprod_reporter_reporter_2019-01-03-18-48.zip ../py-smtag/smtag/rack/reporter_geneprod.zip

In [None]:
smtag-eval -f .zip -S -T

In [None]:
smtag-eval -f 5X_L1200_geneprod_reporter -m 

In [None]:
# trying concurrent model with whole document examples,
# 32 ex batches, reduced learning rate, 4 layers
# without entry BN but with BN tracking
smtag-convert2th -c 181203all \
-E "." \
-L6000 -X5 \
-y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
-e ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
-f 5X_L6000_geneprod_reporter_whole_doc -w /efs/smtag --noocr

smtag-meta -f 5X_L6000_geneprod_reporter_whole_doc -w /efs/smtag \
-E120 -Z32 -R0.005 -o reporter \
-k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2"

saved 5X_L6000_geneprod_reporter_whole_doc_reporter_2018-12-25-21-32.sddl (size: 142311)

In [None]:
cp models/.zip rack/reporter_geneprod.zip

## Intervention-assay small molecule

In [None]:
smtag-convert2th -c 181203all \
-L600 -X3 \
-y ".//sd-tag[@type='molecule']" \
-e ".//sd-tag[@type='molecule']" \
-A ".//sd-tag[@type='molecule']" \
-f 3X_L600_molecule_anonym

In [None]:
smtag-meta -f 3X_L600_molecule_anonym \
-E120 -Z128 -R0.01 \
-o intervention,assayed

saved `3X_L600_molecule_anonym_intervention_assayed_2018-12-13-05-20.sddl` (size: 100315)

In [None]:
cp models/3X_L600_molecule_anonym_intervention_assayed_2018-12-13-05-20.zip rack/role_small_molecule.zip

In [None]:
smtag-eval -f 3X_L600_molecule_anonym -m 3X_L600_molecule_anonym_intervention_assayed_2018-12-13-05-20.zip -S -T

In [None]:
smtag-eval -f 3X_L600_molecule_anonym -m 3X_L600_molecule_anonym_intervention_assayed_2018-12-13-05-20.zip

## Panel start

With 4 layers:
    
    smtag-meta -f 15X_L1200_geneprod \
    -E100 -Z32 -R0.001 \
    -o panel_start \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2" \
    -w /efs/smtag

__saved `15X_L1200_geneprod_panel_start_2019-01-03-04-15.sddl` (size: 115412)__

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/15X_L1200_geneprod_panel_start_2019-01-03-04-15.zip ../py-smtag/smtag/rack/panel_start.zip

In [None]:
smtag-eval -f  -m .zip -S -T

In [None]:
smtag-eval -f -m .zip

In [None]:
# trying whole document level, without entry BN, BN tracking
smtag-convert2th -c 181203all \
--noocr -E "." \
-L6000 -X10 -f 10X_L6000_whole_figure_level_no_ocr \
-w /efs/smtag
# wrong file name...

In [None]:
# wrong file name...
smtag-meta -f 10X_L6000_whole_figure_level_no_ocr \
-E120 -Z32 -R0.005 \
-o panel_start \
-k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2" \
-w /efs/smtag

In [None]:
saved 10X_L6000_whole_figure_level_no_ocr_panel_start_2018-12-26-13-19.sddl (size: 142325)
# WRONG NAME...

## Special models (experimental)

### Whole document level intervention-assayed model

In [None]:
smtag-convert2th -c 181203all \
--noocr -E "." \
-e ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
-A ".//sd-tag[@role='intervention']",\
".//sd-tag[@role='assayed']",\
".//sd-tag[@role='normalizing']",\
".//sd-tag[@role='experiment']",\
".//sd-tag[@role='component']", \
-L6000 -X -f 1X_L5000_doc_level_geneprod_anonym_not_reporter_no_ocr

In [None]:
smtag-meta -f 1X_L5000_doc_level_geneprod_anonym_not_reporter_no_ocr \
-E200 -Z32 -R0.01 \
-n 8,8,8,8 -p 2,2,2,2 -k 6,6,6,6 \
-o intervention,assayed

In [None]:
smtag-convert2th -c 181203all \
--noocr -E "." \
-y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
-L6000 -X3 -f 3X_L6000_doc_level_geneprod_no_ocr

In [None]:
smtag-meta -f 3X_L6000_doc_level_geneprod_no_ocr \
-E200 -Z32 -R0.01 \
-n 8,8,8,8 -p 2,2,2,2 -k 6,6,6,6 \
-o geneprod

## Rack for SmartTag engine

In [None]:
ls rack

## Hyperparam scans

In [None]:
smtag-meta -E50 -o panel_start -H depth,kernel -I 25 -f 5X_L1200_entities -w /efs/smtag # after data transfer to GPU machine

Winning combo:
    4,4,4,4
    
But a lot of variability. Initialization issues?

In [None]:
smtag-meta -E50 -o geneprod -H kernel,depth -I 25 -f 5X_L1200_gene_protein -w /efs/smtag # after transfer to GPU machine

Winning combo with `f1=0.80` after 50 epochs, no overfitting with valid:

    namebase=5X_L1200_gene_protein; modelname=; learning_rate=0.01000000000000001; epochs=50; minibatch_size=32; selected_features=['geneprod']; collapsed_features=[]; overlap_features=[]; features_as_input=[]; nf_table=[8, 8, 8]; pool_table=[2, 2, 2]; kernel_table=[6, 6, 6]; dropout=0.1; validation_fraction=0.2; nf_input=32; nf_output=1