Building the SmartTag models
===


## Initial checks

```
usage: smtag-meta [-h] [-f FILE] [-E EPOCHS] [-Z MINIBATCH_SIZE]
                  [-R LEARNING_RATE] [-D DROPOUT_RATE] [-o OUTPUT_FEATURES]
                  [-i FEATURES_AS_INPUT] [-a OVERLAP_FEATURES]
                  [-c COLLAPSED_FEATURES] [-n NF_TABLE] [-k KERNEL_TABLE]
                  [-p POOL_TABLE] [-w WORKING_DIRECTORY] [-H HYPERPARAMS]
                  [-I ITERATIONS] [-m MODEL] [--ocrxy] [--ocr1] [--ocr2]
                  [--viz]

Top level module to manage training.

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Namebase of dataset to import (default:
                        demo_xml_train)
  -E EPOCHS, --epochs EPOCHS
                        Number of training epochs. (default: 200)
  -Z MINIBATCH_SIZE, --minibatch_size MINIBATCH_SIZE
                        Minibatch size. (default: 32)
  -R LEARNING_RATE, --learning_rate LEARNING_RATE
                        Learning rate. (default: 0.01)
  -D DROPOUT_RATE, --dropout_rate DROPOUT_RATE
                        Dropout rate. (default: 0.1)
  -o OUTPUT_FEATURES, --output_features OUTPUT_FEATURES
                        Selected output features (use quotes if comma+space
                        delimited). (default: geneprod)
  -i FEATURES_AS_INPUT, --features_as_input FEATURES_AS_INPUT
                        Features that should be added to the input (use quotes
                        if comma+space delimited). (default: )
  -a OVERLAP_FEATURES, --overlap_features OVERLAP_FEATURES
                        Features that should be combined by intersecting them
                        (equivalent to AND operation) (use quotes if
                        comma+space delimited). (default: )
  -c COLLAPSED_FEATURES, --collapsed_features COLLAPSED_FEATURES
                        Features that should be collapsed into a single one
                        (equivalent to OR operation) (use quotes if
                        comma+space delimited). (default: )
  -n NF_TABLE, --nf_table NF_TABLE
                        Number of features in each hidden super-layer.
                        (default: 8,8,8)
  -k KERNEL_TABLE, --kernel_table KERNEL_TABLE
                        Convolution kernel for each hidden layer. (default:
                        6,6,6)
  -p POOL_TABLE, --pool_table POOL_TABLE
                        Pooling for each hidden layer (use quotes if
                        comma+space delimited). (default: 2,2,2)
  -w WORKING_DIRECTORY, --working_directory WORKING_DIRECTORY
                        Specify the working directory for meta, where to read
                        and write files to (default: None)
  -H HYPERPARAMS, --hyperparams HYPERPARAMS
                        Perform a scanning of the hyperparameters selected.
                        (default: )
  -I ITERATIONS, --iterations ITERATIONS
                        Number of iterations for the hyperparameters scanning.
                        (default: 25)
  -m MODEL, --model MODEL
                        Load pre-trained model and continue training.
                        (default: )
  --ocrxy               Use as additional input position and orientation of
                        words extracted by OCR from the illustration.
                        (default: False)
  --ocr1                Use as additional presence of words extracted by OCR
                        from the illustration. (default: False)
  --ocr2                Use as additional input orientation of words extracted
                        by OCR from the illustration. (default: False)
  --viz                 Use as additional visual features extracted from the
                        illustration. (default: False)
```

```
usage: smtag-eval [-h] [-f FILE] [-m MODEL] [-T] [-S]

Accuracy evaluation.

optional arguments:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Basename of the dataset to import (testset) (default:
                        test_entities_test)
  -m MODEL, --model MODEL
                        Basename of the model to benchmark. (default:
                        entities.sddl)
  -T, --no_token        Flag to disable tokenization. (default: False)
  -S, --scan            Flag to switch to threshold scaning mode. (default:
                        False)
```

## Preparation of corpus of xml documents and images

In [None]:
smtag-neo2xml -l10000 -f 181203all # on large sd-graph with 30000 panels from 1100 papers

In [None]:
smtag-ocr # run only once

## Small molecules

Classic

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@type='molecule']" \
    -f 10X_L1200_molecule \
    -w /efs/smtag --noocr

    smtag-meta -f 10X_L1200_molecule -w /efs/smtag \
    -E120 -Z128 -R0.01 \
    -o small_molecule \
    -w /efs/smtag

saved `10X_L1200_molecule_small_molecule_2019-01-05-19-23.sddl` (size: 96117)

    smtag-eval -f .zip -S -T
    smtag-eval -f 5X_L1200_molecule_small_molecule_2018-12-19-06-04 -m 5X_L1200_molecule_small_molecule_2018-12-19-06-04.zip
    
    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_molecule_small_molecule_2019-01-05-19-23.zip ../py-smtag/smtag/rack/small_molecule.zip

With 4 super layers, more sampling and low learning rate

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X15 -y ".//sd-tag[@type='molecule']" \
    -f 15X_L1200_molecule -w /efs/smtag --noocr

    smtag-meta -f 15X_L1200_molecule \
    -E200 -Z32 -R0.001 \
    -o small_molecule \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2" \
    -w /efs/smtag

- not great, 68% or so
- overfitting after 15 epochs


## Gene products

Classic L=1200 character dataset

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X5 -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -f 5X_L1200_geneprod --noocr \
    -w /efs/smtag

Trying without entry BN, with tracking

    smtag-meta -f 5X_L1200_geneprod \
    -E120 -Z32 -R0.005 \
    -o geneprod \
    -w /efs/smtag
    
- Without entry BN with tracking: does not work very well, immediate learning to max 75% accuracy without further progress
- using -R0.01 or -Z128 does not improve anything
- trying without entry BN __and__ without tracking: a bit better
- trying restoring BN with tracking (classic mode): best so far
- trying with -k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2": even better

Trying 4 layers but __without__ entry BN and __without__ tracking, smaller learning rate (0.001) with more epochs (500)

    smtag-meta -f 5X_L1200_geneprod \
    -E500 -Z32 -R0.001 \
    -o geneprod \
    -k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2" \
    -w /efs/smtag

- better 80% max, overfit starts at epoch 120

Trying with more sampling, with entry BN and tracking (classic), reduce features first super layer

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X15 -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -f 15X_L1200_geneprod --noocr \
    -w /efs/smtag

    smtag-meta -f 15X_L1200_geneprod \
    -E200 -Z32 -R0.001 \
    -o geneprod \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2" \
    -w /efs/smtag
- better than 3 layers
- this seems pretty good, with combination of higher sampling, feature bottleneck and slow learning
- remarkably no ovefitting after 200 epochs
- still entry BN expected to make problems when short fragments padded with spaces which changes distribution
- small batch size makes training slow

__saved `15X_L1200_geneprod_geneprod_2018-12-31-20-12.sddl` (size: 115412)__

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/15X_L1200_geneprod_geneprod_2018-12-31-20-12.zip ../py-smtag/smtag/rack/geneprod.zip

Trying with 6 super layers which gives window of 1200 characters
with classical BN

    smtag-meta -f 15X_L1200_geneprod \
    -E500 -Z32 -R0.001 \
    -o geneprod \
    -k "6,6,6,6,6,6" -n "8,8,8,8,8,8" -p "2,2,2,2,2,2" \
    -w /efs/smtag

not better than 3 layers after 160 epochs, 80%, and starting overfitting

In [None]:
# trying half window padding instead of 20
smtag-convert2th -c 181203all \
-E ".//figure-caption" \
-L1200 -X15 -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
-p 64 \
-f 15X_L1200_geneprod_64_padding --noocr \
-w /efs/smtag

In [None]:
# trying without padding at all
smtag-meta -f 15X_L1200_geneprod \
-E200 -Z32 -R0.001 \
-o geneprod \
-k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2" \
-p 0 \
-w /efs/smtag

Trying panel-level training

    smtag-convert2th -c 181203all \
    -E ".//sd-panel" \
    -L600 -X10 -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -f 10X_L600_geneprod  \
    -w /efs/smtag


Trying concurrent model with whole document examples, no entry BN, with BN tracking

    smtag-convert2th -c 181203all \
    -E "." \
    -L6000 -X5 \
    -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -f 5X_L6000_geneprod_whole_doc -w /efs/smtag --noocr

    smtag-meta -f 5X_L6000_geneprod_whole_doc -w /efs/smtag \
    -E120 -Z32 -R0.01 \
    -o geneprod \
    -k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2"

saved `5X_L6000_geneprod_whole_doc_geneprod_2018-12-23-13-53.json`

## Subcellular

Figure level, calssic

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@type='subcellular']" \
    -f 10X_L1200_subcellular -w /efs/smtag --noocr

Rapid training

    smtag-meta -f 10X_L1200_subcellular -w /efs/smtag \
    -E120 -Z128 -R0.01 \
    -o subcellular

save as `10X_L1200_subcellular_subcellular_2018-12-22-21-12.sddl` (size: 100167)

With 4 layers, small batch size and slow learning

    smtag-meta -f 10X_L1200_subcellular -w /efs/smtag \
    -E120 -Z32 -R0.001 \
    -o subcellular \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2"

__save as `saved 10X_L1200_subcellular_subcellular_2019-01-03-21-29.sddl` (size: 115412)__

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_subcellular_subcellular_2019-01-03-21-29.zip ../py-smtag/smtag/rack/subcellular.zip

## Cell

Classic training

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@type='cell']" \
    -f 10X_L1200_cell -w /efs/smtag --noocr

    smtag-meta -f 10X_L1200_cell \
    -E120 -Z128 -R0.01 \
    -o cell \
    -w /efs/smtag
    
__saved `saved 10X_L1200_cell_cell_2019-01-04-23-37.sddl (size: 102531)`__

Note: quite a bit of overfitting

Trying with smaller batch and slower learning:

    smtag-meta -f 10X_L1200_cell \
    -E120 -Z32 -R0.001 \
    -o cell \
    -w /efs/smtag
    
77% instead of 75%..., pretty similar to classic, perhaps slighltly less overfitting?
saved 10X_L1200_cell_cell_2019-01-05-02-07.sddl (size: 102531)
    
    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_cell_cell_2019-01-04-23-37.zip ../py-smtag/smtag/rack/cell.zip

## Tissue 

Classic

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@type='tissue']" \
    -f 10X_L1200_tissue --noocr \
    -w /efs/smtag

    smtag-meta -f 10X_L1200_tissue \
    -E120 -Z128 -R0.01 \
    -o tissue \
    -w /efs/smtag
    
saved `saved 10X_L1200_tissue_tissue_2019-01-05-09-54.sddl (size: 102531)`

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_tissue_tissue_2019-01-05-09-54.zip ../py-smtag/smtag/rack/tissue.zip


## Organism 

Classic 

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@type='organism']" \
    --noocr \
    -f 10X_L1200_organism \
    -w /efs/smtag

    smtag-meta -f 10X_L1200_organism \
    -E120 -Z128 -R0.01 \
    -o organism \
    -w /efs/smtag
    
__saved `10X_L1200_organism_organism_2019-01-05-10-48.sddl` (size: 102517)__

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_organism_organism_2019-01-05-10-48.zip ../py-smtag/smtag/rack/organism.zip

## Diseases 

From brat dataset (option `-b`)

    smtag-convert2th -L1200 -X10 -b -c NCBI_disease -f 5X_L1200_NCBI_disease 

    smtag-meta -E120 -Z128 -R0.01 -o disease -f 10X_L1200_NCBI_disease -w /efs/smtag
    
saved `10X_L1200_NCBI_disease_disease_2019-01-05-10-57.sddl`
    
    smtag-meta -E120 -Z32 -R0.001 -o disease -f 10X_L1200_NCBI_disease -w /efs/smtag
    
saved `saved 10X_L1200_NCBI_disease_disease_2019-01-05-11-22.sddl`
    
Delays overfitting to 30 epochs instead of 15, but overall not better accuracy and not better loss on valid

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_disease_2019-01-05-10-57.zip ../py-smtag/smtag/rack/disease.zip
    
Hyperscan

    smtag-meta -f 10X_L1200_NCBI_disease \
    -E15 -Z32 -R0.01 -o disease \
    -H depth,kernel \
    -w /efs/smtag

No huge differences; large variability; maybe kernel 8 best
    
    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/scans/scan_depth_kernel_X25_2019-01-05-11-24/scanned_perf.csv ../py-smtag/scans/scan_depth_kernel_X25_2019-01-05-11-24_scanned_perf.csv
    
Models makes problems when applied in smarttag. Training set probably just too small and not varied enough. Would need to be complemented with negative text to avoid crazy false positives.

Pre-training on negative data


    smtag-meta -E60 -Z128 -R0.01 -o disease -f 15X_L1200_geneprod -w /efs/smtag
    
Stopped after 60 epochs

    smtag-meta -E120 -Z128 -R0.01 -o disease -f 10X_L1200_NCBI_disease -m 15X_L1200_geneprod_last_saved.zip -w /efs/smtag
    
    smtag-meta -E60 -Z128 -R0.01 -o disease -f 15X_L1200_geneprod -m 15X_L1200_geneprod_disease_2019-01-06-20-29.zip -w /efs/smtag
    
    saved `15X_L1200_geneprod_last_saved.sddl`
    
Stopped after 18 epochs. Apparently kept the ability to recognize disease terms, started at f1 0.75 and climbed quickly to 0.8

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/15X_L1200_geneprod_last_saved.zip ../py-smtag/smtag/rack/disease.zip 
    
Still, many artefacts due to hyphens or strange predictions.

Trying to generate a larger dataset (testset combined with train `cp test/* train` to make training set with 1426 examples)

    smtag-convert2th -L1200 -X10 -b -c NCBI_disease -f 10X_L1200_NCBI_disease -w /efs/smtag
    
    smtag-meta -E60 -Z128 -R0.01 -o disease -f 10X_L1200_NCBI_disease -w /efs/smtag
    
Saved `10X_L1200_NCBI_disease_disease_2019-01-07-10-51.sddl`


    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_disease_2019-01-07-10-51.zip ../py-smtag/smtag/rack/disease.zip
    
Still same kind of artifacts due to hyphen or bizarre tagging of "the levels"... Puzzling!
    
Trying to quench false negatives by post-training on negative data:
    
    smtag-meta -E60 -Z128 -R0.01 -o disease -f 15X_L1200_geneprod -m 10X_L1200_NCBI_disease_disease_2019-01-07-10-51.zip -w /efs/smtag
    
Loss goes to 1E-4 after 6 epochs already.
    
Saved `10X_L1200_NCBI_disease_disease_2019-01-07-11-46.sddl`

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_disease_2019-01-07-11-46.zip ../py-smtag/smtag/rack/disease.zip
    
Seems to have forgotten after so many epochs. :-( Confirmed by retraining the retrained model. Starts from zero!

Trying slower more limited training to avoid overfitting

    smtag-meta -E30 -Z32 -R0.001 -o disease -f 10X_L1200_NCBI_disease -w /efs/smtag

Saved `10X_L1200_NCBI_disease_disease_2019-01-07-12-06.sddl`

No overfitting!

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_disease_2019-01-07-12-06.zip ../py-smtag/smtag/rack/disease.zip
    
Still influenced by INF-γ. Not sure what to do next. Included a cleanup step to remove em dash minus and other and replace with standard ASCII hyphen.

To intermingle negative examples during the whole training, assembled manually a `encoded/10X_L1200_NCBI_disease_augmented` dataset by adding all of the already encoded examples from `encoded/15X_L1200_geneprod`.

    smtag-convert2th -L1200 -X10 -b -c NCBI_disease -f 10X_L1200_NCBI_disease_augmented -w /efs/smtag
    
Since the examples are already encoded, `smtag-convert2th` does not touch them and should proceed with sampling of the whole set of encoded examples. 
    
    smtag-meta -E120 -Z32 -R0.001 -o disease -f 10X_L1200_NCBI_disease_augmented -w /efs/smtag
    
Possible that the negative examples dominate too much the training set. After initial peak (5-6th epoch), perf decreases as training progresses. No overfitting. Perf goes back up after valley (around 30-40 epochs). Goes down again later, when valid loss goes up. Capacity of the newtork too small? Trying with 4 super layers and 16 features/layer:
   
    smtag-meta -E200 -Z128 -R0.01 -n "16,16,16,16" -p "2,2,2,2" -k "6,6,6,6" -o disease -f 10X_L1200_NCBI_disease_augmented -w /efs/smtag
    
Ha seems to work! Testing an intermediate model after 65 epcohs, f1 0.8 no overfitting:

Saved `10X_L1200_NCBI_disease_augmented_last_saved.sddl`
    
    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_augmented_last_saved.zip ../py-smtag/smtag/rack/disease.zip
    
Final model: f1=0.85

__Saved `10X_L1200_NCBI_disease_augmented_disease_2019-01-08-17-59.zip`__

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_NCBI_disease_augmented_disease_2019-01-08-17-59.zip ../py-smtag/smtag/rack/disease.zip

    


## Experimental assay 

Calssic

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X10 -y ".//sd-tag[@category='assay']" \
    -f 10X_L1200_assay \
    --noocr \
    -w /efs/smtag
    
    smtag-meta -f 10X_L1200_assay \
    -E120 -Z128 -R0.01 \
    -o assay \
    -w /efs/smtag
    
__saved `10X_L1200_assay_assay_2019-01-05-15-14.sddl`__

Checking a bit hyperparam

    smtag-meta -f 10X_L1200_assay \
    -E30 -Z128 -R0.01 -o assay \
    -H depth,kernel -I 100 \
    -w /efs/smtag
    
Very little impact. If kernel 3 or so, then less good, but otherwise seems remarkably insensitive.
    
    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/10X_L1200_assay_assay_2019-01-05-15-14.zip ../py-smtag/smtag/rack/exp_assay.zip

## Intervention-assay geneprod 

Classical training

    smtag-convert2th -c 181203all \
    -L1200 -X15 \
    -E ".//figure-caption" \
    -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -e ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -A ".//sd-tag[@role='intervention']",".//sd-tag[@role='assayed']",".//sd-tag[@role='normalizing']",".//sd-tag[@role='experiment']",".//sd-tag[@role='component']", \
    -f 15X_L1200_geneprod_anonym_not_reporter \
    --noocr \
    -w /efs/smtag
    
    smtag-meta -f 15X_L1200_geneprod_anonym_not_reporter -w /efs/smtag \
    -E120 -Z128 -R0.01 \
    -o intervention,assayed \
    -k "6,6,6" -n "8,8,8" -p "2,2,2"

__saved `15X_L1200_geneprod_anonym_not_reporter_intervention_assayed_2019-01-04-15-05.sddl (size: 102679)`__

    smtag-meta -f 15X_L1200_geneprod_anonym_not_reporter -w /efs/smtag \
    -E120 -Z32 -R0.001 \
    -o intervention,assayed \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2"
    
Slightly worse than the 3 layer version, intervention 69%, assayed 78%, slighly more overfitting.

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/15X_L1200_geneprod_anonym_not_reporter_intervention_assayed_2019-01-04-15-05.zip ../py-smtag/smtag/rack/role_geneprod.zip

Trying whole document, __without entry BN__, tracking on

    smtag-convert2th -c 181203all \
    -L6000 -X5 \
    -E "." \
    -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -e ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -A ".//sd-tag[@role='intervention']",\
    ".//sd-tag[@role='assayed']",\
    ".//sd-tag[@role='normalizing']",\
    ".//sd-tag[@role='experiment']",\
    ".//sd-tag[@role='component']", \
    -f 5X_L6000_whole_doc_geneprod_anonym_not_reporter -w /efs/smtag --noocr

    smtag-meta -f 5X_L6000_whole_doc_geneprod_anonym_not_reporter -w /efs/smtag \
    -E120 -Z32 -R0.01 \
    -o intervention,assayed \
    -k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2"

- not better
- later exp suggest tracking should be off if no entry BN?

saved `5X_L6000_whole_doc_geneprod_anonym_not_reporter_intervention_assayed_2018-12-24-13-13.sddl` (size: 142459)

Trying whole document, entry BN and tracking (classic), reduced features number (8 everywhere), slower learning rate:

    smtag-meta -f 5X_L6000_whole_doc_geneprod_anonym_not_reporter -w /efs/smtag \
    -E500 -Z32 -R0.001 \
    -o intervention,assayed \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2"
    
- not good, intervention 60%, assayed 70%, overfitting

trying without skip links in unet to force bottleneck

    smtag-meta -f 5X_L6000_whole_doc_geneprod_anonym_not_reporter \
    -E500 -Z32 -R0.01 \
    -o intervention,assayed \
    -k "6,6,6" -n "16,32,64" -p "2,2,2" \
    -w /efs/smtag 

- a disaster...

Hyperscan

    smtag-meta -f 15X_L1200_geneprod_anonym_not_reporter \
    -E20 -Z128 -R0.01 \
    -o intervention,assayed \
    -H depth,pooling,nf -I 100 \
    -w /efs/smtag

scans: `scans/scan_depth_pooling_nf_X100_2019-01-10-12-37/scanned_perf.csv`

- In general: a lot of variability, effects are not drastic; add more depth, more features, don't pool more than 2
- With pool 1: better to have 4-layer depth and tends to be better with more features (13-16), rather variabl
- With pool 2: better to have 4-layer depth and some tendency maybe to be better with more features, perhaps less variable 
- With pool 3: worse than the other in general.

Recommendation: `nf=[16,16,16,16] pooling=[2,2,2,2] kernel=[6,6,6,6]` short training (`-E10`) to avoid overfitting:

    smtag-meta -f 15X_L1200_geneprod_anonym_not_reporter -w /efs/smtag \
    -E10 -Z128 -R0.01 \
    -o intervention,assayed \
    -k "6,6,6,6" -n "16,16,16,16" -p "2,2,2,2" \
    -w /efs/smtag
    
saved `15X_L1200_geneprod_anonym_not_reporter_intervention_assayed_2019-01-11-12-14.sddl`

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/15X_L1200_geneprod_anonym_not_reporter_intervention_assayed_2019-01-11-12-14.zip ../py-smtag/smtag/rack/role_geneprod.zip
    
Benchmark
    
    smtag-eval -S -T -f 15X_L1200_geneprod_anonym_not_reporter -m 15X_L1200_geneprod_anonym_not_reporter_intervention_assayed_2019-01-11-12-14.zip -w /efs/smtag
    
    smtag-eval -f 15X_L1200_geneprod_anonym_not_reporter -m 15X_L1200_geneprod_anonym_not_reporter_intervention_assayed_2019-01-11-12-14.zip -w /efs/smtag

## Geneprod Reporter

Classical

    smtag-convert2th -c 181203all \
    -E ".//figure-caption" \
    -L1200 -X5 \
    -y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -e ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
    -f 5X_L1200_geneprod_reporter -w /efs/smtag --noocr

    smtag-meta -f 5X_L1200_geneprod_reporter -w /efs/smtag \
    -E120 -Z128 -R0.01 \
    -o reporter

saved `5X_L1200_geneprod_reporter_reporter_2018-12-22-23-44.sddl` (size: 96117)

Trying 4 layers, with BN and tracking

    smtag-meta -f 5X_L1200_geneprod_reporter -w /efs/smtag \
    -E120 -Z32 -R0.001 \
    -o reporter \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2"

__saved `5X_L1200_geneprod_reporter_reporter_2019-01-03-18-48.sddl` (size: 115412)__


    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/5X_L1200_geneprod_reporter_reporter_2019-01-03-18-48.zip ../py-smtag/smtag/rack/reporter_geneprod.zip

In [None]:
# trying concurrent model with whole document examples,
# 32 ex batches, reduced learning rate, 4 layers
# without entry BN but with BN tracking
smtag-convert2th -c 181203all \
-E "." \
-L6000 -X5 \
-y ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
-e ".//sd-tag[@type='gene']",".//sd-tag[@type='protein']" \
-f 5X_L6000_geneprod_reporter_whole_doc -w /efs/smtag --noocr

smtag-meta -f 5X_L6000_geneprod_reporter_whole_doc -w /efs/smtag \
-E120 -Z32 -R0.005 -o reporter \
-k "6,6,6,6" -n "16,8,8,8" -p "2,2,2,2"

saved 5X_L6000_geneprod_reporter_whole_doc_reporter_2018-12-25-21-32.sddl (size: 142311)

## Intervention-assay small molecule

Classic training set

    smtag-convert2th -c 181203all \
    -L1200 -X15 \
    -E ".//figure-caption" --noocr \
    -y ".//sd-tag[@type='molecule']" \
    -e ".//sd-tag[@type='molecule']" \
    -A ".//sd-tag[@type='molecule']" \
    -f 15X_L1200_molecule_anonym \
    -w /efs/smtag
    
    smtag-meta -f 15X_L1200_molecule_anonym \
    -E120 -Z128 -R0.01 \
    -o intervention,assayed \
    -w /efs/smtag
    
Saved `...`

No recognition of assayed small molecules whatsoever... Number of examples where small molecules are assayed is very small as compared to small molecule (drug) perturbations. Need some balancing mechanism?

Trying to get only assayed molecules:

    smtag-meta -f 15X_L1200_molecule_anonym \
    -E120 -Z128 -R0.01 \
    -o assayed \
    -w /efs/smtag
    
This works to some extent (f1 0.6).
    
Only 1000 panels with small molecules as both intervention and assayed
    MATCH 
    (f:Figure)-->(p:Panel)-->(t1:Tag {type:"molecule", role:"intervention"}),
    (p)-->(t2:Tag {type:"molecule", role:"assayed"})
    RETURN COUNT(DISTINCT(f))
    
Trying

    smtag-convert2th -c 181203all \
    -L1200 -X15 \
    -E ".//figure-caption" --noocr \
    -y ".//sd-tag[@type='molecule'][@role='intervention']",".//sd-tag[@type='molecule'][@role='assayed']" \
    -e ".//sd-tag[@type='molecule']" \
    -A ".//sd-tag[@type='molecule']" \
    -f 15X_L1200_molecule_anonym_balanced \
    -w /efs/smtag
    
    
    smtag-meta -f 15X_L1200_molecule_anonym_balanced \
    -E40 -Z128 -R0.01 \
    -o intervention,assayed \
    -w /efs/smtag
    
saved `15X_L1200_molecule_anonym_balanced_intervention_assayed_2019-01-09-15-28.sddl`

    scp -i basicLinuxAMI.pem \
    ec2-user@smtag-web:/efs/smtag/models/15X_L1200_molecule_anonym_balanced_intervention_assayed_2019-01-09-15-28.zip \
    ../py-smtag/smtag/rack/role_small_molecule.zip


## Panel start

With 4 layers:
    
    smtag-meta -f 15X_L1200_geneprod \
    -E100 -Z32 -R0.001 \
    -o panel_start \
    -k "6,6,6,6" -n "8,8,8,8" -p "2,2,2,2" \
    -w /efs/smtag

__saved `15X_L1200_geneprod_panel_start_2019-01-03-04-15.sddl` (size: 115412)__

    scp -i basicLinuxAMI.pem ec2-user@smtag-web:/efs/smtag/models/15X_L1200_geneprod_panel_start_2019-01-03-04-15.zip ../py-smtag/smtag/rack/panel_start.zip