<< to keep track of changes made, prev. ideas, etc. >>

apr 8, 2025 - flux_transf.jl
-

this is what was used for original 2D input into mhsa (which doesn't work, needs (k, q, v) input)

In [None]:
function (tf::Transf)(input::Float32Matrix2DType) # input is features (978) x batch
    sa_out = tf.mhsa(tf.norm_1(input)) # OG paper states norm after sa, but norm before sa is more common?
    # x = input + tf.dropout(sa_out)
    x = input + sa_out
    mlp_out = tf.mlp(tf.norm_2(x))
    # x = x + tf.dropout(mlp_out)
    x = x + mlp_out
    return x
end

apr 11, 2025 - flux_transf.jl
-

using a for loop and replacing a matrix of -1 rather than making a copy is much faster! 
using the function below (commented parts) is a little easier to read, but the current impl. in the file is much more efficient.
tmp: if we were to replace the -1s with not the gene names (Str) but with the ranking ints themselves

In [None]:
function sort_gene(expr)
    # data_ranked = Matrix{Int}(undef, size(expr))
    data_ranked = fill(-1, size(expr))
    # gs = Symbol.(gene_symbols)
    n, m = size(expr)
    p = Vector{Int}(undef, n)
    # tmp = sortperm(expr[:, 1])

    for j in 1:m
        e = view(expr, :, j)
        sortperm!(p, e, rev=true)
        # data_ranked[!, j] = gs[tmp]

        for i in 1:n
            data_ranked[i, j] = p[i]
        end
    end
    return data_ranked
end

GENES AS TOKENS gene-gene interactions (sequence length as len(genes))
- each gene is a position in your sequence
- each token's embedding contains information about that gene's ranking across samples
- the sequence length equals the number of genes you're considering
SAMPLES AS TOKENS sample-sample interactions (sequence length as len(samples))
- each sample is a position in your sequence
- each token's embedding contains the gene ranking information for that sample
- the sequence length equals the number of samples
TECHNICAL CONSIDERATIONS
transformers struggle with very long sequences, so if we have many more genes than samples, using samples as tokens may be more computationally feasible
self-attention complexity grows quadratically with sequence length
which dimension has more examples to learn from?

good option: Flux.MultiHeadAttention((64, 64, 64) => (64, 64) => 64, nheads=1), can incr nheads later
- q, k, v input dim should all be the same if data type is the same or we aren't doing encoder-decoder
- middle dimensions should also be the same unless we want to reduce computational complexity in the middle
- output can also be the same unless we want to do ft compression or expansion

may 1, 2025 - masked loss fxn - DONE
- 

In [None]:
function loss_masked(model, x, y_masked)
    logits = model(x)  # (n_classes, seq_len, batch_size)
    logits = permutedims(logits, (2, 3, 1))  # seq_len × batch_size × n_classes (to match targets)
    logits = reshape(logits, :, n_classes)   # (seq_len * batch_size) × n_classes

    y_masked_flat = vec(y_masked) # flatten
    # only keep where y_masked != -100
    mask = y_masked_flat .!= -100
    logits_masked = (logits[mask, :])'
    targets_masked = y_masked_flat[mask]
    y_oh = Flux.onehotbatch(targets_masked, 1:n_classes)

    return Flux.logitcrossentropy(logits_masked, y_oh)
end

may 27, 2025 - training fxn with masked values for accuracy - DONE
- 

In [None]:
function loss(model, x, y)
    logits = model(x)  # (n_classes, seq_len, batch_size)
    logits_flat = reshape(logits, size(logits, 1), :) # (n_classes, seq_len*batch_size)
    y_flat = vec(y) # (seq_len*batch_size) column vec

    mask = y_flat .!= -100 # bit vec, where sum = n_masked
    logits_masked = logits_flat[:, mask] # (n_classes, n_masked)
    y_masked = y_flat[mask] # (n_masked) column vec

    y_oh = Flux.onehotbatch(y_masked, 1:n_classes) # (n_classes, n_masked)
    return Flux.logitcrossentropy(logits_masked, y_oh) 
end

could return logits_masked and y_masked as well, then do:

In [None]:
preds_masked = Flux.onecold(logits_masked)
preds_masked_cpu = preds_masked |> cpu
preds_masked_cpu .== y_masked
accuracy = sum(preds_masked_cpu .== y_masked) / length(y_masked)

may 29, 2025 - sparse matrices
- 

In [None]:
# in train loop, instead of y_gpu:
y_batch_sparse = get_sparse_batch(y_train_masked, start_idx, end_idx)

using SparseArrays

function mask_input_sparse(X::Matrix{Int64}; mask_ratio=0.10)
    X_masked = copy(X)
    # Create sparse matrix for labels
    I_indices = Int[]  # row indices
    J_indices = Int[]  # column indices  
    values = Int16[]   # actual gene indices
    
    for j in 1:size(X, 2)
        num_masked = ceil(Int, size(X, 1) * mask_ratio)
        mask_positions = randperm(size(X, 1))[1:num_masked]
        
        for pos in mask_positions
            push!(I_indices, pos)
            push!(J_indices, j)
            push!(values, X[pos, j])  # original gene index
            
            X_masked[pos, j] = MASK_ID
        end
    end
    
    y_sparse = sparse(I_indices, J_indices, values, size(X)...)
    return X_masked, y_sparse
end

X_train_masked, y_train_masked = mask_input_sparse(X_train)
X_test_masked, y_test_masked = mask_input_sparse(X_test)

function loss_sparse(model, x, y_sparse_batch, mode)
    logits = model(x)  # (n_classes, seq_len, batch_size)
    
    rows_cpu, cols_cpu, vals_cpu = findnz(y_sparse_batch)
    
    if isempty(rows_cpu)
        return 0.0f0
    end
    
    rows_gpu = cu(rows_cpu)
    cols_gpu = cu(cols_cpu) 
    vals_gpu = cu(vals_cpu)
    
    batch_size = size(logits, 3)
    seq_len = size(logits, 2)
    
    linear_indices = (cols_gpu .- 1) .* seq_len .+ rows_gpu
    
    logits_reshaped = reshape(logits, size(logits, 1), :) # (n_classes, seq_len * batch_size)
    masked_logits = logits_reshaped[:, linear_indices]  # (n_classes, n_masked)
    
    y_oh = Flux.onehotbatch(vals_gpu, 1:n_classes)
    
    if mode == "train"
        return Flux.logitcrossentropy(masked_logits, y_oh)
    elseif mode == "test"
        return Flux.logitcrossentropy(masked_logits, y_oh), masked_logits, vals_gpu
    end
end

function get_sparse_batch(y_sparse, start_idx, end_idx)
    rows, cols, vals = findnz(y_sparse[:, start_idx:end_idx])
    cols_adjusted = cols
    batch_sparse = sparse(rows, cols_adjusted, vals, size(y_sparse, 1), end_idx - start_idx + 1)
    
    return batch_sparse
end

** with the above code, 11mins 1 epoch (see github @ this time for other params) VS. 11mins 1 epoch for dense representations.
can revisit later if there is found to be memory bottlenecks @ matrix operations, however for now the dense is sufficient b/c:
https://www.reddit.com/r/Julia/comments/108g5ou/when_is_it_worth_working_with_sparse_matrices/
https://medium.com/data-science/sparse-matrices-in-pytorch-part-2-gpus-fd9cc0725b71
- above state that there should be about 1% or less sparsity for GPU sparse matrices to be efficient
- due to the need to repeatedly transfer data between CPU and GPU and sparse slicing operations in batching

june 12, 2025 trying dynamic masking;
- 

**RoBERTa shows that masking a different subset every epoch already helps, 
but recent work finds that decreasing the rate during training is even better 
(to try later!!! aka scheduler)

***dynamic on the train, static on the test

jun 16, 2025 - tried to only mask 1 position; if it can't predict just the missing # from 1-978, then it's dumb.
- 

- the issue here might be that the test mask is different from the train mask; 
thus w/o dynamic masking, the train learns a single value each time, and the test provides a new value not seen before..?

also trying to profile the memory b/c 25min per epoch is way too long;
Profile.Allocs.@profile sample_rate=1 begin/end is taking wayyyyy too long too (~3hrs so far) to run in the REPL.. maybe
is there another way?

*also what takes 26h on kraken takes 4h on smaug...
there seems to be slightly better trainval loss using higher embed dim --> try higher dim

jun 20, 2025 - based on the results from smaug, 2025-06-19_21-48
- 

there seems to be an issue with learning diff masked tokens (1 per sample) across the whole dataset
it works if we want to do the same say, 5 masks across the whole dataset (albeit at only a 84% accuracy for some reason)
1. double check masking function - ensure that it is correct
2. scale up learning rate..? not sure what else to do here
the training masks have sufficinet examples to learn from i think (60 per label) so what is going wrong?


jul 24, 2025 - speed/memory optimization
-

- lux.jl is equivalent of jax in python (uses xla backend)
- unrelated, but wb a splitted data struct..? (what lea uses)

jul 25, 2025 - exp masking...?
- 

1. can you embed counts? how would that work
2. needs to be a regression output (1 value) rather than a vector of probabilities per class?
3. loss needs to be defined differently
4. should it just be a dense network? does MHA work on 

jul 30, 2025 - exp masking
- 

TODO:
- Dense layer instead of Flux.Embedding 
    - this is b/c below static vs. dynamic Embeddings
    - dense layer stores f(x) = Wx + b, where W and b are learned during training for all genes/samples
    - W is a matrix, b is a vector and these are multiplied by the input x to output f(x) the embeddign vector 
- MHA still works with non-tokenized values!
    - model learns the meaning of the expression levels rather than the gene identity/ranks?
- mask token should be 0.0 after normalizing input
- MSE loss
- regression output to 1 value

Static Embeddings (like Word2Vec or Flux.Embedding): 
- The model learns one single, fixed vector for each unique token (e.g., the word "bank"). 
- The goal is to learn the meaning of the token itself by averaging its usage across thousands of different contexts.
Dynamic Feature Representation (Your Model): 
- Your model does not learn a static vector for "Gene 1". 
- Instead, it learns a function that maps any given expression value to a vector representation.

9PM - edit to exp masking
- had to move loss calc to before model update - resulted in lower loss in test than train
- change masking val to -1, apparenlty there are exp elvels of 0 in the original dataset?

***shoudl make a plot similar to this scatter plot for check_error.jl file


aug 1, 2025 - pre lab-meeting
- 

***need to fix the logging of params for predstrues.csv and params.txt ; didn't save in the last couple runs

aug 4, 2025 - debug + seb loss issue
-

- runnign on GPU 2 for indef run - rerun for seb
- need to 
    1. figure out logging for test/train - is it diff than what's in the slides?
    2. debug original structure typing
    3. fix logging of params for predstrues.csv and params.txt when doing input comparison plots
    4. fix progressbar for indef_run.jl (why is it out of 628, repeats for each epoch???)

- 08-04 run is original test/loss definitions from creating 720ep graph
- changed code to use mask_transf code in the while loop
- so Flux.withgradient = 1st loss calc --> update! --> second loss calc

aug 5, 2025 - debug
-

original structure:

In [None]:
struct PosEnc
    pe_matrix::CuArray{Float32,2}
end

function PosEnc(embed_dim::Int, max_len::Int) # max_len is usually maximum length of sequence but here it is just len(genes)
    pe_matrix = Matrix{Float32}(undef, embed_dim, max_len)
    for pos in 1:max_len, i in 1:embed_dim
        angle = pos / (10000^(2*(div(i-1,2))/embed_dim))
        if mod(i, 2) == 1
            pe_matrix[i,pos] = sin(angle) # odd indices
        else
            pe_matrix[i,pos] = cos(angle) # even indices
        end
    end
    return PosEnc(cu(pe_matrix))
end

Flux.@functor PosEnc

function (pe::PosEnc)(input::Float32Matrix3DType)
    seq_len = size(input,2)
    return input .+ pe.pe_matrix[:,1:seq_len] # adds positional encoding to input embeddings
end

### building transformer section

struct Transf
    mha::Flux.MultiHeadAttention
    att_dropout::Flux.Dropout
    att_norm::Flux.LayerNorm # this is the normalization aspect
    mlp::Flux.Chain
    mlp_norm::Flux.LayerNorm
end

function Transf(
    embed_dim::Int, 
    hidden_dim::Int; 
    n_heads::Int, 
    dropout_prob::Float64
    )

    mha = Flux.MultiHeadAttention((embed_dim, embed_dim, embed_dim) => (embed_dim, embed_dim) => embed_dim, 
                                    nheads=n_heads, 
                                    dropout_prob=dropout_prob
                                    )

    att_dropout = Flux.Dropout(dropout_prob)
    
    att_norm = Flux.LayerNorm(embed_dim)
    
    mlp = Flux.Chain(
        Flux.Dense(embed_dim => hidden_dim, gelu),
        Flux.Dropout(dropout_prob),
        Flux.Dense(hidden_dim => embed_dim),
        Flux.Dropout(dropout_prob)
        )
    mlp_norm = Flux.LayerNorm(embed_dim)

    return Transf(mha, att_dropout, att_norm, mlp, mlp_norm)
end

Flux.@functor Transf

function (tf::Transf)(input::Float32Matrix3DType) # input shape: embed_dim × seq_len × batch_size
    normed = tf.att_norm(input)
    atted = tf.mha(normed, normed, normed)[1] # outputs a tuple (a, b)
    att_dropped = tf.att_dropout(atted)
    residualed = input + att_dropped
    res_normed = tf.mlp_norm(residualed)

    embed_dim, seq_len, batch_size = size(res_normed)
    reshaped = reshape(res_normed, embed_dim, seq_len * batch_size) # dense layers expect 2D inputs
    mlp_out = tf.mlp(reshaped)
    mlp_out_reshaped = reshape(mlp_out, embed_dim, seq_len, batch_size)
    
    tf_output = residualed + mlp_out_reshaped
    return tf_output
end

### full model as << ranked data --> token embedding --> position embedding --> transformer --> classifier head >>

struct Model
    embedding::Flux.Embedding
    pos_encoder::PosEnc
    pos_dropout::Flux.Dropout
    transformer::Flux.Chain
    classifier::Flux.Chain
end

function Model(;
    input_size::Int,
    embed_dim::Int,
    n_layers::Int,
    n_classes::Int,
    n_heads::Int,
    hidden_dim::Int,
    dropout_prob::Float64
    )

    embedding = Flux.Embedding(input_size => embed_dim)

    pos_encoder = PosEnc(embed_dim, input_size)

    pos_dropout = Flux.Dropout(dropout_prob)

    transformer = Flux.Chain(
        [Transf(embed_dim, hidden_dim; n_heads, dropout_prob) for _ in 1:n_layers]...
        )

    classifier = Flux.Chain(
        Flux.Dense(embed_dim => embed_dim, gelu),
        Flux.LayerNorm(embed_dim),
        Flux.Dense(embed_dim => n_classes)
        )

    return Model(embedding, pos_encoder, pos_dropout, transformer, classifier)
end

Flux.@functor Model

function (model::Model)(input::IntMatrix2DType)
    embedded = model.embedding(input)
    encoded = model.pos_encoder(embedded)
    encoded_dropped = model.pos_dropout(encoded)
    transformed = model.transformer(encoded_dropped)
    # pooled = dropdims(mean(transformed; dims=2), dims=2)
    logits_output = model.classifier(transformed)
    return logits_output
end

re-typed structure:

In [None]:
struct PosEnc{U<:AbstractMatrix}
    pe_matrix::U
end

function PosEnc(embed_dim::Int, max_len::Int) # max_len is usually maximum length of sequence but here it is just len(genes)
    pe_matrix = Matrix{Float32}(undef, embed_dim, max_len)
    for pos in 1:max_len, i in 1:embed_dim
        angle = pos / (10000^(2*(div(i-1,2))/embed_dim))
        if mod(i, 2) == 1
            pe_matrix[i,pos] = sin(angle) # odd indices
        else
            pe_matrix[i,pos] = cos(angle) # even indices
        end
    end
    return PosEnc(pe_matrix)
end

Flux.@functor PosEnc

function (pe::PosEnc)(input::Float32Matrix3DType)
    seq_len = size(input,2)
    return input .+ pe.pe_matrix[:,1:seq_len] # adds positional encoding to input embeddings
end

### building transformer section

struct Transf{MHA<:Flux.MultiHeadAttention, D<:Flux.Dropout, LN<:Flux.LayerNorm, C<:Flux.Chain}
    mha::MHA
    att_dropout::D
    att_norm::LN
    mlp::C
    mlp_norm::LN
end

function Transf(
    embed_dim::Int, 
    hidden_dim::Int; 
    n_heads::Int, 
    dropout_prob::Float64
    )

    mha = Flux.MultiHeadAttention((embed_dim, embed_dim, embed_dim) => (embed_dim, embed_dim) => embed_dim, 
                                    nheads=n_heads, 
                                    dropout_prob=dropout_prob
                                    )

    att_dropout = Flux.Dropout(dropout_prob)
    
    att_norm = Flux.LayerNorm(embed_dim)
    
    mlp = Flux.Chain(
        Flux.Dense(embed_dim => hidden_dim, gelu),
        Flux.Dropout(dropout_prob),
        Flux.Dense(hidden_dim => embed_dim),
        Flux.Dropout(dropout_prob)
        )
    mlp_norm = Flux.LayerNorm(embed_dim)

    return Transf(mha, att_dropout, att_norm, mlp, mlp_norm)
end

Flux.@functor Transf

function (tf::Transf)(input::Float32Matrix3DType) # input shape: embed_dim × seq_len × batch_size
    normed = tf.att_norm(input)
    atted, _ = tf.mha(normed, normed, normed) # outputs a tuple (a, b)
    att_dropped = tf.att_dropout(atted)
    residualed = input + att_dropped
    res_normed = tf.mlp_norm(residualed)

    embed_dim, seq_len, batch_size = size(res_normed)
    reshaped = reshape(res_normed, embed_dim, seq_len * batch_size) # dense layers expect 2D inputs
    mlp_out = tf.mlp(reshaped)
    mlp_out_reshaped = reshape(mlp_out, embed_dim, seq_len, batch_size)
    
    tf_output = residualed + mlp_out_reshaped
    return tf_output
end

struct Model{E<:Flux.Embedding, P<:PosEnc, D<:Flux.Dropout, T<:Flux.Chain, C<:Flux.Chain}
    embedding::E
    pos_encoder::P
    pos_dropout::D
    transformer::T
    classifier::C
end

function Model(;
    input_size::Int,
    embed_dim::Int,
    n_layers::Int,
    n_classes::Int,
    n_heads::Int,
    hidden_dim::Int,
    dropout_prob::Float64
    )

    embedding = Flux.Embedding(input_size => embed_dim)

    pos_encoder = PosEnc(embed_dim, input_size)

    pos_dropout = Flux.Dropout(dropout_prob)

    transformer = Flux.Chain(
    (Transf(embed_dim, hidden_dim; n_heads, dropout_prob) for _ in 1:n_layers)...
    )

    classifier = Flux.Chain(
        Flux.Dense(embed_dim => embed_dim, gelu),
        Flux.LayerNorm(embed_dim),
        Flux.Dense(embed_dim => n_classes)
        )

    return Model(embedding, pos_encoder, pos_dropout, transformer, classifier)
end

Flux.@functor Model

function (model::Model)(input::T) where {T<:IntMatrix2DType} 
    # there is an issue here - where type is Any from the Flux portion
    # now - if Flux is causing issues, go into source code and redefine as above:
    # (m::Embedding)(x::T) where {T<:AbstractArray} = reshape(m(vec(x)), :, size(x)...), copied from Flux source code
    # AND
    # input::T where T<:type, allows it to be distinguished as a subtype of the input type
    # should theoretically be able to avoid Anys, and be type-stable!
    embedded = model.embedding(input)
    encoded = model.pos_encoder(embedded)
    encoded_dropped = model.pos_dropout(encoded)
    transformed = model.transformer(encoded_dropped)
    pooled = dropdims(mean(transformed; dims=2), dims=2)
    logits_output = model.classifier(pooled)
    return logits_output
    return embedded
end

TODO:
- need to 
    1. ~~re-run with fixed typing~~ faster.jl running on kraken gpu 1

    2. ~~figure out why test is better than train for loss/accuracy (potentially change in indef_run code or run mask_transf code for x epochs if test is only better in indef_run and not mask_transf)~~ ~~indef_run.jl code changed, running new on smaug gpu 2 ONCE OLD_INDEF_RUN.JL gets to 40!!! --> running new one now! what was the diff bruh~~ doen + clarified fix, see indef_masked_rankings 08-04 vs. 08-05. issue was the withgradient (AGAIN!!!)

    3. ~~fix param logging for exp_transf + mask_transf (predstrues.csv, params.txt)~~

    4. ~~fix progressbar for indef_run.jl~~ removed progress bar lol

    5. ~~fix scatter plot for mask_transf comparison~~ ~~mask_transf_err.jl running on kraken gpu 0~~ done, see masked_rankings/2025-08-05

    6. ~~x-bin for exp_transf comparison~~ ~~exp_transf.jl running on smaug gpu 3~~ done, see masked_expression/2025-08-05

    7. reorganize exp, mask, indef, faster for tomorrow

    8. put fxns/structs into separate src files! more organized.

- 08-04 run is original test/loss calculations from creating 720ep graph
- 08-05 run is updated calculations from mask_transf code

aug 6, 2025 - recap
=

asap:
- ~~exp_transf.jl running on kraken 0 (10ep x-bin)~~ done
- ~~mask_transf_err.jl running on smaug 3 (10ep heatmap)~~ done

still pending:
- faster.jl running on kraken gpu 1 --> old code: 668774 ms, new code:
    - terminated - need to fix code
- reorganize exp, mask, indef, faster
- put fxns/structs into separate src files! more organized.
- ~~why not: exp transf run on untrt - kraken 0~~

aug 12, 2025 - predicting the average, ensuring no repeats, fixing plots
=

TODO:
- ~~redo heatmap/x-bin into boxplots or hex-bin?~~
    ~~- exp on kraken 0, rank on smaug 0~~
    - longer rank run on smaug 0
- ~~see if model is just prev the avg rather than acc learning (raw exp)~~
    - in exp code, running 100ep on kraken 0 for comparison
- ~~see if possible to ensure model has no repeats (via permutations, inductive bias, pointer networks)~~
    - trying on smaug 1 (need to clean up and understand tho)
- put fxns/structs into separate src files! more organized.
- make new diagrams! (look into CLE token)
- do test iwthout pretrain to see if masking even helps
- faster.jl ; compare memory usage still high!!

aug 14, 2025 - for ensuring no repeats
=

a pointer network is designed to select its output from the elements that are **present in the input sequence**. however, the correct answer (the masked number) is the one element that is explicitly absent from the input.

https://kierszbaumsamuel.medium.com/pointer-networks-what-are-they-c3cb68fae076#:~:text=Notice%20how%20they%20are%20placed,-%20output%20dictionary%3A

https://arxiv.org/abs/1506.03134 - pointer networks

https://arxiv.org/abs/2006.06380 - pointer graph networks

alternatively:

can have two sets of inputs:
- context: masked sequence [1, 2, ..., MASK, 79, ...].
- candidate: complete, unmasked set of all possible tokens [1, 2, ..., 978].

where:
- encoder processes the context input to understand what's missing.
- decoder or attention mechanism then uses this context to point to the correct token within the candidate input.

thus, the model isn't pointing to the sequence it was given but to a complete "dictionary" of possibilities, using the masked sequence to figure out which item in the dictionary is the right one.

is this realistically better?

standard transf:
- model must produce a single vector for the [MASK] token that, after going through one final linear layer, can be classified as 78
- vector has to implicitly encode the identity "78"
- model learns a complex, abstract function to map from context to an identity

pointer:
- model must produce a query vector for the [MASK] token
- query's job is to be more similar to the vector for 78 in the candidate set than to any other number's vector
- forces the model to learn a shared, consistent embedding space for all numbers
- representation for 78 must be similar whether it's in the input or in the candidate list
- encourages the model to learn the concept of "78-ness" in a way that is directly comparable to the concepts of "77-ness" and "79-ness."

THUS, leads to a more structured and relational embedding space..?

*useful for things like travelling salesman problem

similar to: https://arxiv.org/abs/2005.11401

RAG models improve language models by first using the input to retrieve relevant documents from a vast database (like Wikipedia). The model then uses both the original input and the retrieved documents to generate a final answer

aug 18, 2025 - currently
- 

running rn:
- smaug 0: ranking long run (300ep) with updated metrics plotting

- ~~smaug 1: exp long run (300ep) wiht updated metrics plotting (and checking if just pred avg)~~
    - done, not just pred avg; did well! (2025-08-18_16-03)

working on rn:
- ~~seeing if model is just predicting the average~~
    - i think done, seems that model is doing better than average (2025-08-18_16-03)
- investigating pointer networks and/or permutations, inductive bias
- lux.jl to decrease mem allocs?
- do test w/ and w/o pretrain to see if masking even helps
    - should save weights of model as well (or model itself somehow for downstream applications)
- reorganize

aug 19, 2025 - no repeats (more options)
-

some ideas:
- inductive bias on the logits before softmax
- constrained beam search (decoder-side control)
- energy-based; adding a penalty for choosing a forbidden value
    - allows for soft penalties .. not sure if this is good or bad
- ~~output as a set (Set Transformers, DeepSets, or Determinantal Point Processes (DPPs))~~
    - implies order doesn't matter?
- ~~ILP/SAT decoding??~~
    - not scalable
- symbolic rule-based filter or constraint-satisfaction layer
    - like logic tensor networks
- copy/generate models
    - also soft penalties; can forbid copy mode (in input)


ones w/ hard penalties:
- inductive bias
    - isn't this what i am already doing via masking?
    - no; rn i have data-level masking. inductive bias is prediction-level masking!
- constrained beam search
    - does not scale well too 100k samples
- neural/symbolic hybrid
    - 2 stage process - not sure how to backprop thru symbolic rules

inductive bias:
- before applying softmax (in loss function, after compute logits), you should set the logits for:
    - any gene IDs already present in the input for that sample (so they can’t be predicted), and
    - any gene IDs already predicted in previous decoding steps (if doing sequential prediction).
- to -Inf (or a very negative number).
- this ensures that the probability of those disallowed gene IDs is exactly 0.
- rn code allows the model to assign probability mass to any of the n_classes outputs.

constrained beam search:
- ~~greedy~~
    - At each position t (row), you look at the logits for that position and immediately pick the single best legal option (argmax after masking forbidden genes).
    - If enforce_unique=true, you also remove that choice from future positions.
    - The decision is locally optimal (best at that timestep given constraints).
- compact
    - Instead of committing immediately, you keep the top-k partial hypotheses (“beams”) as you decode.
    - At each timestep, every beam expands to multiple candidates (respecting constraints), then you prune back down to the top-k by cumulative score.
    - After the last position, you return the best sequence.
    - The decision is globally optimized across the sequence within beam budget.

neural/symbolic hybrid
- split prediction into two modules:
    - neural module: proposes a ranked list of candidate genes
    - symbolic module: enforces your biological or structural rules
- this is like a two-stage pipeline: model suggests --> rules finalize.

SO FINAL IDEAS TO IMPLEMENT:
- compact beam search
    - 2017: https://arxiv.org/pdf/1704.07138 (grid beam search)
    - 2018: https://arxiv.org/pdf/1804.06609 (dynamic beam allocation)
    - explanation: https://huggingface.co/blog/constrained-beam-search
- neural/symbolic constraints
    - 2021: https://arxiv.org/pdf/2103.17232 (first introduction)
    - 2024: https://arxiv.org/pdf/2410.20957 (logical constraints)

aug 20, 2025 - post-seb meeting
-

- comparison plots
    - ~~waiting on smaug 0~~ done
    - ~~- show distribution of true values in histogram; has very little values below 5 so we can ignore that~~ done; explain the distribution correlation with the accuracy as well!
    - ~~- plot both boxplot and hist on top of one another with same axes~~
    - ~~- see if possible to top whisker at 0.9 quartile and bottom whisker at 0.1 quartile; thus 10% of pts at upper and lower whisker rather than extrapolating data~~ done; not much different but easier to explain plot
    - ~~- currently, what is used is IQR of 1.5 for whisker length~~
    - ~~- try rangebars or @recipe macro on CairoMakie~~
- just predicting avg expression?
    - can also show this in the boxplot by adding an X where the average is (maybe no need)
    - ~~- also, double check if we are comparing the hexbin of true values vs. average of predicted values (correct) or average of true values (incorrect)~~ correct!
- improving ranked model
    - ensure parameters are on the same scale generally; such that the exp and rank models have the same capacity
    - since the rank model has less info in the input, it doing just as good as the exp model is sufficient
    - for eval:
        - can do a downstream task, OR
        - we can compare actual values of expression model output to rank model output without needing significant inductive biases applied or downstream tasks
        - need to look into how to have input = rank, output = expression value (for ranked model)
        - ex. quantile normalization (w/ same distribution?)
- slurm connect
- review how exp model works


sep 3, 2025 - check-in
-

plotting
- new_boxplot.png compared to old boxplot in aug22 ppt
- box_hex.png

improving ranked model
- opt 1: input = rank, output = expression val
    - A. concatenate/add ranked embeddings + raw expression embeddings
        - but this means that the rank model isn't 100% a rank model
    - B. change to regression task w/ only model structure, fwd pass, and loss
        - structure: classifier --> Flux.Chain(embed_dim => hidden_dim => 1)
            - 1 for each value masked; loss is calculated on each masked value individually
        - fwd pass: somewhat same, or we can use pooling?
            - involves taking the mean across the sequence length
        - loss: logit cross-entropy --> MSE?
        - similar to what is already done in exp_transf.jl, just with diff input?
    - C. quantile normalization
        - replaces the integer rank with float of average expression for each gene
        - then can convert that into a rank but maintain the previous raw values for y-labels?
        - then predict singular value (Flux.Chain => 1) for expression prediction?
        - does this generalize too much tho?
- opt 2: constraint stuff
    - ex. constrained beam search, neural/symbolic hybrid

misc
- downstream task?
- slurm connect

sep 4, 2025 - post- seb meeting
- 

plotting
- get correlation of avg hexbin for comparison
- update entropu graphs for comparison
    - ex. have accuracy/entropy vs. rank and vs. expression
        - look at density of probabilty for the expression one (but also seb said maybe no for this)
        - re-running error vs rank for trt dataset on kraken 0
    - see if the movement in position is correlated iwth cell type as well!
    - look into more explanation stuff like this^^^

dataset
- do LINCS treated and untreated dataset
    - for exmaple, if we have larger dataset (by x10) then do # epochs / 10 (300 ep on untrt, 30 ep on trt/untrt); thus we have the same # of gradient steps
    - try to run a small subsample first for speed and testing
    - rank currently running on smaug 0, exp currenty running on smaug 1 (09-07)

task
- predict cell line then predict average expression of that cell line

models
- use FNN/MLP for expression profile with same number of parameters as ranking transformer
    - as the initial comparison

evaluation
- find more differences? but masking task might be sufficient
- look into autoencoders which ask to recover values from 0 (ex. denoising autoencoder); is this similar to masking alr tho?

validating it is acc learning
- could remove the mean expression for each gene before input; thus it is definitely not just learning the average; but instead if avg is 0 then it is definitley learning properly
    - this also means i'd need to change the MASK from -1 to something else (ex. a collection of 0s and 1s..?)

sep 9, 2025 - prep ppt for meeting fri
-

for ppt:
- introduction, model, how masking works,etc
- rank vs. exp comaprison
    - plot of train/test loss
    - plot of boxplots of pred vs true
        - with a histogram for the exp val
        - no histogram needed for the rank; it is uniform distr
    - also error only for rank (should there be error for exp as well???)
    - if smaug done in time, show big params vs smol params runs (30ep vs 5ep)
- reasoning results
    - plot of entropies in dataset
    - plot of prediction error by value
    - plot of avg hexbin???
- new task!
- new model..?
***

plotting
- updated entropy graphs to include trt data
    - rank: done
    - exp: done
        - using bins of size 0.01 and 0.1 for discretization
- to compare to entropy graphs, re-run on trt data for error vs:
    - rank: **running on kraken 0, 5ep**
    - exp: **running on kraken 1, 5ep**
- plot cell type against error
    - cell type vs. error: ?
    - cell type vs. entropy: done
- get explanations for above
- get correlation of avg hexbin for comparison (not urgent)

dataset --> changed from untrt only to trt and untrt
- long run with big params
    - rank: **running on smaug 0, 30ep**
    - exp: **running on smaug 1, 30ep**

task

models

sep 12, 2025 - parameter checking, organizing todo
- 

1. comparing input results
- be able to explain projection on raw vs. embedding on rank
- check: does exp model have only 128 params per input vector via Wx+b (128\*1) while the rank model has 128\*978 params where each value gets projected the weight dimension?
    - this would result in less complexity for the embedding model, which means that the embedding model has a disadvantage
    - **EXP MODEL # PARAMS: 649,649**
        - via total_params = sum(length, Flux.params(model))
        - embed_dim = 128, hidden_dim = 236, n_heads = 2, n_layers = 4
    - **RANK MODEL # PARAMS: 921,426**
        - embed_dim = 128, hidden_dim = 256, n_heads = 2, n_layers = 4
        - params length = 56
        - (128, 979)(128, 979)(128, 128)(128, 128)(128, 128)(128, 128)(128,)(128,)(256, 128)(256,)(128, 256)(128,)(128,)(128,)(128, 128)(128, 128)(128, 128)(128, 128)(128,)(128,)(256, 128)(256,)(128, 256)(128,)(128,)(128,)(128, 128)(128, 128)(128, 128)(128, 128)(128,)(128,)(256, 128)(256,)(128, 256)(128,)(128,)(128,)(128, 128)(128, 128)(128, 128)(128, 128)(128,)(128,)(256, 128)(256,)(128, 256)(128,)(128,)(128,)(128, 128)(128,)(128,)(128,)(978, 128)(978,)

2. entropy vs. error graphs **--> APPLY TO THE 30EP RUN**
- rank:
    - take mean/avg error per rank instead for better visualization
- expression:
    - double check the p(x) distribution of the bins; are they fitting in properly or are some bins have 0?
    - do std. dev of expression (or potentially interquartile distance) rather than entropy - this has better representation of variation
3. next steps
- MLP as an autoencoder; train the beginning then prediction head at the end is different - to sub in for downstream task
    - aka pretraining an MLP?
    - see denoising autoencoder
- stacked ass autoencoder vs. transformer with same number of parameters (same degrees of freedom)
    - if autoencoder better then yay! if transformer better then boo.
- see leo's bottleneck stuff
- look at benchmarks
    - aka lea's; predicting gene expression from a different cell line and same perturbation

currently running: 30ep exp run on smaug 0

masked model
-
julia> model.classifier
Chain(
  Dense(128 => 128, gelu_tanh),         # 16_512 parameters
  LayerNorm(128),                       # 256 parameters
  Dense(128 => 978),                    # 126_162 parameters
)                   # Total: 6 arrays, 142_930 parameters, 992 bytes.

julia> model.embedding
Embedding(979 => 128)  # 125_312 parameters

julia> model.pos_encoder
PosEnc(Float32[0.84147096 0.9092974 … -0.8218694 -0.92342377; 0.5403023 -0.41614684 … -0.56967604 0.38378194; … ; 0.0001154782 0.0002309564 … 0.11269774 0.11281249; 1.0 1.0 … 0.99362934 0.9936163])

julia> model.transformer
Chain(
  Transf(
    MultiHeadAttention(128; nheads=2, dropout_prob=0.05),  # 65_536 parameters
    Dropout(0.05),
    LayerNorm(128),                     # 256 parameters
    Chain(
      Dense(128 => 256, gelu_tanh),     # 33_024 parameters
      Dropout(0.05),
      Dense(256 => 128),                # 32_896 parameters
      Dropout(0.05),
    ),
    LayerNorm(128),                     # 256 parameters
  ),
  Transf(
    MultiHeadAttention(128; nheads=2, dropout_prob=0.05),  # 65_536 parameters
    Dropout(0.05),
    LayerNorm(128),                     # 256 parameters
    Chain(
      Dense(128 => 256, gelu_tanh),     # 33_024 parameters
      Dropout(0.05),
      Dense(256 => 128),                # 32_896 parameters
      Dropout(0.05),
    ),
    LayerNorm(128),                     # 256 parameters
  ),
  Transf(
    MultiHeadAttention(128; nheads=2, dropout_prob=0.05),  # 65_536 parameters
    Dropout(0.05),
    LayerNorm(128),                     # 256 parameters
    Chain(
      Dense(128 => 256, gelu_tanh),     # 33_024 parameters
      Dropout(0.05),
      Dense(256 => 128),                # 32_896 parameters
      Dropout(0.05),
    ),
    LayerNorm(128),                     # 256 parameters
  ),
  Transf(
    MultiHeadAttention(128; nheads=2, dropout_prob=0.05),  # 65_536 parameters
    Dropout(0.05),
    LayerNorm(128),                     # 256 parameters
    Chain(
      Dense(128 => 256, gelu_tanh),     # 33_024 parameters
      Dropout(0.05),
      Dense(256 => 128),                # 32_896 parameters
      Dropout(0.05),
    ),
    LayerNorm(128),                     # 256 parameters
  ),
)                   # Total: 48 arrays, 527_872 parameters, 8.328 KiB.

exp model
-
julia> model.classifier
Chain(
  Dense(128 => 128, gelu_tanh),         # 16_512 parameters
  LayerNorm(128),                       # 256 parameters
  Dense(128 => 1, softplus),            # 129 parameters
)                   # Total: 6 arrays, 16_897 parameters, 992 bytes.

julia> model.projection
Dense(1 => 128)     # 256 parameters

julia> model.pos_encoder
PosEnc(Float32[0.84147096 0.9092974 … 0.035307925 -0.8218694; 0.5403023 -0.41614684 … -0.9993765 -0.56967604; … ; 0.0001154782 0.0002309564 … 0.112583004 0.11269774; 1.0 1.0 … 0.99364233 0.99362934])

julia> model.transformer
Chain(
  Transf(
    MultiHeadAttention(128; nheads=2, dropout_prob=0.05),  # 65_536 parameters
    Dropout(0.05),
    LayerNorm(128),                     # 256 parameters
    Chain(
      Dense(128 => 236, gelu_tanh),     # 30_444 parameters
      Dropout(0.05),
      Dense(236 => 128),                # 30_336 parameters
      Dropout(0.05),
    ),
    LayerNorm(128),                     # 256 parameters
  ),
  Transf(
    MultiHeadAttention(128; nheads=2, dropout_prob=0.05),  # 65_536 parameters
    Dropout(0.05),
    LayerNorm(128),                     # 256 parameters
    Chain(
      Dense(128 => 236, gelu_tanh),     # 30_444 parameters
      Dropout(0.05),
      Dense(236 => 128),                # 30_336 parameters
      Dropout(0.05),
    ),
    LayerNorm(128),                     # 256 parameters
  ),
  Transf(
    MultiHeadAttention(128; nheads=2, dropout_prob=0.05),  # 65_536 parameters
    Dropout(0.05),
    LayerNorm(128),                     # 256 parameters
    Chain(
      Dense(128 => 236, gelu_tanh),     # 30_444 parameters
      Dropout(0.05),
      Dense(236 => 128),                # 30_336 parameters
      Dropout(0.05),
    ),
    LayerNorm(128),                     # 256 parameters
  ),
  Transf(
    MultiHeadAttention(128; nheads=2, dropout_prob=0.05),  # 65_536 parameters
    Dropout(0.05),
    LayerNorm(128),                     # 256 parameters
    Chain(
      Dense(128 => 236, gelu_tanh),     # 30_444 parameters
      Dropout(0.05),
      Dense(236 => 128),                # 30_336 parameters
      Dropout(0.05),
    ),
    LayerNorm(128),                     # 256 parameters
  ),
)                   # Total: 48 arrays, 507_312 parameters, 8.328 KiB.


rank model:
- 
- input:
    - Embedding(979 => 128) layer
    - creates a lookup table to convert each of 979 unique input tokens into a 128-dimensional vector
    - 979×128=125,312 parameters
- output:
    - Dense(128 => 978) layer
    -  takes the final 128-dimensional representation and projects it into 978 output values, corresponding to the probability of each token in the vocabulary
    - 128×978+978=126,162 parameters

exp model:
- input:
    - Dense(1 => 128) layer
    - takes a single number and projects it into a 128-dimensional vector
    - 1×128+128=256 parameters.
- output: 
    - Dense(128 => 1, softplus)
    - outputs a single number
    - 128×1+1=129 parameters

difference: ~250k parameters

~~**ADDITIONALLY:~~ **--> FIXED**
- masked: trasnf netwrok expands the dimension from 128 to 256 and then back to 128 (128 => 256 => 128)
- exp: transf netwrok expands the dimension from 128 to 236 and then back to 128 (128 => 236 => 128)

difference: ~20k parameters

sep 15, 2025 - autoencoder research
- 

https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf
- stacked denoising autoencoders --> similar to masked pretrain objective
    - difference:
        - DAE = layer-wise pretrain; T = end-to-end pretrain
        - DAE = local ft detection (neural net); T = global/bidirectional context awareness (via posenc too) (attention)
    - similarity:
        - adding noise = setting inputs to 0 = masking
            - so one of the main things of the DAE is that it uses diff "corruption" methods; ex. Gaussian noise, masking (set to 0), salt/pepper noise (set to max/min val, uysualy either 0 or 1) 
            - reconstructing input = predicting identity of whatever was masked/hidden
- for my project:
    - essentially can use this as the "MLP" (try with/without the stacking)
    - different reconstruction tasks other than masking; ex. Gaussian noise, salt/pepper noise?
    - BUT should i be doing 1. exp DAE, 2. rank DAE, 3. exp transf, 4. rank transf?
        - for comparison against both input aspects and architecture aspects?
        - what else needs to be done in terms of supporting the idea that the baselines can perform just as well? (referencing https://www.nature.com/articles/s41592-025-02772-6#Sec2)

reasoning:
- https://arxiv.org/abs/2502.19718 (2025)
    - information theory perspective on masked autoencoders

**look into:**
- can i use a DAE or MAE to compare against an MLM?
    - shouldn't it be bidirectional like a transformer? or is it already?
    - DAE: see above
    - MAE: https://arxiv.org/pdf/1502.03509
- what about contractive autoencoders: https://icml.cc/2011/papers/455_icmlpaper.pdf

**some new stuff:**
- https://www.nature.com/articles/s41598-025-96215-z
- https://arxiv.org/abs/2505.22914

for tmo:
- leo's bottleneck stuff
- lea's benchmarking?
- other benchmarking?
- prep for meeting; other reasonings why to do or not to do DAE/MAE (specifically the quad-comparison as mentioned td above)
- difference between AE and MLP or FNN? or same thing

sep 16, 2025 - organizing objectives
- 

for input comparison:
- i'm thinking of having exp vs. rank on a transformer and exp vs. rank on a FNN...?
- currently the only difference i have between the inputs is that the degree of variation per token learned is a lot greater in the ranked input (measured in entropy) compared to the raw expression value input (measured in variance)

for architecture comparison:
- currently there are ~250k more parameters in the masked model (649,649 vs. 921,426)
    - this is due to:
        - rank model:
            - input: Embedding(979 => 128) layer (979×128=125,312 params)
            - output: Dense(128 => 978) layer (128×978+978=126,162 params) - +978 is due to # biases
        - exp model:
            - input: Dense(1 => 128) layer (1×128+128=256 params) - +128 is due to # biases
            - output: Dense(128 => 1, softplus) (128×1+1=129 params) - +1 is due to # biases
- mainly: FNN against MLM:
    - should the FNN be 1 layer, DAE, or MAE? (or maybe try all 3?)

update on plotting:
- entropy vs. error graphs **--> APPLY TO THE 30EP RUN**
    - rank:
        - take mean/avg error per rank instead for better visualization
    - expression:
        - double check the p(x) distribution of the bins; are they fitting in properly or are some bins have 0?
        - do std. dev of expression (or potentially interquartile distance) rather than entropy - this has better representation of variation
- get correlation of avg hexbin for comparison in supplementary

it was mentioned last wk:
- look at benchmarks
    - leo's; some bottleneck stuff?
    - lea's; predicting gene expression from a different cell line and same perturbation


for reference:
- MLP
    - supervised learning (vs. AE is unsupervised)
    - input layer + hidden layer(s) + output layer; loss is calcualted by comparing output to true label
- DAE
    - corrupts input data and decoder reconstructs the original; loss is calculated between reconstructed form z and original input x
    - https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf (stacked DAE)
    - https://www.cs.toronto.edu/~larocheh/publications/icml-2008-denoising-autoencoders.pdf (robust ft w/ DAEs)
- MAE
    - in images: randomly mask 75% of an image, encoding only the visible patches and decoder reconstructs the original
    - outside of images:
        - https://arxiv.org/abs/2309.13793 (ReMasker)
            - original input has missing values; additional inputs are masked for AE to train on - then applied to original missing values
        - https://arxiv.org/abs/2412.19152 (PMAE)***
            - masks original input w/ a probability inversely related to a column's observation rate
            - ensures that the model must learn more from rarer/less frequently observed features
- CAE
    - AE w/ small penalty to loss fxn to reduce sensitivity to small/local variations in the input

todo: 
- ~~does MAE exist outside of images/comp vision?~~ yes; see above
- ~~wtf does MAE for distribution estimation do?~~ learns data distribution from reconstructing masked input
- ~~finish why params are different~~
- ~~redo error per rank graph~~ @ /home/golem/scratch/chans/lincs/plots/trt_and_untrt/masked_rankings/2025-09-11_08-26
- ~~redo entropy (std dev instead) per exp graph (on untrt since trt still running)~~
    - check bin distribution of expression error (on untrt since trt still running)
- do avg exp of genes then sort from hgihest to lowest for comparison against rank error graph
- look more into reasoning why input diff = terrible ranked input performance

later:
- avg hexbin correlation

sep 17, 2025
- 

INPUT COMPARISON:
- i'm thinking of having exp vs. rank on a transformer and exp vs. rank on a FNN...?
- currently the only difference i have between the inputs is that the degree of variation per token learned is a lot greater in the ranked input (measured in entropy) compared to the raw expression value input (measured in variance)

ARCHITECTURE COMPARISON:
- currently there are ~250k more parameters in the masked model (649,649 vs. 921,426)
    - this is due to:
        - rank model:
            - input: Embedding(979 => 128) layer (979×128=125,312 params)
            - output: Dense(128 => 978) layer (128×978+978=126,162 params) - +978 is due to # biases
        - exp model:
            - input: Dense(1 => 128) layer (1×128+128=256 params) - +128 is due to # biases
            - output: Dense(128 => 1, softplus) (128×1+1=129 params) - +1 is due to # biases
- mainly: FNN against MLM:
    - should the FNN be 1 layer, DAE, or MAE? (or maybe try all 3?)

PLOTTING:
- entropy vs. error graphs **--> APPLY TO THE 30EP RUN**
    - rank: (trt-rankings)
        - see rank_vs_avgerror (scatter vs line.png)
    - expression: (infographs)
        - see gene_exp_std_dev_trt.png (is this ok or is IQ distance a better representation?)
    - sorted gene expression: (infographs)
        - sorted_gene_mean_exp_trt.png for validation of sorting
        - sorted_gene_std_dev_trt.png for comparison against rank_entropy_trt.png
    - sorted exp error: (unrtrt-exp)
        - gene_vs_meanerror.png vs. sorted_gene_vs_meanerror.png
        - sorted_gene_vs_meanerror.png vs. rank_vs_avgerror_scatter.png

MENTIONED:
- look at benchmarks
    - leo's; some bottleneck stuff?
    - lea's; predicting gene expression from a different cell line and same perturbation


FOR REFERENCE:
- MLP
    - supervised learning (vs. AE is unsupervised)
    - input layer + hidden layer(s) + output layer; loss is calcualted by comparing output to true label
- DAE
    - corrupts input data and decoder reconstructs the original; loss is calculated between reconstructed form z and original input x
    - https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf (stacked DAE)
    - https://www.cs.toronto.edu/~larocheh/publications/icml-2008-denoising-autoencoders.pdf (robust ft w/ DAEs)
- MAE
    - in images: randomly mask 75% of an image, encoding only the visible patches and decoder reconstructs the original
    - outside of images:
        - https://arxiv.org/abs/2309.13793 (ReMasker)
            - original input has missing values; additional inputs are masked for AE to train on - then applied to original missing values
        - https://arxiv.org/abs/2412.19152 (PMAE)***
            - masks original input w/ a probability inversely related to a column's observation rate
            - ensures that the model must learn more from rarer/less frequently observed features
            - more missing values = more mask (complete columns used for reconstruction)
            - evaluated w/ coefficient of determination (numerical values) and accuracy (categorical values)
            - argued that the MAE was better than the transformer version for local dependencies

- CAE
    - AE w/ small penalty to loss fxn to reduce sensitivity to small/local variations in the input

aside:
- https://arxiv.org/abs/2105.01601
    - stacked MLP for computer vision that achieved competitive (around the same) scores compared to CNNs and vision transformers
    - this is the model taht PMAE paper used

sep 18, 2025 - todo
- 

INPUT COMPARISON:
- complete reasoning results
- complete error explanations
- get graphs for exp 30ep (on smaug 0 rn)
- begin run on rank 30ep afterwards

ARCHITECTURE COMPARISON:
- complete reasoning results
- complete error explanations

PLOTTING:
- rank:
- expression:
        - double check the p(x) distribution of the bins; are they fitting in properly or are some bins have 0?
- get correlation of avg hexbin for comparison in supplementary

sep 20, 2025 - fixing 30ep runs, reorganize/planing for cp
- 

INPUT COMPARISON:
- complete reasoning results
- complete error explanations
- ~~get graphs for trt exp 30ep (30ep on smaug 1 rn thru sbatch)~~ done!
- get graphs for trt rank 30ep (30ep on smaug 0 rn thru nohup)
   - sbatch for rank doesn't work; some kind of OOM error?
      - likely becasue the flag has to be --mem-per-gpu rather than --mem (smaug has 300gb total! or something)

ARCHITECTURE COMPARISON:
- build FNN-DAE for exp and rank
   - running exp_nn --> issue with code; need to fix
      - right now, it's kinda cooked. going to try to make it so that its reconstructing the original input rather than reconstructing the embedding space (output = 978 \* batch rather than output = 64 \* batch)
      - should masking be done before compressing input into the embedding or after??
      - things changed:
         - removed output 0.0f0 when sum(mask) = 0 in the loss
         - added mlp_head in Model + function
      - or acc no need to reconstruct original.... becasue embebdding shoud already be a more robust represetnaton of the input data?
      - should be fine now;
         - just 1. increase mask ratio, 2. decrease LR, 3. normalize input?
   - rank_nn still pending
- complete reasoning results
- complete error explanations

PLOTTING:
- rank:
- expression:
        - double check the p(x) distribution of the bins; are they fitting in properly or are some bins have 0?
- get correlation of avg hexbin for comparison in supplementary

GENERALLY:
- create a doc of everything done, tests, etc. need to create some kind of story for talks later + poster presentations in oct
- start prepping poster :0

some additional things brought up last meeting:

INPUT
- exp:
    - try w/ + w/o positional encoding
    - ~~is posenc independent of gene exp?~~ YES.
        - the positional encoding is calculated is deterministic;
        - only uses the position and the embedding dimension index
        - no knowledge of the actual gene expression values
    - ~~or is posenc = gene exp embedding?~~ NO.
        - posenc is added to gene exp embedding
        - "input .+ pe.pe_matrix[:,1:seq_len]"
- rank: 
    - is it possibel to pass (posenc + exp) --> add to rank?
        - yes; if we were to make a hybrid version -- try later!!!********** via adding/concatenating rank embedding and expression projected into same dim as embedding
    - ~~does rank need embeddings?~~
        - if i were to feed raw rank integers, the model would assume rank 2 > rank 1 based on numerical distance between numbers (which is not true)
        - since they are arbitrary ids, each rank should have a embedding vector that represents its identity and meaning
    - ~~try embed_dim = 1; since rank input val = arbitrary #~~
        - it isn't an arbitrary number! see above

ARCH
- tf:
    - output is either dummy embed vec returned from pretrain OR add/concat them together from 2nd last layer
- nn:
    - is bottleneck layer needed?
- is it possibel to get tf to be as good as nn? (if tf not as good)