Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement "pico" GPT example #51

Open
hmf opened this issue Aug 17, 2023 · 56 comments
Open

Implement "pico" GPT example #51

hmf opened this issue Aug 17, 2023 · 56 comments

Comments

@hmf
Copy link
Contributor

hmf commented Aug 17, 2023

Response to request in issue #44.

Attempt to rewrite the "pico" example from Karpathy's "Let's build GPT: from scratch, in code, spelled out" in storch.

@hmf
Copy link
Contributor Author

hmf commented Aug 17, 2023

@sbrunk or anyone else. I need some assistance in this work . To code the "pico" example, I need the Embedding operator. In my branch I have added this here. I have also added comments and made sure ScalaDoc is ok (minus the math expressions).

The code I am working on now, is the BiGram class. If I understand the code correctly, I have to pass a Tensor of shape/size (B,T) and get a Float back. According to the native code that seems to be a call to the forward method. So I am using this in the embedding class as per the other modules:

  def apply(t: Tensor[Int64]): Tensor[D] = Tensor(nativeModule.forward(t.native))

And this is a problem because I get the error:

[error] 101 |final class Embedding[D <: DType: Default](
[error]     |            ^
[error]     |class Embedding needs to be abstract, since def apply(v1: T1): R in trait Function1 in package scala is not defined 
[error]     |(Note that
[error]     | parameter T1 in def apply(v1: T1): R in trait Function1 in package scala does not match
[error]     | parameter torch.Tensor[torch.Int64] in def apply(t: torch.Tensor[torch.Int64]): torch.Tensor[D] in class Embedding in package torch.nn.modules.embed
[error]     | )

I think this is because we extend from TensorModule :

trait TensorModule[D <: DType] extends Module with (Tensor[D] => Tensor[D]):

In other words, the apply from (Tensor[D] => Tensor[D]) assumes the input and output are of the same type. Do we have other operators were this is not true? If not, how should we handle this?

On a related note, is it possible to constrain the Tensor by its shape?

TIA

@hmf
Copy link
Contributor Author

hmf commented Aug 17, 2023

In order to keep going I have used the following solution:

  def apply(t: Tensor[D]): Tensor[D] = Tensor(nativeModule.forward(t.native))
  @targetName("apply_T_D")
  def apply[T<:DType](t: Tensor[T]): Tensor[D] = Tensor(nativeModule.forward(t.native))

Is this ok for a final solution?

@sbrunk
Copy link
Owner

sbrunk commented Aug 17, 2023

@hmf You're right Embedding is an example where the input type might be different from the output, so we can't inherit from TensorModule.

Note that @davoclavo has also added Embedding and a few other modules in #36 (haven't been able to finish and merge that yet, unfortunately) and added a more generic TensorModuleBase to tackle this issue:

final class Embedding[ParamType <: FloatNN | ComplexNN: Default](
numEmbeddings: Int,
embeddingDim: Int,
paddingIdx: Option[Int] = None,
maxNorm: Option[Double] = None,
normType: Option[Double] = Some(2.0),
scaleGradByFreq: Boolean = false,
sparse: Boolean = false
) extends HasParams[ParamType]
with HasWeight[ParamType]
with TensorModuleBase[Int64, ParamType]:

trait TensorModuleBase[D <: DType, D2 <: DType] extends Module with (Tensor[D] => Tensor[D2]) {
override def toString() = "TensorModuleBase"
}

So eventually we need to merge your solutions but for now you could also just inherit from nn.Module and add then use your apply method:

  def apply[T<:DType](t: Tensor[T]): Tensor[D] = Tensor(nativeModule.forward(t.native))

On a related note, is it possible to constrain the Tensor by its shape?

Right now, we're tracking only the dtype at compile time. We might add that in the future though.

@hmf
Copy link
Contributor Author

hmf commented Aug 17, 2023

@sbrunk I have looked at the embedding class and my version its pretty close to it. Currently cannot search @davoclavo's branch, but I think I can copy and use that code (minimum set of classes with updated docs). Might be easier on your side.

In the meantime if you do merge into the main branch, I will update accordingly. Ok, with you?

@sbrunk
Copy link
Owner

sbrunk commented Aug 17, 2023

Sounds good to me 👍

@hmf
Copy link
Contributor Author

hmf commented Aug 17, 2023

Question about cross entropy functions. IThe orgial code uses something like:

import torch
import torch.nn as nn
from torch.nn import functional as F

...
loss = F.cross_entropy(logits, targets)
...
            probs = F.softmax(logits, dim=-1) # (B, C)

I see that we have 2 options, a function in the Loss package (does not exist yet, only binary version available) and the torch.nn.loss.CrossEntropyLoss version. The storch examples use the latter.

What are the advantages/disadvantages of using one or the other?

@sbrunk
Copy link
Owner

sbrunk commented Aug 17, 2023

I see that we have 2 options, a function in the Loss package (does not exist yet, only binary version available) and the torch.nn.loss.CrossEntropyLoss version. The storch examples use the latter.

What are the advantages/disadvantages of using one or the other?

PyTorch has a functional and a class/module variant for most of its nn operations. See torch.nn.functional.cross_entropy
and torch.nn.CrossEntropyLoss. The class variant usually inherits from Module to it's easy to put it into containers expecting modules.

The functional variant does not contain any state, you call it directly with the tensor inputs and other arguments. The class/module variant can be initialized first with init parameters, and then later reused for different inputs. If you have modules with learnable weights/parameters, the module variant also helps you manage that state (makes it easier to update all weights of your model etc.).

For stateless ops without weights, like cross_entropy the class variant doesn't have much advantage except for reuse, so you can also just use the functional variant but it doesn't make much of a difference after all.

@davoclavo
Copy link
Contributor

Hello @hmf! awesome work on implementing Karpathy's examples. I have done some progress as well, but last month I got sidetracked with some things at work so wasn't able to prepare the code to share it.

I'll leave my progress implementing some of the model building blocks here in case it is helpful in any way to you. As @sbrunk mentioned, there are some new modules implemented in PR #36 - such as Embedding, LayerNorm, ModuleList, etc. - and this code expects those modules to exist in storch.

(Btw, you should be able to access my branch from via the PR, or via this direct link)

final case class Head[D <: FloatNN: Default](
    numEmbeddings: Int,
    headSize: Int,
    blockSize: Int,
    dropoutProb: Float
) extends TensorModule[D] {
  val query = register(nn.Linear(numEmbeddings, headSize))
  val key = register(nn.Linear(numEmbeddings, headSize))
  val value = register(nn.Linear(numEmbeddings, headSize))
  val tril = register(torch.tril(torch.ones(Seq(blockSize, blockSize))))
  val dropout = register(Dropout(dropoutProb))

  override def apply(input: Tensor[D]): Tensor[D] =
      val Seq(batch, timeStep, channels) = input.shape // (B, T, C) (64, 256, 384) [Float32]
      assert(blockSize == timeStep, "Block size must be equal to time step")

      val k: Tensor[D] = key(input) // (64, 256, 64) [Float32]
      val q: Tensor[D] = query(input) // (64, 256, 64) [Float32]
      val v: Tensor[D] = value(input) // (64, 256, 64) [Float32]

      // TODO Get rid of the `.to(dtype = q.dtype)`
      val weight =
        torch.matmul(q, torch.transpose(k, -2, -1)) / Tensor(Math.sqrt(channels)).to(dtype = q.dtype) // (64, 256, 256) [Float32]
      val weightMasked =
        weight.maskedFill(
          tril(Slice(0, timeStep), Slice(0, timeStep)) == 0,
          Float.NegativeInfinity
        ) // (64, 256, 256) [Float32]
      val attention =
        torch.nn.functional.softmax(weightMasked, dim = 2)(
          weightMasked.dtype
        ) // (64, 256, 256) [Float32]
      val attentionDropout = dropout(attention) // (64, 256, 256) [Float32]
      val output = weight.matmul(v) // (64, 256, 64) [Float32]
      output
}

final case class MultiHeadAttention[D <: FloatNN: Default](
    numHeads: Int,
    numEmbeddings: Int,
    headSize: Int,
    blockSize: Int,
    dropoutProb: Float
) extends TensorModule[D] {
  // Multiple heads of self-attention in parallel

  val heads = register(nn.ModuleList(Range(0, numHeads).map { _ =>
    Head[D](numEmbeddings, headSize, blockSize, dropoutProb)
  }*))
  val projection = register(nn.Linear(numHeads * headSize, numEmbeddings))
  val dropout = register(Dropout(dropoutProb))
  override def apply(input: Tensor[D]): Tensor[D] =
      val headOutputs = heads.map { head =>
        head(input)
      } // (6, 64, 256, 384) [Float32]
      val headOutputsConcat = torch.cat(headOutputs, dim = -1) // (64, 256, 384) [Float32]
      val projectedOutput = projection(headOutputsConcat) // (64, 256, 384) [Float32]
      dropout(projectedOutput) // (64, 256, 384) [Float32]
}

final case class FeedForward[D <: FloatNN: Default](numEmbeddings: Int, dropoutProb: Float)
    extends TensorModule[D] {
  // A simple linear layer followed by a non-linearity

  val net = register(nn.Sequential(
    nn.Linear(numEmbeddings, numEmbeddings * 4),
    nn.ReLU(),
    nn.Linear(numEmbeddings * 4, numEmbeddings),
    Dropout(dropoutProb)
  ))
  override def apply(input: Tensor[D]): Tensor[D] =
    net(input)

}

final case class Block[D <: FloatNN: Default](numEmbeddings: Int, numHeads: Int, blockSize: Int, dropoutProb: Float)
    extends TensorModule[D] {
  // Transformer block: communication followed by computation
  val headSize = numEmbeddings / numHeads // 384 / 6 = 64
  val attention = register(MultiHeadAttention(numHeads, numEmbeddings, headSize, blockSize, dropoutProb))
  val feedForward = register(FeedForward(numEmbeddings, dropoutProb))
  val layerNorm1 = register(nn.LayerNorm(Seq(numEmbeddings)))
  val layerNorm2 = register(nn.LayerNorm(Seq(numEmbeddings)))

  override def apply(input: Tensor[D]): Tensor[D] =
      // (64, 256, 384) [Float32]
      val a = input + attention(layerNorm1(input)) // (64, 256, 384) [Float32]
      val b = a + feedForward(layerNorm2(a)) // (64, 256, 384) [Float32]
      b

}

final case class Dropout[D <: FloatNN: Default](probability: Float) extends TensorModule[D] {
  override def apply(x: Tensor[D]): Tensor[D] =
    nn.functional.dropout(x, probability)
}

I'm happy to assist you in any way to get this to work. I was able to get some inference going without any runtime errors, but haven't had time to train the model using shakespeare writings yet.

I will also be available to continue work on the pending PR to get it merged, in case I can help in any way @sbrunk

@davoclavo
Copy link
Contributor

Oh I forgot, there are also some changes needed for pico GPT that I haven't created a PR for, but I have fixed in my local project. I aim to get these changes submitted soon, but here they are in case you need them earlier:

Tensor#maskedFill

def maskedFill[S <: ScalaType](mask: Tensor[Bool], value: S): Tensor[D] = Tensor(
 native.masked_fill(mask.native, toScalar(value))
)

Tensor#sqrt

def sqrt = Tensor(native.sqrt())

torch.tril

  def tril[D <: DType](input: Tensor[D], diagonal: Int = 0): Tensor[D] =
    Tensor(torchNative.tril(input.native, diagonal.toLong))

Fixing tensor.split (see #39)

  def split[D <: DType](
      input: Tensor[D],
      splitSizeOrSections: Int | Seq[Int],
      dim: Int = 0
  ): Seq[Tensor[D]] = {
    val result =
      splitSizeOrSections match {
        case i: Int      => torchNative.split(input.native, i.toLong, dim.toLong)
        case s: Seq[Int] => torchNative.split(input.native, s.map(_.toLong).toArray, dim.toLong)
      }
    (0L until result.size()).map(i => Tensor(result.get(i)).clone())
  }

@sbrunk
Copy link
Owner

sbrunk commented Aug 17, 2023

I will also be available to continue work on the pending PR to get it merged, in case I can help in any way @sbrunk

@davoclavo feel free to take over #36 again if you have capacity. I've merged main into it with some improvements of the native bindings but since Scala Days is only 4 weeks away I'd like to focus on getting my Storch talk ready first. Happy to help/review etc. but I'm not sure I'll be able to actually work on it before the talk.

@davoclavo
Copy link
Contributor

@sbrunk sounds good, I'll try to polish the last remaining bits.

Best of luck on the Scala Days talk! Hopefully it will be streamed/recorded, I'd love to watch it :D

@sbrunk
Copy link
Owner

sbrunk commented Aug 17, 2023

Best of luck on the Scala Days talk! Hopefully it will be streamed/recorded, I'd love to watch it :D

Thanks! I'm sure it will be recorded and put on youtube some time after the conference as the videos from the Seattle edition from June are already online.
I'll keep you posted :)

@hmf
Copy link
Contributor Author

hmf commented Aug 18, 2023

@davoclavo Thanks for the assist. Please note that at this time I am working on the very simple "video" version. My aim here is to learn about GPT.

I will look at your code and incorporate all I can to make merging easier.

@hmf
Copy link
Contributor Author

hmf commented Aug 18, 2023

Questions regarding softmax. I was coding the cross_entropy examples to make sure the typing is correct. In the second example we need the softmax function in the link below. Looking at the code I see we have:

  def softmax[In <: DType, Out <: DType](input: Tensor[In], dim: Long)(
      dtype: Out = input.dtype
  ): Tensor[Out] =
    val nativeDType =
      if dtype == input.dtype then ScalarTypeOptional() else ScalarTypeOptional(dtype.toScalarType)
    Tensor(torchNative.softmax(input.native, dim, nativeDType))

This means that we have explicitly provide the last (usually empty) parameter so:

  val target1 = F.softmax( input=torch.randn(Seq(3, 5)), dim=1L)()

If we don't, we get the error:

[error] 358 |  val loss1 = F.crossEntropy(input1, target1)
[error]     |                                     ^^^^^^^
[error]     |Found:    (gpt.BiGram.target1 : torch.DType => torch.Tensor[torch.DType])
[error]     |Required: torch.Tensor[O]
[error]     |
[error]     |where:    O is a type variable with constraint <: torch.NumericRealNN

I have made that last parameter an implicit. I did the same for logSoftmax. If we do this, we avoid having to provide that last parameter. It seems that only the softmax call was used. Ran the test, had no problem. Ok, with this change or am I missing something?

The original Python example code uses a Tensor.softmax(dim=1) call. This method does not exist in storch. The Python documentation states that it is an "Alias for torch.nn.functional.softmax()." Should we add this? If so, do we add as a standard method or use use Scala 3 extension methods?

TIA

@sbrunk
Copy link
Owner

sbrunk commented Aug 18, 2023

I have made that last parameter an implicit. I did the same for logSoftmax. If we do this, we avoid having to provide that last parameter. It seems that only the softmax call was used. Ran the test, had no problem. Ok, with this change or am I missing something?

That's fine but could you give the following variant a try? It's a solution we already use in other places and avoids both implicits and multiple parameter lists (at the expense of a slightly more verbose type signature).

import Derive.derive

// ...

  def softmax[In <: DType, Out <: FloatNN | Derive](
      input: Tensor[In],
      dim: Long,
      dtype: Out = derive
  ): Tensor[DTypeOrDeriveFromTensor[In, Out]] =
    val derivedDType = dtype match
      case _: Derive => input.dtype
      case d: DType  => d
    val nativeDType =
      if dtype == input.dtype then ScalarTypeOptional()
      else ScalarTypeOptional(derivedDType.toScalarType)
    Tensor(torchNative.softmax(input.native, dim, nativeDType))
}

The original Python example code uses a Tensor.softmax(dim=1) call. This method does not exist in storch. The Python documentation states that it is an "Alias for torch.nn.functional.softmax()." Should we add this? If so, do we add as a standard method or use use Scala 3 extension methods?

Yes, you can add it as a regular method in Tensor delegating to the implementation in nn.functional

@hmf
Copy link
Contributor Author

hmf commented Aug 18, 2023

That's fine but could you give the following variant a try? It's a solution we already use in other places and avoids both implicits and multiple parameter lists (at the expense of a slightly more verbose type signature).

Done (also for logSoftmax). Compiled and all tests pass.

Yes, you can add it as a regular method in Tensor delegating to the implementation in nn.functional

Done:

  def shape: Seq[Int] = size

  def softmax[Out <: FloatNN | Derive](
      dim: Long,
      dtype: Out = derive
  ): Tensor[DTypeOrDeriveFromTensor[D, Out]] = F.softmax(input = this, dim = dim, dtype = dtype)

  def square = Tensor(native.square())

@hmf
Copy link
Contributor Author

hmf commented Aug 18, 2023

While trying to replicate the Colaboratory notebook to check the code is working, I tried to do the following:

  // We want x[b,t] = mean_{i<=t} x[b,i]
  val xbow = torch.zeros(Seq(b0, t0, c0))
  for b <- 0 until b0
  do
    for t <- 0 until t0
    do
      val xprev = x(b,º`:`t+1) // (t,C)
      xbow(b,t) = torch.mean(xprev, 0)  

The Tensorclass has no assignment operator. I also did not find a method for this in the JavaCPP code. How should one go about assigning a value?

TIA

@sbrunk
Copy link
Owner

sbrunk commented Aug 18, 2023

The Tensorclass has no assignment operator. I also did not find a method for this in the JavaCPP code. How should one go about assigning a value?

The C++ API has a method for assigning values (with indices): See https://pytorch.org/cppdocs/notes/tensor_indexing.html#setter
It's just not that easy to find, because it's named index_put_. It's also mapped via JavaCPP, but was missing in Storch.

#53 should add support for it. Could you give it a try?

@hmf
Copy link
Contributor Author

hmf commented Aug 19, 2023

Found some compiler weirdness with the changes above.These do not compile:

      xbow(Seq(b,t)) = torch.mean(input=xprev, dim=0)
      xbow(Seq(b,t)) = torch.mean(xprev, dim=0)  

The error is:

method mean in trait ReductionOps: (input: torch.Tensor[?], dtype: torch.Float32): torch.Tensor[torch.Float32] does not have a parameter dim

and (for the last one):

Found:    (0 : Int)
Required: torch.Float32

But these do:

      xbow(b,t) += torch.mean(xprev, dim=0)  
      val c = torch.mean(xprev, dim=0) 
      xbow(Seq(b,t)) = c
      xbow(Seq(b,t)) = torch.mean(input=xprev, dim=0, true, float32)
      xbow(Seq(b,t)) = torch.mean(input=xprev, dim=0, true)

Maybe some tweaking of the 1st definition may get it working, but seems like a Scala issue.

@sbrunk
Copy link
Owner

sbrunk commented Aug 19, 2023

It looks like the compiler gets confused by the overloaded variants of mean for whatever reason. I've seen this in other places with different generic overloads.

I realized that the default dim argument with an empty seq defaults to the behavior of the overloaded variants, making them redundant so I've removed them now in #53. Could you give it another try with the changes?

@hmf
Copy link
Contributor Author

hmf commented Aug 20, 2023

@sbrunk Changes work fine. Thanks.

@hmf
Copy link
Contributor Author

hmf commented Sep 8, 2023

I need the use of Dropout. In Python this seems to return a constructor of sorts (did not check), which can then be applied to a Tensor.

I see that we have a torch.nn.Dropout that is private to the torch package. So the more obvious solution of having a public Dropout class and its companion object will require changes. I have the following questions:

  1. Is the suggested change above ok?
  2. If so, can I go ahead and change this?
  3. If not, what is the storch way?

EDIT 1:

@davoclavo I realized you have already defined Dropout. I searched your repo but did not find it. Were did you define it? TIA

@hmf
Copy link
Contributor Author

hmf commented Sep 8, 2023

I would like to use register_buffer. According to the Python API doc, we must pass in a name.

Looking at the org.bytedeco.pytorch.Module we have:

  public Tensor register_buffer(BytePointer name, Tensor tensor) { return asModule()._register_buffer(name, tensor); }
  private native @ByRef @Name("register_buffer") Tensor _register_buffer(@StdString BytePointer name, @ByVal Tensor tensor);
  public Tensor register_buffer(String name, Tensor tensor) { return asModule()._register_buffer(name, tensor); }
  private native @ByRef @Name("register_buffer") Tensor _register_buffer(@StdString String name, @ByVal Tensor tensor);

So in torch.nn.modules.Module something like this should work:

  def registerB[D <: DType](n: String, t: Tensor[D]): Tensor[D] =
    nativeModule.register_buffer(n, t.native)
    t

However, as an example:

  def register[D <: DType](t: Tensor[D], requiresGrad: Boolean = true)(using
      name: sourcecode.Name
  ): Tensor[D] =
    nativeModule.register_parameter(name.value, t.native, requiresGrad)
    t

the name is implicitly defined. Is there any way I can keep the implicit but still allow manually setting that name?

On a related not, shouldn't these functions return a Tensor(t). We are assuming the same tensor is returned, but this is not guaranteed.

EDIT 1: we also have the problem of duplicate overload methods due to the use of defaults. What is the way to solve this here? Can I change the names?

EDIT 2: In the meantime I will use:

  def buffer[D <: DType](t: Tensor[D], n: String="")(using
      name: sourcecode.Name
  ): Tensor[D] =
    val name_ = if n.trim().isEmpty() then name.value else n.trim()
    Tensor( nativeModule.register_buffer(n, t.native) )

TIA

@sbrunk
Copy link
Owner

sbrunk commented Sep 8, 2023

I need the use of Dropout. In Python this seems to return a constructor of sorts (did not check), which can then be applied to a Tensor.

I see that we have a torch.nn.Dropout that is private to the torch package. So the more obvious solution of having a public Dropout class and its companion object will require changes. I have the following questions:

1. Is the suggested change above ok?

2. If so, can I go ahead and change this?

3. If not, what is the `storch` way?

I think what you found is the Dropout trait in torch.nn.functional right? The trait is private because it's members are exposed through the package object, so you can call it like this:

torch.nn.functional.dropout(input=torch.rand(Seq(3,3)))
// res2: Tensor[Float32] = tensor dtype=float32, shape=[3, 3], device=CPU 
// [[0,4759, 1,4497, 1,7002],
//  [1,2299, 0,0000, 1,1805],
//  [0,0000, 0,0000, 0,0000]]

It corresponds to torch.nn.functional.dropout in Python.

Seems like we're still missing the module variant of Dropout, which corresponds to the Python module you linked to. If you'd like to add that, that would be great! We should put it be under torch.nn.modules somewhere, like the other modules.

@sbrunk
Copy link
Owner

sbrunk commented Sep 8, 2023

So in torch.nn.modules.Module something like this should work:

  def registerB[D <: DType](n: String, t: Tensor[D]): Tensor[D] =
    nativeModule.register_buffer(n, t.native)
    t

However, as an example:

  def register[D <: DType](t: Tensor[D], requiresGrad: Boolean = true)(using
      name: sourcecode.Name
  ): Tensor[D] =
    nativeModule.register_parameter(name.value, t.native, requiresGrad)
    t

the name is implicitly defined. Is there any way I can keep the implicit but still allow manually setting that name?

We could add an explicit optional name parameter, i.e. defaulting to an empty string, or using an Option. If the caller provides a real name, we take that, otherwise, we fall back to the implicit. Ah I see you've just done that below in the buffer impl :)

On a related not, shouldn't these functions return a Tensor(t). We are assuming the same tensor is returned, but this is not guaranteed.

You're right, it's better to use the tensor returned by the native register method.

EDIT 1: we also have the problem of duplicate overload methods due to the use of defaults. What is the way to solve this here? Can I change the names?

Yes please go ahead. Perhaps we can keep register for modules, because it is used quite often, but use registerParameter, registerBuffer for the others.

EDIT 2: In the meantime I will use:

  def buffer[D <: DType](t: Tensor[D], n: String="")(using
      name: sourcecode.Name
  ): Tensor[D] =
    val name_ = if n.trim().isEmpty() then name.value else n.trim()
    Tensor( nativeModule.register_buffer(n, t.native) )

👍

@davoclavo
Copy link
Contributor

davoclavo commented Sep 8, 2023

@davoclavo I realized you have already defined Dropout. I searched your repo but did not find it. Were did you define it? TIA

Hi @hmf ! Apologies for the confusion, I have not committed my changes yet, as I have a bunch of other stuff that needs to be cleaned up. I just shared them in my previous comment to partially share the progress in case it was useful to you :)

You should be able to either drop in that code I shared in your script/example, or add it as a new module to storch.

I'll keep my ear open in case you need any further help, and hopefully find some time soon to help out to contribute these modules to storch.

@hmf
Copy link
Contributor Author

hmf commented Sep 22, 2023

While trying to implement and debug the multi-head attention mechanism, I have what seems to be unexpected behavior. For a model with the multi-head "only", the code:

    val nuParams = m.parameters.map(_.numel).sum
    println(s"${nuParams} parameters")

Reports:

Multi-head attention
4481 parameters

Now to this model I add the following layer:

    val ffwd = register( FeedFoward(nEmbed) )

where nEmbed = 32. If I count the number of parameters of this layer I get 1056 (nEmbed*nEmbed + nEmbed), which is correct. But the model still reports:

Multi-head attention + FFWD
4481 parameters

Shouldn't that be 4481 + 1056?

TIA

@sbrunk
Copy link
Owner

sbrunk commented Sep 22, 2023

@hmf I have a hunch (not tested). Could you try to wrap your Sequential in your feed forward module inside a register as well like so:

class FeedFoward[D <: FloatNN: Default](
nEmbed: Int
) extends torch.nn.modules.TensorModule[D]: // extends nn.Module:
val net = nn.Sequential(
nn.Linear(nEmbed, nEmbed),
nn.ReLU()
// nn.Linear(nEmbed, 4 * nEmbed),
// nn.ReLU(),
// nn.Linear(4 * nEmbed, nEmbed),
// nn.Dropout(dropout),
)

- val net = nn.Sequential(
+ val net = register(nn.Sequential(

Right now it's registering the layers inside Sequential as submodules of net, but not net itself as a submodule of FeedForward. In Python this is done implicitly. Perhaps we need a macro at some point to achieve s.th. similar in Storch as well.

@hmf
Copy link
Contributor Author

hmf commented Sep 23, 2023

@sbrunk I have confirmed that I need to register the inner modules. As for the macro, maybe a single function that traverses the sub-modules and registers them would do. But we also have parameter and buffer registering, so that would also have to dealt with.

Thanks.

@hmf
Copy link
Contributor Author

hmf commented Sep 29, 2023

@davoclavo Thanks for the feedback. In regards to the seed I have tried to do as is in the video, but this may not be correct. Also some of the constants may be off. For the final version I will try to replicate this code, so general performance should match.

As for the causes of differences in loss, when I tested the MNIST example in Linux, its behavior was not the same as in Mac. In fact the process did not converge. This was strange. @sbrunk changed the code altering the learning rate so that learning would converge in both OS.

@sbrunk
Copy link
Owner

sbrunk commented Sep 29, 2023

@hmf I'll try to look into this to get a better understanding. Is there anything I need to consider if I want to run it? I.e. I've seen you are using mill, right?

@hmf
Copy link
Contributor Author

hmf commented Sep 29, 2023

@sbrunk Thanks.

Is there anything I need to consider if I want to run it?

Not really. Simple Scala object. Messy code though. Sorry about that.

I've seen you are using mill, right?

Correct, but it just calls the main. Execution is in the object initialization. Something to correct.

I was hoping to contribute the Mill script (another issue). It is just missing project publishing. When I get time I want to upgrade it to the latest Laika version to avoid the need to override the Helium templates (currently it overrides the header).

@hmf
Copy link
Contributor Author

hmf commented Oct 9, 2023

I have implemented a clean version of this Python code. It is here. I am able to get a validation error below 2.0 (even less) as shown in the tutorial video. However with an increased number of iterations.

Unfortunately I am unable to use the exact same parameters due to memory issues. I am using a GPU with a whopping 24 Gigabytes. As soon as I start training, CUDA (nvidia-smi) shows over 18 Gigabytes being used. The (smaller) model (13_347_905 parameters) seems to use about 50 Mibs. The training loop uses Pointer.physicalBytes(), which reports a stable 2.2 MiBs after many iterations. So I am at a loss to know why so much memory is used.

I have looked for the APIs but cannot find the calls to get the CUDA memory stats.

Can anyone give me some pointer on how to check were the memory is used and diagnose this issue?

TIA

@sbrunk
Copy link
Owner

sbrunk commented Oct 10, 2023

@hmf yeah right now, we don't really have a good way to do memory profiling. Need to look into that too. Perhaps we can use the JavaCPP Cuda bindings to get better GPU memory usage information.

One idea you could try for now is to run only parts of the model (i.e. just the attention layer etc.) inside a training loop (you can use just random inputs of the right size). That might help to isolate better what part consumes so much memory or where it leaks.

@hmf
Copy link
Contributor Author

hmf commented Oct 11, 2023

@sbrunk thanks for the suggestions. I have started using a kludge to try and get an idea where the memory is being allocated. What I do is set a Thread.sleep at certain parts of the code and use nvidia-smi to check the memory. Not practical, but it has helped me see that I need to add some additional PointerScopes. Still trying to figure out how the memory accumulates so much.

Perhaps we can use the JavaCPP Cuda bindings to get better GPU memory usage information.

I was hoping the PointerScope and Pointer would help. No luck, but still looking into it. Do you by any chance have any examples of explicitly de-allocating Tensors via these interfaces/classes?

What JavaCPP Cuda bindings are you referring to? I quick look at the API does not reveal too much. I also think that we would need to access this information in a device independent manner. Maybe via Device?

EDIT: Device does not seem to be helpful. Need to look at torch.cuda.memory_stats for possible solution.

@sbrunk
Copy link
Owner

sbrunk commented Oct 11, 2023

@sbrunk thanks for the suggestions. I have started using a kludge to try and get an idea where the memory is being allocated. What I do is set a Thread.sleep at certain parts of the code and use nvidia-smi to check the memory. Not practical, but it has helped me see that I need to add some additional PointerScopes. Still trying to figure out how the memory accumulates so much.

Does it allocate too much inside a single iteration already or does it grow over multiple iterations during the training loop?

I was hoping the PointerScope and Pointer would help. No luck, but still looking into it. Do you by any chance have any examples of explicitly de-allocating Tensors via these interfaces/classes?

In Storch itself, I think we only have the image classifier example and #5, but you seem to be already doing it this way.

The JavaCPP tests for deallocation and PointerScope might be helpful here. There are also a few issues/discussions in the javacpp/javacpp-presets repo like this: bytedeco/javacpp-presets#1160

What JavaCPP Cuda bindings are you referring to? I quick look at the API does not reveal too much. I also think that we would need to access this information in a device independent manner. Maybe via Device?

The Java bindings to the CUDA toolkit itself. But that's a long shot, I'm not sure if it provides something usable for us here.

EDIT: Device does not seem to be helpful. Need to look at torch.cuda.memory_stats for possible solution.

It looks like LibTorch provides something like this, see: https://discuss.pytorch.org/t/libtorch-equivalent-of-torch-cuda-memory-reserved/165995
But I haven't found anything related in the JavaCPP PyTorch bindings, so it might need to be mapped in the preset. Perhaps you could open an issue in https://github.com/bytedeco/javacpp-presets to ask about it.

@hmf
Copy link
Contributor Author

hmf commented Oct 11, 2023

@sbrunk thanks for the suggestions. I have started using a kludge to try and get an idea where the memory is being allocated. What I do is set a Thread.sleep at certain parts of the code and use nvidia-smi to check the memory. Not practical, but it has helped me see that I need to add some additional PointerScopes. Still trying to figure out how the memory accumulates so much.

Does it allocate too much inside a single iteration already or does it grow over multiple iterations during the training loop?

It grows as it iterates.

I was hoping the PointerScope and Pointer would help. No luck, but still looking into it. Do you by any chance have any examples of explicitly de-allocating Tensors via these interfaces/classes?

In Storch itself, I think we only have the image classifier example and #5, but you seem to be already doing it this way.

The JavaCPP tests for deallocation and PointerScope might be helpful here. There are also a few issues/discussions in the javacpp/javacpp-presets repo like this: bytedeco/javacpp-presets#1160

I had seen these already. What I have learned is that one of the functions (that calculated the validation and training loss was accumulating memory. I added another PointerScope and now the model runs with the original parameters. The 13_443_137 parameter model still seems to consume too much memory (7 Gibs). Not everyone has a GPU with that memory. Maybe another PointerScope can reduce this at a cost.

What JavaCPP Cuda bindings are you referring to? I quick look at the API does not reveal too much. I also think that we would need to access this information in a device independent manner. Maybe via Device?

The Java bindings to the CUDA toolkit itself. But that's a long shot, I'm not sure if it provides something usable for us here.

Ok. I agree with you.

EDIT: Device does not seem to be helpful. Need to look at torch.cuda.memory_stats for possible solution.

It looks like LibTorch provides something like this, see: https://discuss.pytorch.org/t/libtorch-equivalent-of-torch-cuda-memory-reserved/165995 But I haven't found anything related in the JavaCPP PyTorch bindings, so it might need to be mapped in the preset. Perhaps you could open an issue in https://github.com/bytedeco/javacpp-presets to ask about it.

Ok.

@hmf
Copy link
Contributor Author

hmf commented Oct 11, 2023

The current V2 implementation is using the same parameters but will not converge using the original learning rate. The only thing that is missing is the weight and bias initialization. The nn.Module does not seem to have an apply like PyTorch that does this.

So what is the best way forward here? Should we include such a method? What should we name it? Do simply iterate through all layers and apply a function like Python does?

Should I open a new issue to discuss this?

TIA

EDIT: above it should read "The current V2 implementation is using the same parameters but will not converge as quickly using the original learning rate."

@sbrunk
Copy link
Owner

sbrunk commented Oct 11, 2023

Good idea. A recursive apply like in Python should be quite useful. And yes, please create a new issue for this.

@hmf
Copy link
Contributor Author

hmf commented Oct 12, 2023

Results of v2 on par with tutorial (loss below 2.0), but slower convergence. After a while it diverges. At the end I show an example of its output.

13443137 parameters
learningRate = 1.0E-4
maxIterations = 67000
dropout = 0.2
GPU total = 24.0 GiB
GPU used = 6.9 GiB
13443137 parameters >= 53772548 bytes = 51.3 MiB
step 0: train loss 4.335848, val loss 4.332262, mem 714.0 MiB @ 00 00:00:00.000, mean 00 00:00:00.000
step 500: train loss 2.5476382, val loss 2.5570214, mem 838.6 MiB @ 00 00:01:01.811, mean 00 00:00:00.123
step 1000: train loss 2.508413, val loss 2.5130005, mem 839.4 MiB @ 00 00:02:03.242, mean 00 00:00:00.122
step 1500: train loss 2.4970562, val loss 2.495719, mem 839.5 MiB @ 00 00:03:04.595, mean 00 00:00:00.122
step 2000: train loss 2.469262, val loss 2.4932768, mem 839.7 MiB @ 00 00:04:05.924, mean 00 00:00:00.122
step 2500: train loss 2.4545126, val loss 2.4732363, mem 839.7 MiB @ 00 00:05:07.223, mean 00 00:00:00.122
step 3000: train loss 2.4397492, val loss 2.4652252, mem 841.5 MiB @ 00 00:06:08.506, mean 00 00:00:00.122
step 3500: train loss 2.4432235, val loss 2.4631853, mem 841.6 MiB @ 00 00:07:09.783, mean 00 00:00:00.122
step 4000: train loss 2.4328675, val loss 2.457826, mem 841.6 MiB @ 00 00:08:11.065, mean 00 00:00:00.122
step 4500: train loss 2.4265091, val loss 2.4551604, mem 841.6 MiB @ 00 00:09:12.319, mean 00 00:00:00.122
step 5000: train loss 2.4185965, val loss 2.450682, mem 841.6 MiB @ 00 00:10:13.570, mean 00 00:00:00.122
step 5500: train loss 2.4081905, val loss 2.447644, mem 841.9 MiB @ 00 00:11:14.834, mean 00 00:00:00.122
step 6000: train loss 2.3958697, val loss 2.4314053, mem 842.1 MiB @ 00 00:12:16.084, mean 00 00:00:00.122
step 6500: train loss 2.381643, val loss 2.4313533, mem 842.1 MiB @ 00 00:13:17.338, mean 00 00:00:00.122
step 7000: train loss 2.364158, val loss 2.4134161, mem 842.1 MiB @ 00 00:14:18.601, mean 00 00:00:00.122
step 7500: train loss 2.3529167, val loss 2.4074175, mem 842.2 MiB @ 00 00:15:19.855, mean 00 00:00:00.122
step 8000: train loss 2.328031, val loss 2.3847246, mem 842.5 MiB @ 00 00:16:21.105, mean 00 00:00:00.122
step 8500: train loss 2.292856, val loss 2.351461, mem 842.5 MiB @ 00 00:17:22.352, mean 00 00:00:00.122
step 9000: train loss 2.2544227, val loss 2.321474, mem 842.5 MiB @ 00 00:18:23.577, mean 00 00:00:00.122
step 9500: train loss 2.219748, val loss 2.2897422, mem 842.5 MiB @ 00 00:19:24.806, mean 00 00:00:00.122
step 10000: train loss 2.1745658, val loss 2.2487366, mem 842.5 MiB @ 00 00:20:25.992, mean 00 00:00:00.122
step 10500: train loss 2.1545537, val loss 2.235534, mem 842.5 MiB @ 00 00:21:27.209, mean 00 00:00:00.122
step 11000: train loss 2.13079, val loss 2.2194557, mem 842.5 MiB @ 00 00:22:28.427, mean 00 00:00:00.122
step 11500: train loss 2.107516, val loss 2.1982605, mem 842.5 MiB @ 00 00:23:29.635, mean 00 00:00:00.122
step 12000: train loss 2.085714, val loss 2.1769443, mem 842.5 MiB @ 00 00:24:30.833, mean 00 00:00:00.122
step 12500: train loss 2.0651603, val loss 2.1682646, mem 842.5 MiB @ 00 00:25:32.036, mean 00 00:00:00.122
step 13000: train loss 2.04403, val loss 2.145483, mem 842.6 MiB @ 00 00:26:33.255, mean 00 00:00:00.122
step 13500: train loss 2.0215368, val loss 2.1287656, mem 842.8 MiB @ 00 00:27:34.475, mean 00 00:00:00.122
step 14000: train loss 2.073082, val loss 2.1562624, mem 842.8 MiB @ 00 00:28:35.691, mean 00 00:00:00.122
step 14500: train loss 2.0549197, val loss 2.1463277, mem 842.8 MiB @ 00 00:29:36.914, mean 00 00:00:00.122
step 15000: train loss 2.0292356, val loss 2.1369631, mem 842.8 MiB @ 00 00:30:38.123, mean 00 00:00:00.122
step 15500: train loss 2.0073128, val loss 2.1167858, mem 842.8 MiB @ 00 00:31:39.302, mean 00 00:00:00.122
step 16000: train loss 1.987694, val loss 2.1022239, mem 842.8 MiB @ 00 00:32:40.479, mean 00 00:00:00.122
step 16500: train loss 1.9841061, val loss 2.0968378, mem 842.8 MiB @ 00 00:33:41.664, mean 00 00:00:00.122
step 17000: train loss 1.964174, val loss 2.0827105, mem 842.8 MiB @ 00 00:34:42.826, mean 00 00:00:00.122
step 17500: train loss 1.9512877, val loss 2.0708296, mem 843.1 MiB @ 00 00:35:43.998, mean 00 00:00:00.122
step 18000: train loss 1.9287692, val loss 2.0533903, mem 843.3 MiB @ 00 00:36:45.177, mean 00 00:00:00.122
step 18500: train loss 1.9105072, val loss 2.0451093, mem 843.3 MiB @ 00 00:37:46.354, mean 00 00:00:00.122
step 19000: train loss 1.8970441, val loss 2.0320392, mem 843.3 MiB @ 00 00:38:47.511, mean 00 00:00:00.122
step 19500: train loss 1.8854179, val loss 2.0191305, mem 843.6 MiB @ 00 00:39:48.674, mean 00 00:00:00.122
step 20000: train loss 1.8791237, val loss 2.0184035, mem 843.6 MiB @ 00 00:40:49.850, mean 00 00:00:00.122
step 20500: train loss 1.8812588, val loss 2.0221977, mem 843.6 MiB @ 00 00:41:51.024, mean 00 00:00:00.122
step 21000: train loss 1.903744, val loss 2.030137, mem 843.6 MiB @ 00 00:42:52.190, mean 00 00:00:00.122
step 21500: train loss 1.8765324, val loss 2.0156875, mem 843.7 MiB @ 00 00:43:53.337, mean 00 00:00:00.122
step 22000: train loss 1.8608563, val loss 2.0002515, mem 843.7 MiB @ 00 00:44:54.489, mean 00 00:00:00.122
step 22500: train loss 1.8509007, val loss 1.9926138, mem 843.7 MiB @ 00 00:45:55.633, mean 00 00:00:00.122
step 23000: train loss 1.8334926, val loss 1.9830146, mem 843.7 MiB @ 00 00:46:56.772, mean 00 00:00:00.122
step 23500: train loss 1.8383644, val loss 1.9679743, mem 843.7 MiB @ 00 00:47:57.908, mean 00 00:00:00.122
step 24000: train loss 1.8247951, val loss 1.9709209, mem 843.7 MiB @ 00 00:48:59.069, mean 00 00:00:00.122
step 24500: train loss 1.8079953, val loss 1.9558063, mem 843.8 MiB @ 00 00:50:00.229, mean 00 00:00:00.122
step 25000: train loss 1.8013923, val loss 1.9549898, mem 843.8 MiB @ 00 00:51:01.393, mean 00 00:00:00.122
step 25500: train loss 1.7907901, val loss 1.945321, mem 843.8 MiB @ 00 00:52:02.553, mean 00 00:00:00.122
step 26000: train loss 1.7799153, val loss 1.939117, mem 843.8 MiB @ 00 00:53:03.719, mean 00 00:00:00.122
step 26500: train loss 1.7685446, val loss 1.9267551, mem 843.8 MiB @ 00 00:54:04.868, mean 00 00:00:00.122
step 27000: train loss 1.7624547, val loss 1.9231879, mem 843.8 MiB @ 00 00:55:06.017, mean 00 00:00:00.122
step 27500: train loss 1.7491188, val loss 1.9149151, mem 844.1 MiB @ 00 00:56:07.160, mean 00 00:00:00.122
step 28000: train loss 1.7429442, val loss 1.9086628, mem 844.1 MiB @ 00 00:57:08.307, mean 00 00:00:00.122
step 28500: train loss 1.7355636, val loss 1.9029418, mem 844.1 MiB @ 00 00:58:09.454, mean 00 00:00:00.122
step 29000: train loss 1.7248852, val loss 1.8960071, mem 844.1 MiB @ 00 00:59:10.608, mean 00 00:00:00.122
step 29500: train loss 1.7195947, val loss 1.8904512, mem 844.1 MiB @ 00 01:00:11.752, mean 00 00:00:00.122
step 30000: train loss 1.7153524, val loss 1.8848493, mem 844.1 MiB @ 00 01:01:12.904, mean 00 00:00:00.122
step 30500: train loss 1.7048767, val loss 1.8789163, mem 844.1 MiB @ 00 01:02:14.060, mean 00 00:00:00.122
step 31000: train loss 1.694385, val loss 1.870024, mem 844.1 MiB @ 00 01:03:15.210, mean 00 00:00:00.122
step 31500: train loss 1.6884319, val loss 1.8608154, mem 844.1 MiB @ 00 01:04:16.349, mean 00 00:00:00.122
step 32000: train loss 1.6768422, val loss 1.8586318, mem 844.1 MiB @ 00 01:05:17.483, mean 00 00:00:00.122
step 32500: train loss 1.6761434, val loss 1.8587543, mem 844.1 MiB @ 00 01:06:18.619, mean 00 00:00:00.122
step 33000: train loss 1.6758552, val loss 1.8544992, mem 844.1 MiB @ 00 01:07:19.746, mean 00 00:00:00.122
step 33500: train loss 1.67037, val loss 1.8574976, mem 844.1 MiB @ 00 01:08:20.870, mean 00 00:00:00.122
step 34000: train loss 1.6646343, val loss 1.8511721, mem 844.1 MiB @ 00 01:09:21.990, mean 00 00:00:00.122
step 34500: train loss 1.6610796, val loss 1.8486292, mem 844.3 MiB @ 00 01:10:23.103, mean 00 00:00:00.122
step 35000: train loss 1.6537488, val loss 1.8431506, mem 844.3 MiB @ 00 01:11:24.236, mean 00 00:00:00.122
step 35500: train loss 1.6544412, val loss 1.843468, mem 844.3 MiB @ 00 01:12:25.375, mean 00 00:00:00.122
step 36000: train loss 1.6563864, val loss 1.842051, mem 844.3 MiB @ 00 01:13:26.514, mean 00 00:00:00.122
step 36500: train loss 1.6723832, val loss 1.8542444, mem 844.3 MiB @ 00 01:14:27.646, mean 00 00:00:00.122
step 37000: train loss 1.6729113, val loss 1.8599828, mem 844.6 MiB @ 00 01:15:28.785, mean 00 00:00:00.122
step 37500: train loss 1.657896, val loss 1.8432986, mem 844.6 MiB @ 00 01:16:29.928, mean 00 00:00:00.122
step 38000: train loss 1.6419864, val loss 1.8300749, mem 845.2 MiB @ 00 01:17:31.076, mean 00 00:00:00.122
step 38500: train loss 1.6395802, val loss 1.831336, mem 845.4 MiB @ 00 01:18:32.209, mean 00 00:00:00.122
step 39000: train loss 1.6333517, val loss 1.8239709, mem 845.4 MiB @ 00 01:19:33.320, mean 00 00:00:00.122
step 39500: train loss 1.6248128, val loss 1.8159283, mem 845.4 MiB @ 00 01:20:34.444, mean 00 00:00:00.122
step 40000: train loss 1.6188323, val loss 1.8165076, mem 845.4 MiB @ 00 01:21:35.571, mean 00 00:00:00.122
step 40500: train loss 1.6140128, val loss 1.8128036, mem 845.4 MiB @ 00 01:22:36.716, mean 00 00:00:00.122
step 41000: train loss 1.6085365, val loss 1.8036649, mem 850.8 MiB @ 00 01:23:37.860, mean 00 00:00:00.122
step 41500: train loss 1.6002386, val loss 1.8010824, mem 850.8 MiB @ 00 01:24:38.994, mean 00 00:00:00.122
step 42000: train loss 1.5997845, val loss 1.803601, mem 850.8 MiB @ 00 01:25:40.133, mean 00 00:00:00.122
step 42500: train loss 1.9896176, val loss 2.08117, mem 850.8 MiB @ 00 01:26:41.304, mean 00 00:00:00.122
step 43000: train loss 2.6600757, val loss 2.7321503, mem 850.8 MiB @ 00 01:27:42.501, mean 00 00:00:00.122
step 43500: train loss 3.4551952, val loss 3.4863732, mem 850.8 MiB @ 00 01:28:43.643, mean 00 00:00:00.122
step 44000: train loss 3.44979, val loss 3.486229, mem 850.8 MiB @ 00 01:29:44.789, mean 00 00:00:00.122
step 44500: train loss 3.426718, val loss 3.467198, mem 850.8 MiB @ 00 01:30:45.911, mean 00 00:00:00.122
step 45000: train loss 3.4057078, val loss 3.4453905, mem 851.3 MiB @ 00 01:31:46.969, mean 00 00:00:00.122
step 45500: train loss 3.3713775, val loss 3.416267, mem 851.3 MiB @ 00 01:32:47.956, mean 00 00:00:00.121
step 46000: train loss 3.386201, val loss 3.4285066, mem 851.3 MiB @ 00 01:33:48.900, mean 00 00:00:00.121
step 46500: train loss 3.368285, val loss 3.4076788, mem 851.3 MiB @ 00 01:34:49.797, mean 00 00:00:00.121

Output (removed initial white spaces, too many):

GLOUCESTER:
Meaved:
Le, loopeak thather miscapio, and Caps noues?
Buch willift my freat I r'd did

LANDWAGLOUD INA:
Foolvest Hapry, shall he no cacinf it pake thou.
Are withe ret withsks shat usant:
Why somein so, helly how.
G Edwars haven, af of my unt: and,
To douer naty and babt! age she woun wilcy stith fabasten ther na s,
To beethe n as ave thigh bid sty are in in is
To nevirly thed, and our aft shal well, couser's be thringe.

ROMET:
For that ther sthing the meet is than him!
's.

HMOPSAMP'

@sbrunk
Copy link
Owner

sbrunk commented Oct 12, 2023

@hmf amazing work!

I got hands on a larger GPU now and will start playing with it.

@dejvid
Copy link

dejvid commented Nov 18, 2023

Is it possible to get a code for this example? It would be an excellent example project to learn from if it were posted.

@hmf
Copy link
Contributor Author

hmf commented Nov 18, 2023

@dejvid The examples are in this fork. It is not completed yet - still needs weight initialization (see issue #61). I have had trouble finishing due to an update to the Pytorch. See issue #62

hmf added a commit to hmf/storch that referenced this issue Dec 18, 2023
@hmf
Copy link
Contributor Author

hmf commented Dec 18, 2023

@sbrunk I have created a clean branch with the changes that implement the example. This holds changes for #51 and #61 (apologies for the comment error in the commit).
Some notes in the implementation:

  1. It does not have the performance of the original Python code. I have worked on this for a long time but cannot figure out what is wrong. Needs reviewing.
  2. The use of weight initialization make performance significantly worse. I have left it in. Maybe it should be removed or corrected (might not be the intended/best initialization).

I now have another issue. After updating to your latest changes (2.1.1), GPU dos not work on my side. This is also true for the original LeNet example (nvidia-smi shows no process using GPU). Note that because of this, current code breaks, but it is not usable without a GPU.

EDIT: forgot to mention that with version 2.1.0 I had some memory issues I did not have with the previous version. In particular I implemented a wrapper for memory_stats. I think it is necessary to add the memory management functions (such as clearing the cache) to allow us to use storch effectively.

Could check this and give me feedback?

TIA

@sbrunk
Copy link
Owner

sbrunk commented Dec 19, 2023

Thanks @hmf for pushing this forward. Could you create a PR from your branch? That should make it easier to do reviewing.

I'll try to figure out the new GPU issue.

@hmf
Copy link
Contributor Author

hmf commented Dec 20, 2023

@sbrunk The code assumes a GPU is available. The printMemoryInfo will fail otherwise. Remove it if you need to test with CPU only.

The class is gpt.V2.

@sbrunk
Copy link
Owner

sbrunk commented Dec 24, 2023

I now have another issue. After updating to your latest changes (2.1.1), GPU dos not work on my side. This is also true for the original LeNet example (nvidia-smi shows no process using GPU). Note that because of this, current code breaks, but it is not usable without a GPU.

I can't reproduce the GPU issue at the moment, but I could only try on an RTX 4090 so far, which is ADA architecture, while 3090 is Ampere.

It did work with 2.1.0 for you right?

Could you give at a try with the latest update to PyTorch 2.1.2 by bumping the PyTorch patch version in build.sbt?

- val pytorchVersion = "2.1.1"
+ val pytorchVersion = "2.1.2"

@hmf
Copy link
Contributor Author

hmf commented Dec 27, 2023

@sbrunk

GPU working with 2.1.2

Note that with:

set ThisBuild / enableGPU := true

In sbt I now get:

[error] stack trace is suppressed; run last core / update for the full output
[error] (core / update) lmcoursier.internal.shaded.coursier.error.FetchError$DownloadingArtifacts: Error fetching artifacts:
[error] https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/pytorch/2.1.1-1.5.10-SNAPSHOT/pytorch-2.1.1-1.5.10-20231204.171720-12-linux-x86_64-gpu.jar: not found: https://oss.sonatype.org/content/repositories/snapshots/org/bytedeco/pytorch/2.1.1-1.5.10-SNAPSHOT/pytorch-2.1.1-1.5.10-20231204.171720-12-linux-x86_64-gpu.jar

But no issues with 2.1.2.

Thanks.

EDIT: yes it is working with 2.1.0

@sbrunk
Copy link
Owner

sbrunk commented Jan 2, 2024

@hmf I'm trying to reproduce your results but it's diverging much faster in my case. I get down to a train loss of 2.3 but then it starts to go up again. At some point, the losses even go NaN.

[info] step 8000: train loss 2.3154821, val loss 2.3707209, mem 942,0 MiB @ 00 00:08:13.783, mean 00 00:00:00.061
...
[info] step 40500: train loss NaN, val loss NaN, mem 944,2 MiB @ 00 00:41:43.711, mean 00 00:00:00.061

Here's how I ran it: Took the branch, disabled weight init, ran with a learning rate of 1e-4.

// modules.foreach(init_weights)
// ...

train(model, 1e-4, 67000)

Any idea what could be different?

@hmf
Copy link
Contributor Author

hmf commented Jan 3, 2024

@sbrunk I am trying to rerun the test to confirm all is Ok. Unfortunately I have had to recreate the dev container. I have also merged your latest changes from main. I am now running with initialization off.

As soon as I get some results I will report back.

Any idea what could be different?

At this time, it only occurs to me that the OS libraries (including NVIDIA's stuff) my be different. I am assuming your are using Linux Ubuntu. Below is a list of the setup.

storch_os.txt

We could also try setting up a fixed random number seed for replication.

@hmf
Copy link
Contributor Author

hmf commented Jan 3, 2024

@sbrunk I have rerun with Pytorch 2.1.2. and get:

[info] step 41500: train loss 1.6008276, val loss 1.7987113, @ 00 01:25:21.304, mean 00 00:00:00.123

Here is the full output:
tmp.txt

I have added:

 torch.manualSeed(1337)

and am running this again. We can then test and compare with this seed.

I can't reproduce the GPU issue at the moment, but I could only try on an RTX 4090 so far, which is ADA architecture, while 3090 is Ampere.

I have noticed that your compute time per iteration is about half of mine (0.123 vs 0.061). Nice 8-)

EDIT: this run resulted in an abrupt divergence with a resulting NAN. Can you check that you get the same output? Here is the output I get:

tmp.txt

@sbrunk
Copy link
Owner

sbrunk commented Jan 3, 2024

@hmf I did a run on d8d75b7 and it's much closer to your result than before:
train-gpt.v2.txt.

There are still numeric differences but I guess that could be due to slightly different hardware/driver.

@hmf
Copy link
Contributor Author

hmf commented Jan 4, 2024

@sbrunk I am somewhat skeptical of the results. I noticed that the first loss is indeed the same value (baring precision errors). For that reason I would expect the next values to be the same (the data is the same and should be loaded in the same order). This also does not bode well for unit tests - something that can be added that can be used to confirm your hypothesis.

Having said this, I am not satisfied with the results. I am considering a ViT implementation that has SOTA performance that can be compared. This case is easier to test than NLP because pre processing should be simpler. Just a thought.

@sbrunk
Copy link
Owner

sbrunk commented Jan 7, 2024

@sbrunk I am somewhat skeptical of the results. I noticed that the first loss is indeed the same value (baring precision errors). For that reason I would expect the next values to be the same (the data is the same and should be loaded in the same order). This also does not bode well for unit tests - something that can be added that can be used to confirm your hypothesis.

That's true. It seems like I get reproducible results on the same hardware/setup though when I run it twice. I'll have access to yet another GPU type next week and I'll try it there too for comparison.

Having said this, I am not satisfied with the results. I am considering a ViT implementation that has SOTA performance that can be compared. This case is easier to test than NLP because pre processing should be simpler. Just a thought.

That's a great idea. If you'll give it a try, let me know if I can help you in any way.

sbrunk added a commit that referenced this issue Jan 14, 2024
Pico GPT v2 code for #51. Includes changes for #61
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants