Add GGJT loader #114

iacore · 2023-04-06T18:27:55Z

Related to #62 #93

single-file model format magic=ggjt

philpax · 2023-04-06T18:40:37Z

Very nice! Glad someone took this on - we were discussing this on the Discord but decided to wait until upstream figured out what they wanted to do.

There's overlap with #84 and #85, so the merge in the future might be tricky. Just a heads-up.

iacore · 2023-04-06T19:55:42Z

I ported the code from C++.

I'm sure the reading part is correct. Maybe Tensor setup has changed? Or ggml has changed?

The model runs, but is producing garbage

### Assistant: 1 + 1 is '`
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 131
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 132
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 133
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 134
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 135
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 136
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 137
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 138
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 139
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 140
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 141
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 142
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 143
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 144
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 145
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 146
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 147
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 148
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 149
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 150
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 151
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 152
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 153
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 154
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 155
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 156
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 157
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 158
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 159
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 160
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 161
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 162
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 163
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 164
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 165
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 166
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 167
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 168
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 169
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 170
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 171
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 172
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 173
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 174
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 175
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 176
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 177
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 178
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 179
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 180
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 181
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 182
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 183
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 184
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 185
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 186
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 187
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 188
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 189
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 190
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 191
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 192
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 193
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 194
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 195
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 196
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 197
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 198
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 199
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 200
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 201
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 202
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 203
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 204
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 205
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 206
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 207
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 208
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 209
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 210
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 211
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 212
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 213
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 214
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 215
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 216
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 217
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 218
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 219
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 220
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 221
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 222
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 223
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 224
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 225
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 226
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 227
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 228
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 229
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 230
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 231
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 232
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 233
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 234
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 235
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 236
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 237
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 238
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 239
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 240
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 241
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 242
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 243
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 244
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 245
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 246
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 247
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 248
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 249
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 250
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 251
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 252
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 253
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 254
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 255
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 256
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 257
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 258
[2023-04-06T20:03:26Z INFO  llama_cli] ggml ctx size = 7759.50 MB
    
[2023-04-06T20:03:26Z INFO  llama_cli] Loading model part 1/1 from 'models/ggml-vicuna-13b-4bit.bin'
    
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 8/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 16/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 24/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 32/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 40/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 48/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 56/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 64/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 72/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 80/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 88/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 96/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 104/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 112/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 120/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 128/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 136/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 144/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 152/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 160/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 168/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 176/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 184/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 192/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 200/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 208/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 216/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 224/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 232/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 240/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 248/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 256/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 264/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 272/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 280/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 288/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 296/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 304/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 312/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 320/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 328/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 336/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 344/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 352/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 360/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loading of 'models/ggml-vicuna-13b-4bit.bin' complete
[2023-04-06T20:03:26Z INFO  llama_cli] Model size = 7759.40 MB / num tensors = 363
[2023-04-06T20:03:26Z INFO  llama_cli] Model fully loaded!
>> Hello?
⣟ 

!#
  #
   #!$

philpax · 2023-04-06T20:21:15Z

I'd take the same model in ggmj and ggjt format and compare the loaded tensors. I'm guessing it's probably a misalignment or something similar.

iacore · 2023-04-06T20:23:00Z

Is it normal to warn for bad tokens? The bad tokens are single-byte string >128. That's invalid UTF-8, but the tokenizer don't care?

I'm probably done with this for awhile.

philpax · 2023-04-06T20:24:42Z

Yes, we're discussing that over at #11. The short explanation is that the invalid tokens are not valid UTF8 by themselves, but are composed to form valid UTF8. We're figuring out what the actual solution should be.

KerfuffleV2 · 2023-04-07T11:45:17Z

Does it work with the old formats but just gets garbage when using a GGJT model or just always garbage?

iacore · 2023-04-07T11:54:44Z

I haven't change the behavior of old formats. only GGJT produce garbage.

I don't have an older model. Can you try this branch on GGMF models?

Now it can load the model, but it's not working

KerfuffleV2 · 2023-04-07T12:08:20Z

`ggml`

Fine.

`ggmf`

thread 'main' panicked at 'Could not load model: TensorWrongSize { tensor_name: "tok_embeddings.weight", path: "blah.bin" }', llama-cli/src/main.rs:206:6

(I don't know if it's actually supposed to work here or not.)

`ggjt`

Garbage.

KerfuffleV2 · 2023-04-07T12:10:42Z

Was that last force push just rebasing on the current version or does it involve changes to the loading that may fix stuff that previously didn't work?

iacore · 2023-04-07T12:13:00Z

Was that last force push just rebasing on the current version or does it involve changes to the loading that may fix stuff that previously didn't work?

just rebase

can you share a simple GGMF model that I can use for testing?

KerfuffleV2 · 2023-04-07T12:16:31Z

I think this one is GGMF: https://huggingface.co/Sosaka/Alpaca-native-4bit-ggml/

(Don't know if it's a problem to post something like that here, if so let me know and I'll edit it out after iacore sees it.)

iacore · 2023-04-07T12:34:02Z

That one is GGML aka unversioned, not GGMF

KerfuffleV2 · 2023-04-07T13:22:07Z

Sorry, my mistake. Unfortunately, I don't really have a reasonable way to share huge files.

I've been trying to figure out what the issue with GGJT is. One thing I can say is I don't think your logic for finding the tensors/lengths/offsets has a problem.

I added some printfs to llama.cpp and the corresponding ones to llama-rs.

Load: tok_embeddings.weight, offset=432672, size=102403200
Load: norm.weight, offset=102835904, size=20480
Load: output.weight, offset=102856448, size=102403200
[...]
Load: layers.39.feed_forward.w2.weight, offset=8048282880, size=44236800
Load: layers.39.feed_forward.w3.weight, offset=8092519744, size=44236800
Load: layers.39.ffn_norm.weight, offset=8136756608, size=20480

Absolutely no difference in output between the C and Rust versions.

in math, tensor loading

iacore · 2023-04-07T13:56:29Z

still no clue

KerfuffleV2 · 2023-04-07T14:23:47Z

Okay, so I got it actually running inference on a GGJT model. However, what I had to do makes the mmap part pointless.

I believe the problem has something to do with how you can set a context parameters to no_alloc. In the llama.cpp change that added the format, they set no_alloc to true for the main context and then reduce the context size a lot so that GGML doesn't allocate memory for the actual tensors.

However, we're still doing the old context size calculation. I tried making the Context::init function take a bool for no_alloc and set it, but just got a segfault immediately.

Anyway, in loader.rs:load_weights_ggjt just change:

tensor.set_data(ptr as *mut std::ffi::c_void);

to

ptr.copy_to_nonoverlapping(tensor.data() as *mut u8, tensor.nbytes());

With that, it runs just fine on the GGJT model. Loading speed seems normal compared to the current version.

Obviously it's silly to mmap and just copy the data immediately. What I'd recommend is just ditching mmap right now and simply reading into the tensor data instead.

Then later on it will be possible to add mmap support as a separate thing which could work in a general way for the other formats too.

By the way, you probably need to run clippy on your changes. It's very unhappy right now!

philpax · 2023-04-07T14:35:41Z

Yeah, I'd be happy with not supporting mmap right now. We can figure out what that's meant to look like once we have support for all the model types working.

KerfuffleV2 · 2023-04-07T14:39:34Z

The approach I'd go for if I was writing it would be to just have a general type that can describe the types of tensor and where they are, dimensions, etc. Something similar to this: https://github.com/KerfuffleV2/smolrsrwkv/blob/182cd3205b7a7c95571a09bcfbb954b0041e4f90/smolrwkv/src/loader.rs#L15

Then the specific file format code can just scan through the file for metadata, build a structure with a list of those things or whatever. Then there can be generic loader code that just loads tensors based on that: it could use reading files, mmap, whatever. edit: Also could convert from something like SafeTensors, PyTorch, whatever. Then if some data actually needs to be converted, it could be described in that structure and the conversion process wouldn't have to care about low level details like GGJT vs GGML, just "I have a tensor of type X, but I need Y".

I think that approach would make dealing with different file formats much easier.

philpax · 2023-04-07T14:48:27Z

I'd be OK with that. It'd also help with #117 / #84.

iacore · 2023-04-07T19:25:08Z

Got Vicuna working here. The loading speed is unfortunate.

I can use io_uring to make this faster, but that's more code.

maybe copying from mmap-ed memory is faster?

ship it 🚀

KerfuffleV2 · 2023-04-07T19:38:15Z

@iacore I was experimenting with trying to clean it up also. It does seem like reading is way slower than the copying from mmap approach. I don't know why.

My change looks like this:

        let offset_curr = reader.stream_position()?;
        let offset_aligned: u64 = (offset_curr + 31) & !31;

        reader.seek_relative((offset_aligned - offset_curr) as i64)?;
        let td =
            unsafe { std::slice::from_raw_parts_mut(tensor.data() as *mut u8, tensor.nbytes()) };

        reader.read_exact(td)?;
        total_loaded_bytes += tensor.nbytes();

So, uhh... I guess maybe just keep mmap?

edit:

I can use io_uring to make this faster, but that's more code.

Also probably don't want OS specific optimizations.

I also experimented with making the BufReader capacity really big (up to 2GB) and it didn't seem to help the reading speed.

iacore · 2023-04-07T19:44:26Z

for me, seq. read is faster than mmap

I've made a branch ggjt-variant-copy-mmap for the solution

KerfuffleV2 · 2023-04-07T19:46:26Z

Are you saying you tried the copy_to_nonoverlapping version and it was slower than the current version that changed to BufReader?

iacore · 2023-04-07T19:46:49Z

Are you saying you tried the copy_to_nonoverlapping version and it was slower than the current version that changed to BufReader?

yes

KerfuffleV2 · 2023-04-07T19:51:23Z

Ahh, why can't life be simple for once?

What OS are you using, out of curiosity?

iacore · 2023-04-07T19:54:02Z

Linux.

I think this branch is done. I'll not touch it again.

KerfuffleV2 · 2023-04-07T20:10:56Z

I'm on Linux as well.

Sorry for the confusion. I've been trying both versions and I don't get consistent results. No sure what's going on, but I don't think there's a problem with the current approach.

I'll not touch it again.

How can you say that when Clippy is still sad?

You can probably just nuke the set_data method, it doesn't seem clear that there's even a way to successfully use it at the moment.

iacore · 2023-04-07T20:26:40Z

You can probably just nuke the set_data method, it doesn't seem clear that there's even a way to successfully use it at the moment.

it probably will be useful in the future?

KerfuffleV2 · 2023-04-07T20:32:51Z

it probably will be useful in the future?

We'd have to figure out how to actually use it in the future though. I don't think it currently can work at all, since you aren't even able to turn of need_alloc when creating a context so memory for tensors will always get allocated no matter what.

You'd think it would still be possible to set the tensor to point at a different chunk of memory but that didn't actually work: otherwise your first approach would have had no issues.

So my line of thinking is, it's a only a couple of lines of code to wrap a ggml function so it wouldn't be hard to add back in later on if it was actually needed/could be used but it's currently non-functional so it may as well be removed.

Just to be clear, this is just the opinion of some other random person on the internet. So take that for it's worth, I have no authority.

jon-chuang · 2023-04-12T05:01:56Z

I also got this working on new ggjt file format.

jon-chuang · 2023-04-12T05:04:01Z

Btw, from https://justine.lol/mmap/

Remember that progress bar which made you wait for weights to load each time you ran the command? We got rid of that. Linux users should expect a 100x improvement in load time. Windows and MacOS users should expect a 10x improvement. What this means is that tokens will start being produced effectively instantaneously when you run LLaMA, almost providing a similar UX to ChatGPT on the shell. It's important to note these improvements are due to an amortized cost. The first time you load a model after rebooting your computer, it's still going to go slow, because it has to load the weights from disk. However each time it's loaded afterwards, it should be fast (at least until memory pressure causes your file cache to be evicted).

The speedup is meant to be for subsequent loads. Did you guys check that?

Or just the initial load?

Btw, you may need to configure some settings like read_advise in mmap2 library to get some better prefetching.

The default MMAP way of reading is to trigger page faults.

jon-chuang · 2023-04-12T05:05:22Z

Another thing to check: multiple processes utilizing the same mmaped file:

More Processes
You can now run multiple LLaMA processes simultaneously on your computer. Here's a video of Georgi having a conversation with four chatbots powered by four independent llama.cpp processes running on the same Mac. So llama.cpp is not only going to be a better friend to you, it can also serve as your artificial circle of friends too. The trick that makes it possible is mmap() lets us map the read-only weights using MAP_SHARED, which is the same technique that's traditionally been used for loading executable software. So we figured, why aren't we using it to load neural network software too? Now we can.

jon-chuang · 2023-04-12T05:09:55Z

llama-rs/src/lib.rs

+                )?
+            }
+            ModelVersion::GGJT => {
+                let mmap = unsafe { Mmap::map(&file)? };


Can add advise
https://docs.rs/memmap2/latest/memmap2/struct.Mmap.html#method.advise
here with
https://docs.rs/memmap2/latest/memmap2/enum.Advice.html#variant.Sequential

iacore · 2023-04-12T16:34:45Z

Succeeded by #125

iacore changed the title ~~Add loader stub for GGJT~~ Add GGJT loader Apr 6, 2023

This was referenced Apr 6, 2023

Support the new mmap-able ggml format #93

Closed

Make some types Debug/Clone/Copy #113

Merged

iacore force-pushed the ggjt branch from d4145b8 to db5bc8e Compare April 6, 2023 19:54

philpax mentioned this pull request Apr 6, 2023

Consider parsing models with binrw #117

Closed

iacore added 4 commits April 7, 2023 12:03

Add loader stub for GGJT

40dc491

Add loading code for ggjt

53ba1a9

Now it can load the model, but it's not working

code cleanup that doesn't change anything

071612e

more code cleanup

3b9e3fe

iacore force-pushed the ggjt branch from 46fffc3 to 3b9e3fe Compare April 7, 2023 12:08

minor change

f6f0aa0

in math, tensor loading

Add non-mmap loader for GGJT

264920e

iacore marked this pull request as ready for review April 7, 2023 19:25

Prefer traits in loader.rs

6a92d0e

cargo fmt

4548aa2

iacore mentioned this pull request Apr 7, 2023

fix(llama): buffer tokens until valid UTF-8 #122

Merged

iacore added 2 commits April 7, 2023 20:34

cargo clippy --fix

eab7235

Remove ggml::Tensor::set_data

32925e7

iacore force-pushed the ggjt branch from 91a140b to 32925e7 Compare April 7, 2023 22:35

iacore mentioned this pull request Apr 8, 2023

Standalone loader #125

Merged

iacore force-pushed the ggjt branch 2 times, most recently from bf847dd to 32925e7 Compare April 8, 2023 13:39

jon-chuang reviewed Apr 12, 2023

View reviewed changes

iacore closed this Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GGJT loader #114

Add GGJT loader #114

iacore commented Apr 6, 2023 •

edited

Loading

philpax commented Apr 6, 2023 •

edited

Loading

iacore commented Apr 6, 2023 •

edited

Loading

philpax commented Apr 6, 2023

iacore commented Apr 6, 2023 •

edited

Loading

philpax commented Apr 6, 2023

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023 •

edited

Loading

KerfuffleV2 commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023 •

edited

Loading

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023 •

edited

Loading

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

philpax commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023 •

edited

Loading

philpax commented Apr 7, 2023

iacore commented Apr 7, 2023 •

edited

Loading

KerfuffleV2 commented Apr 7, 2023 •

edited

Loading

iacore commented Apr 7, 2023 •

edited

Loading

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

jon-chuang commented Apr 12, 2023

jon-chuang commented Apr 12, 2023 •

edited

Loading

jon-chuang commented Apr 12, 2023

jon-chuang Apr 12, 2023

iacore commented Apr 12, 2023 •

edited

Loading

Add GGJT loader #114

Add GGJT loader #114

Conversation

iacore commented Apr 6, 2023 • edited Loading

philpax commented Apr 6, 2023 • edited Loading

iacore commented Apr 6, 2023 • edited Loading

philpax commented Apr 6, 2023

iacore commented Apr 6, 2023 • edited Loading

philpax commented Apr 6, 2023

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023 • edited Loading

KerfuffleV2 commented Apr 7, 2023

ggml

ggmf

ggjt

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023 • edited Loading

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023 • edited Loading

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

philpax commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023 • edited Loading

philpax commented Apr 7, 2023

iacore commented Apr 7, 2023 • edited Loading

KerfuffleV2 commented Apr 7, 2023 • edited Loading

iacore commented Apr 7, 2023 • edited Loading

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

jon-chuang commented Apr 12, 2023

jon-chuang commented Apr 12, 2023 • edited Loading

jon-chuang commented Apr 12, 2023

jon-chuang Apr 12, 2023

Choose a reason for hiding this comment

iacore commented Apr 12, 2023 • edited Loading

iacore commented Apr 6, 2023 •

edited

Loading

philpax commented Apr 6, 2023 •

edited

Loading

iacore commented Apr 6, 2023 •

edited

Loading

iacore commented Apr 6, 2023 •

edited

Loading

iacore commented Apr 7, 2023 •

edited

Loading

`ggml`

`ggmf`

`ggjt`

iacore commented Apr 7, 2023 •

edited

Loading

iacore commented Apr 7, 2023 •

edited

Loading

KerfuffleV2 commented Apr 7, 2023 •

edited

Loading

iacore commented Apr 7, 2023 •

edited

Loading

KerfuffleV2 commented Apr 7, 2023 •

edited

Loading

iacore commented Apr 7, 2023 •

edited

Loading

jon-chuang commented Apr 12, 2023 •

edited

Loading

iacore commented Apr 12, 2023 •

edited

Loading