Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions vllm/model_executor/layers/quantization/torchao.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,9 +144,9 @@ def torchao_quantize_param_data(param: torch.Tensor,
"""Quantize a Tensor with torchao quantization specified by torchao_config

Args:
`param`: weight parameter of the linear module
`torchao_config`: type of quantization and their arguments we want to
use to quantize the Tensor
param: weight parameter of the linear module
torchao_config: type of quantization and their arguments we want to
use to quantize the Tensor
"""
from torchao.core.config import AOBaseConfig
from torchao.quantization import quantize_
Expand All @@ -172,8 +172,8 @@ class TorchAOLinearMethod(LinearMethodBase):
"""Linear method for torchao.

Args:
torchao_config: The torchao quantization config, a string
that encodes the type of quantization and all relevant arguments.
quant_config: The torchao quantization config, a string that encodes
the type of quantization and all relevant arguments.
"""

def __init__(self, quant_config: TorchAOConfig):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -423,7 +423,7 @@ def w8a8_block_int8_matmul(
Bs: The per-block quantization scale for `B`.
block_size: The block size for per-block quantization. It should be
2-dim, e.g., [128, 128].
output_dytpe: The dtype of the returned tensor.
output_dtype: The dtype of the returned tensor.

Returns:
torch.Tensor: The result of matmul.
Expand Down
4 changes: 2 additions & 2 deletions vllm/model_executor/layers/rotary_embedding/mrope.py
Original file line number Diff line number Diff line change
Expand Up @@ -135,8 +135,8 @@ def triton_mrope(
"""Qwen2VL mrope kernel.

Args:
query: [num_tokens, num_heads * head_size]
key: [num_tokens, num_kv_heads * head_size]
q: [num_tokens, num_heads * head_size]
k: [num_tokens, num_kv_heads * head_size]
cos: [3, num_tokens, head_size //2 ]
(T/H/W positions with multimodal inputs)
sin: [3, num_tokens, head_size //2 ]
Expand Down
89 changes: 45 additions & 44 deletions vllm/model_executor/model_loader/tensorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -171,51 +171,52 @@ class TensorizerConfig(MutableMapping):
_is_sharded: bool = field(init=False, default=False)
_fields: ClassVar[tuple[str, ...]]
_keys: ClassVar[frozenset[str]]
"""
Args for the TensorizerConfig class. These are used to configure the
behavior of model serialization and deserialization using Tensorizer.
"""Configuration class for Tensorizer settings.

Args:
tensorizer_uri: Path to serialized model tensors. Can be a local file
path or a S3 URI. This is a required field unless lora_dir is
provided and the config is meant to be used for the
`tensorize_lora_adapter` function. Unless a `tensorizer_dir` or
`lora_dir` is passed to this object's initializer, this is a required
argument.
tensorizer_dir: Path to a directory containing serialized model tensors,
and all other potential model artifacts to load the model, such as
configs and tokenizer files. Can be passed instead of `tensorizer_uri`
where the `model.tensors` file will be assumed to be in this
directory.
vllm_tensorized: If True, indicates that the serialized model is a
vLLM model. This is used to determine the behavior of the
TensorDeserializer when loading tensors from a serialized model.
It is far faster to deserialize a vLLM model as it utilizes
tensorizer's optimized GPU loading. Note that this is now
deprecated, as serialized vLLM models are now automatically
inferred as vLLM models.
verify_hash: If True, the hashes of each tensor will be verified against
the hashes stored in the metadata. A `HashMismatchError` will be
raised if any of the hashes do not match.
num_readers: Controls how many threads are allowed to read concurrently
from the source file. Default is `None`, which will dynamically set
the number of readers based on the number of available
resources and model size. This greatly increases performance.
encryption_keyfile: File path to a binary file containing a
binary key to use for decryption. `None` (the default) means
no decryption. See the example script in
examples/others/tensorize_vllm_model.py.
s3_access_key_id: The access key for the S3 bucket. Can also be set via
the S3_ACCESS_KEY_ID environment variable.
s3_secret_access_key: The secret access key for the S3 bucket. Can also
be set via the S3_SECRET_ACCESS_KEY environment variable.
s3_endpoint: The endpoint for the S3 bucket. Can also be set via the
S3_ENDPOINT_URL environment variable.
lora_dir: Path to a directory containing LoRA adapter artifacts for
serialization or deserialization. When serializing LoRA adapters
this is the only necessary parameter to pass to this object's
initializer.
"""
These settings configure the behavior of model serialization and
deserialization using Tensorizer.

Attributes:
tensorizer_uri: Path to serialized model tensors. Can be a local file
path or a S3 URI. This is a required field unless lora_dir is
provided and the config is meant to be used for the
`tensorize_lora_adapter` function. Unless a `tensorizer_dir` or
`lora_dir` is passed to this object's initializer, this is
a required argument.
tensorizer_dir: Path to a directory containing serialized model tensors,
and all other potential model artifacts to load the model, such as
configs and tokenizer files. Can be passed instead of
`tensorizer_uri` where the `model.tensors` file will be assumed
to be in this directory.
vllm_tensorized: If True, indicates that the serialized model is a
vLLM model. This is used to determine the behavior of the
TensorDeserializer when loading tensors from a serialized model.
It is far faster to deserialize a vLLM model as it utilizes
tensorizer's optimized GPU loading. Note that this is now
deprecated, as serialized vLLM models are now automatically
inferred as vLLM models.
verify_hash: If True, the hashes of each tensor will be verified
against the hashes stored in the metadata. A `HashMismatchError`
will be raised if any of the hashes do not match.
num_readers: Controls how many threads are allowed to read concurrently
from the source file. Default is `None`, which will dynamically set
the number of readers based on the number of available
resources and model size. This greatly increases performance.
encryption_keyfile: File path to a binary file containing a
binary key to use for decryption. `None` (the default) means
no decryption. See the example script in
examples/others/tensorize_vllm_model.py.
s3_access_key_id: The access key for the S3 bucket. Can also be set via
the S3_ACCESS_KEY_ID environment variable.
s3_secret_access_key: The secret access key for the S3 bucket. Can also
be set via the S3_SECRET_ACCESS_KEY environment variable.
s3_endpoint: The endpoint for the S3 bucket. Can also be set via the
S3_ENDPOINT_URL environment variable.
lora_dir: Path to a directory containing LoRA adapter artifacts for
serialization or deserialization. When serializing LoRA adapters
this is the only necessary parameter to pass to this object's
initializer.
"""

def __post_init__(self):
# check if the configuration is for a sharded vLLM model
Expand Down
16 changes: 4 additions & 12 deletions vllm/model_executor/models/aria.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,16 +143,8 @@ class AriaProjector(nn.Module):
projects ViT's outputs into MoE's inputs.

Args:
patch_to_query_dict (dict): Maps patch numbers to their corresponding
query numbers,
e.g., {1225: 128, 4900: 256}. This allows for different query sizes
based on image resolution.
embed_dim (int): Embedding dimension.
num_heads (int): Number of attention heads.
kv_dim (int): Dimension of key and value.
ff_dim (int): Hidden dimension of the feed-forward network.
output_dim (int): Output dimension.
norm_layer (nn.Module): Normalization layer. Default is nn.LayerNorm.
config: [AriaConfig](https://huggingface.co/docs/transformers/main/model_doc/aria#transformers.AriaConfig)
containing projector configuration parameters.

Outputs:
A tensor with the shape of (batch_size, query_number, output_dim)
Expand Down Expand Up @@ -282,8 +274,8 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
Forward pass of the MoE Layer.

Args:
hidden_states (torch.Tensor): Input tensor of shape (batch_size,
sequence_length, hidden_size).
hidden_states: Input tensor of shape
(batch_size, sequence_length, hidden_size).

Returns:
torch.Tensor: Output tensor after passing through the MoE layer.
Expand Down
104 changes: 40 additions & 64 deletions vllm/model_executor/models/bart.py
Original file line number Diff line number Diff line change
Expand Up @@ -401,8 +401,7 @@ def __init__(
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
r"""
Args:
hidden_states
torch.Tensor of *encoder* input embeddings.
hidden_states: torch.Tensor of *encoder* input embeddings.
Returns:
Encoder layer output torch.Tensor
"""
Expand Down Expand Up @@ -490,10 +489,8 @@ def forward(
) -> torch.Tensor:
r"""
Args:
decoder_hidden_states
torch.Tensor of *decoder* input embeddings.
encoder_hidden_states
torch.Tensor of *encoder* input embeddings.
decoder_hidden_states: torch.Tensor of *decoder* input embeddings.
encoder_hidden_states: torch.Tensor of *encoder* input embeddings.
Returns:
Decoder layer output torch.Tensor
"""
Expand Down Expand Up @@ -584,12 +581,10 @@ def forward(
) -> torch.Tensor:
r"""
Args:
input_ids
Indices of *encoder* input sequence tokens in the vocabulary.
Padding will be ignored by default should you
provide it.
positions
Positions of *encoder* input sequence tokens.
input_ids: Indices of *encoder* input sequence tokens in the
vocabulary.
Padding will be ignored by default should you provide it.
positions: Positions of *encoder* input sequence tokens.
Returns:
Decoder output torch.Tensor
"""
Expand Down Expand Up @@ -663,14 +658,11 @@ def forward(
) -> torch.Tensor:
r"""
Args:
decoder_input_ids
Indices of *decoder* input sequence tokens in the vocabulary.
Padding will be ignored by default should you
provide it.
decoder_positions
Positions of *decoder* input sequence tokens.
encoder_hidden_states:
Tensor of encoder output embeddings
decoder_input_ids: Indices of *decoder* input sequence tokens
in the vocabulary.
Padding will be ignored by default should you provide it.
decoder_positions: Positions of *decoder* input sequence tokens.
encoder_hidden_states: Tensor of encoder output embeddings.
Returns:
Decoder output torch.Tensor
"""
Expand Down Expand Up @@ -732,16 +724,13 @@ def forward(self, input_ids: torch.Tensor, positions: torch.Tensor,
encoder_positions: torch.Tensor) -> torch.Tensor:
r"""
Args:
input_ids
Indices of *decoder* input sequence tokens in the vocabulary.
Padding will be ignored by default should you
provide it.
positions
Positions of *decoder* input sequence tokens.
encoder_input_ids
Indices of *encoder* input sequence tokens in the vocabulary.
encoder_positions:
Positions of *encoder* input sequence tokens.
input_ids: Indices of *decoder* input sequence tokens
in the vocabulary.
Padding will be ignored by default should you provide it.
positions: Positions of *decoder* input sequence tokens.
encoder_input_ids: Indices of *encoder* input sequence tokens
in the vocabulary.
encoder_positions: Positions of *encoder* input sequence tokens.
Returns:
Model output torch.Tensor
"""
Expand Down Expand Up @@ -848,14 +837,10 @@ def forward(
) -> torch.Tensor:
r"""
Args:
input_ids
torch.Tensor of *decoder* input token ids.
positions
torch.Tensor of *decoder* position indices.
encoder_input_ids
torch.Tensor of *encoder* input token ids.
encoder_positions
torch.Tensor of *encoder* position indices
input_ids: torch.Tensor of *decoder* input token ids.
positions: torch.Tensor of *decoder* position indices.
encoder_input_ids: torch.Tensor of *encoder* input token ids.
encoder_positions: torch.Tensor of *encoder* position indices.
Returns:
Output torch.Tensor
"""
Expand Down Expand Up @@ -912,8 +897,7 @@ class MBartEncoderLayer(BartEncoderLayer):
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
r"""
Args:
hidden_states
torch.Tensor of *encoder* input embeddings.
hidden_states: torch.Tensor of *encoder* input embeddings.
Returns:
Encoder layer output torch.Tensor
"""
Expand Down Expand Up @@ -1035,12 +1019,10 @@ def forward(
) -> torch.Tensor:
r"""
Args:
input_ids
Indices of *encoder* input sequence tokens in the vocabulary.
Padding will be ignored by default should you
provide it.
positions
Positions of *encoder* input sequence tokens.
input_ids: Indices of *encoder* input sequence tokens in the
vocabulary.
Padding will be ignored by default should you provide it.
positions: Positions of *encoder* input sequence tokens.
Returns:
Decoder output torch.Tensor
"""
Expand Down Expand Up @@ -1116,14 +1098,11 @@ def forward(
) -> torch.Tensor:
r"""
Args:
decoder_input_ids
Indices of *decoder* input sequence tokens in the vocabulary.
Padding will be ignored by default should you
provide it.
decoder_positions
Positions of *decoder* input sequence tokens.
encoder_hidden_states:
Tensor of encoder output embeddings
decoder_input_ids: Indices of *decoder* input sequence tokens
in the vocabulary.
Padding will be ignored by default should you provide it.
decoder_positions: Positions of *decoder* input sequence tokens.
encoder_hidden_states: Tensor of encoder output embeddings.
Returns:
Decoder output torch.Tensor
"""
Expand Down Expand Up @@ -1185,16 +1164,13 @@ def forward(self, input_ids: torch.Tensor, positions: torch.Tensor,
encoder_positions: torch.Tensor) -> torch.Tensor:
r"""
Args:
input_ids
Indices of *decoder* input sequence tokens in the vocabulary.
Padding will be ignored by default should you
provide it.
positions
Positions of *decoder* input sequence tokens.
encoder_input_ids
Indices of *encoder* input sequence tokens in the vocabulary.
encoder_positions:
Positions of *encoder* input sequence tokens.
input_ids: Indices of *decoder* input sequence tokens
in the vocabulary.
Padding will be ignored by default should you provide it.
positions: Positions of *decoder* input sequence tokens.
encoder_input_ids: Indices of *encoder* input sequence tokens
in the vocabulary.
encoder_positions: Positions of *encoder* input sequence tokens.
Returns:
Model output torch.Tensor
"""
Expand Down
1 change: 0 additions & 1 deletion vllm/model_executor/models/blip2.py
Original file line number Diff line number Diff line change
Expand Up @@ -678,7 +678,6 @@ def forward(
Args:
input_ids: Flattened (concatenated) input_ids corresponding to a
batch.
pixel_values: The pixels in each input image.

Info:
[Blip2ImageInputs][]
Expand Down
18 changes: 6 additions & 12 deletions vllm/model_executor/models/donut.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,8 @@ def forward(
) -> torch.Tensor:
r"""
Args:
input_ids
torch.Tensor of *decoder* input token ids.
positions
torch.Tensor of *decoder* position indices.
input_ids: torch.Tensor of *decoder* input token ids.
positions: torch.Tensor of *decoder* position indices.
Returns:
Output torch.Tensor
"""
Expand Down Expand Up @@ -351,14 +349,10 @@ def forward(
) -> torch.Tensor:
r"""
Args:
input_ids
torch.Tensor of *decoder* input token ids.
positions
torch.Tensor of *decoder* position indices.
encoder_input_ids
torch.Tensor of *encoder* input token ids.
encoder_positions
torch.Tensor of *encoder* position indices
input_ids: torch.Tensor of *decoder* input token ids.
positions: torch.Tensor of *decoder* position indices.
encoder_input_ids: torch.Tensor of *encoder* input token ids.
encoder_positions: torch.Tensor of *encoder* position indices
Returns:
Output torch.Tensor
"""
Expand Down
Loading