In [1]:
!pip install super-json-mode vllm==0.3.0

Collecting vllm==0.3.0
  Downloading vllm-0.3.0-cp310-cp310-manylinux1_x86_64.whl.metadata (7.4 kB)
Collecting ninja (from vllm==0.3.0)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Collecting ray>=2.9 (from vllm==0.3.0)
  Downloading ray-2.9.2-cp310-cp310-manylinux2014_x86_64.whl.metadata (13 kB)
Collecting sentencepiece (from vllm==0.3.0)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m25.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting torch==2.1.2 (from vllm==0.3.0)
  Downloading torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting xformers==0.0.23.post1 (from vllm==0.3.0)
  Downloading xformers-0.0.23.post1-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.0 kB)
Collecting fastapi (from vllm==0.3.0)
  Downloading fastapi-0.109.2-py3-none-any.whl.metadata (2

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
from superjsonmode.integrations.vllm import StructuredVLLMModel
from pydantic import BaseModel

In [4]:
my_vllm = StructuredVLLMModel("mistralai/Mistral-7B-Instruct-v0.1")

INFO 02-06 05:40:56 llm_engine.py:72] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.1', tokenizer='mistralai/Mistral-7B-Instruct-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, seed=0)
INFO 02-06 05:40:59 weight_utils.py:164] Using model weights format ['*.safetensors']
INFO 02-06 05:41:08 llm_engine.py:322] # GPU blocks: 11219, # CPU blocks: 2048
INFO 02-06 05:41:09 model_runner.py:632] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 02-06 05:41:09 model_runner.py:636] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, con

In [5]:
class QuarterlyReport(BaseModel):
    company: str
    stock_ticker: str
    date: str
    reported_revenue: str
    dividend: str

In [6]:
prompt_template = """[INST]{prompt}

Based on this excerpt, extract the correct value for the provided key. Keep it succinct. It should be a {type}.[/INST]

{key}: """

In [7]:
prompt = """NVIDIA Announces Financial Results for Third Quarter Fiscal 2024
November 21, 2023
Record revenue of $18.12 billion, up 34% from Q2, up 206% from year ago
Record Data Center revenue of $14.51 billion, up 41% from Q2, up 279% from year ago
NVIDIA (NASDAQ: NVDA) today reported revenue for the third quarter ended October 29, 2023, of $18.12 billion, up 206% from a year ago and up 34% from the previous quarter.

GAAP earnings per diluted share for the quarter were $3.71, up more than 12x from a year ago and up 50% from the previous quarter. Non-GAAP earnings per diluted share were $4.02, up nearly 6x from a year ago and up 49% from the previous quarter.

“Our strong growth reflects the broad industry platform transition from general-purpose to accelerated computing and generative AI,” said Jensen Huang, founder and CEO of NVIDIA.

“Large language model startups, consumer internet companies and global cloud service providers were the first movers, and the next waves are starting to build. Nations and regional CSPs are investing in AI clouds to serve local demand, enterprise software companies are adding AI copilots and assistants to their platforms, and enterprises are creating custom AI to automate the world’s largest industries.

“NVIDIA GPUs, CPUs, networking, AI foundry services and NVIDIA AI Enterprise software are all growth engines in full throttle. The era of generative AI is taking off,” he said.

NVIDIA will pay its next quarterly cash dividend of $0.04 per share on December 28, 2023, to all shareholders of record on December 6, 2023."""

In [22]:
import time
start = time.time()
output = my_vllm.generate(prompt, 
                           extraction_prompt_template=prompt_template, 
                           schema=QuarterlyReport, 
                           batch_size=6,
                             temperature=0)
print(f"Total time: {time.time() - start}")

Processed prompts: 100%|██████████| 5/5 [00:00<00:00,  7.47it/s]

Total time: 0.6783156394958496





In [23]:
output

{'company': 'NVIDIA',
 'stock_ticker': '```\nNVDA\n```',
 'date': '2023-11-21',
 'reported_revenue': '18.12 billion',
 'dividend': '0.04'}

In [12]:
import time
from vllm import SamplingParams

start = time.time()

default_prompt = f"""{prompt}
---
Based on the passage above, generate a JSON blob with the following keys: "company", "stock_ticker", "date", "reported_revenue", and "dividend".
"""
sampling_params = SamplingParams()
sampling_params.max_tokens = 1024

output = my_vllm.llm.generate(default_prompt,  sampling_params=sampling_params)

print(f"Total time: {time.time() - start}")

Processed prompts: 100%|██████████| 1/1 [00:04<00:00,  4.69s/it]

Total time: 4.6964170932769775





In [19]:
print(output[0].outputs[0].text)


{
"company": "NVIDIA",
"stock_ticker": "NVDA",
"date": "October 29, 2023",
"reported_revenue": "$18.12 billion",
"dividend": "$0.04 per share"
}

This JSON blob contains information about NVIDIA's stock ticker (NVDA), the company that reported the revenue, the date on which the report was issued, and the revenue figure and dividend amount reported by the company. This information could be used to analyze NVIDIA's financial performance and compare it to other companies in the industry. It could also be used to make investment decisions based on the company's potential future performance.
