## Testing Images from HF

The idea of this notebook is to design a quick interface that we can continuously improve to quickly test various images from HF. We can add various options such as:
- loading in FP
- loading in 8 or 4 bit with bits and bytes
- loading with other quantinisation methods
- different types of Memory if possible.

The models will be used for inference and supplied via Gradio.

Lets start with a new model for Coding Assistance

In [1]:
!pip install python-dotenv
!pip install openai



In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
import os
import openai
import torch

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"

In [2]:
model_name = "mistralai/Mistral-7B-Instruct-v0.2"

In [3]:
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", 
                                             trust_remote_code=True, load_in_8bit=True
                                            )
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [13]:
from accelerate import infer_auto_device_map, init_empty_weights
device_map = infer_auto_device_map(model)
device_map

OrderedDict([('', 0)])

In [7]:
generation_config = GenerationConfig.from_pretrained(model_name)
generation_config.max_new_tokens = 512
generation_config.temperature = 0.01
generation_config.top_p = 0.95
generation_config.do_sample = True
generation_config.repetition_penalty = 1.1
generation_config.pad_token_id = tokenizer.eos_token_id

In [9]:
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    generation_config=generation_config,
)

In [18]:
response = text_pipeline("What PC game would you recommend that is similar to The Witcher 3 - The Wild Hunt?")

In [19]:
print(response[0]["generated_text"])

What PC game would you recommend that is similar to The Witcher 3 - The Wild Hunt?
User 1: I'd say Skyrim, but with mods. It's not as good of a story, but the open world and freedom is very similar.

If you want something more like TW3, then maybe Dragon Age Inquisition or Mass Effect Andromeda. Both have great stories and are open world RPGs.
User 2: I've played both Skyrim and DA:I, and while they were fun, they didn't quite scratch the itch for me. I'll give ME:A another try though, since I only got about halfway through it before getting bored.
User 1: Yeah, I'm not a huge fan of the ME series either. But I've heard good things about the new one.

Another suggestion could be the Divinity Original Sin games. They're turn based RPGs, but they have a lot of depth and a great story.
User 2: I've heard good things about DOS2, but haven't tried it yet. I'll definitely check it out! Thanks for the suggestions!


In [12]:
sum = 9000
def fib(n, m):
    if n + m >= sum:
        print (n, m, n+m)
    else:
        print(n, m)
        fib(m, n + m)

fib(0, 1)

0 1
1 1
1 2
2 3
3 5
5 8
8 13
13 21
21 34
34 55
55 89
89 144
144 233
233 377
377 610
610 987
987 1597
1597 2584
2584 4181
4181 6765 10946


### Defining the structure in Gradio and LangChain

The idea is to parametrise the selection of model and other options.

In [1]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.19.2-py3-none-any.whl.metadata (15 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting altair<6.0,>=4.2.0 (from gradio)
  Downloading altair-5.2.0-py3-none-any.whl.metadata (8.7 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting gradio-client==0.10.1 (from gradio)
  Downloading gradio_client-0.10.1-py3-none-any.whl.metadata (7.1 kB)
Collecting orjson~=3.0 (from gradio)
  Downloading orjson-3.9.15-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/49.5 kB[0m [31m898.1 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Collecting python-multipart>=0.0.9 (from gradio)
  Downloading python_multipart-0.0.9-py3-none

In [11]:
# app includes and inference test with the function
import gradio as gr
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain import HuggingFacePipeline, ConversationChain, LLMChain
from langchain.memory import ConversationBufferMemory, ConversationSummaryBufferMemory
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
import torch
import time

import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

def getConfigFromModel(model, tokenizer):
    generation_config = GenerationConfig.from_pretrained(model)
    generation_config.max_new_tokens = 512
    generation_config.temperature = 0.1
    generation_config.top_p = 0.85
    generation_config.do_sample = True
    generation_config.repetition_penalty = 1.15
    generation_config.pad_token_id = tokenizer.eos_token_id
    return generation_config




DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
# Template for mistralai/Mistral-7B-Instruct-v0.2
# template = """
# <s>[INST]Answer the question below as good as you can. Do not repeat yourself.
# Current conversation:
# {history}
# Question: {input}[/INST]
# """

# Template for MS Phi-2
TEMPLATE = """
Current conversation:
{history}
Instruct:
{input}.
Output:
"""



In [13]:

model_name="/home/jovyan/ext_models/phi-2"
model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             trust_remote_code=True, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
config = getConfigFromModel(model_name, tokenizer)
text_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        return_full_text=True,
        generation_config=config
    )
open_llm = HuggingFacePipeline(pipeline=text_pipeline)
prompt = PromptTemplate(template=TEMPLATE.replace('{history}', ""), input_variables=["input"])
chain = LLMChain(llm=open_llm, prompt=prompt, verbose=True)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [15]:
r = chain.invoke({"input": "What is the population of India and China?"})



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Current conversation:

Instruct:
What is the population of India and China?.
Output:
[0m

[1m> Finished chain.[0m


In [17]:
r["text"]

'The current population of India is approximately 1.366 billion, while China has a population of around 1.439 billion people.\n'

In [10]:
chain.predict(input="nice to meet you my name is Snoop. I am a famous rapper")



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mHuman: 
<s>[INST]Answer the question below as good as you can. Do not repeat yourself.
Current conversation:
Human: Hi there who is this?
AI: Response: Hello! I'm an artificial intelligence designed to assist with various tasks and answer questions to the best of my ability. How may I help you today?
Question: nice to meet you my name is Snoop. I am a famous rapper[/INST]
[0m

[1m> Finished chain.[0m


"Hi Snoop, it's nice to meet you! I know that you are a renowned rapper. Is there something specific you would like assistance or information on related to your music career or any particular topic? I'll do my best to provide accurate and helpful responses."

### The App Definition.

We will use Blocks so that we can define different data flows. We need one data flow to help us set up the model and an another flow to provide the chatbot.

Here we go:

In [12]:
def model_runner(llm_obj, model_name, quant, device, llm_type):
    model_args = dict()
    if quant == '8bit':
        model_args.update({"load_in_8bit": True})
    if device == 'gpu' and "cuda" in DEVICE:
        model_args.update({"device_map": DEVICE})
    else:
        model_args.update({"device_map": 'auto'})
    model = AutoModelForCausalLM.from_pretrained(model_name, 
                                             trust_remote_code=True, **model_args)
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    # total_steps = list(model.parameters())
    
    config = getConfigFromModel(model_name, tokenizer)
    text_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        return_full_text=True,
        generation_config=config
    )
    open_llm = HuggingFacePipeline(pipeline=text_pipeline)
    
    llm_obj = dict({"open_llm": open_llm, 
                    "config": config, 
                    "model": model, 
                    "tokenizer": tokenizer})
    return llm_obj

In [13]:
def inference(message, history, llm_obj, llm_type, template=None):
    history_transformer_format = history + [[message, ""]]
    history_str = "".join(["".join(["\n<question>"+item[0], "\n<answer>"+item[1]])  #curr_system_message +
                for item in history_transformer_format])
    result_template = template if template is not None and template != "" else TEMPLATE.strip()
    prompt = PromptTemplate(template=result_template.replace('{history}', history_str), input_variables=["input"])
    if (llm_type == 'gpt_3'):
        gpt3_llm = ChatOpenAI(temperature=0.2, model="gpt-3.5-turbo")
        chain = LLMChain(llm=gpt3_llm, prompt=prompt, verbose=True)
    else:
        chain = LLMChain(llm=llm_obj["open_llm"], prompt=prompt, verbose=True)
    response = chain.invoke({"input": message})
    history.append((message, str(response["text"])))
    
    # response = chain.predict(input=text)
    return "", history
    
    

In [14]:
def cleanup(llm_obj):
    for key in llm_obj.keys():
        llm_obj[key] = None
    del llm_obj
    torch.cuda.empty_cache()
    

In [15]:
demo = gr.Blocks()
with demo:
    placeholder_template = TEMPLATE.strip()
    llm_obj = gr.State()
    model_name = gr.Textbox(label='Model')
    with gr.Row():
        quant = gr.Dropdown(['full', '8bit'], label='Quant')
        device = gr.Dropdown(['auto', 'gpu'], label='Device')
    b1 = gr.Button('Load Model')
    # progress = gr.Textbox()
    llm_type = gr.Radio(choices=['open_llm', 'gpt_3', 'gpt_4'], value='open_llm')
    message = gr.Textbox(label='Message')
    template = gr.TextArea(value=None, label='Custom Template', placeholder=placeholder_template)
    bot = gr.Chatbot(label='Response')
    b2 = gr.Button('Submit')
    b3 = gr.Button('Unload Model')
    b1.click(model_runner, inputs=[llm_obj, model_name, quant, device, llm_type], outputs=[llm_obj])
    b2.click(inference, inputs=[message, bot, llm_obj, llm_type, template], outputs=[message, bot])
    b3.click(cleanup, inputs=[llm_obj])


In [16]:
demo.launch(server_name="0.0.0.0", server_port=7860)

Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.






[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:

<question>Can you suggest a simple python app structure. The app will be used to launch a gradio application for a chatbot
<answer>
Instruct:
Can you suggest a simple python app structure. The app will be used to launch a gradio application for a chatbot.
Output:[0m

[1m> Finished chain.[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mCurrent conversation:

<question>Can you suggest a simple python app structure. The app will be used to launch a gradio application for a chatbot
<answer>Sure! Here is a simple Python app structure for launching a Gradio application for a chatbot:

1. Create a Python file for your chatbot logic, such as chatbot.py.
2. Define your chatbot functionality in the chatbot.py file, including any necessary functions or classes.
3. Create a separate Python file for your Gradio application, such as app.py.
4. In the app.py f

In [18]:
demo.close()

Closing server running on port: 7860
