#### Tuesday, December 5, 2023

[LangChain - Using Hugging Face Models locally (code walkthrough)](https://www.youtube.com/watch?v=Kn7SX2Mx_Jk)

https://colab.research.google.com/drive/1h2505J5H4Y9vngzPD08ppf1ga8sWxLvZ?usp=sharing#scrollTo=VkVTT54xNq8T

I was able to step through most of this notebook.


In [None]:
# !pip -q install langchain huggingface_hub transformers sentence_transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m358.2/358.2 KB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m63.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m27.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m64.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.0/132.0 KB[0m [31m9.1 MB/s[0m 

## HuggingFace

There are two Hugging Face LLM wrappers, one for a local pipeline and one for a model hosted on Hugging Face Hub. Note that these wrappers only work for models that support the following tasks: text2text-generation, text-generation


In [1]:
from getpass import getpass

# enter your api key
HUGGINGFACEHUB_API_TOKEN = getpass("Enter your HuggingFace API Token")

In [2]:
import os

os.environ['HUGGINGFACEHUB_API_TOKEN'] = HUGGINGFACEHUB_API_TOKEN

## Use the HuggingFaceHub

In [3]:
from langchain import PromptTemplate, HuggingFaceHub, LLMChain

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])

In [4]:
llm_chain = LLMChain(prompt=prompt,
                     llm=HuggingFaceHub(repo_id="google/flan-t5-xl",
                                        model_kwargs={"temperature":0,
                                                      "max_length":64}))



In [5]:
question = "What is the capital of France?"

print(llm_chain.run(question))

ValueError: Error raised by inference API: Model google/flan-t5-xl time out

In [None]:
# I am not going to run this because I bet it too will time out ...
question = "What area is best for growing wine in France?"

print(llm_chain.run(question))

The best area for growing wine in France is the Loire Valley. The Loire Valley is located in the south of France. The area of France that is best for growing wine is the Loire Valley. The final answer: Loire Valley.


## BlenderBot

Doesn't work on the Hub

In [None]:
# I am not going to run this cell ... 
blenderbot_chain = LLMChain(prompt=prompt,
                     llm=HuggingFaceHub(repo_id="facebook/blenderbot-1B-distill",
                                        model_kwargs={"temperature":0,
                                                      "max_length":64}))

ValidationError: ignored

In [None]:
# question = "What is the capital of France?"
# question = "What area is best for growing wine in France?"

# print(blenderbot_chain = LLMChain(prompt=prompt,
# .run(question))

## With Local model from HF

### Why would you want to use local mode?

- fine-tuned models
- GPU hosted etc
- some models only work locally

In [6]:
!ls /home/rob/Data2/huggingface/transformers

models--TheBloke--CodeLlama-34B-Instruct-GPTQ  tmpcjh0h7gn
models--TheBloke--Llama-2-13b-Chat-GPTQ        tmpzafytbf_
models--TheBloke--Python-Code-33B-GPTQ	       version.txt
models--meta-llama--Llama-2-13b-hf


In [None]:
# load in the target model to this container ...

## T5-Flan - Encoder-Decoder

In [7]:
from langchain.llms import HuggingFacePipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM


In [8]:
model_id = 'google/flan-t5-large'# go for a smaller model if you dont have the VRAM
tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

In [9]:
%%time
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True)

# download time ...
# CPU times: user 19.2 s, sys: 17.2 s, total: 36.4 s
# Wall time: 43min 38s

Downloading config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

CPU times: user 19.2 s, sys: 17.2 s, total: 36.4 s
Wall time: 43min 38s


In [10]:
!ls /home/rob/Data2/huggingface/transformers

models--TheBloke--CodeLlama-34B-Instruct-GPTQ
models--TheBloke--Llama-2-13b-Chat-GPTQ
models--TheBloke--Python-Code-33B-GPTQ
models--google--flan-t5-large
models--meta-llama--Llama-2-13b-hf
tmpcjh0h7gn
tmpzafytbf_
version.txt


In [None]:
# Backup models--google--flan-t5-large
# docker cp c9b676310ea0://home/rob/Data2/huggingface/transformers/models--google--flan-t5-large /home/rob/Data3/huggingface/transformers
# Successfully copied 3.14GB to /home/rob/Data3/huggingface/transformers

In [11]:
# from langchain.llms import HuggingFacePipeline
# import torch
# from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

# model_id = 'google/flan-t5-large'# go for a smaller model if you dont have the VRAM
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True)

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=100
)

local_llm = HuggingFacePipeline(pipeline=pipe)


In [12]:
print(local_llm('What is the capital of France? '))

# 0.7s

paris


In [13]:
llm_chain = LLMChain(prompt=prompt,
                     llm=local_llm
                     )

question = "What is the capital of England?"

print(llm_chain.run(question))

# 2.1s

The capital of England is London. London is the capital of England. So the answer is London.


## GPT2-medium - Decoder Only Model

microsoft/DialoGPT-large

In [14]:
model_id = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [16]:
model = AutoModelForCausalLM.from_pretrained(model_id)

Downloading model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [17]:
!ls /home/rob/Data2/huggingface/transformers

models--TheBloke--CodeLlama-34B-Instruct-GPTQ
models--TheBloke--Llama-2-13b-Chat-GPTQ
models--TheBloke--Python-Code-33B-GPTQ
models--google--flan-t5-large
models--gpt2-medium
models--meta-llama--Llama-2-13b-hf
tmpcjh0h7gn
tmpfcrmwgx2
tmpzafytbf_
version.txt


In [None]:
# Backup "gpt2-medium"
# docker cp c9b676310ea0://home/rob/Data2/huggingface/transformers/models--gpt2-medium /home/rob/Data3/huggingface/transformers
# Successfully copied 1.52GB to /home/rob/Data3/huggingface/transformers

In [18]:
# model_id = "gpt2-medium"
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=100
)

local_llm = HuggingFacePipeline(pipeline=pipe)

In [19]:
llm_chain = LLMChain(prompt=prompt,
                     llm=local_llm
                     )

question = "What is the capital of France?"

print(llm_chain.run(question))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




Let's do it.

Luxembourg, at the foot of the Rhine

What capital is Luxembourg? —

N.B. – The French national bank has a long standing interest in its existence, and it shares the common responsibility with Belgium and Luxembourg in the creation, financing and maintenance of its capital.

In this country capital is, of course


## BlenderBot - Encoder-Decoder

In [20]:
from langchain.llms import HuggingFacePipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

In [21]:
model_id = 'facebook/blenderbot-1B-distill'
tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading tokenizer_config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/127k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/62.9k [00:00<?, ?B/s]

Downloading added_tokens.json:   0%|          | 0.00/16.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/310k [00:00<?, ?B/s]

In [22]:
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

Downloading pytorch_model.bin:   0%|          | 0.00/2.87G [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

In [23]:
!ls /home/rob/Data2/huggingface/transformers

models--TheBloke--CodeLlama-34B-Instruct-GPTQ
models--TheBloke--Llama-2-13b-Chat-GPTQ
models--TheBloke--Python-Code-33B-GPTQ
models--facebook--blenderbot-1B-distill
models--google--flan-t5-large
models--gpt2-medium
models--meta-llama--Llama-2-13b-hf
tmpcjh0h7gn
tmpfcrmwgx2
tmpzafytbf_
version.txt


In [None]:
# Backup 'facebook/blenderbot-1B-distill'
# docker cp c9b676310ea0://home/rob/Data2/huggingface/transformers/models--facebook--blenderbot-1B-distill /home/rob/Data3/huggingface/transformers
# Successfully copied 2.88GB to /home/rob/Data3/huggingface/transformers

In [24]:
# from langchain.llms import HuggingFacePipeline
# import torch
# from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

# model_id = 'facebook/blenderbot-1B-distill'
# tokenizer = AutoTokenizer.from_pretrained(model_id)
# model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=100
)

local_llm = HuggingFacePipeline(pipeline=pipe)

In [25]:
llm_chain = LLMChain(prompt=prompt,
                     llm=local_llm
                     )

question = "What area is best for growing wine in France?"

print(llm_chain.run(question))

 I'm not sure, but I do know that France is one of the largest producers of wine in the world.


## SentenceTransformers

In [27]:
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"

hf = HuggingFaceEmbeddings(model_name=model_name)

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To where is the above "sentence-transformers/all-mpnet-base-v2" downloaded?

In [41]:
!ls /home/rob/Data2/huggingface/transformers

models--TheBloke--CodeLlama-34B-Instruct-GPTQ
models--TheBloke--Llama-2-13b-Chat-GPTQ
models--TheBloke--Python-Code-33B-GPTQ
models--facebook--blenderbot-1B-distill
models--google--flan-t5-large
models--gpt2-medium
models--meta-llama--Llama-2-13b-hf
tmpcjh0h7gn
tmpfcrmwgx2
tmpzafytbf_
version.txt


In [51]:
!ls ~/.cache/huggingface/hub

066329cdbf0bb3f3816e6f84e2d4876143fd4e26a11ec3ed7c9837e33d9fe41e.30cb82e2121eb5ce659a35437f3a58bdceac7b5cd7c0aaaa1f10345719259ba9
066329cdbf0bb3f3816e6f84e2d4876143fd4e26a11ec3ed7c9837e33d9fe41e.30cb82e2121eb5ce659a35437f3a58bdceac7b5cd7c0aaaa1f10345719259ba9.json
066329cdbf0bb3f3816e6f84e2d4876143fd4e26a11ec3ed7c9837e33d9fe41e.30cb82e2121eb5ce659a35437f3a58bdceac7b5cd7c0aaaa1f10345719259ba9.lock
models--llama2-13b-journal-finetune--checkpoint-500
models--robkayinto--bloomz-560m_PROMPT_TUNING_CAUSAL_LM
models--robkayinto--roberta-large-lora-token-classification
models--robkayinto--t5-large_PREFIX_TUNING_SEQ2SEQ
models--robkayinto--vit-base-patch16-224-in21k-finetuned-lora-food101
models--stevhliu--roberta-large-lora-token-classification


In [57]:
# !find ~/.cache/huggingface '*net*'

In [50]:
!ls ~/.cache/huggingface/token

/root/.cache/huggingface/token


In [39]:
hf.embed_query('this is an embedding')

[0.010657313279807568,
 -0.09967267513275146,
 -0.02696710266172886,
 0.06531776487827301,
 0.021004972979426384,
 0.04262346029281616,
 0.011534163728356361,
 -0.006229331251233816,
 0.051758233457803726,
 0.007306778337806463,
 0.021353479474782944,
 0.04269151762127876,
 0.023143872618675232,
 0.009952736087143421,
 0.056463081389665604,
 -0.06137977913022041,
 0.0527438148856163,
 0.024683985859155655,
 -0.013267772272229195,
 -0.007051215972751379,
 0.026656348258256912,
 -0.005913526751101017,
 0.004097505006939173,
 0.03841238096356392,
 -0.014230641536414623,
 0.023023545742034912,
 -0.007326621096581221,
 -0.03562534600496292,
 -0.017934126779437065,
 -0.013930212706327438,
 0.011977538466453552,
 -0.007365955505520105,
 0.024451518431305885,
 -0.06637249141931534,
 1.5677645706091425e-06,
 0.018217220902442932,
 0.0019748734775930643,
 -0.018329372629523277,
 -0.014930730685591698,
 -0.005393403582274914,
 -0.01122232899069786,
 0.015792936086654663,
 -0.02714184671640396,
 -

In [40]:
hf.embed_documents(['this is an embedding','this another embedding'])

[[0.01065733376890421,
  -0.09967267513275146,
  -0.02696710266172886,
  0.0653177872300148,
  0.021004965528845787,
  0.04262348264455795,
  0.011534172110259533,
  -0.006229327991604805,
  0.05175822228193283,
  0.007306752260774374,
  0.021353481337428093,
  0.04269150644540787,
  0.023143883794546127,
  0.009952722117304802,
  0.056463103741407394,
  -0.0613798089325428,
  0.052743807435035706,
  0.0246839951723814,
  -0.013267735950648785,
  -0.007051208522170782,
  0.02665632963180542,
  -0.0059135183691978455,
  0.0040974742732942104,
  0.038412414491176605,
  -0.014230631291866302,
  0.023023515939712524,
  -0.007326597347855568,
  -0.035625357180833817,
  -0.01793413795530796,
  -0.013930226676166058,
  0.01197753194719553,
  -0.007365953177213669,
  0.024451525881886482,
  -0.06637251377105713,
  1.5677645706091425e-06,
  0.018217215314507484,
  0.0019748753402382135,
  -0.01832936890423298,
  -0.014930754899978638,
  -0.005393403582274914,
  -0.01122231874614954,
  0.0157929

In [None]:


hf = HuggingFaceHubEmbeddings(
    repo_id=model_name,
    task="feature-extraction",
    # huggingfacehub_api_token="my-api-key",
)