<a href="https://colab.research.google.com/github/sebastianbuzdugan/LLM/blob/main/LLM_Hugging_Faces.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip -q install langchain huggingface_hub transformers sentence_transformers

## HuggingFace

There are two Hugging Face LLM wrappers, one for a local pipeline and one for a model hosted on Hugging Face Hub. Note that these wrappers only work for models that support the following tasks: text2text-generation, text-generation


In [1]:
import os


os.environ['HUGGINGFACEHUB_API_TOKEN'] = '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'

## Use the HuggingFaceHub

In [5]:
!pip install langchain_community

Collecting langchain_community
  Downloading langchain_community-0.3.3-py3-none-any.whl.metadata (2.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.6.0-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.23.0-py3-none-any.whl.metadata (7.6 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloa

In [6]:
from langchain import PromptTemplate, HuggingFaceHub, LLMChain

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate(template=template, input_variables=["question"])



In [14]:

# Change to a smaller model
llm_chain = LLMChain(prompt=prompt,
                     llm=HuggingFaceHub(repo_id="google/flan-t5-large",  # Updated model
                                        model_kwargs={"temperature": 0,
                                                      "max_length": 64}))

In [15]:

# Example question
question = "What is the capital of France?"

# Run the model
print(llm_chain.run(question))

The capital of France is Paris. Paris is the capital of France. So the answer is Paris.


In [None]:
question = "What area is best for growing wine in France?"

print(llm_chain.run(question))

The best area for growing wine in France is the Loire Valley. The Loire Valley is located in the south of France. The area of France that is best for growing wine is the Loire Valley. The final answer: Loire Valley.


## BlenderBot

Doesn't work on the Hub

In [16]:
blenderbot_chain = LLMChain(prompt=prompt,
                     llm=HuggingFaceHub(repo_id="facebook/blenderbot-1B-distill",
                                        model_kwargs={"temperature":0,
                                                      "max_length":64}))

In [17]:
# question = "What is the capital of France?"
# question = "What area is best for growing wine in France?"

# print(blenderbot_chain = LLMChain(prompt=prompt,
# .run(question))

SyntaxError: '(' was never closed (<ipython-input-17-d8667b4312fe>, line 4)

## With Local model from HF

### Why would you want to use local mode?

- fine-tuned models
- GPU hosted etc
- some models only work locally

In [1]:
!pip install -U bitsandbytes accelerate




## T5-Flan - Encoder-Decoder

In [16]:
from langchain.llms import HuggingFacePipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

model_id = 'google/flan-t5-large'# go for a smaller model if you dont have the VRAM
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, load_in_8bit=True)

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=100
)

local_llm = HuggingFacePipeline(pipeline=pipe)


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


In [4]:
print(local_llm('What is the capital of France? '))

  print(local_llm('What is the capital of France? '))


paris


In [7]:
llm_chain = LLMChain(prompt=prompt,
                     llm=local_llm
                     )

question = "What is the capital of England?"

print(llm_chain.run(question))

  llm_chain = LLMChain(prompt=prompt,
  print(llm_chain.run(question))


The capital of England is London. London is the capital of England. So the answer is London.


## GPT2-medium - Decoder Only Model

microsoft/DialoGPT-large

In [8]:
model_id = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=100
)

local_llm = HuggingFacePipeline(pipeline=pipe)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [9]:
llm_chain = LLMChain(prompt=prompt,
                     llm=local_llm
                     )

question = "What is the capital of France?"

print(llm_chain.run(question))

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Question: What is the capital of France?

Answer: Let's think step by step.

We can say, for starters, France is the home of all its nations. It's the nation that invented guns.

The name France comes from the French word for gunpowder.

It also comes from gunpowder itself: gun.

When you look at France today, you don't see France as just a vast empire. We don't see it as a sprawling


## BlenderBot - Encoder-Decoder

In [10]:
from langchain.llms import HuggingFacePipeline
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM

model_id = 'facebook/blenderbot-1B-distill'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

pipe = pipeline(
    "text2text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=100
)

local_llm = HuggingFacePipeline(pipeline=pipe)

tokenizer_config.json:   0%|          | 0.00/1.05k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/127k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/62.9k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/16.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/310k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/2.87G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


In [11]:
llm_chain = LLMChain(prompt=prompt,
                     llm=local_llm
                     )

question = "What area is best for growing wine in France?"

print(llm_chain.run(question))

 I'm not sure, but I do know that France is one of the largest producers of wine in the world.


## SentenceTransformers

In [12]:
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"

hf = HuggingFaceEmbeddings(model_name=model_name)

  hf = HuggingFaceEmbeddings(model_name=model_name)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [13]:
hf.embed_query('this is an embedding')

[0.010657301172614098,
 -0.09967263042926788,
 -0.026967110112309456,
 0.065317802131176,
 0.021004989743232727,
 0.04262344539165497,
 0.011534185148775578,
 -0.006229348015040159,
 0.05175818130373955,
 0.0073067559860646725,
 0.02135351672768593,
 0.04269151762127876,
 0.02314387634396553,
 0.009952723048627377,
 0.056463129818439484,
 -0.061379820108413696,
 0.05274379998445511,
 0.024683991447091103,
 -0.013267709873616695,
 -0.0070512196980416775,
 0.026656337082386017,
 -0.005913520231842995,
 0.004097478464245796,
 0.038412388414144516,
 -0.01423066109418869,
 0.023023512214422226,
 -0.007326608989387751,
 -0.03562537953257561,
 -0.01793413795530796,
 -0.013930206187069416,
 0.011977513320744038,
 -0.0073659829795360565,
 0.024451514706015587,
 -0.06637246906757355,
 1.5677649116696557e-06,
 0.018217239528894424,
 0.0019748907070606947,
 -0.018329322338104248,
 -0.01493076141923666,
 -0.005393383093178272,
 -0.011222349479794502,
 0.01579292118549347,
 -0.027141869068145752,
 -

In [14]:
hf.embed_documents(['this is an embedding','this another embedding'])

[[0.010657299309968948,
  -0.09967267513275146,
  -0.026967095211148262,
  0.0653177797794342,
  0.021004972979426384,
  0.04262344539165497,
  0.011534164659678936,
  -0.006229314487427473,
  0.05175821855664253,
  0.0073067666962742805,
  0.021353499963879585,
  0.042691510170698166,
  0.023143868893384933,
  0.009952740743756294,
  0.056463103741407394,
  -0.06137979030609131,
  0.05274380370974541,
  0.024683991447091103,
  -0.013267714530229568,
  -0.007051228545606136,
  0.02665632776916027,
  -0.005913547705858946,
  0.004097473341971636,
  0.038412388414144516,
  -0.014230670407414436,
  0.023023530840873718,
  -0.007326618768274784,
  -0.03562536835670471,
  -0.017934126779437065,
  -0.013930201530456543,
  0.011977544985711575,
  -0.007365954574197531,
  0.024451516568660736,
  -0.06637245416641235,
  1.567764797982818e-06,
  0.01821722649037838,
  0.0019749023485928774,
  -0.018329346552491188,
  -0.014930750243365765,
  -0.005393392406404018,
  -0.01122233085334301,
  0.015