# HuggingFace Pipeline

- Author: [Sunworl](https://github.com/sunworl)
- Design: [Teddy](https://github.com/teddylee777)
- Peer Review: [Teddy](https://github.com/teddylee777)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Hugging Face Local Pipeline

You can run the Hugging Face model locally via the class  `HuggingFacePipeline`.

The Hugging Face model Hub hosts over 120,000 models, 20,000 datasets, and 50,000 demo apps (Spaces) on its online platform, all of which are open-source and publicly available, allowing people to easily collaborate and build ML together.

These models can be used in LangChain either by calling them through this local pipeline wrapper or by calling hosted inference endpoints through the HuggingFaseHub class. For more information on hosted pipelines, please refer to the HuggingFaseHub notebook.

To use this, you should have the Python package transformers installed along with PyTorch.

Additionally, you may install xformers for a more memory-efficient attention implementation.

In [None]:
%pip install --upgrade --quit transformers --quiet

Set the path to download the model.

In [None]:
# Path to download Hugging Face models/tokenizers
# (Example)
import os

# ./cache/ Set to download to the specified path
os.environ["TRANSFORMERS_CACHE"] = "./cache/"
os.environ["HF_HOME"] = "./cache/"

## Model Loading

Models can be loaded by specifying model parameters using the method `from_model_id`.


- The `langchain-opentutorial` class is used to load a pre-trained model from Hugging Face.

- The `from_model_id` method is used to specify the `beomi/llama-2-ko-7b` model and set the task to "text-generation".

- The `pipeline_kwargs` parameter is used to limit the maximum number of tokens to be generated to 10.

- The loaded model is assigned to the `hf` variable, which can be used to perform text generation tasks.

The model used: https://huggingface.co/beomi/llama-2-ko-7b

In [None]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

In [None]:
# Dounload the HuggingFace model
hf = HuggingFacePipeline.from_model_id(

    model_id="beomi/llama-2-ko-7b",  # Specify the ID of the model to use

    task="text-generation",  # Specify the task to perform. Here, it's text generation
    
    # Set additional arguments to pass to the pipeline. Here, we limit the maximum number of new tokens to 512
    pipeline_kwargs={"max_new_tokens": 512},
)

You can also load by directly passing an existing `transformers` pipeline.

The text ageneration model is implemented using HuggingFacePipeline.


- `AutoTokenizer` and `AutoModelForCausalLM` are used to load the `beomi/llama-2-ko-7b` model and tokenizer.

- The `pipeline` function is used to create a "text-generation" pipeline, setting up the model and tokenizer. The maximum number of generated tokens is limited to 10.

- The `HuggingFacePipeline` class is used to create an `hf` object, and the generated pipeline is passed to it.


Using this created `hf` object, you can perform text generation for a given prompt.

In [None]:
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "beomi/llama-2-ko-7b"  # Specify the ID of the model to use
tokenizer = AutoTokenizer.from_pretrained(
    model_id
)  # Load the tokenizer for the specified model

model = AutoModelForCausalLM.from_pretrained(model_id)  # Load the specified model

# Create a text generation pipeline and set the maximum number of new tokens to be generated to 10
pipe = pipeline("text-generation", model=model,
                tokenizer=tokenizer, max_new_tokens=512)

# Create a HuggingFacePipeline object and pass the generated pipeline to it
hf = HuggingFacePipeline(pipeline=pipe)

## Create Chain

Once the model is loaded into memory, you can configure it with prompts to form a chain.


- A prompt template defining the question and answer format is created using the `PromptTemplate` class.

- Create a `chain` object by connecting the `prompt` object and the `hf` object in a pipeline.

- Call the `chain.invoke()` method to generate and output an answer for the given question.

In [4]:
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

template = """Answer the following question in Korean.
#Question: 
{question}

#Answer: """  # A template defining the question and answer format
prompt = PromptTemplate.from_template(template)  # Create a prompt object using the template

# Create a chain by connecting the prompt and the language model
chain = prompt | hf | StrOutputParser()

question = "대한민국의 수도는 어디야?"  # Define the question

print(
    chain.invoke({"question": question})
)  # Call the chain to generate and output an answer to the question

## GPU Inference

When running on a GPU, you can specify the `device=n` parameter to place the model on a specific device.

The default value is `-1`, which means inference is performed on the CPU.

If you have multiple GPUs or if the model is too large for a single GPU, you can specify `device_map="auto"`.

In this case, the [Accelerate](https://huggingface.co/docs/accelerate/index) library is required and is used to automatically determine how to load the model weights.

*Caution*: `device` and `device_map` should not be specified together, as this can cause unexpected behavior.



- Load the `gpt2` model using `HuggingFacePipeline` and set the `device` parameter to 0 to run it on the GPU.

- Limit the maximum number of tokens to be generated to 10 using the `pipeline_kwargs` parameter.

- Connect the `prompt` and `gpu_llm` in a pipeline to create the `gpu_chain`.

- Call the `gpu_chain.invoke()` method to generate and output an answer for the given question.

In [5]:
gpu_llm = HuggingFacePipeline.from_model_id(
    
    model_id="beomi/llama-2-ko-7b",  # Specify the ID of the model to be used

    task="text-generation",  # Set the task to be performed. In this case, it is text generation

    # Specify the GPU device number to be used. Setting it to "auto" will utilize the accelerate library
    device=0,

    # Set additional arguments to be passed to the pipeline. In this case, limit the maximum number of tokens to be generated to 10
    pipeline_kwargs={"max_new_tokens": 64},
)

gpu_chain = prompt | gpu_llm  # Connect the prompt and gpu_llm to create the gpu_chain

# Create a chain by connecting the prompt and the language model
gpu_chain = prompt | gpu_llm | StrOutputParser()

question = "대한민국의 수도는 어디야?"  # Define the question

# Invoke the chain to generate and output an answer to the question
print(gpu_chain.invoke({"question": question}))

## Batch GPU Inference

When running on a GPU device, you can perform inference in batch mode on the GPU.


- Load the `beomi/llama-2-ko-7b` model using `HuggingFacePipeline` and set it to run on the GPU.

- When creating the `gpu_llm`, set the `batch_size` to 2, `temperature` to 0, and `max_length` to 64.

- Connect the `prompt` and `gpu_llm` in a pipeline to create the `gpu_chain`, and set the end token to "\n\n".

- Use `gpu_chain.batch()` to generate answers in parallel for the `questions` in the questions.

- Wrap each answer with <answer> tags and separate each answer with a line break.

In [None]:
gpu_llm = HuggingFacePipeline.from_model_id(

    model_id="beomi/llama-2-ko-7b",  # Specify the ID of the model to be used

    task="text-generation",  # Set the task to be performed

    device=0,  # Specify the GPU device number. -1 indicates CPU

    batch_size=2,  # Adjust the batch size. Set it appropriately based on GPU memory and model size.

    model_kwargs={
        "temperature": 0,
        "max_length": 256,
    },  # Set additional arguments to be passed to the model

)

# Create a chain by connecting the prompt and the language model
gpu_chain = prompt | gpu_llm.bind(stop=["\n\n"])

questions = []
for i in range(4):

    # Generate a list of questions
    questions.append({"question": f"숫자 {i} 이 한글로 뭐에요?"})

answers = gpu_chain.batch(questions)  # Batch process the list of questions to generate answers

for answer in answers:

    print(answer)  # Output the generated answers
