<a href="https://colab.research.google.com/github/sagargowda88/LLM/blob/main/DatasetBuild_Helper_Tool.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Hugging Face Transformers is an open-source framework for deep learning created by Hugging Face.
# It provides APIs and tools to download state-of-the-art pre-trained models and further tune them to maximize performance.
# These models support common tasks in different modalities, such as natural language processing, computer vision, audio, and multi-modal applications.
# Using pretrained models can reduce your compute costs, carbon footprint,
# and save you the time and resources required to train a model from scratch.

# https://huggingface.co/docs/transformers/index
# https://huggingface.co/docs/hub/index

# Accelerate library to help users easily train a 🤗 Transformers model on any type of distributed setup,
# whether it is multiple GPU's on one machine or multiple GPU's across several machines.

!pip install -q transformers langchain huggingface_hub accelerate

In [None]:
# we need to login to Hugging Face to have access to their inference API.
# This step requires a free Hugging Face token.

from huggingface_hub import login
login("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")

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
# This class provides functionality related to Hugging Face Transformers pipelines .
from langchain import HuggingFacePipeline

# This line imports the AutoTokenizer class from the transformers library.
# The AutoTokenizer class is used to load tokenizers for various pre-trained language models available in the Hugging Face model hub.
from transformers import AutoTokenizer

# This line imports the entire transformers library, which is a popular library developed by
# Hugging Face for working with various transformer-based models in natural language processing (NLP),
# including both models and tokenizers.
import transformers

# This line imports the torch library, which is the primary library used for deep learning and tensor computations in PyTorch.
import torch

# Model name that we want to use
# https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

model = "meta-llama/Llama-2-7b-chat-hf"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model)

# Set up text generation pipeline
pipeline = transformers.pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                max_new_tokens = 512,
                do_sample=True,
                top_k=10,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id,
                )

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
# 'HuggingFacePipeline' class creates a custom pipeline for text generation, and we are passing
# the pipeline that we defined earlier along with some model-specific keyword arguments - temperature here.

llm = HuggingFacePipeline(pipeline = pipeline, model_kwargs = {'temperature':0})

In [None]:
from langchain import PromptTemplate,  LLMChain

# template = """
#              Create a SQL query snippet using the below text:
#               ```{text}```
#               Just SQL query:
#            """

# prompt = PromptTemplate(template=template, input_variables=["text"])

# llm_chain = LLMChain(prompt=prompt, llm=llm)

# text = """ Extract all the unique values from column "age"
# """



In [None]:
# print(llm_chain.run(text))

In [None]:
# template1 = """

# Schema is given below:
# "name": "Activities",
# "columns":
# [
# [
# "name": "Activity Type Chapter",
# "keywords": ["Type Chapter", "Chapter"]
# ],
# [
# "name": "Percentage Value per Year",
# "keywords": ["Percentage Value", "percent value", "Value per Year", "Percentage Value per Year", "Percentage per Year"]
# ]
# ]
# ]
# Example Question:
# What is the type of chapter for the activity with the highest percentage value per year?
# Type: Class 0

# Answer Choices:
# Class 0: 'SELECT [] FROM []'
# Class 1: 'SELECT MAX([]) FROM []'
# Class 2: 'SELECT MIN([]) FROM []'
# Class 3: 'SELECT COUNT([]) FROM []'
# Class 4: 'SELECT SUM([]) FROM []'
# Class 5: 'SELECT AVG([]) FROM []'


#               ```{text1}```

#            """

# prompt1 = PromptTemplate(template=template1, input_variables=["text1"])

# llm_chain2 = LLMChain(prompt=prompt1, llm=llm)

# text1 = """
#       Please generate 10 more questions based on the schema provided.


# """

In [None]:
pre_prompt = """[INST] <<SYS>>\nPlease generate 10 more questions and map it to class type given based on the schema and example question provided.Just generate question with class type and dont priont answer choices
Schema is given below:
[
    "name": "Blood Banks",
    "keywords":["Blood Banks"],
    "columns":
    [
        [
        "name": "Unit Type",
        "keywords": ["Blood Group", "Blood Type"]
        ],
        [
        "name": "Units Count",
        "keywords": ["Unit Count", "count of units"]

        ]
    ]
]
Example Question:
What is the type of chapter for the activity with the highest percentage value per year?
Type: Class 0

Class Type:
Class 0: 'SELECT [] FROM []'
Class 1: 'SELECT MAX([]) FROM []'
Class 2: 'SELECT MIN([]) FROM []'
Class 3: 'SELECT COUNT([]) FROM []'
Class 4: 'SELECT SUM([]) FROM []'
Class 5: 'SELECT AVG([]) FROM []'
\n"""
prompt23 = pre_prompt + "a:\n\n{text}\n" + "[\INST]"

In [None]:
llama_prompt = PromptTemplate(template=prompt23, input_variables=['a'])
llm_chain21 = LLMChain(prompt=llama_prompt, llm=llm)

In [None]:
# from langchain.globals import set_debug

# set_debug(True)

# from langchain.globals import set_verbose

# set_verbose(True)



In [None]:
print(llm_chain21.run(llama_prompt))

  Okay, here are 10 more questions based on the schema and example question provided:
1. What is the average number of units of blood donated per year?
Class Type: Class 3
2. Which unit type has the highest percentage of blood donations?
Class Type: Class 1
3. What is the total number of units of blood donated in the last 5 years?
Class Type: Class 4
4. What is the average blood group of units donated in the last year?
Class Type: Class 5
5. Which chapter has the highest percentage of blood donations in the last year?
Class Type: Class 2
6. What is the total number of units of blood donated by each blood type in the last year?
Class Type: Class 5
7. Which unit type has the lowest percentage of blood donations in the last year?
Class Type: Class 1
8. What is the average number of units of blood donated per month in the last year?
Class Type: Class 4
9. Which chapter has the lowest percentage of blood donations in the last 5 years?
Class Type: Class 2
10. What is the total number of unit

In [None]:
# chat_history = []

# query = "What is LangChain and what applications can be created using LangChain?"

# result = chain({"question": query, "chat_history": chat_history})

# print("answer", result['answer'])
# chat_history = [(query, result["answer"])]

# query = "Please repeat the applications mentioned just now?"
# result = chain({"question": query, "context" : result["answer"], "chat_history": chat_history})

# print("answer", result['answer'])
# print("source_documents : ", result['source_documents'])