<a href="https://colab.research.google.com/github/sheegansrigm/RAG-LLM/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### News

**[NEW] We've fixed many bugs in Phi-4** which greatly increases Phi-4's accuracy. See our [blogpost](https://unsloth.ai/blog/phi4)

[NEW] You can view all Phi-4 model uploads with our bug fixes including [dynamic 4-bit quants](https://unsloth.ai/blog/dynamic-4bit), GGUF & more [here](https://huggingface.co/collections/unsloth/phi-4-all-versions-677eecf93784e61afe762afa)

[NEW] As of Novemeber 2024, Unsloth now supports [vision finetuning](https://unsloth.ai/blog/vision)!


### Installation

In [1]:
%%capture
!pip install unsloth
# Also get the latest nightly Unsloth!
!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.1.6: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [6]:
!pip install --upgrade langchain



In [3]:

# Move the model to the appropriate device (GPU or CPU)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096, padding_idx=128004)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSN

In [9]:
!pip install langchain_community


Collecting langchain_community
  Downloading langchain_community-0.3.15-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.25.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB

In [13]:
from unsloth import FastLanguageModel
from langchain_community.llms import BaseLLM
from typing import Optional, List

In [14]:


class UnslothLLM(BaseLLM):
    """Unsloth LLM wrapper for LangChain."""
    model: any
    tokenizer: any
    max_tokens: int = 256  # Default max tokens

    def __init__(self, model, tokenizer, **kwargs):
        super().__init__(model=model, tokenizer=tokenizer)
        self.max_tokens = kwargs.get("max_tokens", self.max_tokens)

    @property
    def _llm_type(self) -> str:
        return "unsloth"

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        device = next(self.model.parameters()).device
        inputs = self.tokenizer(prompt, return_tensors="pt").to(device)
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=self.max_tokens,
            temperature=0.7,
            top_k=50,
            eos_token_id=self.tokenizer.eos_token_id,
        )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Initialize the UnslothLLM instance
unsloth_llm = UnslothLLM(model=model, tokenizer=tokenizer, max_tokens=300)

# Use the `invoke` method to interact with the model
response = unsloth_llm.invoke(
    input="Explain Section 66 of the IT Act related to cybercrimes."
)
print(response)


  warn(


TypeError: Can't instantiate abstract class UnslothLLM with abstract method _generate

In [16]:
from unsloth import FastLanguageModel
from langchain_community.llms import BaseLLM
from typing import Optional, List, Any, Dict


class UnslothLLM(BaseLLM):
    """Unsloth LLM wrapper for LangChain."""
    model: any
    tokenizer: any
    max_tokens: int = 256  # Default max tokens

    def __init__(self, model, tokenizer, **kwargs):
        super().__init__(model=model, tokenizer=tokenizer)
        self.max_tokens = kwargs.get("max_tokens", self.max_tokens)
        # Enable inference mode during initialization
        FastLanguageModel.for_inference(self.model)

    @property
    def _llm_type(self) -> str:
        return "unsloth"

    # Define the _generate method
    def _generate(
        self, prompt: str, stop: Optional[List[str]] = None
    ) -> Dict[str, Any]:
        """Generate text from the Unsloth model."""
        # This method is required by the BaseLLM class
        # It calls the _call method, which you've already defined
        response = self._call(prompt, stop=stop)
        return {"generations": [[{"text": response}]]}

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        device = next(self.model.parameters()).device
        inputs = self.tokenizer(prompt, return_tensors="pt").to(device)
        outputs = self.model.generate(
            **inputs,
            max_new_tokens=self.max_tokens,
            temperature=0.7,
            top_k=50,
            eos_token_id=self.tokenizer.eos_token_id,
        )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Initialize the UnslothLLM instance
unsloth_llm = UnslothLLM(model=model, tokenizer=tokenizer, max_tokens=300)

# Use the `invoke` method to interact with the model
response = unsloth_llm.invoke(
    input="Explain Section 66 of the IT Act related to cybercrimes."
)
print(response)

  warn(


AttributeError: 'dict' object has no attribute 'flatten'

In [25]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

# 1. Load the PDF
pdf_path = "ilovepdf_merged.pdf"  # Replace with your PDF file path
loader = PyPDFLoader(file_path=pdf_path)
documents = loader.load()

# 2. Split Documents into Chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_docs = text_splitter.split_documents(documents)

# 3. Generate Embeddings
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

# 4. Create a Vector Store for Retrieval
vectorstore = FAISS.from_documents(split_docs, embeddings)

# 5. Initialize the Retriever
retriever = vectorstore.as_retriever()


  embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [26]:
from langchain.prompts import PromptTemplate

# Define the prompt
template = """
You are a legal assistant specializing in Indian cyber law. Using the relevant legal data retrieved from the user's query, answer the following clearly and concisely.

Query: {query}
Answer:"""
prompt = PromptTemplate(template=template, input_variables=["query"])


In [40]:
from langchain.chains import LLMChain, RetrievalQA
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.schema import Document
from langchain.chains import LLMChain, RetrievalQA, RetrievalQAWithSourcesChain # Import RetrievalQAWithSourcesChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
from langchain.schema import Document
# 1. Wrap Unsloth LLM with LangChain
unsloth_llm = UnslothLLM(model=model, tokenizer=tokenizer, max_tokens=300)

# 2. Update Prompt Template to include 'context'
template = """
You are a legal assistant specializing in Indian cyber law. Using the relevant legal data retrieved from the user's query, answer the following clearly and concisely.

Query: {query}
Context: {context}
Answer:""" # Added {context} here
prompt = PromptTemplate(template=template, input_variables=["query", "context"]) # Added 'context' here

# 3. Create LLM Chain with the updated Prompt Template
llm_chain = LLMChain(llm=unsloth_llm, prompt=prompt)

# 4. Create a StuffDocumentsChain to combine documents
stuff_chain = StuffDocumentsChain(
    llm_chain=llm_chain, document_variable_name="context"
)

qa_chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=unsloth_llm, chain_type="stuff", retriever=retriever
)

# 6. Query the System
query = "What does the Indian IT Act say about data privacy?"
response = qa_chain({"question": query})  # Pass only the query
print("Response:", response)

Response: {'question': 'What does the Indian IT Act say about data privacy?', 'answer': 'Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES"). \nIf you don\'t know the answer, just say that you don\'t know. Don\'t try to make up an answer.\nALWAYS return a "SOURCES" part in your answer.\n\n', 'sources': "Which state/country's law governs the interpretation of the contract?"}


In [35]:
from langchain.chains import SequentialChain

# Summarization Prompt
summary_template = """
Summarize the legal data retrieved for the user's query. Be concise and highlight key points.

Retrieved Data: {retrieved_data}
Summary:"""
summary_prompt = PromptTemplate(template=summary_template, input_variables=["retrieved_data"])
summary_chain = LLMChain(llm=unsloth_llm, prompt=summary_prompt)

# Combine Retrieval, Reasoning, and Summarization
def mixed_chain(query):
    retrieved_data = retriever.get_relevant_documents(query)
    retrieved_text = " ".join([doc.page_content for doc in retrieved_data])

    # Step 1: Retrieve and Generate Response
    reasoning_response = llm_chain.run(query=query,context=retrieved_text)

    # Step 2: Summarize Retrieved Data
    summary_response = summary_chain.run(retrieved_data=retrieved_text)

    return {
        "reasoning_response": reasoning_response,
        "summary_response": summary_response,
    }

# Test the Mixed Chain
result = mixed_chain("What are the legal consequences of hacking under Indian law?")
print("Reasoning Response:", result["reasoning_response"])
print("Summary Response:", result["summary_response"])


Reasoning Response: 
You are a legal assistant specializing in Indian cyber law. Using the relevant legal data retrieved from the user's query, answer the following clearly and concisely.

Query: What are the legal consequences of hacking under Indian law?
Context: shall be guilty of an offence and shall be liable on conviction to imprisonment for a term not exceeding 
two years or a fine not exceeding one lakh rupees or with both.] 
2[69. Power to issue directions for interception or monitoring or decryption of any information 
through any computer resource.–(1) Where the Central Government or a State Government or any of 
its officers specially authorised by the Central Government or the Sta te Government, as the case may be, 
in this behalf may, if satisfied that it is necessary or expedient so to do, in the interest of the sovereignty 
or integrity of India, defence of India, security of the State, friendly relations with foreign States or public 
order or for preventing incitement

In [42]:
reasoning_template = """
You are a legal expert specializing in Indian cyber law. Using the retrieved data provided below, analyze the query and provide a clear, concise, and actionable legal response.

Retrieved Data: {context}
Query: {query}
Legal Advice:"""
reasoning_prompt = PromptTemplate(template=reasoning_template, input_variables=["query", "context"])
reasoning_chain = LLMChain(llm=unsloth_llm, prompt=reasoning_prompt)

def mixed_chain(query):
    # Retrieve documents
    retrieved_data = retriever.get_relevant_documents(query)
    retrieved_text = " ".join([doc.page_content for doc in retrieved_data])

    # Step 1: Reasoning with Retrieved Data
    reasoning_response = reasoning_chain.run(query=query, context=retrieved_text)

    # Step 2: Summarize Retrieved Data
    summary_response = summary_chain.run(retrieved_data=retrieved_text)

    return {
        "reasoning_response": reasoning_response,
        "summary_response": summary_response,
    }

# Test the Mixed Chain
result = mixed_chain("What are the legal consequences of hacking under Indian law?")
print("Reasoning Response:", result["reasoning_response"])
print("Summary Response:", result["summary_response"])



Reasoning Response: 
You are a legal expert specializing in Indian cyber law. Using the retrieved data provided below, analyze the query and provide a clear, concise, and actionable legal response.

Retrieved Data: shall be guilty of an offence and shall be liable on conviction to imprisonment for a term not exceeding 
two years or a fine not exceeding one lakh rupees or with both.] 
2[69. Power to issue directions for interception or monitoring or decryption of any information 
through any computer resource.–(1) Where the Central Government or a State Government or any of 
its officers specially authorised by the Central Government or the Sta te Government, as the case may be, 
in this behalf may, if satisfied that it is necessary or expedient so to do, in the interest of the sovereignty 
or integrity of India, defence of India, security of the State, friendly relations with foreign States or public 
order or for preventing incitement to the commission of any cognizable offence relati

In [43]:
result = mixed_chain("Some person have using privated softwares what can be the issues faced by them?")
print("Reasoning Response:", result["reasoning_response"])
print("Summary Response:", result["summary_response"])

Reasoning Response: 
You are a legal expert specializing in Indian cyber law. Using the retrieved data provided below, analyze the query and provide a clear, concise, and actionable legal response.

Retrieved Data: 25 
 
(e) ―under circumstances violating privacy ‖ means circumstances in which a person can hav e a 
reasonable expectation that– 
(i) he or she could disrobe in privacy, without being concerned that an image of his private 
area was being captured; or 
(ii) any part of his or her private area would not be visible to the public, regardless of whether 
that person is in a public or private place. 
66F. Punishment for cyber terrorism.–(1) Whoever,– 
(A) with intent to threaten the unity, integrity, security or sovereignty of India or to strike terror in 
the people or any section of the people by– 
(i) denying or cause the denial of access to any person authoris ed to access computer 
resource; or 
(ii) attempting to penetrate or access a computer resource without authorisa t

In [44]:
result = mixed_chain("what if the person continue doing that crime again?")
print("Reasoning Response:", result["reasoning_response"])

Reasoning Response: 
You are a legal expert specializing in Indian cyber law. Using the retrieved data provided below, analyze the query and provide a clear, concise, and actionable legal response.

Retrieved Data: force. 
77A. Compounding of offences.–A court of competent jurisdiction may compound offences, other 
than offences for which the punishment for life or imprisonment for a term exceeding three years has  
been provided, under this Act: 
Provided that the court shall not compound such offence where the accused is, by reason of his 
previous conviction, liable to either enhanced punishment or to a punishment of a different kind: 
Provided further that the court shall not compound any offence where such offence affects the socio 
economic conditions of the country or has been committed against a child below the age of 18 years or a 
woman. 
                                                           
1. Subs. by Act 10 of 2009, s. 38, for section 77 (w.e.f. 27-10-2009). 68. Pena

In [45]:
!pip install huggingface_hub




In [47]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
The token `colab` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `colab`


In [49]:
import torch

# Save the model
torch.save(unsloth_llm.model.state_dict(), "model/unsloth_model.pth")

# Save the tokenizer
unsloth_llm.tokenizer.save_pretrained("model/unsloth_tokenizer")


('model/unsloth_tokenizer/tokenizer_config.json',
 'model/unsloth_tokenizer/special_tokens_map.json',
 'model/unsloth_tokenizer/tokenizer.json')

In [50]:
# Assuming your vector store is stored in `vectorstore`
vectorstore.save_local("model/vectorstore")


In [51]:
import pickle

# Save the complete LangChain system
with open("model/legal_chain_system.pkl", 'wb') as f:
    pickle.dump(mixed_chain, f)


In [18]:
response = unsloth_llm.invoke(
    input="Some person have using privated softwares what can be the issues faced by them"
)
print(response)

Some person have using privated softwares what can be the issues faced by them?
If you are using a pirated software, then there are many problems that you may face. First of all, you will not be able to update the software. Secondly, you will not be able to contact the customer support for the software. Thirdly, you will not be able to get any warranty on the software. Fourthly, you will not be able to use the software for commercial purposes. Lastly, you may face legal issues if you are found using a pirated software.
I hope you will be satisfied with my answer.


In [54]:
!huggingface-cli repo create Sheegan/Legal-Advisor --type model

[90mgit version 2.34.1[0m
[90mgit-lfs/3.0.2 (GitHub; linux amd64; go 1.18.1)[0m

You are about to create [1mSheegan/Sheegan/Legal-Advisor[0m
Proceed? [Y/n] Traceback (most recent call last):
  File "/usr/local/bin/huggingface-cli", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/commands/huggingface_cli.py", line 57, in main
  File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/commands/user.py", line 285, in run
    choice = input("Proceed? [Y/n] ").lower()
             ^^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
^C


In [56]:
from huggingface_hub import upload_folder

# Upload Model and Tokenizer
upload_folder(
    repo_id="Sheegan/Legal_Advisor",  # Replace with your Hugging Face repo name
    folder_path="model",  # Path to where you've saved the model, tokenizer, and vectorstore
    commit_message="Upload full legal system including model, tokenizer, and vector store",
)

  0%|          | 0/4 [00:00<?, ?it/s]

legal_chain_system.pkl:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

unsloth_model.pth:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

unsloth_tokenizer/tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

vectorstore/index.pkl:   0%|          | 0.00/402k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Sheegan/Legal_Advisor/commit/76e0a109c164b98b4721649ee97af73977e58bae', commit_message='Upload full legal system including model, tokenizer, and vector store', commit_description='', oid='76e0a109c164b98b4721649ee97af73977e58bae', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Sheegan/Legal_Advisor', endpoint='https://huggingface.co', repo_type='model', repo_id='Sheegan/Legal_Advisor'), pr_revision=None, pr_num=None)

In [None]:
# Upload the vector store
upload_folder(
    repo_id="Sheegan/Legal_Advisor",  # Replace with your repo name
    folder_path="model/vectorstore",  # Path to the saved vector store
    commit_message="Upload vector store",
)

# Upload the LangChain system
upload_folder(
    repo_id="Sheegan/Legal_Advisor",  # Replace with your repo name
    folder_path="model/legal_chain_system.pkl",  # Path to the saved LangChain setup
    commit_message="Upload LangChain setup",
)


In [None]:
import torch
from unsloth import FastLanguageModel

# Load your unsloth model using FastLanguageModel (replace with actual model path)
model = FastLanguageModel.from_pretrained("model/unsloth_model.pth")

# Load your tokenizer using FastLanguageModel (replace with actual tokenizer path)
tokenizer = FastLanguageModel.from_pretrained("model/unsloth_tokenizer")


# Save the model in Hugging Face format
model.save_pretrained('Sheegan/Legal_Advisor')


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [63]:
from unsloth import FastLanguageModel
from transformers import AutoTokenizer

# Load the model and tokenizer from Hugging Face
model = FastLanguageModel.from_pretrained("Sheegan/Legal_Advisor")
tokenizer = AutoTokenizer.from_pretrained("Sheegan/Legal_Advisor")


==((====))==  Unsloth 2025.1.6: Fast Llama patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


OSError: Sheegan/Legal_Advisor does not appear to have a file named pytorch_model.bin, model.safetensors, tf_model.h5, model.ckpt or flax_model.msgpack.

In [19]:
pip install faiss-cpu langchain transformers


Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading faiss_cpu-1.9.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m48.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0.post1


In [22]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.1.0


In [23]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_path="ilovepdf_merged.pdf")
documents = loader.load()


In [12]:
unsloth_llm = UnslothLLM(model=model, tokenizer=tokenizer, max_tokens=300)

response = unsloth_llm(
    prompt="Explain Section 66 of the IT Act related to cybercrimes."
)
print(response)


  response = unsloth_llm(


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `llama-3` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Downloading readme:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Map (num_proc=2):   0%|          | 0/51760 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.984 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 51,760 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 60
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.8176
2,2.3042
3,1.6893
4,1.9382
5,1.6569
6,1.6219
7,1.1871
8,1.2642
9,1.1012
10,1.1895


In [None]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

462.7198 seconds used for training.
7.71 minutes used for training.
Peak reserved memory = 7.922 GB.
Peak reserved memory for training = 1.938 GB.
Peak reserved memory % of max memory = 53.716 %.
Peak reserved memory for training % of max memory = 13.141 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

['<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nContinue the fibonnaci sequence.\n\n### Input:\n1, 1, 2, 3, 5, 8\n\n### Response:\n13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Continue the fibonnaci sequence.

### Input:
1, 1, 2, 3, 5, 8

### Response:
13, 21, 34, 55, 89, 144<|end_of_text|>


<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model")  # Local saving
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is a famous tall tower in Paris?

### Input:


### Response:
One of the most famous and iconic tall towers in Paris is the Eiffel Tower. Standing at 324 meters (1,063 feet) tall, this wrought iron tower is a symbol of the city and a must-see attraction for tourists from all over the world.<|end_of_text|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
