<a href="https://colab.research.google.com/github/women-in-ai-ireland/June-2024-Group-002/blob/main/Study_Pal_Geo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# To stop warnings from showing
import warnings
warnings.filterwarnings("ignore", category=UserWarning, message="`do_sample` is set to `False`.")

In [2]:
# Installing required libraries
!pip install pytorch torchvision torchaudio
!pip install transformers==4.30
!pip install langchain sentence_transformers huggingface-hub
!pip install pypdf
!pip install -U langchain-community
!pip install bitsandbytes
!pip install faiss-cpu langchain-openai tiktoken unstructured selenium newspaper3k textstat
!pip install accelerate

!pip install langchain-huggingface
!pip install sentence-transformers==2.2.2
!pip install InstructorEmbedding

Collecting pytorch
  Downloading pytorch-1.0.2.tar.gz (689 bytes)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch==2.3.1->torchvision)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch==2.3.1->torchvision)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch==2.3.1->torchvision)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch==2.3.1->torchvision)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch==2.3.1->torchvision)
  Using cached nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==

In [3]:
# Importing all used packages
from google.colab import drive, userdata
import os
import pickle
from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.embeddings import HuggingFaceInstructEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from langchain.chains import RetrievalQA
from langchain_huggingface import HuggingFacePipeline
from InstructorEmbedding import INSTRUCTOR
from langchain.embeddings import HuggingFaceInstructEmbeddings

In [4]:
# Mount Google Drive
drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/MyDrive/WAI_project2/"

# Set device - using GPU if possible (T4 GPU used in this project)
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

# Set HF token - Note you need a Hugging Face token with required permissions for this project
hf_token = userdata.get('HF_TOKEN')

Mounted at /content/gdrive
Using device: cuda


In [5]:
# Function to initialize Hugging Face Instructor Embeddings
def initialize_huggingface_embeddings(model_name="hkunlp/instructor-xl", device="cuda"):
    """
    Initializes Hugging Face Instructor Embeddings model.

    Args:
    - model_name (str): Name of the Hugging Face model. Instructor Embeddings used.
    - device (str): Device to run the model on.

    Returns:
    - embeddings: Initialized Hugging Face Instructor Embeddings model.
    """
    return HuggingFaceInstructEmbeddings(model_name=model_name, model_kwargs={"device": device})
instructor_embeddings = initialize_huggingface_embeddings()


.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512




In [6]:
# set local path to store embeddings, currently set to Google Drive (used about 4.64 GB)
embedding_store_path = f"{root_dir}/embedding_store"

In [7]:
# defines the parameters to use the recursive text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 10,
    length_function = len,
)

In [8]:
# Function to ingest and chunk knowledge database using Recursive Splitter
def ingest_and_chunk_pdfs(root_dir, text_splitter):
    """
    Ingests all PDFs from a specified directory and splits their content into smaller text chunks.

    Args:
    - root_dir (str): The path to the directory containing the PDFs.
    - text_splitter (TextSplitter): An instance of a text splitter used to divide documents into chunks.

    Returns:
    - list: A list of text chunks from the documents.
    """
    loader = DirectoryLoader(f"{root_dir}", glob="*.pdf", loader_cls=PyPDFLoader)
    documents = loader.load()
    texts = text_splitter.split_documents(documents)
    return texts

In [9]:
# Function to store embeddings
def store_embeddings(docs, embeddings, store_name, path):
    """
    Stores embeddings in FAISS format and saves to a pickle file.

    Args:
    - docs (list): List of documents.
    - embeddings: Embedding model.
    - store_name (str): Name of the embedding store.
    - path (str): Path to the directory where embeddings will be stored. Google Drive in this project

    Result: Stores embeddings in given path as pickle file (around 4.64 GB in this project)
    """
    vector_store = FAISS.from_documents(docs, embeddings)
    with open(os.path.join(path, f"faiss_{store_name}.pkl"), "wb") as f:
        pickle.dump(vector_store, f)

In [10]:
# Function to load embeddings
def load_embeddings(store_name, path):
    """
    Loads embeddings from a pickle file.

    Args:
    - store_name (str): Name of the embedding store.
    - path (str): Path to the directory where embeddings are stored.

    Returns:
    - vector_store: Loaded FAISS vector store.
    """
    with open(os.path.join(path, f"faiss_{store_name}.pkl"), "rb") as f:
        vector_store = pickle.load(f)
    return vector_store


In [11]:
# Function to set up embeddings
def setup_embeddings(root_dir, text_splitter, instructor_embeddings, embedding_store_path):
    """
    Ingests and chunks PDF documents, creates embeddings, and stores them.

    Args:
    - root_dir (str): The directory path containing PDF files. Google Drive used in this project
    - text_splitter (TextSplitter): Recursive Text Splitter
    - instructor_embeddings: The model used for generating embeddings. Hugging Face Instructor Embeddings
    - embedding_store_path (str): Directory path where the embeddings will be stored. Google Drive used in this project

    Returns:
    - tuple: A tuple containing the list of text chunks and the generated vector store.
    """
    torch.cuda.empty_cache()

    # Use the given Knowledge Database to ingest and chunk
    texts = ingest_and_chunk_pdfs(root_dir, text_splitter)

    # Create and store embeddings
    vector_store = FAISS.from_documents(texts, instructor_embeddings)
    with open(os.path.join(embedding_store_path, f"faiss_instructEmbeddings.pkl"), "wb") as f:
        pickle.dump(vector_store, f)

    # Return texts and vector_store for verification
    return texts, vector_store

In [12]:
# Initializing Gemma 2B LLM and Tokenizer
def initialize_model_and_tokenizer():
    """
    Initializes a pre-trained language model (Gemma 2B) and tokenizer with 4-bit quantization.

    Returns:
    - tuple: A tuple containing the initialized model and tokenizer.
    """
    quantization_config = BitsAndBytesConfig(load_in_4bit=True)
    model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", quantization_config=quantization_config)
    tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", quantization_config=quantization_config, model_max_length=256)
    return model, tokenizer

# Function to find similar chunks to query - set to just 1 chunk for this project
def retrieve_relevant_chunks(question, vector_store, num_chunks=1):
    """
    Retrieves the most relevant document chunks from the vector store based on the user query.

    Args:
    - question (str): The question used to search for relevant chunks.
    - vector_store: The FAISS vector store containing document embeddings.
    - num_chunks (int, optional): The number of relevant chunks to retrieve. Defaults to 1.

    Returns:
    - list: A list of relevant document chunks. In this project it is set to 1 chunk
    """
    docs = vector_store.similarity_search(question, k=num_chunks)
    return docs

In [20]:
# Format the prompt for the language model.Function to prevent hallucinantions - using only the Knowledge Database to generate answer from - Format the prompt for the model
def format_prompt(question, chunks):
    """
    Formats a prompt for the language model by combining the question with relevant context chunks.

    Args:
    - question (str): User query to be answered.
    - chunks (list): A list of document chunks providing context for the answer.

    Returns:
    - str: A formatted prompt string.
    """
    context = "\n".join([chunk.page_content for chunk in chunks])
    prompt = f"Provide an answer to the following question using only the context provided: {question}? " \
             f"If you cannot answer this question from the information provided, respond with 'There is insufficient information to answer this question.'\n\n{context}"
    return prompt

# Function for Generating answer
def gen_answer(prompt, tokenizer, model, max_length=200, temperature=0.5, top_p=0.9):
    """
    Generates an answer to a prompt using a pre-trained language model. Hyperparameters can be changed to suit user preferance

    Args:
    - prompt (str): The formatted prompt to ensure contextually aware response.
    - tokenizer: The tokenizer used for encoding the prompt.
    - model: Gemma 2B Language Model.
    - max_length (int, optional): Maximum length of the generated answer. Defaults to 200.
    - temperature (float, optional): Sampling temperature for the generation. Defaults to 0.5.
    - top_p (float, optional): Nucleus sampling probability. Defaults to 0.9.

    Returns:
    - str: The generated answer.
    """
    inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    with torch.no_grad():
        response = model.generate(
            inputs,
            max_new_tokens=max_length,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            num_return_sequences=1,
            pad_token_id=tokenizer.eos_token_id
        )
    answer = tokenizer.decode(response[0], skip_special_tokens=True)
    return answer.strip()

In [21]:
# Main function to process the question and generate the answer
def main(question):
    """
    Runs entire RAG Framework: Processes a user query, retrieves relevant chunks from the vector store, generates a prompt,
    and produces an answer using the chosen language model.(Gemma 2B)

    Args:
    - question (str): The User Query.

    Returns:
    - str: The generated answer to the question.
    """
    torch.cuda.empty_cache()
    vector_store = load_embeddings(store_name='instructEmbeddings', path=embedding_store_path)
    model, tokenizer = initialize_model_and_tokenizer()
    torch.cuda.empty_cache()
    relevant_chunks = retrieve_relevant_chunks(question, vector_store)
    torch.cuda.empty_cache()
    prompt = format_prompt(question, relevant_chunks)
    answer = gen_answer(prompt, tokenizer, model)
    torch.cuda.empty_cache()
    return answer

In [23]:
# Set up simple interface to allow user query
question = input("Enter your question: ")
answer = main(question)
print(answer)

Enter your question: How does the Hadley Cell work?


`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Provide an answer to the following question using only the context provided: How does the Hadley Cell work?? If you cannot answer this question from the information provided, respond with 'There is insufficient information to answer this question.'

The main cell to know is the Hadley Cell  
Here’s how it works:
1. The Hadley Cell is a 3D visualization of the Earth’s atmosphere.
2. It’s set up like a 3D globe, with the equator at the bottom and the poles at the top.
3. The Hadley Cell is divided into four zones: the polar regions, the polar maritime, the subtropical highland, and the subtropical lowland.
4. Each of the four zones is then further divided into four quadrants, which are then each further subdivided into sectors.
5. Each of the four quadrants is then further subdivided into four sectors, which are each then further subdivided into eight sectors.
6. Each of the eight sectors is then each further subdivided into four subsectors, which are each then each further subdivided in