<a href="https://colab.research.google.com/github/souramay/Darksoul/blob/main/pdf_querry.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing**

In [None]:
!pip install sentence-transformers
!pip install -U langchain-community
!pip install -q cassio datasets langchain tiktoken
!pip install -q llama-cpp-python sentence-transformers PyPDF2
!pip install -U cassio
!pip install pymupdf



# **Installing model llama 2**

In [None]:
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="TheBloke/Llama-2-7b-chat-GGUF",
    filename="llama-2-7b-chat.Q4_K_M.gguf",
    force_download=True,
    cache_dir="/root/.cache/huggingface/hub"
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


llama-2-7b-chat.Q4_K_M.gguf:   0%|          | 0.00/4.08G [00:00<?, ?B/s]

# **Defining model path**

In [None]:
LLAMA_MODEL_PATH = "/root/.cache/huggingface/hub/models--TheBloke--Llama-2-7b-chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf"

## Importing

In [None]:
from langchain_community.llms import LlamaCpp
from langchain_community.embeddings import LlamaCppEmbeddings
from langchain.vectorstores.cassandra import Cassandra
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import fitz  # PyMuPDF
from langchain.text_splitter import CharacterTextSplitter
import cassio
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

## **Configuration**

In [None]:
# Configuration
ASTRA_DB_APPLICATION_TOKEN = "your_token"
ASTRA_DB_ID = "your_id"

PDF_PATH = "Kali_Linux_Guide.pdf"
KEYSPACE = "default_keyspace"

# **Initialize Astra DB**

In [None]:

cassio.init(
    token=ASTRA_DB_APPLICATION_TOKEN,
    database_id=ASTRA_DB_ID,
    keyspace=KEYSPACE
)


# **Load and process PDF**

In [None]:

def load_and_chunk_pdf():
    doc = fitz.open(PDF_PATH)
    raw_text = "".join([page.get_text() for page in doc])

    text_splitter = CharacterTextSplitter(
        separator="\n",
        chunk_size=800,
        chunk_overlap=200,
        length_function=len,
    )
    return text_splitter.split_text(raw_text)

text_chunks = load_and_chunk_pdf()

# **text chunks**

In [None]:
text_chunks

['Kali Linux: A Comprehensive Guide\n## 1. Introduction to Kali Linux\nKali Linux is a Debian-based Linux distribution developed by\nOffensive Security, specifically designed for penetration testing,\nethical hacking, and cybersecurity research. It comes pre-installed\nwith a vast collection of security tools and is widely used by security\nprofessionals, ethical hackers, and digital forensic experts.\n## 2. History and Development\nKali Linux evolved from BackTrack Linux, another popular\nsecurity-focused distribution. In 2013, Offensive Security introduced\nKali Linux as a more robust and flexible alternative, incorporating a\nrolling release model to ensure continuous updates.\n## 3. Features of Kali Linux\n- Pre-installed Security Tools: Comes with over 600 tools for',
 'rolling release model to ensure continuous updates.\n## 3. Features of Kali Linux\n- Pre-installed Security Tools: Comes with over 600 tools for\npenetration testing, forensic analysis, and network security.\n- Cus

## **Initialize models**

In [None]:

llm = LlamaCpp(
    model_path=LLAMA_MODEL_PATH,
    n_ctx=2048,
    n_gpu_layers=40,
    n_threads=8,
    temperature=0.1,  # Lower temperature for more factual responses
    verbose=False
)

embeddings = LlamaCppEmbeddings(
    model_path=LLAMA_MODEL_PATH,
    n_ctx=512
)

llama_init_from_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (4096) -- the full capacity of the model will not be utilized
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /root/.cache/huggingface/hub/models--TheBloke--Llama-2-7b-chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.b

# Create and populate vector **store**

In [None]:

vector_store = Cassandra(
    embedding=embeddings,
    table_name="pdf_qa_store",
    keyspace=KEYSPACE
)


llama_perf_context_print:        load time =    5717.94 ms
llama_perf_context_print: prompt eval time =    5675.35 ms /     7 tokens (  810.76 ms per token,     1.23 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    5719.46 ms /     8 tokens


# Batch insert with progress **tracking**

In [None]:

BATCH_SIZE = 10
for i in range(0, len(text_chunks), BATCH_SIZE):
    batch = text_chunks[i:i+BATCH_SIZE]
    vector_store.add_texts(batch)
    print(f"Inserted batch {i//BATCH_SIZE + 1}/{(len(text_chunks)-1)//BATCH_SIZE + 1}")

# Custom prompt template enforcing strict context-based **answers**

In [None]:

STRICT_PROMPT = """Use the context below to answer the question. If the answer isn't clearly explained in the context,
respond with "This information is not covered in the document."

Context: {context}
Question: {question}
Answer: """

PROMPT = PromptTemplate(
    template=STRICT_PROMPT,
    input_variables=["context", "question"]
)


# Create QA chain with optimized retriever **settings**

In [None]:

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(
        search_type="similarity",  # Changed from "mmr" to "similarity"
        search_kwargs={"k": 5}  # Retrieve the top 5 most relevant passages
    ),
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True
)

### **validating response**

In [None]:
def validate_response(response: str, sources: list) -> bool:
    """Validate if answer content exists in retrieved source documents."""
    if not sources:
        return False  # No source means no valid answer

    # Allow default response to pass validation
    if "not covered in the document" in response.lower():
        return True

    # Convert source text into a single lowercase string
    source_text = " ".join([doc.page_content.lower() for doc in sources])

    # Extract meaningful words from response (excluding common stopwords)
    stopwords = {"the", "a", "an", "is", "in", "of", "and", "to"}
    answer_keywords = set(response.lower().split()) - stopwords

    # Check if at least one keyword from response exists in the source text
    return any(keyword in source_text for keyword in answer_keywords)

# **Taking query input until exit or quit is typed**

In [None]:
# Interactive QA loop with validation
while True:
    try:
        query = input("\nQuestion (type 'exit' to quit): ").strip()
        if query.lower() in ("exit", "quit", "q"):
            break

        result = qa_chain.invoke({"query": query})
        response = result["result"]
        sources = result["source_documents"]

        # First check: If no sources found
        if not sources:
            response = "This information is not covered in the document."
        # Second check: Validate answer against sources
        if not validate_response(response, sources):
            response = "This information is not covered in the document."

        # Display final answer
        print(f"\nANSWER: {response}")

        # Display source references
        if sources:
            print("\nSOURCE REFERENCES:")
            for i, doc in enumerate(sources[:3], 1):  # Limit to 3 sources
                print(f"[Reference {i}] {doc.page_content[:200]}...")  # Show first 200 chars
        else:
            print("\nNo supporting documents found.")

    except KeyboardInterrupt:
        break
    except Exception as e:
        print(f"Error: {str(e)}")

print("\nSession ended.")



Question (type 'exit' to quit): what are the recommed of kali tools


llama_perf_context_print:        load time =    5717.94 ms
llama_perf_context_print: prompt eval time =    5491.41 ms /    10 tokens (  549.14 ms per token,     1.82 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    5539.31 ms /    11 tokens



ANSWER: 
The document provides information on the features, installation process, and basic commands of Kali Linux. However, it does not cover the recommended tools for ethical hacking or digital forensics using Kali Linux. To answer this question, you would need to consult other resources or expert opinions on the topic. Therefore, I respond with "This information is not covered in the document."

SOURCE REFERENCES:
[Reference 1] Kali Linux: A Comprehensive Guide
## 1. Introduction to Kali Linux
Kali Linux is a Debian-based Linux distribution developed by
Offensive Security, specifically designed for penetration testing,
ethic...
[Reference 2] - 20GB of free disk space
- 64-bit or 32-bit processor
- Bootable USB drive or DVD
### 4.2 Installation Steps
1. Download the latest Kali Linux ISO from the official website.
2. Create a bootable USB ...
[Reference 3] rolling release model to ensure continuous updates.
## 3. Features of Kali Linux
- Pre-installed Security Tools: Comes with over

llama_perf_context_print:        load time =    5717.94 ms
llama_perf_context_print: prompt eval time =    3821.57 ms /     9 tokens (  424.62 ms per token,     2.36 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    3869.65 ms /    10 tokens



ANSWER:  Basic Kali Linux Commands include:
- ls - List directory contents
- cd - Change directory
- pwd - Print working directory
- cp - Copy files and directories
- mv - Move or rename files
- whoami - Displays the current user
- passwd - Change user password
- adduser - Create a new user
- shutdown -h now - Shutdown system
- reboot - Restart system

SOURCE REFERENCES:
[Reference 1] Kali Linux: A Comprehensive Guide
## 1. Introduction to Kali Linux
Kali Linux is a Debian-based Linux distribution developed by
Offensive Security, specifically designed for penetration testing,
ethic...
[Reference 2] - 20GB of free disk space
- 64-bit or 32-bit processor
- Bootable USB drive or DVD
### 4.2 Installation Steps
1. Download the latest Kali Linux ISO from the official website.
2. Create a bootable USB ...
[Reference 3] rolling release model to ensure continuous updates.
## 3. Features of Kali Linux
- Pre-installed Security Tools: Comes with over 600 tools for
penetration testing, forensic an

llama_perf_context_print:        load time =    5717.94 ms
llama_perf_context_print: prompt eval time =    2433.61 ms /     5 tokens (  486.72 ms per token,     2.05 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    2486.77 ms /     6 tokens



ANSWER:  Kali Linux is a Debian-based Linux distribution developed by Offensive Security, specifically designed for penetration testing, ethical hacking, and cybersecurity research. It comes pre-installed with a vast collection of security tools and is widely used by security professionals, ethical hackers, and digital forensic experts.

SOURCE REFERENCES:
[Reference 1] Kali Linux: A Comprehensive Guide
## 1. Introduction to Kali Linux
Kali Linux is a Debian-based Linux distribution developed by
Offensive Security, specifically designed for penetration testing,
ethic...
[Reference 2] - 20GB of free disk space
- 64-bit or 32-bit processor
- Bootable USB drive or DVD
### 4.2 Installation Steps
1. Download the latest Kali Linux ISO from the official website.
2. Create a bootable USB ...
[Reference 3] rolling release model to ensure continuous updates.
## 3. Features of Kali Linux
- Pre-installed Security Tools: Comes with over 600 tools for
penetration testing, forensic analysis, and net