### Author : Rahul Bhoyar



In this tutorial, we will demonstrate the process of compiling data from a CSV file and organizing it into a structured PDF document. The resulting PDF will serve as our "Knowledge-Base," supporting the implementation of the RAG (Retrieval Augmented Generation) technique. This procedure aims to provide a comprehensive resource for effectively utilizing the RAG model for various applications.

Installing the necessary libraries

In [1]:
!pip install reportlab



Code to create a PDF file.

In [27]:
import pandas as pd
from reportlab.pdfgen import canvas
from tqdm import tqdm

def create_pdf(csv_file, pdf_file):
    # Read CSV data into a pandas DataFrame
    df = pd.read_csv(csv_file)

    # Create a PDF document
    pdf = canvas.Canvas(pdf_file)

    # Set font for the content
    pdf.setFont("Helvetica", 12)

    # Set consistent margins
    top_margin = 750
    bottom_margin = 50
    left_margin = 100
    right_margin = 500

    # Function to wrap text based on available space
    def wrap_text(text, available_space):
        words = text.split()
        lines = []
        current_line = words[0]

        for word in words[1:]:
            if pdf.stringWidth(current_line + " " + word, "Helvetica", 12) < available_space:
                current_line += " " + word
            else:
                lines.append(current_line)
                current_line = word

        lines.append(current_line)
        return lines

    # Write data to the PDF for each row
    for index, row in tqdm(df.iterrows()):
        # Calculate available space on the current page
        available_space = top_margin - bottom_margin

        if available_space < 50:  # Check if there's enough space on the current page
            pdf.showPage()  # Create a new page
            top_margin = 750  # Reset top margin for the new page

            # Set font for the rest of the content on subsequent pages
            pdf.setFont("Helvetica", 12)

        # Write data to the PDF for each column in the row
        for col in df.columns:
            text = f"{col}: {row[col]}"
            lines = wrap_text(text, right_margin - left_margin)

            for line in lines:
                pdf.drawString(left_margin, top_margin, line)
                top_margin -= 15  # Adjust Y position for the next line

        # Draw separator line after each row
        pdf.line(left_margin, top_margin, right_margin, top_margin)
        top_margin -= 20  # Add extra space between rows

    # Save the PDF outside the loop, after all rows have been processed
    pdf.save()


In [28]:
CSV_FILE_PATH = "kaggle_datasets_v4.csv"

In [29]:
pdf_file_path = "knowlege_base_kaggle_datasets.pdf"  # Replace with your desired PDF output path

create_pdf(CSV_FILE_PATH, pdf_file_path)
print("File saved.")

2155it [00:03, 689.97it/s]


File saved.


Querying

In [None]:
pip install langchain_openai

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI()

In [None]:
from langchain.chains import ConversationalRetrievalChain

chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=new_db2.as_retriever())


Querying

In [None]:
query = """

Give me the dataset for Energy Consumption. I want at least 10 datasets.
Format should be :
Sr No -
Dataset name -
URL -

"""

In [None]:
CHAT_HISTORY = []


result = chain({"question": query,"chat_history" : CHAT_HISTORY})

In [None]:
result['answer']

"I found a dataset related to energy prediction that you might find interesting:\n\n1. Dataset name: Appliances energy prediction\n   URL: https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction\n\nUnfortunately, I couldn't find 10 datasets specifically related to energy consumption."

In [None]:
1
Cardiovascular Disease Dataset
https://github.com/durgeshrao9993/heart-failure-dataset

2
A-Z Medicine Dataset of India
https://github.com/shudhanshusingh/az-medicine-dataset-of-india

In [None]:
1 - Electoral Datasets - https://www.kaggle.com/datasets/arnabmitra007/electoral-datasets
2 - polityafter91 - https://www.kaggle.com/datasets/prakharatre99/polityafter91
3 - Bangladesh's 2024 Election News Dataset - https://www.kaggle.com/datasets/mahinshikder/polotical

In [None]:
1. Malicious URLs dataset
   URL: sid321axn/malicious-urls-dataset

2. WEB-D-E
   URL: saurabhshahane/webdemon

3. Earthquake dataset
   URL: warcoder/earthquake-dataset