## Author : Rahul Bhoyar

### Proof of Concept: Leveraging Language Models for the Construction of a Comprehensive Dataset


**Objective:**

The aim of this endeavor is to assess the viability of leveraging a Large Language Model (LLM) for the purpose of constructing a comprehensive Kaggle Dataset.

This evaluation seeks to determine the feasibility and effectiveness of employing a Language Model in the creation of a comprehensive dataset for Kaggle.

**Tutorial:**

In this tutorial, we outline a systematic approach to creating extensive datasets leveraging Language Model (LLM) capabilities. Our focus is on utilizing LLMs to crawl through the Kaggle website and extract relevant data for constructing a database.

**Steps:**

**1. User Input:** Begin by prompting the user to provide a keyword for which the dataset is to be constructed.

**2. URL Construction:** Utilize the provided keyword to construct a Kaggle URL, which will serve as the target for web crawling in the subsequent steps.

**3. Web Page Loader Initialization:** Employ the *llama_index* library's web page loader to initialize the crawling process. The primary function of this loader is to navigate through the webpage, extracting information and generating a comprehensive "document."

**4. Vectorization and Database Storage:** Once the document is obtained, proceed to vectorize its contents. Store the resulting vectors in a dedicated Vector Database. This step ensures a structured and organized representation of the crawled data.

**5. Querying with ChatGPT 4.0:** Leverage the capabilities of CHATGPT 4.0 to query the stored data in the Vector Database. Construct queries that facilitate the extraction of relevant information pertaining to Kaggle datasets from the documents.

**6. Execution of Queries:** Execute the queries within the CHATGPT 4.0 environment to obtain specific details and insights related to Kaggle datasets present in the crawled documents. This step enhances the efficiency of data retrieval.

By following these systematic steps, users can harness the power of LLMs to construct and query comprehensive datasets from Kaggle, facilitating streamlined access to valuable information for diverse purposes.


#### Step 1 -  User Input: Begin by prompting the user to provide a keyword for which the dataset is to be constructed.

(A) Install the required libraries.

In [None]:
!pip install langchain langchain-openai faiss-cpu beautifulsoup4



(B) Take the user input for which we need to construct dataset.

In [None]:
keyword_input = str(input("Enter the keyword for which you need to construct the Kaggle dataset :"))

print("-"*100)
print("Provided input by user is :", keyword_input)

Enter the keyword for which you need to construct the Kaggle dataset :Healthcare
----------------------------------------------------------------------------------------------------
Provided input by user is : Healthcare


#### Step 2 - URL Construction: Utilize the provided keyword to construct a Kaggle URL, which will serve as the target for web crawling in the subsequent steps.

Construct the Kaggle URL with from above keyword.

In [None]:
URL = f"https://www.kaggle.com/search?q={keyword_input}"

print("Constructed Kaggle URL is :",URL)

Constructed Kaggle URL is : https://www.kaggle.com/search?q=Healthcare


#### Step 3 - Web Page Loader Initialization: Employ the llama_index library's web page loader to initialize the crawling process. The primary function of this loader is to navigate through the webpage, extracting information and generating a comprehensive "document."

(A) Setting up OPENAI's environment. As a part of setting up the environment, we will require OPENAI's API Access key.

We will store is as our environment variable "OPENAI_API_KEY".

In [None]:
import os
openai_api_key = "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"  # Enter your OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = openai_api_key
print("OPENAI API key is set successfully :",openai_api_key)

OPENAI API key is set successfully : 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


(B) Crawl through the web and store the data from webpage as document.

In [None]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(URL)
documents = loader.load()
print("Document object created.")

Document object created.


In [None]:
print("Document loaded from Kaggle Webpage :")
print("-"*200)
print(documents)

Document loaded from Kaggle Webpage :
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[Document(page_content='\n\n\n\nSearch | Kaggle\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', metadata={'source': 'https://www.kaggle.com/search?q=Healthcare', 'title': 'Search | Kaggle', 'description': 'Search for anything on Kaggle.', 'language': 'en'})]


In [None]:
print("The length of the document object :", len(documents))

The length of the document object : 1


#### Step 4 - Vectorization and Database Storage:

Once the document is obtained, proceed to vectorize its contents. Store the resulting vectors in a dedicated Vector Database. This step ensures a structured and organized representation of the crawled data.

(A) Initialise the LLM. We will be using GPT-4 as our base LLM model.

In [None]:
from langchain_openai import ChatOpenAI
MODEL_NAME = "gpt-4"
llm = ChatOpenAI(model = MODEL_NAME)
print("LLM model loaded successfully.")

LLM model loaded successfully.


(B) Initialise the embedding model which we will use to vectorise the document data and convert it into numerical representation.

In [None]:
from langchain_openai import OpenAIEmbeddings

#MODEL_NAME = "gpt-4"
embeddings = OpenAIEmbeddings()
print("LLM model for embeddings loaded successfully.")

LLM model for embeddings loaded successfully.


(C) Creation of vector object to convert the numerical representation of documents.

In [None]:
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter


text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(documents)
vector = FAISS.from_documents(documents, embeddings)

print("Vector object created successfully.")

Vector object created successfully.


(D) Creation of chain object.

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("""Answer the following question based only on the provided context:

<context>
{context}
</context>

Question: {input}""")

document_chain = create_stuff_documents_chain(llm, prompt)

In [None]:
from langchain.chains import create_retrieval_chain

retriever = vector.as_retriever()
retrieval_chain = create_retrieval_chain(retriever, document_chain)

#### Step 6 - Execution of Queries: Execute the queries within the CHATGPT 4.0 environment to obtain specific details and insights related to Kaggle datasets present in the crawled documents. This step enhances the efficiency of data retrieval.

(A) Creating a query so that we can fetch the data in the desired format.

In [None]:
query = """
Give me all the Kaggle datasets form link with its description from the text that you have.
Give me all possible datasets.
Format should be like this:
index : Serial number of the dataset
(next line)
dataset_name : Name of the dataset
(next line)
description : Description of the dataset
(next line)
link : Link of the dataset
(next line)
"""

In [None]:
response = retrieval_chain.invoke({"input": query})
answer = response["answer"]

In [None]:
print("-"*200)
print("The answer is :")
print("-"*200)
print(answer)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The answer is :
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The text does not provide any Kaggle datasets or their descriptions.


### Conclusion :


The utilization of the language model (LLM) for the creation of a comprehensive dataset faces limitations, as evidenced by the incapability of crawling Kaggle URLs using the Beautiful Soup loader within llama_index.

Consequently, the RAG-based approach proves unfeasible for constructing the Kaggle dataset.

The hindrance arises from the inability to access Kaggle URLs through the specified Beautiful Soup loader implemented by llama_index, rendering the proposed methodology impractical for dataset generation.