<div style="overflow: hidden;">
  <!-- Colab Button on the Left -->
  <a href="https://colab.research.google.com/github/trashpanda-ai/TBD.ipynb" target="_parent">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" width="150" style="float: left; margin-right: 15px;" />
  </a>
  
  <!-- Text on the Right -->
  <div style="text-align: right;">
    <h3><span style="color:gray"> GPT Tool for HAI Literature Review </span></h3>
  </div>
</div>

<br>
<br>
<br>


<h1><center>Retrieval Augmented Generation (RAG) Chatbot with OpenAI LLM and LangChain</center></h1>
<h2><center> <span style="font-weight:normal"><font color='#0065bd'> Take home assignment for an application </font>  </span></center></h2>


<h3><center><font color='gray'>JONAS GOTTAL</font></center></h3>





### GPT-based Research Summarization and Literature Review Tool (RAG System)
This solution leverages a Retrieval-Augmented Generation (RAG) framework, combining a collection of various research papers with an interactive GPT-powered chatbot to facilitate real-time querying and information retrieval. The system enables users to:

- Summarize Research Articles: The tool stores and indexes the 32 reference papers, allowing the user to query specific aspects of each article. The chatbot can extract and summarize key details such as the research context, questions, findings, themes (like human vs. AI comparisons or human-AI collaboration), methods, contributions, and future directions, providing concise, on-demand summaries.
- Generate Literature Reviews: The system allows users to input custom queries (e.g., "I want all research on human vs. AI and empirical studies"). Based on the query, the system retrieves the most relevant papers from the database, summarizes them, and synthesizes a literature review that discusses common themes, trends, and potential future research areas. The review is presented in coherent paragraphs, with citations formatted according to academic standards.

By combining a rich database of research papers with an interactive chatbot interface, this RAG-powered tool offers an intuitive and dynamic way for users to engage with academic literature, ask detailed questions, and receive customized summaries and literature reviews in real-time.

### Detailed requirements
#### Function 1: Guidance
For each of the 32 reference papers (PDF), your GPT tool should extract the following information:
1.    Context: Specify whether the study is focused on a specific industry, task or a broader, conceptual scope.
2.    Research Question and Findings: Identify the main research question and summarise the key findings.
3.    Theme of Research:
  - Human vs. AI: Highlight any comparisons of comparative advantages between humans and AI, including condition-based results or scenarios where one outperforms the other.
  - Human + AI Collaboration: Indicate the type of collaboration discussed, such as the roles of human and AI, the sequences of actions of human and AI taken in the process, and so on.
4.    Method: Classify the study method as one of the following:
  - Conceptual/Case Study
  - Modeling: Either Stylized Modeling or Operations Research (OR) Model
  - Empirical Study: Lab/Field Experiment or Secondary Data Analysis
5.    Contribution: Identify the primary contribution of the study, categorizing it as theoretical, managerial, or methodological.
6.    Future Potential and Limitations: Summarise what the study states about future research directions or the limitations of its findings.

#### Function 2: Guidance
If a user types "I want all research to demonstrate 1) human VS AI research, and 2) empirical research," the system should:
- Retrieve all relevant articles from the backend database that discuss human VS AI research and empirical research.
- Generate summaries of these articles in the specified format.
- Compile the summaries into coherent paragraphs that discuss how the research in these areas is connected, identifying common themes, trends, and potential future directions.
- Provide a reference list for all retrieved articles, formatted according to standard academic citation styles.

### Interpretation of requirements
Since there is an interaction based on natural language queries, the implementation needs a variance of RAG. For this specific implementation, a LangChain database vector database for the RAG is used. The LLM is provided through OpenAIs API.

### Install all packages:

In [1]:
#from google.colab import drive
#drive.mount('/content/drive/')
import warnings
warnings.filterwarnings("ignore")
!pip install langchain -q
!pip install openai -q
!pip install tiktoken -q
!pip install faiss-gpu -q
!pip install langchain_experimental -q
!pip install unstructured -q
!pip install "unstructured[pdf]"  -q


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.0/209.0 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m34.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.5/409.5 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m52.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/49.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Upload your literature folder as zip file:

In [4]:
# get folder of PDFs
# unfortunately too large for github and google drive...
!wget https://syncandshare.lrz.de/dl/fiRNoSMHyQvo58L8YKuK3b/literature.zip
!wget https://raw.githubusercontent.com/trashpanda-ai/RAG-chatbot/refs/heads/main/__init__.py
!wget https://raw.githubusercontent.com/trashpanda-ai/RAG-chatbot/refs/heads/main/rag.py
!unzip literature.zip

--2024-11-29 21:12:21--  https://raw.githubusercontent.com/trashpanda-ai/RAG-chatbot/refs/heads/main/__init__.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/plain]
Saving to: ‘__init__.py’

__init__.py             [<=>                 ]       0  --.-KB/s               __init__.py             [ <=>                ]       0  --.-KB/s    in 0s      

2024-11-29 21:12:21 (0.00 B/s) - ‘__init__.py’ saved [0/0]

--2024-11-29 21:12:21--  https://raw.githubusercontent.com/trashpanda-ai/RAG-chatbot/refs/heads/main/rag.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP reques

In [None]:
# if the exact same data is used, you can "recycle" the vector database which speeds up the start time
#!wget https://syncandshare.lrz.de/dl/fi9E4ZMTV2QEaNADSkazuq/vectorstore.zip
#!unzip vectorstore.zip

### Import the chatbot from ```rag.py```:

In [1]:
from rag import SimpleRAGChat
import os

### Enter your API Key:
(if you dont have one, reach out to me and I'll provide one for testing purpose)

In [5]:
# Prompt the user for their OpenAI API key
api_key = input("Please enter your OpenAI API key: ")

# Set the API key as an environment variable
os.environ["OPENAI_API_KEY"] = api_key
print("OPENAI_API_KEY has been set!")

Please enter your OpenAI API key: NotYourKey
OPENAI_API_KEY has been set!


### Let's go!

In [3]:
chat = SimpleRAGChat()

HTML(value='<h2 style="color: #0065bd; font-family: Arial, sans-serif; margin-bottom: 20px;">AI Research Assis…

Output(layout=Layout(border='1px solid #ddd', height='500px', overflow_y='auto', padding='20px', width='800px'…

Text(value='', layout=Layout(border='1px solid #ddd', margin='10px 0', padding='10px', width='800px'), placeho…