## **Article Retrieval System using FAISS and GPT-3.5-turbo**

In [1]:
!pip install ydata-profiling langchain sentence_transformers faiss-cpu openai python-dotenv codecarbon

Collecting ydata-profiling
  Downloading ydata_profiling-4.7.0-py2.py3-none-any.whl (357 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m357.9/357.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.1.16-py3-none-any.whl (817 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence_transformers
  Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m163.3/163.3 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting faiss-cpu
  Downloading faiss_cpu-1.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting openai
  Downloading openai-1.17.0-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━

In [3]:
import pandas as pd
import torch
from openai import OpenAI
import os
from dotenv import load_dotenv
from langchain.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from ydata_profiling import ProfileReport
from codecarbon import EmissionsTracker
import warnings

In [4]:
warnings.filterwarnings("ignore")

In [6]:
load_dotenv('config.env')
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

In [7]:
filename = 'medium.csv'

In [8]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if device.type == 'cuda':
    gpu_name = torch.cuda.get_device_name(0)
    total_memory = torch.cuda.get_device_properties(0).total_memory
    total_memory_gb = total_memory / (1024**3)
    print(f"GPU is available: {gpu_name} with {total_memory_gb:.2f} GB")
else:
    print("GPU is not available. Using CPU")

GPU is not available. Using CPU


In [16]:
df = pd.read_csv(filename)
profile = ProfileReport(df, title='1300 Kaggle Articles about Data Science')
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [17]:
articles = DataFrameLoader(df, page_content_column="Title")
document = articles.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)
splitted_texts = splitter.split_documents(document)
db = FAISS.from_documents(splitted_texts, HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5"))

In [20]:
def show_rag(query):
    docs = db.similarity_search(query, k=3)
    print(f'Query: {query}')
    print(f'Retrieved documents: {len(docs)}')
    for index, doc in enumerate(docs, start=1):
        details = doc.to_json()['kwargs']
        title = details['page_content']
        text = details['metadata']['Text'][:500]
        print(f"Document {index}: {title}\n")
        print(f"Text: {text}\n")
        print('-' * 80)  # Print a separator line

In [21]:
query_text = "What is kNN?"
show_rag(query_text)

Query: What is kNN?
Retrieved documents: 3
Document 1: Layman’s Introduction to KNN

Text: Layman’s Introduction to KNN

Photo by timJ on Unsplash

kNN stands for k-Nearest Neighbours. It is a supervised learning algorithm. This means that we train it under supervision. We train it using the labelled data already available to us. Given a labelled dataset consisting of observations (x,y), we would like to capture the relationship between x — the data and y — the label. More formally, we want to learn a function g : X→Y so that given an unseen observation X, g(x) can confidently predict

--------------------------------------------------------------------------------
Document 2: K-Nearest Neighbors (KNN) Algorithm

Text: K-Nearest Neighbors (KNN) Algorithm

A Brief Introduction Afroz Chakure · Follow Published in DataDrivenInvestor · 4 min read · Jul 6, 2019 -- Listen Share

Simple Analogy for K-Nearest Neighbors (K-NN)

In this blog, we’ll talk about one of the most widely used machine 

In [23]:
def retrieve_documents(query, k=3):
    docs = db.similarity_search(query, k=k)
    retrieved_texts = []
    for doc in docs:
        details = doc.to_json()['kwargs']
        retrieved_texts.append(details['metadata']['Text'])
    return retrieved_texts

In [24]:
def format_documents_for_prompt(retrieved_documents):
    formatted_docs = "\n\n".join([f"Document: {doc[:500]}" for doc in retrieved_documents])
    return formatted_docs

In [27]:
def llm(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = format_documents_for_prompt(retrieved_documents)

    response = openai_client.chat.completions.create(
        model=model,
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant. You will be shown the user's question and the relevant information from article.  Answer the user's question using only this information."
            },
            {
                "role": "user",
                "content": f"Question: {query}. \n Information: {information}"
            }
        ]
    )
    return response.choices[0].message.content

In [28]:
query_text = "What is kNN?"
retrieved_documents = retrieve_documents(query_text)
output = llm(query=query_text, retrieved_documents=retrieved_documents, model="gpt-3.5-turbo")
print(output)

kNN stands for k-Nearest Neighbours. It is a supervised learning algorithm trained using labelled data to capture the relationship between data and labels in order to make predictions for new, unseen observations.
