<a href="https://colab.research.google.com/github/vishalkumar-sahu/Intelligent-Support-for-Researchers-A-Learning-Assistance-and-Plagiarism-Detection-Tool/blob/main/Working_on_ArXiv_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Working on ArXiv dataset -:**

## **Installing Kaggle and Accessing it -:**

In [None]:
! pip install kaggle

In [None]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

## **Downloading arXiv metadata(Snapshot) files --**

In [None]:
! kaggle datasets download -d Cornell-University/arxiv

Downloading arxiv.zip to /content
100% 1.17G/1.17G [00:14<00:00, 100MB/s] 
100% 1.17G/1.17G [00:14<00:00, 83.8MB/s]


In [None]:
import numpy as np
import pandas as pd
import json
from collections import Counter
import zipfile

In [None]:
with zipfile.ZipFile('/content/arxiv.zip', 'r') as zip:
    zip.extractall('/content/arXiv_metadata')

## **Getting Metadata from Snapshot json file --**

In [None]:
def get_metadata():
    with open('/content/arXiv_metadata/arxiv-metadata-oai-snapshot.json', 'r') as f:
        for line in f:
            yield line

metadata = get_metadata()

## **Catogory Mapping of Resaerch Papers of CS domain --**

In [None]:
category_map = ['cs.AI','cs.AR','cs.CC','cs.CE','cs.CG','cs.CL','cs.CR','cs.CV','cs.CY','cs.DB','cs.DC','cs.DL','cs.DM','cs.DS','cs.ET','cs.FL','cs.GL','cs.GR','cs.GT',
'cs.HC','cs.IR','cs.IT','cs.LG''cs.LO','cs.MA','cs.MM','cs.MS','cs.NA''cs.NE''cs.NI','cs.OH','cs.PF','cs.PL','cs.RO','cs.SC''cs.SD','cs.SE','cs.SI','cs.SY']

## **Extracting the CS domain Paper Ids from Metadata file -:**

In [None]:
idx = []
for paper in metadata:
    data = json.loads(paper)
    if data['categories'] in category_map :
      idx.append(data['id'])

print(len(idx))
print(idx)

158859
['0704.0062', '0704.0108', '0704.0213', '0704.0229', '0704.0301', '0704.0492', '0704.0834', '0704.0858', '0704.0860', '0704.0879', '0704.1267', '0704.1294', '0704.1373', '0704.1394', '0704.1827', '0704.1829', '0704.1838', '0704.1842', '0704.2010', '0704.2295', '0704.2344', '0704.2355', '0704.2542', '0704.3141', '0704.3313', '0704.3433', '0704.3500', '0704.3501', '0704.3515', '0704.3520', '0704.3643', '0704.3708', '0704.3773', '0704.3890', '0704.3905', '0705.0150', '0705.0178', '0705.0197', '0705.0204', '0705.0214', '0705.0262', '0705.0281', '0705.0449', '0705.0450', '0705.0453', '0705.0454', '0705.0599', '0705.0612', '0705.0635', '0705.0693', '0705.0734', '0705.0738', '0705.0742', '0705.0761', '0705.0781', '0705.0828', '0705.0856', '0705.0915', '0705.0952', '0705.0956', '0705.0959', '0705.0960', '0705.0961', '0705.0962', '0705.0969', '0705.0982', '0705.1025', '0705.1031', '0705.1036', '0705.1037', '0705.1038', '0705.1148', '0705.1150', '0705.1209', '0705.1215', '0705.1217', '070

In [None]:
print(idx)

['0704.0062', '0704.0108', '0704.0213', '0704.0229', '0704.0301', '0704.0492', '0704.0834', '0704.0858', '0704.0860', '0704.0879', '0704.1267', '0704.1294', '0704.1373', '0704.1394', '0704.1827', '0704.1829', '0704.1838', '0704.1842', '0704.2010', '0704.2295', '0704.2344', '0704.2355', '0704.2542', '0704.3141', '0704.3313', '0704.3433', '0704.3500', '0704.3501', '0704.3515', '0704.3520', '0704.3643', '0704.3708', '0704.3773', '0704.3890', '0704.3905', '0705.0150', '0705.0178', '0705.0197', '0705.0204', '0705.0214', '0705.0262', '0705.0281', '0705.0449', '0705.0450', '0705.0453', '0705.0454', '0705.0599', '0705.0612', '0705.0635', '0705.0693', '0705.0734', '0705.0738', '0705.0742', '0705.0761', '0705.0781', '0705.0828', '0705.0856', '0705.0915', '0705.0952', '0705.0956', '0705.0959', '0705.0960', '0705.0961', '0705.0962', '0705.0969', '0705.0982', '0705.1025', '0705.1031', '0705.1036', '0705.1037', '0705.1038', '0705.1148', '0705.1150', '0705.1209', '0705.1215', '0705.1217', '0705.1218'

## **Installing and importing necessary dependencies --**

In [None]:
!pip install -qU langchain
!pip install embeddings
!pip install pypdf
!pip install sentence_transformers
!pip install ray
!pip install faiss-cpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.0/90.0 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.1/49.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting embeddings
  Downloading embeddings-0.0.8-py3-none-any.whl (12 kB)
Installing collected packages: embeddings
Successfully installed embeddings-0.0.8
Collecting pypdf
  Downloading pypdf-3.12.1-py3-none-any.whl (254 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m254.8/254.8 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.12.1
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m4.9 MB/s[0m eta [36m0:00:0

In [None]:
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.base import Embeddings
from sentence_transformers import SentenceTransformer
from langchain.document_loaders import PyPDFDirectoryLoader

from typing import List
import time
import os
import ray
import numpy as np
import pypdf
from urllib.error import HTTPError
import requests
from time import sleep

## **Initializing the RAY instance --**

In [None]:
ray.init(num_cpus = 1)

2023-07-15 03:06:45,787	INFO worker.py:1636 -- Started a local Ray instance.


0,1
Python version:,3.10.12
Ray version:,2.5.1


## **Downloading and Loding(using PyPDFLoader) of pdfs --**

In [None]:
pdf_loaders = []
count = 0;
for id in idx:
    pdf_loaders.append(PyPDFLoader("https://arxiv.org/pdf/" + id))
    count += 1

    if count == 1500:
      break

    if count % 400 == 0:
      sleep(300); # sleep for 5 min

    if count == 800:
      sleep(150); # sleep for 2.5 min


In [None]:
print(len(pdf_loaders))

1500


## **Extracting pdfs and dividing text into chunks --**

In [None]:
docs = []
for i in range(len(pdf_loaders)):
    try :
      docs.extend(pdf_loaders[i].load_and_split())
    except pypdf.errors.PdfStreamError:
      continue

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 300,
    chunk_overlap = 20,
    length_function = len,
)

chunks = text_splitter.create_documents(
    [doc.page_content for doc in docs],
    metadatas=[doc.metadata for doc in docs])

print(len(chunks))



223252


In [None]:
print(chunks[100])

page_content='memory needed for the decoding algorithm; for example, pres ence of states with similar emission\nprobabilities tends to increase memory requirements. Is it possible to characterize HMMs that\nrequire large amounts of memory to decode? Can we characteri ze the states that are likely to serve' metadata={'source': '/tmp/tmp978xrbrw/tmp.pdf', 'page': 8}


## **Initializing the FAISS Vector Database --**

In [None]:
FAISS_INDEX_PATH = "faiss_index_09062023"

class LocalHuggingFaceEmbeddings(Embeddings):
    def __init__(self, model_id):
        self.model = SentenceTransformer(model_id)

    def embed_documents(self, texts: List[str]) -> List[List[float]]:
        embeddings = self.model.encode(texts)
        return embeddings

    def embed_query(self, text: str) -> List[float]:
        embedding = self.model.encode(text)
        return list(map(float, embedding))

## **Function for creating Embeddings --**

In [None]:
@ray.remote(num_gpus=0.5)
def process_shard(shard):
    print(f"Starting process_shard of {len(shard)} chunks.")
    st = time.time()

    embeddings = LocalHuggingFaceEmbeddings("multi-qa-mpnet-base-dot-v1")
    result = FAISS.from_documents(shard, embeddings)

    et = time.time() - st
    print(f"Shard completed in {et} seconds.")

    return result

In [None]:
! ray status

Node status
---------------------------------------------------------------
Healthy:
 1 node_7677f7cdf4c88ffe99cffc4704fbe86ced5c385fdf2a85d6330c4b0b
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/1.0 CPU
 0.0/1.0 GPU
 0B/7.24GiB memory
 0B/3.62GiB object_store_memory

Demands:
 (no resource demands)
[0m

## **Creating Embeddings --**

In [None]:
core = 2
shards = np.array_split(chunks, core)

futures = [process_shard.remote(shards[i]) for i in range(core)]
results = ray.get(futures)

db = results[0]
for i in range(1, core):
    db.merge_from(results[i])

db.save_local(FAISS_INDEX_PATH)

[2m[36m(process_shard pid=19718)[0m Starting process_shard of 111626 chunks.


[2m[36m(process_shard pid=19718)[0m Downloading (…)16ebc/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]Downloading (…)16ebc/.gitattributes: 100%|██████████| 737/737 [00:00<00:00, 4.40MB/s]
Downloading (…)_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 1.91MB/s]
Downloading (…)b6b5d16ebc/README.md: 100%|██████████| 8.65k/8.65k [00:00<00:00, 37.5MB/s]
Downloading (…)b5d16ebc/config.json: 100%|██████████| 571/571 [00:00<00:00, 4.47MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 116/116 [00:00<00:00, 1.04MB/s]
Downloading (…)ebc/data_config.json: 100%|██████████| 25.5k/25.5k [00:00<00:00, 127MB/s]
Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]
Downloading pytorch_model.bin:   7%|▋         | 31.5M/438M [00:00<00:01, 237MB/s]
Downloading pytorch_model.bin:  17%|█▋        | 73.4M/438M [00:00<00:01, 291MB/s]
Downloading pytorch_model.bin:  24%|██▍       | 105M/438M [00:00<00:01, 299MB/s] 
Downloading pytorch_model.bin:  31%|██

[2m[36m(process_shard pid=19718)[0m Shard completed in 1290.7926216125488 seconds.
[2m[36m(process_shard pid=25369)[0m Starting process_shard of 111626 chunks.
[2m[36m(process_shard pid=25369)[0m Shard completed in 1212.108793258667 seconds.


In [None]:
! ray status

Node status
---------------------------------------------------------------
Healthy:
 1 node_7677f7cdf4c88ffe99cffc4704fbe86ced5c385fdf2a85d6330c4b0b
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/1.0 CPU
 0.0/1.0 GPU
 0B/7.24GiB memory
 1.61GiB/3.62GiB object_store_memory

Demands:
 (no resource demands)
[0m

## **Shutdown RAY Instance --**

In [None]:
ray.shutdown()

## **Installing Necessary Dependencies and importing it --**

In [None]:
!pip install openai

Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/73.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
Successfully installed openai-0.27.8


In [None]:
from langchain.llms import OpenAI

from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain

## **Setting up OPENAI Cerenditals --**

In [None]:
os.environ["OPENAI_API_KEY"] = 'Enter key here'
llm = OpenAI(openai_api_key="OPENAI_API_KEY")

## **Initializing the Buffer Memory --**

In [None]:
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
pdf_qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.9) , db.as_retriever(), memory=memory)

## **Answering the Questions --**

In [None]:
question_framing="Give me the Name and detail of research paper that have information about "

In [None]:
query = question_framing +" measurement-based availability assessment study using field data collected during a 4-year period from 373 SunOS/Solaris Unix workstations and servers interconnected through a local area network. "
result = pdf_qa({"question": query})
print("Answer:")
result["answer"]

Answer:


' The research paper is entitled “Failure Data Analysis of a LAN of Windows NT-Based Computers” and was published in the 18th IEEE Symposium on Reliable Distributed Systems. It discusses a measurement-based availability assessment study using field data collected over a 4-year period from 373 SunOS/Solaris Unix workstations and servers interconnected through a local area network.'

In [None]:
query = question_framing +" we provide first approaches of assisting possibilities for users how to resolve the difference of requirements and their actual activities with respect to privacy protection in Elearing domain"
result = pdf_qa({"question": query})
print("Answer:")
result["answer"]

Answer:


' The research paper "Privacy - an Issue for eLearning? A Trend Analysis Reflecting the Attitude of European eLearning Users" by Katrin Borcea-Pfitzmann and Anne-Katrin Stange provides first approaches of assisting possibilities for users to resolve the difference of requirements and their actual activities with respect to privacy protection in eLearning domain.'

In [None]:
query =question_framing + "design of efficient scrip systems and develop tools for empirically analyzing them. For those interested in the empirical study of scrip system"
result = pdf_qa({"question": query})
print("Answer:")
result["answer"]

Answer:


' This paper, titled "Design of Efficient Scrip Systems and Empirical Analysis Tools." It is available from the Computer Science Dept. at Cornell University, and was written by Halpern and published as an Abstract.'

In [None]:
para ="We also characterize the distribution of money in the system in equilibrium, as a function of the fraction of agents of each type. Using this characterization, we show that we caninfer the threshold strategies that different types of agents are using simply from the distribution of money. This shows that, by simply observing a scrip system in operation, we can learn a great deal about the agents in the system. Notonly is such information of interest to social scientists and marketers, it is also important to a system designer trying to optimize the performance of the system. This is because agents that have no money will be unable to pay for service;agents that are at their threshold are unwilling to serve others. We show that, typically, it is the number of agents with no money that has the more significant impact on the overall efficiency of the system. Thus, the way to optimize the performance of the system is to try to minimize the number of agents with no money. As we show, we can decrease the number of agents with no money by increasing the money supply. However, this only works up to a point. Once a certain amount of money is reached, the system experiences a monetary crash: there is so much money that, in equilibrium, everyone will feel rich and no agents are willing to work. The point where the system crashes gives us a sharp threshold. We show that, to get optimal performance, we want the total amount of money in the system to be as close as possible to the"

In [None]:
query = question_framing+para
result = pdf_qa({"question": query})
print("Answer:")
result["answer"]

Answer:


'\nThis paper, titled "Design and Analysis of Scrip Systems" by Halpern.'

In [None]:
query = "what is On-line Viterbi Algorithm"
result = pdf_qa({"question": query})
print("Answer:")
result["answer"]

Answer:


' On-line Viterbi Algorithm is an algorithm that is related to random walks and has memory requirements that can be bounded by an asymptotic limit. It is described in detail in a 1973 paper by G.D. Forney Jr. entitled "The Viterbi Algorithm" and the pseudocode for the algorithm is given in the text above.'

In [None]:
query = "who is pm of india"
result = pdf_qa({"question": query})
print("Answer:")
result["answer"]

Answer:


" I don't know."