# Digital Research Assistant (digitalRA)

The intention is to help with the litrature review writing (not a survay paper, but the litrature review section of a paper or a grant).

You describe your idea, and some parameters for filtering paper, the output is a set of found papers that are relevant to your idea, why do the model think they are relevant, and a "high quality" text reveiwing the papers and relating them to your idea.


Provide an idea for a research you are planning to do. This cell is all you need to provide. The rest runs without your inputs, you just need to run it.

In [8]:
# idea_text = "One problem in self supervised learning without negative instances is collapse. To avoid collapse, I will use eigenvalues of the output embedding space and ensure they are all larger than 1.0 to make sure the space of embedding is used effectively. Still, making sure the variations of the same data point (generated by some augmentation) are mapped closer to one another. "
# idea_text = """The prevailing mental health predicament compels us to investigate resilience—the intrinsic capability to counteract stress and rejuvenate mental well-being. While resilience is recognized for its potent influence on mental health, aging, recovery from ailments, and possible deterrence of cognitive decline, there remains a significant void in understanding its transformation with age and its modulation by major life events. This research endeavors to delve into the neurocognitive mechanisms underpinning emotional resilience in older populations, both healthy and those with mood disorders. The central thrust is to discern how aging individuals process emotions, given its profound impact on their overall well-being. Harnessing a synergistic approach that melds brain imaging, cognitive assessments, physiological monitoring, and real-world data, the study seeks to: 1) Uncover the intricate neurological, cognitive, and physiological substrates bolstering emotional resilience in aging; 2) Contrast the neurological and physiological responses of older individuals with late-life depression against their healthy counterparts; and 3) Track and prognosticate the trajectory of mental well-being in the twilight years. Through these pursuits, this research aims to amplify our grasp on the intricate dance between aging, emotion, and cognition, ultimately steering the creation of innovative strategies to bolster mental health in senior years."""
idea_text = """As people age, they may pay less attention to the social aspects of their environment2. Normative aging has a negative impact on certain aspects of social cognition and specifically in social perception. One notable difference between younger and older individuals is the age-related decline in perceiving and integrating social-emotional cues from faces. Most of the evidence, however, stems from studies in which the data is averaged across individuals, making conclusions about the two extreme ends of the age spectrum rather than a complete picture across the lifespan. In addition, processing social-emotional cues typically vary across individuals and across tests. Without current knowledge of how social perception is affected across the adult lifespan – from young to middle age to late adulthood – and without fully understanding individual differences in processing social-emotional cues (inter-individual differences) in different tests, it is difficult to draw conclusions as to why social perception change with age. The plan is to identify neurocognitive mechanisms underlying age-related differences in social perception function. I will achieve this by examining brain networks across the adult lifespan, from young adulthood to middle age and older adulthood to identify critical window in which this function starts to change. Due to the heterogeneity of aging population and individual variability of social-emotional processing, I will further investigate inter-individual differences in social perception functions. I will achieve this by having repeated measurements of social perception among few individuals, also known as deep sampling, to determine how individual variabilities could explain age-related differences observed in social perception. Furthermore, I will zoom in and investigate the role of subcortical brain structures, which have been largely overlooked in social cognitive functions due to technical challenges associated with accurate mapping of these areas. I will achieve this by measuring structural and functional properties of the amygdala and locus coeruleus to further understand their role in social perception functions."""

# Pick your samll GPT instance (used for research and extracting papers) and Large GPT instance (used for writing the litrature review)
large_mdl = 'gpt-3.5-turbo-16k' # 'gpt-4'
small_mdl = 'gpt-3.5-turbo-0613'

# We filter artilces with some spec. For example, only papers later than 2020, or if the papers had more than 100 citations, and their relevance score was at least medium.

min_cite = 100
min_year = 2020
litrature_review_len = 1500 # Tokens

Establish latest gpt pricing from OpenAI: https://openai.com/pricing

In [9]:
pricing_map = {'gpt-4-0613': [0.03/1000, 0.06/1000], 'gpt-3.5-turbo-16k': [0.003/1000, 0.004/1000], 'gpt-3.5-turbo-0613': [0.0015/1000, 0.002/1000]}

Some initial methods and functions are defined here

In [10]:
import subprocess
import os
import time
import json
import pandas as pd

working_dir = './results/reza_gpt2_layers/'
if not os.path.exists(working_dir):
    os.makedirs(working_dir)

if not os.path.exists('settings.json'):
    data = {"OPENAI_API_KEY": "YYY"}
    with open('settings.json', 'w') as f:
        json.dump(data, f, indent=4) 

with open('settings.json', 'r') as file:
    data = json.load(file)
    field_name = "OPENAI_API_KEY"
    OPENAI_API_KEY = data[field_name]

def run_pop8query(keywords, datasource, max_results, output_format, output_file):
    cmd = [
        "./assets/pop8query",
        "--keywords={}".format(keywords),
        "--{}".format(datasource),
        "--max={}".format(max_results),
        "--format={}".format(output_format),
        output_file
    ]

    result = subprocess.run(cmd, capture_output=True, text=True)
    
    if result.returncode != 0:
        print("Error occurred:", result.stderr)
    else:
        print("Command executed successfully!")
        print("Output:", result.stdout)

# Example usage:
# run_pop8query("machine learning", "semscholar", 20, "json", "output.json")


def get_papers(search_phrase, dataset="semscholar"):
    try:
        run_pop8query(search_phrase, dataset, 20, "json", "output.json")
        
        # Check if output.json was created and is not empty
        if not os.path.exists("output.json") or os.path.getsize("output.json") == 0:
            print(f"Error: Output file for '{search_phrase}' not created or is empty.")
            return pd.DataFrame()  # Return empty dataframe

        with open("output.json", "r", encoding="utf-8-sig") as file:
            data = json.load(file)

        if not data:
            print(f"No data found in the JSON file for '{search_phrase}'.")
            return pd.DataFrame()  # Return empty dataframe

        df = pd.DataFrame(data)

        return df

    except Exception as e:
        print(f"Error processing '{search_phrase}': {e}")
        return pd.DataFrame()  # Return empty dataframe in case of any other unexpected errors


In [11]:
import openai
import time
import tiktoken

class llmOperations:    
    total_prompt_tokens = 0
    total_cmpl_tokens = 0

    openai.api_key = OPENAI_API_KEY
    # openai.api_key = os.getenv('OPENAI_API_KEY').strip('"')
    # language_model = 'gpt-3.5-turbo-instruct-0914'
    # language_model = "babbage-002"

    def __init__(self, language_model="gpt-3.5-turbo-0613", price_inp=0.0015/1000, price_out=0.002/1000):
        self.language_model=language_model
        self.price_inp=price_inp
        self.price_out=price_out    
        self.tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")

    
    def get_llm_response(self, prompt, system_prompt = "You are a smart, very knowledgable, research assistant."):
        chat_data = [{'role': 'system', "content": system_prompt}, {'role': 'user', 'content': prompt}]
        # print(chat_data)
        
        response = openai.ChatCompletion.create(model=self.language_model, messages=chat_data)
        final_response = response['choices'][0]['message']['content']
        
        self.total_prompt_tokens += response['usage']['prompt_tokens']
        self.total_cmpl_tokens += response['usage']['completion_tokens']
        
        return final_response, response

    def get_current_cost(self):
        return self.total_prompt_tokens*self.price_inp + self.total_cmpl_tokens*self.price_out
        
    def get_estimated_cost(self, prompt, completion_estimate_len=100):
        # Assumes the system prompt is small, and prompt variable contains all text to be processed by LLM        
        return len(self.tokenizer.encode(prompt))*self.price_inp + completion_estimate_len*self.price_out


# response = openai.Completion.create(model=language_model, prompt=prompt, max_tokens=300, temperature=0)
# final_response = response['choices'][0]['text']

In [12]:
# Load short context and logn context models

short_context_model = llmOperations(small_mdl, price_inp=pricing_map[small_mdl][0], price_out=pricing_map[small_mdl][1])
# long_context_model = llmOperations('gpt-3.5-turbo-16k', price_inp=0.003/1000, price_out=0.004/1000)
long_context_model = llmOperations(large_mdl, price_inp=pricing_map[large_mdl][0], price_out=pricing_map[large_mdl][1])

print('Loaded ' + small_mdl + ' for short context cases and ' + large_mdl + ' for long context inferences.')

Loaded gpt-3.5-turbo-0613 for short context cases and gpt-3.5-turbo-16k for long context inferences.


## Establish the research background

- We first establish the experties we need to perform the research. Based on that, we ask our model to act like an expert in that field.
- The expert GPT-small-instance-expert first generates search phrases out of the idea described above to search.
- Next, we use those phrases to find relevant paper.
- GPT-small-instance-expert then goes through those papers and rate them to extent to which they are relevant to out idea.
- It also generate a short reasons why that score was given.


In [15]:
prompt = "Here is a research proposal:\n"+idea_text+'\n If a professor is going to research this propsoal, what would the professor be expert at? Generate the answer in the format of "You are an expeert in the field of XXX"'
researcher_spec, response = short_context_model.get_llm_response(prompt)

researcher_spec += " You are the best in the world in this field. "
print(researcher_spec)

You are an expert in the field of social cognition and social perception, specifically in relation to age-related changes and individual differences. You are the best in the world in this field. 


In [16]:
prompt = f'{researcher_spec} \n\nHere is a description of an idea: {idea_text}. \n Generate 5 search phrases to search in Google for related articles. Generate the search phrases in a json format, with fields of "search phrase X", where X is the number. Include nothing but this json format output in your response.'
final_response, response = short_context_model.get_llm_response(prompt)

parsed_data = json.loads(final_response)
search_phrases = []
for i in parsed_data.keys():
    search_phrases.append(parsed_data[i])
print('Here are search phrases i suggest: ', search_phrases)

with open(working_dir + 'search_phrases.txt', 'w') as f:
    f.write('\n'.join(search_phrases))

print('Current cost: ', short_context_model.get_current_cost())

Here are search phrases i suggest:  ['age-related changes in social cognition and social perception', 'individual differences in social perception across the lifespan', 'neurocognitive mechanisms underlying age-related differences in social perception', 'inter-individual differences in social perception function', 'role of amygdala and locus coeruleus in social perception functions']
Current cost:  0.0031865


In [17]:
# Get a summary of the idea. 

prompt = researcher_spec+"\n\nSummarize this research idea to a concise paragraph while make sure it does not loose any important message or question:\n"+idea_text
idea_text_summary, response = short_context_model.get_llm_response(prompt)

with open(working_dir + 'idea_summary.txt', 'w') as f:
    f.write('Idea: \n\n')
    f.write(idea_text)
    f.write('\n\n idea summary:\n\n')
    f.write(idea_text_summary)

print('Here is a summary of your idea: \n', idea_text_summary)

Here is a summary of your idea: 
 Summary:
This research aims to understand age-related changes in social perception function and individual differences in processing social-emotional cues. Current knowledge is limited due to research averaging data across individuals and lacking a complete lifespan perspective. The plan is to examine brain networks across the adult lifespan, focusing on critical windows of change. Deep sampling of individuals will help explore inter-individual differences in social perception. The study will also investigate the role of subcortical brain structures, specifically the amygdala and locus coeruleus, in social perception functions.


In [20]:
# Find papers
import pandas as pd

# Find papers
combined_df = pd.DataFrame()
engines = ['gscholar', 'pubmed', 'semscholar']
# engines = ['pubmed']

for search_phrase in search_phrases:
    for engine in engines:
        print(f"Extracting papers with search phrase: {search_phrase} from {engine}")
        df = get_papers(search_phrase, )
        combined_df = pd.concat([combined_df, df])
        time.sleep(2)    
combined_df = combined_df.drop_duplicates(subset=['abstract'])
clean_df = combined_df.dropna(subset= ['abstract'])

# Save the found papers
clean_df.to_csv(working_dir+'papers_found.csv')

print(f'Found  {clean_df.shape[0]} articles.')

Extracting papers with search phrase: age-related changes in social cognition and social perception from gscholar
Command executed successfully!
Output: 
Extracting papers with search phrase: age-related changes in social cognition and social perception from pubmed
Command executed successfully!
Output: 
Extracting papers with search phrase: age-related changes in social cognition and social perception from semscholar
Command executed successfully!
Output: 
Extracting papers with search phrase: individual differences in social perception across the lifespan from gscholar
Command executed successfully!
Output: 
Extracting papers with search phrase: individual differences in social perception across the lifespan from pubmed
Command executed successfully!
Output: 
Extracting papers with search phrase: individual differences in social perception across the lifespan from semscholar
Command executed successfully!
Output: 
Extracting papers with search phrase: neurocognitive mechanisms underl

In [21]:
# Simulate cost for finding relevance scores and reasons for relevance
import tiktoken
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")
token_out_estimate = 100
estimated_cost = 0
relevance_scores = []

# clean_df['abstract'].values.shape[0]
for i in range(clean_df.shape[0]):
    abstract = clean_df['abstract'].values[i]
    # print(abstract)
    text = researcher_spec + '\n\nHere is an idea: ' + idea_text_summary + '\n \n' + "How relevant this idea is to the following abstract of a paper:\n" + abstract + """\n \nPick the relevance score from very low, low, medium, high, and very high. Output format as json, with field "relevance" and "reason". Include nothing but this json format output in your response."""
    parsed_response = """{
        "relevance": "very high",
        "reason": "Your idea aligns closely with the concepts discussed in the paper's abstract, which focuses on avoiding collapse and regularizing the covariance matrix of network outputs. Your idea of using eigenvalues to ensure consistent embeddings is directly related to the paper's content."
        }
    """   
    estimated_cost += short_context_model.get_estimated_cost(text, 100)

print("Estimated cost is: ", estimated_cost)

Estimated cost is:  0.0857385


Now we are ready to analyze the found papers and give them scores. 

In [22]:
relevance_scores = []
real_cost = 0

for i in range(clean_df.shape[0]):
    abstract = clean_df['abstract'].values[i]
    
    prompt = researcher_spec + '\n\nHere is an idea: ' + idea_text_summary + '\n' + "How relevant this idea is to the following abstract of a paper:\n" + abstract + """\n \nPick the relevance score from very low, low, medium, high, and very high. Output format as json, with fields "relevance" and "reason", which would look like:\n
    {"relevance": "RELEVANCE", "reason": "THE REASON"}. Include nothing but this json format output in your response."""

    parsed_response, response = short_context_model.get_llm_response(prompt)

    try:
        parsed_data = json.loads(parsed_response)
    except:
        parsed_data = {"relevance": "unknown", "reason": parsed_response}
    columns_to_pass = ['authors', 'abstract', 'doi', 'cites', 'year', 'title']
    for c in columns_to_pass:
        try:
            parsed_data[c] = clean_df[c].values[i]
        except:
            parsed_data[c] = "NONE"
    
    relevance_scores.append(parsed_data)
    
    tokens_out = tokenizer.encode(parsed_response)
    real_cost = short_context_model.get_current_cost()
    print(i, ' out of ', clean_df.shape[0], ', tokens generated: ', len(tokens_out), ', cost so far:', real_cost)
    
print('your cost so far: ', short_context_model.get_current_cost())

# Save the results
relevance_scores_df = pd.DataFrame(relevance_scores)
relevance_scores_df.to_csv(working_dir + '/first_level_analysis.csv')

0  out of  63 , tokens generated:  90 , cost so far: 0.005170500000000001
1  out of  63 , tokens generated:  70 , cost so far: 0.00611
2  out of  63 , tokens generated:  87 , cost so far: 0.006992
3  out of  63 , tokens generated:  69 , cost so far: 0.007802
4  out of  63 , tokens generated:  88 , cost so far: 0.008788
5  out of  63 , tokens generated:  77 , cost so far: 0.009686
6  out of  63 , tokens generated:  69 , cost so far: 0.01058
7  out of  63 , tokens generated:  81 , cost so far: 0.011491999999999999
8  out of  63 , tokens generated:  87 , cost so far: 0.012444499999999999
9  out of  63 , tokens generated:  58 , cost so far: 0.013469499999999999
10  out of  63 , tokens generated:  69 , cost so far: 0.014233
11  out of  63 , tokens generated:  84 , cost so far: 0.015419500000000001
12  out of  63 , tokens generated:  76 , cost so far: 0.017037
13  out of  63 , tokens generated:  46 , cost so far: 0.018029
14  out of  63 , tokens generated:  79 , cost so far: 0.0195655
15  ou

## Write the litrature review

- We first filter artilces with some spec. For example, only papers later than 2020, or if the papers had more than 100 citations, and their relevance score was at least medium.
- We then use the larger instance to write the litrature review.

In [29]:
min_cite = 50
min_year = 2010
litrature_review_len = 2000 # Tokens

filtered_df = relevance_scores_df[relevance_scores_df["relevance"].str.lower().isin(["high", "very high"])]
papers_df = filtered_df[(filtered_df['year']>min_year) | (filtered_df['cites']>min_cite)]
print(f'Selected {papers_df.shape[0]} papers for the review.')

concated_data = [('Paper ID '+ str(i) + ': \n' + p + '\n\n') for i, p in enumerate(papers_df['abstract'].values.tolist())]
concated_data = ''.join(concated_data)
print(f'Estimated price for litrature review: {long_context_model.get_estimated_cost(concated_data, litrature_review_len)} to write a review of around {litrature_review_len*3/4} words')


Selected 15 papers for the review.
Estimated price for litrature review: 0.033263 to write a review of around 1500.0 words


In [30]:
with open(working_dir + 'used_papers_review.txt', 'w', encoding="utf-8-sig") as f:
    f.write(concated_data)
# print(concated_data)

In [31]:
# Write the litrature review
prompt = researcher_spec+'\n \n Here is an idea: ' + idea_text_summary + '\n' + "and here are some paper abstracts that are relevant to this idea:\n\n" + concated_data + """\n\n END OF PAPER ABSTRACT PROVIDED.\n \nWrite a litrature review section, which I will be using for my paper in the background section, using these papers about the idea. Use as much as these papers as you can. Ensure the review is engaging and compares the ideas, rather than a flat list of papers. Also, the review makes reference back to my idea where relevant. Use Paper IDs for referencing, for example [Paper ID 3]. Also, at the very end, add one short and condensed paragraph and discuss how my idea is going to advance the field further and what gaps it will be filling."""
litrature_review, response = long_context_model.get_llm_response(prompt)

print('your cost so far: ', long_context_model.get_current_cost())

your cost so far:  0.030831


In [32]:
refs = [''.join(['[', str(i), '], ', p]) for i, p in enumerate((papers_df['title']+ ', (' + papers_df['year'].astype(str) + ')\n').values.tolist())]
print(litrature_review)
print(''.join(refs))

with open(working_dir + 'litrature_review_gpt4.txt', 'w') as f:
    f.write(litrature_review)
    f.write('\n\nReferences:\n')
    f.write(''.join(refs))

Literature Review:

The field of social cognition and social perception has garnered much attention in recent years, with researchers seeking to understand the age-related changes and individual differences in processing social-emotional cues. This literature review will explore relevant studies and provide an overview of the current understanding in this area. Furthermore, it will discuss the limitations of existing research and highlight the potential for further advancements in the field.

One particular study [Paper ID 0] aimed to analyze age-related changes in behavioral and neurophysiological correlates of social cognition. The findings indicated that older adults (OA) had higher neural responses (N170) during emotional recognition tasks, despite similar behavioral performances compared to younger adults (YA) and middle-aged adults (MA). Additionally, YA and MA outperformed OA in theory of mind tasks, while OA had attenuated neural responses (N2) during moral judgment tasks.

Ano

In [112]:
print(f'Total cost: {long_context_model.get_current_cost()+short_context_model.get_current_cost()}')

Total cost: 0.09337150000000001


In [41]:
# from transformers import BertTokenizer, BertModel
# import torch
# from sklearn.metrics.pairwise import cosine_similarity

# # 1. Initialize BERT model and tokenizer
# model_name = 'bert-base-uncased'
# tokenizer = BertTokenizer.from_pretrained(model_name)
# model = BertModel.from_pretrained(model_name)

# # 2. Function to get embeddings
# def get_embedding(text):
#     tokens = tokenizer(text, return_tensors='pt', truncation=True, max_length=512, padding='max_length')
#     with torch.no_grad():
#         outputs = model(**tokens)
#     return outputs['pooler_output'].detach().numpy()

# # 3. Compute the embedding for the idea
# # idea_text = idea_text
# idea_embedding = get_embedding(idea_text_summary)

# # 4. Compute embeddings for abstracts and rank them
# abstracts = combined_df['abstract'].dropna().values  # List of abstracts
# abstract_embeddings = [get_embedding(abstract) for abstract in abstracts]

# # 5. Calculate cosine similarities and rank
# similarities = [cosine_similarity(idea_embedding, abstract_embedding)[0][0] for abstract_embedding in abstract_embeddings]
# ranked_abstracts = sorted(zip(abstracts, similarities), key=lambda x: x[1], reverse=True)

# # 6. Print ranked abstracts
# for rank, (abstract, score) in enumerate(ranked_abstracts, 1):
#     print(f"Rank: {rank}, Similarity Score: {score:.4f}")
#     print(abstract)
#     print("="*50)



Rank: 1, Similarity Score: 0.9609
Real world data is mostly unlabeled or only few instances are labeled. Manually labeling data is a very expensive and daunting task. This calls for unsupervised learning techniques that are powerful enough to achieve comparable results as semi-supervised/supervised techniques. Contrastive self-supervised learning has emerged as a powerful direction, in some cases outperforming supervised techniques. In this study, we propose, SelfGNN, a novel contrastive self-supervised graph neural network (GNN) without relying on explicit contrastive terms. We leverage Batch Normalization, which introduces implicit contrastive terms, without sacrificing performance. Furthermore, as data augmentation is key in contrastive learning, we introduce four feature augmentation (FA) techniques for graphs. Though graph topological augmentation (TA) is commonly used, our empirical findings show that FA perform as good as TA. Moreover, FA incurs no computational overhead, unlike

In [None]:
# pip install -q transformers
from transformers import pipeline

checkpoint = "MBZUAI/LaMini-Flan-T5-77M"

model = pipeline('text2text-generation', model = checkpoint)
input_prompt = """Here is an idea: Lets say I have a neural network that maps a set of images from the same distribution. I want all of them to be mapped to the same embedding. To avoid collapse, I will use eigenvalues of the output embedding and ensure they are all larger 1.0, or maybe if they are multiplied to one another, the result is larger than 1.0.

How relevant this idea is to the following abstract of a paper:

Deep neural networks (DNNs), regardless of their impressive performance, are vulnerable to attacks from adversarial inputs and, more recently, Trojans to misguide or hijack the decision of the model. We expose the existence of an intriguing class of spatially bounded, physically realizable, adversarial examples— Universal NaTuralistic adversarial paTches—we call TnTs, by exploring the super set of the spatially bounded adversarial example space and the natural input space within generative adversarial networks. Now, an adversary can arm themselves with a patch that is naturalistic, less malicious-looking, physically realizable, highly effective—achieving high attack success rates, and universal. A TnT is universal because any input image captured with a TnT in the scene will: i) misguide a network (untargeted attack); or ii) force the network to make a malicious decision (targeted attack). Interestingly, now, an adversarial patch attacker has the potential to exert a greater level of control—the ability to choose a location independent, natural-looking patch as a trigger in contrast to being constrained to noisy perturbations—an ability is thus far shown to be only possible with Trojan attack methods needing to interfere with the model building processes to embed a backdoor at the risk discovery; but, still realize a patch deployable in the physical world. Through extensive experiments on the large-scale visual classification task, ImageNet with evaluations across its entire validation set of 50,000 images, we demonstrate the realistic threat from TnTs and the robustness of the attack. We show a generalization of the attack to create patches achieving higher attack success rates than existing state-of-the-art methods. Our results show the generalizability of the attack to different visual classification tasks (CIFAR-10, GTSRB, PubFig) and multiple state-of-the-art deep neural networks such as WideResnet50, Inception-V3 and VGG-16."""
input_prompt = 'Please let me know your thoughts on the given place and why you think it deserves to be visited: \n"Barcelona, Spain"'
generated_text = model(input_prompt, max_length=512, do_sample=True)[0]['generated_text']

print("Response", generated_text)
