In [None]:
!pip install arxiv langchain sentence-transformers faiss-cpu

Collecting arxiv
  Downloading arxiv-2.2.0-py3-none-any.whl.metadata (6.3 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting feedparser~=6.0.10 (from arxiv)
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser~=6.0.10->arxiv)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadat

In [None]:
import arxiv
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np


In [None]:
def fetch_arxiv_papers(query, max_results=25):
    search = arxiv.Search(query=query, max_results=max_results)
    papers = []
    for result in search.results():
        # Extract the author names
        authors = ", ".join([author.name for author in result.authors])  # Ensure authors is defined before using it
        papers.append({
            "title": result.title,
            "summary": result.summary,
            "url": result.entry_id,
            "authors": authors,  # Now authors is a string
            "date": result.published
        })
    return papers


In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2')

def embed_papers(papers):
    texts = [paper["summary"] for paper in papers]
    embeddings = model.encode(texts)
    return embeddings


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
def create_faiss_index(embeddings):
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)
    return index


In [None]:
def semantic_search(index, papers, query, top_k=8):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding, top_k)
    results = [papers[i] for i in indices[0]]
    return results


In [None]:
# Citation Management Utilities

citation_tracker = {}

def add_citation(key: str, reference: str):
    """
    Adds a citation to the tracker. If the key exists, it will not duplicate.
    """
    if key not in citation_tracker:
        citation_tracker[key] = reference


def get_citations():
    """
    Returns a formatted string of all tracked citations.
    """
    if not citation_tracker:
        return "No citations tracked yet."
    citations = "\n\n📚 **Citations:**\n"
    for i, (key, ref) in enumerate(citation_tracker.items(), 1):
        citations += f"{i}. {ref}\n"
    return typewriter(citations)


# Example Usage:
# add_citation("bert2018", "Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://arxiv.org/abs/1810.04805")
# print(get_citations())


In [None]:
import time
import sys

def typewriter(text, delay=0.03):
    """
    Prints text with a typewriter effect.

    Args:
        text (str): The text to print.
        delay (float, optional): The delay between printing characters (in seconds). Defaults to 0.03.
    """
    for char in text:
        sys.stdout.write(char)
        sys.stdout.flush()
        time.sleep(delay)  # Ensure delay is a number
    print()

# # Example usage
# typewriter("Hello, this is a typewriter effect!")


In [None]:
query = "Compare BERT and GPT models in terms of performance and architecture"
papers = fetch_arxiv_papers(query)  # increase if needed
embeddings = embed_papers(papers)
index = create_faiss_index(np.array(embeddings))

# Now semantic search makes sense
results = semantic_search(index, papers, query)



# Print top result
for paper in results:
    print("Title:", paper["title"])
    print("URL:", paper["url"])
    print("Summary:", paper["summary"][:300], "...\n")
        # Add citation if relevant (e.g., from the paper's unique ID or citation key)
    citation_key = paper["title"]  # or any other unique identifier for the paper
    citation_reference = f"{paper['authors']} - {paper['title']}. URL: {paper['url']}"

    # Adding the citation to the tracker
    add_citation(citation_key, citation_reference)


  for result in search.results():


Title: FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers
URL: http://arxiv.org/abs/2204.04477v1
Summary: The mainstream BERT/GPT model contains only 10 to 20 layers, and there is
little literature to discuss the training of deep BERT/GPT. This paper proposes
a simple yet effective method to stabilize BERT and GPT training. We
successfully scale up BERT and GPT to 1,000 layers, which is an order of
magn ...

Title: Sensitivity Analysis on Transferred Neural Architectures of BERT and GPT-2 for Financial Sentiment Analysis
URL: http://arxiv.org/abs/2207.03037v1
Summary: The explosion in novel NLP word embedding and deep learning techniques has
induced significant endeavors into potential applications. One of these
directions is in the financial sector. Although there is a lot of work done in
state-of-the-art models like GPT and BERT, there are relatively few works  ...

Title: Selecting Between BERT and GPT for Text Classification in Political Science Research
URL: http://arxiv.or

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "HuggingFaceH4/zephyr-7b-beta"

tokenizer = AutoTokenizer.from_pretrained(model_id)
llm_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,  # Works well on Colab with T4
    device_map="auto"
)


tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not in

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]



In [None]:
def ask_question(context, question, max_tokens=4096):
    prompt = f"""<|system|>You are a helpful research assistant. Use the context below to answer the user's question precisely and clearly.<|end|>
<|user|>Context: {context}

Question: {question}<|end|>
<|assistant|>"""

    inputs = tokenizer(prompt, return_tensors="pt").to(llm_model.device)
    outputs = llm_model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

    return typewriter(tokenizer.decode(outputs[0], skip_special_tokens=True).split("<|assistant|>")[-1].strip())

In [None]:
def ask_question(context, question, max_tokens=4096):
    prompt = f"""<|system|>You are a helpful research assistant. Use the context below to answer the user's question precisely and clearly.<|end|>
<|user|>Context: {context}

Question: {question}<|end|>
<|assistant|>"""

    inputs = tokenizer(prompt, return_tensors="pt").to(llm_model.device)
    outputs = llm_model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=True,
        temperature=0.0001,
        top_p=1,
        pad_token_id=tokenizer.eos_token_id
    )

    return typewriter(tokenizer.decode(outputs[0], skip_special_tokens=True).split("<|assistant|>")[-1].strip())

In [None]:
question = "In what scenarios is BERT better than GPT and vice-versa."
ask_question(results[0]["summary"], question)

The paper you mentioned focuses on scaling up the depth of BERT and GPT models, and it does not directly address the question of when BERT is better than GPT or vice versa. Both BERT and GPT are powerful language models that have shown impressive performance in various natural language processing tasks. The choice between BERT and GPT may depend on the specific task and the nature of the input data.

BERT (Bidirectional Encoder Representations from Transformers) is generally better suited for tasks that require understanding the context of a word in a sentence, such as question answering, sentiment analysis, and text classification. BERT's bidirectional architecture allows it to consider the context of a word from both the left and right sides, which can improve its ability to understand the meaning of a word in context.

On the other hand, GPT (Generative Pretrained Transformer) is better suited for tasks that require generating new text, such as summarization, translation, and text c

In [None]:
get_citations()



📚 **Citations:**
1. Dezhou Shen - FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers. URL: http://arxiv.org/abs/2204.04477v1
2. Tracy Qian, Andy Xie, Camille Bruckmann - Sensitivity Analysis on Transferred Neural Architectures of BERT and GPT-2 for Financial Sentiment Analysis. URL: http://arxiv.org/abs/2207.03037v1
3. Yu Wang, Wen Qu, Xin Ye - Selecting Between BERT and GPT for Text Classification in Political Science Research. URL: http://arxiv.org/abs/2411.05050v1
4. Francesco Periti, Haim Dubossarsky, Nina Tahmasebi - (Chat)GPT v BERT: Dawn of Justice for Semantic Change Detection. URL: http://arxiv.org/abs/2401.14040v3
5. Jingye Yang, Cong Liu, Wendy Deng, Da Wu, Chunhua Weng, Yunyun Zhou, Kai Wang - Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT. URL: http://arxiv.org/abs/2308.06294v2
6. Lewen Yang, Xuanyu Zhou, Juao Fan, Xinyi Xie, Shengxin Zhu - Can bidirectional encoder become the ultimate winner for downstream 