Downloading Test Data

This dataset gives us ~42K text chunks to embed, each roughly a paragraph or two.

In [12]:
from datasets import load_dataset
# https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked/viewer/default/train?p=0
# data = load_dataset(
#     "jamescalam/ai-arxiv-chunked",
#     split= "train")


data = load_dataset("json", data_files="data/train.jsonl", split="train")    
print(data.info)
print(type(data['chunk']),len(data['chunk'])) 
print(type(data['chunk'][0]),len(data['chunk'][0])) 


KeyboardInterrupt: 

## Creating Embeddings

create an embed function for each model -  return a list of vector embeddings from string list


### hugginface; load embedding model & embedding function to use with model

In [None]:
from embedding_models.hugginface import load_LLM


# model_id = "intfloat/e5-base-v2"
model_id = "C:/models/e5-base-v2"

model, tokenizer, device = load_LLM(model_id)

device: cpu


### Building a Vector Index

Use this to build a Numpy array of embedding vectors.

In [None]:
from embedding_models.hugginface import embed
from tqdm.auto import tqdm
import numpy as np
import time


def VectorIndex_in_memory(chunks,tokenizer= None,model= None,device = None):
    LEN = len(chunks)
    batch_size = 256
    for i in tqdm(range(0, LEN, batch_size)):
        # start = time.time()   
        chunk_batch = chunks[i:min(i+batch_size,LEN)].copy()
        # embed current batch
        embed_batch = embed(chunk_batch,tokenizer=tokenizer,model=model,device=device)
        if i > 0: # add to existing np array if exists (otherwise create)
            # embed_batch is the new batch of embeddings (same size as chunk_batch)
            # arr = np.concatenate([arr, embed_batch.copy()])
            arr += embed_batch.copy()
        if i == 0:
            arr = embed_batch.copy()
        # print(f'batch time: {time.time() - start}')
        if i>10:
            break
    print(i,len(arr) )   
    return arr
# arr = VectorIndex_in_memory(chunks = data["chunk"],tokenizer=tokenizer,model=model,device=device)




In [None]:
from add_data_to_db import VectorIndexUpdate
VectorIndexUpdate(texts= data["chunk"][:100],tokenizer=tokenizer,model=model,device=device)

100%|██████████| 50/50 [00:21<00:00,  2.30it/s]

finished writing Data to DB!





Now we need to create the query mechanism, this is simply a cosine similarity calculation between a query vector and our arr vectors.

In [None]:
# from numpy.linalg import norm

# # convert chunks list to array for easy indexing
# chunk_arr = np.array(chunks)

# def query_in_memory(text: str, top_k: int=3) -> list[str]:
#     # create query embedding
#     xq = embed([f"query: {text}"])[0]
#     # calculate cosine similarities
#     sim = np.dot(arr, xq.T) / (norm(arr, axis=1)*norm(xq.T))
#     # get indices of top_k records
#     idx = np.argpartition(sim, -top_k)[-top_k:]
#     docs = chunk_arr[idx]
#     for d in docs.tolist():
#         print(d)
#         print("----------")
# # query_in_memory("why should I use llama 2?")

In [None]:
from vector_search import query
texts = ["why should I use llama 2?"]
query(texts, model, tokenizer, device)



Input text: why should I use llama 2?
Similarity: 0.7536534335583621;	table ID: 324	Content: little to no technical skill. While this might make our paper seem harmful, we believe the beneﬁts of
publishing this attack far outweighs any potential harms.
The ﬁrst reason the beneﬁts outweigh the harms is that, to the best of our knowledge, multimodal
contrastive classiﬁers are not yet used in any security-critical situations. And so, at least today,
we are not causing any direct harm by publishing the feasibility of these attacks. Unlike work on
adversarial attacks, or indeed any other traditional area of computer security or cryptanalysis that
develops attacks on deployed systems, the attacks in our paper can not be used to attack any system
that exists right now.
Compounding on the above, by publicizing the limitations of these classiﬁers early, we can prevent
users in the future from assuming these classiﬁers are robust when they in fact are not. If we were
to wait to publish the feasi

## Usecases

https://medium.com/@vladris/embeddings-and-vector-databases-732f9927b377

https://medium.com/@vladris/n-shot-learning-f9bc0d670a41


Q&A solution

In [None]:
import json
import os
from embedding_models.hugginface import embed

embeddings = {}

for f in os.listdir('./racing'):
    path = os.path.join('./racing', f)
    with open(path, 'r') as f:
        text = [f.read()]

    embeddings[path] = embed(text,tokenizer=tokenizer,model=model,device=device)[0]
   

with open('embeddings.json', 'w+') as f:
    json.dump(embeddings, f)

In [None]:
import json

embeddings = json.load(open('embeddings.json', 'r'))

def cosine_distance(a, b):
    return 1 - sum([a_i * b_i for a_i, b_i in zip(a, b)]) / (
        sum([a_i ** 2 for a_i in a]) ** 0.5 * sum([b_i ** 2 for b_i in b]) ** 0.5)
def nearest_embedding(embedding):
    nearest, nearest_distance = None, 1

    for path, embedding2 in embeddings.items():
        distance = cosine_distance(embedding, embedding2)
        if distance < nearest_distance:
            nearest, nearest_distance = path, distance

    return nearest#name of nearest file




i=0
while i<1:
    i+=1
    # prompt = input('user: ')
    prompt = 'What happened to Senn Kava during the Genosis Challenge?'
    if prompt == 'exit':
        break
    
    # find best context for prompt
    context = nearest_embedding(embed([prompt],tokenizer=tokenizer,model=model,device=device)[0])
    
    # open context - and ask new model question on given context
    data = open(context, 'r').read()
    print(data)
    # message = chat.completion(
    #     {'data': data, 'prompt': prompt}).choices[0].message
    # print(f'{message.role}: {message.content}')


During the Genosis Challenge Pod Racing race, there were several exhilarating and unforeseen events that shaped the final standings:

Lightning Bolt's Electrodynamic Boost: Tira Suro, piloting the Lightning Bolt, had equipped the pod with a cutting-edge electrodynamic propulsion system. As the race began, Suro ingeniously synchronized the pod's engine with the planet's unique electromagnetic field, harnessing its energy to achieve an unprecedented burst of speed. This electrifying boost propelled the Lightning Bolt into an early lead, setting the stage for Suro's victory.

Razor Blade's Risky Gambit: Kael Voss, the pilot of the Razor Blade, opted for a daring strategy to gain an advantage. Approaching a treacherous section filled with narrow rock formations, Voss executed a series of precise maneuvers, utilizing the Razor Blade's superior agility to navigate through the hazardous obstacles. Despite the risks involved, Voss's calculated moves allowed the Razor Blade to maintain a strong

In [None]:
import copy
import json
import openai
import os
import re
from langchain_community.llms import LlamaCpp


# openai.api_key = None
# if openai.api_key is None:
#     raise Exception('OPENAI_API_KEY not set')

def insert_params(string, **kwargs):
    pattern = r"{{(.*?)}}"
    matches = re.findall(pattern, string)
    for match in matches:
        replacement = kwargs.get(match.strip())
        if replacement is not None:
            string = string.replace("{{" + match + "}}", replacement)
    return string

class ChatTemplate:
    def __init__(self, template):
        self.template = template

    def from_file(template_file):
        with open(template_file, 'r') as f:
            template = json.load(f)
        return ChatTemplate(template)

    def completion(self, parameters):
        instance = copy.deepcopy(self.template)
        for item in instance['messages']:
            item['content'] = insert_params(item['content'], **parameters)
        # return openai.ChatCompletion.create(
        #     model='gpt-3.5-turbo',
        #     **instance)
        model_path = 'C:/models/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf'

        llm = LlamaCpp(model_path = model_path, temperature=0.5, max_tokens=1000, top_p=0.9, top_k=50,verbose = False,n_ctx=2048)
        # print(instance)
        instance = "role: system, content:You are a Q&A AI. role: system, content: Here are some facts that can help you answer the following question: During the Genosis Challenge Pod Racing race, there were several exhilarating and unforeseen events that shaped the final standings:\n\nLightning Bolt's Electrodynamic Boost: Tira Suro, piloting the Lightning Bolt, had equipped the pod with a cutting-edge electrodynamic propulsion system. As the race began, Suro ingeniously synchronized the pod's engine with the planet's unique electromagnetic field, harnessing its energy to achieve an unprecedented burst of speed. This electrifying boost propelled the Lightning Bolt into an early lead, setting the stage for Suro's victory.\n\nRazor Blade's Risky Gambit: Kael Voss, the pilot of the Razor Blade, opted for a daring strategy to gain an advantage. Approaching a treacherous section filled with narrow rock formations, Voss executed a series of precise maneuvers, utilizing the Razor Blade's superior agility to navigate through the hazardous obstacles. Despite the risks involved, Voss's calculated moves allowed the Razor Blade to maintain a strong position, ultimately securing second place.\n\nThunderbolt's Technical Glitch: Senn Kava, piloting the Thunderbolt pod, encountered an unexpected technical glitch during a crucial stage of the race. A malfunction in the pod's stabilization system caused Kava to lose control momentarily, resulting in a brief deviation from the racing line. Despite this setback, Kava's skillful recovery and determination enabled them to regain momentum and finish in third place.\n\nCrimson Fang's Thrilling Pursuit: Remy Thal, piloting the Crimson Fang, demonstrated exceptional perseverance and a never-say-die attitude throughout the race. Despite starting in a lower position, Thal showcased relentless determination, employing precise cornering techniques and exploiting gaps in the field to make a series of remarkable overtakes. Thal's tenacity ultimately earned them the fourth-place position.\n\nShadow Racer's Unforeseen Obstacle: Vix Tor, the pilot of the Shadow Racer, encountered an unexpected obstacle during a crucial segment of the race. A sandstorm suddenly swept across the course, impairing visibility and causing Tor to momentarily lose control. The unforeseen challenge hampered Tor's progress, resulting in a drop to fifth place. Despite the setback, Tor exhibited admirable skill in maneuvering through the turbulent sands and completing the race.\n\nThese captivating and unpredictable occurrences made the Genosis Challenge Pod Racing race an exhilarating spectacle, showcasing the racers' skills, adaptability, and resilience in the face of unexpected obstacles., role: user, content: What happened to Senn Kava during the Genosis Challenge?"
        return  llm.invoke(instance)




chat = ChatTemplate( template=
    {'messages': [{'role': 'system', 'content': 'You are a Q&A AI.'},
                  {'role': 'system', 'content': 'Here are some facts that can help you answer the following question: {{data}}'},
                  {'role': 'user', 'content': '{{prompt}}'}]
     })
message = chat.completion(
    {'data': data, 'prompt': prompt})
# .choices[0].message
print(message)

KeyboardInterrupt: 

In [None]:
model_path = 'C:/models/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf'
from langchain_community.llms import LlamaCpp
# from llama_cpp import Llama
llm = LlamaCpp(model_path = model_path, temperature=0.5, max_tokens=500, top_p=0.9, top_k=50,verbose = False)
# prompt = 


text = "What is the capital of France?"
response = llm.invoke(text)
print(response)


 everybody knows that the capital of France is Paris. But do you know why it's called "Paris"? The name Paris comes from a Celtic word meaning "path" or "way". It was originally a small Gallo-Roman settlement on the Île de la Cité, which was later fortified by the Romans and became the center of the city. Over time, the name evolved into "Paris" and the city grew to become one of the most famous and romantic cities in the world.
