<a href="https://colab.research.google.com/github/sachaRfd/openfabric-test/blob/main/Model_and_Embedding_Loading_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [11]:
# Installations: 
!pip install -qU datasets pinecone-client sentence-transformers torch
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
# Load Dataset: locally into colab directory then run following: 
import pandas as pd

# Load the dataset:
df = pd.read_csv('science_dataset.csv')
df.head()


# Get PineCone access: 
import pinecone
pinecone.init(
    api_key="64661917-e6c9-41c5-a4bf-bfcca4c85091",
    environment="us-east1-gcp"
)

Create new Pinecone Environment with the scientific word embeddings: 

In [9]:
index_name = "science-bot"

# connect to science bot index in Pinecode (created manually by me):
index = pinecone.Index(index_name)

Training a MPNet for our retriver model which has 2 functions: 
- Create embedding of our dataset
- Create embedding of our question

### Automatically trained by pinecone

In [10]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model:
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)
retriever

Downloading (…)e933c/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)cbe6ee933c/README.md:   0%|          | 0.00/9.85k [00:00<?, ?B/s]

Downloading (…)e6ee933c/config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)33c/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)e933c/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)933c/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)cbe6ee933c/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)6ee933c/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

Let's do some pre-processing of our dataset: 

In [69]:
df.head(10)

Unnamed: 0,question,distractor3,distractor1,distractor2,correct_answer,support
0,What do the letters in our blood types represent?,iron levels,genomes,proteins,alleles,Another exception to Mendel's laws is a phenom...
1,What term is used to describe when a liquid is...,evaporating point,freezing point,burning point,boiling point,Boiling Points When the vapor pressure increas...
2,What force is responsible for erosion by flowi...,weight,kinetic,electromagnetic,gravity,Gravity is responsible for erosion by flowing ...
3,What are the most energetic of all electromagn...,X-ray,ultraviolet,infrared,gamma rays,Gamma rays are the most energetic of all elect...
4,What pair of tubes that extends toward the ova...,vas deferens,golgi apparatus,ovarian tubes,fallopian tubes,Extending from the upper corners of the uterus...
5,Due to the difference in the distribution of c...,crooked,ionic,uneven,polar,Due to the difference in the distribution of c...
6,How many eyelid membranes do frogs have?,one,two,four,three,"In order to live on land and in water, frogs h..."
7,Is the progeny produced by asexual reproductio...,weaker,the same,lighter,stronger,
8,The denser regions of the electron cloud are c...,isotopes,cores,lattices,orbitals,Some regions of the electron cloud are denser ...
9,A fish's stream-lined body that reduces water ...,evolution,natural selection,retraction,adaptation,Many structures in fish are adaptations for th...


We will only be using the question and support columns (but keep the correct answer column incase we have missing data): 

In [70]:
clean_df = df.drop(columns=['distractor3', 'distractor1', 'distractor2'])
clean_df.head(5)

Unnamed: 0,question,correct_answer,support
0,What do the letters in our blood types represent?,alleles,Another exception to Mendel's laws is a phenom...
1,What term is used to describe when a liquid is...,boiling point,Boiling Points When the vapor pressure increas...
2,What force is responsible for erosion by flowi...,gravity,Gravity is responsible for erosion by flowing ...
3,What are the most energetic of all electromagn...,gamma rays,Gamma rays are the most energetic of all elect...
4,What pair of tubes that extends toward the ova...,fallopian tubes,Extending from the upper corners of the uterus...


Now let's check for duplicates and Null Values: 

In [71]:
print(f'Duplicated Values in dataset: {clean_df.duplicated().sum()}')
print(f'Null Values in dataset: {clean_df.isna().sum()}')

Duplicated Values in dataset: 2
Null Values in dataset: question             0
correct_answer       0
support           1198
dtype: int64


So there seems to be duplicates in our dataset, we can just drop those rows. 

In [72]:
clean_df.drop_duplicates(inplace=True)
print(f'Duplicated Values in dataset now: {clean_df.duplicated().sum()}')

Duplicated Values in dataset now: 0


Now, how are we going to deal with the Null Values. Deleting the rows would lead to significant data deletion. Let's just visualise some examples and see what we are dealing with: 

In [74]:
clean_df[clean_df.support.isna()]

Unnamed: 0,question,correct_answer,support
7,Is the progeny produced by asexual reproductio...,stronger,
13,"When animals get rid of their gaseous waste, w...",carbon dioxide,
15,What is the term for an infection caused by a ...,mycosis,
36,The science of analyzing tree rings is called ...,dendrochronology,
37,What is defined as the change of water from it...,evaporation,
...,...,...,...
11637,All new alleles are formed by what types of mu...,random,
11638,Alleles that carry deadly diseases are usually...,recessive,
11655,When the ground absorbs the water and it settl...,groundwater,
11656,Impenetrable what underlies the soil of the fo...,bedrock,


It seems like it is whole answers which are not available. Instead of deleting all the rows we have, which would lead to significant loss of data, we could just replace those NA value with the data in the correct answer. 


Okay this is still not great as we will lose context and range of answering capabilities, but at least those questions could be answered. 

In [75]:
clean_df.loc[clean_df.support.isna(), 'support'] = clean_df[clean_df.support.isna()].correct_answer

In [76]:
print(f'Null Values in dataset: {clean_df.isna().sum()}')

Null Values in dataset: question          0
correct_answer    0
support           0
dtype: int64


Now we should be ready to load our science embeddings onto Pinecone: 

In [79]:
from tqdm import tqdm

# Generate Embeddings:
batch_size = 64
# Iterate through DataFrame in batches
for i in tqdm(range(0, len(clean_df), batch_size)):
    # current batch's start and end index
    start_index, end_index = i, min(i + batch_size, len(clean_df))
    
    # current batch of rows
    batch = clean_df.iloc[start_index:end_index]
    
    # embeddings for current batch
    emb = retriever.encode(batch["support"].tolist()).tolist()
    
    # metadata for current batch
    meta = batch.to_dict(orient="records")
    
    # unique IDs for current batch
    ids = [f"{idx}" for idx in range(start_index, end_index)]
    
    # list of IDs, embeddings, and metadata
    to_upsert = list(zip(ids, emb, meta))
    
    # Upsert/insert records to Pinecone
    _ = index.upsert(vectors=to_upsert)

# Check that all vectors have been added to the Pinecone index
index.describe_index_stats()

100%|██████████| 183/183 [01:35<00:00,  1.92it/s]


{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 11677}},
 'total_vector_count': 11677}

In [80]:
from transformers import BartTokenizer, BartForConditionalGeneration

# download models
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [102]:
query = "what is an atomic scale?"

In [103]:
def generate_answer(query):
    # Tokenize the query
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    
    # Use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40).to(device)
    
    # Use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return print(answer)

In [105]:
generate_answer(query)

The atomic scale is a measure of the number of protons in an atom. It's a measure of the number of protons.
