# Task
Create vector embeddings of the "https://huggingface.co/datasets/openlifescienceai/medmcqa" dataset and upload them to a Pinecone vector database.

## Install necessary libraries

### Subtask:
Install `git-lfs` and the Pinecone client library.


**Reasoning**:
The subtask requires installing two packages, so I will use pip to install them in a single code block.



In [None]:
%pip install git-lfs pinecone

Collecting pinecone
  Downloading pinecone-7.3.0-py3-none-any.whl.metadata (9.5 kB)
Collecting pinecone-plugin-assistant<2.0.0,>=1.6.0 (from pinecone)
  Downloading pinecone_plugin_assistant-1.7.0-py3-none-any.whl.metadata (28 kB)
Downloading pinecone-7.3.0-py3-none-any.whl (587 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.6/587.6 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_assistant-1.7.0-py3-none-any.whl (239 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.0/240.0 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pinecone-plugin-assistant, pinecone
Successfully installed pinecone-7.3.0 pinecone-plugin-assistant-1.7.0


## Clone the dataset

### Subtask:
Use `git clone` to get the dataset from Hugging Face.


**Reasoning**:
Use git clone to download the dataset from the provided Hugging Face URL into a local directory.



In [None]:
!git clone https://huggingface.co/datasets/openlifescienceai/medmcqa /tmp/medmcqa

fatal: destination path '/tmp/medmcqa' already exists and is not an empty directory.


## Load the data

### Subtask:
Load the dataset into a pandas DataFrame or similar structure.


**Reasoning**:
Import the pandas library and load the dataset from the specified path into a DataFrame. Then, display the head and info of the DataFrame to understand its structure.



**Reasoning**:
The previous attempt to load the data failed because the file `train.json` was not found. I need to check the contents of the `/tmp/medmcqa` directory to identify the correct data file(s).



In [None]:
import os

dataset_path = '/tmp/medmcqa'
print(os.listdir(dataset_path))

['data', 'README.md', '.git', '.gitattributes']


**Reasoning**:
The previous command showed that the data files are likely within the 'data' subdirectory. I will list the contents of the 'data' directory to find the data files.



In [None]:
import os

data_dir = '/tmp/medmcqa/data'
print(os.listdir(data_dir))

['validation-00000-of-00001.parquet', 'test-00000-of-00001.parquet', 'train-00000-of-00001.parquet']


**Reasoning**:
The previous command revealed that the data files are in Parquet format within the 'data' subdirectory. I will load the 'train-00000-of-00001.parquet' file into a pandas DataFrame and display the head and info.



In [None]:
import pandas as pd
import os

data_file = '/tmp/medmcqa/data/train-00000-of-00001.parquet'

df = pd.read_parquet(data_file)

display(df.head())
df.info()

Unnamed: 0,id,question,opa,opb,opc,opd,cop,choice_type,exp,subject_name,topic_name
0,e9ad821a-c438-4965-9f77-760819dfa155,Chronic urethral obstruction due to benign pri...,Hyperplasia,Hyperophy,Atrophy,Dyplasia,2,single,Chronic urethral obstruction because of urinar...,Anatomy,Urinary tract
1,e3d3c4e1-4fb2-45e7-9f88-247cc8f373b3,Which vitamin is supplied from only animal sou...,Vitamin C,Vitamin B7,Vitamin B12,Vitamin D,2,single,Ans. (c) Vitamin B12 Ref: Harrison's 19th ed. ...,Biochemistry,Vitamins and Minerals
2,5c38bea6-787a-44a9-b2df-88f4218ab914,All of the following are surgical options for ...,Adjustable gastric banding,Biliopancreatic diversion,Duodenal Switch,Roux en Y Duodenal By pass,3,multi,"Ans. is 'd' i.e., Roux en Y Duodenal Bypass Ba...",Surgery,Surgical Treatment Obesity
3,cdeedb04-fbe9-432c-937c-d53ac24475de,Following endaerectomy on the right common car...,Central aery of the retina,Infraorbital aery,Lacrimal aery,Nasociliary aretry,0,multi,The central aery of the retina is a branch of ...,Ophthalmology,
4,dc6794a3-b108-47c5-8b1b-3b4931577249,Growth hormone has its effect on growth through?,Directly,IG1-1,Thyroxine,Intranuclear receptors,1,single,"Ans. is 'b' i.e., IGI-1GH has two major functi...",Physiology,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 182822 entries, 0 to 182821
Data columns (total 11 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   id            182822 non-null  object
 1   question      182822 non-null  object
 2   opa           182822 non-null  object
 3   opb           182822 non-null  object
 4   opc           182822 non-null  object
 5   opd           182822 non-null  object
 6   cop           182822 non-null  int64 
 7   choice_type   182822 non-null  object
 8   exp           160869 non-null  object
 9   subject_name  182822 non-null  object
 10  topic_name    87209 non-null   object
dtypes: int64(1), object(10)
memory usage: 15.3+ MB


## Process the data

### Subtask:
Clean and preprocess the data for embedding. This might involve selecting relevant columns, handling missing values, and formatting text.


**Reasoning**:
Select relevant columns, handle missing values, concatenate text, and perform basic text cleaning to prepare the data for embedding.



In [None]:
# Select relevant columns
selected_cols = ['question', 'opa', 'opb', 'opc', 'opd', 'exp']
df_selected = df[selected_cols].copy()

# Handle missing values: fill with empty strings
df_selected = df_selected.fillna('')

# Concatenate text from selected columns
df_selected['text'] = df_selected['question'] + ' ' + \
                      df_selected['opa'] + ' ' + \
                      df_selected['opb'] + ' ' + \
                      df_selected['opc'] + ' ' + \
                      df_selected['opd'] + ' ' + \
                      df_selected['exp']

# Perform basic text cleaning: convert to lowercase
df_selected['text'] = df_selected['text'].str.lower()

# Display the first few rows of the processed data
display(df_selected[['text']].head())

Unnamed: 0,text
0,chronic urethral obstruction due to benign pri...
1,which vitamin is supplied from only animal sou...
2,all of the following are surgical options for ...
3,following endaerectomy on the right common car...
4,growth hormone has its effect on growth throug...


## Choose an embedding model

### Subtask:
Select a suitable model for creating embeddings (e.g., a sentence transformer or a model from the `transformers` library).


**Reasoning**:
Select a suitable pre-trained language model for generating embeddings from the `sentence-transformers` library which is well-suited for generating embeddings for semantic similarity tasks and is commonly used for creating embeddings from text data. The `all-MiniLM-L6-v2` model is a good balance of performance and computational efficiency for general text embeddings.



In [None]:
from sentence_transformers import SentenceTransformer

model_name = 'all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

print(f"Selected model: {model_name}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Selected model: all-MiniLM-L6-v2


## Generate embeddings

### Subtask:
Generate embeddings for the concatenated text data using the chosen sentence transformer model.


**Reasoning**:

---


Generate embeddings for the 'text' column of the `df_selected` DataFrame using the loaded SentenceTransformer model and store them in a new column.



In [None]:
import torch

# Check for GPU and set device
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using device: {device}")

embeddings = model.encode(df_selected['text'].tolist(), show_progress_bar=True, device=device)
df_selected['embeddings'] = embeddings.tolist()

display(df_selected[['text', 'embeddings']].head())

Using device: cuda


Batches:   0%|          | 0/5714 [00:00<?, ?it/s]

Unnamed: 0,text,embeddings
0,chronic urethral obstruction due to benign pri...,"[0.00010939138155663386, 0.016637397930026054,..."
1,which vitamin is supplied from only animal sou...,"[-0.030962148681282997, -0.004025406204164028,..."
2,all of the following are surgical options for ...,"[0.03295871242880821, -0.006006740033626556, -..."
3,following endaerectomy on the right common car...,"[0.011834653094410896, 0.0037785270251333714, ..."
4,growth hormone has its effect on growth throug...,"[-0.0075697205029428005, -0.025288023054599762..."


In [None]:
display(df_selected[['text', 'embeddings']].head())

Unnamed: 0,text,embeddings
0,chronic urethral obstruction due to benign pri...,"[0.00010939138155663386, 0.016637397930026054,..."
1,which vitamin is supplied from only animal sou...,"[-0.030962148681282997, -0.004025406204164028,..."
2,all of the following are surgical options for ...,"[0.03295871242880821, -0.006006740033626556, -..."
3,following endaerectomy on the right common car...,"[0.011834653094410896, 0.0037785270251333714, ..."
4,growth hormone has its effect on growth throug...,"[-0.0075697205029428005, -0.025288023054599762..."


In [None]:
import os
from pinecone import Pinecone, ServerlessSpec
from google.colab import userdata

PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')

# Initialize Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY)

# Define index name and dimension
index_name = 'medmcqa-embeddings'
embedding_dimension = embeddings.shape[1]

# Create index if it doesn't exist
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=embedding_dimension,
        metric='cosine',
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )

# Connect to the index
pinecone_index = pc.Index(index_name)
print(f"Pinecone index '{index_name}' set up successfully.")

# Prepare data to upload
data_to_upload = []
for idx, row in df_selected.iterrows():
    vector_id = str(df['id'].iloc[idx])
    vector = row['embeddings']
    metadata = {
        'question': df['question'].iloc[idx],
        'opa': df['opa'].iloc[idx],
        'opb': df['opb'].iloc[idx],
        'opc': df['opc'].iloc[idx],
        'opd': df['opd'].iloc[idx],
        'cop': str(df['cop'].iloc[idx]),
        'choice_type': df['choice_type'].iloc[idx],
        'exp': df['exp'].iloc[idx],
        'subject_name': df['subject_name'].iloc[idx],
        'topic_name': df['topic_name'].iloc[idx] if pd.notna(df['topic_name'].iloc[idx]) else ''
    }
    data_to_upload.append((vector_id, vector, metadata))

print(f"Prepared {len(data_to_upload)} data points for upload.")


Pinecone index 'medmcqa-embeddings' set up successfully.
Prepared 182822 data points for upload.


PineconeApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Date': 'Sat, 28 Jun 2025 14:29:45 GMT', 'Content-Type': 'application/json', 'Content-Length': '131', 'Connection': 'keep-alive', 'x-pinecone-request-latency-ms': '436', 'x-pinecone-request-id': '7265171089736132733', 'x-envoy-upstream-service-time': '63', 'server': 'envoy'})
HTTP response body: {"code":3,"message":"Metadata value must be a string, number, boolean or list of strings, got 'null' for field 'exp'","details":[]}


In [None]:
batch_size = 100
for i in range(0, len(data_to_upload), batch_size):
    batch = data_to_upload[i:i + batch_size]
    # fix metadata type issue
    safe_batch = []
    for vid, vec, meta in batch:
        # replace any null/None with empty string for metadata fields
        clean_meta = {k: (v if v is not None else "") for k,v in meta.items()}
        safe_batch.append((vid, vec, clean_meta))
    pinecone_index.upsert(vectors=safe_batch)
    print(f"Uploaded batch {int(i/batch_size) + 1}/{int(len(data_to_upload)/batch_size)}")
print("Upload to Pinecone complete.")

Uploaded batch 1/1828
Uploaded batch 2/1828
Uploaded batch 3/1828
Uploaded batch 4/1828
Uploaded batch 5/1828
Uploaded batch 6/1828
Uploaded batch 7/1828
Uploaded batch 8/1828
Uploaded batch 9/1828
Uploaded batch 10/1828
Uploaded batch 11/1828
Uploaded batch 12/1828
Uploaded batch 13/1828
Uploaded batch 14/1828
Uploaded batch 15/1828
Uploaded batch 16/1828
Uploaded batch 17/1828
Uploaded batch 18/1828
Uploaded batch 19/1828
Uploaded batch 20/1828
Uploaded batch 21/1828
Uploaded batch 22/1828
Uploaded batch 23/1828
Uploaded batch 24/1828
Uploaded batch 25/1828
Uploaded batch 26/1828
Uploaded batch 27/1828
Uploaded batch 28/1828
Uploaded batch 29/1828
Uploaded batch 30/1828
Uploaded batch 31/1828
Uploaded batch 32/1828
Uploaded batch 33/1828
Uploaded batch 34/1828
Uploaded batch 35/1828
Uploaded batch 36/1828
Uploaded batch 37/1828
Uploaded batch 38/1828
Uploaded batch 39/1828
Uploaded batch 40/1828
Uploaded batch 41/1828
Uploaded batch 42/1828
Uploaded batch 43/1828
Uploaded batch 44/18

## Finish task

### Subtask:
Summarize the process and confirm that the embeddings have been successfully uploaded to Pinecone.

**Reasoning**:
Provide a summary of the steps taken and confirm the successful upload of embeddings to the Pinecone index.