## Install Library

In [None]:
! pip install datasets

## Acquire Data

Let's suppose the data is in `.csv` form.

In [None]:
import pandas as pd

In [23]:
# create df
df = pd.DataFrame({
    'questions': [
        "What is your name?",
        "What is your age?",
        "What is your gender?",
    ],
    'answers': [
        "My name is Tom",
        "I am 22 years old",
        "I am a male",
    ]
})

In [24]:
# save
df.to_csv('/content/toy_data.csv', index=False)

In [25]:
csv = pd.read_csv('/content/toy_data.csv')

raw_content_questions = list(csv['questions'])
raw_content_answers = list(csv['answers'])

In [26]:
raw_content_questions

['What is your name?', 'What is your age?', 'What is your gender?']

ONCE THE DATA IS IN A PD.DATAFRAME AND EACH ROW HAS 1 QUESTION/ANSWER. THEN START FROM HRERE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

In [27]:
from datasets import Dataset, DatasetDict

In [28]:
# Example data - replace these with your actual data
train_data = {
    'questions': raw_content_questions,
    'answers': raw_content_answers
}

# Create Dataset objects for training and testing
train_dataset = Dataset.from_dict(train_data)

# Combine them into a DatasetDict
dataset_dict = DatasetDict({
    'train': train_dataset,
})

# Display the structure of the dataset
print(dataset_dict)

DatasetDict({
    train: Dataset({
        features: ['questions', 'answers'],
        num_rows: 3
    })
})


## Push to HuggingFace Hub

In [None]:
# ! huggingface-cli login

In [None]:
from huggingface_hub import HfApi, create_repo

In [None]:
from google.colab import userdata
HF_TOKEN = userdata.get('HF_TOKEN')

In [30]:
# Replace 'your_token_here' with your actual Hugging Face Auth token
# Replace 'youthless-homeless-shelter-web-scrape-dataset' with your desired repository name
auth_token = HF_TOKEN
repo_name = 'sample_toy_data_v1'
username = 'eagle0504' # replace with your Hugging Face username

api = HfApi()
create_repo(repo_name, token=auth_token, private=False) # Set private=True if you want it to be a private dataset

RepoUrl('https://huggingface.co/eagle0504/sample_toy_data_v1', endpoint='https://huggingface.co', repo_type='model', repo_id='eagle0504/sample_toy_data_v1')

In [31]:
# verify name
app_id = f"{username}/{repo_name}"
print(app_id)

eagle0504/sample_toy_data_v1


In [32]:
%%time

dataset_dict.push_to_hub(app_id)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CPU times: user 91.8 ms, sys: 7.4 ms, total: 99.2 ms
Wall time: 1.51 s


CommitInfo(commit_url='https://huggingface.co/datasets/eagle0504/sample_toy_data_v1/commit/a4952165da01b58c693a2a736e0eeec6f82cd111', commit_message='Upload dataset', commit_description='', oid='a4952165da01b58c693a2a736e0eeec6f82cd111', pr_url=None, pr_revision=None, pr_num=None)

## Pull Data from HuggingFace

If you are already have a `DataDict` on *HuggingFace*, you can start here and use the following code to load the data in directly to make some queries.

You can use this code directly in a `streamlit` application.

In [None]:
! pip install chromadb

In [33]:
import chromadb
from datasets import load_dataset
import numpy as np
import pandas as pd
import string

You'll have to know the directory of the `HuggingFace` dataset that you want to acquire.

For example, I am using [this link](https://huggingface.co/datasets/eagle0504/sample_toy_data) as a demonstration in this notebook. Or you can use the URL below.

```md
https://huggingface.co/datasets/eagle0504/sample_toy_data
```

In [34]:
%%time

dataset = load_dataset("eagle0504/sample_toy_data_v1")
client = chromadb.Client()
random_number = np.random.randint(low=1e9, high=1e10)
random_string = ''.join(np.random.choice(list(string.ascii_uppercase + string.digits), size=10))
combined_string = f"{random_number}{random_string}"
collection = client.create_collection(combined_string)

# Embed and store the first N supports for this demo
L = len(dataset["train"]['questions'])
collection.add(
    ids=[str(i) for i in range(0, L)],  # IDs are just strings
    documents=dataset["train"]['questions'], # Enter questions here
    metadatas=[{"type": "support"} for _ in range(0, L)],
)

Downloading readme:   0%|          | 0.00/301 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3 [00:00<?, ? examples/s]

CPU times: user 682 ms, sys: 22.7 ms, total: 705 ms
Wall time: 2.58 s


In [39]:
question = "What is your nom?"

In [40]:
results = collection.query(
    query_texts=question,
    n_results=5
)



In [41]:
idx = results["ids"][0]
idx = [int(i) for i in idx]
idx

[0, 2, 1]

In [42]:
ref = pd.DataFrame(
    {
        "idx": idx,
        "question": [dataset["train"]['questions'][i] for i in idx],
        "answers": [dataset["train"]['answers'][i] for i in idx],
        "distances": results["distances"][0]
    }
)
ref

Unnamed: 0,idx,question,answers,distances
0,0,What is your name?,My name is Tom,1.098543
1,2,What is your gender?,I am a male,1.126488
2,1,What is your age?,I am 22 years old,1.377615


In [None]:
special_threshold = 0.3
filtered_ref = ref[ref["distances"] < special_threshold]
filtered_ref

Unnamed: 0,idx,question,answers,distances
0,0,What is your name?,My name is Keshav,0.0
