# Dataset Basics

These are some notes on the basics of working with [HF datasets](https://huggingface.co/docs/datasets/index).  These are very important if you want to fine tune LLMs because you will be downloading / uploading datasets from the Hub frequently.

# Highlights

- `dataset.map` does some kind of dict merge so `dataset.map(...) that emits a new dict key will add an additional field.
- For LLM instruction tuning, you likely want some fields like `features: ['output', 'instruction', 'input']`.  
- You can stream data `ds = load_dataset("bigcode/the-stack", streaming=True, split="train")`
- using `batched=True` is a good way to speed things up
- you can go back and forth from pandas dataframes which is handy for data manipulation.

# Dataset Quickstart

Following notes on [this page](https://huggingface.co/docs/datasets/quickstart#nlp)

In [None]:
from datasets import load_dataset
dataset = load_dataset("glue", "mrpc", split="train")

Found cached dataset glue (/Users/hamel/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


In [None]:
dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [None]:
dataset

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx'],
    num_rows: 3668
})

### Tokenize Data

You will want to tokenize examples in this case

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


With just one input, the `token_type_ids` are the same:

In [None]:
tokenizer('hello what is going on')

{'input_ids': [101, 7592, 2054, 2003, 2183, 2006, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

With two inputs, the `token_type_ids` are indexed accordingly:

In [None]:
out = tokenizer('hello what is going on?',  'I am here.')
out

{'input_ids': [101, 7592, 2054, 2003, 2183, 2006, 1029, 102, 1045, 2572, 2182, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
groups = [[], []]
for i,tt in zip(out['input_ids'], out['token_type_ids']):
    groups[tt].append(i)

for g in groups:
    print(tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(g)))

[CLS] hello what is going on? [SEP]
i am here. [SEP]


In [None]:
def encode(examples):
    return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")

In [None]:
tds = dataset.map(encode, batched=True)
tds

  0%|          | 0/4 [00:00<?, ?ba/s]

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3668
})

### Add additional field / change col name

The quickstart says that the model requires the field name `labels`.  How would we know?  We can look at the `forward` method of the model:

In [None]:
help(model.forward)

Help on method forward in module transformers.models.bert.modeling_bert:

forward(input_ids: Optional[torch.Tensor] = None, attention_mask: Optional[torch.Tensor] = None, token_type_ids: Optional[torch.Tensor] = None, position_ids: Optional[torch.Tensor] = None, head_mask: Optional[torch.Tensor] = None, inputs_embeds: Optional[torch.Tensor] = None, labels: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, return_dict: Optional[bool] = None) -> Union[Tuple[torch.Tensor], transformers.modeling_outputs.SequenceClassifierOutput] method of transformers.models.bert.modeling_bert.BertForSequenceClassification instance
    The [`BertForSequenceClassification`] forward method, overrides the `__call__` special method.
    
    <Tip>
    
    Although the recipe for forward pass needs to be defined within this function, one should call the [`Module`]
    instance afterwards instead of this since the former takes care of running t

Change `label` to `labels`

In [None]:
tds = tds.map(lambda examples: {"labels": examples["label"]}, batched=True)
tds

  0%|          | 0/4 [00:00<?, ?ba/s]

Dataset({
    features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 3668
})

### Turn Dataset into a pytorch dataloader

In [None]:
import torch

tds.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataloader = torch.utils.data.DataLoader(tds, batch_size=32)

## Wikipedia Dataset

There seems to be many subsets. [This is the page](https://huggingface.co/datasets/wikitext/viewer)

In [None]:
ds = load_dataset("wikitext", "wikitext-2-v1", streaming=True, split="validation")

In [None]:
example = ds.take(5)

In [None]:
ds.description

' The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified\n Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike\n License.\n'

In [None]:
list(example)

[{'text': ''},
 {'text': ' = Homarus gammarus = \n'},
 {'text': ''},
 {'text': ' Homarus gammarus , known as the European lobster or common lobster , is a species of <unk> lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into <unk> larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . \n'},
 {'text': ''}]

# Loading Custom Dataset

You can load a dataset from `csv, tsv, text, json, jsonl, dataframes`
You can also point to a url

In [None]:
from datasets import load_dataset
ds = load_dataset("csv", data_files="https://github.com/datablist/sample-csv-files/raw/main/files/customers/customers-500000.zip")

Using custom data configuration default-6e1837ea838b9492
Found cached dataset csv (/Users/hamel/.cache/huggingface/datasets/csv/default-6e1837ea838b9492/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Phone 1', 'Phone 2', 'Email', 'Subscription Date', 'Website'],
        num_rows: 500000
    })
})

In [None]:
ds['train'][0]

{'Index': 1,
 'Customer Id': 'e685B8690f9fbce',
 'First Name': 'Erik',
 'Last Name': 'Little',
 'Company': 'Blankenship PLC',
 'City': 'Caitlynmouth',
 'Country': 'Sao Tome and Principe',
 'Phone 1': '457-542-6899',
 'Phone 2': '055.415.2664x5425',
 'Email': 'shanehester@campbell.org',
 'Subscription Date': '2021-12-23',
 'Website': 'https://wagner.com/'}

## Transformations

### `map`

In [None]:
def fullnm(d): return {'Full Name': d['First Name'] + ' ' + d['Last Name']}
ds = ds.map(fullnm)

  0%|          | 0/500000 [00:00<?, ?ex/s]

In [None]:
ds['train'][0]

{'Index': 1,
 'Customer Id': 'e685B8690f9fbce',
 'First Name': 'Erik',
 'Last Name': 'Little',
 'Company': 'Blankenship PLC',
 'City': 'Caitlynmouth',
 'Country': 'Sao Tome and Principe',
 'Phone 1': '457-542-6899',
 'Phone 2': '055.415.2664x5425',
 'Email': 'shanehester@campbell.org',
 'Subscription Date': '2021-12-23',
 'Website': 'https://wagner.com/',
 'Full Name': 'Erik Little'}

#### `batched=True` for `map`

You operate over a list instead of single items, this can usually speed things up a bit.  The below example is significantly faster than the default.

per [the docs](https://huggingface.co/learn/nlp-course/chapter5/3?fw=pt#the-map-methods-superpowers):

> list comprehensions are usually faster than executing the same code in a for loop, and we also gain some performance by accessing lots of elements at the same time instead of one by one.

> Using Dataset.map() with batched=True will be essential to unlock the speed of the “fast” tokenizers

In [None]:
def fullnm_batched(d): return {'Full Name': [f + ' ' + l for f,l in zip(d['First Name'], d['Last Name'])]}
ds.map(fullnm_batched, batched=True)

  0%|          | 0/500 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Phone 1', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 500000
    })
})

#### `batched=True` speed test

HF tokenizers can work with or without `batch=True`, let's see the difference, first let's make a text field, let's use a dataset with a larger text field:

In [None]:
from datasets import set_caching_enabled
set_caching_enabled(False)

In [None]:
tds = load_dataset("csv",
                   data_files='https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip',
                   delimiter="\t");

Using custom data configuration default-3340c354bf896b6f
Found cached dataset csv (/Users/hamel/.cache/huggingface/datasets/csv/default-3340c354bf896b6f/0.0.0/652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a)


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
tds['train']['review'][0]

'"I&#039;ve tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia &amp; anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I&#039;ve actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me."'

In [None]:
def tokenize_function(examples): return tokenizer(examples["review"], truncation=True)

##### Without `batched`

In [None]:
%time tds.map(tokenize_function)

  0%|          | 0/215063 [00:00<?, ?ex/s]

CPU times: user 1min 21s, sys: 1.71 s, total: 1min 23s
Wall time: 1min 23s


DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 215063
    })
})

##### With `batched`

19 Seconds!

In [None]:
%time tds.map(tokenize_function, batched=True)

  0%|          | 0/216 [00:00<?, ?ba/s]

CPU times: user 1min 5s, sys: 1.18 s, total: 1min 6s
Wall time: 1min 6s


DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 215063
    })
})

#### Multicore

15.7s!

>  for values of num_proc other than 8, our tests showed that it was faster to use batched=True without that option. In general, we don’t recommend using Python multiprocessing for fast tokenizers with batched=True.

In [None]:
%time tds.map(tokenize_function, batched=True, num_proc=8)

                

#0:   0%|          | 0/27 [00:00<?, ?ba/s]

#5:   0%|          | 0/27 [00:00<?, ?ba/s]

#3:   0%|          | 0/27 [00:00<?, ?ba/s]

#2:   0%|          | 0/27 [00:00<?, ?ba/s]

#1:   0%|          | 0/27 [00:00<?, ?ba/s]

#4:   0%|          | 0/27 [00:00<?, ?ba/s]

#6:   0%|          | 0/27 [00:00<?, ?ba/s]

#7:   0%|          | 0/27 [00:00<?, ?ba/s]

CPU times: user 911 ms, sys: 533 ms, total: 1.44 s
Wall time: 18.1 s


DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'drugName', 'condition', 'review', 'rating', 'date', 'usefulCount', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 215063
    })
})

### `select`

Good to see a preview of different rows

In [None]:
sample = ds['train'].shuffle(seed=42).select(range(10))
sample[:2]

{'Index': [209712, 246986],
 'Customer Id': ['fad0d3B75B73cd7', 'D75eCaeAc8C6BD6'],
 'First Name': ['Jo', 'Judith'],
 'Last Name': ['Pittman', 'Thomas'],
 'Company': ['Pineda-Hobbs', 'Mcguire, Alvarado and Kennedy'],
 'City': ['Traciestad', 'Palmerfort'],
 'Country': ['Finland', 'Tonga'],
 'Phone 1': ['001-086-011-7063', '+1-495-667-1061x21703'],
 'Phone 2': ['853-679-2287x631', '589.777.0504'],
 'Email': ['gsantos@stuart.biz', 'vchung@bowman.com'],
 'Subscription Date': ['2020-08-04', '2021-08-14'],
 'Website': ['https://www.bautista.com/', 'https://wilkerson.org/'],
 'Full Name': ['Jo Pittman', 'Judith Thomas']}

### `unique`

In [None]:
len(ds['train'].unique('Index')), ds.num_rows

(500000, {'train': 500000})

### `rename_column`

In [None]:
ds = ds.rename_column('Phone 1', new_column_name='Primary Phone Number')

### `filter`

In [None]:
def erik(d): return d['First Name'].lower() == 'erik'

e_ds = ds.filter(erik)

  0%|          | 0/500 [00:00<?, ?ba/s]

In [None]:
e_ds['train'].select(range(5))['First Name']

['Erik', 'Erik', 'Erik', 'Erik', 'Erik']

### `sort`

In [None]:
ds['train'].sort('First Name').select(range(10))[:3]

{'Index': [491821, 170619, 212021],
 'Customer Id': ['84C747dDFac8Dc7', '5886eaffEF8dc6D', 'B8a6cFab936Fb2A'],
 'First Name': ['Aaron', 'Aaron', 'Aaron'],
 'Last Name': ['Hull', 'Cain', 'Mays'],
 'Company': ['Morrow Inc', 'Mccormick-Hardy', 'Hopkins-Larson'],
 'City': ['West Charles', 'West Connie', 'Mccallchester'],
 'Country': ['Netherlands', 'Vanuatu', 'Ecuador'],
 'Primary Phone Number': ['670-796-3507',
  '323-296-0014',
  '(594)960-9651x17240'],
 'Phone 2': ['001-917-832-0423x324',
  '+1-551-114-3103x05351',
  '996.174.5737x6442'],
 'Email': ['ivan16@bender.org',
  'shelley82@bender.org',
  'qrhodes@stokes-larson.info'],
 'Subscription Date': ['2020-05-28', '2021-04-11', '2022-03-19'],
 'Website': ['http://carney-lawson.info/',
  'http://www.wiggins.biz/',
  'http://pugh.com/'],
 'Full Name': ['Aaron Hull', 'Aaron Cain', 'Aaron Mays']}

## Dataframes from datasets

`set_format` seems to work in place:

In [None]:
ds.set_format('pandas')

In [None]:
ds['train'][:5]

Unnamed: 0,Index,Customer Id,First Name,Last Name,Company,City,Country,Primary Phone Number,Phone 2,Email,Subscription Date,Website,Full Name
0,1,e685B8690f9fbce,Erik,Little,Blankenship PLC,Caitlynmouth,Sao Tome and Principe,457-542-6899,055.415.2664x5425,shanehester@campbell.org,2021-12-23,https://wagner.com/,Erik Little
1,2,6EDdBA3a2DFA7De,Yvonne,Shaw,Jensen and Sons,Janetfort,Palestinian Territory,9610730173,531-482-3000x7085,kleinluis@vang.com,2021-01-01,https://www.paul.org/,Yvonne Shaw
2,3,b9Da13bedEc47de,Jeffery,Ibarra,"Rose, Deleon and Sanders",Darlenebury,Albania,(840)539-1797x479,209-519-5817,deckerjamie@bartlett.biz,2020-03-30,https://www.morgan-phelps.com/,Jeffery Ibarra
3,4,710D4dA2FAa96B5,James,Walters,Kline and Sons,Donhaven,Bahrain,+1-985-596-1072x3040,(528)734-8924x054,dochoa@carey-morse.com,2022-01-18,https://brennan.com/,James Walters
4,5,3c44ed62d7BfEBC,Leslie,Snyder,"Price, Mason and Doyle",Mossfort,Central African Republic,812-016-9904x8231,254.631.9380,darrylbarber@warren.org,2020-01-25,http://www.trujillo-sullivan.info/,Leslie Snyder


You can get a proper pandas dataframe like this:

> 🚨 Under the hood, Dataset.set_format() changes the return format for the dataset’s __getitem__() dunder method. This means that when we want to create a new object like train_df from a Dataset in the "pandas" format, we need to slice the whole dataset to obtain a pandas.DataFrame. You can verify for yourself that the type of drug_dataset["train"] is Dataset, irrespective of the output format.

In [None]:
df = ds['train'][:]
type(df)

pandas.core.frame.DataFrame

## Datasets from DataFrames

This is going the other direction df -> ds

In [None]:
new_ds = dataset.from_pandas(df)
new_ds

Dataset({
    features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
    num_rows: 500000
})

In [None]:
new_ds[:2]

{'Index': [1, 2],
 'Customer Id': ['e685B8690f9fbce', '6EDdBA3a2DFA7De'],
 'First Name': ['Erik', 'Yvonne'],
 'Last Name': ['Little', 'Shaw'],
 'Company': ['Blankenship PLC', 'Jensen and Sons'],
 'City': ['Caitlynmouth', 'Janetfort'],
 'Country': ['Sao Tome and Principe', 'Palestinian Territory'],
 'Primary Phone Number': ['457-542-6899', '9610730173'],
 'Phone 2': ['055.415.2664x5425', '531-482-3000x7085'],
 'Email': ['shanehester@campbell.org', 'kleinluis@vang.com'],
 'Subscription Date': ['2021-12-23', '2021-01-01'],
 'Website': ['https://wagner.com/', 'https://www.paul.org/'],
 'Full Name': ['Erik Little', 'Yvonne Shaw']}

### Reset the format
Note you can reset the format at anytime:

In [None]:
new_ds.set_format('pandas')
type(new_ds[:3])

pandas.core.frame.DataFrame

In [None]:
new_ds.reset_format()
type(new_ds[:3])

dict

## Creating data partitions

train/test etc.

In [None]:
split_ds = new_ds.train_test_split(train_size=0.8, seed=42)

In [None]:
split_ds

DatasetDict({
    train: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 400000
    })
    test: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 100000
    })
})

You can create new partitions without `train_test_split` explicitly by creating a new group like this:

In [None]:
split_ds2 = split_ds['train'].train_test_split(train_size=0.8)

In [None]:
split_ds['train'] = split_ds2['train']
split_ds['validation'] = split_ds2['test']

In [None]:
split_ds

DatasetDict({
    train: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 320000
    })
    test: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 100000
    })
    validation: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 80000
    })
})

# Saving & Loading Datasets

Let's save our ds dataset to disk:

In [None]:
new_ds

Dataset({
    features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
    num_rows: 500000
})

In [None]:
new_ds.save_to_disk('tabular_data')

In [None]:
!tree tabular_data

[01;34mtabular_data[00m
├── dataset.arrow
├── dataset_dict.json
├── dataset_info.json
├── state.json
└── [01;34mtrain[00m
    ├── dataset.arrow
    ├── dataset_info.json
    └── state.json

1 directory, 7 files


Load the data now from disk

In [None]:
from_disk_ds = dataset.load_from_disk('tabular_data')

In [None]:
from_disk_ds

Dataset({
    features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
    num_rows: 500000
})

# Streaming a `dataset`

When you set `streaming=True` you are returned a `IterableDataset` object.

In [None]:
sds = load_dataset("wikitext", "wikitext-2-v1",
                  streaming=True, split="validation")
type(sds)

datasets.iterable_dataset.IterableDataset

## `take` and `skip`

These are special methods for `IterableDataset`, these will not work for a regular `dataset`

In [None]:
list(sds.take(4))

[{'text': ''},
 {'text': ' = Homarus gammarus = \n'},
 {'text': ''},
 {'text': ' Homarus gammarus , known as the European lobster or common lobster , is a species of <unk> lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into <unk> larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . \n'}]

In [None]:
foo = list(sds.skip(100))

In [None]:
len(foo), len(list(sds))

(3660, 3760)

you can use `itertools.islice` to get multiple items:

In [None]:
from itertools import islice
len(list(islice(sds, 5)))

5

The old way looks like this:

In [None]:
nds = load_dataset("wikitext", "wikitext-2-v1",
                   split="validation")
type(nds)

Found cached dataset wikitext (/Users/hamel/.cache/huggingface/datasets/wikitext/wikitext-2-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


datasets.arrow_dataset.Dataset

In [None]:
ds

DatasetDict({
    train: Dataset({
        features: ['Index', 'Customer Id', 'First Name', 'Last Name', 'Company', 'City', 'Country', 'Primary Phone Number', 'Phone 2', 'Email', 'Subscription Date', 'Website', 'Full Name'],
        num_rows: 500000
    })
})

# Uploading Datset to the Hub

See [the docs](https://huggingface.co/learn/nlp-course/chapter5/5?fw=pt#uploading-the-dataset-to-the-hugging-face-hub)

## Login & upload with notebook

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from_disk_ds = dataset.load_from_disk('tabular_data')

In [None]:
remote_name = 'hamel/tabular-data-test'
from_disk_ds.push_to_hub(remote_name)

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading metadata:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Updating downloaded metadata with the new split.


## Using the cli

In [None]:
!huggingface-cli --help

usage: huggingface-cli <command> [<args>]

positional arguments:
  {login,whoami,logout,repo,lfs-enable-largefiles,lfs-multipart-upload}
                        huggingface-cli command helpers
    login               Log in using a token from
                        huggingface.co/settings/tokens
    whoami              Find out which huggingface.co account you are logged
                        in as.
    logout              Log out
    repo                {create, ls-files} Commands to interact with your
                        huggingface.co repos.
    lfs-enable-largefiles
                        Configure your repository to enable upload of files >
                        5GB.
    lfs-multipart-upload
                        Command will get called by git-lfs, do not call it
                        directly.

optional arguments:
  -h, --help            show this help message and exit


You can use `huggingface-cli login` to login

HF datasets are just git repos!  You can clone a repo like this:

### Datasets are Git repos

HF datasets are just git repos

In [None]:
_dir = remote_name.split('/')[-1]

!rm -rf {_dir}
!git clone 'https://huggingface.co/datasets/'{remote_name}
!ls {_dir}

Cloning into 'tabular-data-test'...
remote: Enumerating objects: 13, done.[K
remote: Counting objects: 100% (13/13), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 13 (delta 3), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (13/13), 1.93 KiB | 164.00 KiB/s, done.
[1m[36mdata[m[m               dataset_infos.json


The parquet file is here:

In [None]:
!ls {_dir}'/data'

train-00000-of-00001-646295d7cc3e7eab.parquet


## Dataset Cards

1. You specify the dataset card by filling out the `README.md` file.  In the Hub there is a README creation tool that has a template you can fill out.
2. There are tags for the dataset that you can set in the front matter of the README.  [This is an example](https://raw.githubusercontent.com/huggingface/datasets/main/templates/README_guide.md).  [This application](https://huggingface.co/spaces/huggingface/datasets-tagging) can help you generate the tags.

# FAISS Semantic Search

See [this lesson](https://huggingface.co/learn/nlp-course/chapter5/6?fw=tf). HF datasets have really nice built-in tools to do semantic search.  This is really useful and fun.