# Dataset.map()

This notebook shows a workflow for using `Dataset.map`. This method is useful for creating a new column with a custom map function to generate the output.


In [1]:
%load_ext autoreload
%autoreload 2
import lilac as ll

ll.set_project_dir('./data')

try:
  glue = ll.get_dataset('local', 'glue_ax_map_2')
except:
  glue = ll.create_dataset(
    ll.DatasetConfig(
      namespace='local',
      name='glue_ax_map_2',
      source=ll.HuggingFaceSource(
        dataset_name='glue',
        config_name='ax',
        sample_size=100
      )))

#ll.start_server()

  from .autonotebook import tqdm as notebook_tqdm


True

# Upper case 'premise'

The following map will upper case the 'premise' field from the dataset.

The output of the map is returned as a generator.


In [2]:
# Upper case 'premise' and print the first result
# This call does not save the output to a column.
def _upper(item: dict) -> str:
  #print('premise==>', item['premise'])
  return item['premise'].upper()

res = glue.map(_upper)
print(next(iter(res)))
print()

# # # Write the output to a column 'premise_upper'.
# glue.map(lambda item: item['premise'].upper(), output_path='premise_upper', overwrite=True, num_jobs=-1)

# rows = glue.select_rows(['premise', 'premise_upper'], limit=3)
# for row in rows:
#   print(row)


Scheduling task "99197345dced43ada2ba9fda35aaebd7": "[local/glue_ax_map_2] map "_upper"".


Perhaps you already have a cluster running?
Hosting the HTTP server on port 54128 instead


IF THE PIPELINE TOKENIZATION SCHEME DOES NOT CORRESPOND TO THE ONE THAT WAS USED WHEN A MODEL WAS CREATED, IT WOULD BE EXPECTED TO NEGATIVELY IMPACT THE PIPELINE RESULTS.



[local/glue_ax_map_2] map "_upper": 100%|██████████| 100/100 [00:00<00:00, 16572.38it/s]


Task finished "99197345dced43ada2ba9fda35aaebd7": "[local/glue_ax_map_2] map "_upper"" in 5s.


# Map continuation during an error, or computer shutdown

`dataset.map()` will not lose data if an error is thrown when writing to disk. The next time it is called, it will continue from where it left off. Once it is finally complete, the column is written.


In [3]:
throw_for_rowid = True

random_row_id = list(glue.select_rows([ll.ROWID], limit=1))[0][ll.ROWID]

def _upper(item):
  global i, throw_after_n
  if throw_for_rowid and item[ll.ROWID] == random_row_id:
    raise ValueError(f'Throwing for {random_row_id}')
  if not throw_for_rowid:
    print(item['premise'].upper())
  return item['premise'].upper()


# This is going to throw after 10 iterations. When we call it again, it will only call _upper()
# for the rest of the dataset.
glue.map(_upper, output_path='premise_upper2', overwrite=True, num_jobs=-1)


Scheduling task "1da54efcbe0f497e9bd9085cf98144e6": "[local/glue_ax_map_2][1/12] map "_upper" to "premise_upper2"".
Scheduling task "e6d7dae5dd5a460e88d2d0e384636380": "[local/glue_ax_map_2][2/12] map "_upper" to "premise_upper2"".
Scheduling task "704e030c9c71413881b6539511fe1aa1": "[local/glue_ax_map_2][3/12] map "_upper" to "premise_upper2"".
Scheduling task "5621b696af3a4962a08c3da2fdd67fe3": "[local/glue_ax_map_2][4/12] map "_upper" to "premise_upper2"".
Scheduling task "2ba96252fefa4200b815c8ddabbf5fed": "[local/glue_ax_map_2][5/12] map "_upper" to "premise_upper2"".
Scheduling task "d5a8710fba1647aea11eb1bc96c42111": "[local/glue_ax_map_2][6/12] map "_upper" to "premise_upper2"".
Scheduling task "c92841f9775e4071bae1178c675cf8bd": "[local/glue_ax_map_2][7/12] map "_upper" to "premise_upper2"".
Scheduling task "15f763466dfe4cd8b6344ed2b43e1ed2": "[local/glue_ax_map_2][8/12] map "_upper" to "premise_upper2"".
Scheduling task "a922afdf40b94a638b8dac7b07b62294": "[local/glue_ax_map_

Perhaps you already have a cluster running?
Hosting the HTTP server on port 54248 instead
[local/glue_ax_map_2][1/12] map "_upper" to "premise_upper2":   9%|▉         | 9/100 [00:00<00:00, 1152.95it/s]
[local/glue_ax_map_2][2/12] map "_upper" to "premise_upper2":   9%|▉         | 9/100 [00:00<00:00, 604.27it/s]


Task finished "1da54efcbe0f497e9bd9085cf98144e6": "[local/glue_ax_map_2][1/12] map "_upper" to "premise_upper2"" in 8s.
Task finished "e6d7dae5dd5a460e88d2d0e384636380": "[local/glue_ax_map_2][2/12] map "_upper" to "premise_upper2"" in 8s.


[local/glue_ax_map_2][3/12] map "_upper" to "premise_upper2":   9%|▉         | 9/100 [00:00<00:00, 1401.63it/s]


Task finished "704e030c9c71413881b6539511fe1aa1": "[local/glue_ax_map_2][3/12] map "_upper" to "premise_upper2"" in 8s.


[local/glue_ax_map_2][4/12] map "_upper" to "premise_upper2":   9%|▉         | 9/100 [00:00<00:00, 1301.91it/s]


Task finished "5621b696af3a4962a08c3da2fdd67fe3": "[local/glue_ax_map_2][4/12] map "_upper" to "premise_upper2"" in 9s.
Task error "2ba96252fefa4200b815c8ddabbf5fed": "[local/glue_ax_map_2][5/12] map "_upper" to "premise_upper2"" in 9s.


[local/glue_ax_map_2][5/12] map "_upper" to "premise_upper2":   6%|▌         | 6/100 [00:00<00:00, 937.66it/s]
Key:       2ba96252fefa4200b815c8ddabbf5fed
Function:  _execute_task
args:      (<function _upper at 0x2aef63920>, (36, 45), './data/.cache/lilac/local/glue_ax_map_2/premise_upper2.36-45.jsonl', 'premise_upper2', None, True, False, False, ('2ba96252fefa4200b815c8ddabbf5fed', 0))
kwargs:    {}
Exception: "ValueError('Throwing for 56505f14f6dc496e807018c3675ac840')"

[local/glue_ax_map_2][6/12] map "_upper" to "premise_upper2":   9%|▉         | 9/100 [00:00<00:00, 1587.55it/s]


Task finished "d5a8710fba1647aea11eb1bc96c42111": "[local/glue_ax_map_2][6/12] map "_upper" to "premise_upper2"" in 9s.


[local/glue_ax_map_2][7/12] map "_upper" to "premise_upper2":   9%|▉         | 9/100 [00:00<00:00, 1857.99it/s]
[local/glue_ax_map_2][8/12] map "_upper" to "premise_upper2":   9%|▉         | 9/100 [00:00<00:00, 2045.78it/s]


Task finished "c92841f9775e4071bae1178c675cf8bd": "[local/glue_ax_map_2][7/12] map "_upper" to "premise_upper2"" in 10s.
Task finished "15f763466dfe4cd8b6344ed2b43e1ed2": "[local/glue_ax_map_2][8/12] map "_upper" to "premise_upper2"" in 10s.


[local/glue_ax_map_2][9/12] map "_upper" to "premise_upper2":   9%|▉         | 9/100 [00:00<00:00, 1299.80it/s]


Task finished "a922afdf40b94a638b8dac7b07b62294": "[local/glue_ax_map_2][9/12] map "_upper" to "premise_upper2"" in 10s.


[local/glue_ax_map_2][10/12] map "_upper" to "premise_upper2":   9%|▉         | 9/100 [00:00<00:00, 2163.50it/s]
[local/glue_ax_map_2][11/12] map "_upper" to "premise_upper2":   9%|▉         | 9/100 [00:00<00:00, 1947.32it/s]
[local/glue_ax_map_2][12/12] map "_upper" to "premise_upper2":   1%|          | 1/100 [00:00<00:00, 258.59it/s]


Task finished "b2f6b5dbfec247d993c6bd8e409d6c76": "[local/glue_ax_map_2][10/12] map "_upper" to "premise_upper2"" in 11s.
Task finished "1f4032a8910c431db311d4bfd874b552": "[local/glue_ax_map_2][11/12] map "_upper" to "premise_upper2"" in 11s.


ValueError: Throwing for 56505f14f6dc496e807018c3675ac840

Task finished "c7e6304252154fd8a75f8ad5fc27f66e": "[local/glue_ax_map_2][12/12] map "_upper" to "premise_upper2"" in 11s.


In [4]:
throw_for_rowid = False
# This will finish calling _upper, without calling it for the first 10 items.
glue.map(_upper, output_path='premise_upper2', num_jobs=-1)


Scheduling task "37d5981f1df24e57932e66c18b451482": "[local/glue_ax_map_2][1/12] map "_upper" to "premise_upper2"".
Scheduling task "5427454733a14a16af71c66b30d7cc53": "[local/glue_ax_map_2][2/12] map "_upper" to "premise_upper2"".
Scheduling task "3510adb400e040db9eb0656cb16a43f5": "[local/glue_ax_map_2][3/12] map "_upper" to "premise_upper2"".
Scheduling task "17367dfd696d486e9a804c260b15d58b": "[local/glue_ax_map_2][4/12] map "_upper" to "premise_upper2"".
Scheduling task "a255d661dfb24a8a89e85a058ccd004f": "[local/glue_ax_map_2][5/12] map "_upper" to "premise_upper2"".
Scheduling task "d054c47d8ffb4fa7b05dd9ec4c6c1dab": "[local/glue_ax_map_2][6/12] map "_upper" to "premise_upper2"".
Scheduling task "c9cfa05c3fc0423285eda0e01a34a6cf": "[local/glue_ax_map_2][7/12] map "_upper" to "premise_upper2"".
Scheduling task "fe4f94cc4c0b4ac08cf1b84bdb59722c": "[local/glue_ax_map_2][8/12] map "_upper" to "premise_upper2"".
Scheduling task "474178fe84be4234add6574c00e1787d": "[local/glue_ax_map_

[local/glue_ax_map_2][1/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]
[local/glue_ax_map_2][2/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]
[local/glue_ax_map_2][3/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]
[local/glue_ax_map_2][4/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]


THE CAT SAT ON THE MAT.
SOME DOGS LIKE TO SCRATCH THEIR EARS.
ALL DOGS LIKE TO SCRATCH THEIR EARS.


[local/glue_ax_map_2][5/12] map "_upper" to "premise_upper2":   3%|▎         | 3/100 [00:00<00:07, 13.79it/s]
[local/glue_ax_map_2][6/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]
[local/glue_ax_map_2][7/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]
[local/glue_ax_map_2][9/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]
[local/glue_ax_map_2][8/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]]
[local/glue_ax_map_2][10/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]


Task finished "5427454733a14a16af71c66b30d7cc53": "[local/glue_ax_map_2][2/12] map "_upper" to "premise_upper2"" in 2s.
Task finished "37d5981f1df24e57932e66c18b451482": "[local/glue_ax_map_2][1/12] map "_upper" to "premise_upper2"" in 3s.
Task finished "3510adb400e040db9eb0656cb16a43f5": "[local/glue_ax_map_2][3/12] map "_upper" to "premise_upper2"" in 3s.
Task finished "17367dfd696d486e9a804c260b15d58b": "[local/glue_ax_map_2][4/12] map "_upper" to "premise_upper2"" in 3s.
Task finished "a255d661dfb24a8a89e85a058ccd004f": "[local/glue_ax_map_2][5/12] map "_upper" to "premise_upper2"" in 3s.
Task finished "c9cfa05c3fc0423285eda0e01a34a6cf": "[local/glue_ax_map_2][7/12] map "_upper" to "premise_upper2"" in 3s.
Task finished "d054c47d8ffb4fa7b05dd9ec4c6c1dab": "[local/glue_ax_map_2][6/12] map "_upper" to "premise_upper2"" in 3s.
Task finished "474178fe84be4234add6574c00e1787d": "[local/glue_ax_map_2][9/12] map "_upper" to "premise_upper2"" in 3s.
Task finished "fe4f94cc4c0b4ac08cf1b84bd

[local/glue_ax_map_2][12/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]
[local/glue_ax_map_2][11/12] map "_upper" to "premise_upper2":   0%|          | 0/100 [00:00<?, ?it/s]


<lilac.data.dataset_duckdb.DuckDBMapOutput at 0x3200a3c90>

Task finished "83b94f11c5134242a63f60dee03d1829": "[local/glue_ax_map_2][12/12] map "_upper" to "premise_upper2"" in 3s.
Task finished "97a309cd763f46b7bace45709ac5ac76": "[local/glue_ax_map_2][11/12] map "_upper" to "premise_upper2"" in 3s.


2023-11-13 08:26:38,312 - tornado.application - ERROR - Exception in callback <bound method SystemMonitor.update of <SystemMonitor: cpu: 2 memory: 86 MB fds: 26>>
Traceback (most recent call last):
  File "/Users/nikhil/Code/lilac/.venv/lib/python3.11/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
          ^^^^^^^^^^^^^^^
  File "/Users/nikhil/Code/lilac/.venv/lib/python3.11/site-packages/distributed/system_monitor.py", line 168, in update
    net_ioc = psutil.net_io_counters()
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/nikhil/Code/lilac/.venv/lib/python3.11/site-packages/psutil/__init__.py", line 2119, in net_io_counters
    rawdict = _psplatform.net_io_counters()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 12] Cannot allocate memory
