# Upload and download YT tables

This notebook contains examples of how to download and upload [tables](https://ytsaurus.tech/docs/en/user-guide/storage/files) on YTsaurus.

1. How to work with tables without schema.
2. How to create a table with schema and upload data.
3. How to work with structured tables and yt_dataclass objects.
4. How to upload datetime objects to Datetime columns.

In [2]:
from yt import wrapper as yt
from yt import type_info

In [3]:
from datetime import datetime, timedelta
from dataclasses import asdict
import uuid
import time

## Create a base directory for examples

In [5]:
working_dir = f"//tmp/examples/simple-upload-download-table_{uuid.uuid4()}"
yt.create("map_node", working_dir, recursive=True)
print(working_dir)

//tmp/examples/simple-upload-download-table_3de6a588-bffa-483d-bfc0-f4f748543739


## Upload and download unstructured data

The easest way to upload data is using `write_table` method of ytClient. `write_table` creates a table without the [schema](https://ytsaurus.tech/docs/en/user-guide/storage/static-schema). `read_table` automatically converts data from a table to primitive python types.

This method is easy to implement, but does not protect against some typical problems:
1. Typos in column names in different records.
2. Unexpected type conversions.
3. Large tables without schemas use resources inefficiently.

In [7]:
records_unstractued = [
    { 
        "field_string": f"string_{i}",
        "field_float": i / 10,
        "field_date": (datetime.now() - timedelta(days=i)).isoformat(),
        "field_uint32": i,
    }
    for i in range(10)
]
print(records_unstractued)

[{'field_string': 'string_0', 'field_float': 0.0, 'field_date': '2025-01-21T19:36:55.206925', 'field_uint32': 0}, {'field_string': 'string_1', 'field_float': 0.1, 'field_date': '2025-01-20T19:36:55.206941', 'field_uint32': 1}, {'field_string': 'string_2', 'field_float': 0.2, 'field_date': '2025-01-19T19:36:55.206944', 'field_uint32': 2}, {'field_string': 'string_3', 'field_float': 0.3, 'field_date': '2025-01-18T19:36:55.206950', 'field_uint32': 3}, {'field_string': 'string_4', 'field_float': 0.4, 'field_date': '2025-01-17T19:36:55.206952', 'field_uint32': 4}, {'field_string': 'string_5', 'field_float': 0.5, 'field_date': '2025-01-16T19:36:55.206955', 'field_uint32': 5}, {'field_string': 'string_6', 'field_float': 0.6, 'field_date': '2025-01-15T19:36:55.206957', 'field_uint32': 6}, {'field_string': 'string_7', 'field_float': 0.7, 'field_date': '2025-01-14T19:36:55.206960', 'field_uint32': 7}, {'field_string': 'string_8', 'field_float': 0.8, 'field_date': '2025-01-13T19:36:55.206964', 'f

In [8]:
unstructured_table_path = f"{working_dir}/unstructured_table"
yt.write_table(unstructured_table_path, records_unstractued)

In [9]:
for record in yt.read_table(unstructured_table_path):
    print(record)

{'field_string': 'string_0', 'field_float': 0.0, 'field_date': '2025-01-21T19:36:55.206925', 'field_uint32': 0}
{'field_string': 'string_1', 'field_float': 0.1, 'field_date': '2025-01-20T19:36:55.206941', 'field_uint32': 1}
{'field_string': 'string_2', 'field_float': 0.2, 'field_date': '2025-01-19T19:36:55.206944', 'field_uint32': 2}
{'field_string': 'string_3', 'field_float': 0.3, 'field_date': '2025-01-18T19:36:55.206950', 'field_uint32': 3}
{'field_string': 'string_4', 'field_float': 0.4, 'field_date': '2025-01-17T19:36:55.206952', 'field_uint32': 4}
{'field_string': 'string_5', 'field_float': 0.5, 'field_date': '2025-01-16T19:36:55.206955', 'field_uint32': 5}
{'field_string': 'string_6', 'field_float': 0.6, 'field_date': '2025-01-15T19:36:55.206957', 'field_uint32': 6}
{'field_string': 'string_7', 'field_float': 0.7, 'field_date': '2025-01-14T19:36:55.206960', 'field_uint32': 7}
{'field_string': 'string_8', 'field_float': 0.8, 'field_date': '2025-01-13T19:36:55.206964', 'field_uint

## Create schema and upload unstructured data

In order to use schematized tables we can create a table with the strong [schema](https://ytsaurus.tech/docs/en/user-guide/storage/static-schema) and upload unstructured data.
Schematization allows efficient use of resources and defines the table structure and column types.

In [11]:
schema = yt.schema.TableSchema()
schema.add_column("field_string", type_info.String)
schema.add_column("field_float", type_info.Float)
schema.add_column("field_date", type_info.Datetime)
schema.add_column("field_uint32", type_info.Uint32)

TableSchema({'value': [{'name': 'field_string', 'type_v3': 'string'}, {'name': 'field_float', 'type_v3': 'float'}, {'name': 'field_date', 'type_v3': 'datetime'}, {'name': 'field_uint32', 'type_v3': 'uint32'}], 'attributes': {'strict': True, 'unique_keys': False}})

YTsaurus client doesn't have standard convertion mechanisms for datetime.datetime objects. Datetime columns expect uint32 values, so we have to convert datetime.datetime objects to python's int.

In [13]:
def datetime_to_unixtime(datetime_obj: datetime) -> int:
    # workaround for https://github.com/ytsaurus/ytsaurus/issues/309
    return int(time.mktime(datetime_obj.timetuple()))

In [14]:
records_handmade_schema = [
    { 
        "field_string": f"string_{i}",
        "field_float": i / 10,
        "field_date": datetime_to_unixtime(datetime.now() - timedelta(days=i)),
        "field_uint32": i,
    }
    for i in range(10)
]
print(records_unstractued)

[{'field_string': 'string_0', 'field_float': 0.0, 'field_date': '2025-01-21T19:36:55.206925', 'field_uint32': 0}, {'field_string': 'string_1', 'field_float': 0.1, 'field_date': '2025-01-20T19:36:55.206941', 'field_uint32': 1}, {'field_string': 'string_2', 'field_float': 0.2, 'field_date': '2025-01-19T19:36:55.206944', 'field_uint32': 2}, {'field_string': 'string_3', 'field_float': 0.3, 'field_date': '2025-01-18T19:36:55.206950', 'field_uint32': 3}, {'field_string': 'string_4', 'field_float': 0.4, 'field_date': '2025-01-17T19:36:55.206952', 'field_uint32': 4}, {'field_string': 'string_5', 'field_float': 0.5, 'field_date': '2025-01-16T19:36:55.206955', 'field_uint32': 5}, {'field_string': 'string_6', 'field_float': 0.6, 'field_date': '2025-01-15T19:36:55.206957', 'field_uint32': 6}, {'field_string': 'string_7', 'field_float': 0.7, 'field_date': '2025-01-14T19:36:55.206960', 'field_uint32': 7}, {'field_string': 'string_8', 'field_float': 0.8, 'field_date': '2025-01-13T19:36:55.206964', 'f

In [15]:
handmade_schema_table_path = f"{working_dir}/handmade_schema_table"
yt.create_table(handmade_schema_table_path, attributes={"schema": schema.to_yson_type()})



'307d-ea759-13440191-edfb8ca8'

In [16]:
yt.write_table(handmade_schema_table_path, records_handmade_schema)

## `yt.yt_dataclass` obects for schematized data

YTsaurus sdk provides python-native tables representation. We can use `yt_dataclass` to represent the table's schema and to check the correspondence of python types to the table's schema. [Documentation](https://ytsaurus.tech/docs/en/api/python/dataclass)

In [18]:
@yt.yt_dataclass
class Nested:
    a: str
    b: int

@yt.yt_dataclass
class TableRow:
    field_string: str
    field_float: float
    field_datetime: yt.schema.Datetime
    field_uint32: yt.schema.Uint32
    field_nested: Nested

In [19]:
records_dataclasses = [
    TableRow(
        field_string=f"string_{i}", 
        field_float=i / 10,
        field_datetime=datetime_to_unixtime(datetime.now() - timedelta(days=i)),
        field_uint32=i,
        field_nested=Nested(
            a=f"string_{i}",
            b=i,
        )
    )
    for i in range(10)
]
for record in records_dataclasses:
    print(records_dataclasses)

[TableRow(field_string='string_0', field_float=0.0, field_datetime=1737488224, field_uint32=0, field_nested=Nested(a='string_0', b=0)), TableRow(field_string='string_1', field_float=0.1, field_datetime=1737401824, field_uint32=1, field_nested=Nested(a='string_1', b=1)), TableRow(field_string='string_2', field_float=0.2, field_datetime=1737315424, field_uint32=2, field_nested=Nested(a='string_2', b=2)), TableRow(field_string='string_3', field_float=0.3, field_datetime=1737229024, field_uint32=3, field_nested=Nested(a='string_3', b=3)), TableRow(field_string='string_4', field_float=0.4, field_datetime=1737142624, field_uint32=4, field_nested=Nested(a='string_4', b=4)), TableRow(field_string='string_5', field_float=0.5, field_datetime=1737056224, field_uint32=5, field_nested=Nested(a='string_5', b=5)), TableRow(field_string='string_6', field_float=0.6, field_datetime=1736969824, field_uint32=6, field_nested=Nested(a='string_6', b=6)), TableRow(field_string='string_7', field_float=0.7, fie

In [20]:
dataclass_table_path = f"{working_dir}/dataclass_based_table"
yt.write_table_structured(dataclass_table_path, TableRow, records_dataclasses)

In [21]:
for record in yt.read_table_structured(dataclass_table_path, TableRow):
    print(record)

TableRow(field_string='string_0', field_float=0.0, field_datetime=1737488224, field_uint32=0, field_nested=Nested(a='string_0', b=0))
TableRow(field_string='string_1', field_float=0.1, field_datetime=1737401824, field_uint32=1, field_nested=Nested(a='string_1', b=1))
TableRow(field_string='string_2', field_float=0.2, field_datetime=1737315424, field_uint32=2, field_nested=Nested(a='string_2', b=2))
TableRow(field_string='string_3', field_float=0.3, field_datetime=1737229024, field_uint32=3, field_nested=Nested(a='string_3', b=3))
TableRow(field_string='string_4', field_float=0.4, field_datetime=1737142624, field_uint32=4, field_nested=Nested(a='string_4', b=4))
TableRow(field_string='string_5', field_float=0.5, field_datetime=1737056224, field_uint32=5, field_nested=Nested(a='string_5', b=5))
TableRow(field_string='string_6', field_float=0.6, field_datetime=1736969824, field_uint32=6, field_nested=Nested(a='string_6', b=6))
TableRow(field_string='string_7', field_float=0.7, field_datet