When is it more likely to see a running squirrel in Central Park, New York? Let's find out, along with an example of data preparation on YTsaurus.

This notebook demonstrates:

* How to use `map`, `reduce`, `sort`, and `mapreduce` operations on schematized and non-schematized tables
* How to use YTsaurus to transform unstructured data into structured data
* How to process Date type on YT

At the end of this example, we will find out when it is more likely to encounter a running squirrel in Central Park, New York.

In [1]:
# configure environment to run this notebooks
import uuid
import yt.wrapper as yt

username = yt.get_user_name()
if yt.exists(f"//sys/users/{username}/@user_info/home_path"):
    # prepare working directory on distributed file system
    user_info = yt.get(f"//sys/users/{yt.get_user_name()}/@user_info")
    homedir = user_info["home_path"]
    # find avaliable vm presets
    cpu_pool_trees = [pool_tree for pool_tree in user_info["available_pool_trees"] if pool_tree.endswith("cpu")] or ["default"]
    h100_pool_trees = [pool_tree for pool_tree in user_info["available_pool_trees"] if pool_tree.endswith("h100")]
    h100_8_pool_trees = [pool_tree for pool_tree in user_info["available_pool_trees"] if pool_tree.endswith("h100_8")]
    workdir = f"{homedir}/tmp/demo_workdir/{uuid.uuid4().hex}"
else:
    cpu_pool_trees = ["default"]
    h100_pool_trees = ["gpu_h100"]
    h100_8_pool_trees = ["gpu_h100"]
    workdir = f"//tmp/examples/{uuid.uuid4().hex}"

yt.create("map_node", workdir, recursive=True, ignore_existing=True)
print("Current working directory:", workdir)

Current working directory: //home/equal_amethyst_vulture/tmp/demo_workdir/6b22ea3130c44386908eaeec58ee6af8


In [2]:
from yt import wrapper as yt
from yt import type_info

import uuid
import re
import datetime
import time

from typing import Iterable
from collections import defaultdict

# Dataset preparation

Let use `//home/samples/squirrels-hectare-data`. This dataset contains environmental data related to each of the 350 “countable” hectares of Central Park. Examples include weather, litter, animals sighted, and human density.

This dataset has several problems:
1. Date has a non-standard format.
2. Columns `other_animals_sightings` is unstructured.
3. Weather data is also unstructed. Let's extract the temperature and structure weather description.

## Extract weather data

Request for original dataset size. We are going to use this data to estimate the proportion of parsed values.

In [4]:
dataset_size = yt.get("//home/samples/squirrels-hectare-data/@row_count")
print(dataset_size)

700


Looking at the dataset, we can notice some facts:
1. Temperature data is at the beginning of the record
2. Temperature can be indicated in either Fahrenheit or Celsius
3. Typically the data is separated by a comma

This way we can iteratively apply our parsing function in the map operation and evaluate the records that could not be parsed. Since there are few records in the dataset, we can read them and watch them in this notebook.

In [6]:
F_TEMP_REGEXP = re.compile(r"^~?(\d+\.?\d*)(-\d+)?\s*[°º]?\s*[fF]")
C_TEMP_REGEXP = re.compile(r"^~?(\d+\.?\d*)\s*[°º]?\s*[cC]")
F2_TEMP_REGEXP = re.compile(r"(\d+\.?\d*)[s|ish]")

def f_to_c(temp: int) -> int:
    return round((5 / 9) * (temp - 32))

def str_to_int(value: str) -> int:
    return round(float(value))

def parse_weather_data(raw_weather_data: str | None) -> tuple[int, list[str]]:
    if raw_weather_data is None:
        return None, []

    weather_data_parts = [part.strip(" ").lower() for part in raw_weather_data.split(",")]

    if len(weather_data_parts) == 0:
        None, []
    maybe_temp = weather_data_parts[0]

    f_match = F_TEMP_REGEXP.search(maybe_temp)
    if f_match:
        return f_to_c(str_to_int(f_match.group(1))), weather_data_parts[1:]
    
    f2_match = F2_TEMP_REGEXP.search(maybe_temp) 
    if f2_match:
        return f_to_c(str_to_int(f2_match.group(1))), weather_data_parts[1:]

    c_match = C_TEMP_REGEXP.search(maybe_temp)
    if c_match:
        return str_to_int(c_match.group(1)), weather_data_parts[1:]
    
    return None, weather_data_parts

## Map operation for testing parsiong function

Let's run [map operation](https://ytsaurus.tech/docs/en/user-guide/data-processing/operations/map)

In [8]:
def filter_records_without_temperature(record: dict) -> Iterable[dict]:
    temp, weather_data = parse_weather_data(record["sighter_observed_weather_data"])
    if not temp:
        yield {"sighter_observed_weather_data": record["sighter_observed_weather_data"]}

yt.run_map(
    filter_records_without_temperature,
    source_table="//home/samples/squirrels-hectare-data",
    destination_table=f"{workdir}/records_without_temperature",
)

records = [record for record in yt.read_table(f"{workdir}/records_without_temperature")]
filtered_count = yt.get(f"{workdir}/records_without_temperature/@row_count")
print(f"{filtered_count / dataset_size * 100}%")

for record in records:
    print(record)

2025-06-19 18:49:54,337	INFO	Operation started: https://playground.tracto.ai/playground/operations/2e7b0209-e884d6f7-24dd03e8-a3568eb0/details


2025-06-19 18:49:54,373	INFO	( 0 min) operation 2e7b0209-e884d6f7-24dd03e8-a3568eb0 starting


2025-06-19 18:49:54,915	INFO	( 0 min) operation 2e7b0209-e884d6f7-24dd03e8-a3568eb0 initializing


2025-06-19 18:49:57,660	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'mapper': {'title': 'filter_records_without_tempera'}}


2025-06-19 18:49:57,699	INFO	( 0 min) operation 2e7b0209-e884d6f7-24dd03e8-a3568eb0: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:49:58,825	INFO	( 0 min) operation 2e7b0209-e884d6f7-24dd03e8-a3568eb0 completing


2025-06-19 18:49:59,365	INFO	( 0 min) operation 2e7b0209-e884d6f7-24dd03e8-a3568eb0 completed


10.857142857142858%
{'sighter_observed_weather_data': None}
{'sighter_observed_weather_data': 'cloudy, slight drizzle'}
{'sighter_observed_weather_data': 'sunny, chilly'}
{'sighter_observed_weather_data': 'Misty'}
{'sighter_observed_weather_data': 'Muggy, cloudy, slightly damp'}
{'sighter_observed_weather_data': None}
{'sighter_observed_weather_data': 'chilly, sunny'}
{'sighter_observed_weather_data': 'cloudy, drizzling'}
{'sighter_observed_weather_data': None}
{'sighter_observed_weather_data': 'drizzling'}
{'sighter_observed_weather_data': 'overcast, damp'}
{'sighter_observed_weather_data': 'Partly cloudy, dewy'}
{'sighter_observed_weather_data': 'Cool, Cloudy'}
{'sighter_observed_weather_data': 'partly cloudy, pleasant'}
{'sighter_observed_weather_data': 'rainy'}
{'sighter_observed_weather_data': 'cool, cloudy'}
{'sighter_observed_weather_data': 'cool, windy'}
{'sighter_observed_weather_data': 'got dark very suddenly'}
{'sighter_observed_weather_data': 'damp, misty, cloudy'}
{'sighte

We can verify that there is no more unparsed temperature data. The proportion of undefined temperature is 10%, let's consider it acceptable for demonstration.

## Prepare dataset

Dates on YTsaurus are presented as days from `01-01-1970` (like unittime, but days) -> Dates have Int type. For simplicity, we will not use data schematization at this stage, except for some important columns.

In the next step we plan to join this data with another dataset, so to avoid problems with implicit type casting, we create an explicit non-strict schema only for three columns:
* date
* hectare
* shift

YTsaurus operation also can be implemented as python classes.

In [10]:
REMOVE_BRACKETS_REGEXP = re.compile(r"\(.*?\)")

class HectareDataCanonizer:
    def _canonize_date(self, date: str) -> int:
        day, month, year = date[2:4], date[:2], date[4:8]
        date_str = f"{year}-{month}-{day}"
        date_obj = datetime.datetime.strptime(date_str, '%Y-%m-%d')
        unix_days = int((date_obj.date() - datetime.date.fromtimestamp(0)).days)
        return unix_days

    def _canonize_other_animals(self, other_animals_sightings: str | None) -> list[str]:
        if not other_animals_sightings:
            return []

        return [REMOVE_BRACKETS_REGEXP.sub("", r).strip(" ").lower() for r in other_animals_sightings.split(",")]

    def __call__(self, record: dict) -> Iterable[dict]:
        record["date"] = self._canonize_date(record["date"])
        temperature, weather_data = parse_weather_data(record["sighter_observed_weather_data"])
        record["temperature_celsius"] = temperature
        record["weather_data"] = weather_data
        record["other_animals_sightings"] = self._canonize_other_animals(record["other_animals_sightings"])
        yield record

canonized_squirrels_hectare_data = f"{workdir}/hectare_data"
schema = yt.schema.TableSchema(strict=False)
schema.add_column("date", type_info.Date)
schema.add_column("hectare", type_info.String)
schema.add_column("shift", type_info.String)

yt.create("table", canonized_squirrels_hectare_data, force=True, attributes={"schema": schema.to_yson_type()})

yt.run_map(
    HectareDataCanonizer(),
    source_table="//home/samples/squirrels-hectare-data",
    destination_table=canonized_squirrels_hectare_data,
)

2025-06-19 18:50:01,342	INFO	Operation started: https://playground.tracto.ai/playground/operations/8c39f736-29c91382-24dd03e8-13d7920a/details


2025-06-19 18:50:01,392	INFO	( 0 min) operation 8c39f736-29c91382-24dd03e8-13d7920a starting


2025-06-19 18:50:01,935	INFO	( 0 min) operation 8c39f736-29c91382-24dd03e8-13d7920a initializing


2025-06-19 18:50:03,609	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'mapper': {'title': 'HectareDataCanonizer'}}


2025-06-19 18:50:03,655	INFO	( 0 min) operation 8c39f736-29c91382-24dd03e8-13d7920a: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:04,246	INFO	( 0 min) operation 8c39f736-29c91382-24dd03e8-13d7920a: running=0     completed=1     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:04,789	INFO	( 0 min) operation 8c39f736-29c91382-24dd03e8-13d7920a completed


<yt.wrapper.operation_commands.Operation at 0x7fe9cd296ba0>

# Verify dataset

We have a dataset from the same authors, that contains squirrel data for each of the 3,023 sightings, including location coordinates, age, primary and secondary fur color, elevation, activities, communications, and interactions between squirrels and humans.

We can use this data for:
1. Verifying our current dataset
2. Creating a new dataset that includes data from both of them

Since we will be using the reduce operation, we have to [sort](https://ytsaurus.tech/docs/en/user-guide/data-processing/operations/sort) the table by the keys.

In [12]:
yt.run_sort(
    source_table=canonized_squirrels_hectare_data,
    destination_table=canonized_squirrels_hectare_data,
    sort_by=["date", "hectare", "shift"],
)

2025-06-19 18:50:05,392	INFO	Operation started: https://playground.tracto.ai/playground/operations/b5c350fe-7c2f0072-24dd03e8-d55f7daf/details


2025-06-19 18:50:05,436	INFO	( 0 min) operation b5c350fe-7c2f0072-24dd03e8-d55f7daf starting


2025-06-19 18:50:05,985	INFO	( 0 min) operation b5c350fe-7c2f0072-24dd03e8-d55f7daf initializing


2025-06-19 18:50:06,615	INFO	( 0 min) operation b5c350fe-7c2f0072-24dd03e8-d55f7daf: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:07,203	INFO	( 0 min) operation b5c350fe-7c2f0072-24dd03e8-d55f7daf: running=0     completed=1     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:07,751	INFO	( 0 min) operation b5c350fe-7c2f0072-24dd03e8-d55f7daf completed


<yt.wrapper.operation_commands.Operation at 0x7fe9cc246ed0>

Let's count how many squirrels were seen every day in the first dataset. Let's use [reduce operation](https://ytsaurus.tech/docs/en/user-guide/data-processing/operations/reduce)

In [14]:
def sum_squirrels_by_date_hectare(key: dict[str, int], records: Iterable[dict]):
    squirrels = 0
    for record in records:
        squirrels += record["number_of_squirrels"]
    yield {"date": int(key["date"]), "squirrels": squirrels}

squirrels_by_date_hectare = f"{workdir}/squirrels_by_date_hectare"

yt.run_reduce(
    sum_squirrels_by_date_hectare,
    source_table=canonized_squirrels_hectare_data,
    destination_table=squirrels_by_date_hectare,
    reduce_by=["date"],
)
yt.run_sort(
    source_table=squirrels_by_date_hectare,
    destination_table=squirrels_by_date_hectare,
    sort_by=["date"],
)

2025-06-19 18:50:09,372	INFO	Operation started: https://playground.tracto.ai/playground/operations/9b05cd75-d05730f5-24dd03e8-24d9a881/details


2025-06-19 18:50:09,419	INFO	( 0 min) operation 9b05cd75-d05730f5-24dd03e8-24d9a881 starting


2025-06-19 18:50:09,962	INFO	( 0 min) operation 9b05cd75-d05730f5-24dd03e8-24d9a881 initializing


2025-06-19 18:50:12,721	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'reducer': {'title': 'sum_squirrels_by_date_hectare'}}


2025-06-19 18:50:12,762	INFO	( 0 min) operation 9b05cd75-d05730f5-24dd03e8-24d9a881: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:13,342	INFO	( 0 min) operation 9b05cd75-d05730f5-24dd03e8-24d9a881: running=0     completed=1     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:13,876	INFO	( 0 min) operation 9b05cd75-d05730f5-24dd03e8-24d9a881 completing


2025-06-19 18:50:14,419	INFO	( 0 min) operation 9b05cd75-d05730f5-24dd03e8-24d9a881 completed


2025-06-19 18:50:14,965	INFO	Operation started: https://playground.tracto.ai/playground/operations/bba847ae-398e7ee9-24dd03e8-5020cb15/details


2025-06-19 18:50:15,008	INFO	( 0 min) operation bba847ae-398e7ee9-24dd03e8-5020cb15 starting


2025-06-19 18:50:16,233	INFO	( 0 min) operation bba847ae-398e7ee9-24dd03e8-5020cb15: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:17,365	INFO	( 0 min) operation bba847ae-398e7ee9-24dd03e8-5020cb15 completing


2025-06-19 18:50:17,898	INFO	( 0 min) operation bba847ae-398e7ee9-24dd03e8-5020cb15 completed


<yt.wrapper.operation_commands.Operation at 0x7fe9cc2809e0>

Let's count how many squirrels were seen every day in the second dataset.

In [16]:
sorted_squirrels_data = f"{workdir}/sorted_squirrels_data"

yt.run_sort(
    source_table="//home/samples/squirrels",
    destination_table=sorted_squirrels_data,
    sort_by=["date", "hectare", "shift"],
)

2025-06-19 18:50:18,523	INFO	Operation started: https://playground.tracto.ai/playground/operations/1bf213ea-e7c4fb5a-24dd03e8-b58327d3/details


2025-06-19 18:50:18,562	INFO	( 0 min) operation 1bf213ea-e7c4fb5a-24dd03e8-b58327d3 starting


2025-06-19 18:50:19,114	INFO	( 0 min) operation 1bf213ea-e7c4fb5a-24dd03e8-b58327d3 initializing


2025-06-19 18:50:20,786	INFO	( 0 min) operation 1bf213ea-e7c4fb5a-24dd03e8-b58327d3 completing


2025-06-19 18:50:21,336	INFO	( 0 min) operation 1bf213ea-e7c4fb5a-24dd03e8-b58327d3 completed


<yt.wrapper.operation_commands.Operation at 0x7fe9cc282fc0>

In [17]:
def sum_squirrels_by_date_squirrels(key: dict[str, int], records: Iterable[dict]):
    squirrels = []
    for record in records:
        squirrels.append(record["squirrel_id"])
    squirrels_count = len(squirrels)
    yield {"date": int(key["date"]), "squirrels": squirrels_count}

squirrels_by_date_squirrels = f"{workdir}/squirrels_by_date_squirrels"

yt.run_reduce(
    sum_squirrels_by_date_squirrels,
    source_table=sorted_squirrels_data,
    destination_table=squirrels_by_date_squirrels,
    reduce_by=["date"],
)
yt.run_sort(
    source_table=squirrels_by_date_squirrels,
    destination_table=squirrels_by_date_squirrels,
    sort_by=["date", "hectare", "shift"],
)

2025-06-19 18:50:23,015	INFO	Operation started: https://playground.tracto.ai/playground/operations/9671f4-8fd445f5-24dd03e8-d1607911/details


2025-06-19 18:50:23,068	INFO	( 0 min) operation 9671f4-8fd445f5-24dd03e8-d1607911 starting


2025-06-19 18:50:23,608	INFO	( 0 min) operation 9671f4-8fd445f5-24dd03e8-d1607911 initializing


2025-06-19 18:50:24,732	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'reducer': {'title': 'sum_squirrels_by_date_squirrel'}}


2025-06-19 18:50:24,774	INFO	( 0 min) operation 9671f4-8fd445f5-24dd03e8-d1607911: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:27,123	INFO	( 0 min) operation 9671f4-8fd445f5-24dd03e8-d1607911: running=1     completed=0     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:28,292	INFO	( 0 min) operation 9671f4-8fd445f5-24dd03e8-d1607911: running=0     completed=1     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:28,935	INFO	( 0 min) operation 9671f4-8fd445f5-24dd03e8-d1607911 completed


2025-06-19 18:50:29,479	INFO	Operation started: https://playground.tracto.ai/playground/operations/4696732f-52420ad1-24dd03e8-e802b333/details


2025-06-19 18:50:29,521	INFO	( 0 min) operation 4696732f-52420ad1-24dd03e8-e802b333 starting


2025-06-19 18:50:30,068	INFO	( 0 min) operation 4696732f-52420ad1-24dd03e8-e802b333 initializing


2025-06-19 18:50:32,440	INFO	( 0 min) operation 4696732f-52420ad1-24dd03e8-e802b333: running=0     completed=1     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:32,991	INFO	( 0 min) operation 4696732f-52420ad1-24dd03e8-e802b333 completed


<yt.wrapper.operation_commands.Operation at 0x7fe9cc279670>

## Join tables

Now we can compare the data.

In [19]:
@yt.with_context
def reduce_compare_squirrels_count(key: dict[str, str], records: Iterable[dict], context: yt.schema.Context):
    count_by_table = {}
    for record in records:
        table_index = context.table_index
        assert table_index not in count_by_table 
        count_by_table[table_index] = record["squirrels"]
    if count_by_table.get(0) != count_by_table.get(1):
        yield {"date": key["date"], "squirrels_data": count_by_table.get(0), "hectare_data": count_by_table.get(1)}

squirrels_count_diff = f"{workdir}/squirrels_count_diff"

yt.run_reduce(
    reduce_compare_squirrels_count,
    source_table=[squirrels_by_date_squirrels, squirrels_by_date_hectare],
    destination_table=squirrels_count_diff,
    reduce_by=["date"],
)

2025-06-19 18:50:34,540	INFO	Operation started: https://playground.tracto.ai/playground/operations/e3bcc7ee-7244865d-24dd03e8-f7b6fb73/details


2025-06-19 18:50:34,581	INFO	( 0 min) operation e3bcc7ee-7244865d-24dd03e8-f7b6fb73 starting


2025-06-19 18:50:35,125	INFO	( 0 min) operation e3bcc7ee-7244865d-24dd03e8-f7b6fb73 initializing


2025-06-19 18:50:36,794	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'reducer': {'title': 'reduce_compare_squirrels_count'}}


2025-06-19 18:50:37,417	INFO	( 0 min) operation e3bcc7ee-7244865d-24dd03e8-f7b6fb73: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:50:40,353	INFO	( 0 min) operation e3bcc7ee-7244865d-24dd03e8-f7b6fb73: running=1     completed=0     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:56:55,936	INFO	( 6 min) operation e3bcc7ee-7244865d-24dd03e8-f7b6fb73 completed


<yt.wrapper.operation_commands.Operation at 0x7fe9cc255ee0>

In [20]:
for record in yt.read_table(squirrels_count_diff):
    print(record)

{'date': 17814, 'squirrels_data': 335, 'hectare_data': 334}


We can see that the data is not the same on only one day and differs by 1. For the demo example, we consider this result acceptable.

# Make new dataset

## Use destination table with schema

We can secribe table's schema as yt_dataclass and reuse this object in next steps.

In [22]:
from typing import Optional, Any

@yt.yt_dataclass
class JoinedDatasetRow:
    date: yt.schema.Date
    hectare: str
    shift: str
    other_animals_sightings: list[str]
    temperature_celsius: Optional[int]
    weather_data: list[str]
    age: str
    squirrel_id: str
    running: bool
    chasing: bool
    climbing: bool
    eating: bool
    foraging: bool
    kuks: bool
    quaas: bool
    moans: bool
    tail_flags: bool
    tail_twitches: bool
    approaches: bool
    indifferent: bool
    runs_from: bool


joined_dataset = f"{workdir}/joined_dataset"
yt.create("table", joined_dataset, force=True, attributes={"schema": yt.schema.TableSchema.from_row_type(JoinedDatasetRow)})


@yt.with_context
def reduce_join(key: dict[str, str], records: Iterable[dict[str, Any]], context: yt.schema.Context) -> Iterable:
    squirrels: list[SquirrelsRow] = []
    other_animals_sightings = set()
    temperature_celsius = None
    weather_data = set()
    for record in records:
        if context.table_index == 0:
            squirrels.append(record)
        elif context.table_index == 1:
            other_animals_sightings.update(set(record["other_animals_sightings"]))
            temperature_celsius = record["temperature_celsius"] if temperature_celsius is None else (temperature_celsius + record["temperature_celsius"]) / 2
            weather_data.update(set(record["weather_data"]))
    temperature_celsius = round(temperature_celsius) if temperature_celsius is not None else None
    weather_data = list(weather_data)
    other_animals_sightings = list(other_animals_sightings)
    for squirrel in squirrels:
        yield dict(
            date=key["date"],
            hectare=key["hectare"],
            shift=key["shift"],
            other_animals_sightings=other_animals_sightings,
            temperature_celsius=temperature_celsius,
            weather_data=weather_data,
            age=squirrel["age"],
            squirrel_id=squirrel["squirrel_id"],
            running=squirrel["running"],
            chasing=squirrel["chasing"],
            climbing=squirrel["climbing"],
            eating=squirrel["eating"],
            foraging=squirrel["foraging"],
            kuks=squirrel["kuks"],
            quaas=squirrel["quaas"],
            moans=squirrel["moans"],
            tail_flags=squirrel["tail_flags"],
            tail_twitches=squirrel["tail_twitches"],
            approaches=squirrel["approaches"],
            indifferent=squirrel["indifferent"],
            runs_from=squirrel["runs_from"],
        )


yt.run_reduce(
    reduce_join,
    source_table=[sorted_squirrels_data, canonized_squirrels_hectare_data],
    destination_table=joined_dataset,
    reduce_by=["date", "hectare", "shift"],
)

2025-06-19 18:56:58,066	INFO	Operation started: https://playground.tracto.ai/playground/operations/ada1861c-d86c4a51-24dd03e8-37c676c4/details


2025-06-19 18:56:58,112	INFO	( 0 min) operation ada1861c-d86c4a51-24dd03e8-37c676c4 starting


2025-06-19 18:56:58,660	INFO	( 0 min) operation ada1861c-d86c4a51-24dd03e8-37c676c4 initializing


2025-06-19 18:57:01,426	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'reducer': {'title': 'reduce_join'}}


2025-06-19 18:57:01,467	INFO	( 0 min) operation ada1861c-d86c4a51-24dd03e8-37c676c4: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:57:03,214	INFO	( 0 min) operation ada1861c-d86c4a51-24dd03e8-37c676c4: running=0     completed=1     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-06-19 18:57:03,756	INFO	( 0 min) operation ada1861c-d86c4a51-24dd03e8-37c676c4 completing


2025-06-19 18:57:04,308	INFO	( 0 min) operation ada1861c-d86c4a51-24dd03e8-37c676c4 completed


<yt.wrapper.operation_commands.Operation at 0x7fe9cc27acc0>

# Running squirrels

Now we can find out when it was more likely to see a running squirrel in Central Park, New York, in October 2018 - on cold or warm days using our new dataset. Let's do this using [mapreduce operation](https://ytsaurus.tech/docs/en/user-guide/data-processing/operations/mapreduce). We will consider days with temperatures >= 15 as warm days and temperatures < 15 as cold days.

In [24]:
@yt.yt_dataclass
class RunningIsColdRow:
    temperature: str
    is_running: bool


@yt.yt_dataclass
class RunningIsColdResultRow:
    temperature: str
    is_running: int
    not_running: int


class RunningIsColdMapper(yt.TypedJob):
    def __call__(self, record: JoinedDatasetRow) -> Iterable[RunningIsColdRow]:
        if record.temperature_celsius is None:
            return 
        yield RunningIsColdRow(
            is_running=record.running,
            temperature="total",
        )
        yield RunningIsColdRow(
            is_running=record.running,
            temperature="cold" if (record.temperature_celsius < 15) else "not_cold",
        )


class RunningIsColdReducer(yt.TypedJob):
    def __call__(self, records: yt.schema.RowIterator[RunningIsColdRow]) -> Iterable[RunningIsColdResultRow]:
        is_running = 0
        not_running = 0
        for record in records:
            if record.is_running:
                is_running += 1
            else:
                not_running += 1
        yield RunningIsColdResultRow(
            is_running=is_running,
            not_running=not_running,
            temperature=record.temperature,
        )


running_squirrels = f"{workdir}/running_squirrels"

yt.run_map_reduce(
    mapper=RunningIsColdMapper(),
    reducer=RunningIsColdReducer(),
    source_table=joined_dataset,
    destination_table=running_squirrels,
    reduce_by=["temperature"],
)

for line in yt.read_table(running_squirrels):
    print(line)

2025-06-19 18:57:07,071	INFO	Operation started: https://playground.tracto.ai/playground/operations/53e31f6f-c46163f7-24dd03e8-6a678268/details


2025-06-19 18:57:07,109	INFO	( 0 min) operation 53e31f6f-c46163f7-24dd03e8-6a678268 starting


2025-06-19 18:57:07,655	INFO	( 0 min) operation 53e31f6f-c46163f7-24dd03e8-6a678268 initializing


2025-06-19 18:57:10,398	INFO	( 0 min) Unrecognized spec: {'mapper': {'title': 'RunningIsColdMapper'}, 'reducer': {'title': 'RunningIsColdReducer'}}


2025-06-19 18:57:11,022	INFO	( 0 min) operation 53e31f6f-c46163f7-24dd03e8-6a678268: running=0     completed=1     pending=1     failed=0     aborted=0     lost=0     total=2     blocked=0    


2025-06-19 18:57:13,887	INFO	( 0 min) operation 53e31f6f-c46163f7-24dd03e8-6a678268: running=1     completed=1     pending=0     failed=0     aborted=0     lost=0     total=2     blocked=0    


2025-06-19 18:57:18,739	INFO	( 0 min) operation 53e31f6f-c46163f7-24dd03e8-6a678268 completed


{'temperature': 'cold', 'is_running': 296, 'not_running': 799}
{'temperature': 'not_cold', 'is_running': 368, 'not_running': 1264}
{'temperature': 'total', 'is_running': 664, 'not_running': 2063}


In [25]:
print("Cold: ", 296 / (296 + 799))
print("Warm: ", 368 / (368 + 1264))

Cold:  0.27031963470319637
Warm:  0.22549019607843138


We can see that the proportion of contacts with running squirrels was higher on cold days. Let's use [Chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test) to verify it.

In [27]:
!pip install scipy



[0m


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [28]:
from scipy.stats import chi2_contingency

observed = [
    [296, 799],
    [368, 1264],
]

chi2, p, dof, expected = chi2_contingency(observed)

p < 0.05

np.True_

~~Therefore, in cold days squirrels run more.~~

Therefore, we see that in cold days of October 2018, it was more likely to see a running squirrel in Central Park, New York, than on warm days.