When is it more likely to see a running squirrel in Central Park, New York? Let's find out, along with an example of data preparation on YTsaurus.

This notebook demonstrates:

* How to use `map`, `reduce`, `sort`, and `mapreduce` operations on schematized and non-schematized tables
* How to use YTsaurus to transform unstructured data into structured data
* How to process Date type on YT

At the end of this example, we will find out when it is more likely to encounter a running squirrel in Central Park, New York.

In [1]:
from yt import wrapper as yt
from yt import type_info

import uuid
import re
import datetime
import time

from typing import Iterable
from collections import defaultdict

## Create a base directory for examples

In [3]:
working_dir = f"//tmp/examples/process-squirrels-data_{uuid.uuid4()}"
yt.create("map_node", working_dir, recursive=True)
print(working_dir)

//tmp/examples/process-squirrels-data_32335170-1236-4633-942e-086b0af41222


# Dataset preparation

Let use `//home/samples/squirrels-hectare-data`. This dataset contains environmental data related to each of the 350 “countable” hectares of Central Park. Examples include weather, litter, animals sighted, and human density.

This dataset has several problems:
1. Date has a non-standard format.
2. Columns `other_animals_sightings` is unstructured.
3. Weather data is also unstructed. Let's extract the temperature and structure weather description.

## Extract weather data

Request for original dataset size. We are going to use this data to estimate the proportion of parsed values.

In [8]:
dataset_size = yt.get("//home/samples/squirrels-hectare-data/@row_count")
print(dataset_size)

700


Looking at the dataset, we can notice some facts:
1. Temperature data is at the beginning of the record
2. Temperature can be indicated in either Fahrenheit or Celsius
3. Typically the data is separated by a comma

This way we can iteratively apply our parsing function in the map operation and evaluate the records that could not be parsed. Since there are few records in the dataset, we can read them and watch them in this notebook.

In [10]:
F_TEMP_REGEXP = re.compile(r"^~?(\d+\.?\d*)(-\d+)?\s*[°º]?\s*[fF]")
C_TEMP_REGEXP = re.compile(r"^~?(\d+\.?\d*)\s*[°º]?\s*[cC]")
F2_TEMP_REGEXP = re.compile(r"(\d+\.?\d*)[s|ish]")

def f_to_c(temp: int) -> int:
    return round((5 / 9) * (temp - 32))

def str_to_int(value: str) -> int:
    return round(float(value))

def parse_weather_data(raw_weather_data: str | None) -> tuple[int, list[str]]:
    if raw_weather_data is None:
        return None, []

    weather_data_parts = [part.strip(" ").lower() for part in raw_weather_data.split(",")]

    if len(weather_data_parts) == 0:
        None, []
    maybe_temp = weather_data_parts[0]

    f_match = F_TEMP_REGEXP.search(maybe_temp)
    if f_match:
        return f_to_c(str_to_int(f_match.group(1))), weather_data_parts[1:]
    
    f2_match = F2_TEMP_REGEXP.search(maybe_temp) 
    if f2_match:
        return f_to_c(str_to_int(f2_match.group(1))), weather_data_parts[1:]

    c_match = C_TEMP_REGEXP.search(maybe_temp)
    if c_match:
        return str_to_int(c_match.group(1)), weather_data_parts[1:]
    
    return None, weather_data_parts

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


## Map operation for testing parsiong function

Let's run [map operation](https://ytsaurus.tech/docs/en/user-guide/data-processing/operations/map)

In [13]:
def filter_records_without_temperature(record: dict) -> Iterable[dict]:
    temp, weather_data = parse_weather_data(record["sighter_observed_weather_data"])
    if not temp:
        yield {"sighter_observed_weather_data": record["sighter_observed_weather_data"]}

yt.run_map(
    filter_records_without_temperature,
    source_table="//home/samples/squirrels-hectare-data",
    destination_table=f"{working_dir}/records_without_temperature",
)

records = [record for record in yt.read_table(f"{working_dir}/records_without_temperature")]
filtered_count = yt.get(f"{working_dir}/records_without_temperature/@row_count")
print(f"{filtered_count / dataset_size * 100}%")

for record in records:
    print(record)

2025-01-21 19:29:19,970	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/e4a31444-46a78637-134403e8-8adec6c/details


2025-01-21 19:29:19,990	INFO	( 0 min) operation e4a31444-46a78637-134403e8-8adec6c starting


2025-01-21 19:29:20,541	INFO	( 0 min) operation e4a31444-46a78637-134403e8-8adec6c initializing


2025-01-21 19:29:21,612	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'mapper': {'title': 'filter_records_without_tempera'}}


2025-01-21 19:29:23,930	INFO	( 0 min) operation e4a31444-46a78637-134403e8-8adec6c: running=1     completed=0     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:29:35,413	INFO	( 0 min) operation e4a31444-46a78637-134403e8-8adec6c: running=0     completed=1     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:29:38,013	INFO	( 0 min) operation e4a31444-46a78637-134403e8-8adec6c completed


10.857142857142858%
{'sighter_observed_weather_data': None}
{'sighter_observed_weather_data': 'cloudy, slight drizzle'}
{'sighter_observed_weather_data': 'sunny, chilly'}
{'sighter_observed_weather_data': 'Misty'}
{'sighter_observed_weather_data': 'Muggy, cloudy, slightly damp'}
{'sighter_observed_weather_data': None}
{'sighter_observed_weather_data': 'chilly, sunny'}
{'sighter_observed_weather_data': 'cloudy, drizzling'}
{'sighter_observed_weather_data': None}
{'sighter_observed_weather_data': 'drizzling'}
{'sighter_observed_weather_data': 'overcast, damp'}
{'sighter_observed_weather_data': 'Partly cloudy, dewy'}
{'sighter_observed_weather_data': 'Cool, Cloudy'}
{'sighter_observed_weather_data': 'partly cloudy, pleasant'}
{'sighter_observed_weather_data': 'rainy'}
{'sighter_observed_weather_data': 'cool, cloudy'}
{'sighter_observed_weather_data': 'cool, windy'}
{'sighter_observed_weather_data': 'got dark very suddenly'}
{'sighter_observed_weather_data': 'damp, misty, cloudy'}
{'sighte

We can verify that there is no more unparsed temperature data. The proportion of undefined temperature is 10%, let's consider it acceptable for demonstration.

## Prepare dataset

Dates on YTsaurus are presented as days from `01-01-1970` (like unittime, but days) -> Dates have Int type. For simplicity, we will not use data schematization at this stage, except for some important columns.

In the next step we plan to join this data with another dataset, so to avoid problems with implicit type casting, we create an explicit non-strict schema only for three columns:
* date
* hectare
* shift

YTsaurus operation also can be implemented as python classes.

In [18]:
REMOVE_BRACKETS_REGEXP = re.compile(r"\(.*?\)")

class HectareDataCanonizer:
    def _canonize_date(self, date: str) -> int:
        day, month, year = date[2:4], date[:2], date[4:8]
        date_str = f"{year}-{month}-{day}"
        date_obj = datetime.datetime.strptime(date_str, '%Y-%m-%d')
        unix_days = int((date_obj.date() - datetime.date.fromtimestamp(0)).days)
        return unix_days

    def _canonize_other_animals(self, other_animals_sightings: str | None) -> list[str]:
        if not other_animals_sightings:
            return []

        return [REMOVE_BRACKETS_REGEXP.sub("", r).strip(" ").lower() for r in other_animals_sightings.split(",")]

    def __call__(self, record: dict) -> Iterable[dict]:
        record["date"] = self._canonize_date(record["date"])
        temperature, weather_data = parse_weather_data(record["sighter_observed_weather_data"])
        record["temperature_celsius"] = temperature
        record["weather_data"] = weather_data
        record["other_animals_sightings"] = self._canonize_other_animals(record["other_animals_sightings"])
        yield record

canonized_squirrels_hectare_data = f"{working_dir}/hectare_data"
schema = yt.schema.TableSchema(strict=False)
schema.add_column("date", type_info.Date)
schema.add_column("hectare", type_info.String)
schema.add_column("shift", type_info.String)

yt.create("table", canonized_squirrels_hectare_data, force=True, attributes={"schema": schema.to_yson_type()})

yt.run_map(
    HectareDataCanonizer(),
    source_table="//home/samples/squirrels-hectare-data",
    destination_table=canonized_squirrels_hectare_data,
)

2025-01-21 19:29:48,978	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/80abcaf5-61bd8e50-134403e8-d29aa2e0/details


2025-01-21 19:29:48,999	INFO	( 0 min) operation 80abcaf5-61bd8e50-134403e8-d29aa2e0 initializing


2025-01-21 19:29:51,630	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'mapper': {'title': 'HectareDataCanonizer'}}


2025-01-21 19:29:51,659	INFO	( 0 min) operation 80abcaf5-61bd8e50-134403e8-d29aa2e0: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:29:53,853	INFO	( 0 min) operation 80abcaf5-61bd8e50-134403e8-d29aa2e0: running=1     completed=0     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:35:52,776	INFO	( 6 min) operation 80abcaf5-61bd8e50-134403e8-d29aa2e0 completed


<yt.wrapper.operation_commands.Operation at 0x7fa3e8515750>

# Verify dataset

We have a dataset from the same authors, that contains squirrel data for each of the 3,023 sightings, including location coordinates, age, primary and secondary fur color, elevation, activities, communications, and interactions between squirrels and humans.

We can use this data for:
1. Verifying our current dataset
2. Creating a new dataset that includes data from both of them

Since we will be using the reduce operation, we have to [sort](https://ytsaurus.tech/docs/en/user-guide/data-processing/operations/sort) the table by the keys.

In [22]:
yt.run_sort(
    source_table=canonized_squirrels_hectare_data,
    destination_table=canonized_squirrels_hectare_data,
    sort_by=["date", "hectare", "shift"],
)

2025-01-21 19:35:53,175	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/8f9fe338-19a84995-134403e8-ff867333/details


2025-01-21 19:35:53,197	INFO	( 0 min) operation 8f9fe338-19a84995-134403e8-ff867333 starting


2025-01-21 19:35:53,721	INFO	( 0 min) operation 8f9fe338-19a84995-134403e8-ff867333 initializing


2025-01-21 19:35:55,423	INFO	( 0 min) operation 8f9fe338-19a84995-134403e8-ff867333: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:35:59,217	INFO	( 0 min) operation 8f9fe338-19a84995-134403e8-ff867333 completed


<yt.wrapper.operation_commands.Operation at 0x7fa3e85728d0>

Let's count how many squirrels were seen every day in the first dataset. Let's use [reduce operation](https://ytsaurus.tech/docs/en/user-guide/data-processing/operations/reduce)

In [24]:
def sum_squirrels_by_date_hectare(key: dict[str, int], records: Iterable[dict]):
    squirrels = 0
    for record in records:
        squirrels += record["number_of_squirrels"]
    yield {"date": int(key["date"]), "squirrels": squirrels}

squirrels_by_date_hectare = f"{working_dir}/squirrels_by_date_hectare"

yt.run_reduce(
    sum_squirrels_by_date_hectare,
    source_table=canonized_squirrels_hectare_data,
    destination_table=squirrels_by_date_hectare,
    reduce_by=["date"],
)
yt.run_sort(
    source_table=squirrels_by_date_hectare,
    destination_table=squirrels_by_date_hectare,
    sort_by=["date"],
)

2025-01-21 19:36:06,106	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/bbac56ad-12c2673a-134403e8-b3ce3e42/details


2025-01-21 19:36:06,130	INFO	( 0 min) operation bbac56ad-12c2673a-134403e8-b3ce3e42 starting


2025-01-21 19:36:06,646	INFO	( 0 min) operation bbac56ad-12c2673a-134403e8-b3ce3e42 initializing


2025-01-21 19:36:09,292	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'reducer': {'title': 'sum_squirrels_by_date_hectare'}}


2025-01-21 19:36:09,312	INFO	( 0 min) operation bbac56ad-12c2673a-134403e8-b3ce3e42: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:36:10,935	INFO	( 0 min) operation bbac56ad-12c2673a-134403e8-b3ce3e42: running=1     completed=0     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:39:44,548	INFO	( 3 min) operation bbac56ad-12c2673a-134403e8-b3ce3e42 completed


2025-01-21 19:39:44,980	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/5102a997-94566239-134403e8-55d0933f/details


2025-01-21 19:39:45,001	INFO	( 0 min) operation 5102a997-94566239-134403e8-55d0933f starting


2025-01-21 19:39:45,520	INFO	( 0 min) operation 5102a997-94566239-134403e8-55d0933f initializing


2025-01-21 19:39:46,654	INFO	( 0 min) operation 5102a997-94566239-134403e8-55d0933f: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:39:49,878	INFO	( 0 min) operation 5102a997-94566239-134403e8-55d0933f: running=1     completed=0     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:39:57,897	INFO	( 0 min) operation 5102a997-94566239-134403e8-55d0933f completing


2025-01-21 19:39:58,415	INFO	( 0 min) operation 5102a997-94566239-134403e8-55d0933f completed


<yt.wrapper.operation_commands.Operation at 0x7fa3e9e3ccd0>

Let's count how many squirrels were seen every day in the second dataset.

In [26]:
sorted_squirrels_data = f"{working_dir}/sorted_squirrels_data"

yt.run_sort(
    source_table="//home/samples/squirrels",
    destination_table=sorted_squirrels_data,
    sort_by=["date", "hectare", "shift"],
)

2025-01-21 19:39:59,680	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/586d4921-cd22b52f-134403e8-169ebdcc/details


2025-01-21 19:39:59,694	INFO	( 0 min) operation 586d4921-cd22b52f-134403e8-169ebdcc starting


2025-01-21 19:40:00,216	INFO	( 0 min) operation 586d4921-cd22b52f-134403e8-169ebdcc initializing


2025-01-21 19:40:03,447	INFO	( 0 min) operation 586d4921-cd22b52f-134403e8-169ebdcc: running=1     completed=0     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:40:17,344	INFO	( 0 min) operation 586d4921-cd22b52f-134403e8-169ebdcc: running=0     completed=1     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:40:20,454	INFO	( 0 min) operation 586d4921-cd22b52f-134403e8-169ebdcc completed


<yt.wrapper.operation_commands.Operation at 0x7fa3e847fd10>

In [27]:
def sum_squirrels_by_date_squirrels(key: dict[str, int], records: Iterable[dict]):
    squirrels = []
    for record in records:
        squirrels.append(record["squirrel_id"])
    squirrels_count = len(squirrels)
    yield {"date": int(key["date"]), "squirrels": squirrels_count}

squirrels_by_date_squirrels = f"{working_dir}/squirrels_by_date_squirrels"

yt.run_reduce(
    sum_squirrels_by_date_squirrels,
    source_table=sorted_squirrels_data,
    destination_table=squirrels_by_date_squirrels,
    reduce_by=["date"],
)
yt.run_sort(
    source_table=squirrels_by_date_squirrels,
    destination_table=squirrels_by_date_squirrels,
    sort_by=["date", "hectare", "shift"],
)

2025-01-21 19:40:27,440	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/38798a85-54287e4b-134403e8-b03b2314/details


2025-01-21 19:40:27,466	INFO	( 0 min) operation 38798a85-54287e4b-134403e8-b03b2314 starting


2025-01-21 19:40:27,991	INFO	( 0 min) operation 38798a85-54287e4b-134403e8-b03b2314 initializing


2025-01-21 19:40:28,531	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'reducer': {'title': 'sum_squirrels_by_date_squirrel'}}


2025-01-21 19:40:29,123	INFO	( 0 min) operation 38798a85-54287e4b-134403e8-b03b2314: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:40:32,467	INFO	( 0 min) operation 38798a85-54287e4b-134403e8-b03b2314: running=1     completed=0     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:43:37,534	INFO	( 3 min) operation 38798a85-54287e4b-134403e8-b03b2314 completed


2025-01-21 19:43:38,228	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/e6700343-652fa9ff-134403e8-b9a7a44e/details


2025-01-21 19:43:38,250	INFO	( 0 min) operation e6700343-652fa9ff-134403e8-b9a7a44e initializing


2025-01-21 19:43:41,296	INFO	( 0 min) operation e6700343-652fa9ff-134403e8-b9a7a44e: running=0     completed=1     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:43:41,820	INFO	( 0 min) operation e6700343-652fa9ff-134403e8-b9a7a44e completing


2025-01-21 19:43:42,345	INFO	( 0 min) operation e6700343-652fa9ff-134403e8-b9a7a44e completed


<yt.wrapper.operation_commands.Operation at 0x7fa3e816b090>

## Join tables

Now we can compare the data.

In [30]:
@yt.with_context
def reduce_compare_squirrels_count(key: dict[str, str], records: Iterable[dict], context: yt.schema.Context):
    count_by_table = {}
    for record in records:
        table_index = context.table_index
        assert table_index not in count_by_table 
        count_by_table[table_index] = record["squirrels"]
    if count_by_table.get(0) != count_by_table.get(1):
        yield {"date": key["date"], "squirrels_data": count_by_table.get(0), "hectare_data": count_by_table.get(1)}

squirrels_count_diff = f"{working_dir}/squirrels_count_diff"

yt.run_reduce(
    reduce_compare_squirrels_count,
    source_table=[squirrels_by_date_squirrels, squirrels_by_date_hectare],
    destination_table=squirrels_count_diff,
    reduce_by=["date"],
)

2025-01-21 19:43:48,746	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/661600b1-9d03b2a2-134403e8-77e968b3/details


2025-01-21 19:43:48,765	INFO	( 0 min) operation 661600b1-9d03b2a2-134403e8-77e968b3 starting


2025-01-21 19:43:49,284	INFO	( 0 min) operation 661600b1-9d03b2a2-134403e8-77e968b3 initializing


2025-01-21 19:43:50,878	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'reducer': {'title': 'reduce_compare_squirrels_count'}}


2025-01-21 19:43:50,897	INFO	( 0 min) operation 661600b1-9d03b2a2-134403e8-77e968b3: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:43:54,205	INFO	( 0 min) operation 661600b1-9d03b2a2-134403e8-77e968b3: running=1     completed=0     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:46:45,171	INFO	( 2 min) operation 661600b1-9d03b2a2-134403e8-77e968b3 completed


<yt.wrapper.operation_commands.Operation at 0x7fa3e9e87710>

In [31]:
for record in yt.read_table(squirrels_count_diff):
    print(record)

{'date': 17814, 'squirrels_data': 335, 'hectare_data': 334}


We can see that the data is not the same on only one day and differs by 1. For the demo example, we consider this result acceptable.

# Make new dataset

## Use destination table with schema

We can secribe table's schema as yt_dataclass and reuse this object in next steps.

In [35]:
from typing import Optional, Any

@yt.yt_dataclass
class JoinedDatasetRow:
    date: yt.schema.Date
    hectare: str
    shift: str
    other_animals_sightings: list[str]
    temperature_celsius: Optional[int]
    weather_data: list[str]
    age: str
    squirrel_id: str
    running: bool
    chasing: bool
    climbing: bool
    eating: bool
    foraging: bool
    kuks: bool
    quaas: bool
    moans: bool
    tail_flags: bool
    tail_twitches: bool
    approaches: bool
    indifferent: bool
    runs_from: bool


joined_dataset = f"{working_dir}/joined_dataset"
yt.create("table", joined_dataset, force=True, attributes={"schema": yt.schema.TableSchema.from_row_type(JoinedDatasetRow)})


@yt.with_context
def reduce_join(key: dict[str, str], records: Iterable[dict[str, Any]], context: yt.schema.Context) -> Iterable:
    squirrels: list[SquirrelsRow] = []
    other_animals_sightings = set()
    temperature_celsius = None
    weather_data = set()
    for record in records:
        if context.table_index == 0:
            squirrels.append(record)
        elif context.table_index == 1:
            other_animals_sightings.update(set(record["other_animals_sightings"]))
            temperature_celsius = record["temperature_celsius"] if temperature_celsius is None else (temperature_celsius + record["temperature_celsius"]) / 2
            weather_data.update(set(record["weather_data"]))
    temperature_celsius = round(temperature_celsius) if temperature_celsius is not None else None
    weather_data = list(weather_data)
    other_animals_sightings = list(other_animals_sightings)
    for squirrel in squirrels:
        yield dict(
            date=key["date"],
            hectare=key["hectare"],
            shift=key["shift"],
            other_animals_sightings=other_animals_sightings,
            temperature_celsius=temperature_celsius,
            weather_data=weather_data,
            age=squirrel["age"],
            squirrel_id=squirrel["squirrel_id"],
            running=squirrel["running"],
            chasing=squirrel["chasing"],
            climbing=squirrel["climbing"],
            eating=squirrel["eating"],
            foraging=squirrel["foraging"],
            kuks=squirrel["kuks"],
            quaas=squirrel["quaas"],
            moans=squirrel["moans"],
            tail_flags=squirrel["tail_flags"],
            tail_twitches=squirrel["tail_twitches"],
            approaches=squirrel["approaches"],
            indifferent=squirrel["indifferent"],
            runs_from=squirrel["runs_from"],
        )


yt.run_reduce(
    reduce_join,
    source_table=[sorted_squirrels_data, canonized_squirrels_hectare_data],
    destination_table=joined_dataset,
    reduce_by=["date", "hectare", "shift"],
)

2025-01-21 19:46:51,444	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/5a8606e7-ae10c92d-134403e8-377ace7e/details


2025-01-21 19:46:51,461	INFO	( 0 min) operation 5a8606e7-ae10c92d-134403e8-377ace7e starting


2025-01-21 19:46:51,980	INFO	( 0 min) operation 5a8606e7-ae10c92d-134403e8-377ace7e initializing


2025-01-21 19:46:52,522	INFO	( 0 min) Unrecognized spec: {'enable_partitioned_data_balancing': false, 'reducer': {'title': 'reduce_join'}}


2025-01-21 19:46:53,613	INFO	( 0 min) operation 5a8606e7-ae10c92d-134403e8-377ace7e: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:46:57,176	INFO	( 0 min) operation 5a8606e7-ae10c92d-134403e8-377ace7e: running=1     completed=0     pending=0     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:46:58,063	INFO	( 0 min) operation 5a8606e7-ae10c92d-134403e8-377ace7e completing


2025-01-21 19:46:59,096	INFO	( 0 min) operation 5a8606e7-ae10c92d-134403e8-377ace7e completed


<yt.wrapper.operation_commands.Operation at 0x7fa3e8561490>

# Running squirrels

Now we can find out when it was more likely to see a running squirrel in Central Park, New York, in October 2018 - on cold or warm days using our new dataset. Let's do this using [mapreduce operation](https://ytsaurus.tech/docs/en/user-guide/data-processing/operations/mapreduce). We will consider days with temperatures >= 15 as warm days and temperatures < 15 as cold days.

In [37]:
@yt.yt_dataclass
class RunningIsColdRow:
    temperature: str
    is_running: bool


@yt.yt_dataclass
class RunningIsColdResultRow:
    temperature: str
    is_running: int
    not_running: int


class RunningIsColdMapper(yt.TypedJob):
    def __call__(self, record: JoinedDatasetRow) -> Iterable[RunningIsColdRow]:
        if record.temperature_celsius is None:
            return 
        yield RunningIsColdRow(
            is_running=record.running,
            temperature="total",
        )
        yield RunningIsColdRow(
            is_running=record.running,
            temperature="cold" if (record.temperature_celsius < 15) else "not_cold",
        )


class RunningIsColdReducer(yt.TypedJob):
    def __call__(self, records: yt.schema.RowIterator[RunningIsColdRow]) -> Iterable[RunningIsColdResultRow]:
        is_running = 0
        not_running = 0
        for record in records:
            if record.is_running:
                is_running += 1
            else:
                not_running += 1
        yield RunningIsColdResultRow(
            is_running=is_running,
            not_running=not_running,
            temperature=record.temperature,
        )


running_squirrels = f"{working_dir}/running_squirrels"

yt.run_map_reduce(
    mapper=RunningIsColdMapper(),
    reducer=RunningIsColdReducer(),
    source_table=joined_dataset,
    destination_table=running_squirrels,
    reduce_by=["temperature"],
)

for line in yt.read_table(running_squirrels):
    print(line)

2025-01-21 19:47:11,571	INFO	Operation started: https://planck.yt.nebius.yt/playground/operations/63a15d58-67d774a2-134403e8-863f1c8a/details


2025-01-21 19:47:11,634	INFO	( 0 min) operation 63a15d58-67d774a2-134403e8-863f1c8a initializing


2025-01-21 19:47:13,354	INFO	( 0 min) Unrecognized spec: {'mapper': {'title': 'RunningIsColdMapper'}, 'reducer': {'title': 'RunningIsColdReducer'}}


2025-01-21 19:47:15,836	INFO	( 0 min) operation 63a15d58-67d774a2-134403e8-863f1c8a: running=0     completed=0     pending=1     failed=0     aborted=0     lost=0     total=1     blocked=0    


2025-01-21 19:47:19,535	INFO	( 0 min) operation 63a15d58-67d774a2-134403e8-863f1c8a: running=1     completed=1     pending=0     failed=0     aborted=0     lost=0     total=2     blocked=0    


2025-01-21 19:50:05,035	INFO	( 2 min) operation 63a15d58-67d774a2-134403e8-863f1c8a completed


{'temperature': 'cold', 'is_running': 296, 'not_running': 799}
{'temperature': 'not_cold', 'is_running': 368, 'not_running': 1264}
{'temperature': 'total', 'is_running': 664, 'not_running': 2063}


In [38]:
print("Cold: ", 296 / (296 + 799))
print("Warm: ", 368 / (368 + 1264))

Cold:  0.27031963470319637
Warm:  0.22549019607843138


We can see that the proportion of contacts with running squirrels was higher on cold days. Let's use [Chi-squared test](https://en.wikipedia.org/wiki/Chi-squared_test) to verify it.

In [40]:
!pip install scipy

Collecting scipy


  Downloading scipy-1.15.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m61.4/62.0 kB[0m [31m22.3 MB/s[0m eta [36m0:00:01[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m61.4/62.0 kB[0m [31m22.3 MB/s[0m eta [36m0:00:01[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m61.4/62.0 kB[0m [31m22.3 MB/s[0m eta [36m0:00:01[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m61.4/62.0 kB[0m [31m22.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m61.4/62.0 kB[0m [31m22.3 MB/s[0m eta [36m0:00:01[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m61.4/62.0 kB[0m [31m22.3 MB/s[0m eta [36m0:00:01[0m

[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m61.4/62.0 kB[0m [31m22.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m61.4/62.0 kB[0m [31m22.3 MB/s[0m eta [36m0:00:01[0m

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m178.5 kB/s[0m eta [36m0:00:00[0m
Downloading scipy-1.15.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (40.6 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/40.6 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [91m━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/40.6 MB[0m [31m159.9 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/40.6 MB[0m [31m233.4 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m21.5/40.6 MB[0m [31m234.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━[0m [32m28.3/40.6 MB[0m [31m230.5 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m28.4/40.6 MB[0m [31m140.4 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━[0m [32m28.7/40.6 MB[0m [31m89.0 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m28.9/40.6 MB[0m [31m72.4 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m29.4/40.6 MB[0m [31m55.8 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━[0m [32m36.9/40.6 MB[0m [31m55.3 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.2/40.6 MB[0m [31m163.6 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m40.6/40.6 MB[0m [31m126.7 MB/s[0m eta [36m0:00:01[0m

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.6/40.6 MB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25h

Installing collected packages: scipy


Successfully installed scipy-1.15.1
[0m

In [41]:
from scipy.stats import chi2_contingency

observed = [
    [296, 799],
    [368, 1264],
]

chi2, p, dof, expected = chi2_contingency(observed)

p < 0.05

ImportError: Error importing numpy: you should not try to import numpy from
        its source directory; please exit the numpy source tree, and relaunch
        your python interpreter from there.

~~Therefore, in cold days squirrels run more.~~

Therefore, we see that in cold days of October 2018, it was more likely to see a running squirrel in Central Park, New York, than on warm days.