Skip to content

Commit

Permalink
Introduce SessionTotalsArray class (#365)
Browse files Browse the repository at this point in the history
* Introduce `SessionTotalsArray` class

Currently we've been having infro problems with a customer that uploads a LOT of carryforward flags. The issue we're having is that some columns on the database have such huge inserts (hundreds of MB) that the DB fails. These huge inserts are caused by the `session_totals` array, because the index in which a `SessionTotals` is inside of the array is the ID of the session that generated those totals.

So for many carryforward flags (we think) you end up having huge arrays filled with `null` and a few session_totals at the very end. Also if we shift by lines we might even clear out the totals and just have a huge array of all nulls.

As a quick-and-dirty solution for that problem we'll be encoding this array in a `SessionTotalsArray` class, that has a `real_length` so we know the index of any appended totals, and the non-null totals indexed by their index. We hope that for large totals arrays this will save enough space to protect the DB.

I still want to make some study to see how impactful these changes reall are in compressing the data.

Yes these are dangerous changes so I 'll be testing more thoroughly after hooking up to api and worker.

* Make sure that the items in the `non_null_items` are `SessionTotals`

First step into integration hell. I had issues when integrating the `SessionTotalsArray` into the worker because at times the class would receive lists as items of the `non_null_items` (when pulling from db), and other times `ReportTotals` (when processing a report).

When the items are `ReportTotals` I was having issues on the encoding of the `SessionTotalsArray`. So now when creating it we first make sure to convert all internal items of `non_null_items` into `ReportTotals` first.

One interesting benefit of doing this is that `ReportTotals` encoding includes a step to remove trailing zeros from the array, further reducing the final size of the encoded object.

* Fix session_totals when deleting and carry forwarding sessions

Step two into integration hell.

This bug came when integrating the worker. In a particular test we were deleting the session of ID 0. You can see in `editable.py` that when deleting sessions (which is relevant in the carryforward context) we were iterating over the session totals to generate new ones. Because `__iter__` is defined it would return a list of sessions but with the wrong index (which is now the key of the session in the `non_null_items` dict, NOT it's position in an array anymore).

To solve this we just need to explicitly delete from the `non_null_items` dict using the given key.

You can also see in the `test_carryforward.py` file that this unveiled tests that were gettign erroneus passes before. By creating the `session_totals` as proper `SessionArrayTotals` and keeping track of the session_totals that should be carried forward (via the flag, "simple" or "complex") we can now see that the results are correct.

As usual we can't promise a bug-free experience with 100% certainty, but this seems to be a step in the right direction.

* Foreshadow the next release number for SessionTotalsArray changes

* Small touches to SessionTotalsArray + use that in NetworkFile

Changes to SessionTotalsArray such as the default value in the constructor being immutable (from `{}` to `None`) and better typehints.

Also expanding SessionTotalsArray on iteration correctly (meaning that it now matches what legacy/expanded format for session_totals should be.

Bigger changes around the `NetworkFile` where we were not using `SessionTotalsArray`, but are now.

* Change `real_length` by `session_count` in `SessionTotalsArray`

To improve readability and maintain better context over time we're changing `real_length` to `session_count`

* Add flag to fallback to legacy report style on save

Because the new report format is not backwards compatible there's a change the deploy will go terribly wrong.
To make sure we can revert back quickly to the old style we are adding a feature flag to enable saving the reports in the legacy style.

The best case scenario is that we won't have to use, but it's no good to be prepare only for the best case scenario.
Thanks @scott-codecov for this good idea
  • Loading branch information
giovanni-guidini authored Apr 6, 2023
1 parent 3425659 commit 85d6d4a
Showing 13 changed files with 796 additions and 164 deletions.
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -11,7 +11,7 @@

setup(
name="shared",
version="0.8.3",
version="0.11.0",
rust_extensions=[RustExtension("shared.rustyribs", binding=Binding.PyO3)],
packages=find_packages(exclude=["contrib", "docs", "tests*"]),
# rust extensions are not zip safe, just like C-extensions.
9 changes: 3 additions & 6 deletions shared/reports/editable.py
Original file line number Diff line number Diff line change
@@ -152,15 +152,12 @@ def delete_multiple_sessions(self, session_ids_to_delete: List[int]):
if file is not None:
file.delete_multiple_sessions(session_ids_to_delete)
if file:
new_session_totals = [
a
for ind, a in enumerate(self._files[file.name].session_totals)
if ind not in session_ids_to_delete
]
session_totals = self._files[file.name].session_totals
session_totals.delete_many(session_ids_to_delete)
self._files[file.name] = dataclasses.replace(
self._files.get(file.name),
file_totals=file.totals,
session_totals=new_session_totals,
session_totals=session_totals,
)
else:
del self[file.name]
205 changes: 174 additions & 31 deletions shared/reports/types.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
import logging
from dataclasses import asdict, dataclass
from decimal import Decimal
from typing import Any, List, Optional, Sequence, Tuple, Union
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union

from shared.config import get_config

log = logging.getLogger(__name__)


@dataclass
@@ -39,6 +44,12 @@ def astuple(self):
self.diff,
)

def to_database(self):
obj = list(self)
while obj and obj[-1] in ("0", 0):
obj.pop()
return obj

def asdict(self):
return asdict(self)

@@ -61,20 +72,6 @@ def default_totals(cls):
)


@dataclass
class NetworkFile(object):
totals: ReportTotals
session_totals: ReportTotals
diff_totals: ReportTotals

def astuple(self):
return (
self.totals.astuple(),
[s.astuple() for s in self.session_totals] if self.session_totals else None,
self.diff_totals.astuple() if self.diff_totals else None,
)


@dataclass
class LineSession(object):
__slots__ = ("id", "coverage", "branches", "partials", "complexity")
@@ -172,22 +169,6 @@ def __post_init__(self):
self.datapoints[i] = CoverageDatapoint(*cov_dp)


@dataclass
class ReportFileSummary(object):
file_index: int
file_totals: ReportTotals = None
session_totals: Sequence[ReportTotals] = None
diff_totals: Any = None

def astuple(self):
return (
self.file_index,
self.file_totals,
self.session_totals,
self.diff_totals,
)


@dataclass
class Change(object):
path: str = None
@@ -206,3 +187,165 @@ def __post_init__(self):
EMPTY = ""

TOTALS_MAP = tuple("fnhmpcbdMsCN")


SessionTotals = ReportTotals


class SessionTotalsArray(object):
def __init__(self, session_count=0, non_null_items=None):
self.session_count: int = session_count

parsed_non_null_items = {}
if non_null_items is None:
non_null_items = {}
for key, value in non_null_items.items():
if isinstance(value, SessionTotals):
parsed_non_null_items[key] = value
elif isinstance(value, list):
parsed_non_null_items[key] = SessionTotals(*value)
else:
log.warning(
"Unknown value for SessionTotal. Ignoring.",
extra=dict(session_total=value, key=key),
)
self.non_null_items: Dict[int, SessionTotals] = parsed_non_null_items

@classmethod
def build_from_encoded_data(cls, sessions_array: Union[dict, list]):
if isinstance(sessions_array, dict):
# The session_totals array is already encoded in the new format
if "meta" not in sessions_array:
# This shouldn't happen, but it would be a good indication that processing is not as we expect
log.warning(
"meta info not found in encoded SessionArray",
extra=dict(sessions_array=sessions_array),
)
sessions_array["meta"] = {
"session_count": max(sessions_array.keys()) + 1
}
meta_info = sessions_array.pop("meta")
session_count = meta_info["session_count"]
# Force keys to be integers for standarization.
# It probably becomes a strong when going to the database
non_null_items = {int(key): value for key, value in sessions_array.items()}
return cls(session_count=session_count, non_null_items=non_null_items)
elif isinstance(sessions_array, list):
session_count = len(sessions_array)
non_null_items = {}
for idx, session_totals in enumerate(sessions_array):
if session_totals is not None:
non_null_items[idx] = session_totals
non_null_items = non_null_items
return cls(session_count=session_count, non_null_items=non_null_items)
elif isinstance(sessions_array, cls):
return sessions_array
elif sessions_array is None:
return SessionTotalsArray()
log.warning(
"Tried to build SessionArray from unknown encoded data.",
dict(data=sessions_array, data_type=type(sessions_array)),
)
return None

def to_database(self):
if get_config("setup", "legacy_report_style", default=False):
return [
value.to_database() if value is not None else None for value in self
]
encoded_obj = {
key: value.to_database() for key, value in self.non_null_items.items()
}
encoded_obj["meta"] = dict(session_count=self.session_count)
return encoded_obj

def __repr__(self) -> str:
return f"SessionTotalsArray<session_count={self.session_count}, non_null_items={self.non_null_items}>"

def __iter__(self):
"""
Expands SessionTotalsArray back to the legacy format
e.g. [None, None, ReportTotals, None, ReportTotals]
"""
for idx in range(self.session_count):
if idx in self.non_null_items:
yield self.non_null_items[idx]
else:
yield None

def __eq__(self, value: object) -> bool:
if isinstance(value, SessionTotalsArray):
return (
self.session_count == value.session_count
and self.non_null_items == value.non_null_items
)
return False

def __bool__(self):
return self.session_count > 0

def append(self, totals: SessionTotals):
if totals == None:
log.warning("Trying to append None session total to SessionTotalsArray")
return
new_totals_index = self.session_count
self.non_null_items[new_totals_index] = totals
self.session_count += 1

def delete_many(self, indexes_to_delete: List[int]):
deleted_items = [self.delete(index) for index in indexes_to_delete]
return deleted_items

def delete(self, index_to_delete: Union[int, str]):
return self.non_null_items.pop(int(index_to_delete), None)


@dataclass
class NetworkFile(object):
totals: ReportTotals
session_totals: SessionTotalsArray
diff_totals: ReportTotals

def __init__(
self, totals=None, session_totals=None, diff_totals=None, *args, **kwargs
) -> None:
self.totals = totals
self.session_totals = SessionTotalsArray.build_from_encoded_data(session_totals)
self.diff_totals = diff_totals

def astuple(self):
return (
self.totals.astuple(),
self.session_totals.to_database(),
self.diff_totals.astuple() if self.diff_totals else None,
)


@dataclass
class ReportFileSummary(object):
file_index: int
file_totals: ReportTotals = None
session_totals: SessionTotalsArray = None
diff_totals: Any = None

def __init__(
self,
file_index,
file_totals=None,
session_totals=None,
diff_totals=None,
*args,
**kwargs,
) -> None:
self.file_index = file_index
self.file_totals = file_totals
self.diff_totals = diff_totals
self.session_totals = SessionTotalsArray.build_from_encoded_data(session_totals)

def astuple(self):
return (
self.file_index,
self.file_totals,
self.session_totals,
self.diff_totals,
)
13 changes: 6 additions & 7 deletions shared/utils/ReportEncoder.py
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@
from json import JSONEncoder
from types import GeneratorType

from shared.reports.types import ReportTotals
from shared.reports.types import ReportTotals, SessionTotalsArray


class ReportEncoder(JSONEncoder):
@@ -12,14 +12,13 @@ class ReportEncoder(JSONEncoder):
def default(self, obj):
if dataclasses.is_dataclass(obj):
return obj.astuple()
if isinstance(obj, Fraction):
elif isinstance(obj, Fraction):
return str(obj)
if isinstance(obj, ReportTotals):
elif isinstance(obj, ReportTotals):
# reduce totals
obj = list(obj)
while obj and obj[-1] in ("0", 0):
obj.pop()
return obj
return obj.to_database()
elif isinstance(obj, SessionTotalsArray):
return obj.to_database()
elif hasattr(obj, "_encode"):
return obj._encode()
elif isinstance(obj, GeneratorType):
8 changes: 3 additions & 5 deletions shared/utils/make_network_file.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
from shared.reports.types import NetworkFile, ReportTotals
from shared.reports.types import NetworkFile, ReportTotals, SessionTotalsArray


def make_network_file(totals, sessions=None, diff=None):
def make_network_file(totals, sessions_totals: SessionTotalsArray = None, diff=None):
return NetworkFile(
ReportTotals(*totals) if totals else ReportTotals(),
[ReportTotals(*session) if session else None for session in sessions]
if sessions
else None,
sessions_totals,
ReportTotals(*diff) if diff else None,
)
1 change: 1 addition & 0 deletions shared/validation/install.py
Original file line number Diff line number Diff line change
@@ -112,6 +112,7 @@ def check_task_config_key(field, value, error):
"yaml": {"type": "integer"},
},
},
"legacy_report_style": {"type": "boolean"},
"loglvl": {"type": "string", "allowed": ("INFO",)},
"max_sessions": {"type": "integer"},
"debug": {"type": "boolean"},
2 changes: 1 addition & 1 deletion tests/integration/test_report.py
Original file line number Diff line number Diff line change
@@ -445,7 +445,7 @@ def test_to_database():
"diff": None,
"N": 0,
},
'{"files": {"file.py": [0, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], null, null]}, "sessions": {}}',
'{"files": {"file.py": [0, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], {"meta": {"session_count": 0}}, null]}, "sessions": {}}',
)
res = Report(
files={"file.py": [0, ReportTotals()]},
Loading

0 comments on commit 85d6d4a

Please sign in to comment.