Skip to content
This repository was archived by the owner on May 5, 2025. It is now read-only.

Commit 85d6d4a

Browse files
Introduce SessionTotalsArray class (#365)
* Introduce `SessionTotalsArray` class Currently we've been having infro problems with a customer that uploads a LOT of carryforward flags. The issue we're having is that some columns on the database have such huge inserts (hundreds of MB) that the DB fails. These huge inserts are caused by the `session_totals` array, because the index in which a `SessionTotals` is inside of the array is the ID of the session that generated those totals. So for many carryforward flags (we think) you end up having huge arrays filled with `null` and a few session_totals at the very end. Also if we shift by lines we might even clear out the totals and just have a huge array of all nulls. As a quick-and-dirty solution for that problem we'll be encoding this array in a `SessionTotalsArray` class, that has a `real_length` so we know the index of any appended totals, and the non-null totals indexed by their index. We hope that for large totals arrays this will save enough space to protect the DB. I still want to make some study to see how impactful these changes reall are in compressing the data. Yes these are dangerous changes so I 'll be testing more thoroughly after hooking up to api and worker. * Make sure that the items in the `non_null_items` are `SessionTotals` First step into integration hell. I had issues when integrating the `SessionTotalsArray` into the worker because at times the class would receive lists as items of the `non_null_items` (when pulling from db), and other times `ReportTotals` (when processing a report). When the items are `ReportTotals` I was having issues on the encoding of the `SessionTotalsArray`. So now when creating it we first make sure to convert all internal items of `non_null_items` into `ReportTotals` first. One interesting benefit of doing this is that `ReportTotals` encoding includes a step to remove trailing zeros from the array, further reducing the final size of the encoded object. * Fix session_totals when deleting and carry forwarding sessions Step two into integration hell. This bug came when integrating the worker. In a particular test we were deleting the session of ID 0. You can see in `editable.py` that when deleting sessions (which is relevant in the carryforward context) we were iterating over the session totals to generate new ones. Because `__iter__` is defined it would return a list of sessions but with the wrong index (which is now the key of the session in the `non_null_items` dict, NOT it's position in an array anymore). To solve this we just need to explicitly delete from the `non_null_items` dict using the given key. You can also see in the `test_carryforward.py` file that this unveiled tests that were gettign erroneus passes before. By creating the `session_totals` as proper `SessionArrayTotals` and keeping track of the session_totals that should be carried forward (via the flag, "simple" or "complex") we can now see that the results are correct. As usual we can't promise a bug-free experience with 100% certainty, but this seems to be a step in the right direction. * Foreshadow the next release number for SessionTotalsArray changes * Small touches to SessionTotalsArray + use that in NetworkFile Changes to SessionTotalsArray such as the default value in the constructor being immutable (from `{}` to `None`) and better typehints. Also expanding SessionTotalsArray on iteration correctly (meaning that it now matches what legacy/expanded format for session_totals should be. Bigger changes around the `NetworkFile` where we were not using `SessionTotalsArray`, but are now. * Change `real_length` by `session_count` in `SessionTotalsArray` To improve readability and maintain better context over time we're changing `real_length` to `session_count` * Add flag to fallback to legacy report style on save Because the new report format is not backwards compatible there's a change the deploy will go terribly wrong. To make sure we can revert back quickly to the old style we are adding a feature flag to enable saving the reports in the legacy style. The best case scenario is that we won't have to use, but it's no good to be prepare only for the best case scenario. Thanks @scott-codecov for this good idea
1 parent 3425659 commit 85d6d4a

File tree

13 files changed

+796
-164
lines changed

13 files changed

+796
-164
lines changed

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111

1212
setup(
1313
name="shared",
14-
version="0.8.3",
14+
version="0.11.0",
1515
rust_extensions=[RustExtension("shared.rustyribs", binding=Binding.PyO3)],
1616
packages=find_packages(exclude=["contrib", "docs", "tests*"]),
1717
# rust extensions are not zip safe, just like C-extensions.

shared/reports/editable.py

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -152,15 +152,12 @@ def delete_multiple_sessions(self, session_ids_to_delete: List[int]):
152152
if file is not None:
153153
file.delete_multiple_sessions(session_ids_to_delete)
154154
if file:
155-
new_session_totals = [
156-
a
157-
for ind, a in enumerate(self._files[file.name].session_totals)
158-
if ind not in session_ids_to_delete
159-
]
155+
session_totals = self._files[file.name].session_totals
156+
session_totals.delete_many(session_ids_to_delete)
160157
self._files[file.name] = dataclasses.replace(
161158
self._files.get(file.name),
162159
file_totals=file.totals,
163-
session_totals=new_session_totals,
160+
session_totals=session_totals,
164161
)
165162
else:
166163
del self[file.name]

shared/reports/types.py

Lines changed: 174 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,11 @@
1+
import logging
12
from dataclasses import asdict, dataclass
23
from decimal import Decimal
3-
from typing import Any, List, Optional, Sequence, Tuple, Union
4+
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
5+
6+
from shared.config import get_config
7+
8+
log = logging.getLogger(__name__)
49

510

611
@dataclass
@@ -39,6 +44,12 @@ def astuple(self):
3944
self.diff,
4045
)
4146

47+
def to_database(self):
48+
obj = list(self)
49+
while obj and obj[-1] in ("0", 0):
50+
obj.pop()
51+
return obj
52+
4253
def asdict(self):
4354
return asdict(self)
4455

@@ -61,20 +72,6 @@ def default_totals(cls):
6172
)
6273

6374

64-
@dataclass
65-
class NetworkFile(object):
66-
totals: ReportTotals
67-
session_totals: ReportTotals
68-
diff_totals: ReportTotals
69-
70-
def astuple(self):
71-
return (
72-
self.totals.astuple(),
73-
[s.astuple() for s in self.session_totals] if self.session_totals else None,
74-
self.diff_totals.astuple() if self.diff_totals else None,
75-
)
76-
77-
7875
@dataclass
7976
class LineSession(object):
8077
__slots__ = ("id", "coverage", "branches", "partials", "complexity")
@@ -172,22 +169,6 @@ def __post_init__(self):
172169
self.datapoints[i] = CoverageDatapoint(*cov_dp)
173170

174171

175-
@dataclass
176-
class ReportFileSummary(object):
177-
file_index: int
178-
file_totals: ReportTotals = None
179-
session_totals: Sequence[ReportTotals] = None
180-
diff_totals: Any = None
181-
182-
def astuple(self):
183-
return (
184-
self.file_index,
185-
self.file_totals,
186-
self.session_totals,
187-
self.diff_totals,
188-
)
189-
190-
191172
@dataclass
192173
class Change(object):
193174
path: str = None
@@ -206,3 +187,165 @@ def __post_init__(self):
206187
EMPTY = ""
207188

208189
TOTALS_MAP = tuple("fnhmpcbdMsCN")
190+
191+
192+
SessionTotals = ReportTotals
193+
194+
195+
class SessionTotalsArray(object):
196+
def __init__(self, session_count=0, non_null_items=None):
197+
self.session_count: int = session_count
198+
199+
parsed_non_null_items = {}
200+
if non_null_items is None:
201+
non_null_items = {}
202+
for key, value in non_null_items.items():
203+
if isinstance(value, SessionTotals):
204+
parsed_non_null_items[key] = value
205+
elif isinstance(value, list):
206+
parsed_non_null_items[key] = SessionTotals(*value)
207+
else:
208+
log.warning(
209+
"Unknown value for SessionTotal. Ignoring.",
210+
extra=dict(session_total=value, key=key),
211+
)
212+
self.non_null_items: Dict[int, SessionTotals] = parsed_non_null_items
213+
214+
@classmethod
215+
def build_from_encoded_data(cls, sessions_array: Union[dict, list]):
216+
if isinstance(sessions_array, dict):
217+
# The session_totals array is already encoded in the new format
218+
if "meta" not in sessions_array:
219+
# This shouldn't happen, but it would be a good indication that processing is not as we expect
220+
log.warning(
221+
"meta info not found in encoded SessionArray",
222+
extra=dict(sessions_array=sessions_array),
223+
)
224+
sessions_array["meta"] = {
225+
"session_count": max(sessions_array.keys()) + 1
226+
}
227+
meta_info = sessions_array.pop("meta")
228+
session_count = meta_info["session_count"]
229+
# Force keys to be integers for standarization.
230+
# It probably becomes a strong when going to the database
231+
non_null_items = {int(key): value for key, value in sessions_array.items()}
232+
return cls(session_count=session_count, non_null_items=non_null_items)
233+
elif isinstance(sessions_array, list):
234+
session_count = len(sessions_array)
235+
non_null_items = {}
236+
for idx, session_totals in enumerate(sessions_array):
237+
if session_totals is not None:
238+
non_null_items[idx] = session_totals
239+
non_null_items = non_null_items
240+
return cls(session_count=session_count, non_null_items=non_null_items)
241+
elif isinstance(sessions_array, cls):
242+
return sessions_array
243+
elif sessions_array is None:
244+
return SessionTotalsArray()
245+
log.warning(
246+
"Tried to build SessionArray from unknown encoded data.",
247+
dict(data=sessions_array, data_type=type(sessions_array)),
248+
)
249+
return None
250+
251+
def to_database(self):
252+
if get_config("setup", "legacy_report_style", default=False):
253+
return [
254+
value.to_database() if value is not None else None for value in self
255+
]
256+
encoded_obj = {
257+
key: value.to_database() for key, value in self.non_null_items.items()
258+
}
259+
encoded_obj["meta"] = dict(session_count=self.session_count)
260+
return encoded_obj
261+
262+
def __repr__(self) -> str:
263+
return f"SessionTotalsArray<session_count={self.session_count}, non_null_items={self.non_null_items}>"
264+
265+
def __iter__(self):
266+
"""
267+
Expands SessionTotalsArray back to the legacy format
268+
e.g. [None, None, ReportTotals, None, ReportTotals]
269+
"""
270+
for idx in range(self.session_count):
271+
if idx in self.non_null_items:
272+
yield self.non_null_items[idx]
273+
else:
274+
yield None
275+
276+
def __eq__(self, value: object) -> bool:
277+
if isinstance(value, SessionTotalsArray):
278+
return (
279+
self.session_count == value.session_count
280+
and self.non_null_items == value.non_null_items
281+
)
282+
return False
283+
284+
def __bool__(self):
285+
return self.session_count > 0
286+
287+
def append(self, totals: SessionTotals):
288+
if totals == None:
289+
log.warning("Trying to append None session total to SessionTotalsArray")
290+
return
291+
new_totals_index = self.session_count
292+
self.non_null_items[new_totals_index] = totals
293+
self.session_count += 1
294+
295+
def delete_many(self, indexes_to_delete: List[int]):
296+
deleted_items = [self.delete(index) for index in indexes_to_delete]
297+
return deleted_items
298+
299+
def delete(self, index_to_delete: Union[int, str]):
300+
return self.non_null_items.pop(int(index_to_delete), None)
301+
302+
303+
@dataclass
304+
class NetworkFile(object):
305+
totals: ReportTotals
306+
session_totals: SessionTotalsArray
307+
diff_totals: ReportTotals
308+
309+
def __init__(
310+
self, totals=None, session_totals=None, diff_totals=None, *args, **kwargs
311+
) -> None:
312+
self.totals = totals
313+
self.session_totals = SessionTotalsArray.build_from_encoded_data(session_totals)
314+
self.diff_totals = diff_totals
315+
316+
def astuple(self):
317+
return (
318+
self.totals.astuple(),
319+
self.session_totals.to_database(),
320+
self.diff_totals.astuple() if self.diff_totals else None,
321+
)
322+
323+
324+
@dataclass
325+
class ReportFileSummary(object):
326+
file_index: int
327+
file_totals: ReportTotals = None
328+
session_totals: SessionTotalsArray = None
329+
diff_totals: Any = None
330+
331+
def __init__(
332+
self,
333+
file_index,
334+
file_totals=None,
335+
session_totals=None,
336+
diff_totals=None,
337+
*args,
338+
**kwargs,
339+
) -> None:
340+
self.file_index = file_index
341+
self.file_totals = file_totals
342+
self.diff_totals = diff_totals
343+
self.session_totals = SessionTotalsArray.build_from_encoded_data(session_totals)
344+
345+
def astuple(self):
346+
return (
347+
self.file_index,
348+
self.file_totals,
349+
self.session_totals,
350+
self.diff_totals,
351+
)

shared/utils/ReportEncoder.py

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
from json import JSONEncoder
44
from types import GeneratorType
55

6-
from shared.reports.types import ReportTotals
6+
from shared.reports.types import ReportTotals, SessionTotalsArray
77

88

99
class ReportEncoder(JSONEncoder):
@@ -12,14 +12,13 @@ class ReportEncoder(JSONEncoder):
1212
def default(self, obj):
1313
if dataclasses.is_dataclass(obj):
1414
return obj.astuple()
15-
if isinstance(obj, Fraction):
15+
elif isinstance(obj, Fraction):
1616
return str(obj)
17-
if isinstance(obj, ReportTotals):
17+
elif isinstance(obj, ReportTotals):
1818
# reduce totals
19-
obj = list(obj)
20-
while obj and obj[-1] in ("0", 0):
21-
obj.pop()
22-
return obj
19+
return obj.to_database()
20+
elif isinstance(obj, SessionTotalsArray):
21+
return obj.to_database()
2322
elif hasattr(obj, "_encode"):
2423
return obj._encode()
2524
elif isinstance(obj, GeneratorType):

shared/utils/make_network_file.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,9 @@
1-
from shared.reports.types import NetworkFile, ReportTotals
1+
from shared.reports.types import NetworkFile, ReportTotals, SessionTotalsArray
22

33

4-
def make_network_file(totals, sessions=None, diff=None):
4+
def make_network_file(totals, sessions_totals: SessionTotalsArray = None, diff=None):
55
return NetworkFile(
66
ReportTotals(*totals) if totals else ReportTotals(),
7-
[ReportTotals(*session) if session else None for session in sessions]
8-
if sessions
9-
else None,
7+
sessions_totals,
108
ReportTotals(*diff) if diff else None,
119
)

shared/validation/install.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -112,6 +112,7 @@ def check_task_config_key(field, value, error):
112112
"yaml": {"type": "integer"},
113113
},
114114
},
115+
"legacy_report_style": {"type": "boolean"},
115116
"loglvl": {"type": "string", "allowed": ("INFO",)},
116117
"max_sessions": {"type": "integer"},
117118
"debug": {"type": "boolean"},

tests/integration/test_report.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -445,7 +445,7 @@ def test_to_database():
445445
"diff": None,
446446
"N": 0,
447447
},
448-
'{"files": {"file.py": [0, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], null, null]}, "sessions": {}}',
448+
'{"files": {"file.py": [0, [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], {"meta": {"session_count": 0}}, null]}, "sessions": {}}',
449449
)
450450
res = Report(
451451
files={"file.py": [0, ReportTotals()]},

0 commit comments

Comments
 (0)