Merge pull request #83 from rmnldwg/release-1.2.0

Release 1.2.0
rmnldwg · Mar 29, 2024 · b7f453a · b7f453a
2 parents 6559f11 + b64a1c8
commit b7f453a
Show file tree

Hide file tree

Showing 22 changed files with 594 additions and 331 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,48 @@
 
 All notable changes to this project will be documented in this file.
 
+<a name="1.2.0"></a>
+## [1.2.0] - 2024-03-29
+
+### Bug Fixes
+
+- (**mid**) `obs_dist` may return 3D array.
+
+
+### Documentation
+
+- Fix unknown version in title.
+- Add missing blank before list.
+- (**mid**) Add comment about midext marginalizing.
+
+
+### Features
+
+- (**mid**) Add `posterior_state_dist()` method.\
+  The `Midline` model now has a `posterior_state_dist()` method, too.
+- (**types**) Base `Model` has state dist methods.\
+  Both `state_dist()` and `posterior_state_dist()` have been added to the
+  `types.Model` base class.
+- Add `marginalize()` method.\
+  With this new method, one can marginalize a (prior or posterior) state
+  distribution over all states that match a provided involvement.\
+  It is used e.g. to refactor the code of the `risk()` methods.
+- (**types**) Add `obs_dist` and `marginalize`.\
+  The `types.Model` base abstract base class now also has the methods
+  `obs_dist` and `marginalize` for better autocomplete support in editors.
+
+
+### Testing
+
+- Remove plain test risk.
+
+
+### Change
+
+- (**types**) Improve type hints for inv. pattern.
+- Rename "diagnose" to "diagnosis" when noun.\
+  When used as a noun, "diagnosis" is correct, not "diagnose".
+
 
 <a name="1.1.0"></a>
 ## [1.1.0] - 2024-03-20
@@ -626,7 +668,8 @@ Almost the entire API has changed. I'd therefore recommend to have a look at the
 - add pre-commit hook to check commit msg
 
 
-[Unreleased]: https://github.com/rmnldwg/lymph/compare/1.1.0...HEAD
+[Unreleased]: https://github.com/rmnldwg/lymph/compare/1.2.0...HEAD
+[1.2.0]: https://github.com/rmnldwg/lymph/compare/1.1.0...1.2.0
 [1.1.0]: https://github.com/rmnldwg/lymph/compare/1.0.0...1.1.0
 [1.0.0]: https://github.com/rmnldwg/lymph/compare/1.0.0.rc2...1.0.0
 [1.0.0.rc2]: https://github.com/rmnldwg/lymph/compare/1.0.0.rc1...1.0.0.rc2

diff --git a/README.rst b/README.rst
@@ -26,13 +26,13 @@ HNSCC spreads though the lymphatic system of the neck and forms metastases in re
 
 To account for this microscopic involvement, parts of the lymphatic system are often irradiated electively to increase tumor control. Which parts are included in this elective clinical target volume is currently decided based on guidelines [1]_ [2]_ [3]_ [4]_. These in turn are derived from reports of the prevalence of involvement per lymph node level (LNL), i.e. the portion of patients that were diagnosed with metastases in any given LNL, stratified by primary tumor location. It is recommended to include a LNL in the elective target volume if 10 - 15% of patients showed involvement in that particular level.
 
-However, while the prevalence of involvement has been reported in the literature [5]_ [6]_, and the general lymph drainage pathways are understood well, the detailed progression patterns of HNSCC remain poorly quantified. We believe that the risk for microscopic involvement in an LNL depends highly on the specific diagnose of a particular patient and their treatment can hence be personalized if the progression patterns were better quantified.
+However, while the prevalence of involvement has been reported in the literature [5]_ [6]_, and the general lymph drainage pathways are understood well, the detailed progression patterns of HNSCC remain poorly quantified. We believe that the risk for microscopic involvement in an LNL depends highly on the specific diagnosis of a particular patient and their treatment can hence be personalized if the progression patterns were better quantified.
 
 
 Our Goal
 ========
 
-With this Python package we want to provide a framework to accurately predict the risk for microscopic metastases in any lymph node level for the specific diagnose a particular patient presents with.
+With this Python package we want to provide a framework to accurately predict the risk for microscopic metastases in any lymph node level for the specific diagnosis a particular patient presents with.
 
 The implemented model is highly interpretable and was developed together with clinicians to accurately represent the anatomy of the lymphatic drainiage. It can be trained with data that reports the patterns of lymphatic progression in detail, like the `dataset(s) <https://github.com/rmnldwg/lydata>`_ we collected at our institution, the University Hospital Zurich (USZ).
 

diff --git a/docs/source/components.rst b/docs/source/components.rst
@@ -18,10 +18,10 @@ Diagnostic Modalities
     :show-inheritance:
 
 
-Marginalization over Diagnose Times
+Marginalization over Diagnosis Times
 -----------------------------------
 
-.. automodule:: lymph.diagnose_times
+.. automodule:: lymph.diagnosis_times
     :members:
     :special-members: __init__, __hash__
     :show-inheritance:

diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -3,36 +3,19 @@
 # This file only contains a selection of the most common options. For a full
 # list see the documentation:
 # https://www.sphinx-doc.org/en/main/usage/configuration.html
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-#
-import os
-import sys
-
-from pkg_resources import DistributionNotFound, get_distribution
-
-sys.path.insert(0, os.path.abspath('../..'))
-
-try:
-    __version__ = get_distribution("lymph").version
-except DistributionNotFound:
-    __version__ = "unknown version"
-
+import lymph
 
 # -- Project information -----------------------------------------------------
+# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
 
 project = 'lymph'
 copyright = '2022, Roman Ludwig'
 author = 'Roman Ludwig'
 gh_username = 'rmnldwg'
 
-version = __version__
+version = lymph.__version__
 # The full version, including alpha/beta/rc tags
-release = __version__
+release = lymph.__version__
 
 
 # -- General configuration ---------------------------------------------------

diff --git a/docs/source/quickstart_bilateral.ipynb b/docs/source/quickstart_bilateral.ipynb
@@ -222,9 +222,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Distribution over Diagnose Times\n",
+    "## Distribution over Diagnosis Times\n",
     "\n",
-    "Just as with the modalities, the distributions over diagnose times are delegated to the two sides via the exact same API as in the `Unilateral` model:"
+    "Just as with the modalities, the distributions over diagnosis times are delegated to the two sides via the exact same API as in the `Unilateral` model:"
    ]
   },
   {
@@ -276,7 +276,7 @@
     "\n",
     ":::{note}\n",
     "\n",
-    "You cannot set the diagnose time distributions asymmetrically! With the modalities this may make sense (although it is not really supported, you may try), but for the diagnose times, this will surely break!\n",
+    "You cannot set the diagnosis time distributions asymmetrically! With the modalities this may make sense (although it is not really supported, you may try), but for the diagnosis times, this will surely break!\n",
     ":::\n",
     "\n",
     "## Likelihood\n",

diff --git a/docs/source/quickstart_unilateral.ipynb b/docs/source/quickstart_unilateral.ipynb
@@ -6,7 +6,7 @@
    "source": [
     "# Getting started\n",
     "\n",
-    "A lot of people get diagnosed with squamous cell carcinoma in the head & neck region ([HNSCC](https://en.wikipedia.org/wiki/Head_and_neck_cancer)), which frequently metastasizes via the lymphatic system. We set out to develop a methodology to predict the risk of a new patient having metastases in so-called lymph node levels (LNLs), based on their personal diagnose (e.g. findings from a CT scan) and information of previously diagnosed and treated patients. And that's exactly what this code enables you to do as well.\n",
+    "A lot of people get diagnosed with squamous cell carcinoma in the head & neck region ([HNSCC](https://en.wikipedia.org/wiki/Head_and_neck_cancer)), which frequently metastasizes via the lymphatic system. We set out to develop a methodology to predict the risk of a new patient having metastases in so-called lymph node levels (LNLs), based on their personal diagnosis (e.g. findings from a CT scan) and information of previously diagnosed and treated patients. And that's exactly what this code enables you to do as well.\n",
     "\n",
     "As mentioned, this package is meant to be a relatively simple-to-use frontend. The math is done under the hood and one does not need to worry about it a lot. But let's have a quick look at what we're doing here.\n",
     "\n",
@@ -152,7 +152,7 @@
    "source": [
     "## Diagnostic Modalities\n",
     "\n",
-    "To ultimately compute the likelihoods of observations, we need to fix the sensitivities and specificities of the obtained diagnoses. And since we might have multiple diagnostic modalities available, we need to tell the system which of them comes with which specificity and sensitivity. We do this by adding specificity/sensitivity pairs to our model:"
+    "To ultimately compute the likelihoods of observations, we need to fix the sensitivities and specificities of the obtained diagnosis. And since we might have multiple diagnostic modalities available, we need to tell the system which of them comes with which specificity and sensitivity. We do this by adding specificity/sensitivity pairs to our model:"
    ]
   },
   {
@@ -256,7 +256,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To feed the dataset into the system, we assign the dataset to the attribute `patient_data`. What the system then does here is creating a diagnose matrix for every T-stage in the data."
+    "To feed the dataset into the system, we assign the dataset to the attribute `patient_data`. What the system then does here is creating a diagnosis matrix for every T-stage in the data."
    ]
   },
   {
@@ -275,17 +275,17 @@
    "source": [
     ":::{note}\n",
     "\n",
-    "The data now has an additional top-level header `\"_model\"` which stores only the information the model actually needs. In this case, it only stores the ipsilateral CT diagnoses of the LNLs I, II, III, and IV, as well as the mapped T-stage of the patients. Note that from the original T-stages 1, 2, 3, and 4, only \"early\" and \"late\" are left. This is the default transformation, but it can be changed by providing a function to the `mapping` keyword argument in the `load_patient_data()` method.\n",
+    "The data now has an additional top-level header `\"_model\"` which stores only the information the model actually needs. In this case, it only stores the ipsilateral CT diagnosis of the LNLs I, II, III, and IV, as well as the mapped T-stage of the patients. Note that from the original T-stages 1, 2, 3, and 4, only \"early\" and \"late\" are left. This is the default transformation, but it can be changed by providing a function to the `mapping` keyword argument in the `load_patient_data()` method.\n",
     ":::"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Distribution over Diagnose Times\n",
+    "## Distribution over Diagnosis Times\n",
     "\n",
-    "The last ingredient to set up (at least when using the hidden Markov model) would now be the distribution over diagnose times. Our dataset contains two different T-stages \"early\" and \"late\". One of the underlying assumptions with our model is that earlier T-stage patients have been - on average - diagnosed at an earlier time-point, compared to late T-stage patients. We can reflect that using distributions over the diagnosis time:"
+    "The last ingredient to set up (at least when using the hidden Markov model) would now be the distribution over diagnosis times. Our dataset contains two different T-stages \"early\" and \"late\". One of the underlying assumptions with our model is that earlier T-stage patients have been - on average - diagnosed at an earlier time-point, compared to late T-stage patients. We can reflect that using distributions over the diagnosis time:"
    ]
   },
   {
@@ -313,7 +313,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can now set a fixed prior for the distribution over diagnose times of early T-stage patients (i.e., patients with T1 and T2 tumors)."
+    "We can now set a fixed prior for the distribution over diagnosis times of early T-stage patients (i.e., patients with T1 and T2 tumors)."
    ]
   },
   {
@@ -330,11 +330,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's define a parametrized PMF over diagnose times for patients with late T-stage tumors (T3 and T4) to show this functionality. For that, we first define a parametrized function with the signature\n",
+    "Let's define a parametrized PMF over diagnosis times for patients with late T-stage tumors (T3 and T4) to show this functionality. For that, we first define a parametrized function with the signature\n",
     "\n",
     "```python\n",
     "def distribution(support: list[float] | np.ndarray, a=1, b=2, c=3, ...) -> np.ndarray:\n",
-    "    \"\"\"PMF over diagnose times (``support``) with parameters ``a``, ``b``, and ``c``.\"\"\"\n",
+    "    \"\"\"PMF over diagnosis times (``support``) with parameters ``a``, ``b``, and ``c``.\"\"\"\n",
     "    ...\n",
     "    return result\n",
     "```\n",
@@ -405,7 +405,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Note how the set of adjustable parameters now also contains the `p` parameter for the late T-stage's distribution over diagnose times. For the early T-stage, it is not present, because that one was provided as a fixed array."
+    "Note how the set of adjustable parameters now also contains the `p` parameter for the late T-stage's distribution over diagnosis times. For the early T-stage, it is not present, because that one was provided as a fixed array."
    ]
   },
   {

diff --git a/lymph/__init__.py b/lymph/__init__.py
@@ -16,11 +16,11 @@
 
 # nopycln: file
 
-from lymph import diagnose_times, graph, matrix, models
+from lymph import diagnosis_times, graph, matrix, models
 from lymph.utils import clinical, pathological
 
 __all__ = [
-    "diagnose_times", "matrix",
+    "diagnosis_times", "matrix",
     "graph", "models",
     "clinical", "pathological",
 ]

diff --git a/lymph/diagnose_times.py → lymph/diagnosis_times.py b/lymph/diagnose_times.py → lymph/diagnosis_times.py
@@ -1,5 +1,5 @@
 """
-Module for marginalizing over diagnose times.
+Module for marginalizing over diagnosis times.
 
 The hidden Markov model we implement assumes that every patient started off with a
 healthy neck, meaning no lymph node levels harboured any metastases. This is a valid
@@ -33,22 +33,22 @@ class SupportError(Exception):
 
 
 class Distribution:
-    """Class that provides a way of storing distributions over diagnose times."""
+    """Class that provides a way of storing distributions over diagnosis times."""
     def __init__(
         self,
         distribution: Iterable[float] | callable,
         max_time: int | None = None,
         **kwargs,
     ) -> None:
-        """Initialize a distribution over diagnose times.
+        """Initialize a distribution over diagnosis times.
 
         This object can either be created by passing a parametrized function (e.g.,
         ``scipy.stats`` distribution) or by passing a list of probabilities for each
-        diagnose time.
+        diagnosis time.
 
         The signature of the function must be ``func(support, **kwargs)``, where
         ``support`` is the support of the distribution from 0 to ``max_time``. The
-        function must return a list of probabilities for each diagnose time.
+        function must return a list of probabilities for each diagnosis time.
 
         Note:
             All arguments except ``support`` must have default values and if some
@@ -214,7 +214,7 @@ def get_params(
         """If updateable, return the dist's ``param`` value or all params in a dict.
 
         See Also:
-            :py:meth:`lymph.diagnose_times.DistributionsUserDict.get_params`
+            :py:meth:`lymph.diagnosis_times.DistributionsUserDict.get_params`
             :py:meth:`lymph.graph.Edge.get_params`
             :py:meth:`lymph.models.Unilateral.get_params`
             :py:meth:`lymph.models.Bilateral.get_params`
@@ -264,7 +264,7 @@ def draw_diag_times(
         rng: np.random.Generator | None = None,
         seed: int = 42,
     ) -> np.ndarray:
-        """Draw ``num`` samples of diagnose times from the stored PMF.
+        """Draw ``num`` samples of diagnosis times from the stored PMF.
 
         A random number generator can be provided as ``rng``. If ``None``, a new one
         is initialized with the given ``seed`` (or ``42``, by default).

diff --git a/lymph/graph.py b/lymph/graph.py
@@ -406,8 +406,8 @@ def get_params(
         """Return the value of the parameter ``param`` or all params in a dict.
 
         See Also:
-            :py:meth:`lymph.diagnose_times.Distribution.get_params`
-            :py:meth:`lymph.diagnose_times.DistributionsUserDict.get_params`
+            :py:meth:`lymph.diagnosis_times.Distribution.get_params`
+            :py:meth:`lymph.diagnosis_times.DistributionsUserDict.get_params`
             :py:meth:`lymph.models.Unilateral.get_params`
             :py:meth:`lymph.models.Bilateral.get_params`
         """

diff --git a/lymph/matrix.py b/lymph/matrix.py
@@ -11,9 +11,9 @@
 import numpy as np
 import pandas as pd
 
-from lymph import graph
-from lymph.utils import get_state_idx_matrix, row_wise_kron, tile_and_repeat
+from lymph import graph, types
 from lymph.modalities import Modality
+from lymph.utils import get_state_idx_matrix, row_wise_kron, tile_and_repeat
 
 
 @lru_cache(maxsize=128)
@@ -94,17 +94,18 @@ def generate_observation(
 
 def compute_encoding(
     lnls: list[str],
-    pattern: pd.Series | dict[str, bool | int | str],
+    pattern: pd.Series | dict[str, types.InvolvementIndicator],
     base: int = 2,
 ) -> np.ndarray:
     """Compute the encoding of a particular ``pattern`` of involvement.
 
     A ``pattern`` holds information about the involvement of each LNL and the function
     transforms this into a binary encoding which is ``True`` for all possible complete
-    states/diagnoses that are compatible with the given ``pattern``.
+    states/diagnosis that are compatible with the given ``pattern``.
 
     In the binary case (``base=2``), the value behind ``pattern[lnl]`` can be one of
     the following things:
+
     - ``False``: The LNL is healthy.
     - ``"healthy"``: The LNL is healthy.
     - ``True``: The LNL is involved.
@@ -113,6 +114,7 @@ def compute_encoding(
 
     In the trinary case (``base=3``), the value behind ``pattern[lnl]`` can be one of
     these things:
+
     - ``False``: The LNL is healthy.
     - ``"healthy"``: The LNL is healthy.
     - ``True``: The LNL is involved (micro- or macroscopic).
@@ -211,12 +213,12 @@ def generate_data_encoding(
             if modality_name not in patient_row:
                 warnings.warn(f"Modality {modality_name} not in data. Skipping.")
                 continue
-            diagnose_encoding = compute_encoding(
+            diagnosis_encoding = compute_encoding(
                 lnls=lnls,
                 pattern=patient_row[modality_name],
                 base=2,   # observations are always binary!
             )
-            patient_encoding = np.kron(patient_encoding, diagnose_encoding)
+            patient_encoding = np.kron(patient_encoding, diagnosis_encoding)
 
         result[:,i] = patient_encoding