## Expectations from this notebook

This notebook is going to give some examples of statistical analyses that can be performed on data from a set of KEP instances. It is important to note a few things.

1. The data being used has been randomly generated for testing purposes. It does not match any known population, and should not be used to arrive at any conclusions for real-world kidney exchange programmes.
2. The outputs from statistical analyses in general, including the ones in this notebook, should not be used without considering whether the analysis technique is appropriate, and whether the conclusion is believable.

To properly consider the outputs from the analyses, it is important to have an understanding of what the various parameters mean. For a brief introduction into the parameters used, see [the kep_solver documentation](https://kep-solver.readthedocs.io/en/latest/source/terms.html).

First we're just going to install the relevant packages. For our analyses, we install `statsmodels`, and for some visualisations we use `matplotlib`. We also pull in `numpy` and `pandas` so we can refer to certain parts of them. Note that `kep_solver` itself doesn't do any statistical analyses - plenty of other packages exist for such purposes so we use them, and you can use your preferred statistics package instead.

In [None]:
!pip install kep_solver statsmodels matplotlib numpy pandas
import json

from kep_solver.fileio import read_json
from kep_solver.entities import InstanceSet, BloodGroup
import kep_solver.generation as generation

import statsmodels.api as sm
import matplotlib.pyplot as plt
# Some extra imports to make things look nicer
import matplotlib.ticker as mticker
import matplotlib.colors as mcolors

import numpy
import pandas

We now can build an `InstanceSet`. This is a part of the `kep_solver` package, and is a representation of a set of instances. We will see later that it can perform some relevant calculations over the set of instances. Note that while this cell looks simple, if you are importing new data then a lot of the work required will involve either getting your data into formats supported by kep_solver (see [the kep_solver documentation](https://kep-solver.readthedocs.io/en/latest/source/formats.html#input)) or adding new functionality to kep_solver to read your given file format.

In [None]:
# get the data
filenames = [f"medium-{i}.json" for i in range(1,11)]
instances = [read_json("../tests/test_instances/" + filename) for filename in filenames]
instances = InstanceSet(instances)

Details of the donors, as a complete table, can be extracted. This is returned as a `pandas.DataFrame` object.

In [None]:
donors = instances.donor_details()
print(donors)

Blood group distributions will come up a few times, so we make a function to plot such distributions, and then plot the distribution of blood groups amongst donors.

In [None]:
# make a blood group distribution function
def plot_bloodgroups(column, axis=None, title="Blood group distribution"):
    plot = column.value_counts().plot(ax=axis,
                                      kind='bar',
                                      xlabel='Blood group',
                                      ylabel='Frequency',
                                      title=title)
    plot.tick_params(rotation=0)

plot_bloodgroups(donors["donor_bloodgroup"])

We will create "configurations" that can be used to create random entity generators. These configurations are meant to be stored as JSON files, so we use `to_json` function from `pandas` to create a JSON string. However, we then turn this back into a dictionary using the `json.loads()` function.

In [None]:
generic_donor_distribution = json.loads(donors["donor_bloodgroup"].value_counts(normalize=True).to_json())
print(generic_donor_distribution)

We now consider our first possible correlation between variables. The question we pose is: Does the distribution of donor blood groups depend on the blood group of their paired recipient? Looking at this data, it does look like a correlation may exist as, if a recipient has blood group AB, then it seems unlikely that a donor will have blood group B. In a real analysis, you should take note of the population size (i.e., there are only 3 donors who are paired with recipients who have blood group AB) and use that to drive your decision. In this notebook, we are simply demonstrating what is possible, and so ignore the population sizes.

In [None]:
fig, axs = plt.subplots(2, 2, tight_layout=True)
donor_config = {"Generic": generic_donor_distribution}
for bloodgroup, ax in zip(BloodGroup.all(), fig.axes):
    filtered = donors[donors['paired_recipient_bloodgroup'] == bloodgroup]
    plot_bloodgroups(filtered["donor_bloodgroup"], axis = ax, title=f"Recipient has {bloodgroup}")
    donor_config[str(bloodgroup)] = json.loads(filtered["donor_bloodgroup"].value_counts(normalize=True).to_json())

When splitting out donor blood groups by the blood group of their paired recipient, we lost all non-directed donors, so lets look at those now.

In [None]:
plot_bloodgroups(donors[donors["NDD"] == True]["donor_bloodgroup"], title=f"Non-directed donors")
ndd_bloodgroup_dist = json.loads(donors[donors["NDD"] == True]["donor_bloodgroup"].value_counts(normalize=True).to_json())

Now we want to look at recipients, but first, we have to calculate the `compatibility_chance`. For a given recipient R, the `compatibility_chance` of R is the ratio between `number of distinct donors who can donate to R` and `number of donors who are blood group compatible with R and appear in a common instance with R`. This is calculated when we extract recipient_details.

In [None]:
# This calculates compatibility_chance for each recipient.

recipients = instances.recipient_details(calculate_compatibility=True)
print(recipients)

Since we have some more data, let's look at a correlation matrix for the recipients. As a reminder, a correlation matrix has values between -1.0 and 1.0, where a value of -1.0 indicates a negative linear correlation, 0.0 indicates no linear correlation, and 1.0 indicates a positive linear correlation. Again, we also highlight that such matrices make many assumptions (such as linear relationships, but also on the underlying distribution of each variable) and so one should consider any implications of any correlations that are utilised.

We see from the below that a higher cPRA is perhaps linked to having a lower compatibility chance (which makes sense from the definitions of cPRA and compatibility chance) and that having a blood group compatible donor might be linked to having a higher cPRA (which makes sense, as a recipient with a blood group compatible donor and a low cPRA is more likely to have a transplant arranaged without entering a KEP).

In [None]:
print(recipients.corr(numeric_only=True))
labels = ["cPRA", "Compatibility", "#Donors", "ABO Comp. Donor"]
fig, ax = plt.subplots(figsize=(10,5))
norm = mcolors.Normalize(vmin=-1, vmax=1)
cax = ax.matshow(recipients.corr(numeric_only=True), cmap=plt.colormaps.get_cmap('PuBu'), norm=norm)
fig.colorbar(cax)
ticks_loc = ax.get_xticks().tolist()
ax.xaxis.set_major_locator(mticker.FixedLocator(ticks_loc))
ax.set_xticklabels([''] + labels + [''])

# fixing yticks with "set_yticks"
ticks_loc = ax.get_yticks().tolist()
ax.yaxis.set_major_locator(mticker.FixedLocator(ticks_loc))
ax.set_yticklabels([''] + labels + ['']);

We can now look at some individual properties of the recipients, starting with the distribution of number of donors. Again, we are building configuration objects for use later as well.

In [None]:
plot = recipients["num_donors"].value_counts().plot(kind='bar',
                                  xlabel='Number of donors',
                                  ylabel='Frequency',
                                  title="Donors per recipient")
plot.tick_params(rotation=0)
donor_count_config = {}
for num_donors, probability in recipients["num_donors"].value_counts(normalize=True).to_dict().items():
    donor_count_config[num_donors] = probability

The blood group distribution of the recipients is also relevant, and easy to examine.

In [None]:
plot_bloodgroups(recipients["recipient_bloodgroup"], title="Recipient Bloodgroup distribution")
recipient_bloodgroup_distribution = json.loads(recipients["recipient_bloodgroup"].value_counts(normalize=True).to_json())

Next we look at the distribution of cPRA. Since this can take many values, we use a histogram to view the data. Note that we have specified the bins exactly, you can also just specify a certain number of bins.

In [None]:
cPRA_histogram_bins = [0.05 * index for index in range(0, 21)]
plot = recipients["cPRA"].plot(kind='hist',
                               xlabel='cPRA',
                               ylabel='Frequency',
                               title="cPRA Distribution",
                               bins=cPRA_histogram_bins,
                               edgecolor="black"
                               )
plot.tick_params(rotation=0)

Recall that the correlation matrix indicated a possible link between having a blood group compatible donor and cPRA. We begin by separating recipients into those who have a blood group compatible donor, and those who don't, and create two separate histograms on the same axes.

In [None]:
compatible = recipients[recipients["has_abo_compatible_donor"] == True]["cPRA"]
incompatible = recipients[recipients["has_abo_compatible_donor"] == False]["cPRA"]
fig, ax = plt.subplots()
cpra_results = ax.hist([compatible, incompatible], edgecolor="black", density=True, bins=cPRA_histogram_bins,
                       label=["Has ABO compatible donor", "Does not have an ABO compatible donor"])
ax.set_xlabel("cPRA")
ax.set_ylabel("Frequency")
ax.set_title("cPRA distributions")
ax.legend();

From the plot, it seems likely that there is some link, so we're going to extract this data into configuration for a generator we will use later. We do this by utilising the density-based results from calling `hist()`. We then create the configurations. These are created by capturing, for each bin, the lower and upper bound as well as the probability density of said bin.

Note that we have to slightly shift the upper bound of the last bin. While `matplotlib` ensures that all bins have a strict upper bound except the last, our `kep_solver` software will expect to have a strict upper bound on the last bin. To ensure that we can still get cPRA values equal to 1.0, we have to push this upper bound. The `kep_solver` software will itself ensure that a cPRA above 1.0 will not be generated.

In [None]:
compat_densities = []
incompat_densities = []
for ind, bin_lower in enumerate(cPRA_histogram_bins[:-1]):
    bin_upper = cPRA_histogram_bins[ind+1]
    # plt.hist returns a probability density such that the integral over all probabilities is equal to one.
    # We just want sum of probabilities to equal one, so multiply each probability by the bin width
    probability_compatible = cpra_results[0][0][ind] * (bin_upper - bin_lower)
    probability_incompatible = cpra_results[0][0][ind] * (bin_upper - bin_lower)
    # We have to slightly tweak the upper bound of the last bin. matplotlib makes the last upper bound closed,
    # while kep_solver expects it to be open
    if ind == len(cPRA_histogram_bins) - 2:
        bin_upper += 1e-6
    compat_densities.append(((bin_lower, bin_upper), probability_compatible))
    incompat_densities.append(((bin_lower, bin_upper), probability_incompatible))

Next we look at any correlation between cPRA and compatibility chance. Again, we begin by plotting. From this, it seems plausible that there is some negative correlation, but it may be hard to model precisely.

In [None]:
plt.scatter(recipients["cPRA"], recipients["compatibility_chance"])
plt.xlabel("cPRA")
plt.ylabel("Compatibility Chance");

We generate a linear model regardless, just to see how well it works.

In [None]:
model = sm.formula.glm("compatibility_chance ~ cPRA",
                      data=recipients)
result = model.fit()
print(result.summary())

We now plot both the actual data, as well as our linear regression. While the black line is a line of best fit, we also see that it clearly is not perfect. Still, we don't have any better relationship to use, and from a medical perspective it does make sense that a higher cPRA means a lower compatibility chance.

In [None]:
plt.scatter(recipients["cPRA"], recipients["compatibility_chance"], label="Data")
plt.xlabel("cPRA")
plt.ylabel("Compatibility Chance");
xseq = numpy.linspace(0, 1.0, num=100)

# Plot regression line
plt.plot(xseq, result.params['Intercept'] + result.params['cPRA'] * xseq, color="k", lw=2.5, label="Regression");
plt.legend();

We can use all of the configurations we've been collecting to build generators. These generators allow us to randomly sample blood groups, donors, recipients, even instances, based on the distributions we've chosen.

In [None]:
donor_generator = generation.DonorGenerator.from_json(donor_config)
recipient_blood_generator = generation.BloodGroupGenerator.from_json(recipient_bloodgroup_distribution)
donor_count_generator = generation.DonorCountGenerator(donor_count_config)
cpra_generator = generation.CPRAGenerator.from_json({"Compatible": compat_densities,
                                                     "Incompatible": incompat_densities})
compat_generator = generation.CompatibilityChanceGenerator.from_json([[0, 1.01, 
                                                                       {"function": 
                                                                        {"type": "linear", 
                                                                         "offset": result.params['Intercept'],
                                                                         "coefficient": result.params['cPRA']}}]])

The recipient generator needs a number of generators itself to function.

In [None]:
recipient_generator = generation.RecipientGenerator(recipient_blood_generator,
                                                    donor_count_generator,
                                                    donor_generator,
                                                    cpra_generator,
                                                    compat_generator)

An instance generator then needs to know how to create non-directed donors and recipients. Directed donors (also known as paired donors) will be constructed when their paired recipient is constructed.

In [None]:
ndd_blood_generator = generation.BloodGroupGenerator.from_json(ndd_bloodgroup_dist)
instance_generator = generation.InstanceGenerator(recipient_generator, ndd_blood_generator)

Now we can `draw()` random instances. The `50` here means we want 50 recipients. We didn't specify any non-directed donors, so we get none. Note that we (probably) have more than 50 donors, as recipients have a small chance of being generated with two donors. Of course, we are sampling at random so we might actually get exactly 50 donors.

In [None]:
new_instance = instance_generator.draw(50)
print(f"The new instance has {len(new_instance.recipients())} recipients")
print(f"The new instance has {len(new_instance.donors())} donors")
print(f"The new instance has {len([donor for donor in new_instance.donors() if donor.NDD])} non-directed donors")

If we want to have non-directed donors, we tell the `draw()` function how many to generate with a second parameter.

In [None]:
new_instance = instance_generator.draw(50, 5)
print(f"The new instance has {len(new_instance.recipients())} recipients")
print(f"The new instance has {len(new_instance.donors())} donors")
print(f"The new instance has {len([donor for donor in new_instance.donors() if donor.NDD])} non-directed donors")

Above we created all the individual generators, and then combined them. However, we can put all of the configuration into one large dictionary as follows. The following prints out a large (and messy looking) string, but this string contains all the information required to construct an instance generator. This string can be saved in a file, and shared with others so they can generate random instances with the same distributions.

In [None]:
recipient_generator_config = {
    "RecipientBloodGroupGenerator": recipient_bloodgroup_distribution,
    "DonorCountGenerator": donor_count_config,
    "DonorGenerator": donor_config,
    "CPRAGenerator": {"Compatible": compat_densities, "Incompatible": incompat_densities},
    "CompatibilityChanceGenerator": [[0, 1.01, {"function": {"type": "linear", 
                                                             "offset": result.params['Intercept'],
                                                             "coefficient": result.params['cPRA']
                                                            }
                                               }]]
}
instance_configuration = {
    "RecipientGenerator": recipient_generator_config,
    "NDDBloodGroupGenerator": ndd_bloodgroup_dist,
}
instance_generator_json = generation.InstanceGenerator(instance_configuration)
json.dumps(instance_configuration)