# Spatial and Machine Learning Analysis for WHO-AFRO
## Performed by Dr. Scott Pezanowski
#### 2022-10-15 to 2023-07-31

The spatial analysis portion of my code closely follows the ESDA procedure set forth by Dr. Dani Arribes-Bel in the free online course referenced below.

Arribas-Bel, D. (2019). A course on Geographic Data Science. *The Journal of Open Source Education*, *2*(14). https://doi.org/https://doi.org/10.21105/jose.00042.
```
@article{darribas_gds_course,
  author = {Dani Arribas-Bel},
  title = {A course on Geographic Data Science},
  year = 2019,
  journal = {The Journal of Open Source Education},
  volume = 2,
  number = 14,
  doi = {https://doi.org/10.21105/jose.00042}
}
```

http://darribas.org/gds_course/content/bF/lab_F.html

The second half of my code that uses machine learning is not from Dr. Arribas-Bel's course.

# Spatial autocorrelation and Exploratory Spatial Data Analysis

Spatial autocorrelation has to do with the degree to which the similarity in values between observations in a dataset is related to the similarity in locations of such observations. Not completely unlike the traditional correlation between two variables -which informs us about how the values in one variable change as a function of those in the other- and analogous to its time-series counterpart -which relates the value of a variable at a given point in time with those in previous periods-, spatial autocorrelation relates the value of the variable of interest in a given location, with values of the same variable in surrounding locations.

A key idea in this context is that of spatial randomness: a situation in which the location of an observation gives no information whatsoever about its value. In other words, a variable is spatially random if it is distributed following no discernible pattern over space. Spatial autocorrelation can thus be formally defined as the “absence of spatial randomness”, which gives room for two main classes of autocorrelation, similar to the traditional case: positive spatial autocorrelation, when similar values tend to group together in similar locations; and negative spatial autocorrelation, in cases where similar values tend to be dispersed and further apart from each other.

In this session we will learn how to explore spatial autocorrelation in a given dataset, interrogating the data about its presence, nature, and strength. To do this, we will use a set of tools collectively known as Exploratory Spatial Data Analysis (ESDA), specifically designed for this purpose. The range of ESDA methods is very wide and spans from less sophisticated approaches like choropleths and general table querying, to more advanced and robust methodologies that include statistical inference and an explicit recognition of the geographical dimension of the data. The purpose of this session is to dip our toes into the latter group.

ESDA techniques are usually divided into two main groups: tools to analyze *global*, and *local* spatial autocorrelation. The former consider the overall trend that the location of values follows, and makes possible statements about the degree of clustering in the dataset. *Do values generally follow a particular pattern in their geographical distribution? Are similar values closer to other similar values than we would expect from pure chance?* These are some of the questions that tools for global spatial autocorrelation allow to answer. We will practice with global spatial autocorrelation by using Moran’s I statistic.

Tools for *local* spatial autocorrelation instead focus on spatial instability: the departure of parts of a map from the general trend. The idea here is that, even though there is a given trend for the data in terms of the nature and strength of spatial association, some particular areas can diverege quite substantially from the general pattern. Regardless of the overall degree of concentration in the values, we can observe pockets of unusually high (low) values close to other high (low) values, in what we will call hot(cold)spots. Additionally, it is also possible to observe some high (low) values surrounded by low (high) values, and we will name these “spatial outliers”. The main technique we will review in this session to explore local spatial autocorrelation is the Local Indicators of Spatial Association (LISA).

1. Before running this notebook, you must install AutoGluon via the instructions here https://auto.gluon.ai/stable/install.html into a Conda environment.
2. Then, install the additional dependencies below.

In [None]:
!conda install -c conda-forge pysal -y
!conda install -c conda-forge contextily -y
!conda install -c anaconda sqlalchemy -y
!python -m pip install psycopg2
!conda install -c conda-forge imbalanced-learn -y

In [None]:
import os
from datetime import datetime
import pyproj

import seaborn as sns
import pandas as pd
import esda
from pysal.lib import weights
from splot.esda import (
    moran_scatterplot, lisa_cluster, plot_local_autocorrelation
)
import geopandas as gpd
import numpy as np
import contextily as ctx
import matplotlib.pyplot as plt

ts = datetime.today().strftime('%Y%m%d')
report_image_path = '/path-to-a-directory-to-save-results/' # where to save results

## Data

Query for the disease data from the Postgresql/PostGIS database. You can change the variable *disease* to one of the commented options to see results for a different disease.

This query depends on the Postgresql/PostGIS database being created prior from the IDSR dataset, the administrative boundaries provided by Dr. Etien, a water bodies dataset extracted from Openstreetmap, and the population and relative wealth datasets downloaded from the links specified. All SQL statements and import commands are provided in separate files.

In [None]:
# diseases = ['cholera'] 
# diseases = ['ebola', 'ebola virus disease'] # 2023-08-02: There are not enough cases in IDSR for Ebola for proper analysis.
# diseases = ['malaria', 'malaria (imported)', 'malaria confirmed', 'malaria tested'] # is named multiple different ways in IDSR
# diseases = ['meningitis', 'meningococcal meningitis', 'meningococal meningitis'] # is named multiple different ways in IDSR
diseases = ['yellow fever', 'yf'] # is named multiple different ways in IDSR
# note that some of the diseases are named slightly differently in the IDSR dataset.
disease = diseases[0]

sql_data = f"""
WITH projection AS (
	SELECT foo1.my_id,
	foo1.adm0_name,
	foo1.epidemic_week, 
	foo1.y, 
	COALESCE(foo3.totalcases, 0) AS totalcases FROM (
		SELECT ab.my_id, ab.adm0_name, e.epidemic_week, e.y FROM admin2_afro_master_subset1 ab
		CROSS JOIN (
			SELECT DISTINCT epidemic_week, y FROM idsr
		) e WHERE ab.in_idsr = 'true'
	) foo1
	LEFT JOIN 
	(
		SELECT	
		sum(ir.totalcases) AS totalcases,
		ir.epidemic_week,
		ir.y,
		ir.adm2_my_id
		FROM idsr ir
		INNER JOIN diseases d
		ON ir.d_id = d.id
		AND lower(d.disease) =  ANY ('{{"{'","'.join(diseases)}"}}'::text[])
		GROUP BY adm2_my_id, epidemic_week, y
	) foo3
	ON foo1.my_id = foo3.adm2_my_id AND foo1.epidemic_week = foo3.epidemic_week AND foo1.y = foo3.y
)
SELECT 
p.my_id,
p.adm0_name,
p.epidemic_week,
p.y,
pr.val_precipitation, 
tp.val_temperature,
trees.val AS val_trees,
trees.percent_cov AS percent_cov_trees,
crops.val AS val_crops,
crops.percent_cov AS percent_cov_crops,
builtup.val AS val_builtup,
builtup.percent_cov AS percent_cov_builtup,
bareground.val AS val_bareground,
bareground.percent_cov AS percent_cov_bareground,
rangeland.val AS val_rangeland,
rangeland.percent_cov AS percent_cov_rangeland,
pf.pop_total / pf.count_total AS relative_pop_density,
pf.pop_near_water,
rw.val_wealth,
floor(ef.val_elevation) AS val_elevation,
p.totalcases
FROM 
projection p
LEFT JOIN precipitation_full pr
ON p.my_id = pr.my_id AND p.epidemic_week = pr.epidemic_week
LEFT JOIN temperature_full tp
ON p.my_id = tp.my_id AND p.epidemic_week = tp.epidemic_week
LEFT JOIN landcover_full trees 
ON p.my_id = trees.my_id AND p.y = trees."year" AND trees.clas = 'trees'
LEFT JOIN landcover_full crops 
ON p.my_id = crops.my_id AND p.y = crops."year" AND crops.clas = 'crops'
LEFT JOIN landcover_full builtup 
ON p.my_id = builtup.my_id AND p.y = builtup."year" AND builtup.clas = 'builtup'
LEFT JOIN landcover_full bareground 
ON p.my_id = bareground.my_id AND p.y = bareground."year" AND bareground.clas = 'bareground'
LEFT JOIN landcover_full rangeland 
ON p.my_id = rangeland.my_id AND p.y = rangeland."year" AND rangeland.clas = 'rangeland'
LEFT JOIN population_full pf
ON p.my_id = pf.my_id AND (CASE WHEN p.y > 2020 THEN 2020 ELSE p.y END) = pf."year"
LEFT JOIN relativewealth_full rw
ON p.my_id = rw.my_id
LEFT JOIN elevation_full ef
ON p.my_id = ef.my_id
ORDER BY p.epidemic_week ASC;
"""

sql_geom = """
 SELECT my_id::integer, geom FROM admin2_afro_master_subset1
"""

In [None]:
# Create the DB connection
from sqlalchemy import create_engine  
db_connection_url = "postgresql://<dbuser>:<dbpasswd>@<dbhost>:<dbport>/<dbname>"
con = create_engine(db_connection_url)

In [None]:
# read from Postgresql and specify the datatypes for the variables to make sure they are correct
dtypes = {
    'my_id': np.int64,
    'adm0_name': np.str_,
    'val_precipitation': np.float64,
    'val_temperature': np.float64,
    'val_trees': np.int64,
    'percent_cov_trees': np.float64,
    'val_crops': np.int64,
    'percent_cov_crops': np.float64,
    'val_builtup': np.int64,
    'percent_cov_builtup': np.float64,
    'val_bareground': np.int64,
    'percent_cov_bareground': np.float64,
    'val_rangeland': np.int64,
    'percent_cov_rangeland': np.float64,
    'relative_pop_density': np.float64,
    'pop_near_water': np.float64,
    'val_wealth': np.float64,
    'val_elevation': np.int64,
    'totalcases': np.int64,
}
data = pd.read_sql_query(sql_data, con, dtype=dtypes, parse_dates=['epidemic_week'])

In [None]:
data.info()

In [None]:
# make sure there are no null values
len(data[data.isnull().any(axis=1)])

In [None]:
print(len(data[data['totalcases'] > 0]))
print(len(data[data['totalcases'] == 0]))

In [None]:
data[data['totalcases'] > 0]

In [None]:
# read the boundaries of the administrative districts so that we can produce map visualizations
geo = gpd.read_postgis(sql_geom, con, geom_col='geom', index_col='my_id')

In [None]:
d = {
    'val_precipitation': 'mean',
    'val_temperature': 'mean',
    'val_trees': 'mean',
    'percent_cov_trees': 'mean',
    'val_crops': 'mean',
    'percent_cov_crops': 'mean',
    'val_builtup': 'mean',
    'percent_cov_builtup': 'mean',
    'val_bareground': 'mean',
    'percent_cov_bareground': 'mean',
    'val_rangeland': 'mean',
    'percent_cov_rangeland': 'mean',
    'relative_pop_density': 'mean',
    'pop_near_water': 'mean',
    'val_wealth': 'mean',
    'val_elevation': 'mean',
    'totalcases': 'sum',
}
data_agg = data.groupby(['my_id', 'adm0_name'], as_index = False).agg(d)

In [None]:
data_agg.info()

In [None]:
# join the disease data with the boundaries
geo2 = geo.merge(data_agg, on='my_id', how='left')
geo2['totalcases'] = geo2['totalcases'].astype(int)
geo2['adm0_name'] = geo2['adm0_name'].astype(str)

In [None]:
geo2.info()

In [None]:
# read in some geographic datasets to help with the map visualizations
waterbodies_gdf = gpd.read_file("/path-to-data/africa_waterbody.gpkg")
africa_gdf = gpd.read_file("/path-to-data/africa.gpkg")

In [None]:
# visualize the summary counts on a map
ax = africa_gdf.plot(color='beige', figsize=(20,18))
geo2.plot(ax=ax, color="lightgrey")
lbl = f"Total {disease} cases"
geo2[geo2['totalcases'] > 0].plot(ax=ax, column='totalcases', scheme='FisherJenks', k=7, legend=True)
leg1 = ax.get_legend()
leg1.set_title(f'Total cases for {disease}')
ax.set_axis_off()
ax.figure.savefig(f'{report_image_path}{os.sep}{ts}_{disease}_summarymap.png', bbox_inches='tight', transparent=True, dpi=300)

Now let’s index it on the local authority IDs, while keeping those as a column too:

In [None]:
# I commented out these lines because I explicitly set the index in the read_sql_query function above
# Index table on the LAD ID
# geo2 = geo2.set_index("my_id", inplace=True)
geo2.reset_index(inplace=True)
# Display summary
geo2.info()

## Preparing the data

Let’s get a first view of the data:

In [None]:
# Plot polygons
ax = geo2.plot(alpha=0.5, color='red')
# Add background map, expressing target CRS so the basemap can be
# reprojected (warped)
ctx.add_basemap(ax, crs=geo2.crs)

### Spatial weights matrix

As discused before, a spatial weights matrix is the way geographical space is formally encoded into a numerical form so it is easy for a computer (or a statistical method) to understand. We have seen already many of the conceptual ways in which we can define a spatial weights matrix, such as contiguity, distance-based, or block.

For this example, we will show how to build a queen contiguity matrix, which considers two observations as neighbors if they share at least one point of their boundary. In other words, for a pair of local authorities in the dataset to be considered neighbours under this W, they will need to be sharing border or, in other words, “touching” each other to some degree.

Technically speaking, we will approach building the contiguity matrix in the same way we did in Lab 5. We will begin with a GeoDataFrame and pass it on to the queen contiguity weights builder in PySAL (ps.weights.Queen.from_dataframe). We will also make sure our table of data is previously indexed on the local authority code, so the W is also indexed on that form.

In [None]:
# Create the spatial weights matrix
%time w = weights.Queen.from_dataframe(geo2)

In [None]:
w.islands

#### There are some islands. Therefore, let's connect those using the nearest neighbor.

I removed the lines below about accounting for islands because it caused problems with the spatial correlation. The islands for our analysis are actually physical islands off the coast of Africa. Therefore, it does not make sense to include them in the spatial correlation analysis. Any sort of spatial correlation that includes them would require a more complicated way to include their relationships with other districts on the mainland.

#### Next, I calculate clock weights which gives a weight to all districts in the same country.

I am doing this because I am making the assumption that people will not move as much between countries as within countries. Therefore, farther down, if the Queen neighbors fall in the same country, I give them a greater weight.

Now, the w object we have just is of the same type of any other one we have created in the past. As such, we can inspect it in the same way. For example, we can check who is a neighbor of observation 0:

However, the cell where we computed W returned a warning on “islands”. Remember these are islands not necessarily in the geographic sense (although some of them will be), but in the mathematical sense of the term: local authorities that are not sharing border with any other one and thus do not have any neighbors. We can inspect and map them to get a better sense of what we are dealing with:

In [None]:
# Plot polygons
ax = geo2.loc[w.islands].plot(alpha=0.5, color='red')
# Add background map, expressing target CRS so the basemap can be
# reprojected (warped)
ctx.add_basemap(ax, crs=geo2.crs)

* Again, note that the below is the case for us that our islands are actual islands and not errors in the geometry.

In this case, all the islands are indeed “real” islands. These cases can create issues in the analysis and distort the results. There are several solutions to this situation such as connecting the islands to other observations through a different criterium (e.g. nearest neighbor), and then combining both spatial weights matrices. For convenience, we will remove them from the dataset because they are a small sample and their removal is likely not to have a large impact in the calculations.

Technically, this amounts to a subsetting, very much like we saw in the first weeks of the course, although in this case we will use the drop command, which comes in very handy in these cases:

In [None]:
geo2 = geo2.drop(w.islands)

Once we have the set of local authorities that are not an island, we need to re-calculate the weights matrix:

In [None]:
# Create the spatial weights matrix
# NOTE: this might take a few minutes as the geometries are
#       are very detailed
%time w = weights.Queen.from_dataframe(geo2)

And, finally, let us row-standardize it to make sure every row of the matrix sums up to one:

In [None]:
# Row standardize the matrix
# w_new.transform = 'R'
w.transform = 'R'

Now, because we have row-standardize them, the weight given to each of the three neighbors is 0.33 which, all together, sum up to one.

### Spatial Lag

Once we have the data and the spatial weights matrix ready, we can start by computing the spatial lag of the total cases of the disease. Remember the spatial lag is the product of the spatial weights matrix and a given variable and that, if W is row-standardized, the result amounts to the average value of the variable in the neighborhood of each observation.

We can calculate the spatial lag for the variable total_cases and store it directly in the main table with the following line of code:

In [None]:
geo2['w_totalcases'] = weights.lag_spatial(w, geo2['totalcases'])

Let us have a quick look at the resulting variable, as compared to the original one:

In [None]:
geo2[['my_id', 'totalcases', 'w_totalcases']].head()

## Global Spatial autocorrelation

Global spatial autocorrelation relates to the overall geographical pattern present in the data. Statistics designed to measure this trend thus characterize a map in terms of its degree of clustering and summarize it. This summary can be visual or numerical. In this section, we will walk through an example of each of them: the Moran Plot, and Moran’s I statistic of spatial autocorrelation.

### Moran Plot

The moran plot is a way of visualizing a spatial dataset to explore the nature and strength of spatial autocorrelation. It is essentially a traditional scatter plot in which the variable of interest is displayed against its spatial lag. In order to be able to interpret values as above or below the mean, and their quantities in terms of standard deviations, the variable of interest is usually standardized by substracting its mean and dividing it by its standard deviation.

Technically speaking, creating a Moran Plot is very similar to creating any other scatter plot in Python, provided we have standardized the variable and calculated its spatial lag beforehand:

In [None]:
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(9, 9))
# Plot values
sns.regplot(x='totalcases', y='w_totalcases', data=geo2, ci=None)
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
# Display
plt.savefig(f'{report_image_path}{os.sep}{ts}_{disease}_summaryspatialcorrelationplot.png', bbox_inches='tight', transparent=True, dpi=300)
plt.show()

The figure above displays the relationship between the standardized total cases and its spatial lag which, because the W that was used is row-standardized, can be interpreted as the average cases in the surrounding areas of a given Local Authority. In order to guide the interpretation of the plot, a linear fit is also included in the post. This line represents the best linear fit to the scatter plot or, in other words, what is the best way to represent the relationship between the two variables as a straight line.

The plot displays a positive relationship between both variables. This is associated with the presence of positive spatial autocorrelation: similar values tend to be located close to each other. This means that the overall trend is for high values to be close to other high values, and for low values to be surrounded by other low values. This however does not mean that this is only situation in the dataset: there can of course be particular cases where high values are surrounded by low ones, and viceversa. But it means that, if we had to summarize the main pattern of the data in terms of how clustered similar values are, the best way would be to say they are positively correlated and, hence, clustered over space.

In the context of the example, this can be interpreted along the lines of: local authorities display positive spatial autocorrelation in the number of total cases. This means that local authorities with high percentage of total cases tend to be located nearby other local authorities where there are high total cases, and viceversa.

### Moran's I

The Moran Plot is an excellent tool to explore the data and get a good sense of how much values are clustered over space. However, because it is a graphical device, it is sometimes hard to condense its insights into a more concise way. For these cases, a good approach is to come up with a statistical measure that summarizes the figure. This is exactly what Moran’s I is meant to do.

Very much in the same way the mean summarizes a crucial element of the distribution of values in a non-spatial setting, so does Moran’s I for a spatial dataset. Continuing the comparison, we can think of the mean as a single numerical value summarizing a histogram or a kernel density plot. Similarly, Moran’s I captures much of the essence of the Moran Plot. In fact, there is an even close connection between the two: the value of Moran’s I corresponds with the slope of the linear fit overlayed on top of the Moran Plot.

In order to calculate Moran’s I in our dataset, we can call a specific function in PySAL directly.

Note how we do not need to use the standardized version in this context as we will not represent it visually.

The method ps.Moran creates an object that contains much more information than the actual statistic. If we want to retrieve the value of the statistic, we can do it this way:

In [None]:
mi = esda.Moran(geo2['totalcases'], w)
print(mi.I)
print(mi.EI_sim)

The other bit of information we will extract from Moran’s I relates to statistical inference: how likely is the pattern we observe in the map and Moran’s I captures in its value to be generated by an entirely random process? If we considered the same variable but shuffled its locations randomly, would we obtain a map with similar characteristics?

The specific details of the mechanism to calculate this are beyond the scope of the session, but it is important to know that a small enough p-value associated with the Moran’s I of a map allows to reject the hypothesis that the map is random. In other words, we can conclude that the map displays more spatial pattern that we would expect if the values had been randomly allocated to a particular location.

The most reliable p-value for Moran’s I can be found in the attribute p_sim:

In [None]:
print(mi.p_sim)
print(mi.z_sim)

In [None]:
spatial_stats_file = f'{report_image_path}{os.sep}{ts}_summaryspatialcorrelationstats.csv'
file_exists = os.path.isfile(spatial_stats_file)
with open(spatial_stats_file, 'a') as f:
    if not file_exists:
        f.write("Disease,Moran's I,EI_sim,p_sim,z_sim\n")
    f.write(f'{disease},{round(mi.I, 4)},{round(mi.EI_sim, 4)},{mi.p_sim},{round(mi.z_sim, 4)}\n')

That is just 0.1% and, by standard terms, it would be considered statistically significant. We can quickly ellaborate on its intuition. What that 0.001 (or 0.1%) means is that, if we generated a large number of maps with the same values but randomly allocated over space, and calculated the Moran’s I statistic for each of those maps, only 0.1% of them would display a larger (absolute) value than the one we obtain from the real data, and the other 99.9% of the random maps would receive a smaller (absolute) value of Moran’s I. If we remember again that the value of Moran’s I can also be interpreted as the slope of the Moran Plot, what we have is that, in this case, the particular spatial arrangement of values for cases is more concentrated than if the values had been allocated following a completely spatially random process, hence the statistical significance.

Once we have calculated Moran’s I and created an object like mi, we can use some of the functionality in splot to replicate the plot above more easily (remember, D.R.Y.):

In [None]:
moran_scatterplot(mi)

As a first step, the global autocorrelation analysis can teach us that observations do seem to be positively correlated over space. In terms of our initial goal to find spatial structure in the disease cases, this view seems to align: if the cases had no such structure, it should not show a pattern over space -technically, it would show a random one.

## Local Spatial autocorrelation

Moran’s I is good tool to summarize a dataset into a single value that informs about its degree of clustering. However, it is not an appropriate measure to identify areas within the map where specific values are located. In other words, Moran’s I can tell us values are clustered overall, but it will not inform us about where the clusters are. For that purpose, we need to use a local measure of spatial autocorrelation. Local measures consider each single observation in a dataset and operate on them, as oposed to on the overall data, as global measures do. Because of that, they are not good a summarizing a map, but they allow to obtain further insight.

In this session, we will consider Local Indicators of Spatial Association (LISAs), a local counter-part of global measures like Moran’s I. At the core of these method is a classification of the observations in a dataset into four groups derived from the Moran Plot: high values surrounded by high values (HH), low values nearby other low values (LL), high values among low values (HL), and viceversa (LH). Each of these groups are typically called “quadrants”. An illustration of where each of these groups fall into the Moran Plot can be seen below:

In [None]:
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(8, 8))
# Plot values
sns.regplot(x='totalcases', y='w_totalcases', data=geo2, ci=None)
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
plt.text(1.75, 0.5, "HH", fontsize=25)
plt.text(1.5, -1.5, "HL", fontsize=25)
plt.text(-2, 1, "LH", fontsize=25)
plt.text(-1.5, -2.5, "LL", fontsize=25)
# Display
plt.show()

So far we have classified each observation in the dataset depending on its value and that of its neighbors. This is only half way into identifying areas of unusual concentration of values. To know whether each of the locations is a statistically significant cluster of a given kind, we again need to compare it with what we would expect if the data were allocated in a completely random way. After all, by definition, every observation will be of one kind of another, based on the comparison above. However, what we are interested in is whether the strength with which the values are concentrated is unusually high.

This is exactly what LISAs are designed to do. As before, a more detailed description of their statistical underpinnings is beyond the scope in this context, but we will try to shed some light into the intuition of how they go about it. The core idea is to identify cases in which the comparison between the value of an observation and the average of its neighbors is either more similar (HH, LL) or dissimilar (HL, LH) than we would expect from pure chance. The mechanism to do this is similar to the one in the global Moran’s I, but applied in this case to each observation, resulting then in as many statistics as original observations.

LISAs are widely used in many fields to identify clusters of values in space. They are a very useful tool that can quickly return areas in which values are concentrated and provide suggestive evidence about the processes that might be at work. For that, they have a prime place in the exploratory toolbox. Examples of contexts where LISAs can be useful include: identification of spatial clusters of poverty in regions, detection of ethnic enclaves, delineation of areas of particularly high/low activity of any phenomenon, etc.

In Python, we can calculate LISAs in a very streamlined way thanks to PySAL:

In [None]:
lisa = esda.Moran_Local(geo2['totalcases'], w)

All we need to pass is the variable of interest -total cases- and the spatial weights that describes the neighborhood relations between the different observation that make up the dataset.

Because of their very nature, looking at the numerical result of LISAs is not always the most useful way to exploit all the information they can provide. Remember that we are calculating a statistic for every sigle observation in the data so, if we have many of them, it will be difficult to extract any meaningful pattern. Instead, what is typically done is to create a map, a cluster map as it is usually called, that extracts the significant observations (those that are highly unlikely to have come from pure chance) and plots them with a specific color depending on their quadrant category.

All of the needed pieces are contained inside the lisa object we have created above. But, to make the map making more straightforward, it is convenient to pull them out and insert them in the main data table, br:

In [None]:
# Break observations into significant or not
geo2['significant'] = lisa.p_sim < 0.05
# Store the quadrant they belong to
geo2['quadrant'] = lisa.q

Let us stop for second on these two steps. First, the significant column. Similarly as with global Moran’s I, PySAL is automatically computing a p-value for each LISA. Because not every observation represents a statistically significant one, we want to identify those with a p-value small enough that rules out the possibility of obtaining a similar situation from pure chance. Following a similar reasoning as with global Moran’s I, we select 5% as the threshold for statistical significance. To identify these values, we create a variable, significant, that contains True if the p-value of the observation is satisfies the condition, and False otherwise. We can check this is the case:

In [None]:
geo2['significant'].head()

And the first five p-values can be checked by:

In [None]:
lisa.p_sim[:5]

Note how the third and fourth are smaller than 0.05, as the variable significant correctly identified.

Second, the quadrant each observation belongs to. This one is easier as it comes built into the lisa object directly:

In [None]:
geo2['quadrant'].head()

The correspondence between the numbers in the variable and the actual quadrants is as follows:

1: HH

2: LH

3: LL

4: HL

With these two elements, significant and quadrant, we can build a typical LISA cluster map combining the mapping skills with what we have learned about subsetting and querying tables:

We can create a quick LISA cluster map with splot:

In [None]:
print(f'{report_image_path}{ts}_{disease}_summaryhotspots.png')

In [None]:
import os
from datetime import datetime

# f, ax = plt.subplots(1, figsize=(20, 20))
# f.patch.set_facecolor('#BFBFBF')
ax = africa_gdf.plot(color='beige', figsize=(20,20))
geo2.plot(ax=ax, color="lightgrey")
lisa_cluster(lisa, geo2, ax=ax)
# notincluded_gdf.plot(ax=ax, color='beige')
waterbodies_gdf.plot(ax=ax, color='black')
leg1 = ax.get_legend()
leg1.set_title(f'Hot spots for {disease}')
ax.set_axis_off()
plt.savefig(f'{report_image_path}{ts}_{disease}_summaryhotspots.png', bbox_inches='tight', transparent=True, dpi=300)

Or, if we want to have more control over what is being displayed, and how each component is presented, we can “cook” the plot ourselves:

Below is the same plot but showing how you can change colors.

In [None]:
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(9, 9))
# Plot insignificant clusters
ns = geo2.loc[geo2['significant']==False, 'geom']
ns.plot(ax=ax, color='grey')
# Plot HH clusters
hh = geo2.loc[(geo2['quadrant']==1) & (geo2['significant']==True), 'geom']
hh.plot(ax=ax, color='red')
# Plot LL clusters
ll = geo2.loc[(geo2['quadrant']==3) & (geo2['significant']==True), 'geom']
ll.plot(ax=ax, color='mediumblue')
# Plot LH clusters
lh = geo2.loc[(geo2['quadrant']==2) & (geo2['significant']==True), 'geom']
lh.plot(ax=ax, color='lightblue')
# Plot HL clusters
hl = geo2.loc[(geo2['quadrant']==4) & (geo2['significant']==True), 'geom']
hl.plot(ax=ax, color='salmon')
waterbodies_gdf.plot(ax=ax, color='black')
# Style and draw
f.suptitle('Total cases', size=30)
f.set_facecolor('0.75')
ax.set_axis_off()
plt.show()

The map above displays the LISA results of the disease totasl cases. In bright red, we find those local authorities with an unusual concentration of high cases surrounded also by high levels of cases. In light red, we find the first type of spatial outliers. These are areas with high cases but surrounded by areas with low cases. Finally, in light blue we find the other type of spatial outlier: local authorities with low cases surrounded by other authorities with high cases.

The substantive interpretation of a LISA map needs to relate its output to the original intention of the analyst who created the map. In this case, our original idea was to explore the spatial structure of cases. The LISA proves a fairly useful tool in this context. Comparing the LISA map above with the choropleth we started with, we can interpret the LISA as “simplification” of the detailed but perhaps too complicated picture in the choropleth that focuses the reader’s attention to the areas that display a particularly high concentration of (dis)similar values, helping the spatial structure of the cases emerge in a more explicit way.

The results from the LISA statistics can be connected to the Moran plot to visualise where in the scatter plot each type of polygon falls:

In [None]:
plot_local_autocorrelation(lisa, geo2, 'totalcases', figsize=(20,6))

## Correlation of total cases and population near water bodies

Here, I calculate a simple linear correlation of the total cases per district with the other variables in that same district. We can see that there is a slight global significant linear correlation between the variables and cases.

In [None]:
corr = data['totalcases'].corr(data['val_precipitation'])
print(f'Linear correlation for precipitation and total_cases is: {corr}')
corr = data['totalcases'].corr(data['val_temperature'])
print(f'Linear correlation for temperature and total_cases is: {corr}')
corr = data['totalcases'].corr(data['val_trees'])
print(f'Linear correlation for val_trees and total_cases is: {corr}')
corr = data['totalcases'].corr(data['percent_cov_trees'])
print(f'Linear correlation for percent_cov_trees and total_cases is: {corr}')
corr = data['totalcases'].corr(data['val_crops'])
print(f'Linear correlation for val_crops and total_cases is: {corr}')
corr = data['totalcases'].corr(data['percent_cov_crops'])
print(f'Linear correlation for percent_cov_crops and total_cases is: {corr}')
corr = data['totalcases'].corr(data['val_builtup'])
print(f'Linear correlation for val_builtup and total_cases is: {corr}')
corr = data['totalcases'].corr(data['percent_cov_builtup'])
print(f'Linear correlation for percent_cov_builtup and total_cases is: {corr}')
corr = data['totalcases'].corr(data['val_bareground'])
print(f'Linear correlation for val_bareground and total_cases is: {corr}')
corr = data['totalcases'].corr(data['percent_cov_bareground'])
print(f'Linear correlation for percent_cov_bareground and total_cases is: {corr}')
corr = data['totalcases'].corr(data['val_rangeland'])
print(f'Linear correlation for val_rangeland and total_cases is: {corr}')
corr = data['totalcases'].corr(data['percent_cov_rangeland'])
print(f'Linear correlation for percent_cov_rangeland and total_cases is: {corr}')

Scatterplots of the total cases against the other variables.

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(6, 2, figsize=(12, 25))
data.plot.scatter(x = 'val_precipitation', y = 'totalcases', ax = axs[0, 0], c='DarkBlue')
data.plot.scatter(x = 'val_temperature', y = 'totalcases', ax = axs[0, 1], c='DarkBlue')
data.plot.scatter(x = 'val_trees', y = 'totalcases', ax = axs[1, 0], c='DarkBlue')
data.plot.scatter(x = 'percent_cov_trees', y = 'totalcases', ax = axs[1, 1], c='DarkBlue')
data.plot.scatter(x = 'val_crops', y = 'totalcases', ax = axs[2, 0], c='DarkBlue')
data.plot.scatter(x = 'percent_cov_crops', y = 'totalcases', ax = axs[2, 1], c='DarkBlue')
data.plot.scatter(x = 'val_builtup', y = 'totalcases', ax = axs[3, 0], c='DarkGreen')
data.plot.scatter(x = 'percent_cov_builtup', y = 'totalcases', ax = axs[3, 1], c='DarkRed')
data.plot.scatter(x = 'val_bareground', y = 'totalcases', ax = axs[4, 0], c='DarkBlue')
data.plot.scatter(x = 'percent_cov_bareground', y = 'totalcases', ax = axs[4, 1], c='DarkBlue')
data.plot.scatter(x = 'val_rangeland', y = 'totalcases', ax = axs[5, 0], c='DarkGreen')
data.plot.scatter(x = 'percent_cov_rangeland', y = 'totalcases', ax = axs[5, 1], c='DarkRed')
plt.savefig(f'{report_image_path}{os.sep}{ts}_{disease}_scatterplots.png', bbox_inches='tight', transparent=True, dpi=300)

Below, I tried fitting a polynomial line of 3 and 5 order to the data and visualzie the results in the graph. The graphs suggest that by using correlations other than linear and by using machine learning models, there is good potential to predict cases from the variables.

In [None]:
fig, axs = plt.subplots(6, 2, figsize=(8, 25))

sns.regplot(data=data_agg, ax=axs[0,0], x='totalcases', y='val_precipitation', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[0,1], x='totalcases', y='val_temperature', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[1,0], x='totalcases', y='val_trees', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[1,1], x='totalcases', y='percent_cov_trees', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[2,0], x='totalcases', y='val_crops', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[2,1], x='totalcases', y='percent_cov_crops', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[3,0], x='totalcases', y='val_builtup', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[3,1], x='totalcases', y='percent_cov_builtup', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[4,0], x='totalcases', y='val_bareground', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[4,1], x='totalcases', y='percent_cov_bareground', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[5,0], x='totalcases', y='val_rangeland', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
sns.regplot(data=data_agg, ax=axs[5,1], x='totalcases', y='percent_cov_rangeland', order=2, scatter_kws={"color": "LightBlue"}, line_kws={"color": "grey"})
plt.savefig(f'{report_image_path}{os.sep}{ts}_{disease}_summaryregressionplots.png', bbox_inches='tight', transparent=True, dpi=300)

## Outliers

The boxplot shows that there are many outliers and a lot of variance in the data.

In [None]:
import seaborn as sns

fig, ax = plt.subplots(figsize=(12, 25))
cols = ['totalcases']  #'val_precipitation', 'val_temperature'] #, 'pop_near_water_10']
sns.boxplot(data[data['totalcases'] > 0][cols], ax=ax)

# Some more data preprocessing

# Machine Learning predictions

## Decision trees

### Use a decision tree to see how well the attributes can predict the total cases.

Decide if we want to classify the cases. If we set n_classes = 1, the model will try to predict the actual case values. n_classes = 2 is a binary classification of either cases present or not. n_classes greater than 2 can be considered a range from low to high, but keeping 0 cases as its own class since 0 cases is an important cutoff for disease.

Note that I experimented with previous_week_totalcases and previous_week_neighbors_totalcases where I explicitly calcualted these values for each district and added them to the training data. I removed them since I feel it was not the correct thing to make it so explicit. Yes, the previous_week_totalcases did improve the models' predictions substantialy. However, the model also depended on this feature much more so than any other feature. And, in the real word, a model that relies almost completely on the previous weeks cases is not as valuable because we cannot control time and a model that focuses more on factors that we can control is best.

* The n_classes variable is the number of classes for the machine learning problem.
* n_classes = 1, that means do not classify the cases values and treat the problem as regression where you are trying to predict the actual number of cases.
* n_classes = 2, is a binary classification of cases vs. no cases.
* n_classes > 2, is a multiclass classifcation that will keep one of the classes as no cases. So, if you set it to 3, one class would be no classes and the other two would be low and high cases.

In [None]:
# decision tree for feature importance on a regression problem
from sklearn.datasets import make_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, StandardScaler
from matplotlib import pyplot
import mapclassify as mc

features_temp = [
    'epidemic_week',
    'val_precipitation',
    'val_temperature',
    'val_trees',
    'percent_cov_trees',
    'val_crops',
    'percent_cov_crops',
    'val_builtup',
    'percent_cov_builtup',
    'val_bareground',
    'percent_cov_bareground',
    'val_rangeland',
    'percent_cov_rangeland',
    'relative_pop_density',
    'pop_near_water',
    'val_wealth',
    'val_elevation',
    # 'previous_week_totalcases',
    # 'previous_week_neighbors_totalcases',
]

features = [
    'val_precipitation',
    'val_temperature',
    'trees',
    'crops',
    'builtup',
    'bareground',
    'rangeland',
    'relative_pop_density',
    'pop_near_water',
    'val_wealth',
    'val_elevation',
    # 'previous_week_totalcases',
    # 'previous_week_neighbors_totalcases',
]

label = 'cases'
n_classes = 2

if n_classes > 2:
    nb = mc.NaturalBreaks(data[data['totalcases'] > 0]['totalcases'], k=n_classes-1) # I subtract 1 from the n_classes because I will consider 0 case counts as its own class and get that manually.

def classify(d, bins):
    c = []
    for r in d:
        v = 0
        if r > 0:
            for i in range(len(bins)-1, -1, -1):
                if r > bins[i]:
                    v = i + 2
                    break
                v = 1
        c.append(v)
    return np.array(c)


def classify_binary(d):
    c = []
    for r in d:
        v = 0
        if r > 0:
            v = 1
        c.append(v)
    return np.array(c)

if n_classes > 2:
    classes = classify(data['totalcases'], nb.bins-1)
elif n_classes == 2:
    classes = classify_binary(data['totalcases'])
elif n_classes < 2:
    classes = data['totalcases']
data[label] = classes

### Select Previous weeks cases
1. the number of cases for that district for the previous week
2. the number of cases for the neighbors of that district and weight those cases by the Spatial Queen weighting above

**Note**: be careful because on my computer which is a fairly solid computer, selecting the previous weeks cases **took about 5 hours.**

Note that I excluded this code as described above.

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split


X = data[features_temp]
y = data[label]
X_train_temp, X_test_temp, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Create the landcover values for both the amount of coverage in the district and the percent of coverage of the district for each landcover class.

Combine measurements for:
1. The amount of landcover
2. The percent landcover
3. The population density

This was done so that large and small districts in area and those with high and low populations are not unrealistically affected by one of the values.

In [None]:
landcover_classes = [
    'trees',
    'crops',
    'builtup',
    'bareground',
    'rangeland'
]

for lc in landcover_classes:
    f1 = X_train_temp[f'val_{lc}']
    f2 = X_train_temp[f'percent_cov_{lc}']
    f3 = X_train_temp['relative_pop_density']
    f1_min = f1.min()
    f1_max = f1.max()
    f2_min = f2.min()
    f2_max = f2.max()
    f3_min = f3.min()
    f3_max = f3.max()
    X_train_temp[lc] = ((((f1-f1_min) / (f1_max-f1_min)) + ((f2-f2_min) / (f2_max-f2_min))) / 2) * ((f3-f3_min) / (f3_max-f3_min))
    
    f1 = X_test_temp[f'val_{lc}']
    f2 = X_test_temp[f'percent_cov_{lc}']
    f3 = X_test_temp['relative_pop_density']
    f1_min = f1.min()
    f1_max = f1.max()
    f2_min = f2.min()
    f2_max = f2.max()
    f3_min = f3.min()
    f3_max = f3.max()
    X_test_temp[lc] = ((((f1-f1_min) / (f1_max-f1_min)) + ((f2-f2_min) / (f2_max-f2_min))) / 2) * ((f3-f3_min) / (f3_max-f3_min))

### Select the features to use in machine learning.

In [None]:
X_train = X_train_temp[features]
X_test = X_test_temp[features]

### Normalize the Data

Use the RobustScaler because our data has lots of outliers and the RobustScaler better normalizes when there are many outliers.

In [None]:
scaler = RobustScaler()
scaler = scaler.fit(X_train)

In [None]:
X_train = pd.concat([pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns), X_train_temp['epidemic_week']], axis=1)
X_test = pd.concat([pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns), X_test_temp['epidemic_week']], axis=1)

### Perform oversampling or undersampling if desired.

I excluded this in the end. Most AutoML libraries include ways to account for class imbalances without the need for oversampling or undersampling. And testing with both random undersampling and SMOTE oversampling the results were worse. Therefore, I preferred to train the model with eval_metric='f1' because it can utilize all data and still try to account for the class imbalanace to produce the best f1 score which optimizes both classes.

from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import RandomOverSampler
from imblearn.over_sampling import ADASYN
from imblearn.over_sampling import SVMSMOTE
from imblearn.over_sampling import KMeansSMOTE
from imblearn.combine import SMOTEENN
from imblearn.combine import SMOTETomek


n_jobs = os.cpu_count() - 2
resampling_methods = {
    'NearMiss': NearMiss(n_jobs=n_jobs),
    'RandomOverSampler': RandomOverSampler(),
    'ADASYN': ADASYN(n_jobs=n_jobs), 
    'SVMSMOTE': SVMSMOTE(n_jobs=n_jobs),
    'KMeansSMOTE': KMeansSMOTE(n_jobs=n_jobs),
    'SMOTEENN': SMOTEENN(n_jobs=n_jobs),
    'SMOTETomek': SMOTETomek(n_jobs=n_jobs)
}

def perform_resampling(X, y, resampling_method):
    #input DataFrame
    #X →Independent Variable in DataFrame\
    #y →dependent Variable in Pandas DataFrame format
    X, y = resampling_method.fit_resample(X, y)
    return X, y

### Put the features and labels together into one dataframe with the labels as the last column.

In [None]:
df_train = pd.concat([X_train, y_train], axis=1)
df_test = pd.concat([X_test, y_test], axis=1)

### Set up AutoML using AutoGluon

In [None]:
from autogluon.tabular import TabularDataset, TabularPredictor

train_data = TabularDataset(df_train)
train_data.head()

In [None]:
label = 'cases'
save_path = 'agModels-predictClass'  # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path, sample_weight='auto_weight', eval_metric='f1')   # eval_metric='recall_weighted' for more than 2 classes # recall

### Train (fit) the model

If you delete the time_limit parameter or set it higher, it will obviously run longer but you will most likely get better results. From my experience, when you are developing, testing different scenarios, or changing lots of code above, then set the time_limit=180 or 180 seconds. You will likely get decent results. Then, once your code is more set, you can run it longer.

In [None]:
predictor = predictor.fit(train_data)

### Evaluate the best model created by AutoGluon

In [None]:
test_data = TabularDataset(df_test)

y_pred = predictor.predict(test_data.drop(columns=[label]))

df_test[f'{label}_pred'] = y_pred
# df_test.to_pickle('df_test.pickle')

In [None]:
mtr = predictor.evaluate(test_data, silent=True)
model_metrics_file = f'{report_image_path}{os.sep}{ts}_{n_classes}_modelmetrics.csv'
file_exists = os.path.isfile(model_metrics_file)
with open(model_metrics_file, 'a') as f:
    if n_classes == 2:
        if not file_exists:
            f.write("disease,accuracy,balanced_accuracy,mcc,roc_auc,f1,precision,recall\n")
        f.write(f"{disease},{round(mtr['accuracy'],3)},{round(mtr['balanced_accuracy'],3)},{round(mtr['mcc'],3)},{round(mtr['roc_auc'],3)},{round(mtr['f1'],3)},{round(mtr['precision'],3)},{round(mtr['recall'],3)}\n")
    if n_classes == 3:
        if not file_exists:
            f.write("disease,accuracy,balanced_accuracy,recall_weighted,mcc\n")
        f.write(f"{disease},{round(mtr['accuracy'],3)},{round(mtr['balanced_accuracy'],3)},{round(mtr['recall_weighted'],3)},{round(mtr['mcc'],3)}\n")
mtr

In [None]:
# Calculating the confusion matrix
if n_classes == 2:
    cm = ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=['no cases', 'cases'], values_format = '.6g', normalize='true')
    cm.figure_.savefig(f'{report_image_path}{os.sep}{ts}_{disease}_{n_classes}_confusion_matrix.png', bbox_inches='tight', transparent=True, dpi=300)
if n_classes == 3:
    cm = ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=['none', 'low', 'high'], values_format = '.6g', normalize='true')
    cm.figure_.savefig(f'{report_image_path}{os.sep}{ts}_{disease}_{n_classes}_confusion_matrix.png', bbox_inches='tight', transparent=True, dpi=300)

### List the best performing models from AutoGluon.

In [None]:
predictor.leaderboard(test_data, silent=True)

## Generate the feature importances.

The article below describes how AutoGluon calculates feature importance.

https://explained.ai/rf-importance/

It is important to note that feature importance is similar to a variables multiplier in linear regression, but it is not the same and calculated differently. Feature importance, gives you a good sense of which variables are most important to the model making predictions. Therefore, it is not exactly the same as which variable is most important to determining disease spread, but it gives you a very close estimate to this.

In [None]:
fi = predictor.feature_importance(train_data)
fi

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
ax.barh(y=fi.index, width=fi.importance)
ax.invert_yaxis()
plt.grid()
plt.savefig(f'{report_image_path}{os.sep}{ts}_{disease}_{n_classes}_featureimportancechart.png', bbox_inches='tight', transparent=True, dpi=300)
plt.show()

**Note**: the code below is **deprecated** because I switched to use the AutoGluon library above for AutoML instead of FLAML below. AutoGluon is getting better results and it has a few built in deep learning models to choose from.

### Set up AutoML using Microsoft's FLAML library

AutoML will help us top pick the best performing model for predicting the data.

If n_classes >= 2, then we should treat it as a classifcation problem. Otherwise, predicting the actual case numbers is a regression problem.

### Train the model using AutoML

### Find the best performing model

### Compute predictions of testing dataset

This is to get a first look at how the model did.

### Compute different metric values on testing dataset to evalute it

#### metrics

If its a regresssion problem then we need to use different metrics compared to a classification problem.

### compare predicted versus actual

Let us print some of the predictions compared to the actual values to visually check the predictions.
In the classification problem, I subsetted the values where there are actual cases present since this is the smaller class and therefore more challenging to get correct.

### Plot the learning curve

To make sure we avoid overfitting and to see if training longer will produce better results.

### Plot feature importance

The feature importance measures tells us the variables that the model decided are most important to predicting disease cases.