# Detection and dealing with bias in EHR data

Biases in Electronic Health Records (EHR) data pose significant challenges to healthcare analytics, potentially leading to skewed research findings and clinical decision-making.
These biases can arise from various sources, such as selective documentation, patient demographic skews, or algorithmic biases in data collection and processing.
Such biases can result in the misrepresentation of patient populations, the underrepresentation of minority groups, or inaccuracies in disease prevalence rates.
Exploratory data analysis (EDA), through data visualization, statistical summaries, and pattern identification, is a critical step in uncovering and mitigating biases.

In this tutorial, we outline the various sources of bias, show how they can be detected and potentially mitigated with ehrapy.
It is important to note that many biases are already inherent to the data collection process itself and can only be unveiled but not always dealth with.
We make use of the [Diabetes 130-US Hospitals for years 1999-2008](https://archive.ics.uci.edu/dataset/296/diabetes+130-us+hospitals+for+years+1999-2008) dataset and a synthetic dataset for our analysis.

## Selection bias

Selection bias occurs when the data are not representative of the general population, often because the individuals in the dataset are more likely to seek care or have certain conditions.
This can lead to skewed results and incorrect inferences about disease prevalence, treatment effects, or health outcomes.
For example, a study using EHR data from a specialized clinic may overestimate the prevalence of a specific condition because patients visiting the clinic are more likely to have that condition compared to the general population.

Calculate some statistics such as median income and compare to the national average of the country. Plot it.

## Information bias

Information bias refers to inaccuracies or inconsistencies in the data recorded, which can stem from errors in how health information is documented, interpreted, or coded.
Such biases can lead to misclassifications, either overestimating or underestimating the association between exposures and outcomes.
For instance, if a condition is under-documented due to lack of standardized diagnostic criteria across EHR systems, studies might underestimate its prevalence and the effectiveness of treatments.

Just classical quality control. Show that there might be one group that is more susceptible to data information collection biases. Maybe show that some minority gets data measured with some cheap ass bad instrument that records mistakes, whereas rich people get the perfect new machine.


## Coding bias

Coding bias in EHR data arises when there are discrepancies or inconsistencies in how medical conditions, procedures, and outcomes are coded, often due to variation in the understanding and application of coding systems by different healthcare providers.
This can lead to misrepresentation of patient conditions, treatments received, and outcomes, affecting the reliability of research and analyses conducted using this data.
For example, two providers may code the same symptom differently, leading to challenges in accurately aggregating and comparing data across EHR systems.

Need to find some overlap in ICD encodings that can lead to this issue? Maybe consider showing that Mondo is great but simplifies some things that can lead to biases.

## Surveillance bias

Surveillance bias occurs when the likelihood of detecting a condition or outcome is influenced by the intensity or frequency of monitoring, leading to an overestimation of the association between exposure and outcome.
This bias is particularly prevalent in studies where certain groups are more closely observed than others, resulting in a higher detection rate of conditions in these groups regardless of actual prevalence.
For instance, individuals in a clinical trial may receive more rigorous testing and follow-up compared to the general population, thus appearing to have higher rates of certain conditions or side effects.

Find a way to get statistics on this (e.g. several measurements per visit) and plot it. 

## Attrition bias

Attrition bias emerges when there is a systematic difference between participants who continue to be followed up within the healthcare system and those who are lost to follow-up or withdraw from the system.
This can lead to skewed outcomes or distorted associations in longitudinal studies, as the data may no longer be representative of the original population.
For example, if patients with more severe conditions are more likely to remain engaged in the healthcare system for ongoing treatment, studies may overestimate the prevalence of these conditions and their associated healthcare outcomes.

Find a way to get statistics on this and plot it.

## Algorithmic bias

Algorithmic bias occurs when algorithms systematically favor certain groups over others, often due to biases inherent in the data used to train these algorithms.
This can result in unequal treatment recommendations, risk predictions, or health outcomes assessments across different demographics, such as race, gender, or socioeconomic status.
For instance, an algorithm trained predominantly on data from one ethnic group may perform poorly or inaccurately predict outcomes for individuals from underrepresented groups, exacerbating disparities in healthcare access and outcomes.

Fairlearn stuff, uncertainty, feature importance.

## Confounding bias

Confounding bias arises when an outside variable, not accounted for in the analysis, influences both the exposure of interest and the outcome, leading to a spurious association between them.
This can distort the true effect of the exposure on the outcome, either exaggerating or underestimating it.
For example, if a study investigating the effect of a medication on disease progression fails to account for the severity of illness at baseline, any observed effect might be due more to the initial health status of the patient than to the medication itself.

Have a domain expert that helps with that -> try to get as much information about the data collection as possible.
Or try to find some correlations -> TBD.

## Missing data bias

Missing data bias refers to the distortion in analysis results caused by non-random absence of data points, which can occur through various types, such as missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR).
Each type affects the data and subsequent analyses differently, with MCAR having the least impact since the missingness is unrelated to the study variables or outcomes, whereas MAR and NMAR can introduce significant bias if the reasons for missingness are related to the study variables or if the missingness itself is informative.
For instance, if patients with more severe symptoms are less likely to have complete records (NMAR), analyses may underestimate the severity and impact of certain conditions.

Little's test on some data where we know what kind of type it should be.

## Imputation bias

Imputation bias, a more specific form of algorithmic bias, occurs when the process used to estimate and fill in missing values introduces systematic differences between the imputed values and the true values.
This can happen if the imputation method does not accurately reflect the underlying data distribution or the reasons for missingness, leading to skewed analyses and incorrect conclusions.
For example, using mean imputation for missing values in a dataset with a non-normal distribution may artificially reduce variability and lead to misleading inferences about the population characteristics or treatment effects.

Maybe show how the KNN imputation algorithm can introduce some biases in specific setups. Are there other ways to measure uncertainty? We should look into the literature and potentially implement things. Maybe also https://github.com/theislab/ehrapy/issues/652? What about time aware imputation?

## Filtering bias

Filtering bias emerges when the criteria used to include or exclude records in the analysis are not clearly defined or are applied inconsistently, potentially leading to a non-representative sample of the original population.
This bias can obscure the true relationships between variables by systematically removing certain patient groups or information based on arbitrary or non-transparent criteria.
For instance, if multiple filtering steps are conducted to clean the data or select specific cohorts without adequately documenting the reasons or thresholds for these decisions, it may be difficult to replicate the study or assess the validity of its findings, thus compromising the reliability of the conclusions drawn.

Basically all things tableone etc.