# Detection and dealing with bias in EHR data

PROBLEM STATEMENT + SOLUTION STATEMENT

Discuss that many biases are already pre-determined at the data collection stage and only some can be mitigated at the data analysis stage. Discuss how exploratory analysis can uncover biases -> show a few UMAPs and maybe barplots or line plots for features:

1. Selection Bias: This occurs when the patients included in the EHR system are not representative of the general population due to the specific demographics of the healthcare system's patient base. For example, a hospital in a wealthy urban area may have a patient population that skews towards certain socioeconomic statuses, affecting the applicability of findings to broader populations.

Ideas:
Calculate some statistics such as median income and compare to the national average of the country. Plot it.

2. Information Bias: This arises from inaccuracies in the data recorded in the EHRs. It includes errors in diagnosis, treatment information, or outcome data, and can be due to misreporting, misunderstanding, or misclassification. Information bias can lead to incorrect conclusions about associations between variables.

Ideas:
Just classical quality control. Show that there might be one group that is more susceptible to data information collection biases. Maybe show that some minority gets data measured with some cheap ass bad instrument that records mistakes, whereas rich people get the perfect new machine.

3. Coding Bias: Related to information bias, coding bias occurs when there are inconsistencies in how health conditions and procedures are coded (e.g., ICD-10 codes). Different practitioners may use different codes for the same condition or procedure, leading to potential misinterpretations of the data.

Ideas:
Need to find some overlap in ICD encodings that can lead to this issue? Maybe consider showing that Mondo is great but simplifies some things that can lead to biases.

4. Surveillance Bias: This occurs when the likelihood of diagnosing a condition is influenced by the intensity of monitoring or screening. Patients with more frequent healthcare interactions are more likely to have conditions diagnosed than those with fewer interactions, which can skew analysis results.

Ideas:
Find a way to get statistics on this (e.g. several measurements per visit) and plot it. 

5. Attrition Bias: In longitudinal studies using EHR data, attrition bias can occur if there is a systematic difference between those who continue to participate or are followed up in the system and those who are lost to follow-up. This can affect the validity of the findings.

Ideas:
Find a way to get statistics on this and plot it.

6. Algorithmic Bias: When using machine learning models or other algorithmic processes for exploratory analysis, biases in the algorithms themselves or in the training data can lead to biased outcomes. This includes overrepresentation or underrepresentation of certain groups in the data used to train algorithms.

Ideas:
Fairlearn stuff, uncertainty, feature importance

7. Confounding Bias: This happens when the relationship between two variables is influenced by a third variable that is not accounted for in the analysis. EHR data is complex and multifaceted, making it challenging to control for all potential confounding variables.

Ideas:
Have a domain expert that helps with that -> try to get as much information about the data collection as possible.

8. Missing data Bias: Some data is MAR, some is MNAR, some is whatever the fuck.

Ideas:
Little's test on some data where we know what kind of type it should be.

9. Imputation bias: Different imputation algorithms introduce different kinds of biases.

Ideas:
Maybe show how the KNN imputation algorithm can introduce some biases in specific setups. Are there other ways to measure uncertainty? We should look into the literature and potentially implement things. Maybe also https://github.com/theislab/ehrapy/issues/652? What about time aware imputation?

10. Filtering Bias: Many filtering steps and it's not clear anymore why these were done. This is not explicit bias, but more something that happens during analysis?

Ideas:
Basically all things tableone etc

In [1]:
import ehrapy as ep

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
adata = ep.dt.diabetes_130()
adata

[1;35m2024-03-11 15:47:03,405[0m - [1;34mroot[0m [1;37mINFO - Transformed passed DataFrame into an AnnData object with n_obs x n_vars = `101766` x `51`.[0m


AnnData object with n_obs × n_vars = 101766 × 51
    var: 'ehrapy_column_type'
    layers: 'original'