<div style="background-color:white;" >
<div style="clear: both; display: table;">
  <div style="float: left; width: 14%; padding: 5px; height:auto">
    <img src="img/TUBraunschweig_CO_200vH_300dpi.jpg" alt="TU_Braunschweig" style="width:100%">
  </div>
  <div style="float: left; width: 28%; padding: 5px; height:auto">
    <img src="img/TU_Clausthal_Logo.png" alt="TU_Clausthal" style="width:100%">
  </div>
  <div style="float: left; width: 25%; padding: 5px; height:auto">
    <img src="img/ostfalia.jpg" alt="Ostfalia" style="width:100%">
  </div>
  <div style="float: left; width: 21%; padding: 5px;">
    <img src="img/niedersachsen_rgb_whitebg.png" alt="Niedersachsen" style="width:100%">
  </div>
  <div style="float: left; width: 9%; padding: 5px;">
    <img src="img/internet_BMBF_gefoerdert_2017_en.jpg" alt="bmbf" style="width:100%">
  </div>
</div>
<div style="text-align:center">
<img src="img/ki4all.jpg" alt="KI4ALL-Logo" width="200"/>
</div>
</div>

# Microcredit Artificial Data Generator
**Author:** Sigrun May, Johann Katron, Daria Kober
**Date:** August 2023
**Version:** 4  
**Credits:** 0.25 ECTS
**License:** [MIT License](https://opensource.org/licenses/MIT)
**developed by:** [TU Braunschweig](https://www.tu-braunschweig.de/), [Ostfalia Hochschule](https://landing.ostfalia.de/) and [TU Clausthal](https://www.tu-clausthal.de/)  
**sponsored by** [Bundesministerium für Bildung und Forschung](https://www.bmbf.de/bmbf/de/home/home_node.html)  

## Target audience
Students, professionals and the general public who are interested in obtaining a ... (**TODO**)

## Prerequisites
The reader should have a foundational understanding of basic mathematical concepts. Additionally, familiarity with the basics of Python is crucial (see microcredit [Python Introduction](https://git.rz.tu-bs.de/ifn-public/ki4all/python-introduction)) and Python libraries like NumPy and Pandas for data analysis, along with an introductory knowledge of machine learning principles (as outlined in the microcredit [Machine Learning Introduction](https://git.rz.tu-bs.de/ifn-public/ki4all/machine-learning-introduction). It is essential to understand the composition and significance of training data, alongside fundamental statistical concepts such as classification, features (which correspond to data columns), samples (which correspond to data rows), labels (which form the vector indicating the category of each sample), as well as statistical distributions, including the normal and log-normal distributions. Knowledge of effect size and clustering techniques is also vital for data analysis and interpretation. We recommend consulting **[THIS LITERATURE]** for comprehensive coverage of these topics.

Moreover, an understanding of feature engineering, selection, extraction, and the importance of these processes is crucial for the optimization of machine learning models. Additionally, a basic knowledge of biomarkers, the application areas of artificial data, and proficiency in managing data formats such as CSV will significantly benefit practitioners in the field of machine learning and data science.


## Learning goals
After reading this document, the reader should be able to:
- Define synthetic (artificial) biomedical data
- Explain why synthetic data is important in biomedical research
- Understand the structure of synthetic data
- Generate a simple synthetic dataset
- Visualize and explore the generated data


## A note to the reader
Explaining at full length some terms and concepts regarding machine learning is beyond
the scope of this document. They will, however, in the majority be explained brieﬂy. The
reader is, however, referred to further textbooks or other material in case a more detailed
understanding is desired.

# Lesson 1: Introduction to Artificial Biomedical Data

## What is synthetic biomedical data?

**Synthetic biomedical data** is artificially generated data that reproduces the statistical structure of real-world biomedical datasets without containing any actual patient information.

In biomedical research, datasets may include laboratory test results, genetic sequences, imaging-derived measurements, or other biological signals collected from many individuals. **High-throughput data** refers to datasets where large numbers of measurements are obtained in parallel, typically using automated lab technologies — for example, simultaneously measuring thousands of gene expression levels for each sample. Because each sample can have thousands or even tens of thousands of measured variables, these datasets are called **high-dimensional**.

Within such data, a **biomarker** is a measurable variable (e.g., a gene expression level, protein concentration, or imaging-derived metric) that is associated with a biological condition, disease state, or outcome of interest. Identifying relevant biomarkers is a central task in many biomedical studies.

When generating synthetic biomedical data, you define:

- **Features**: The simulated variables (e.g., “gene_1_expression”, “protein_5_concentration”).
- **Class labels**: The outcome categories, such as *healthy* vs. *diseased*.
- **Feature importance**: A numerical indication of how much each feature contributes to predicting the class label. In synthetic data, you can assign exact importance values.
- **Feature dependencies**: Correlations or functional relationships between features, which can be specified explicitly.
- **Noise characteristics**: Random variability introduced to mimic measurement error or natural biological variation.

Because all these parameters are defined during generation, the “ground truth” is fully known. This makes synthetic data ideal for testing algorithms, teaching concepts, and benchmarking pipelines.

---

## Why use synthetic biomedical data?

Real high-dimensional biomedical datasets are valuable but come with challenges:

1. **Incomplete ground truth**: In actual patient data, the true set of relevant biomarkers is often uncertain. Many features may appear associated with the outcome due to chance or confounding factors.
2. **Complex dependencies**: Features can be interdependent in subtle ways, making it hard to know whether a method is correctly identifying causal signals or just correlations.
3. **Privacy constraints**: Real patient data cannot always be freely shared due to privacy regulations (e.g., GDPR, HIPAA).

**Synthetic data** addresses these issues by allowing you to:

- Control exactly which features are relevant and their **feature importance**.
- Define **irrelevant features**, either:
  - As **pure noise** — random values with no link to the outcome, or
  - As part of a **pseudo-class** — a made-up group that behaves like a real class in some ways, but is unrelated to the true outcome. This helps test whether a model can ignore misleading signals.
- Simulate **random effects** — sources of variation unrelated to the main research question. In biomedical contexts, random effects might come from batch effects (measurements taken on different days), equipment calibration differences, or other uncontrolled conditions. Including them tests whether models can remain robust under such variation.
- Predefine class distributions, making it easy to simulate balanced datasets or introduce class imbalance intentionally.

Because the data is artificial, there are **no privacy concerns**, making it safe to publish, share in teaching environments, or include in open-source repositories.

Typical use cases include:

- **Method development**: Creating controlled datasets to test new feature selection or classification algorithms.
- **Benchmarking**: Comparing multiple methods on identical, reproducible data.
- **Educational demonstrations**: Teaching statistical and ML concepts without needing access to real patient data.
- **Pipeline testing**: Checking preprocessing steps (normalization, imputation, batch correction) against known expected results.
- **Simulating edge cases**: Generating data for rare diseases or extreme conditions to test model performance.

In summary, synthetic biomedical data offers a **flexible, reproducible, and ethically safe** way to explore and evaluate methods in biomedical data science, while providing full control over the statistical properties and ground truth.



## What is synthetic biomedical data?

**Synthetic biomedical data** is artificially generated data that mimics the statistical properties of real patient data
without containing any real patient information.
**Synthetic biomedical data** is data that is *artificially created* to imitate certain characteristics of real-world medical or biological data, without containing any actual patient information.

In medicine and life sciences, datasets can be extremely large and complex. They might include laboratory measurements, results from imaging tests, genetic sequences, or clinical observations — often collected from hundreds or thousands of people. Synthetic biomedical data reproduces the *structure* and *statistical properties* of these kinds of datasets, but is generated entirely by a computer program. This means it can be designed to match the scale, complexity, and variability of real data, while remaining completely fictional.

Creating synthetic biomedical data involves deciding in advance:

- **What variables (features) to include** – for example, simulated gene expression values, protein concentrations, or blood test results.
- **How those features are related** – such as correlations between certain biomarkers or patterns within specific patient groups.
- **How many categories (classes) exist** – for instance, “healthy” vs. “diseased” groups.
- **How much randomness to add** – to reflect natural variability or measurement noise.

Because the entire dataset is generated artificially, the *ground truth* is known — meaning the importance of each feature, the exact relationships among them, and the correct classification of each sample are all defined from the start. This makes synthetic biomedical data a powerful tool for testing, teaching, and developing new analytical methods.

---

## Why use synthetic biomedical data?

Modern biomedical research often relies on vast and complex datasets — for example, thousands of gene activity measurements or protein levels collected from patient samples. These are sometimes called **high-throughput datasets**, because they are produced in large quantities very quickly by advanced laboratory equipment. Within such datasets, scientists may look for **biomarkers**: measurable signals, such as the level of a certain molecule in the blood, that can help identify a disease, predict its course, or monitor treatment effects. Because these datasets often contain many thousands of measurements for each sample, they are referred to as **high-dimensional datasets**.

Working with real high-dimensional biological data is challenging for several reasons. First, in real patient datasets, it is rarely known with certainty which biomarkers are truly important for the question at hand. Second, the relationships between different measurements — what statisticians call *dependencies* — can be complex, and sometimes hidden. Without knowing the “correct” answers in advance, it is difficult to objectively evaluate how well a data analysis method is performing.

This is where **synthetic biomedical data** becomes valuable. Synthetic data is artificially generated to mimic certain aspects of real biological data, but with a crucial difference: the data creator controls every detail. In a synthetic dataset, the **feature importance** — meaning how strongly each measurement contributes to predicting the outcome — is predefined. The internal dependencies between features, as well as the overall distribution of different groups or categories (called *classes*), are also known. This means researchers can evaluate their methods against a “ground truth” where the right answer is already established.

Synthetic datasets can also be designed with **irrelevant features** — measurements that have nothing to do with the outcome. These irrelevant features might be purely random (completely unconnected to anything else in the dataset) or belong to a **pseudo-class**, which is a made-up category that behaves somewhat like a real one but is included purely to challenge the method. Including such features makes it possible to test whether an analysis method can correctly ignore unimportant information.

Another advantage is that synthetic data can include **random effects**, meaning variations that happen by chance or due to external factors not related to the main research question. For example, measurements might differ slightly depending on which laboratory instrument was used or on what day the test was run. Adding such controlled randomness makes it possible to test how robust a method is to noise and other unpredictable factors.

Because synthetic biomedical data is fully artificial, it has no privacy concerns — no real patient information is included. This makes it safe to share publicly, use in teaching, and incorporate into collaborative projects without legal or ethical restrictions. Researchers can use it to:

- Develop new methods for selecting important features from complex datasets.
- Compare different analysis techniques under identical, reproducible conditions.
- Teach students how to work with biomedical data without accessing confidential records.
- Benchmark software tools, preprocessing pipelines, or statistical workflows.
- Simulate rare diseases or conditions that may be underrepresented in real datasets.
- Explore the effects of noise, irrelevant features, or confounding factors in a controlled way.

In short, synthetic biomedical data acts as a safe, flexible, and highly informative testing ground. It allows scientists, educators, and developers to experiment, learn, and improve analytical methods in situations where using real data would be difficult, risky, or impossible.


## What is Synthetic Biomedical Data?

**Synthetic biomedical data** is artificially generated data that mimics the statistical properties of real patient data
without containing any real patient information.

---

## Why use synthetic biomedical data?

1. **Benchmarking algorithms** — test feature selection, classification, and other ML methods.
2. **Educational purposes** — teach students without using real patient data.
3. **Prototyping and software testing** — develop pipelines before deploying on sensitive data.
4. **Privacy-preserving research** — share and analyze data without violating privacy laws.
5. **Data augmentation** — enhance existing datasets to improve model robustness.
6. **Cost-effective** — reduce costs associated with data collection and storage.
7. **Regulatory compliance** — meet legal requirements for data usage in research.
8. **Scalability** — easily generate large datasets for training complex models.
9. **Diversity** — create diverse datasets to avoid bias in machine learning models.
10. **Exploratory analysis** — allow researchers to explore hypothetical scenarios without ethical concerns.
11. **Data sharing** — facilitate collaboration between institutions without sharing sensitive data.
12. **Model validation** — validate models on synthetic data before applying them to real-world scenarios.
13. **Hypothesis testing** — test hypotheses in a controlled environment without real-world constraints.
14. **Data simulation** — simulate rare diseases or conditions that are difficult to study in real life.
15. **Training data generation** — generate training data for machine learning models when real data is scarce.
16. **Feature engineering** — create complex features that may not be present in real datasets.
17. **Data balancing** — balance class distributions in datasets to improve model performance.
18. **Scenario testing** — test models under various hypothetical scenarios to assess robustness.
19. **Development of new methods** — create controlled environments to develop and test new algorithms.

## Synthetic Data for the development of new machine learning methods

In order to develop new methods or to compare existing methods for feature selection, reference data with known dependencies and importance of the individual features are needed. This data generator can be used to simulate biological data for example artificial high throughput data including artificial biomarkers. Since commonly not all true biomarkers and internal dependencies of high-dimensional biological datasets are known with certainty, artificial data enables to know the expected outcome in advance. In synthetic data, the feature importances and the distribution of each class are known. Irrelevant features can be purely random or belong to a pseudo-class. Such data can be used, for example, to make random effects observable.
