To figure out:
1. Find a toy problem that this problem solves. One where it does poorly on biased data or poorly on data that accuracy falls sharply as the number of samples decreases (clinical trials?S)
2. Find a way to theoretically describe how well it should do to overcome the above bias
3. Find a way to generalize to categorical data
4. Apply it to deep learning 
5. Find a more mathematical way of approximating the pdf, rather than just clustering
6. Read similar papers to know how to format it
7. Doping for this paper or the next?




Label by KNN and then train on the unlabeled data (relabeled)
sample red, label it by knn and then train it on that

talk about a pathological example of how different red and green data


1. 10 hour literature search
    SMOTE
    Xiao Li Meng - data quality way more important than model value


### Research Paper Outline: Optimizing Ensemble Models with Unlabeled Data for Bias Mitigation

**Title**: "Adaptive Weighting of Ensemble Classifiers Using Unlabeled Data for Bias Correction"

#### Abstract
- Briefly introduce the problem of bias in machine learning models due to unrepresentative training data.
- Outline the proposed solution: optimizing ensemble model weights using unlabeled data.
- Summarize the methodology, key findings, and implications.

#### Introduction
- **Background and Motivation**:
  - Discuss the challenges of biased training data in machine learning.
  - Emphasize the importance of ensemble models in improving prediction accuracy.
- **Problem Statement**:
  - Define the specific problem addressed by the paper (bias in ensemble models).
- **Objectives**:
  - Detail the goals of the research (optimizing ensemble weights using unlabeled data).
- **Contributions**:
  - Highlight the novel contributions of the paper.

#### Literature Review
- **Ensemble Learning**:
  - Review existing literature on ensemble methods and their strengths.
- **Bias in Machine Learning**:
  - Discuss research on bias in training data and its impact.
- **Unsupervised and Semi-Supervised Learning**:
  - Summarize key developments in using unlabeled data for model training.
- **Weight Optimization in Ensembles**:
  - Review existing methods and identify gaps your research addresses.

#### Methodology
- **Model Framework**:
  - Describe the ensemble model structure and component classifiers.
- **Data Description**:
  - Detail the characteristics of both the original and new unlabeled data sets.
- **Optimization Process**:
  - Explain the mathematical formulation of the weight optimization.
  - Discuss the choice of optimization algorithms and their justification.
- **Regularization and Constraints**:
  - Detail the regularization techniques used and their purpose.

#### Experimentation and Results
- **Experimental Setup**:
  - Describe the experimental design, including data splits and baseline models for comparison.
- **Implementation Details**:
  - Provide specifics on the implementation, including software and hardware used.
- **Results**:
  - Present the results of the optimization process.
  - Compare performance with baseline models.
- **Discussion**:
  - Analyze and interpret the results.
  - Discuss the effectiveness of the method in reducing bias.

#### Validation and Robustness Checks
- **Cross-Validation**:
  - Detail the cross-validation process used on the unlabeled data.
- **Sensitivity Analysis**:
  - Discuss the robustness of the model to changes in data and parameters.

#### Implications and Applications
- **Practical Implications**:
  - Discuss how this method can be applied in real-world scenarios.
- **Theoretical Contributions**:
  - Highlight how the findings contribute to the existing body of knowledge.

#### Limitations and Future Work
- **Limitations**:
  - Acknowledge any limitations or potential biases in your research.
- **Future Research Directions**:
  - Suggest areas for further exploration and improvement.

#### Conclusion
- Summarize the key findings and their significance.
- Reinforce the importance of the research in the context of bias mitigation in machine learning.

#### References
- Cite all sources used in the paper following the chosen formatting style.

#### Appendices (if necessary)
- Include additional data, mathematical derivations, or experimental details.

---

### Optimization Problem Formulation

The objective is to minimize the negative log-likelihood of the observed data under a weighted combination of models:

$$
\text{minimize}_{\mathbf{w}} \quad -\sum_{i=1}^{n} \ln(\mathbf{w}^T P(\mathbf{x}_i))
$$

Subject to the constraint that the weights sum up to 1:

$$
\text{subject to} \quad \sum_{k=1}^{K} w_k = 1, w_k>0
$$

Where:
- $\mathbf{w} = [w_1, w_2, \ldots, w_K]^T $ are the weights of the models.
- $P(\mathbf{x}_i) = [p_1(\mathbf{x}_i | m_1), p_2(\mathbf{x}_i | m_2), \ldots, p_K(\mathbf{x}_i | m_K)]^T $ is the vector of probabilities for each model $ m_k $ given the observation $ \mathbf{x}_i $.

### Probability Distribution for Each Model

Each model $ m_k $ provides a probability distribution based on the condition of the observation:

$$
p_k(\mathbf{x}_i | m_k) = 
\begin{cases} 
p_{k1} & \text{if } \mathbf{x}_i \in l_{k1} \\
\vdots \\
p_{kh} & \text{if } \mathbf{x}_i \in l_{kh}
\end{cases}
$$

Where:
- $p_{kj}$ are probabilities corresponding to different conditions $l_{kj}$ for model $m_k$.
- $ \sum_{j=1}^{h} p_{kj} = 1 $ for each model $ m_k $, ensuring a valid probability distribution.


## Proving objective function is convex
$$
-\sum_{i=1}^{n} \ln(\mathbf{w}^T P(\mathbf{x}_i))
$$


1. **Convexity of the Logarithm Function**:
   - The natural logarithm function, $\ln(x)$, is concave for $x > 0$. This is a well-known property, and its proof typically involves showing that the second derivative of $\ln(x)$ is negative ($\frac{d^2}{dx^2}\ln(x) = -\frac{1}{x^2} < 0$ for $x > 0$).

2. **Negative of a Concave Function is Convex**:
   - The negative of a concave function is convex. Since \(\ln(x)\) is concave, \(-\ln(x)\) is convex.

3. **Linear Combinations Maintain Convexity**:
   - The function \(\mathbf{w}^T P(\mathbf{x}_i)\) is a linear combination of the elements in \(P(\mathbf{x}_i)\) with weights \(\mathbf{w}\). Linear combinations preserve convexity. Therefore, if \(P(\mathbf{x}_i)\) is convex (or linear, as in this case), then so is \(\mathbf{w}^T P(\mathbf{x}_i)\).

4. **Composition Rule for Convex Functions**:
   - The composition of a convex function with an affine (linear) function is convex, provided that the convex function is non-decreasing. In our case, the convex function is \(-\ln(x)\), and it is composed with the affine function \(\mathbf{w}^T P(\mathbf{x}_i)\). However, we need to be cautious here: \(-\ln(x)\) is decreasing, not non-decreasing. Hence, this rule does not directly apply.

5. **Addressing the Composition**:
   - Given that \(-\ln(x)\) is decreasing, we can't directly use the standard composition rule for convex functions. However, we know that \(\mathbf{w}^T P(\mathbf{x}_i)\) is positive (since it represents a combination of probabilities), and \(-\ln(\mathbf{w}^T P(\mathbf{x}_i))\) remains convex as long as \(\mathbf{w}^T P(\mathbf{x}_i) > 0\).

6. **Sum of Convex Functions**:
   - The sum of convex functions is convex. Therefore, summing \(-\ln(\mathbf{w}^T P(\mathbf{x}_i))\) over \(i\) preserves convexity.

In summary, your objective function \(-\sum_{i=1}^{n} \ln(\mathbf{w}^T P(\mathbf{x}_i))\) is convex, given that each \(\mathbf{w}^T P(\mathbf{x}_i)\) is positive, and the negative logarithm function is convex over the positive domain. The proof hinges on the convexity of the negative logarithm and the fact that linear combinations and sums of convex functions retain convexity.

All Papers:
1. Original introduction of idea
2. Doping/Sample Selection (using the ideas of priors and a bayesian approach to selecting the data to dope your model with)
3. Applicaton of idea to Neural Nets
4. (possible) Potential complexity speedups of the process
5. (possible) Bayesian alernative to updating the weights

\begin{align*}
\text{Let } & f_x \text{ and } f_y \text{ be the prior distributions of } X \text{ and } Y \text{ respectively.} \\
& \tilde{f}_x \text{ and } \tilde{f}_y \text{ denote the posterior distributions of } X \text{ and } Y. \\
& K_x \text{ is a normalization constant.} \\
\text{Given:} & \, \tilde{f}_y \text{ is known.} \\
\text{The posterior of } X \text{ is given by:} \\
\tilde{f}_x & = \frac{P(D|X,\tilde{Y})P(X|\tilde{Y})}{K_x} \\
& = \frac{\prod_{i=1}^{n} P(z_i|X,\tilde{Y}) P(X|\tilde{Y})}{K_x} \\
\text{Assuming } X + \tilde{Y} & \text{ is distributed according to the convolution } f_x * \tilde{f}_y, \\
\tilde{f}_x & = \frac{\prod_{i=1}^{n} (f_{x+\tilde{y}}(z_i))f_x(x)}{K_x} \\
\text{Similarly, for } Y\text{, we have:} \\
\tilde{f}_y & = \frac{\prod_{i=1}^{n} (f_{\tilde{x}+y}(z_i))f_y(y)}{K_y} \\
\end{align*}
