# Combining models

There are many methods to combine models. 

🗳️ **Voting**

Just count what each models predicts and takes the most common prediction. This is simple to use but has it pitfalls.
- **Strong correlation**: Sometimes two models have almost the same outcome, with voting you just double their influence.
- **Repeat mistakes**: Not every model will be great for the same kind of data. If most of the models make a same mistake on a datapoint it does not mean they are right.

🗳️ + 🔢 **Voting with scores**

To counter these problems we need models to differ in prediction and have a confidence score so we know when to listen.
In my model with Umap+nearest neighbors it was very rare for correct predictions to have an ambiguity in there neighbors.

In [23]:
from pathlib import Path
from dataclasses import dataclass
import pandas as pd

## Correlation

Lets find a model that does NOT correlate strongly with my model. For this I am looking for notebooks that introduce their own solution not just take existing submissions and combining them.

In [33]:
@dataclass
class Submission:
    name: str
    path: Path
    predictions: pd.DataFrame


all_submissions_paths = Path(r"/kaggle/input/").glob(r"**/submission.csv")

all_submissions = []
for i in all_submissions_paths:
    name = str(i.parent).split(r"/")[-1]
    all_submissions.append(Submission(name=name, path=i, predictions=pd.read_csv(i)))

all_names = [i.name for i in all_submissions]

In [37]:
corr_data = pd.DataFrame(index=all_names, columns=all_names)
for x in all_submissions:
    for y in all_submissions:
        corr_data.loc[x.name, y.name] = (
            x.predictions.real_text_id == y.predictions.real_text_id
        ).mean()
corr_data.head()

Unnamed: 0,0-87759-fake-or-real-bert-pca-randomforest,dimension-reduction-with-umap,extracting-features-with-spacy
0-87759-fake-or-real-bert-pca-randomforest,1.0,0.855805,0.808052
dimension-reduction-with-umap,0.855805,1.0,0.832397
extracting-features-with-spacy,0.808052,0.832397,1.0
