# Combining models

There are many methods to combine models. 

🗳️ **Voting**

Just count what each models predicts and takes the most common prediction. This is simple to use but has it pitfalls.
- **Strong correlation**: Sometimes two models have almost the same outcome, with voting you just double their influence.
- **Repeat mistakes**: Not every model will be great for the same kind of data. If most of the models make a same mistake on a datapoint it does not mean they are right.

🗳️ + 🔢 **Voting with scores**

To counter these problems we need models to differ in prediction and have a confidence score so we know when to listen.
In my model with Umap+nearest neighbors it was very rare for correct predictions to have an ambiguity in there neighbors.

In [1]:
from pathlib import Path
from dataclasses import dataclass
import pandas as pd

## Correlation

Lets find a model that does NOT correlate strongly with my model. For this I am looking for notebooks that introduce their own solution not just take existing submissions and combining them.

In [2]:
@dataclass
class Submission:
    name: str
    path: Path
    predictions: pd.DataFrame


all_submissions_paths = Path(r"/kaggle/input/").glob(r"**/submission.csv")

all_submissions = []
for i in all_submissions_paths:
    name = str(i.parent).split(r"/")[-1]
    all_submissions.append(Submission(name=name, path=i, predictions=pd.read_csv(i)))

all_names = [i.name for i in all_submissions]

prediction_umap = pd.read_csv(r"/kaggle/input/dimension-reduction-with-umap/prediction.csv")
prediction_umap.head()

Unnamed: 0,id,real_text_id,uncertainty,overwritten
0,0,2,0.0,True
1,1,2,0.142857,False
2,2,1,0.0,False
3,3,1,0.0,True
4,4,2,0.0,True


In [3]:
corr_data = pd.DataFrame(index=all_names, columns=all_names)
for x in all_submissions:
    for y in all_submissions:
        corr_data.loc[x.name, y.name] = (
            x.predictions.real_text_id == y.predictions.real_text_id
        ).mean()
corr_data

Unnamed: 0,0-89211-fake-or-real-distil-bert-self-attention,extracting-features-with-spacy,0-87759-fake-or-real-bert-pca-randomforest,memory-optimized-transformers-for-impostor-hunt,dimension-reduction-with-umap,0-84232-enssenbel-4-model-impostor-hunt,truthgpt-spotting-real-in-this-fake-world
0-89211-fake-or-real-distil-bert-self-attention,1.0,0.831461,0.871723,0.888577,0.858614,0.846442,0.817416
extracting-features-with-spacy,0.831461,1.0,0.808052,0.838015,0.830524,0.805243,0.770599
0-87759-fake-or-real-bert-pca-randomforest,0.871723,0.808052,1.0,0.827715,0.865169,0.817416,0.805243
memory-optimized-transformers-for-impostor-hunt,0.888577,0.838015,0.827715,1.0,0.807116,0.83427,0.835206
dimension-reduction-with-umap,0.858614,0.830524,0.865169,0.807116,1.0,0.821161,0.786517
0-84232-enssenbel-4-model-impostor-hunt,0.846442,0.805243,0.817416,0.83427,0.821161,1.0,0.815543
truthgpt-spotting-real-in-this-fake-world,0.817416,0.770599,0.805243,0.835206,0.786517,0.815543,1.0


In [4]:
all_submissions_table = pd.DataFrame(columns=["id", *all_names])
all_submissions_table.id = all_submissions[0].predictions.id
for submision in all_submissions:
    all_submissions_table[submision.name] = submision.predictions.real_text_id
all_submissions_table.head()

Unnamed: 0,id,0-89211-fake-or-real-distil-bert-self-attention,extracting-features-with-spacy,0-87759-fake-or-real-bert-pca-randomforest,memory-optimized-transformers-for-impostor-hunt,dimension-reduction-with-umap,0-84232-enssenbel-4-model-impostor-hunt,truthgpt-spotting-real-in-this-fake-world
0,0,2,2,2,2,2,2,2
1,1,2,2,2,2,2,2,1
2,2,1,1,1,1,1,1,2
3,3,1,1,1,1,1,1,1
4,4,2,2,2,2,2,2,2


In [5]:
naif_voting = all_submissions_table.mode(axis=1)[0]
naif_voting

0       2
1       2
2       1
3       1
4       2
       ..
1063    1
1064    1
1065    1
1066    2
1067    2
Name: 0, Length: 1068, dtype: int64

In [6]:
submision2 = prediction_umap[["id", "real_text_id"]]
unsure_idexes = (prediction_umap.overwritten == False) & (prediction_umap.uncertainty > 0)
submision2.loc[unsure_idexes, "real_text_id"] = naif_voting[unsure_idexes]
# Assert no NaN
assert not submision2.isna().sum().sum()
# Index should be 1068 unique values between 0 and 1067
assert len(submision2.id.unique()) == 1068
assert submision2.id.min() == 0
assert submision2.id.max() == 1067
# Real_text_id only can be 1 or 2
assert ((submision2.real_text_id == 1) | (submision2.real_text_id == 2)).all()
submision2[["id", "real_text_id"]].to_csv("submission.csv", index=False)
submision2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submision2.loc[unsure_idexes, "real_text_id"] = naif_voting[unsure_idexes]


Unnamed: 0,id,real_text_id
0,0,2
1,1,2
2,2,1
3,3,1
4,4,2
...,...,...
1063,1063,1
1064,1064,1
1065,1065,1
1066,1066,2
