# Introduction

**Purpose:**
This notebook will be concerned with the pseudolabelling. That is, identifying which samples in the test set that an ensemble of methods are confident on the true label. These can then be fed back into the training set to improve the accuracy of a model. 

**Note:**
Because of memory problems I had to run my 5 classifiers in other notesbooks. The results are in this [dataset](https://www.kaggle.com/samuelcortinhas/tps-dec-pseudolabels).

See my other notebook where I do EDA and make use of these pseudolabels.

# Libraries

In [None]:
# Core
import numpy as np
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.1f' % x)
pd.get_option("display.max_columns", 55)
import seaborn as sns
sns.set_style('darkgrid')
import matplotlib.pyplot as plt
%matplotlib inline
from itertools import combinations
import statistics
import time

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler 
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn.ensemble import RandomForestClassifier

# Models
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

# Tensorflow
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import callbacks

# Data

In [None]:
RF_preds=pd.read_csv('../input/tps-dec-pseudolabels/RF_predictions.csv', index_col='Id')
XGB_preds=pd.read_csv('../input/tps-dec-pseudolabels/XGB_predictions.csv', index_col='Id')
LGBM_preds=pd.read_csv('../input/tps-dec-pseudolabels/LGBM_predictions.csv', index_col='Id')
CAT_preds=pd.read_csv('../input/tps-dec-pseudolabels/CAT_predictions.csv', index_col='Id')
ANN_preds=pd.read_csv('../input/tps-dec-pseudolabels/ANN_predictions.csv', index_col='Id')
test_data=pd.read_csv('../input/tabular-playground-series-dec-2021/test.csv', index_col='Id')
test_index=test_data.index # save for later

In [None]:
ensemble_preds=pd.concat([RF_preds,XGB_preds,LGBM_preds,CAT_preds,ANN_preds], axis=1)
ensemble_preds.head()

# Pseudolabels

**Assign pseudolabels**

In [None]:
# Samples where all models agree on
conf_preds_index=ensemble_preds[ensemble_preds.min(axis=1)==ensemble_preds.max(axis=1)].index

# Select corresponding test data
pseudo_label_df=test_data.loc[conf_preds_index]

# Assign Pseudolabels
pseudo_label_df['Cover_Type']=ensemble_preds.loc[conf_preds_index, 'RF']

# Print shape
print('Data frame shape:',pseudo_label_df.shape)

# Print proportion of samples predicted
print('Proportion of samples predicted:',pseudo_label_df.shape[0]/test_data.shape[0])

# Preview df
pseudo_label_df.head()

**Plot distributions of labels**

In [None]:
# Figure size
plt.figure(figsize=(12,6))

# Countplot
sns.countplot(pseudo_label_df.Cover_Type)

# Aesthetics
plt.title('Distribution of pseudolabels', fontsize=15)

**Save to csv**

In [None]:
# Save to csv
pseudo_label_df.to_csv('pseudo_label_df.csv')