### I have already shown ([here](https://www.kaggle.com/arnabbiswas1/classification-approach-lgbm-log-loss/comments)) that classification approach is not working well with this data. 

### But, since the target variable "loss" consists of 43 discrete integer values, we can create a **confusion matrix using the out of fold predictions** generated during cross validation. 

To do this, I have loaded out-of-fold predictions and test predictions (i.e. prediction on test data) generated by my vanilla XGBoost Regressor (with StratifiedKFold). OOF score for this submission was 7.86117 and public LB score was 7.88749. 

Now, to create a confusion matrix, I need the OOF predictions to be of integer type (discrete values between 0 to 42). But, the predictions made by the regressor is float in nature. To handle this, I have rounded off the OOF predictions to the nearest integer values.

Reference Discussion: https://www.kaggle.com/c/tabular-playground-series-aug-2021/discussion/263862

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.metrics import confusion_matrix, mean_squared_error

In [None]:
def plot_confusion_matrix(cm_array, labels, figsize):
    df_cm = pd.DataFrame(cm_array, index=[i for i in labels], columns=[i for i in labels])
    plt.figure(figsize=figsize)
    sns.heatmap(df_cm, annot=True, fmt='d')
    plt.show()

#### Read the data

In [None]:
# Train and Test Data
train_df = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv', index_col="id")
test_df = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv', index_col="id")

# OOF Predictions from my submission
df_oof = pd.read_csv("../input/tpsaug2021oof785946/oof_xgb_benchmark_Stratifiedkfold_0804_1945_7.86117.csv", index_col="id")
# Predictions on test data from my submission
df_submission = pd.read_csv("/kaggle/input/tpsaug2021oof785946/sub_xgb_benchmark_Stratifiedkfold_0804_1945_7.86117.csv", index_col="id")

#### Round off the OOF predictions to the nearest integer values & create a new column called loss_round. Now, there are two columns:
- loss: OOF Prediction by the XGB Regressor
- loss_round: Rounded values of the OOF predictions

In [None]:
df_oof = df_oof.rename(columns={"0": "loss"})
df_oof["loss_round"] = df_oof.loss.round()
df_oof.head()

#### Plot the confusion matrix using true value of "loss" from train data and OOF predictions (rounded values)

In [None]:
labels = list(np.sort(train_df.loss.unique()))
cm_array = confusion_matrix(y_true=train_df.loss, y_pred=df_oof.loss_round, labels=labels)

plot_confusion_matrix(cm_array, labels, figsize=(30, 20))

If the image is not displayed clearly, check it [here](https://imgur.com/a/obqT6yM).

The confusion matrix looks surprising. Are not we supposed to see lot of large numbers along the diagonal of the matrix? Here, most of the elements across the diagonal are zero. That means most of the classes (0 to 42) are NOT predicted properly. 

On a closer look:
- All instances of loss=0 are predicted as values between 1 to 17
- All instances of loss=1 are predicted as values between 1 to 17 (Only one instance of 1 has been predicted as 1)
- All instances of loss=7 are predicted as values between 2 to 15 (Only one instance of 1 has been predicted as 1)
.....
- All instances of loss=38 are predicted as values between 5 to 11 (None of the instances of 17 has been predicted as 17)

and so on

Overall, 
- whatever be the real value, the predicted values are lying between 1 to 17.
- the higher the real value, the lesser is the spread of predicted values (some kind of Normal distribution?)

#### Let's check the RMSE value for the rounded version of OOF predictions.

In [None]:
np.sqrt(mean_squared_error(y_true=train_df.loss, y_pred=df_oof.loss_round))

#### Let's check the RMSE value for the original version of OOF predictions.

In [None]:
np.sqrt(mean_squared_error(y_true=train_df.loss, y_pred=df_oof.loss))

#### As we can see they are actaully pretty close. So, by rounding off the OOF predictions, I have not introduced lot of error.

#### Let me plot histograms of the following:

- True value of OOF from train data (Red)
- OOF Predictions (Green)
- Predictions made using test data (Blue)

#### From the figure below, it's very clear that **the model is not actually predicting anything close to the true value**

## Update (12th August 2021)
Based on Patrick's (@paddykb) comment, I have actually plotted the mean of the above three as well. They are so close that we can't distinguish the values.

In [None]:
figsize = (20, 10)
bins=50
train_df.loss.plot(
    kind="hist",
    figsize=figsize,
    label="loss_on_true",
    bins=bins,
    alpha=0.4,
    color="red",
    title=f"",
)
df_oof.loss.plot(
    kind="hist",
    figsize=figsize,
    label="prediction_on_oof",
    bins=bins,
    alpha=0.4,
    color="darkgreen",
)
df_submission.loss.plot(
    kind="hist",
    figsize=figsize,
    label="prediction_on_test",
    bins=bins,
    alpha=0.4,
    color="blue",
)

plt.axvline(train_df.loss.mean(), color='red', linestyle='solid', linewidth=1, alpha=0.3, label="mean of loss_on_true")
plt.axvline(df_oof.loss.mean(), color='darkgreen', linestyle='dotted', linewidth=2.5, alpha=0.3, label="mean of prediction_on_oof")
plt.axvline(df_submission.loss.mean(), color='blue', linestyle=':', linewidth=1, alpha=0.3, label="mean of prediction_on_test")

plt.legend()
plt.show()

In [None]:
print(f"Mean of loss in train data (True target): {train_df.loss.mean()}")
print(f"Mean of loss in OOF predictions : {df_oof.loss.mean()}")
print(f"Mean of loss in prediction in test : {df_submission.loss.mean()}")

#### As we can see that the difference are on the third decimal point. I am not sure if we can call the difference as statistically significant or not.

#### The model which I have used is a vanilla LGBM Regressor. I am wondering about the means for the models which are on the top of the LB (Please feel free to use this kernel and do a quick calculation)