<div>
    <h1 align="center">Results & "RMSE" Evaluation</h1>    
    <h1 align="center">Tabular Playground Series - Aug 2021</h1> 
    <h4 align="center">By: Somayyeh Gholami & Mehran Kazeminia</h4>
</div>

<div class="alert alert-success">  
</div>

### The evaluation of this challenge is based on "RMSE". That means Submissions are scored on the root mean squared error. So obviously all the learning machine methods that users use are just trying to optimize the "RMSE" equation. But the important question is:

## Can merely optimizing the "RMSE" equation guarantee the success of classification or regression?

### Unfortunately, the answer is no. In this notebook, you will see that even the best results of machine learning methods are far from reality. For this study, we used the results of several public notebooks that have used different methods and are also the best public notebooks in terms of scores.


<div class="alert alert-success">
    <h1 align="center">If you find this work useful, please don't forget upvoting :)</h1>
</div>

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.express as px

%matplotlib inline

<div class="alert alert-success">  
</div>

## Data Set of Challenge

In [None]:
DF1 = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')

DF2 = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')

SAM = pd.read_csv('../input/tabular-playground-series-aug-2021/sample_submission.csv')

In [None]:
X  = DF1.drop(columns = ['id','loss'])

XX = DF2.drop(columns = ['id'])

y  = DF1.loss

In [None]:
display(y, y.min(), y.max())

In [None]:
y.value_counts().plot(figsize=(16, 8), kind='bar')

In [None]:
y.value_counts().plot(figsize=(10, 10), kind='pie')

y.value_counts(normalize=True)

In [None]:
hist_data = [ y ]  

group_labels = ['y']
    
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False) 

fig.show()

### We are not informed about "train_data" and "test_data" and their differences. For this reason, we expect our results to be approximately similar to "y". ***But did this happen?***

### We will see that all the results of public notebooks have a significant distance from "y". While the score of these notebooks (error value) is approximately equal to "7.8". This is a big error and even bigger than "np.mean (y)".


<div class="alert alert-success">  
</div>   

## Notebook Results

Thanks to: @dmitryuarov https://www.kaggle.com/dmitryuarov/falling-below-7-87-voting-cb-xgb-lgbm

In [None]:
path0 = '../input/tps8-786780/voting.csv' 

sub786780 = pd.read_csv(path0)

Thanks to: @oxzplvifi https://www.kaggle.com/oxzplvifi/tabular-denoising-residual-network

Thanks to: @pourchot https://www.kaggle.com/pourchot/in-python-tabular-denoising-residual-network

In [None]:
path1 = '../input/tps8-786595/submission43.csv' 

sub786595 = pd.read_csv(path1)

Thanks to: @alexryzhkov https://www.kaggle.com/alexryzhkov/aug21-lightautoml-starter

In [None]:
path2 = '../input/tps8-786259/In_LightAutoML_we_trust.csv' 

sub786259 = pd.read_csv(path2)

Thanks to: @tensorchoko https://www.kaggle.com/tensorchoko/tabular-aug-2021-lightgbm

In [None]:
path3 = '../input/tps8-786132/submit.csv' 

sub786132 = pd.read_csv(path3)

Thanks to: @hiro5299834 https://www.kaggle.com/hiro5299834/tps-aug-2021-lgbm?scriptVersionId=71439698

In [None]:
path4 = '../input/tps8-785852/submission.csv' 

sub785852 = pd.read_csv(path4)

Thanks to: @alexryzhkov https://www.kaggle.com/alexryzhkov/lightautoml-classifier-regressor-mix/output?scriptVersionId=71481321

In [None]:
path5 = '../input/tps8-785308/TPS8_785308.csv' 

sub785308 = pd.read_csv(path5)

Thanks to: @hiro5299834 https://www.kaggle.com/hiro5299834/tps-aug-2021-lgbm-xgb-catboost

In [None]:
path6 = '../input/tps8-785239/submission.csv' 

sub785239 = pd.read_csv(path6)

In [None]:
hist_data = [sub786780.loss, sub786595.loss, sub786259.loss, sub786132.loss, sub785852.loss, sub785308.loss, sub785239.loss]  

group_labels = ['7.86780', '7.86595', '7.86259', '7.86132', '7.85852', '7.85308', '7.85239']
    
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False) 

fig.show()

### As you can see, the results of different notebooks are very different. But they have one thing in common. The minimum and maximum values of all results are almost equal. But strangely enough, these values are completely different from the minimum and maximum "y" values. Please note the following diagram:

In [None]:
hist_data = [sub786780.loss, sub786595.loss, sub786259.loss, sub786132.loss, sub785852.loss, sub785308.loss, sub785239.loss, y]  

group_labels = ['7.86780', '7.86595', '7.86259', '7.86132', '7.85852', '7.85308', '7.85239', 'y']
    
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False) 

fig.show()

### The first priority of all machine learning methods was only the "RMSE" constraint, and in practice we could not create a successful classification or regression. The predicted values do not have the desired scatter and are only gathered around the value of "np.mean (y)". Obviously, this range of numbers optimizes the "RMSE" equation, but as you can see, it will not necessarily be a good and realistic prediction.


<div class="alert alert-success">  
</div>

## The methods of "Ensembling", "Comparative Method", etc. 

## Do not help to increase the scatter of numbers.


### Please note that these methods may increase the notebook score, but the final maximum and minimum values will still not change much. For comparison in the charts below, the inputs and outputs of our second notebook are provided.

## Inputs

In [None]:
path11 = '../input/tps-785318/TPS_785318.csv' 

sub785318 = pd.read_csv(path11)

In [None]:
path12 = '../input/tps-785254/TPS8_785254.csv' 

sub785254 = pd.read_csv(path12)

In [None]:
path13 = '../input/tps8-785237/TPS8_785237.csv' 

sub785237 = pd.read_csv(path13)

In [None]:
hist_data = [sub785318.loss, sub785254.loss, sub785237.loss]  

group_labels = ['Public Score: 7.85318', 'Public Score: 7.85254', 'Public Score: 7.85237']
    
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False) 

fig.show()

## Output

In [None]:
path14 = '../input/tps8-785159/TPS8_785159.csv' 

sub785159 = pd.read_csv(path14)

In [None]:
hist_data = [sub785159.loss]  

group_labels = ['Ensembling (7.85159)']
    
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False) 

fig.show()

## Inputs & Output

In [None]:
hist_data = [sub785318.loss, sub785254.loss, sub785237.loss, sub785159.loss]  

group_labels = ['Public Score: 7.85318', 'Public Score: 7.85254', 'Public Score: 7.85237', 'Ensembling (7.85159)']
    
fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False) 

fig.show()

### As you can see, the score improved a lot with "Ensembling", but the minimum and maximum values did not change much. Perhaps it should be said that the "RMSE" equation has practically created a cage for our results and does not allow us to approach the real predictions :)

<div class="alert alert-success">  
</div>

## How to get out of the "RMSE" cage?

### At first glance, this may not seem like a difficult task. But unfortunately this is complicated. The order and ranking of the numbers predicted by the notebooks is not accurate. This means that with the slightest change, the "RMSE" equation may be out of the optimal state. And in practice we can not simply increase the scatter of numbers ... but we can still do something.


## Method 1: Coordinate with constant values

In [None]:
def coordinate1(main, min_lim, max_lim, constant1, constant2):   
    
    sub  = main.copy() 
    subv = sub.values    
    suba = subv[:, 1]
    
    coor  = main.copy()    
    coorv = coor.values
    
    for i in range (len(main)):
        
        if (suba[i] <= min_lim):
            per = suba[i] - constant1
            coorv[i, 1] = per
            
        if (suba[i] >= max_lim):
            per = suba[i] + constant2
            coorv[i, 1] = per
          
    coor.iloc[:, 1] = coorv[:, 1] 
    
    ###############################   
    X = suba
    Y = coor.iloc[:, 1] 
    
    plt.style.use('seaborn-whitegrid') 
    plt.figure(figsize=(9, 9), facecolor='lightgray')
    plt.title(f'\nC O O R D I N A T E\n')   
            
    plt.scatter(X, X, s=2.0, label='Main(X=Y)')
    plt.scatter(X, Y, s=2.0, label='Coordinated')
    
    plt.legend(fontsize=12, loc=2)
    #plt.savefig('Coordinate_1.png')
    plt.show()     
    ###############################   
    coor.iloc[:, 1] = coor.iloc[:, 1].astype(float)
    hist_data = [suba, coor.iloc[:, 1]] 
    group_labels = ['Main', 'Coordinated']
    
    fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False)
    fig.show()   
    ###############################       
    print()
    print(':::::::::::::::: Main Values ::::::::::::::::')
    print(f'Min:{suba.min()}   Max:{suba.max()}\n')
    print(':::::::::::: Coordinated Values :::::::::::::')
    print(f'Min:{coor.iloc[: ,1].min()}   Max:{coor.iloc[: ,1].max()}\n')     
    ###############################    
    
    return coor   

In [None]:
sub1 = coordinate1(sub785159, 5.5, 8.0, 0.2, 0.2)

### Of course, **we do not recommend this method**. We will publish much better methods in the next notebooks. But this method is a good example of conveying the concept we wanted to share.

### We hope that the contents of this notebook will be useful for you to continue this challenge.

## Good Luck.

<div class="alert alert-success">  
</div>

In [None]:
sub1.to_csv("submission1.csv",index=False)
!ls

<div class="alert alert-success">
    <h1 align="center">If you find this work useful, please don't forget upvoting :)</h1>
</div>

<div class="alert alert-success">  
</div>

<div class="alert alert-success">  
</div>