# EDA of outputs vs True values

This is the output obtained from [Chris Deotte](https://www.kaggle.com/cdeotte) notebook : [RAPIDS SVR Boost - [17.8]](https://www.kaggle.com/cdeotte/rapids-svr-boost-17-8)
I have taken these results into consideration:
* True values
* NN OOF outputs combined
* NN with SVR head OOF outputs combined
* Ensemble of NN and NN with SVR head outputs (equal weights) combined

I have plotted the following:
* Distributions of the True, NN preds, SVR preds and ensemble preds
* Change in the values from true values to ensemble predictions
* Box plots

## **Please DO Upvote**
I will be updating with more findings :)

In [None]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
from scipy.stats import skew

In [None]:
df = pd.read_csv('../input/pawpularity-results/eda.csv')

In [None]:
df = df.set_axis(['True', 'NN', 'SVR', 'Ensemble'], axis=1)

# Data Summary

Showing the mean, median, standard deviation and other insights

In [None]:
df.describe()

# True values distribution

The True values (actual Pawpularity score) distribution is shown here. It shows that the data is very skewed.

In [None]:
df1 = df['True'].value_counts().reset_index()
fig = px.bar(df1, y="True", x="index", color="index",
             color_continuous_scale='Bluered_r')
fig.show()

In [None]:
print('-'*75)
print('Skewness : ', skew(df['True'].values))
print('-'*75)

# Reducing skewness

Taking the root of each values will decrease the skewness.

Let us see

In [None]:
ss = [round(np.sqrt(x)) for x in df['True'].values]

In [None]:
print('-'*75)
print('Skewness : ', skew(ss))
print('-'*75)

#### I have no idea if this will help

# NN predictions distribution (rounded off to the nearest integer)

The predictions of the NN (swin-transformer) is shown. Since the data output is continuous I have made it discrete by rounding it off to the nearest integer. 
Then i Plotted the distribution and shows that the predictions are almost as skewed as the actual Pawpularity

In [None]:
df2 = pd.DataFrame([round(y) for y in df['NN'].values])[0].value_counts().reset_index()
fig = px.bar(df2, y=0, x="index", color="index",
             color_continuous_scale='sunset')
fig.show()

In [None]:
print('-'*75)
print('Skewness : ', skew(df['NN'].values))
print('-'*75)

# SVR head predictions distribution (rounded off to the nearest integer)

In [None]:
df2 = pd.DataFrame([round(y) for y in df['SVR'].values])[0].value_counts().reset_index()
fig = px.bar(df2, y=0, x="index", color="index",
             color_continuous_scale='earth')
fig.show()

In [None]:
print('-'*75)
print('Skewness : ', skew(df['SVR'].values))
print('-'*75)

# Ensemble predictions distribution (rounded off to the nearest integer)

In [None]:
df2 = pd.DataFrame([round(y) for y in df['NN'].values])[0].value_counts().reset_index()
fig = px.bar(df2, y=0, x="index", color="index",
             color_continuous_scale='purp')
fig.show()

In [None]:
print('-'*75)
print('Skewness : ', skew(df['Ensemble'].values))
print('-'*75)

# True vs Ensemble distribution

In [None]:
df2=df2.sort_values(by='index').reset_index(drop=True)
fig = go.Figure()
fig.add_traces(go.Bar(x=df['True'].value_counts().index, y=df['True'].value_counts().values, name='True'))
fig.add_traces(go.Bar(x=df2.index.values, y=df2[0].values, name='Ensemble'))
fig.update_layout(
    title="True vs Ensemble distribution",
    xaxis_title="Pawpularity",
    yaxis_title="Count",
    legend_title="Legend Title",
    height=700,
    width=1300,
    font=dict(
        family="Courier New, monospace",
        size=16,
        color="RebeccaPurple"
    )
)

fig.show()

# Parallel Coordinates Plot to see the change in values 

This is an interesting plot

It shows the change in the predictions from the original Pawpularity values

It connects the dots plotted on a scale of 0-100 for each True, NN, SVR and Ensemble values.

Show how the values are changed. 

In [None]:
fig = px.parallel_coordinates(df[['True', 'NN', 'SVR', 'Ensemble']], color="True",
                             color_continuous_scale='earth',
                             color_continuous_midpoint=2, title='True vs predictions change')
fig.show()

# Plotting for a small sample

The same plot with a smaller sample

In [None]:
fig = px.parallel_coordinates(df[['True', 'NN', 'SVR', 'Ensemble']].sample(frac=0.05, random_state=11).reset_index(drop=True), color="True",
                             color_continuous_scale='rdbu',
                             color_continuous_midpoint=2,
                             title='True vs predictions change (Small Sample)')
fig.show()

# True vs Prediction

This plot shows the true values and ensembled predictions. 

It shows the increase and decrease in the predictions from the original values

In [None]:
fig = px.parallel_coordinates(df[['True','Ensemble']].sample(frac=0.05, random_state=11).reset_index(drop=True), color="True",
                             color_continuous_scale='rdbu',
                             color_continuous_midpoint=2,
                             title='True vs predictions change (Small Sample)')
fig.show()

# Getting the means of each range from 0-10, 10-20 ..... 90-100

In [None]:
def get_means(df):
    all_dfs=[]
    for i in range(0,100,10):
        small_df = df[(df['True']>i) & (df['True']<=i+10)]
        all_dfs.append(small_df)

    all_means=[]
    for i in range(10):
        all_means.append(np.vstack(all_dfs[i].describe().values[1]))

    all_means = np.hstack(all_means)
    return all_means

In [None]:
all_means = get_means(df)

# Plotting the mean values in each range

Plotting the mean values of the predictions and the true values in the range of actual values:

I have divided the results in ranges of 10 for the actual values of Pawpularity and checked the mean in the respective ranges.

Then I have calculated the mean of the prediction for the samples in that range.

In [None]:
fig = go.Figure()
xaxis_labels=[f'{i}-{i+10}' for i in range(0,100,10)]


fig.add_trace(go.Bar(x = xaxis_labels, y=all_means[0], name = 'True avg.'))
fig.add_trace(go.Bar(x = xaxis_labels, y=all_means[1], name = 'NN avg.'))
fig.add_trace(go.Bar(x = xaxis_labels, y=all_means[2], name = 'SVR avg.'))
fig.add_trace(go.Bar(x = xaxis_labels, y=all_means[3], name = 'Ensemble avg.'))
fig.update_xaxes(type='category')
fig.update_layout(
    title="Mean values in each range of 10",
    xaxis_title="Range",
    yaxis_title="Pawpularity",
    legend_title="Values for:",
    height=700,
    width=1300,
    font=dict(
        family="Courier New, monospace",
        size=16,
        color="RebeccaPurple"
    )
)

fig.show()

## The low and high Pawpularity predictions are way off the orginal Pawpularity

# Plotting All the True values vs Ensemble predictions

Plotting the Original Pawpularity and the Ensembled Pawpularity wrt to Id

In [None]:
fig = go.Figure()
xaxis_labels=[f'{i}-{i+10}' for i in range(0,100,10)]


fig.add_trace(go.Scatter(x = df['True'].index, y=df['True'].values, name = 'True', opacity=0.6))
fig.add_trace(go.Scatter(x = df['Ensemble'].index, y=df['Ensemble'].values, name = 'Ensemble', opacity=0.6))
fig.update_layout(
    title="True vs Ensemble",
    xaxis_title="Id",
    yaxis_title="Pawpularity",
    legend_title="Values for:",
    height=700,
    width=1300,
    font=dict(
        family="Courier New, monospace",
        size=16,
        color="RebeccaPurple"
    )
)
fig.show()

#### Since we have most predictions surrounded around the mean lets see the no. of extreme predictions that are way off the actual values

# Box plots

In [None]:
fig = go.Figure()
xaxis_labels=[f'{i}-{i+10}' for i in range(0,100,10)]


fig.add_trace(go.Box(y=df['True'].values, name = 'True'))
fig.add_trace(go.Box(y=df['NN'].values, name = 'NN'))
fig.add_trace(go.Box(y=df['SVR'].values, name = 'SVR'))
fig.add_trace(go.Box(y=df['Ensemble'].values, name = 'Ensemble'))
fig.update_layout(
    title="Box plots",
    xaxis_title="Values for",
    yaxis_title="Pawpularity",
    legend_title="Values for:",
    height=700,
    width=1300,
    font=dict(
        family="Courier New, monospace",
        size=16,
        color="RebeccaPurple"
    )
)
fig.show()

# Some other calculations

In [None]:
print('-'*70)
print('No of NN prediction having MAE of more than 5: ',len(df[np.abs(df['True']-df['NN'])>=5]))
print('-'*70)
print('No. of predictions > 30 with true values <10 :', len(df[(df['True']<10) & (df['NN']>=30)]))
print('-'*70)
print('No. of predictions < 50 with true values >80 :', len(df[(df['True']>=80) & (df['NN']<50)]))
print('-'*70)

# Conclusions

* The box plots show the spread of the predictions.
* The outputs are highly concentrated in the middle. 
* Data Augmentation can help build balanced dataset. But I haven't tried it out yet. 
* Any ideas on augmentation might help :)


# Please **DO** upvote :)