# Introduction

As found out by [@grayjay](https://www.kaggle.com/grayjay) [here](https://www.kaggle.com/c/tabular-playground-series-nov-2021/discussion/286731), the train data appears to be in ten chunks with different average target value and differently distributed features. Here we check these statements with a few plots.

# Import the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')

In [None]:
train.head()

In [None]:
# Alleged size of the chunk
CS = 60000

# Cumulative sum of target differs between chunks?

As suggested by @grayjay in the comments, it is easier to see the difference if we replace 
target zero with target -1

In [None]:
zero_target = train['target'] == 0
train.loc[zero_target, 'target'] = -1

This is the variable we should be looking at

In [None]:
train['target_cumsum'] = train['target'].cumsum()

The differences between chunks are clearly visible

In [None]:
plt.plot(train['target_cumsum'])
for i in range(10):
    plt.axvline(60000*i,color = 'lightgray',alpha = 0.3)
plt.ylabel('Cumulative sum of target')
plt.xlabel('id');

So indeed, the training data has been chunked in groups of 60k

Mean target in the individual chunks:

In [None]:
# Reverse the 0 to -1 replacement in the target
target_m1 = train['target'] == -1
train.loc[target_m1, 'target'] = 0

for i in range(10):
    chunk = train['target'][i*CS:(i+1)*CS]
    print(f'Mean in chunk {i} is {np.mean(chunk):.2f}')

Wow, large differences!

# Difference in the features

Allegedly there are feature differences between the chunks, namely f27. Let's take a look at mean and std within the chunks

In [None]:
m = np.zeros(10)
s = np.zeros(10)
for i in range(10):
    print(f'Chunk {i}')

    m[i] = np.mean(train['f27'][i*CS:(i+1)*CS])
    s[i] = np.std(train['f27'][i*CS:(i+1)*CS])
    
    print(f'   Mean {m[i]:.4f}')
    print(f'   Std  {s[i]:.4f}')

So the standard deviation in the last chunk is indeed much smaller.

The feature histogram in the first and last chunk are clearly different:

In [None]:
plt.hist(train['f27'][:CS], bins = np.arange(-0.2, 0.3, 0.01));
plt.hist(train['f27'][9*CS:], bins = np.arange(-0.2, 0.3, 0.01), alpha = 0.2);