In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Let's import our packages and dataset first. The dataset comes from a [house price prediction competition](https://www.kaggle.com/c/home-data-for-ml-course), in which the goal of the task is to predict the selling price for new house as accurately as possible.

In [None]:
from sklearn.preprocessing import power_transform

import seaborn as sns

seed = 0

# Read the data
X = pd.read_csv('../input/house-prices-data/train.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)

Before we partition our data into training set and validation set, let's first take a look at the distribution of the target attribute `SalePrice`.

As you can see, the prices of houses are highly skewed to the right. We'd better not to randomly split the data, otherwise there exists a potential risk that all the extreme data points lying on the right tail might be allocated **solely** to the validation set, which will deteriorate our training process.

In [None]:
sns.distplot(a=y, kde=True)

Instead, we can apply [Box-Cox Transformation](https://en.wikipedia.org/wiki/Power_transform) to the `SalePrice` first. For those of you who are not familiar with this transformation, Box-Cox is simply a family of functions that try to transform the distribution of input data points to normal-like shape by fitting the best $\lambda$ via maximum likelihood estimation. In fact, the $\log$ transformation is simply a special case of Box-Cox when $\lambda = 0$.

In [None]:
y_box_cox = power_transform(np.expand_dims(y, axis=1), method='box-cox').flatten()
sns.distplot(a=y_box_cox, kde=True)

Now the `SalePrice` looks pretty much like a normal distribution. Then we can make some bins for this normal distribution and partition our data according to those bins. For simplicity, I only create 8 bins, each of which corresponds to data points less than $-3\sigma$, between $-3\sigma$ and $-2\sigma$, between $-2\sigma$ and $-1\sigma$, ..., between $2\sigma$ and $3\sigma$, and finally beyond $3\sigma$. Of course, you see the patterns.

In [None]:
bins = np.linspace(start=-3, stop=3, num=7)
bins

Lastly, we apply a technique called **stratified train-test-split** to our data, by first assigning them into their corresponding bins and then split them within each bin independently. As a result, both training and validation set will contain points with all ranges of prices.

In [None]:
from sklearn.model_selection import train_test_split

y_binned = np.digitize(y_box_cox, bins)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, stratify=y_binned, random_state=seed)

This notebook ends here and please kindly give a vote if you find it helpful. You are also welcome to check my other notebooks.