# Spliting A Data Set

## Using Numpy to perform the split
It is useful to see how we can write a function that will split our data set using numpy.  A few segments later, we will use built functions from SciKit Learning.

In [5]:
import numpy as np
def fractional_split(data_set, test_fraction=0.2, seed=42):
    data_count = len(data_set)
    test_count = int(test_fraction*data_count)
    
    np.random.seed(seed)
    shuffled_indices = np.random.permutation(data_count)
    
    # Use the front of the shuffled list as the test set
    # Use the bask of the shuffled list as the training set
    test_indices = shuffled_indices[:test_count]
    train_indices = shuffled_indices[test_count:]
    
    return data_set.iloc[train_indices], data_set.iloc[test_indices]

import pandas as pd

data_frame = pd.read_csv("h.csv", sep=";")

train_set, test_set = fractional_split(data_frame)

print(len(train_set), len(test_set))
print(train_set.head())
print(test_set.head())


436 108
      height     weight   age  male
332   81.915  11.878440   2.0     1
210  143.510  31.071052  18.0     0
185  142.875  32.205032  17.0     0
370   83.820   9.213587   1.0     0
543  158.750  52.531624  68.0     1
      height     weight   age  male
457  163.830  55.394923  43.0     1
257  163.195  48.137451  67.0     1
357  152.400  43.431434  21.0     0
532  156.210  41.050076  53.0     1
542   71.120   8.051258   0.0     1


## Using Scikit Learning to perform the split
The train_test_split function performes much like the one we wrote.  It shuffles the data and then uses a fraction to split the data set.  You can pass in a seed for the random number generator

In [13]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(data_frame, test_size=0.2, random_state=123)
print(len(train_set), len(test_set))
print(train_set.head())
print(test_set.head())


435 109
      height     weight   age  male
55    97.790  12.757275   5.0     0
543  158.750  52.531624  68.0     1
287  167.005  50.603858  49.0     1
166  141.605  44.338618  24.0     0
132  163.195  53.098613  22.0     1
      height     weight   age  male
138  141.605  29.313383  15.0     1
308  157.480  49.214732  18.0     0
440   64.135   6.662132   1.0     0
282  147.320  35.947166  40.0     0
356  152.400  43.544832  63.0     0


## Performing a Stratified Sample

### Using an existing feature
Suppose that we want both the test set and the training set to have the same ration of men to women as the original data set.  In the Howell data set there are 287 women and 257 men. 

We can use the StratifiedShuffleSplit class to create our training and test sets

We will just do a one fold split. 



In [22]:
from sklearn.model_selection import StratifiedShuffleSplit

print(data_frame["male"].value_counts())

splitter = StratifiedShuffleSplit(n_splits = 1, test_size=0.2, random_state=123)
for train_indices, test_indices in splitter.split(data_frame, data_frame["male"]):
    #the body only executes once because the number of splits is one
    train_set = data_frame.iloc[train_indices]
    test_set = data_frame.iloc[test_indices]
print(train_set["male"].value_counts())
print(test_set["male"].value_counts())


0    287
1    257
Name: male, dtype: int64
0    229
1    206
Name: male, dtype: int64
0    58
1    51
Name: male, dtype: int64


### Creating a new feature
Sometimes the features that we get are not exactly what we want.  We can create a new feature out of old features.  For example, we can create a BMI feature out of the weight and height.  Assuming that the height is in cm and the weight is in kg we can compute the BMI = $10000 \times weight\over{height*height}$.

In [24]:
data_frame["bmi"] = 10000*data_frame["weight"]/(data_frame["height"]**2)
data_frame.head()

Unnamed: 0,height,weight,age,male,bmi
0,151.765,47.825606,63.0,1,20.764297
1,139.7,36.485807,63.0,0,18.695244
2,136.525,31.864838,65.0,0,17.095718
3,156.845,53.041915,41.0,1,21.561444
4,145.415,41.276872,51.0,0,19.520384


### Stratifying on a continuous feature
We need to group together the values into a discrete number of layers.  Then we can sample.


In [41]:
#Divide into evenly sized discrete buckets
data_frame["bmi_category"] = np.ceil(data_frame["bmi"]/3)
data_frame.head()
data_frame["bmi_category"].value_counts()

#Use a function to define the categories
def bmi_rater(bmi):
    if bmi < 18.5 : return 1
    if bmi < 25 : return 2
    if bmi < 30 : return 3
    return 4
# And then map it onto the bmi
data_frame["bmi_category"] = data_frame["bmi"].map(bmi_rater)
data_frame.head()
print(data_frame["bmi_category"].value_counts())

# Merge all the categories above 2 down to two
# where takes a condition to keep the value and a new value otherwise
# and we don't make a copy
data_frame["bmi_category"].where(data_frame["bmi_category"] <= 2, 2, inplace=True)
print(data_frame["bmi_category"].value_counts())


splitter = StratifiedShuffleSplit(n_splits = 1, test_size=0.2, random_state=123)
for train_indices, test_indices in splitter.split(data_frame, data_frame["bmi_category"]):
    #the body only executes once because the number of splits is one
    train_set = data_frame.iloc[train_indices]
    test_set = data_frame.iloc[test_indices]
print(train_set["bmi_category"].value_counts())
print(test_set["bmi_category"].value_counts())

1    341
2    202
3      1
Name: bmi_category, dtype: int64
1    341
2    203
Name: bmi_category, dtype: int64
1    273
2    162
Name: bmi_category, dtype: int64
1    68
2    41
Name: bmi_category, dtype: int64


## Make a working copy
It is useful to make a copy of the training set. We may want to make additional transformations to it, so we will clone it as needed.

In [42]:



working_set = train_set.copy()
print(working_set)

       height     weight   age  male        bmi  bmi_category
15   163.1950  48.562694  36.0     1  18.234299             1
495  141.6050  42.885420  43.0     0  21.387129             2
385  152.4000  45.160753  38.0     0  19.444252             2
111  162.5600  45.954540  35.0     1  17.390058             1
457  163.8300  55.394923  43.0     1  20.638736             2
80   147.9550  41.900561  17.0     0  19.140820             2
358   81.2800  11.509897   1.0     1  17.422242             1
294  152.4000  43.856676  33.0     1  18.882773             2
126   96.5200  13.097469   5.0     1  14.058946             1
191  148.5900  43.885026  33.0     0  19.876376             2
434  154.9400  39.179009  16.0     1  16.320233             1
16   157.4800  42.325803  44.0     1  17.066890             1
208  156.2100  44.338618  29.0     0  18.170400             1
410  163.1950  51.029100  39.0     1  19.160384             2
129  149.2250  42.155707  27.0     0  18.930984             2
149  157