# Creating Train/Test Splits in Python

We may want to partition dataframes into training and test samples.  A training sample would be used to train an algorithm.  A test sample would be used to evaluate the performance of an algorithm.  In the examples below, the training sample will be **2/3rds** of the data, and the test sample will be remaining **1/3rd** of the data.  

In [2]:
# Load relevant libraries.

%pylab inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.api import abline_plot
import patsy
import seaborn as sns
sns.set(context='notebook', style='whitegrid', palette='deep', font='sans-serif', font_scale=1, rc=None)
import sklearn as skl

Populating the interactive namespace from numpy and matplotlib


In [4]:
# Spam database.

target_url = ("https://archive.ics.uci.edu/ml/machine-learning-databases/"
              "spambase/spambase.data")

spam = pd.read_csv("spam.csv")
print(spam.info())
print(spam['v57'].describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4601 entries, 0 to 4600
Data columns (total 11 columns):
v57    4601 non-null int64
v1     4601 non-null float64
v2     4601 non-null float64
v3     4601 non-null float64
v4     4601 non-null float64
v5     4601 non-null float64
v6     4601 non-null float64
v7     4601 non-null float64
v8     4601 non-null float64
v9     4601 non-null float64
v10    4601 non-null float64
dtypes: float64(10), int64(1)
memory usage: 395.5 KB
None
count    4601.000000
mean        0.394045
std         0.488698
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: v57, dtype: float64


## METHOD ONE

Every observation in a structured dataframe has an observation number.  The modulus of the observation number divided by three is either 0 or non-0.  When the modulus is 0, the observation is a test observation.  When the modulus is non-0, the observation is a training observation.

In [6]:
spam['index'] = range(len(spam))

# In Python, the % indicates modulus.
spam['index'] = spam['index'] % 3
spam['test'] = pd.get_dummies(spam['index'] == 1)[1]

# We don't need the index, so drop it.
del spam['index']

# Now we can create new train and test dataframes.
# Note the format of these lines code.
# It basically resolves as create spamtest as a subset of spam when test is 1.
# Otherwise, it is train.
spamtrain = spam[spam['test'] == 0]
spamtest = spam[spam['test'] == 1]

# Confirm data has been split properly.
print(spamtrain['v57'].count())
print(spamtest['v57'].count())
print(spam['v57'].count())

3067
1534
4601


## METHOD TWO

Generate a uniform[0, 1] draw for every observation in the dataframe.  Flagging those observations with a uniform draw less than 1/3 identifies the test sample.  The remaining data would be the training sample.

In [8]:
seed(12345)
spam['index'] = np.random.uniform(low = 0, high = 1, size = len(spam))
spam['test'] = pd.get_dummies(spam['index'] <= 0.3333)[1]

# We don't need the index, so drop it.
del spam['index']

# Now we can create new train and test dataframes.
# Note the format of these command lines.
# It basically resolves as create spamtest as a subset of spam when test is 1.
# Otherwise, it is train.
spamtrain = spam[spam['test'] == 0]
spamtest = spam[spam['test'] == 1]

# Confirm data has been split properly.
print(spamtrain['v57'].count())
print(spamtest['v57'].count())
print(spam['v57'].count())

3056
1545
4601


## Method Three

Ask pandas to do it.

In [12]:
spamtrain = spam.sample(frac = 0.67, random_state = 1066)
spamtest = spam.drop(spamtrain.index)

# Confirm data has been split properly.
print(spamtrain['v57'].count())
print(spamtest['v57'].count())
print(spam['v57'].count())

3083
1518
4601
