# How to Sample Data in Python

## Learning Objectives
In order to get an unbiased assessment of the performance of a supervised machine learning model, we need to evaluate it based on data that it did not previously encounter during the training process. To accomplish this, we must first split our data into a training subset and a test subset prior to the model build stage. One common way to split data in this fashion is by creating non-overlapping subsets of the original data using one of several **sampling** approaches. By the end of the tutorial, you will have learned:

+ how to split data using simple random sampling
+ how to split data using stratified random sampling

In [35]:
#  To illustrate how to split data using sampling, let's import and preview a dataset of vehicles evaluated by 
# the EPA between the years 1984 and 2018. 

import pandas as pd
vehicles = pd.read_csv("vehicles.csv")
vehicles

Unnamed: 0,citympg,cylinders,displacement,drive,highwaympg,make,model,class,year,transmissiontype,transmissionspeeds,co2emissions
0,14.0,6,4.1,2-Wheel Drive,19.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,555.437500
1,14.0,8,5.0,2-Wheel Drive,20.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,555.437500
2,18.0,8,5.7,2-Wheel Drive,26.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4,484.761905
3,21.0,6,4.3,Rear-Wheel Drive,31.0,Cadillac,Fleetwood/DeVille (FWD),Large Cars,1984,Automatic,4,424.166667
4,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4,555.437500
...,...,...,...,...,...,...,...,...,...,...,...,...
36974,17.0,8,4.7,Rear-Wheel Drive,25.0,Mercedes-Benz,SL550,Two Seaters,2018,Automatic,9,442.000000
36975,16.0,8,6.2,Rear-Wheel Drive,25.0,Chevrolet,Corvette,Two Seaters,2018,Manual,7,466.000000
36976,15.0,8,6.2,Rear-Wheel Drive,22.0,Chevrolet,Corvette,Two Seaters,2018,Manual,7,503.000000
36977,12.0,12,6.5,Rear-Wheel Drive,16.0,Ferrari,812 Superfast,Two Seaters,2018,Automatic,7,661.000000


In [16]:

# Before we split our data, you must first separate the dependent variable from the independent variables. 
# If we assume that the CO2 emissions column is a dependent variable, then we can create a data frame called Y 
# based on that column alone. To do this, we create a string variable called response for the name of the dependent
# variable column, CO2 emissions. 
# Next, we create our data frame by sub-setting based on this variable. And we preview our data frame. 

response = 'co2emissions'
y = vehicles[[response]]
y.head()

Unnamed: 0,co2emissions
0,555.4375
1,555.4375
2,484.761905
3,424.166667
4,555.4375


In [17]:

# To create a data frame of the independent variables, we first create a list called predictors of all the columns
# in the vehicles data frame. So we need our list called predictors, using the list command. 
# Now the vehicles, not columns. That gives us a list of all the columns in the vehicles data set. 

predictors = list(vehicles.columns)
predictors

['citympg',
 'cylinders',
 'displacement',
 'drive',
 'highwaympg',
 'make',
 'model',
 'class',
 'year',
 'transmissiontype',
 'transmissionspeeds',
 'co2emissions']

In [18]:
# Next, we remove the CO2 emissions column from the list. Let's remove, you're going to recall the remove method, 
# pass to it, the column name, and let's see what we have. That gives us all the columns except the CO2 emissions 
# column.
predictors.remove(response)
predictors

['citympg',
 'cylinders',
 'displacement',
 'drive',
 'highwaympg',
 'make',
 'model',
 'class',
 'year',
 'transmissiontype',
 'transmissionspeeds']

In [19]:
 
# Using predictors, we can now create a data frame called X based on the remaining columns in the list, 
# the independent variables. So we call it X, let's go to vehicles. We can preview our X data frame. 

x = vehicles[predictors]
x.head()

Unnamed: 0,citympg,cylinders,displacement,drive,highwaympg,make,model,class,year,transmissiontype,transmissionspeeds
0,14.0,6,4.1,2-Wheel Drive,19.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
1,14.0,8,5.0,2-Wheel Drive,20.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
2,18.0,8,5.7,2-Wheel Drive,26.0,Buick,Electra/Park Avenue,Large Cars,1984,Automatic,4
3,21.0,6,4.3,Rear-Wheel Drive,31.0,Cadillac,Fleetwood/DeVille (FWD),Large Cars,1984,Automatic,4
4,14.0,8,4.1,Rear-Wheel Drive,19.0,Cadillac,Brougham/DeVille (RWD),Large Cars,1984,Automatic,4


## How to split data using Simple Random Sampling

In [20]:
# The first approach we use to split our data is simple, random sampling. 
# The train test split function from the SK learn model selection sub package allows us to do this.

from sklearn.model_selection import train_test_split

In [21]:
# Next, we pass our independent variables, X and dependent variable Y to the function which we transfer datasets. 
# We call data sets X underscore train, X underscore tests, Y underscore train, and Y underscore test and 
# the train test split function. We pass to it X and Y. 

x_train, x_test, y_train, y_test = train_test_split(x, y) 

In [22]:
# The X train data frame holds the independent variables of the training set. It has 27,734 rows and 11 columns. 

x_train.shape

(27734, 11)

In [23]:
# The Y train data frame holds the dependent variables of the training set. It also has 27,734 rows but one column.

y_train.shape

(27734, 1)

In [24]:
# The X test data frame holds the independent variables of the test set. It has 9,245 rows and 11 columns. 

x_test.shape

(9245, 11)

In [25]:
# Finally, the Y test data frame holds the dependent variables of the test set. It also has 9,245 rows, 
# but one column. 


y_test.shape

(9245, 1)

In [30]:
# The original data has 36,979 rows. This means that the 9,245 rows in the two test sets represent
# 25% of the original data. 

# By default, the train test split function allocates 25% of the original data to the test set. 
# If we want to override this behavior, we can do so by setting either the train size argument or the test size 
# argument of the function. 

In [31]:
# For example, we can allocate 40% of the original data to the test set as follows: 
# within our train test split function We specify an additional argument called tests underscore size, 
# and we set it to 0.4. We see that 14,792 rows are now assigned to the test set. 
# That is 40% of the 36,979 rows from the original dataset. 

x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size = 0.4) 
x_test.shape

(14792, 11)

## How to split data using Stratified Random Sampling

In [32]:
# The second approach we use to split our data is stratified random sampling. With this approach, our objective is
# to maintain the same distribution of values for a specific column between the original training and test data.

# To contrast the two sampling approaches, let's split our data once again, using simple random sampling. 
# This time, we want 1% of the original data allocated to the test set. Notice that in our code, we also said 
# the random state argument. This helps ensure that the result in this tutorial are reproducible by you and me 
# at a later time. 

x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size = 0.01, 
                                                    random_state = 1234)

In [33]:
# If our objective for using stratified random sampling is to maintain the same distribution of 
# values for the drive column, between the original training and test sets, then let's get the distribution for 
# the drive column in the original data. 

# So that's X, specify the column we want, which is drive, and we get the 
# value counts. Within the value counts method we specify normalize is equal to true, and that gives us the 
# distribution of values for the drive column.

x['drive'].value_counts(normalize = True)

Rear-Wheel Drive     0.356797
Front-Wheel Drive    0.353552
All-Wheel Drive      0.239893
4-Wheel Drive        0.036480
2-Wheel Drive        0.013278
Name: drive, dtype: float64

In [34]:
# We do the same for the test set, which was created based on simple random sampling. Use X underscore test, 
# specify the column we want, which is drive, the value counts method within the method we're supposed to specify 
# that normalize is equal to true. Let me get our distribution of values. 

# Looking at the two distributions,we notice that there is a small but noticeable difference in the values. 


x_test['drive'].value_counts(normalize = True)

Front-Wheel Drive    0.364865
Rear-Wheel Drive     0.332432
All-Wheel Drive      0.248649
4-Wheel Drive        0.035135
2-Wheel Drive        0.018919
Name: drive, dtype: float64

In [None]:
# Now, let's split the data using stratified random sampling. Stratifying by drive. 
# To do so, we specify an additional argument stratify, and we specify the column we want to stratify by,
# which happens to be drive, and we can run our code. 

# Once again, we can get the distribution for the drive column within the test data. 
#Looking at these two distributions, X test and the original X drive distribution, 
# we can see the distribution of values for the drive column in the test set created by stratified random sampling,
# more closely mimics the original data above.

In [37]:
x_train, x_test, y_train, y_test = train_test_split(x, y, 
                                                    test_size = 0.01, 
                                                    random_state = 1234, 
                                                    stratify = x['drive']) 

In [38]:
x_test['drive'].value_counts(normalize = True)

Rear-Wheel Drive     0.356757
Front-Wheel Drive    0.354054
All-Wheel Drive      0.240541
4-Wheel Drive        0.035135
2-Wheel Drive        0.013514
Name: drive, dtype: float64