## 5.2.3 Credit card data: Splitting the data: training and test sets
### Why do we need the concept of train/test split concept?
We already introduced a foundational concept in predictive modeling, which is the concept of using a trained model to make predictions on new data that the model had never "seen" before. 

When creating a model for prediction, we need some kind of measure of how well the model can make predictions on data that were not used to fit the model. This is because in fitting a model, the model becomes "specialised" at learning the relationship between features and response on the specific set of labeled data that were used for fitting. 
- While this is nice, in the end, we want to be able to use the model to make accurate predictions on new, unseen data, for which we don't know the true value of the labels.

For example, in our case study, once we deliver the trained model to our clients, they will obtain a new dataset of features similar to the one we have now. However, the features' data will be different in the sense it has new values fr V1..V28, time, and amount. Our client will be using the model with the features' data, to predict whether a transaction is a fraud one or not. 

It is important to evaluate how well we can anticipate the trained model to predict which transactions are frauds. To do that, we can take our current dataset and split it into two sets:

- The training set/ training data.  This consists of samples used to train the model.
- The test set/test data. This consists of samples that were not used in training the model. The test data is used to evaluate the model on data not seen or used in the trained model. 

Evaluating the model on a set of test data shall give an idea of how the model will perform when it is actually used by the client for its intended purpose in solving the business problem (e.g. to make predictions on samples that were not included during the model training).

### 1. Train/test split in scikit-learn
In this section, we illustrate the concept of train/test split using scikit-learn. We shall use the functionality of train_test_split offered by the scikit-learn library to split the data into 70% for training and 30% for testing.

- The percentages of 70% for training and 30% for testing are common to make a data split. The idea is we want enough data for training to build trained models that learn and fit from enough data. 
- You may consider reducing the size of training data if there is enough data for training and you do not want the training process to take so much computational power or time. 
- In summary, there is no hard rule to specify the percentages of training/testing data split, but the mentioned percentages are common ones. 

### 2. Defining features and target/response variables
Before we perform the data splitting, we define the input features and the target variable as follows:

In [1]:
import pandas as pd
df = pd.read_csv('../datasets/clean_creditcard.csv')
print(df.shape)

X = df.drop(['Class_Category'], axis=1)
y = df[['Class_Category']]

(892, 31)


The first argument to the train_test_split function is the features X, and the second argument is the response variable y. 

In [2]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

There are four outputs of the function train_test_split: 

- The features of the samples in the training set X_train
- The features of the samples in the test set X_test
- The corresponding response variables (y_train, y_test) of those sets of features X_train and X_test, respectively.

The train_test_split function is randomly selecting 30% of the row indices from the dataset and subset out these features and responses as test data, leaving the rest for training. 

In the above code, we've set test_size to 0.3, or 30%. The size of the training data will be automatically set to the remainder, 70%. 

In making the train/test split, the random_state parameter is set to a specific value, which is a random number seed. 

- Using this parameter allows a consistent train/test split across runs of the code. Otherwise, the random splitting procedure would select a different 30% of the data for testing each time the code was run.

### 4. Examining the shape of data
Let's examine the shapes of our training and test data, as in the following code:

In [3]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(624, 30)
(268, 30)
(624, 1)
(268, 1)


### 5. Checking the nature of data (imbalanced/balanced)
Now that we have our training and test data, it's good to make sure the nature of the data is the same between these sets. In particular, is the fraction of the positive class similar? You can observe this in the following output:

In [4]:
import numpy as np
print(np.mean(y_train))
print(np.mean(y_test))

0.5032051282051282
0.4925373134328358


The positive class fractions in the training and test data are both about 50%. This is good, as we can say that the training set is representative of the test set. 