## Train Test split

The code snippet provided is designed to divide our dataset into training and testing subsets. Our dataset has an imbalanced 'y' variable. To address this, we employ a stratified splitting approach. This ensures that the proportion of the positive class in both the training and testing sets remains consistent.

### Key Arguments for the Splitting Process:

1. The random state is fixed at 42. This is a deliberate choice to guarantee that our results can be consistently replicated.
2. The size of the test subset is established at 10%.

### Post-Split Evaluations:

1. We rigorously confirm that the split is indeed stratified, maintaining the proportional representation of each class.
2. Although not detailed in this notebook, we have conducted additional checks to verify that both the training and testing sets contain comparable representations of patients categorized by the number of visits (specifically, those with a single visit compared to those with multiple visits).


In [44]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns   

In [10]:
processed_data=pd.read_csv('../data/diabetic_data_processed.csv', 
                            na_values='?',
                            low_memory=False #silence the mixed dtypes warning
                           )

In [13]:
def stratify_split(dataframe):
    df=dataframe
    print('before splitting the class percentage in our dataset is : ',  round(df['readmitted'].sum()/len(df['readmitted']), 4))

    y = processed_data['readmitted']
    X= processed_data.drop(columns=['readmitted'])
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        stratify=y, 
                                                        test_size=0.10, 
                                                        random_state=42)
    
    print(' After splitting in train and test class percentage in test: ', round(y_test.sum()/len(y_test),4), ' and class percentage in train is: ', round(y_train.sum()/len(y_train),4))

    return X_train, X_test, y_train, y_test

    



In [None]:
X_train, X_test, y_train, y_test=stratify_split(processed_data)